Top Big Data trends in India – Interview with a data scientist

22nd Aug `15, 03:13 PM in Resources

Navin Manaswi, a data scientist and a big data specialist with General Mills –a U.S.-based food company, shares…

Mastufa Ahmed Contributor

Navin Manaswi, a data scientist and a big data specialist with General Mills –a U.S.-based food company, shares insights on how they leverage data science to forecast demands of products.

Navin Manaswi

Give me an overview of how big data and analytics is being leveraged by Indian enterprises?

Indian Big Data market would touch $ 1 billion by 2015 from nearly $ 0.2 billion in 2011 and global Big Data market would grow at the same rate and touch nearly $ 25 billion by 2015, according to NASSCOM and CRISIL. In only a few big companies, big data analytics is being applied to understand, help and lure customers and to maximize the revenue while others are just storing Big Data in Data Warehouse and trying to create some basic reports.

Gradually, e-commerce, telecommunication, giant retailers, media, oil-exploration and mobile tech business entities are recruiting data science professionals and that too, with hefty packages that indicate the boom of data science and big data analytics.

What are some of the top trends currently in the big data and analytics domain?

Leveraging machine learning algorithms and graph theory: Data scientists leverage machine learning  to predict and strategise to get a compact and clear picture of the future business trends by using past data. Every business stakeholder including manager, executive, investor is interested in finding weekly, monthly, quarterly or yearly forecasts in aggregate and detailed form in cross-tabular or/and graphical analysis so that they can optimize their resources and save millions of dollars.

Using R/Python programming languages: Every bit of data science including machine learning, data visualization, data manipulation, and web based and interactive visualization is done on R or python, mostly. Industry verticals including finance, telecommunications use R or python.

Working on distributed computing including Hadoop clusters: With the emergence of Internet of Things, mobile devices and the increasing interaction among these, data size and its analysis are increasing more than exponentially. So the need of faster real- time analytics and large mobile data storage has been increasing exponentially.

Using Tableau/Spotfire/Qlikview and D3 for visualization: All industry verticals need to analyze interdependencies and complex relationships between indicators including KPIs by using lucid and powerful visualization capabilities including network graphs, tree-maps, surface and contour plots and hexbin plots.

What are your prime responsibilities in General Mills as a (supply chain) data scientist?

Forecasting demands of products each week accurately is the most important job of Supply Chain Management. Understanding the linear or nonlinear relationships between different KPIs (key performance indicators) to get the best possible insights is the second most important job. Understanding the customers, suppliers based on data and then classification or clustering them is another important job. Here analysts use R and python languages for doing predictive analytics and optimization.

How do these initiatives translate in terms of adding to the business?

We try and solve business problems more accurately with help of machine learning. Having solved it and built the confidence of stakeholders, our team has started new projects on prediction, classification, clustering and optimization. All these initiatives have helped our company save millions of dollars with the help of accurate prediction of demands and understanding of complex relationship between different KPIs. Please note that one percent accuracy improvement in demand forecasting translates to huge savings for a company. Plus, minimization of the production cost and transportation cost also brings in a lot of savings in supply chain management.

Data science being in the nascent stage in India yet, what all challenges do you face today?

Data format and management are not consistent throughout the system. The main challenge lies in making it consistent and clean before applying various modeling and optimization techniques. In addition, we need to be ready to know the business more holistically and more deeply before we start working on any project.

Can you share some insights on the tools that you use?

Time series clustering helps us cluster products based on various parameters.  Bayesian networks and markov models help us get big insights out of data pool. (Bayesian networks is a statistical model that represents a set of random variables and their dependencies via a directed acyclic graph. For example, Bayesian network could represent the relationships between diseases and symptoms. Markov model is a random probability pattern describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.) Understanding the delivery networks through graph theory also helps us reduce transportation cost significantly.

What’s your advice for IT Managers on how to leverage analytics tools?

IT Managers are supposed to be aware of strengths and limitations of all popular machine learning algorithms and should be thinking where they can do classification, clustering, prediction , associate mining, optimization and mathematical modeling, so as to help business leaders take better decisions.

Republished with author’s permission. Originally appeared on ITNEXT.