Machine Learning

Machine Learning for Sales Forecasting: How to tackle insufficient data issue

07th Sep `18, 04:22 PM in Machine Learning

Online and offline retailers understand: the external environment has become overly complex and unpredictable, as the number of…

Nikolay Savin
Nikolay Savin Contributor

Online and offline retailers understand: the external environment has become overly complex and unpredictable, as the number of products gets too large to manage by hand. Therefore, retail uses algorithms to set the right prices, predict stocks and optimize other business processes.

Before the beginning of predicting, the algorithm needs to be trained. Often, the training is conducted on historical data—supervised learning—where there is a target function sales, revenue, profit or market share.

Through the learning process, the model analyzes all the variables that affect sales (prices, traffic, etc.) and outputs a function describing sales. After the algorithm has completed the training and showed sufficient accuracy of the forecast, it can be used to recommend which values should be taken in the future to maximize sales.


For learning algorithms a significant amount of competitive data is needed, data today becomes the new oil for retail. It can turn a retailer into a market leader or into an ordinary player.

In this article, we describe the methods for obtaining missing data for forecasting algorithms: from data buying to machine learning modeling.

Causes of missing data

The leading retailer’s problem with using algorithms is that the “supervisor”—historical data—is often incomplete or mixed and can’t be used for training. Different reasons can cause this:

  • The format of the data has changed. Different internal systems, IT-solutions, the data collection approaches (by days or by transactions), cause data collection in different formats for different periods.
  • There were different purposes of data collection. If the data was collected at a top level, for example, to pay bonuses for category managers, it is not suitable for these algorithms.
  • Retailer’s time in business. The retailer can be new in the market, so during the initial stage, sales are 90% dependent on site traffic. Therefore, it is impossible to assess how prices affect sales during this period.
  • Flash sales. If the retailer operates in flash sales mode (sales of different categories or brands for very short periods of time), the algorithms also can’t run with such heterogeneous sales data.

If for one of the given above, or for some other reason the data is insufficient for model training and forecasting, the retailer should try to take the best of existing data volume, or model a missing one.

Working with existing data

If the problem lies in the data format, then a retailer needs to unify it within a single format. If a certain amount of data is collected and the retailer begins to gather additional info on new factors (e.g., competitive prices), he needs to wait a certain amount of time (about a year) to collect new data. Another option is to buy the missing data.


At the same time, even without the data collected from the market, forecasting models can be created.


They will be not all that precise, more time consuming, require more assumptions and modeling of missing information. They’re efficient enough and used widely.

Simulation of missing data

Some methods can predict the missing values of other variables with existing data on specific variables. For example, if the retailer has its prices and sales for two years, and there is a history of competitors’ prices for 1.5 years, a simulation can determine the missing competitive prices.


To solve such problems, classifiers are used. They predict missing values based on other independent variables about which information is available. There are two main types of “smart” missing data imputation:

  • Predictive model: To get a prediction of the missing values, the data set has to be divided into two parts. The first part with the existing data and the second part with missing data. The first part will become the train set, while the missing values ​​in the second part become the forecast target. A binary classifier in this case answers the question whether an event happened or not (for example, whether the products were present on the shelf). A categorical classifier assigns a product to a particular segment (e.g., a price segment).
  • KNN (k-nearest neighbor) method: Predicts the missing values on the base of the “closest” variable to the target. The estimated distance between them determines the similarity of variables.

The most common example of a classifier is the churn predictor that shows the probability of customer churn for a retailer or service company. The five main classifier types are logistic regressions, decision trees, neural networks, boosters, and Random Forest.

When the missing values were predicted, other types of algorithms—regressors—are used to predict the final target variable, sales. What they predict is not a segment or probability but the numerical value, in case of retail it’s sales.

The most common regressor types are linear and polynomial regressions, neural networks, regression trees and Random Forest mentioned above.

Machine learning in working with data

If the retailer has a significant amount of data, it can use neural networks to recommend stock-ups or prices to maximize sales. If there is not enough data to create and train a neural network, there are other algorithms that need fewer data.

If the retailer’s portfolio has a sufficient sales history only for about 30% of the products, its traffic is small, and sales are scarce, there is no space for the neural network to work. In this case, tree algorithms are able to forecast sales at the level of individual products.

An example of a tree algorithm is XGboost or its younger and ambitious pursuers, LightGBM and CatBoost.

For this type, an active sales history of 150+ days is sufficient to predict optimal prices. This algorithm has a disadvantage: it is typically poor at taking into account the interdependence of prices for different products. It can be used for KVI-products, while the remaining products can be managed with simple pricing scenarios (e.g., rule-based pricing).

On a small number of products (20-30 positions) a retailer can use regression to calculate the price elasticity of these goods by adding of 3-4 variables. it can be used for high-level decision making: is there room for a price increase, or is it better to refrain from it.

Example: linear or polynomial regression (SVM – support vector machines)

This algorithm does not say what the price should be to maximize sales and margins but shows the trend.

Another method that is used when the retailer does not have enough data is A/B testing, which uses mathematics and statistics. Beginner retailers can use it to assess the impact of ads and prices on sales.

Example – Conjoint analysis

Used on the data collected by A/B testing, the conjoint analysis method allows a retailer to determine price-promo-ads combinations that will work best on a small sample. This method also shows the contribution of each of these factors and their optimal values.


There are many methods retailer can use to fill the missing data or forecast prices on small data volumes. At the same time, the optimal among them, remains the collection and processing of historical data. It allows for much easier neural networks training. Therefore forecasts are much more reliable.