Extrapolation — Useful tool or dangerous data

Extrapolation

Dennis Ash
unpack
Published in
6 min readJan 18, 2021

--

Useful tool or dangerous data

One cannot talk about extrapolation without referring to regression, in fact regression is an integral part of extrapolation. A simple explanation of regression used for extrapolation is finding the best fit mathematical relationship between the dependent and independent variables in a data set where the dependent variables are continuous.

Extrapolation uses the relationship determined by the regression to predict dependent and independent values that are outside of the data set used for training.

Understanding different types of data sets and where extrapolation can be applied:

A simple explanation of different types of data sets and uses for deep learning modes cold be something like this:

A data set of types of animal that one might encounter in the Kruger National Game Park. There are a finite number of types of dear and images would be classified by how close they fit to what was determined to be the ideal animal the model was trained to recognize in each domain. If we showed the model a dear from the USA it would not be able to identify it since it falls outside of the domains it has categorized and there is not way that the model could predict or extrapolate that it was an American Deer (to do this you would need to introduce a new domain for American deer). In this case extrapolation is not a useful tool.

Data sets with continuous variables are however exactly the type of data sets that can have data points outside of the trained data set predicted or extrapolated.

Using the example created for the previous article on regression, a simple linear regression formula would look like y = bx + a. the random data used generated a linear regression with the results for ‘a’ and ‘b’.

a = 4.8297; b = 0.4735

The example of linear regression and the associated linear extrapolation shows how this can work but it also highlights the inadequacies of linear extrapolation, the predictions will have a margin of error which is clearly apparent in the graph:

Extrapolation is a useful tool for predicting or forecasting outcomes that are outside of the models training set and examples of these could be forecasting speed with reference to drag and engine power (racing cars), forecasting the best time to harvest based on growth rates and weather forecasts, predicting the stock price level although there are parameter in the SE that cannot be accounted for that will affect the outcomes. A general rule of thumb is that extrapolation is less accurate the further you get from the know data/training set and it is less accurate the more complex the regression expression is for the data set.

Creating a new example with random data with a polynomial regression the example of extrapolation generated we see that the fit the mathematical expressions and the accuracy of the extrapolation is dependent on the degree of accuracy of the fit. We will have points that are not on the regression line so it is safe to say that the extrapolated point will not be accurate.

Extrapolation can only give a result based on the training data, for example if a company wanted to predict sales for the next 12 months the extrapolation will be based on available sales data, should a new product be introduced into the mix the extrapolated results would need to be adjusted based on the new data available.

Extrapolation is useful for predicting nonlinear but regularly repetitive anomalies in historical data, for instance holidays and special sales events such as black Friday and 1111 etc. and is essential for planning the logistics around these events.

The challenge however is how to improve accuracy, one way of doing this is to expand the area of prediction by adding upper and lower limits to the extrapolation model and rather than giving a single number present with a range. While this does not improve the point accuracy it does allow for inclusion of the prediction in the actual result. You can see in the example that the actual points that are not on the regression line now fall into the area between the upper and lower limits of the extrapolation (in this example we regressed the upper and lower extrapolations back into the known model to test it) and so we now have an extrapolation that more accurately includes data points outside of the original set of data. It might not be point accurate, but it gives enough information to allow for planning that will incorporate highs and lows.

The example shows that the further away from the know data you get the broader the range is for the predicted outcome so it is worth wile keeping the data updates as time passes, this in turn will help keep the prediction range quite narrow.

We use this a lot in supply chain management where we can predict upstream supply based on current sales data this help to keep the supply pipeline lean without running out.

Another method that can be used for complex data would be to generate slightly different regression relationships, use these to generate multiple extrapolations and then average these out to get the final predictions. This can be more accurate however the predictions could also be far off, in running a few models for these examples one of the models created a completely incorrect extrapolation and the result made no sense. The figure on the right shows a calculation that was made averaging out five different polynomial regression/extrapolation models. The upper and lower limits of the single model extrapolation remain as a comparison and you can see that the further away from the actual data how different the predicted outcome can be however there is little different with the extrapolations of the different methods within the for first 5 data points however after that there is a significant difference.

Where does extrapolation fail in Deep Learning?

While extrapolation works to some extent with linear or other types of regression it does not work with decision trees and random forest. Decision Tree and Random Forest look at the data differently with data being sorted and filtered down into leaf nodes that do not have any direct relation to other leaf nodes in the tree or forest. This means that while random forest is great for sorting data the results cannot be use for extrapolation since it does not know how to characterize the data outside of the domain.

In the image on the right the red dots show the extrapolation, as soon as the data falls outside of the domain the line flattens out and the extrapolation is not useful.

Summary

Extrapolation is a useful tool but it must be used with the right model characterizing the data and it is limitations one you leave the training domain. Its uses are predicting in cases where you have continues data such as time, speed, etc. Prediction is notoriously inaccurate and as the distance from the trained domain increases so the accuracy decreases. In cases where extrapolation is necessary one should update the model and retrain it in order to reduce the margin of error.

--

--