Explaining the basics steps to create time series forecasts.
We are surrounded by patterns that can be found everywhere, one can notice patterns with the four season in relation to the weather; patterns on peak hour when it refers to the volume of traffic; in your heart beats, as well as in the shares of the stock market and also in the sales cycles of certain products.
Analyzing time series data can be extremely useful for checking these patterns and creating predictions for future. There are several ways to create these forecasts, in this post I will approach the concepts of the most basic and traditional methodologies.
All code is written in Python, and also, any additional information can be seen on my Github.
So let’s start commenting about the initial condition for analyzing Time Series:
A stationary time series is one whose statistical properties, such as mean, variance and auto correlation, are relatively constant over time. Therefore, a non-stationary series is one whose statistical properties change over time.
Before starting any predictive modeling it is necessary to verify if these statistical properties are constant, I will explain below each of these points:
- Constant mean
- Constant variance
- Auto correlated
A stationary series has a relatively constant mean overtime, there are no bullish or bearish trends. Having a constant mean with small variations around it, makes much easier to extrapolate to the future.
There are cases where the variance is small relative to the mean and using it may be a good metric to make predictions for the future, below a chart to show the relative constant mean in relation to the variances over time:
In this case, if the series is not stationary, the forecast for the future will not be efficient, because variations around the mean values deviate significantly as can be seen on the chart below:
In the chart above, it is clear that there is a bullish trend and the mean is gradually rising. In this case, if the average was used to make future forecasts the error would be significant, since forecast prices would always be below the real price.
When the series has constant variance, we have an idea of the standard variation in relation to the mean, when the variance is not constant (as image below) the forecast will probably have bigger errors in certain periods and these periods will not be predictable, it is expected that the variance will remain inconstant over time, including in the future.
In order to reduce the variance effect, the logarithmic transformation can be applied. In this case also, exponential transformation, like the Box-Cox method, or the use of inflation adjustment can be used as well.
When two variables have similar variation in relation to the standard deviation during time, you can say that these variables are correlated, For instance, when the body weight increase along with heart disorders, the greater the weight, greater is the incidence of problems in the heart. In this case, the correlation is positive and the graph would look something like this:
A case of negative correlation would be something like: the greater the investment within safety measures at work the smaller would be the amount of work related accidents.
Here are several examples of scatter plots with correlation levels:
When the subject is auto correlation, it means that there is a correlation of certain previous periods with the current period, the name given to the period with this correlation is lag, For instance, in a series that has measurements every hour, today’s temperature at 12:00 is very similar to the temperature of 12:00, 24 hours ago. If you compare the variation of temperatures through this 24 house time frame, there will be an auto correlation, in this case we will have an auto correlation with the 24th lag.
Auto correlation is a condition to create forecasts with a single variable, because if there is no correlation, you can not use past values to predict the future, when there are several variables, you can verify if there is a correlation between the dependent variable and the lags of the independent variables.
If a series does not have auto correlation it is a series with random and unpredictable sequences, and the best way to make a prediction is usually to use the value from the previous day. I will use more detailed charts and explanations below.
From here I will analyze the weekly Hydrous ethanol prices from Esalq (it’s a price reference to negotiate hydrous ethanol in Brazil), the data can be downloaded here.
The price is in Brazilian Reais per cubic meter (BRL/m3).
Before starting any analysis, let’s split the data on a training and test set
Dividing data on training and testing basis
When we are going to create a time series prediction model, it’s crucial to separate the data into two parts:
Training set: these data will be the main basis for defining the coefficients/parameters of the model;
Test set: These are the data that will be separated and will not be seen by the model to test if the model works (generally these values are compared with a walk forward method and finally the mean error is measured).
The size of the test set is usually about 20% of the total sample, although this percentage depends on the sample size that you have and also how much time ahead you want to make the forecast. The test set should ideally be at least as large as the maximum forecast horizon required.
Unlike other prediction methods, such as classifications and regressions without the influence of time, in time series we can not divide the training and test data with random samples from any part of the data, we must follow the time criterion of the series, where the training data should always come before the test data.
In this example of Esalq hydrous prices we have 856 weeks, we will use as training set the first 700 weeks and the last 156 weeks (3 years ~ 18%) we will use as a test set:
From now on we will only use the training set to do the studies, the test set will only be used to validate the predictions that we will make.
Every time series can be broken down into 3 parts: trend, seasonality and residuals, which is what remains after removing the first two parts from the series, below the separation of these parts:
Clearly the series has an uptrend, with peaks between the end and beginning of each years and minimums between April and September (beginning of the sugarcane crushing in the center-south of Brazil).
First, we will use the Dickey-Fuller test, I will use the base P-value of 5%, that is, if the P-value is below this 5% it means that the series is statistically stationary.
In addition, there is the Statistical Test of the model, where these values can be compared with the critical values of 1%, 5% and 10%, if the statistical test is below some critical value chosen the series will be stationary:
In this case, the Dickey-Fuller test indicated that the series is not stationary (P-value 36% and the critical value 5% is less than the statistical test).
Now we are going to analyze the series with the KPSS test, unlike the Dickey-Fuller test, the KPSS test already assumes that the series is stationary and only will not be if the P value is less than 5% or the statistical test is less than some value critic:
Confirming the Dickey-Fuller test, the KPSS test also shows that the series is not stationary because the P-value is at 1% and the statistical test is above any critical value.
Next I will demonstrate ways to turn a series into stationary.
Turning the series into stationary
Differentiation is used to remove trend signals and also to reduce the variance, it is simply the difference of the value of period T with the value of the previous period T-1.
To make it easier to understand, below we get only a fraction of ethanol prices for better visualization, note from May/2005 prices start rising until mid-May/2006, these prices have weekly rises that accumulates creating an uptrend, in this case, we have a non-stationary series.
When the first differentiation is made (graph below), we remove the cumulative effect of the series and only show the variation of period T against period T-1 throughout the whole series, so if the price of 3 days ago was BRL 800.00 and changed to BRL 850.00, the value of the differentiation will be BRL 50.00 and if today’s value is BRL 860.00 then the difference will be -BRL 10.00.
Normally only one differentiation is necessary to transform a series into stationary, but if necessary, a second differentiation can be applied, in this case, the differentiation will be made on the values of the first differentiation (there will hardly be cases with more than 2 differentiations).
Using the same example, to make a second differentiation we must take the differentiation of T minus T-1: BRL 2.9 — BRL 5.5 = -BRL 2.6 and so on.
Let’s do the Dickey-fuller test to see if the series will be stationary with the first differentiation:
In this case we confirm that the series is stationary, the P-value is zero and when we compare the value of the statistical test, it is far below the critical values.
In the next example we will try to transform a series into stationary using the inflation adjustment.
Prices are relative to the time that they were traded, in 2002 the price of ethanol was at BRL 680.00, if the price of this product were traded at this price nowadays certainly many mills would be closed as it’s a very low price.
To try to make the series stationary, I will adjust the whole series based on the current values using the IPCA index (it’s the Brazilian CPI index), accumulating from the end of the training period (Apr/2016) until the beginning of the study, the source of the data is on the IBGE website.
Now let’s see how the series became and also if it became stationary.
As can be seen, the uptrend has disappeared, with only the seasonal oscillations remaining, the Dickey-Fuller test also confirms that the series is now stationary.
Just for the sake of curiosity, see below the graph with the adjusted price with inflation against the original series.
The logarithm is usually used to transform series that have exponential growth values in series with more linear growths, in this example we will use the Natural Logarithm (NL), where the base is 2.718, this type of logarithm is widely used in economic models.
The difference of the values transformed into NL is approximately equivalent to the percentage variation of the values of the original series, which is valid as a basis for reducing the variance in series with different prices, see the example below:
If we have a product that had a price increase in 2000 and went from BRL 50.00 to 52.50, some years later (2019) the price was already BRL 100.00 and changed to BRL 105.00, the absolute difference between prices is BRL 2.50 and BRL 5.00 respectively, however the percentage difference of both is 5%.
When we use the LN in these prices we have: NL (52,50) — NL (50,00) = 3,96–3,912 = 0,048 or 4.8%, in the same way using the LN in the second price sequence we have: NL (105) — NL (100) = 4.654–4.605 = 0.049 or 4.9%.
In this example, we can reduce the variation of values by bringing almost everything to the same basis.
Below the same example:
Below the table comparing values of percentage variation of X with the variation values of NL (X):
let’s plot the comparative between the original series and the series with NL transform:
Box-Cox Transformation (Power Transform)
The BOX COX transformation is also a way to transform a series, the lambda (λ) value is a parameter used to transform the series.
In short, this function is the junction of several exponential transformation functions, where we search for the best value of lambda that transforms the series so that it has a distribution closer to a normal Gaussian distribution. A condition to use this transformation is that the series only has positive values, the formula is:
Below I will plot the original series with its distribution and after that the transformed series with the optimal value of lambda with its new distribution, to find the value of lambda we will use the function boxcox of the library Scipy, where it generates the transformed series and the ideal lambda:
Below is an interactive chart where you can change the lambda value and check the change in the chart:
This tool is usually used to improve the performance of the model, since it makes it with more normal distributions, remembering that after finishing the prediction of the model, you must return to the original base inverting the transformation according to the formula below:
Looking for correlated lags
To be predictable, a series with a single variable must have auto correlation, that is, the current period must be explained based on an earlier period (a lag).
As this series has weekly periods, 1 year is approximately 52 weeks, I will use the auto correlation function showing a period of 60 lags to verify correlations of the current period with these lags.
Analyzing the above auto correlation chart above, it seems that all lags could be used to create forecasts for future events since they have a positive correlation close to 1 and they are also outside of the confidence interval, but this characteristic is of a non-stationary series.
Another very important function is the partial auto correlation function, where the effect of previous lags on the current period is removed and only the effect of the lag analyzed over the current period remains, for instance: the partial auto correlation of the fourth lag will remove the effects of the first, second and third lags.
Below the partial auto correlation graph:
As can be seen, almost no lag has an effect on the current period, but as demonstrated earlier, the series without differentiation is not stationary, we will now plot these two functions with the series with one differentiation to see how it works:
The auto correlation plot changed significantly, showing that the series has a significant correlation only in the first lag and a seasonal effect with negative correlation around the 26th month (half a year).
To create forecasts, we must pay attention to an extremely important detail about finding correlated lags, it’s important that there is a reason behind this correlation, because if there is no logical reason it’s possible that it’s only chance and that this correlation can disappear when you include more data.
Another important point is that the auto correlation and partial auto correlation graphs are very sensitive to outliers, so it’s important to analyze the time series itself and compare with the two auto correlation charts.
In this example the first lag has a high correlation with the current period, since the prices of the previous week historically do not vary significantly, in the same case the 26th lag presents a negative correlation, indicating a tendency contrary to the current period, probably due to the different periods of supply and demand over the course of a year.
As the inflation-adjusted series has become stationary, we will use it to create our forecasts, below the auto correlation and partial auto correlation graphs of the adjusted series:
We will use only the first two lags as a predictor for auto-regressive series.
For more information, Duke University professor Robert Nau’s website is one of the best related to this subject.
Metrics to evaluate the model
In order to analyze if the forecasts are with the values close to the current values one must make the measurement of the error, the error (or residuals) in this case is basically Yreal−YpredYreal−Ypred.
The error in the training data is evaluated to verify if the model has good assertiveness, and validates the model by checking the error in the test data (data that was not “seen” by the model).
Checking the error is very important to verify if your model is overfitting or underfitting when you compare the training data with the test data.
Below are the key metrics used to evaluate time series models:
MEAN FORECAST ERROR — (BIAS)
It’s nothing more than the average of the errors of the evaluated series, the values can be positive or negative. This metric suggests that the model tends to make predictions above the real value (negative errors) or below the real value (positive errors), so it can also be said that the mean forecast error is the bias of the model.
MAE — MEAN ABSOLUTE ERROR
This metric is very similar to the average error of the prediction mentioned above, the only difference is the error with a negative value that is transformed into positive and afterward the mean is calculated.
This metric is widely used in time series, since there are cases that the negative error can cancel the positive error and give an idea that the model is accurate, in the case of the MAE it doesn’t happen, because this metric shows how much the forecast is far from the real values, regardless if above or below, see the case below:
MSE — MEAN SQUARED ERROR
This metric places more weight on larger errors because each individual error value is squared and then the mean is calculated. Thus, this metric is very sensitive to outliers and puts a lot of weight on predictions with more significant errors.
Unlike the MAE and MFE, the MSE values are in quadratic units rather than the units of the model.
RMSE — ROOT MEAN SQUARED ERROR
This metric is simply the square root of the MSE, where the error returns to the unit of measure of the model (BRL/m3), it is very used in time series because it’s more sensitive to the bigger errors due to the process of squaring which originated it.
MAPE — MEAN ABSOLUTE PERCENTAGE ERROR
This is another interesting metric to use, which generally is used in management reports because the error is measured in percentage terms, so the error of a product X can be compared with the error of a product Y.
The calculation of this metric takes the absolute value of the error divided by the current price, then the mean is calculated:
Let’s create a function to evaluate the errors of training and test data with several evaluation metrics:
Checking the residual values
It’s not enough to create the model and check the error values according to the chosen metric, you must also analyze the characteristics of the residual itself, as there are cases where the model can not capture the information necessary to make a good forecast, resulting in an error with information that should be used to improve the forecast.
To verify this residual we will check:
- Current vs. predicted values (sequential chart);
- Residual vs. predicted values (dispersion chart):
It is very important to analyze this graph since in it we can check patterns that can tell us if some modification is needed in the model, the ideal is that the error is distributed linearly along the forecast sequence.
- QQ plot of the residual (dispersion chart):
Summarizing this is a graph that shows where the residue should be theoretically distributed, following a Gaussian distribution, versus how it actually is.
- Residual auto correlation (sequential chart):
Where there should be no values that come out of the confidence margin or the model is leaving information out of the model.
We need to create another function to plot these graphs:
Most basic ways to make a forecast
From now on we will create some models of price forecast of Hydrous ethanol, below will be the steps that we will follow for each model:
- Create prediction on the training data and subsequently validate on the test data;
- Check the error of each model according to the metrics mentioned above;
- Plot the model with the residual comparatives.
Let’s go to the models:
The simplest way to make a forecast is to use the value of the previous period, this is the best approach that can be done in some cases, where the error is lower compared to other forecast methodologies.
Generally, this methodology doesn’t work well to predict many periods ahead, as the errors tend to increase in relation to real values.
Many people also use this approach as a baseline to try to improve with more complex models.
Below we will use the training and test data to make the simulations:
The QQ chart shows that there are some larger (up and down) residuals than theoretically should be, these are the so-called outliers, and there is still a significant auto correlation in the first, sixth and seventh lag, which could be used to improve the model.
In the same way, we will now make the forecast in the test data. The first value of the predicted series will be the last of the training data, then these values will be updated step-by-step by the current value of the test and so on:
The RMSE and MAE errors were similar to the training data, the QQ chart is with the residual more in line with what should theoretically be, probably due to the few sample values compared to the training data.
In the chart comparing the residuals with the predicted values it’s noted that there is a tendency for the errors to increase in absolute values when prices increase, perhaps a logarithmic adjustment would decrease this error expansion, and to finalize the residual correlation graph shows that there is still room for improvement as there is a strong correlation in the first lag, where a regression based on the first lag could probably be added to improve predictions. Next model is the simple average:
Another way to make predictions is to use the series mean, usually this form of forecasting is good when the values oscillate close around the mean, with constant variance and no uptrend or downtrend, but it’s possible to use better methods, where can make the forecast using seasonal patterns among others.
This model uses the mean of the beginning of the data until the previous period analyzed and it expands daily until the end of the data, in the end, the tendency is that the line is straight, we will now compare the error of this model with the first model:
In the testing data, I will continue using the mean from the beginning of the training data and make the expansion of the mean with the values that will be added on the test data:
The simple mean model failed to capture relevant information of the series, as can be seen in the Real vs Forecast graph, also in the correlation and Residual vs. Predicted graphs.
Simple Moving Average:
The moving average is an average that is calculated for a given period (5 days for example) and is moving and always being calculated using this particular period, in which case we will always be using the average for the last 5 days to predict the value of the next day.
The error was lower than the simple average, but still above the simple model, below the test model:
Similarly to the training data, the moving-averages model is better than the simple average, but they do not yet gain from the simple model.
The predictions are with auto-correlation in two lags and the error is with a very high variance in relation to the predicted values.
Exponential Moving Average:
The simple moving average model described above has the property of treating the last X observations equally and completely ignoring all previous observations. Intuitively, past data should be discounted more gradually, for example, the most recent observation should theoretically be slightly more important than the second most recent, and the second most recent should have a little more importance than the third more recent, and so on, the Exponential Moving Average (EMM) model does this.
Since α (alpha) is a constant with a value between 0 and 1, we will calculate the forecast with the following formula:
Where the first value of the forecast is the respective current value, the other values will be updated by α times the difference between the actual value and the forecast of the previous period. When α is zero we have a constant based on the first value of the forecast, when α is 1 we have a model with a simple approach because the result is the value of the previous real period.
Below is a graph chart several values of α:
The average data period in the EMM forecast is 1 / α . For example, when α = 0.5, lag is equivalent to 2 periods; when α = 0.2 the lag is 5 periods; when α = 0.1 the lag is 10 periods and so on.
In this model, we will arbitrarily use a α of 0.50, but you can do a grid search to look for the α which reduces the error in the training and also in the validation, we will see how it will look:
The error of this model was similar to the error of the moving averages, however, we have to validate the model in the test base:
In the validation data, the error so far is the second best of the models that we have already trained, but the characteristics of the graphs of the residuals are very similar to the graphs of the model of the moving average of 5 days.
An auto-regressive model is basically a linear regression with significantly correlated lags, where the autocorrelation and partial autocorrelation charts should initially be plotted to verify if there is anything relevant.
Below are the autocorrelation and partial autocorrelation charts of the training series that shows a signature of auto-regressive model with 2 lags with significant correlations:
Below we will create the model based on the training data and after obtaining the coefficients of the model, we will multiply them by the values that are being performed by the test data:
In this model the error was the lowest compared to all the others that we trained, now let’s use its coefficients to do the step-by-step forecast of the training data:
Note that in the test data the error did not remain stable, even worse than the simple model, note in the chart that the forecasts are almost always below the current values, the bias measurement shows that the real values are BRL 50.19 above the predictions, maybe tuning some parameters in the training model this difference would decrease.
To improve these models you can apply several transformations, such as those explained in this post, also you can add external variables as a forecast source, however, this is a subject for another post.
Each time series model has its own characteristics and should be analyzed individually so we can extract as much information as possible to make good predictions reducing the uncertainty of the future.
Checking for stationary, transforming the data, creating the model in the training data, validating on the test data and checking the residuals are key steps to create a good time series forecast.
See also my post related to ARIMA models.
I hope you liked this post, if you have any questions or information that you need, don’t hesitate to contact with me on my LinkedIn or also here in the comments.
Basic Principles to Create a Time Series Forecast was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.