If history repeats itself, Fourier Transform is a key
Is it possible to clean your data and forecast the new ones by only using a Fourier Transform?
Image by Jon Tyson https://unsplash.com/@jontyson
One of the sentences that my professor used to say in high school was that “History repeats itself” . The sense of this sentence is obviously related to the fact that we should learn from history, thus being able not to do the same mistakes that we have done in the past.
Now let’s talk about science. If you have a time series, then you have your data for a (preferably) long time. Let’s assume for a second that history actually does reproduce itself. That will mean that by simply replicating the signal you will extend your data, thus obtaining a twice longer dataset. I know what you’re thinking, and you are right, I’m fooling you. Indeed, what I’m essentially mentioning in these few lines is the Fourier Transform, that states that each periodic signal can be seen as a sum of sines and cosines. Obtaining the amplitude and the phase of each frequency, you have your signal, and you can easily extend it.
I have a good news and a bad news to give you. The bad news is that you can’t pretend that your data are periodic unless you know that they are indeed periodic for some strange reason. The good news is that sometimes your data does present a semi-periodic pattern, and in those cases Fourier Transform is worth a shot.
I know, I talk a lot, let’s start with the practice.
Let’s take the Polish Electrical Load as an example and let’s give a look to data from 2008 to 2016:
Each year time step is incredibly easy to distinguish, and if we zoom it out, so does each months, and each week.
In these cases, a simple approach as Fourier Transform could be an option. Indeed, the idea that it could be applied is that our signal is actually periodic, but it does not seems to be periodic because of the noise, a general linear trend, or other unwanted effects that make our signal non stationary. Let’s move forward!
The signal has a linear trend, and its mean is not 0. This can disturb the Fourier spectrum and in general we don’t want that the linear trend is seen as a “long frequency component” of our Fourier spectrum. Using the magical scipy library, the signal has been detrended and shifted in order to set its mean to 0.
Ok, we are ready.
Borrowing the Machine Learning terminology, the dataset is divided into three parts:
- Training set: Where the Fourier Transform is applied (From 2008 to 2014)
- Validation set: Where the best model is fitted (2015)
- Test set: Where the goodness of the model is revealed. (2016)
If you are a ML specialist, please don’t take the nomenclature too seriously as it doesn’t mach really well with what you know. No offence. Names are names. :)
Of course if you naively apply the Fourier Transform you can’t expect good results, thats why the algorithm is slightly (I hope) more complicated.
Divide et impera (Training set)
The Fourier transform implies that data are stationary, but life changes, and love stories end <\3.
That’s why each year of the training set has to be taken alone. Each year will give its Fourier transform.
A mean Fourier transform (F) is obtaining by the sum of each Fourier Transform divided by 7 (there are 7 years on the training set).
Cut the noise out (Validation Set)
Now let’s consider the validation set signal (2015 Load). The idea is to use the Fourier Transform of the training set, transform it back, and compare the reconstructed signal with the validation set one. The inverse Fourier transform is of course obtained by using numpy FFT algorithm.
The problem is that, as numpy FFT algorithm is fantastic, it is able to reproduce the signal extremely well, maybe even too well. In fact we want the noise to be out of the reconstructed signal. But what is actually noise?
You know no-one can answer this question, but a safe warning that one could give you is that the error ( the difference between the original signal (i.e. the 2015 Load) and the reconstructed one) has to be the most possible uncorrelated with the signal . Of course, on the other hand, if we have a 0 signal it is uncorrelated with the original one, but it has no predictive ability. In short, that’s what we want from our model:
- Acceptably low RMSE between the original signal and the reconstructed one: if the RMSE is too high the prediction is not doing its job.
- Low Correlation between the error and the original signal: if the correlation is too high our model is overfitting the data, and there will be unpleasant surprises during the test time.
In order to do so, we invoke a magic wizard called threshold. The threshold is a certain value between 0 and 1 that will be multiplied by the highest value of the mean spectrum of the training set (F). All frequencies that are less than t * max(F) are set to 0, the signal is then transformed back to the original space, and both the RMSE and the correlation values have been computed. The best threshold is the one with the lowest RMSE and the lowest correlation as possible.
Test prediction (Test set)
This filtered Fourier spectrum has been used to predict the first 10 to 90 days of the 2016 Load, and the results are actually surprising.
The model is slighly overfitting but the RMSE is still acceptably low, and the errors are in part due to statistical error fluctuation, as it is possible to see from the fact that C is <0.60. Moreover the predictions are actually really good for the first days of the week, while the weekends seem to be more difficult to catch (guys, please, stop drinking too much!)
A lot of data science challenge are time series challenge (e.g. global climate reports , stock market analysis, solar cycle forecasting etc.). Of course there are more powerful methods to approach the timeseries (RNN, ARMA, ARIMA, SARIMA, etc.) and have a forecast out of them. Nonetheless, as I hope I’ve managed to show with this report, sometimes a good use of simple math concepts could still output very decent results. Being simple math concepts, they also give you full control of what is happening and they are tunable with your own investigation.
By the way, I don’t really know if history repeats itself, but I do think that we are supposed to live again some moments in a different form, with a different mindset and a different spirit. As Mark Twain once said:
“History doesn’t repeat itself, but it often rhymes”