We are going to learn how to create a polynomial regression and make a prediction over a future value using python.
The data set have been fetched from INE (national statistics institute), that data is the EPA (active population survey), that tell us the national total (Spain), both genders. 16 and over are unemployed (in thousands).
Example data:
Data CSV can be downloaded here:
https://drive.google.com/file/d/1fwvAZe7lah5DX8-DDEpmfeUDYQhKcfzG/view?usp=sharing
Lets see how looks that data:
Fine, as we can see the data describe a curve, so its for that because we want to use a polynomial regression.
To try to approximate that curve we will use a grade 2 polynomial or higher, because if we try to approximate with a grade 1 polynomial, we will try with a straight line, and that line will not fit correctly:
As we can see, using a linear model doesn't fit correctly, more technical:
R2 or (coefficient of determination): approx 0.3961
That R2 it's not good (the better it is, the closer to 1)
So, how we do to approximate with a polynomial?, the answer is using Polynomial Regression.
To pass from linear to quadratic we need to add new quadratic terms, we can add that terms with the sklearn PolynomialFeatures.
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd
data = pd.read_csv('unemployment.csv')
X = pd.DataFrame(data, columns=['serie']).values.reshape(-1, 1) # we need a 2D array
y = pd.DataFrame(data, columns=['rate']).values.reshape(-1, 1)
pf = PolynomialFeatures(degree=2,include_bias=False)
x_transformed = pf.fit_transform(X)
The function fit_transform adds the new quadratic term, just adding each value of x squared:
From X:
[[ 0] [ 1] [ 2] [ 3] ...]
to a new X:
[[0 0] [1 1] [2 4] [3 9] ...]
regr = LinearRegression()
regr.fit(x_transformed, y)
y_predicted = regr.predict(x_transformed)
Well, seems now fits better, we are going to check with the error function:
R2 or (coefficient of determination): approx 0.6160
And as we can check, now we have less error than before (remember, closer to 1 is better).
Perfect, but, what if we try with a grade 3 polynomial?, lets check what happens:
R2 or (coefficient of determination): approx 0.8764
Now we have a pretty much better quadratic error. But, can we make the polynomial grade higher and higher looking for the perfect fit?
We can dot, but at the same time we will be over training our model, so better to keep in grade 2 or 3.
Just for fun, we are going to increase the grade to grade 6, and try to do a prediction for the future serie (serie 71):
R2 or (coefficient of determination): approx 0.9779
Our model with grade 6 polynomial it's close to the perfection (we have an over trained model that will works great with the data set, but not with the future series, remember).
And now, how we predict the next unemployment rate? (last 2019 quarter)
First, we need to transform the serie we want to predict (71), to add the quadratic term, and later feed the predict method to obtain the predicted future unemployment rate:
X_transformed = pf.fit_transform([[71]])
regr.predict(X_transformed)[0][0]
That throws the value of 3295.8 (in thousands).
To obtain the last unemployment serie into the data set:
data['rate'].iloc[-1] # that value is 3214.4
That means:
3295,80 −3214,4 = 81,4 (thousands) people more unemployed (I wish to be wrong because of the over training).
UPDATE:
The EPA for the last quarter have been published, and the unemployment rate for the last quarter is 3191.9 k people unemployed.
That means that our model, with a grade 6 polynomial, estimated 103.9 k people more from the real value published.
3295.8 - 3191.9 = 103.9 k people in more.
That's a big deviation.
We need to consider that the unemployment rates is affected for a lot of variables not considered here.
UPDATE 2:
After the Christmas hangover, period where always we have less employment, we have the unemployment rate for January 2020, month that raises a more realistic unemployment rate, and this unemployment rate is 3.253 (thousands people).
That's: 3295.8 - 3.253 = 42,8 thousands people estimated in more.
We need to consider that this amount is for january and not for a quarter and can't be compared, furthermore that amounts came from different statistics, so the way in that amount is calculated is different.
I'm doing that comparison just to check after the Christmas period, that have a more realistic unemployment rate.
I will develop a new entry in my new post, to check this data doing the same calculations with the monthly amounts instead of quarters, to do more realistic calcs.
My new blog address is: https://www.pylancer.com/blog/ I want to see you in my new blog.
Regards.
You can download the whole jupyter notebook here:
https://drive.google.com/file/d/1GkZO3R14zs3uUrK6_6uVRGGO4Z755Nnu/view?usp=sharing
The data set have been fetched from INE (national statistics institute), that data is the EPA (active population survey), that tell us the national total (Spain), both genders. 16 and over are unemployed (in thousands).
Example data:
label | serie | rate | |
---|---|---|---|
0 | 2002T1 | 0 | 2152.8 |
1 | 2002T2 | 1 | 2103.3 |
2 | 2002T3 | 2 | 2196.0 |
3 | 2002T4 | 3 | 2232.4 |
4 | 2003T1 | 4 | 2328.5 |
Data CSV can be downloaded here:
https://drive.google.com/file/d/1fwvAZe7lah5DX8-DDEpmfeUDYQhKcfzG/view?usp=sharing
Lets see how looks that data:
Fine, as we can see the data describe a curve, so its for that because we want to use a polynomial regression.
To try to approximate that curve we will use a grade 2 polynomial or higher, because if we try to approximate with a grade 1 polynomial, we will try with a straight line, and that line will not fit correctly:
As we can see, using a linear model doesn't fit correctly, more technical:
R2 or (coefficient of determination): approx 0.3961
That R2 it's not good (the better it is, the closer to 1)
So, how we do to approximate with a polynomial?, the answer is using Polynomial Regression.
To pass from linear to quadratic we need to add new quadratic terms, we can add that terms with the sklearn PolynomialFeatures.
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd
data = pd.read_csv('unemployment.csv')
X = pd.DataFrame(data, columns=['serie']).values.reshape(-1, 1) # we need a 2D array
y = pd.DataFrame(data, columns=['rate']).values.reshape(-1, 1)
pf = PolynomialFeatures(degree=2,include_bias=False)
x_transformed = pf.fit_transform(X)
The function fit_transform adds the new quadratic term, just adding each value of x squared:
From X:
[[ 0] [ 1] [ 2] [ 3] ...]
to a new X:
[[0 0] [1 1] [2 4] [3 9] ...]
And now, we just need to calculate our regession using our new quadratic x_transformed and see how it fits:regr = LinearRegression()
regr.fit(x_transformed, y)
y_predicted = regr.predict(x_transformed)
Well, seems now fits better, we are going to check with the error function:
R2 or (coefficient of determination): approx 0.6160
And as we can check, now we have less error than before (remember, closer to 1 is better).
Perfect, but, what if we try with a grade 3 polynomial?, lets check what happens:
R2 or (coefficient of determination): approx 0.8764
Now we have a pretty much better quadratic error. But, can we make the polynomial grade higher and higher looking for the perfect fit?
We can dot, but at the same time we will be over training our model, so better to keep in grade 2 or 3.
Just for fun, we are going to increase the grade to grade 6, and try to do a prediction for the future serie (serie 71):

Our model with grade 6 polynomial it's close to the perfection (we have an over trained model that will works great with the data set, but not with the future series, remember).
And now, how we predict the next unemployment rate? (last 2019 quarter)
First, we need to transform the serie we want to predict (71), to add the quadratic term, and later feed the predict method to obtain the predicted future unemployment rate:
X_transformed = pf.fit_transform([[71]])
regr.predict(X_transformed)[0][0]
That throws the value of 3295.8 (in thousands).
To obtain the last unemployment serie into the data set:
data['rate'].iloc[-1] # that value is 3214.4
That means:
3295,80 −3214,4 = 81,4 (thousands) people more unemployed (I wish to be wrong because of the over training).
UPDATE:
The EPA for the last quarter have been published, and the unemployment rate for the last quarter is 3191.9 k people unemployed.
That means that our model, with a grade 6 polynomial, estimated 103.9 k people more from the real value published.
3295.8 - 3191.9 = 103.9 k people in more.
That's a big deviation.
We need to consider that the unemployment rates is affected for a lot of variables not considered here.
UPDATE 2:
After the Christmas hangover, period where always we have less employment, we have the unemployment rate for January 2020, month that raises a more realistic unemployment rate, and this unemployment rate is 3.253 (thousands people).
That's: 3295.8 - 3.253 = 42,8 thousands people estimated in more.
We need to consider that this amount is for january and not for a quarter and can't be compared, furthermore that amounts came from different statistics, so the way in that amount is calculated is different.
I'm doing that comparison just to check after the Christmas period, that have a more realistic unemployment rate.
I will develop a new entry in my new post, to check this data doing the same calculations with the monthly amounts instead of quarters, to do more realistic calcs.
My new blog address is: https://www.pylancer.com/blog/ I want to see you in my new blog.
Regards.
You can download the whole jupyter notebook here:
https://drive.google.com/file/d/1GkZO3R14zs3uUrK6_6uVRGGO4Z755Nnu/view?usp=sharing
Comentarios
Publicar un comentario