Ir al contenido principal

Polynomial regression using python

We are going to learn how to create a polynomial regression and make a prediction over a future value using python.

The data set have been fetched from INE (national statistics institute), that data is the EPA (active population survey), that tell us the national total (Spain), both genders. 16 and over are unemployed (in thousands).

Example data:

label serie rate
0 2002T1 0 2152.8
1 2002T2 1 2103.3
2 2002T3 2 2196.0
3 2002T4 3 2232.4
4 2003T1 4 2328.5

Data CSV can be downloaded here:
https://drive.google.com/file/d/1fwvAZe7lah5DX8-DDEpmfeUDYQhKcfzG/view?usp=sharing

Lets see how looks that data:

Fine, as we can see the data describe a curve, so its for that because we want to use a polynomial regression.

To try to approximate that curve we will use a grade 2 polynomial or higher, because if we try to approximate with a grade 1 polynomial, we will try with a straight line, and that line will not fit correctly:

As we can see, using a linear model doesn't fit correctly, more technical:
R2 or (coefficient of determination): approx 0.3961

That R2 it's not good (the better it is, the closer to 1)

So, how we do to approximate with a polynomial?, the answer is using Polynomial Regression.

To pass from linear to quadratic we need to add new quadratic terms, we can add that terms with the sklearn PolynomialFeatures.

from sklearn.preprocessing import PolynomialFeatures
import pandas as pd


data = pd.read_csv('unemployment.csv')

X = pd.DataFrame(data, columns=['serie']).values.reshape(-1, 1) # we need a 2D array
y = pd.DataFrame(data, columns=['rate']).values.reshape(-1, 1)

pf = PolynomialFeatures(degree=2,include_bias=False)
x_transformed = pf.fit_transform(X)

The function fit_transform adds the new quadratic term, just adding each value of x squared:

From X:
[[ 0] [ 1] [ 2] [ 3] ...]

to a new X:
[[0 0] [1 1] [2 4] [3 9] ...]

And now, we just need to calculate our regession using our new quadratic x_transformed and see how it fits:
regr = LinearRegression()
regr.fit(x_transformed, y)
y_predicted = regr.predict(x_transformed)


Well, seems now fits better, we are going to check with the error function:
R2 or (coefficient of determination): approx 0.6160

And as we can check, now we have less error than before (remember, closer to 1 is better).

Perfect, but, what if we try with a grade 3 polynomial?, lets check what happens:


R2 or (coefficient of determination): approx 0.8764

Now we have a pretty much better quadratic error. But, can we make the polynomial grade higher and higher looking for the perfect fit?

We can dot, but at the same time we will be over training our model, so better to keep in grade 2 or 3.

Just for fun, we are going to increase the grade to grade 6, and try to do a prediction for the future serie (serie 71):

R2 or (coefficient of determination): approx 0.9779

Our model with grade 6 polynomial it's close to the perfection (we have an over trained model that will works great with the data set, but not with the future series, remember).



And now, how we predict the next unemployment rate? (last 2019 quarter)
First, we need to transform the serie we want to predict (71), to add the quadratic term, and later feed the predict method to obtain the predicted future unemployment rate:

X_transformed = pf.fit_transform([[71]])
regr.predict(X_transformed)[0][0]

That throws the value of 3295.8 (in thousands).
To obtain the last unemployment serie into the data set:
data['rate'].iloc[-1]  # that value is 3214.4

That means:
3295,80 −3214,4 = 81,4 (thousands) people more unemployed (I wish to be wrong because of the over training).

UPDATE:
The EPA for the last quarter have been published, and the unemployment rate for the last quarter is 3191.9 k people unemployed.

That means that our model, with a grade 6 polynomial, estimated 103.9 k people more from the real value published.

3295.8 - 3191.9 = 103.9 k people in more.

That's a big deviation.
We need to consider that the unemployment rates is affected for a lot of variables not considered here.

UPDATE 2:
After the Christmas hangover, period where always we have less employment, we have the unemployment rate for January 2020, month that raises a more realistic unemployment rate, and this unemployment rate is 3.253 (thousands people).

That's: 3295.8 - 3.253 = 42,8 thousands people estimated in more.

We need to consider that this amount is for january and not for a quarter and can't be compared, furthermore that amounts came from different statistics, so the way in that amount is calculated is different.

I'm doing that comparison just to check after the Christmas period, that have a more realistic unemployment rate.

I will develop a new entry in my new post, to check this data doing the same calculations with the monthly amounts instead of quarters, to do more realistic calcs.

My new blog address is: https://www.pylancer.com/blog/ I want to see you in my new blog.

Regards.

You can download the whole jupyter notebook here:
https://drive.google.com/file/d/1GkZO3R14zs3uUrK6_6uVRGGO4Z755Nnu/view?usp=sharing

Comentarios

Entradas populares de este blog

Stop measuring time with time.clock() in Python3

If we read the time.clock() documentation ( https://docs.python.org/3/library/time.html#time.clock): Deprecated since version 3.3: The behaviour of this function depends on the platform: use perf_counter() or process_time() instead, depending on your requirements, to have a well defined behaviour. I use time.perf_counter, but feel free to use time.process_time if fits for your requirements. The main difference they have are: time.perf_counter() It does include time elapsed during sleep and is system-wide. time.process_time() It does not include time elapsed during sleep. It is process-wide by definition.

Use django ORM standalone within your nameko micro-services

Learning about micro services with python, I found a great tool named nameko . https://www.nameko.io/ Nameko is a Python framework to build microservices that doesn't care in concrete technologies you will use within your project. To allow that microservices to work with a database, you can install into your project a wide variety of third parties, like SQLAlchemy (just like any other). To have an easy way to communicate with the database and keep track of the changes made to the models, I chose Django: I'm just learning about microservices and I want to keep focused on that. Easy to use, Django is a reliable web framework, have a powerful and well known ORM. Also using Django we will have many of the different functionalities that this framework provide. To make all this magic to work together, I developed a python package that allow you to use Django as a Nameko injected dependency: https://pypi.org/project/django-nameko-standalone/ You can found the source ...