Ir al contenido principal

Polynomial regression using python

We are going to learn how to create a polynomial regression and make a prediction over a future value using python.

The data set have been fetched from INE (national statistics institute), that data is the EPA (active population survey), that tell us the national total (Spain), both genders. 16 and over are unemployed (in thousands).

Example data:

label serie rate
0 2002T1 0 2152.8
1 2002T2 1 2103.3
2 2002T3 2 2196.0
3 2002T4 3 2232.4
4 2003T1 4 2328.5

Data CSV can be downloaded here:
https://drive.google.com/file/d/1fwvAZe7lah5DX8-DDEpmfeUDYQhKcfzG/view?usp=sharing

Lets see how looks that data:

Fine, as we can see the data describe a curve, so its for that because we want to use a polynomial regression.

To try to approximate that curve we will use a grade 2 polynomial or higher, because if we try to approximate with a grade 1 polynomial, we will try with a straight line, and that line will not fit correctly:

As we can see, using a linear model doesn't fit correctly, more technical:
R2 or (coefficient of determination): approx 0.3961

That R2 it's not good (the better it is, the closer to 1)

So, how we do to approximate with a polynomial?, the answer is using Polynomial Regression.

To pass from linear to quadratic we need to add new quadratic terms, we can add that terms with the sklearn PolynomialFeatures.

from sklearn.preprocessing import PolynomialFeatures
import pandas as pd


data = pd.read_csv('unemployment.csv')

X = pd.DataFrame(data, columns=['serie']).values.reshape(-1, 1) # we need a 2D array
y = pd.DataFrame(data, columns=['rate']).values.reshape(-1, 1)

pf = PolynomialFeatures(degree=2,include_bias=False)
x_transformed = pf.fit_transform(X)

The function fit_transform adds the new quadratic term, just adding each value of x squared:

From X:
[[ 0] [ 1] [ 2] [ 3] ...]

to a new X:
[[0 0] [1 1] [2 4] [3 9] ...]

And now, we just need to calculate our regession using our new quadratic x_transformed and see how it fits:
regr = LinearRegression()
regr.fit(x_transformed, y)
y_predicted = regr.predict(x_transformed)


Well, seems now fits better, we are going to check with the error function:
R2 or (coefficient of determination): approx 0.6160

And as we can check, now we have less error than before (remember, closer to 1 is better).

Perfect, but, what if we try with a grade 3 polynomial?, lets check what happens:


R2 or (coefficient of determination): approx 0.8764

Now we have a pretty much better quadratic error. But, can we make the polynomial grade higher and higher looking for the perfect fit?

We can dot, but at the same time we will be over training our model, so better to keep in grade 2 or 3.

Just for fun, we are going to increase the grade to grade 6, and try to do a prediction for the future serie (serie 71):

R2 or (coefficient of determination): approx 0.9779

Our model with grade 6 polynomial it's close to the perfection (we have an over trained model that will works great with the data set, but not with the future series, remember).



And now, how we predict the next unemployment rate? (last 2019 quarter)
First, we need to transform the serie we want to predict (71), to add the quadratic term, and later feed the predict method to obtain the predicted future unemployment rate:

X_transformed = pf.fit_transform([[71]])
regr.predict(X_transformed)[0][0]

That throws the value of 3295.8 (in thousands).
To obtain the last unemployment serie into the data set:
data['rate'].iloc[-1]  # that value is 3214.4

That means:
3295,80 −3214,4 = 81,4 (thousands) people more unemployed (I wish to be wrong because of the over training).

UPDATE:
The EPA for the last quarter have been published, and the unemployment rate for the last quarter is 3191.9 k people unemployed.

That means that our model, with a grade 6 polynomial, estimated 103.9 k people more from the real value published.

3295.8 - 3191.9 = 103.9 k people in more.

That's a big deviation.
We need to consider that the unemployment rates is affected for a lot of variables not considered here.

UPDATE 2:
After the Christmas hangover, period where always we have less employment, we have the unemployment rate for January 2020, month that raises a more realistic unemployment rate, and this unemployment rate is 3.253 (thousands people).

That's: 3295.8 - 3.253 = 42,8 thousands people estimated in more.

We need to consider that this amount is for january and not for a quarter and can't be compared, furthermore that amounts came from different statistics, so the way in that amount is calculated is different.

I'm doing that comparison just to check after the Christmas period, that have a more realistic unemployment rate.

I will develop a new entry in my new post, to check this data doing the same calculations with the monthly amounts instead of quarters, to do more realistic calcs.

My new blog address is: https://www.pylancer.com/blog/ I want to see you in my new blog.

Regards.

You can download the whole jupyter notebook here:
https://drive.google.com/file/d/1GkZO3R14zs3uUrK6_6uVRGGO4Z755Nnu/view?usp=sharing

Comentarios

Entradas populares de este blog

Join o producto cartesiano de dos tablas en EXCEL 2007

Hace unos dias inicie mi ocupacion como becario de informatica en la facultad de humanidades y ciencias de la educacion de la UJAEN. Y como no, no han tardado en surgir los problemas. Supongamos que tenemos dos tablas, y queremos hacer una tabla que tenga datos de estas dos tablas, segun un criterio , y es que solo pueden aparecer ciertas filas, mas exactamente aquellas donde coincida cierto campo, en este ejemplo, el codigo de la asignatura. Si queremos realizar el join o producto cartesiano tal y como lo hariamos en una base de datos, parece ser que si no estamos trabajando con una bbdd sino con Excel, la cosa se complica un poco. Para "multiplicar tablas" en excel, primero vamos a hacer una cosa, cada tabla la vamos a guardar en hojas separadas, en nuestro caso, una tabla la guardamos en Hoja1 , y la otra en Hoja2 Ahora, nos situamos en la hoja donde queramos que aparezca el producto cartesiano de nuestras dos tablas, nos vamos a la ficha DATOS . Veremos que h

Descargar código fuente desde Google App Engine

Estaba desarrollando una aplicación en google app engine, cuando un día, al llegar al trabajo (hoy), me doy cuenta que no tengo acceso a mi versión de desarrollo. Como la ultima versión que estaba desarrollando, justo la noche de antes la había subido a google app engine, pues me dije: "Ya esta, me conecto y me descargo el código fuente" ERROR 404 // SOLUCIÓN A IDEA MÁGICA NO ENCONTRADA Tras buscar por google, observo que hay muchas voces que dicen que no te puedes descargar el código fuente, que google no deja disponible ninguna API para descargarte tu codigo, ... ¡Pero como va a ser así!, desde appengine, te dicen que se puede hacer, lo que no está tan claro es como hacerlo. Pues estos son los pasos para poder hacerlo: Crear un directorio vacío para poder descargar en el nuestra aplicación. Abrir la línea de comandos, y cambiarnos al directorio de google app engine: cd C:\Archivos de Programa\Google\google_appengine\ Ejecutar el siguiente comando para descar

Clases abstractas con python

¿Como se crean clases abstractas con python?. Voy a explicar cual es la forma correcta de definir una clase abstracta y heredar de ella. El procedimiento general es: Definir una clase abstracta utilizando una metaclase. Definir la subclase de la clase abstracta (sin herencia). Registrar esta última clase como subclase de la clase abstracta. Tomemos como ejemplo el siguiente código: from abc import ABCMeta, abstractmethod class AbstractFoo:     __metaclass__ = ABCMeta          @abstractmethod     def bar(self):         pass     @classmethod     def __subclasshook__(cls, C):         return NotImplemented class Foo(object):     def bar(self):         print "hola" AbstractFoo.register(Foo)  Lo primero que hacemos es importar del módulo abc la clase ABCMeta y el decorador abstractmethod . La clase ABCMeta es la metaclase que utilizamos para definir las clases abstractas, nos aporta una serie de funcionalidades. Una vez hemos asignada la metacl