-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
1,200 additions
and
0 deletions.
There are no files selected for viewing
331 changes: 331 additions & 0 deletions
331
m04_machine_learning/m04_c06_ml_workflow/m04_c06_lab.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,331 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "false", | ||
"slideshow": { | ||
"slide_type": "slide" | ||
} | ||
}, | ||
"source": [ | ||
"<img src=\"https://upload.wikimedia.org/wikipedia/commons/4/47/Logo_UTFSM.png\" width=\"200\" alt=\"utfsm-logo\" align=\"left\"/>\n", | ||
"\n", | ||
"# MAT281\n", | ||
"### Aplicaciones de la Matemática en la Ingeniería" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "false", | ||
"slideshow": { | ||
"slide_type": "slide" | ||
} | ||
}, | ||
"source": [ | ||
"## Módulo 04\n", | ||
"## Laboratorio Clase 06: Proyectos de Machine Learning" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"source": [ | ||
"### Instrucciones\n", | ||
"\n", | ||
"\n", | ||
"* Completa tus datos personales (nombre y rol USM) en siguiente celda.\n", | ||
"* La escala es de 0 a 4 considerando solo valores enteros.\n", | ||
"* Debes _pushear_ tus cambios a tu repositorio personal del curso.\n", | ||
"* Como respaldo, debes enviar un archivo .zip con el siguiente formato `mXX_cYY_lab_apellido_nombre.zip` a [email protected], debe contener todo lo necesario para que se ejecute correctamente cada celda, ya sea datos, imágenes, scripts, etc.\n", | ||
"* Se evaluará:\n", | ||
" - Soluciones\n", | ||
" - Código\n", | ||
" - Que Binder esté bien configurado.\n", | ||
" - Al presionar `Kernel -> Restart Kernel and Run All Cells` deben ejecutarse todas las celdas sin error.\n", | ||
"* __La entrega es al final de esta clase.__" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"source": [ | ||
"__Nombre__:\n", | ||
"\n", | ||
"__Rol__:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"source": [ | ||
"## GapMinder" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import pandas as pd\n", | ||
"import altair as alt\n", | ||
"\n", | ||
"from vega_datasets import data\n", | ||
"\n", | ||
"alt.themes.enable('opaque')\n", | ||
"\n", | ||
"%matplotlib inline" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"gapminder = data.gapminder_health_income()\n", | ||
"gapminder.head()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"source": [ | ||
"### 1. Análisis exploratorio (1 pto)\n", | ||
"\n", | ||
"Como mínimo, realizar un `describe` del dataframe y una visualización adecuada, una _scatter matrix_ con los valores numéricos." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"## FIX ME PLEASE ##" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"## FIX ME PLEASE ##" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"source": [ | ||
"### 2. Preprocesamiento (1 pto)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"source": [ | ||
"Aplicar un escalamiento a los datos antes de aplicar nuestro algoritmo de clustering. Para ello, definir la variable `X_raw` que corresponde a un `numpy.array` con los valores del dataframe `gapminder` en las columnas _income_, _health_ y _population_. Luego, definir la variable `X` que deben ser los datos escalados de `X_raw`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from sklearn.preprocessing import StandardScaler" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"X_raw = ## FIX ME PLEASE ##\n", | ||
"X = ## FIX ME PLEASE ##" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"source": [ | ||
"### 3. Clustering (1 pto)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from sklearn.cluster import KMeans" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"source": [ | ||
"Definir un _estimator_ `KMeans` con `k=3` y `random_state=42`, luego ajustar con `X` y finalmente, agregar los _labels_ obtenidos a una nueva columna del dataframe `gapminder` llamada `cluster`. Finalmente, realizar el mismo gráfico del principio pero coloreado por los clusters obtenidos.\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"k = 3\n", | ||
"kmeans =## FIX ME PLEASE ##\n", | ||
"kmeans## FIX ME PLEASE ##\n", | ||
"clusters = ## FIX ME PLEASE ##\n", | ||
"gapminder## FIX ME PLEASE ##" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"## FIX ME PLEASE ##" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "true" | ||
}, | ||
"source": [ | ||
"### 4. Regla del codo (1 pto)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"source": [ | ||
"__¿Cómo escoger la mejor cantidad de _clusters_?__\n", | ||
"\n", | ||
"En este ejercicio hemos utilizado que el número de clusters es igual a 3. El ajuste del modelo siempre será mejor al aumentar el número de clusters, pero ello no significa que el número de clusters sea el apropiado. De hecho, si tenemos que ajustar $n$ puntos, claramente tomar $n$ clusters generaría un ajuste perfecto, pero no permitiría representar si existen realmente agrupaciones de datos.\n", | ||
"\n", | ||
"Cuando no se conoce el número de clusters a priori, se utiliza la [regla del codo](https://jarroba.com/seleccion-del-numero-optimo-clusters/), que indica que el número más apropiado es aquel donde \"cambia la pendiente\" de decrecimiento de la la suma de las distancias a los clusters para cada punto, en función del número de clusters.\n", | ||
"\n", | ||
"A continuación se provee el código para el caso de clustering sobre los datos estandarizados, leídos directamente de un archivo preparado especialmente." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"elbow = pd.Series(name=\"inertia\").rename_axis(index=\"k\")\n", | ||
"for k in range(1, 10):\n", | ||
" kmeans = KMeans(n_clusters=k, random_state=42).fit(X)\n", | ||
" elbow.loc[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center\n", | ||
"elbow = elbow.reset_index()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"alt.Chart(elbow).mark_line(point=True).encode(\n", | ||
" x=\"k:O\",\n", | ||
" y=\"inertia:Q\"\n", | ||
").properties(\n", | ||
" height=600,\n", | ||
" width=800\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"source": [ | ||
"__Pregunta__\n", | ||
"\n", | ||
"Considerando los datos (países) y el gráfico anterior, ¿Cuántos clusters escogerías?" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"Collapsed": "false" | ||
}, | ||
"source": [ | ||
"##_TU RESPUESTA AQUÍ_ ##" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"celltoolbar": "Slideshow", | ||
"kernelspec": { | ||
"display_name": "Python [conda env:ds]", | ||
"language": "python", | ||
"name": "conda-env-ds-py" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
Oops, something went wrong.