diff --git a/m04_machine_learning/m04_c06_ml_workflow/m04_c06_lab.ipynb b/m04_machine_learning/m04_c06_ml_workflow/m04_c06_lab.ipynb
new file mode 100644
index 0000000..b5df338
--- /dev/null
+++ b/m04_machine_learning/m04_c06_ml_workflow/m04_c06_lab.ipynb
@@ -0,0 +1,331 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "
\n",
+ "\n",
+ "# MAT281\n",
+ "### Aplicaciones de la Matemática en la Ingeniería"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Módulo 04\n",
+ "## Laboratorio Clase 06: Proyectos de Machine Learning"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "### Instrucciones\n",
+ "\n",
+ "\n",
+ "* Completa tus datos personales (nombre y rol USM) en siguiente celda.\n",
+ "* La escala es de 0 a 4 considerando solo valores enteros.\n",
+ "* Debes _pushear_ tus cambios a tu repositorio personal del curso.\n",
+ "* Como respaldo, debes enviar un archivo .zip con el siguiente formato `mXX_cYY_lab_apellido_nombre.zip` a alonso.ogueda@gmail.com, debe contener todo lo necesario para que se ejecute correctamente cada celda, ya sea datos, imágenes, scripts, etc.\n",
+ "* Se evaluará:\n",
+ " - Soluciones\n",
+ " - Código\n",
+ " - Que Binder esté bien configurado.\n",
+ " - Al presionar `Kernel -> Restart Kernel and Run All Cells` deben ejecutarse todas las celdas sin error.\n",
+ "* __La entrega es al final de esta clase.__"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "__Nombre__:\n",
+ "\n",
+ "__Rol__:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "## GapMinder"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import altair as alt\n",
+ "\n",
+ "from vega_datasets import data\n",
+ "\n",
+ "alt.themes.enable('opaque')\n",
+ "\n",
+ "%matplotlib inline"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "gapminder = data.gapminder_health_income()\n",
+ "gapminder.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "### 1. Análisis exploratorio (1 pto)\n",
+ "\n",
+ "Como mínimo, realizar un `describe` del dataframe y una visualización adecuada, una _scatter matrix_ con los valores numéricos."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "## FIX ME PLEASE ##"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "## FIX ME PLEASE ##"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "### 2. Preprocesamiento (1 pto)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "Aplicar un escalamiento a los datos antes de aplicar nuestro algoritmo de clustering. Para ello, definir la variable `X_raw` que corresponde a un `numpy.array` con los valores del dataframe `gapminder` en las columnas _income_, _health_ y _population_. Luego, definir la variable `X` que deben ser los datos escalados de `X_raw`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.preprocessing import StandardScaler"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "X_raw = ## FIX ME PLEASE ##\n",
+ "X = ## FIX ME PLEASE ##"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "### 3. Clustering (1 pto)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.cluster import KMeans"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "Definir un _estimator_ `KMeans` con `k=3` y `random_state=42`, luego ajustar con `X` y finalmente, agregar los _labels_ obtenidos a una nueva columna del dataframe `gapminder` llamada `cluster`. Finalmente, realizar el mismo gráfico del principio pero coloreado por los clusters obtenidos.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "k = 3\n",
+ "kmeans =## FIX ME PLEASE ##\n",
+ "kmeans## FIX ME PLEASE ##\n",
+ "clusters = ## FIX ME PLEASE ##\n",
+ "gapminder## FIX ME PLEASE ##"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "## FIX ME PLEASE ##"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "true"
+ },
+ "source": [
+ "### 4. Regla del codo (1 pto)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "__¿Cómo escoger la mejor cantidad de _clusters_?__\n",
+ "\n",
+ "En este ejercicio hemos utilizado que el número de clusters es igual a 3. El ajuste del modelo siempre será mejor al aumentar el número de clusters, pero ello no significa que el número de clusters sea el apropiado. De hecho, si tenemos que ajustar $n$ puntos, claramente tomar $n$ clusters generaría un ajuste perfecto, pero no permitiría representar si existen realmente agrupaciones de datos.\n",
+ "\n",
+ "Cuando no se conoce el número de clusters a priori, se utiliza la [regla del codo](https://jarroba.com/seleccion-del-numero-optimo-clusters/), que indica que el número más apropiado es aquel donde \"cambia la pendiente\" de decrecimiento de la la suma de las distancias a los clusters para cada punto, en función del número de clusters.\n",
+ "\n",
+ "A continuación se provee el código para el caso de clustering sobre los datos estandarizados, leídos directamente de un archivo preparado especialmente."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "elbow = pd.Series(name=\"inertia\").rename_axis(index=\"k\")\n",
+ "for k in range(1, 10):\n",
+ " kmeans = KMeans(n_clusters=k, random_state=42).fit(X)\n",
+ " elbow.loc[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center\n",
+ "elbow = elbow.reset_index()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "alt.Chart(elbow).mark_line(point=True).encode(\n",
+ " x=\"k:O\",\n",
+ " y=\"inertia:Q\"\n",
+ ").properties(\n",
+ " height=600,\n",
+ " width=800\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "__Pregunta__\n",
+ "\n",
+ "Considerando los datos (países) y el gráfico anterior, ¿Cuántos clusters escogerías?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "##_TU RESPUESTA AQUÍ_ ##"
+ ]
+ }
+ ],
+ "metadata": {
+ "celltoolbar": "Slideshow",
+ "kernelspec": {
+ "display_name": "Python [conda env:ds]",
+ "language": "python",
+ "name": "conda-env-ds-py"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.6"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/m04_machine_learning/m04_c06_ml_workflow/m04_c06_ml_workflow.ipynb b/m04_machine_learning/m04_c06_ml_workflow/m04_c06_ml_workflow.ipynb
new file mode 100644
index 0000000..6a1bbf8
--- /dev/null
+++ b/m04_machine_learning/m04_c06_ml_workflow/m04_c06_ml_workflow.ipynb
@@ -0,0 +1,869 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "
\n",
+ "\n",
+ "# MAT281\n",
+ "### Aplicaciones de la Matemática en la Ingeniería"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Módulo 04\n",
+ "## Clase 06: Proyectos de Machine Learning"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Objetivos\n",
+ "\n",
+ "* Resumir lo que aprendido en el módulo.\n",
+ "* Conocer el _workflow_ de un proyecto de _machine learning_."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "source": [
+ "## Contenidos\n",
+ "* [Estimadores](#estimator)\n",
+ "* [Pre-Procesamiento](#preprocessing)\n",
+ "* [Pipelines](#pipelines)\n",
+ "* [Evaluación de Modelos](#model_evaluation)\n",
+ "* [Búsqueda de Hiper-Parámetros](#hyperparameter_search)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Estimadores"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "Ya sabemos que `scikit-learn` nos provee de múltiples algoritmos y modelos de Machine Learning, que oficialmente son llamados **estimadores** (_estimators_). Cada _estimator_ puede ser ajustado (o coloquialmente, _fiteado_) utilizando los datos adecuados.\n",
+ "\n",
+ "Por ejemplo, para motivar, la __Regresión Ridge__ es un tipo de regresión que agrega un parámetro de regularización, en particular, busca minimizar la suma de residuos pero penalizada, es decir:\n",
+ "\n",
+ "$$\n",
+ "\\min_\\beta \\vert \\vert y - X \\beta \\vert \\vert_2^2 + \\alpha \\vert \\vert \\beta \\vert \\vert_2^2\n",
+ "$$\n",
+ "\n",
+ "El hiper-parámetro $\\alpha > 0$ es usualmente conocido como parámetro penalización ridge. En realidad, en la literatura estadística se denota con $lambda$, pero como en `python` el nombre lambda está reservado para las funciones anónimas, `scikit-learn` optó por utilizar otra letra griega. La regresión ridge es una alternativa popularpara sobrellevar el problema de colinealidad.\n",
+ "\n",
+ "En `scikit-learn.linear_models` se encuentra el estimador `Ridge`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.linear_model import Ridge\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "from sklearn.datasets import load_boston"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "X, y = load_boston(return_X_y=True)\n",
+ "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "rr_est = Ridge(alpha=0.1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "Típicamente el método `fit` acepta dos inputs:\n",
+ "\n",
+ "* La matriz de diseño `X`, arreglo bidimensional que típicamente es `(n_samples, n_features)`.\n",
+ "* Los valores _target_ `y`.\n",
+ " - En tareas de regresión corresponden a números reales.\n",
+ " - En tareas de clasificación corresopnden a enteros (u otro conjunto de elementos discreto).\n",
+ " - Para aprendizaje no-supervisado este input no es necesario."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None,\n",
+ " normalize=False, random_state=None, solver='auto', tol=0.001)"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "rr_est.fit(X, y)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([-1.07473720e-01, 4.65716366e-02, 1.59989982e-02, 2.67001859e+00,\n",
+ " -1.66846452e+01, 3.81823322e+00, -2.69060598e-04, -1.45962557e+00,\n",
+ " 3.03515266e-01, -1.24205910e-02, -9.40758541e-01, 9.36807461e-03,\n",
+ " -5.25966203e-01])"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "rr_est.coef_"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "35.69365371165901"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "rr_est.intercept_"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "El método `predict` necesita un arreglo bidimensional como input. Para ejemplificar podemos utilizar la misma _data_ de entrenamiento."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([30.04164633, 24.99087654, 30.56235738, 28.65418856, 27.98110937,\n",
+ " 25.28351105, 22.99401212, 19.49937732, 11.46728387, 18.90419332])"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "rr_est.predict(X)[:10]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "En un flujo estándar ajustaríamos con los datos de entrenamiento, predeciríamos datos de test y luego calculamos alguna métrica, por ejemplo, para un caso de regresión, el error cuadrático medio."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None,\n",
+ " normalize=False, random_state=None, solver='auto', tol=0.001)"
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "rr_est.fit(X_train, y_train)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "y_pred = rr_est.predict(X_test)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import mean_squared_error"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "22.14223297423886"
+ ]
+ },
+ "execution_count": 36,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "mean_squared_error(y_pred, y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Pre-Procesamiento"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "En el flujo de trabajo típico de un proyecto de machine learning es usual procesar y transformar los datos. En `scikit-learn` el pre-procesamiento y transformación siguen la misma API que los objetos _estimators_, pero que se denotan como _transformers_. Sin embargo, estos no poseen un método `predict` pero si uno de transformación, `transform`.\n",
+ "\n",
+ "Motivaremos con la típica estandarización."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.preprocessing import StandardScaler"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "# StandardScaler?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "Usualmente se ajusta y transformar los mismos datos, por lo que se aplican los métodos concatenados."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[-0.41978194, 0.28482986, -1.2879095 , ..., -1.45900038,\n",
+ " 0.44105193, -1.0755623 ],\n",
+ " [-0.41733926, -0.48772236, -0.59338101, ..., -0.30309415,\n",
+ " 0.44105193, -0.49243937],\n",
+ " [-0.41734159, -0.48772236, -0.59338101, ..., -0.30309415,\n",
+ " 0.39642699, -1.2087274 ],\n",
+ " ...,\n",
+ " [-0.41344658, -0.48772236, 0.11573841, ..., 1.17646583,\n",
+ " 0.44105193, -0.98304761],\n",
+ " [-0.40776407, -0.48772236, 0.11573841, ..., 1.17646583,\n",
+ " 0.4032249 , -0.86530163],\n",
+ " [-0.41500016, -0.48772236, 0.11573841, ..., 1.17646583,\n",
+ " 0.44105193, -0.66905833]])"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "StandardScaler().fit(X).transform(X)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "Sin embargo, muchos de estos objetos (si es que no es la totalidad de ellos), poseen el método `fit_transform`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[-0.41978194, 0.28482986, -1.2879095 , ..., -1.45900038,\n",
+ " 0.44105193, -1.0755623 ],\n",
+ " [-0.41733926, -0.48772236, -0.59338101, ..., -0.30309415,\n",
+ " 0.44105193, -0.49243937],\n",
+ " [-0.41734159, -0.48772236, -0.59338101, ..., -0.30309415,\n",
+ " 0.39642699, -1.2087274 ],\n",
+ " ...,\n",
+ " [-0.41344658, -0.48772236, 0.11573841, ..., 1.17646583,\n",
+ " 0.44105193, -0.98304761],\n",
+ " [-0.40776407, -0.48772236, 0.11573841, ..., 1.17646583,\n",
+ " 0.4032249 , -0.86530163],\n",
+ " [-0.41500016, -0.48772236, 0.11573841, ..., 1.17646583,\n",
+ " 0.44105193, -0.66905833]])"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "StandardScaler().fit_transform(X)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Pipelines"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "`Scikit-learn` nos permite combinar _transformers_ y _estimators_ uniéndolos a través de \"tuberías\", objeto denotado como _pipeline_. Nuevamente, la API es consistente con un _estimator_, tanto como para ajustar como para predecir."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.pipeline import make_pipeline"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "pipe = make_pipeline(\n",
+ " StandardScaler(),\n",
+ " Ridge(alpha=0.1)\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Pipeline(memory=None,\n",
+ " steps=[('standardscaler',\n",
+ " StandardScaler(copy=True, with_mean=True, with_std=True)),\n",
+ " ('ridge',\n",
+ " Ridge(alpha=0.1, copy_X=True, fit_intercept=True,\n",
+ " max_iter=None, normalize=False, random_state=None,\n",
+ " solver='auto', tol=0.001))],\n",
+ " verbose=False)"
+ ]
+ },
+ "execution_count": 28,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pipe.fit(X_train, y_train)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([28.83639192, 36.00279792, 15.09483565, 25.22983181, 18.87788941,\n",
+ " 23.21453831, 17.59519315, 14.30885051, 23.04885263, 20.62241378])"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pipe.predict(X_test)[:10]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "22.100507974094608"
+ ]
+ },
+ "execution_count": 37,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "mean_squared_error(pipe.predict(X_test), y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Evaluación de Modelos"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "Ya sabemos que ajustar un modelo con datos conocidos no implica que se comportará de buena manera con datos nuevos, por lo que tenemos herramientas como _cross validation_ para evaluar los modelos con los datos conocidos."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import cross_validate"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "result = cross_validate(rr_est, X_train, y_train) # defaults to 5-fold CV"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "dict_keys(['fit_time', 'score_time', 'test_score'])"
+ ]
+ },
+ "execution_count": 40,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result.keys()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "metadata": {
+ "Collapsed": "false",
+ "jupyter": {
+ "source_hidden": true
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([0.73386461, 0.64296157, 0.76353404, 0.77445777, 0.66149893])"
+ ]
+ },
+ "execution_count": 41,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result[\"test_score\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "## Búsqueda de Hiper-parámetros"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "Para el caso de la regeresión ridge, el parámetro de penalización es un hiper-parámetro que necesita ser escogido con algún procedimiento. Aunque no lo creas, `scikit-learn` también provee herramientas para escoger automáticamente este tipo de hiper-parámetros. \n",
+ "\n",
+ "Por ejemplo `GridSearchCV` realiza una búsqueda exhaustiva entre los posibles valores especificados para los hiper-parámetros."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 57,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "from sklearn.model_selection import GridSearchCV"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 64,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "GridSearchCV(cv=None, error_score=nan,\n",
+ " estimator=Ridge(alpha=0.1, copy_X=True, fit_intercept=True,\n",
+ " max_iter=None, normalize=False, random_state=None,\n",
+ " solver='auto', tol=0.001),\n",
+ " iid='deprecated', n_jobs=None,\n",
+ " param_grid={'alpha': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])},\n",
+ " pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n",
+ " scoring=None, verbose=0)"
+ ]
+ },
+ "execution_count": 64,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "param_grid = {\"alpha\": np.arange(0, 1, 0.1)}\n",
+ "\n",
+ "search = GridSearchCV(\n",
+ " estimator=rr_est,\n",
+ " param_grid=param_grid\n",
+ ")\n",
+ "\n",
+ "search.fit(X_train, y_train)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 65,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'alpha': 0.0}"
+ ]
+ },
+ "execution_count": 65,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "search.best_params_"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "source": [
+ "El objeto `search` ahora es equivalente a un estimator `Ridge` pero con los mejores parámetros encontrados (`alpha` = 0)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 66,
+ "metadata": {
+ "Collapsed": "false"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.6844267283527127"
+ ]
+ },
+ "execution_count": 66,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "search.score(X_test, y_test)"
+ ]
+ }
+ ],
+ "metadata": {
+ "celltoolbar": "Slideshow",
+ "kernelspec": {
+ "display_name": "Python [conda env:ds]",
+ "language": "python",
+ "name": "conda-env-ds-py"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.6"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}