diff --git a/m04_machine_learning/m04_c06_ml_workflow/m04_c06_lab.ipynb b/m04_machine_learning/m04_c06_ml_workflow/m04_c06_lab.ipynb new file mode 100644 index 0000000..b5df338 --- /dev/null +++ b/m04_machine_learning/m04_c06_ml_workflow/m04_c06_lab.ipynb @@ -0,0 +1,331 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "\"utfsm-logo\"\n", + "\n", + "# MAT281\n", + "### Aplicaciones de la Matemática en la Ingeniería" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Módulo 04\n", + "## Laboratorio Clase 06: Proyectos de Machine Learning" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "### Instrucciones\n", + "\n", + "\n", + "* Completa tus datos personales (nombre y rol USM) en siguiente celda.\n", + "* La escala es de 0 a 4 considerando solo valores enteros.\n", + "* Debes _pushear_ tus cambios a tu repositorio personal del curso.\n", + "* Como respaldo, debes enviar un archivo .zip con el siguiente formato `mXX_cYY_lab_apellido_nombre.zip` a alonso.ogueda@gmail.com, debe contener todo lo necesario para que se ejecute correctamente cada celda, ya sea datos, imágenes, scripts, etc.\n", + "* Se evaluará:\n", + " - Soluciones\n", + " - Código\n", + " - Que Binder esté bien configurado.\n", + " - Al presionar `Kernel -> Restart Kernel and Run All Cells` deben ejecutarse todas las celdas sin error.\n", + "* __La entrega es al final de esta clase.__" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "__Nombre__:\n", + "\n", + "__Rol__:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "## GapMinder" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import altair as alt\n", + "\n", + "from vega_datasets import data\n", + "\n", + "alt.themes.enable('opaque')\n", + "\n", + "%matplotlib inline" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "gapminder = data.gapminder_health_income()\n", + "gapminder.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "### 1. Análisis exploratorio (1 pto)\n", + "\n", + "Como mínimo, realizar un `describe` del dataframe y una visualización adecuada, una _scatter matrix_ con los valores numéricos." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "## FIX ME PLEASE ##" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "## FIX ME PLEASE ##" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "### 2. Preprocesamiento (1 pto)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Aplicar un escalamiento a los datos antes de aplicar nuestro algoritmo de clustering. Para ello, definir la variable `X_raw` que corresponde a un `numpy.array` con los valores del dataframe `gapminder` en las columnas _income_, _health_ y _population_. Luego, definir la variable `X` que deben ser los datos escalados de `X_raw`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "from sklearn.preprocessing import StandardScaler" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "X_raw = ## FIX ME PLEASE ##\n", + "X = ## FIX ME PLEASE ##" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "### 3. Clustering (1 pto)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "from sklearn.cluster import KMeans" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Definir un _estimator_ `KMeans` con `k=3` y `random_state=42`, luego ajustar con `X` y finalmente, agregar los _labels_ obtenidos a una nueva columna del dataframe `gapminder` llamada `cluster`. Finalmente, realizar el mismo gráfico del principio pero coloreado por los clusters obtenidos.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "k = 3\n", + "kmeans =## FIX ME PLEASE ##\n", + "kmeans## FIX ME PLEASE ##\n", + "clusters = ## FIX ME PLEASE ##\n", + "gapminder## FIX ME PLEASE ##" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "## FIX ME PLEASE ##" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "true" + }, + "source": [ + "### 4. Regla del codo (1 pto)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "__¿Cómo escoger la mejor cantidad de _clusters_?__\n", + "\n", + "En este ejercicio hemos utilizado que el número de clusters es igual a 3. El ajuste del modelo siempre será mejor al aumentar el número de clusters, pero ello no significa que el número de clusters sea el apropiado. De hecho, si tenemos que ajustar $n$ puntos, claramente tomar $n$ clusters generaría un ajuste perfecto, pero no permitiría representar si existen realmente agrupaciones de datos.\n", + "\n", + "Cuando no se conoce el número de clusters a priori, se utiliza la [regla del codo](https://jarroba.com/seleccion-del-numero-optimo-clusters/), que indica que el número más apropiado es aquel donde \"cambia la pendiente\" de decrecimiento de la la suma de las distancias a los clusters para cada punto, en función del número de clusters.\n", + "\n", + "A continuación se provee el código para el caso de clustering sobre los datos estandarizados, leídos directamente de un archivo preparado especialmente." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "elbow = pd.Series(name=\"inertia\").rename_axis(index=\"k\")\n", + "for k in range(1, 10):\n", + " kmeans = KMeans(n_clusters=k, random_state=42).fit(X)\n", + " elbow.loc[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center\n", + "elbow = elbow.reset_index()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "alt.Chart(elbow).mark_line(point=True).encode(\n", + " x=\"k:O\",\n", + " y=\"inertia:Q\"\n", + ").properties(\n", + " height=600,\n", + " width=800\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "__Pregunta__\n", + "\n", + "Considerando los datos (países) y el gráfico anterior, ¿Cuántos clusters escogerías?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "##_TU RESPUESTA AQUÍ_ ##" + ] + } + ], + "metadata": { + "celltoolbar": "Slideshow", + "kernelspec": { + "display_name": "Python [conda env:ds]", + "language": "python", + "name": "conda-env-ds-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/m04_machine_learning/m04_c06_ml_workflow/m04_c06_ml_workflow.ipynb b/m04_machine_learning/m04_c06_ml_workflow/m04_c06_ml_workflow.ipynb new file mode 100644 index 0000000..6a1bbf8 --- /dev/null +++ b/m04_machine_learning/m04_c06_ml_workflow/m04_c06_ml_workflow.ipynb @@ -0,0 +1,869 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "\"utfsm-logo\"\n", + "\n", + "# MAT281\n", + "### Aplicaciones de la Matemática en la Ingeniería" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Módulo 04\n", + "## Clase 06: Proyectos de Machine Learning" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Objetivos\n", + "\n", + "* Resumir lo que aprendido en el módulo.\n", + "* Conocer el _workflow_ de un proyecto de _machine learning_." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Contenidos\n", + "* [Estimadores](#estimator)\n", + "* [Pre-Procesamiento](#preprocessing)\n", + "* [Pipelines](#pipelines)\n", + "* [Evaluación de Modelos](#model_evaluation)\n", + "* [Búsqueda de Hiper-Parámetros](#hyperparameter_search)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Estimadores" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Ya sabemos que `scikit-learn` nos provee de múltiples algoritmos y modelos de Machine Learning, que oficialmente son llamados **estimadores** (_estimators_). Cada _estimator_ puede ser ajustado (o coloquialmente, _fiteado_) utilizando los datos adecuados.\n", + "\n", + "Por ejemplo, para motivar, la __Regresión Ridge__ es un tipo de regresión que agrega un parámetro de regularización, en particular, busca minimizar la suma de residuos pero penalizada, es decir:\n", + "\n", + "$$\n", + "\\min_\\beta \\vert \\vert y - X \\beta \\vert \\vert_2^2 + \\alpha \\vert \\vert \\beta \\vert \\vert_2^2\n", + "$$\n", + "\n", + "El hiper-parámetro $\\alpha > 0$ es usualmente conocido como parámetro penalización ridge. En realidad, en la literatura estadística se denota con $lambda$, pero como en `python` el nombre lambda está reservado para las funciones anónimas, `scikit-learn` optó por utilizar otra letra griega. La regresión ridge es una alternativa popularpara sobrellevar el problema de colinealidad.\n", + "\n", + "En `scikit-learn.linear_models` se encuentra el estimador `Ridge`." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "from sklearn.linear_model import Ridge\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "from sklearn.datasets import load_boston" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "X, y = load_boston(return_X_y=True)\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "rr_est = Ridge(alpha=0.1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Típicamente el método `fit` acepta dos inputs:\n", + "\n", + "* La matriz de diseño `X`, arreglo bidimensional que típicamente es `(n_samples, n_features)`.\n", + "* Los valores _target_ `y`.\n", + " - En tareas de regresión corresponden a números reales.\n", + " - En tareas de clasificación corresopnden a enteros (u otro conjunto de elementos discreto).\n", + " - Para aprendizaje no-supervisado este input no es necesario." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None,\n", + " normalize=False, random_state=None, solver='auto', tol=0.001)" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "rr_est.fit(X, y)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([-1.07473720e-01, 4.65716366e-02, 1.59989982e-02, 2.67001859e+00,\n", + " -1.66846452e+01, 3.81823322e+00, -2.69060598e-04, -1.45962557e+00,\n", + " 3.03515266e-01, -1.24205910e-02, -9.40758541e-01, 9.36807461e-03,\n", + " -5.25966203e-01])" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "rr_est.coef_" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "35.69365371165901" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "rr_est.intercept_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "El método `predict` necesita un arreglo bidimensional como input. Para ejemplificar podemos utilizar la misma _data_ de entrenamiento." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([30.04164633, 24.99087654, 30.56235738, 28.65418856, 27.98110937,\n", + " 25.28351105, 22.99401212, 19.49937732, 11.46728387, 18.90419332])" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "rr_est.predict(X)[:10]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "En un flujo estándar ajustaríamos con los datos de entrenamiento, predeciríamos datos de test y luego calculamos alguna métrica, por ejemplo, para un caso de regresión, el error cuadrático medio." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None,\n", + " normalize=False, random_state=None, solver='auto', tol=0.001)" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "rr_est.fit(X_train, y_train)" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "y_pred = rr_est.predict(X_test)" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "from sklearn.metrics import mean_squared_error" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "22.14223297423886" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "mean_squared_error(y_pred, y_test)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Pre-Procesamiento" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "En el flujo de trabajo típico de un proyecto de machine learning es usual procesar y transformar los datos. En `scikit-learn` el pre-procesamiento y transformación siguen la misma API que los objetos _estimators_, pero que se denotan como _transformers_. Sin embargo, estos no poseen un método `predict` pero si uno de transformación, `transform`.\n", + "\n", + "Motivaremos con la típica estandarización." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "from sklearn.preprocessing import StandardScaler" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "# StandardScaler?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Usualmente se ajusta y transformar los mismos datos, por lo que se aplican los métodos concatenados." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[-0.41978194, 0.28482986, -1.2879095 , ..., -1.45900038,\n", + " 0.44105193, -1.0755623 ],\n", + " [-0.41733926, -0.48772236, -0.59338101, ..., -0.30309415,\n", + " 0.44105193, -0.49243937],\n", + " [-0.41734159, -0.48772236, -0.59338101, ..., -0.30309415,\n", + " 0.39642699, -1.2087274 ],\n", + " ...,\n", + " [-0.41344658, -0.48772236, 0.11573841, ..., 1.17646583,\n", + " 0.44105193, -0.98304761],\n", + " [-0.40776407, -0.48772236, 0.11573841, ..., 1.17646583,\n", + " 0.4032249 , -0.86530163],\n", + " [-0.41500016, -0.48772236, 0.11573841, ..., 1.17646583,\n", + " 0.44105193, -0.66905833]])" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "StandardScaler().fit(X).transform(X)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Sin embargo, muchos de estos objetos (si es que no es la totalidad de ellos), poseen el método `fit_transform`." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[-0.41978194, 0.28482986, -1.2879095 , ..., -1.45900038,\n", + " 0.44105193, -1.0755623 ],\n", + " [-0.41733926, -0.48772236, -0.59338101, ..., -0.30309415,\n", + " 0.44105193, -0.49243937],\n", + " [-0.41734159, -0.48772236, -0.59338101, ..., -0.30309415,\n", + " 0.39642699, -1.2087274 ],\n", + " ...,\n", + " [-0.41344658, -0.48772236, 0.11573841, ..., 1.17646583,\n", + " 0.44105193, -0.98304761],\n", + " [-0.40776407, -0.48772236, 0.11573841, ..., 1.17646583,\n", + " 0.4032249 , -0.86530163],\n", + " [-0.41500016, -0.48772236, 0.11573841, ..., 1.17646583,\n", + " 0.44105193, -0.66905833]])" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "StandardScaler().fit_transform(X)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Pipelines" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "`Scikit-learn` nos permite combinar _transformers_ y _estimators_ uniéndolos a través de \"tuberías\", objeto denotado como _pipeline_. Nuevamente, la API es consistente con un _estimator_, tanto como para ajustar como para predecir." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "from sklearn.pipeline import make_pipeline" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "pipe = make_pipeline(\n", + " StandardScaler(),\n", + " Ridge(alpha=0.1)\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Pipeline(memory=None,\n", + " steps=[('standardscaler',\n", + " StandardScaler(copy=True, with_mean=True, with_std=True)),\n", + " ('ridge',\n", + " Ridge(alpha=0.1, copy_X=True, fit_intercept=True,\n", + " max_iter=None, normalize=False, random_state=None,\n", + " solver='auto', tol=0.001))],\n", + " verbose=False)" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pipe.fit(X_train, y_train)" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([28.83639192, 36.00279792, 15.09483565, 25.22983181, 18.87788941,\n", + " 23.21453831, 17.59519315, 14.30885051, 23.04885263, 20.62241378])" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pipe.predict(X_test)[:10]" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "22.100507974094608" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "mean_squared_error(pipe.predict(X_test), y_test)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Evaluación de Modelos" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Ya sabemos que ajustar un modelo con datos conocidos no implica que se comportará de buena manera con datos nuevos, por lo que tenemos herramientas como _cross validation_ para evaluar los modelos con los datos conocidos." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "from sklearn.model_selection import cross_validate" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "result = cross_validate(rr_est, X_train, y_train) # defaults to 5-fold CV" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "dict_keys(['fit_time', 'score_time', 'test_score'])" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result.keys()" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "Collapsed": "false", + "jupyter": { + "source_hidden": true + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.73386461, 0.64296157, 0.76353404, 0.77445777, 0.66149893])" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result[\"test_score\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "## Búsqueda de Hiper-parámetros" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Para el caso de la regeresión ridge, el parámetro de penalización es un hiper-parámetro que necesita ser escogido con algún procedimiento. Aunque no lo creas, `scikit-learn` también provee herramientas para escoger automáticamente este tipo de hiper-parámetros. \n", + "\n", + "Por ejemplo `GridSearchCV` realiza una búsqueda exhaustiva entre los posibles valores especificados para los hiper-parámetros." + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "from sklearn.model_selection import GridSearchCV" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "GridSearchCV(cv=None, error_score=nan,\n", + " estimator=Ridge(alpha=0.1, copy_X=True, fit_intercept=True,\n", + " max_iter=None, normalize=False, random_state=None,\n", + " solver='auto', tol=0.001),\n", + " iid='deprecated', n_jobs=None,\n", + " param_grid={'alpha': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])},\n", + " pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n", + " scoring=None, verbose=0)" + ] + }, + "execution_count": 64, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "param_grid = {\"alpha\": np.arange(0, 1, 0.1)}\n", + "\n", + "search = GridSearchCV(\n", + " estimator=rr_est,\n", + " param_grid=param_grid\n", + ")\n", + "\n", + "search.fit(X_train, y_train)" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'alpha': 0.0}" + ] + }, + "execution_count": 65, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "search.best_params_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "El objeto `search` ahora es equivalente a un estimator `Ridge` pero con los mejores parámetros encontrados (`alpha` = 0)." + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0.6844267283527127" + ] + }, + "execution_count": 66, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "search.score(X_test, y_test)" + ] + } + ], + "metadata": { + "celltoolbar": "Slideshow", + "kernelspec": { + "display_name": "Python [conda env:ds]", + "language": "python", + "name": "conda-env-ds-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}