diff --git a/m04_machine_learning/m04_c03_classification/m04_c03_classification.ipynb b/m04_machine_learning/m04_c03_classification/m04_c03_classification.ipynb index de1328d..7b9cd5c 100644 --- a/m04_machine_learning/m04_c03_classification/m04_c03_classification.ipynb +++ b/m04_machine_learning/m04_c03_classification/m04_c03_classification.ipynb @@ -453,14 +453,102 @@ "metadata": { "Collapsed": "false" }, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
temp_fnm_bad_ringstemp_cis_failureds_failure
053311.671Falla
156113.331Falla
257113.891Falla
363017.220Éxito
466018.890Éxito
\n", + "
" + ], + "text/plain": [ + " temp_f nm_bad_rings temp_c is_failure ds_failure\n", + "0 53 3 11.67 1 Falla\n", + "1 56 1 13.33 1 Falla\n", + "2 57 1 13.89 1 Falla\n", + "3 63 0 17.22 0 Éxito\n", + "4 66 0 18.89 0 Éxito" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# Un poco de procesamiento de datos\n", "challenger = challenger.assign(\n", " temp_c=lambda x: ((x[\"temp_f\"] - 32.) / 1.8).round(2),\n", " is_failure=lambda x: x[\"nm_bad_rings\"].ne(0).astype(np.int),\n", " ds_failure=lambda x: x[\"is_failure\"].map({1: \"Falla\", 0:\"Éxito\"})\n", - ")" + ")\n", + "challenger.head()" ] }, { @@ -4194,32 +4282,6 @@ "metadata": { "Collapsed": "false" }, - "outputs": [ - { - "data": { - "text/plain": [ - "array([0.85100948, 0.77559639, 0.74473325, 0.51574021, 0.39116945,\n", - " 0.35232342, 0.35232342, 0.35232342, 0.31468403, 0.27933151,\n", - " 0.24708448, 0.24708448, 0.24708448, 0.24708448, 0.18998087,\n", - " 0.16525956, 0.12395322, 0.12395322, 0.10698086, 0.10698086,\n", - " 0.07864538, 0.06739953, 0.05749692, 0.04911397])" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sigmoid(np.dot(X, theta_l)).flatten()" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": { - "Collapsed": "false" - }, "outputs": [ { "data": { @@ -4291,7 +4353,7 @@ "4 18.89 norm_2_error_prediction 0.354528" ] }, - "execution_count": 18, + "execution_count": 17, "metadata": {}, "output_type": "execute_result" } @@ -4316,7 +4378,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 18, "metadata": { "Collapsed": "false" }, @@ -4840,7 +4902,7 @@ "https://altair-viz.github.io/user_guide/troubleshooting.html\n" ] }, - "execution_count": 19, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" } @@ -4877,7 +4939,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 19, "metadata": { "Collapsed": "false" }, @@ -4886,6 +4948,37 @@ "from sklearn.linear_model import LogisticRegression" ] }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", + " intercept_scaling=1, l1_ratio=None, max_iter=100,\n", + " multi_class='auto', n_jobs=None, penalty='l2',\n", + " random_state=None, solver='lbfgs', tol=0.0001, verbose=0,\n", + " warm_start=False)" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "X = challenger[[\"temp_c\"]].values\n", + "y = challenger[\"is_failure\"].values\n", + "\n", + "# Fitting the model\n", + "Logit = LogisticRegression(solver=\"lbfgs\")\n", + "Logit.fit(X, y)" + ] + }, { "cell_type": "code", "execution_count": 21, @@ -4902,13 +4995,6 @@ } ], "source": [ - "X = challenger[[\"temp_c\"]]\n", - "y = challenger[\"is_failure\"]\n", - "\n", - "# Fitting the model\n", - "Logit = LogisticRegression(solver=\"lbfgs\")\n", - "Logit.fit(X, y)\n", - "\n", "# Obtain the coefficients\n", "print(Logit.intercept_, Logit.coef_ )\n", "\n", @@ -5227,7 +5313,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 30, "metadata": { "Collapsed": "false" }, @@ -5282,7 +5368,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 31, "metadata": { "Collapsed": "false" }, @@ -5307,7 +5393,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 32, "metadata": { "Collapsed": "false" }, @@ -5331,7 +5417,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 33, "metadata": { "Collapsed": "false" }, diff --git a/m04_machine_learning/m04_c04_metrics_and_model_selection/m04_c04_metrics_and_model_selection.ipynb b/m04_machine_learning/m04_c04_metrics_and_model_selection/m04_c04_metrics_and_model_selection.ipynb index 53e9177..18a38a4 100644 --- a/m04_machine_learning/m04_c04_metrics_and_model_selection/m04_c04_metrics_and_model_selection.ipynb +++ b/m04_machine_learning/m04_c04_metrics_and_model_selection/m04_c04_metrics_and_model_selection.ipynb @@ -147814,6 +147814,15 @@ "chart_C + identity_line_C" ] }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "**Spoiler**: El cálculo de métricas es parte del laboratorio de esta semana." + ] + }, { "cell_type": "markdown", "metadata": { @@ -147826,6 +147835,508 @@ "## Validación Cruzada" ] }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Aprender/ajustar los parámetros de una función de predicción y probarlo con los mismos datos es un error metodológico: un modelo que simplemente repita las etiquetas de las muestras que acaba de ver tendría una puntuación perfecta pero no podría predecir nada útil, en particular datos nuevos. Esta situación se llama **sobreajuste** (**_overfitting_**). Para evitarlo, una práctica común es cuando se realiza un experimento de aprendizaje automático (supervisado) mantener parte de los datos disponibles como un conjunto de pruebas `X_test`, `y_test`. Es importante tener en cuenta que la palabra _\"experimento\"_ no pretende denotar únicamente el uso académico, porque incluso en entornos comerciales, el aprendizaje automático generalmente comienza de manera experimental. Aquí hay un diagrama de flujo del flujo de trabajo típico de validación cruzada en la capacitación de modelos. Los mejores parámetros se pueden determinar mediante técnicas de búsqueda de cuadrícula (_grid search_).\n", + "\n", + "\"Validacion\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "La validación cruzada (_cross validation_ o solo CV), el más sencillo de todos es la llamada _k-fold_, la cual se inicia mediante el fraccionamiento de un conjunto de datos en un número $k$ de particiones (generalmente entre 5 y 10) llamadas _pliegues_ (_folds_). El proceso consiste en iterar entre los datos de evaluación y entrenamiento $k$ veces, de un modo particular. En cada iteración de la validación cruzada, un pliegue diferente se elige como los datos de evaluación. En esta iteración, los otros pliegues $k-1$ se combinan para formar los datos de entrenamiento. Por lo tanto, en cada iteración tenemos $(k-1) / k$ de los datos utilizados para el entrenamiento y $1 / k$ utilizado para la evaluación.\n", + "\n", + "Cada iteración produce un modelo, y por lo tanto una estimación del rendimiento de la generalización, por ejemplo, una estimación de la precisión. Una vez finalizada la validación cruzada, todos los ejemplos se han utilizado sólo una vez para evaluar pero $k -1$ veces para entrenar. En este punto tenemos estimaciones de rendimiento de todos los pliegues y podemos calcular la media y la desviación estándar de la precisión del modelo." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "\"Validacion" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "#### Ejemplo con Iris Dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)targettarget_names
05.13.51.40.20setosa
14.93.01.40.20setosa
24.73.21.30.20setosa
34.63.11.50.20setosa
45.03.61.40.20setosa
\n", + "
" + ], + "text/plain": [ + " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", + "0 5.1 3.5 1.4 0.2 \n", + "1 4.9 3.0 1.4 0.2 \n", + "2 4.7 3.2 1.3 0.2 \n", + "3 4.6 3.1 1.5 0.2 \n", + "4 5.0 3.6 1.4 0.2 \n", + "\n", + " target target_names \n", + "0 0 setosa \n", + "1 0 setosa \n", + "2 0 setosa \n", + "3 0 setosa \n", + "4 0 setosa " + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn import datasets \n", + "\n", + "iris = datasets.load_iris()\n", + "iris_df = (\n", + " pd.DataFrame(iris.data, columns=iris.feature_names)\n", + " .assign(\n", + " target=iris.target,\n", + " target_names=lambda x: x.target.map(dict(zip(range(3), iris.target_names)))\n", + " )\n", + ")\n", + "\n", + "iris_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Tal como la semana pasada utilizaremos un modelo de Regresión Logística. Un _k-fold_ con $k=1$ es lo mismo que dividir nuestros datos en _train_ y _set_ como se hace usualmente, lo cual no es muy sorprendente." + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": { + "Collapsed": "false" + }, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "X = iris_df.drop(columns=[\"target\", \"target_names\"]).values\n", + "y = iris_df[\"target\"].values\n", + "\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42) " + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[6.1 2.8 4.7 1.2]\n", + " [5.7 3.8 1.7 0.3]\n", + " [7.7 2.6 6.9 2.3]\n", + " [6. 2.9 4.5 1.5]\n", + " [6.8 2.8 4.8 1.4]\n", + " [5.4 3.4 1.5 0.4]\n", + " [5.6 2.9 3.6 1.3]\n", + " [6.9 3.1 5.1 2.3]\n", + " [6.2 2.2 4.5 1.5]\n", + " [5.8 2.7 3.9 1.2]\n", + " [6.5 3.2 5.1 2. ]\n", + " [4.8 3. 1.4 0.1]\n", + " [5.5 3.5 1.3 0.2]\n", + " [4.9 3.1 1.5 0.1]\n", + " [5.1 3.8 1.5 0.3]\n", + " [6.3 3.3 4.7 1.6]\n", + " [6.5 3. 5.8 2.2]\n", + " [5.6 2.5 3.9 1.1]\n", + " [5.7 2.8 4.5 1.3]\n", + " [6.4 2.8 5.6 2.2]\n", + " [4.7 3.2 1.6 0.2]\n", + " [6.1 3. 4.9 1.8]\n", + " [5. 3.4 1.6 0.4]\n", + " [6.4 2.8 5.6 2.1]\n", + " [7.9 3.8 6.4 2. ]\n", + " [6.7 3. 5.2 2.3]\n", + " [6.7 2.5 5.8 1.8]\n", + " [6.8 3.2 5.9 2.3]\n", + " [4.8 3. 1.4 0.3]\n", + " [4.8 3.1 1.6 0.2]]\n" + ] + } + ], + "source": [ + "print(X_test)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "En `scikit-learn` la función de K-fold conserva el orden de los datos y no se ve afectado por clases ni grupos.\n", + "\n", + "![](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_0041.png)\n", + "\n", + "Para permutaciones aleatorias existe [ShuffleSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit)" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "K-fold iteración 1\n", + "\n", + "Train indices:\n", + " [ 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67\n", + " 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85\n", + " 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103\n", + " 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121\n", + " 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139\n", + " 140 141 142 143 144 145 146 147 148 149]\n", + "\n", + "Test indices:\n", + " [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23\n", + " 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47\n", + " 48 49]\n", + "\n", + "\n", + "K-fold iteración 2\n", + "\n", + "Train indices:\n", + " [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17\n", + " 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35\n", + " 36 37 38 39 40 41 42 43 44 45 46 47 48 49 100 101 102 103\n", + " 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121\n", + " 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139\n", + " 140 141 142 143 144 145 146 147 148 149]\n", + "\n", + "Test indices:\n", + " [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73\n", + " 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97\n", + " 98 99]\n", + "\n", + "\n", + "K-fold iteración 3\n", + "\n", + "Train indices:\n", + " [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23\n", + " 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47\n", + " 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71\n", + " 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95\n", + " 96 97 98 99]\n", + "\n", + "Test indices:\n", + " [100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117\n", + " 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135\n", + " 136 137 138 139 140 141 142 143 144 145 146 147 148 149]\n", + "\n", + "\n" + ] + } + ], + "source": [ + "from sklearn.model_selection import KFold\n", + "\n", + "kf = KFold(n_splits=3)\n", + "i = 1\n", + "for train, test in kf.split(X, y):\n", + " print(f\"K-fold iteración {i}\\n\")\n", + " print(f\"Train indices:\\n {train}\\n\")\n", + " print(f\"Test indices:\\n {test}\\n\\n\")\n", + " i += 1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Para filtrar la data basta con usar los índices que retorna el método `split`." + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[6.3 3.3 6. 2.5]\n", + " [5.8 2.7 5.1 1.9]\n", + " [7.1 3. 5.9 2.1]\n", + " [6.3 2.9 5.6 1.8]\n", + " [6.5 3. 5.8 2.2]\n", + " [7.6 3. 6.6 2.1]\n", + " [4.9 2.5 4.5 1.7]\n", + " [7.3 2.9 6.3 1.8]\n", + " [6.7 2.5 5.8 1.8]\n", + " [7.2 3.6 6.1 2.5]\n", + " [6.5 3.2 5.1 2. ]\n", + " [6.4 2.7 5.3 1.9]\n", + " [6.8 3. 5.5 2.1]\n", + " [5.7 2.5 5. 2. ]\n", + " [5.8 2.8 5.1 2.4]\n", + " [6.4 3.2 5.3 2.3]\n", + " [6.5 3. 5.5 1.8]\n", + " [7.7 3.8 6.7 2.2]\n", + " [7.7 2.6 6.9 2.3]\n", + " [6. 2.2 5. 1.5]\n", + " [6.9 3.2 5.7 2.3]\n", + " [5.6 2.8 4.9 2. ]\n", + " [7.7 2.8 6.7 2. ]\n", + " [6.3 2.7 4.9 1.8]\n", + " [6.7 3.3 5.7 2.1]\n", + " [7.2 3.2 6. 1.8]\n", + " [6.2 2.8 4.8 1.8]\n", + " [6.1 3. 4.9 1.8]\n", + " [6.4 2.8 5.6 2.1]\n", + " [7.2 3. 5.8 1.6]\n", + " [7.4 2.8 6.1 1.9]\n", + " [7.9 3.8 6.4 2. ]\n", + " [6.4 2.8 5.6 2.2]\n", + " [6.3 2.8 5.1 1.5]\n", + " [6.1 2.6 5.6 1.4]\n", + " [7.7 3. 6.1 2.3]\n", + " [6.3 3.4 5.6 2.4]\n", + " [6.4 3.1 5.5 1.8]\n", + " [6. 3. 4.8 1.8]\n", + " [6.9 3.1 5.4 2.1]\n", + " [6.7 3.1 5.6 2.4]\n", + " [6.9 3.1 5.1 2.3]\n", + " [5.8 2.7 5.1 1.9]\n", + " [6.8 3.2 5.9 2.3]\n", + " [6.7 3.3 5.7 2.5]\n", + " [6.7 3. 5.2 2.3]\n", + " [6.3 2.5 5. 1.9]\n", + " [6.5 3. 5.2 2. ]\n", + " [6.2 3.4 5.4 2.3]\n", + " [5.9 3. 5.1 1.8]]\n" + ] + } + ], + "source": [ + "X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]\n", + "print(X_test)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Para balancear por clases se utiliza `K-fold Stratificado`, la representación gráfica lo dice todo.\n", + "\n", + "![](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_0071.png)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "train - [33 33 34] | test - [17 17 16]\n", + "train - [33 34 33] | test - [17 16 17]\n", + "train - [34 33 33] | test - [16 17 17]\n" + ] + } + ], + "source": [ + "from sklearn.model_selection import StratifiedKFold\n", + "\n", + "skf = StratifiedKFold(n_splits=3)\n", + "for train, test in skf.split(X, y):\n", + " print('train - {} | test - {}'.format(\n", + " np.bincount(y[train]), np.bincount(y[test])))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Algo como el método `value_counts()` de `pandas` sería lo siguiente:" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 0, 16],\n", + " [ 1, 17],\n", + " [ 2, 17]])" + ] + }, + "execution_count": 66, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.array(np.unique(y[test], return_counts=True)).T" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "En cambio, si hacemos un k-fold común y corriente:" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": { + "Collapsed": "false" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "train - [ 0 50 50] | test - [50]\n", + "train - [50 0 50] | test - [ 0 50]\n", + "train - [50 50] | test - [ 0 0 50]\n" + ] + } + ], + "source": [ + "kf = KFold(n_splits=3)\n", + "for train, test in kf.split(X, y):\n", + " print('train - {} | test - {}'.format(\n", + " np.bincount(y[train]), np.bincount(y[test])))" + ] + }, { "cell_type": "code", "execution_count": null, @@ -147833,7 +148344,20 @@ "Collapsed": "false" }, "outputs": [], - "source": [] + "source": [ + "np.array(np.unique(y[test], return_counts=True)).T" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "No tiene por qué estar balanceado! Y esto repercutirá a la hora del entrenamiento.\n", + "\n", + "Más información y tipos de validación cruzada ya están implementado en `scikit-learn` en el siguiente [link](https://scikit-learn.org/stable/modules/cross_validation.html)." + ] }, { "cell_type": "markdown", @@ -147852,6 +148376,104 @@ "source": [ "## Métricas" ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Una métrica es una función que define una distancia entre cada par de elementos de un conjunto. Para nuetro caso, se define una función de distancia entre los valores reales ($y$) y los valores predichos ($\\hat{y}$).\n", + "\n", + "Defeniremos algunas métricas bajo dos tipos de contexto: modelos de regresión y modelos de clasificación." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "### Métricas para Regresión" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Sabemos que los modelos de regresión buscan ajustar un modelo a valores numéricos que no reprensentan etiquetas. La idea es cuantificar el error y seleccionar el mejor modelo. El error corresponde a la diferencia entre el valor original y el valor predicho, es decir:\n", + "\n", + "$$e_{i}=y_{i}-\\hat{y}_{i} $$\n", + "\n", + "\n", + "Lo que se busca es medir el error bajo cierta funciones de distancias o métricas. Dentro de las métricas más populares se encuentran:\n", + "\n", + "* Métricas absolutas: Las métricas absolutas o no escalada miden el error sin escalar los valores. Las métrica absolutas más ocupadas son:\n", + "\n", + " - Mean Absolute Error (MAE)\n", + "\n", + " $$\\textrm{MAE}(y,\\hat{y}) = \\dfrac{1}{n}\\sum_{t=1}^{n}\\left | y_{t}-\\hat{y}_{t}\\right |$$\n", + "\n", + " - Mean squared error (MSE):\n", + " \n", + " $$\\textrm{MSE}(y,\\hat{y}) =\\dfrac{1}{n}\\sum_{t=1}^{n}\\left | y_{t}-\\hat{y}_{t}\\right |^2$$\n", + "\n", + "* Métricas Porcentuales: Las métricas porcentuales o escaladas miden el error de manera escalada, es decir, se busca acotar el error entre valores de 0 a 1, donde 0 significa que el ajuste es perfecto, mientras que 1 sería un mal ajuste. Cabe destacar que muchas veces las métricas porcentuales puden tener valores mayores a 1.\n", + "\n", + " - Mean absolute percentage error (MAPE):\n", + "\n", + " $$\\textrm{MAPE}(y,\\hat{y}) = \\dfrac{1}{n}\\sum_{t=1}^{n}\\left | \\frac{y_{t}-\\hat{y}_{t}}{y_{t}} \\right |$$\n", + "\n", + " - Symmetric mean absolute percentage error (sMAPE):\n", + " $$\\textrm{sMAPE}(y,\\hat{y}) = \\dfrac{1}{n}\\sum_{t=1}^{n} \\frac{\\left |y_{t}-\\hat{y}_{t}\\right |}{(\\left | y_{t} \\right |^2+\\left | \\hat{y}_{t} \\right |^2)/2}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "### Métricas para Clasificación" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Sabemos que los modelos de clasificación etiquetan a los datos a partir del entrenamiento. Por lo tanto es necesario introducir nuevos conceptos.\n", + "\n", + "Uno de ellos es la matriz de confusión (vista en la clase pasada). Típicamente para un clasificador binario se tiene:\n", + "\n", + "* `TP`: Verdadero Positivo\n", + "* `FN`: Falso Negativo\n", + "* `FP`: Falso positivo\n", + "* `TN`: Verdadero Negativo\n", + "\n", + "En este contexto, los valores `TP` y `TN` muestran los valores correctos que tuve al momento de realizar la predicción, mientras que los valores de de `FN` y `FP` denotan los valores en que la clasificación fue errónea. Se busca maximizar la suma de los elementos bien clasificados. Para esto se definen las siguientes métricas:\n", + "\n", + "* Accuracy:\n", + " $$\\textrm{accuracy}= \\frac{TP+TN}{TP+TN+FP+FN}$$\n", + "* Recall:\n", + " $$\\textrm{recall} = \\frac{TP}{TP+FN}$$\n", + "* Precision:\n", + " $$\\textrm{precision} = \\frac{TP}{TP+FP} $$\n", + "* F-score:\n", + " $$\\textrm{F_score} = 2\\times \\frac{ \\textrm{precision} \\times \\textrm{recall} }{ \\textrm{precision} + \\textrm{recall} } $$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "Collapsed": "false" + }, + "source": [ + "Estas son las más comunes, y como te imaginarás, `scikit-learn` tiene toda una artillería de selección de modelos en este [link](https://scikit-learn.org/stable/modules/model_evaluation.html)." + ] } ], "metadata": {