make book

katieburak · Jul 11, 2024 · b4ab9c8 · b4ab9c8
1 parent 023fbb6
commit b4ab9c8
Show file tree

Hide file tree

Showing 30 changed files with 4,220 additions and 559 deletions.
diff --git a/_build/.doctrees/README.doctree b/_build/.doctrees/README.doctree
diff --git a/_build/.doctrees/environment.pickle b/_build/.doctrees/environment.pickle
diff --git a/_build/.doctrees/notes/day1.doctree b/_build/.doctrees/notes/day1.doctree
diff --git a/_build/.doctrees/notes/day2.doctree b/_build/.doctrees/notes/day2.doctree
diff --git a/_build/.doctrees/notes/day3.doctree b/_build/.doctrees/notes/day3.doctree
diff --git a/_build/html/README.html b/_build/html/README.html
@@ -164,6 +164,7 @@
 <li class="toctree-l1"><a class="reference internal" href="notes/day2.html">Day 2 - Girls in Data Science</a></li>
 
 
+
 <li class="toctree-l1"><a class="reference internal" href="notes/day3.html">Day 3 - Girls in Data Science</a></li>
 
 
@@ -415,7 +416,7 @@ <h2>Topic Overview<a class="headerlink" href="#topic-overview" title="Permalink
 </tr>
 </tbody>
 </table>
-<p>A Jupyter notebook of the material is available <a class="reference external" href="https://katieburak.github.io/girls-in-DS/notes/day2.html">here</a>.</p>
+<p>A Jupyter notebook of the material is available <a class="reference external" href="https://katieburak.github.io/girls-in-DS/README.html">here</a>.</p>
 </section>
 <section id="references">
 <h2>References<a class="headerlink" href="#references" title="Permalink to this heading">#</a></h2>

diff --git a/_build/html/_sources/README.md b/_build/html/_sources/README.md
@@ -22,7 +22,7 @@ Lunch will be provided each day for the participants.
 | Day 2 | Measures of central tendency and spread, statistical inference and sampling, observational studies vs. experiments | 
 | Day 3 | Machine learning fundamentals, answering predictive questions (regression, classification) | 
 
-A Jupyter notebook of the material is available [here](https://katieburak.github.io/girls-in-DS/notes/day2.html).
+A Jupyter notebook of the material is available [here](https://katieburak.github.io/girls-in-DS/README.html).
 
 ## References 
 

diff --git a/_build/html/_sources/notes/day1.ipynb b/_build/html/_sources/notes/day1.ipynb
diff --git a/_build/html/_sources/notes/day2.ipynb b/_build/html/_sources/notes/day2.ipynb
diff --git a/_build/html/_sources/notes/day3.ipynb b/_build/html/_sources/notes/day3.ipynb
@@ -45,7 +45,7 @@
       "\u001b[31m✖\u001b[39m \u001b[34mdplyr\u001b[39m::\u001b[32mlag()\u001b[39m      masks \u001b[34mstats\u001b[39m::lag()\n",
       "\u001b[31m✖\u001b[39m \u001b[34myardstick\u001b[39m::\u001b[32mspec()\u001b[39m masks \u001b[34mreadr\u001b[39m::spec()\n",
       "\u001b[31m✖\u001b[39m \u001b[34mrecipes\u001b[39m::\u001b[32mstep()\u001b[39m   masks \u001b[34mstats\u001b[39m::step()\n",
-      "\u001b[34m•\u001b[39m Dig deeper into tidy modeling with R at \u001b[32mhttps://www.tmwr.org\u001b[39m\n",
+      "\u001b[34m•\u001b[39m Learn how to get started at \u001b[32mhttps://www.tidymodels.org/start/\u001b[39m\n",
       "\n",
       "\n",
       "Attaching package: ‘palmerpenguins’\n",
@@ -117,12 +117,49 @@
     "Descriptive, exploratory, predictive, inferential, causal, or mechanistic?"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "4d1ad8c2-49ed-44f0-84f3-3948d5fd1408",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## 1.2 Training and Test Sets\n",
+    "\n",
+    "When building a machine learning model, typically we start by dividing our data into two sets:\n",
+    "1) Training set\n",
+    "2) Test set\n",
+    "    \n",
+    "The **training set** is a subset of our data that is used is to train or teach our model to perform sort of predictive task. Then, using our **test set**, we can evaluate how well our model performs on unseen data. \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe206660-2ee8-4c43-9d6d-c3339f3cf870",
+   "metadata": {},
+   "source": [
+    "There are two important things to do when splitting data.\n",
+    "\n",
+    "1. **Shuffling:** randomly reorder the data before splitting\n",
+    "2. **Stratification:** make sure the two split subsets of data have roughly equal proportions of the different labels\n",
+    "\n",
+    "<center>\n",
+    "<img src=\"https://datasciencebook.ca/img/classification2/training_test.png\" width=\"1100\"/>\n",
+    "</center>\n",
+    "\n",
+    "#### Golden Rule of Machine Learning / Statistics:\n",
+    "\n",
+    "**Don't use your testing data to train your model!**\n",
+    "\n",
+    "Showing your classifier the labels of evaluation data is like cheating on a test; it'll look more accurate than it really is."
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "3a604659-c9c4-4b17-9f1a-8c7f5cb8d3a8",
    "metadata": {},
    "source": [
-    "## 1.2 K-nearest neighbours classification\n",
+    "## 1.3 K-nearest neighbours classification\n",
     "\n",
     "*Predict the label / class for a new observation using the K closest points from our dataset.*\n"
    ]
@@ -215,7 +252,7 @@
    "id": "5de05126-4608-4a1b-8f3b-dcedb36821d0",
    "metadata": {},
    "source": [
-    "## 1.3 Standardizing Data\n",
+    "## 1.4 Standardizing Data\n",
     "<center><img src=\"img/scaling_example1.png\" width=\"600\"/></center>"
    ]
   },
@@ -287,7 +324,7 @@
    "id": "e27ffaac-1825-407a-8bd5-2dbc29006566",
    "metadata": {},
    "source": [
-    "## 1.4 `tidymodels` package in R\n",
+    "## 1.5 `tidymodels` package in R\n",
     "\n",
     "`tidymodels` is a collection of packages and handles computing distances, standardization, balancing, and prediction for us!\n",
     "\n",
@@ -659,7 +696,7 @@
    "id": "8a5c9ef2-9bb3-4f3c-a1cc-4fcd15728997",
    "metadata": {},
    "source": [
-    "## 1.5 How to measure classifier performance?\n",
+    "## 1.6 How to measure classifier performance?\n",
     "</br>\n",
     "\n",
     "### Accuracy\n",
@@ -805,57 +842,6 @@
     "We'll now talk about these steps below."
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "4d1ad8c2-49ed-44f0-84f3-3948d5fd1408",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "## 1.6 Training and Test Sets\n",
-    "\n",
-    "When building a machine learning model, typically we start by dividing our data into two sets:\n",
-    "1) Training set\n",
-    "2) Test set\n",
-    "    \n",
-    "The **training set** is a subset of our data that is used is to train or teach our model to perform sort of predictive task. Then, using our **test set**, we can evaluate how well our model performs on unseen data. \n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "905177de-cdd2-46dc-b468-08dfdf6c0ad3",
-   "metadata": {},
-   "source": [
-    "There are two important things to do when splitting data.\n",
-    "\n",
-    "1. **Shuffling:** randomly reorder the data before splitting\n",
-    "2. **Stratification:** make sure the two split subsets of data have roughly equal proportions of the different labels\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "59937715-de40-404f-b9f7-40249fc10dc9",
-   "metadata": {},
-   "source": [
-    "\n",
-    "<center>\n",
-    "<img src=\"https://datasciencebook.ca/img/classification2/training_test.png\" width=\"1100\"/>\n",
-    "</center>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "278edce1-f89a-4fb3-835e-9ce6cb077c7d",
-   "metadata": {},
-   "source": [
-    "#### Golden Rule of Machine Learning / Statistics:\n",
-    "\n",
-    "**Don't use your testing data to train your model!**\n",
-    "\n",
-    "Showing your classifier the labels of evaluation data is like cheating on a test; it'll look more accurate than it really is."
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "cf171c32-67c8-474a-9e8d-25f42b8aee43",
@@ -1536,7 +1522,113 @@
    "source": [
     "#### Question\n",
     "\n",
-    "What are the precision and recall for the classifier on the test data?"
+    "What are the precision and recall for the classifier on the test data?\n",
+    "\n",
+    "We can use the precision and recall functions from `tidymodels`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "4da417d1-5f54-4332-95c1-b3f36e97eeeb",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<table class=\"dataframe\">\n",
+       "<caption>A tibble: 1 × 3</caption>\n",
+       "<thead>\n",
+       "\t<tr><th scope=col>.metric</th><th scope=col>.estimator</th><th scope=col>.estimate</th></tr>\n",
+       "\t<tr><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th></tr>\n",
+       "</thead>\n",
+       "<tbody>\n",
+       "\t<tr><td>precision</td><td>binary</td><td>0.84</td></tr>\n",
+       "</tbody>\n",
+       "</table>\n"
+      ],
+      "text/latex": [
+       "A tibble: 1 × 3\n",
+       "\\begin{tabular}{lll}\n",
+       " .metric & .estimator & .estimate\\\\\n",
+       " <chr> & <chr> & <dbl>\\\\\n",
+       "\\hline\n",
+       "\t precision & binary & 0.84\\\\\n",
+       "\\end{tabular}\n"
+      ],
+      "text/markdown": [
+       "\n",
+       "A tibble: 1 × 3\n",
+       "\n",
+       "| .metric &lt;chr&gt; | .estimator &lt;chr&gt; | .estimate &lt;dbl&gt; |\n",
+       "|---|---|---|\n",
+       "| precision | binary | 0.84 |\n",
+       "\n"
+      ],
+      "text/plain": [
+       "  .metric   .estimator .estimate\n",
+       "1 precision binary     0.84     "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "tumor_test_predictions |>\n",
+    "  precision(truth = Class, estimate = .pred_class, event_level = \"first\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "29bc5624-c9a5-4cbe-bcb2-b78a9645b537",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<table class=\"dataframe\">\n",
+       "<caption>A tibble: 1 × 3</caption>\n",
+       "<thead>\n",
+       "\t<tr><th scope=col>.metric</th><th scope=col>.estimator</th><th scope=col>.estimate</th></tr>\n",
+       "\t<tr><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th></tr>\n",
+       "</thead>\n",
+       "<tbody>\n",
+       "\t<tr><td>recall</td><td>binary</td><td>0.7924528</td></tr>\n",
+       "</tbody>\n",
+       "</table>\n"
+      ],
+      "text/latex": [
+       "A tibble: 1 × 3\n",
+       "\\begin{tabular}{lll}\n",
+       " .metric & .estimator & .estimate\\\\\n",
+       " <chr> & <chr> & <dbl>\\\\\n",
+       "\\hline\n",
+       "\t recall & binary & 0.7924528\\\\\n",
+       "\\end{tabular}\n"
+      ],
+      "text/markdown": [
+       "\n",
+       "A tibble: 1 × 3\n",
+       "\n",
+       "| .metric &lt;chr&gt; | .estimator &lt;chr&gt; | .estimate &lt;dbl&gt; |\n",
+       "|---|---|---|\n",
+       "| recall | binary | 0.7924528 |\n",
+       "\n"
+      ],
+      "text/plain": [
+       "  .metric .estimator .estimate\n",
+       "1 recall  binary     0.7924528"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "tumor_test_predictions |>\n",
+    "  recall(truth = Class, estimate = .pred_class, event_level = \"first\")"
    ]
   },
   {
@@ -1567,7 +1659,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 20,
    "id": "cff7e157-d2b7-451e-8613-deabd3fdf6ce",
    "metadata": {
     "tags": []
@@ -1658,7 +1750,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 21,
    "id": "21f1e6a1-d095-4278-92fe-3792a58d6273",
    "metadata": {
     "tags": []
@@ -1855,7 +1947,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 22,
    "id": "f48976ac-de94-43b2-9027-50a3516b31f8",
    "metadata": {
     "tags": []
@@ -1933,7 +2025,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 23,
    "id": "63263f8f-e9cc-4992-a5dc-ee6ae0fdae9f",
    "metadata": {
     "tags": []
@@ -2005,7 +2097,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": 24,
    "id": "dc937aa9-49e9-419f-99c9-c33c6f967f20",
    "metadata": {
     "tags": []
@@ -2075,7 +2167,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 23,
+   "execution_count": 25,
    "id": "8cb5c51a-15b9-4984-9b9b-8375a5d938d7",
    "metadata": {
     "tags": []

diff --git a/_build/html/genindex.html b/_build/html/genindex.html
@@ -162,6 +162,7 @@
 <li class="toctree-l1"><a class="reference internal" href="notes/day2.html">Day 2 - Girls in Data Science</a></li>
 
 
+
 <li class="toctree-l1"><a class="reference internal" href="notes/day3.html">Day 3 - Girls in Data Science</a></li>
Original file line number	Diff line number	Diff line change
Expand Up		@@ -162,6 +162,7 @@
		<li class="toctree-l1"><a class="reference internal" href="notes/day2.html">Day 2 - Girls in Data Science</a></li>



		<li class="toctree-l1"><a class="reference internal" href="notes/day3.html">Day 3 - Girls in Data Science</a></li>


Expand Down