Skip to content

Commit

Permalink
make book
Browse files Browse the repository at this point in the history
  • Loading branch information
katieburak committed Jul 11, 2024
1 parent 023fbb6 commit b4ab9c8
Show file tree
Hide file tree
Showing 30 changed files with 4,220 additions and 559 deletions.
Binary file modified _build/.doctrees/README.doctree
Binary file not shown.
Binary file modified _build/.doctrees/environment.pickle
Binary file not shown.
Binary file modified _build/.doctrees/notes/day1.doctree
Binary file not shown.
Binary file modified _build/.doctrees/notes/day2.doctree
Binary file not shown.
Binary file modified _build/.doctrees/notes/day3.doctree
Binary file not shown.
3 changes: 2 additions & 1 deletion _build/html/README.html
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@
<li class="toctree-l1"><a class="reference internal" href="notes/day2.html">Day 2 - Girls in Data Science</a></li>



<li class="toctree-l1"><a class="reference internal" href="notes/day3.html">Day 3 - Girls in Data Science</a></li>


Expand Down Expand Up @@ -415,7 +416,7 @@ <h2>Topic Overview<a class="headerlink" href="#topic-overview" title="Permalink
</tr>
</tbody>
</table>
<p>A Jupyter notebook of the material is available <a class="reference external" href="https://katieburak.github.io/girls-in-DS/notes/day2.html">here</a>.</p>
<p>A Jupyter notebook of the material is available <a class="reference external" href="https://katieburak.github.io/girls-in-DS/README.html">here</a>.</p>
</section>
<section id="references">
<h2>References<a class="headerlink" href="#references" title="Permalink to this heading">#</a></h2>
Expand Down
2 changes: 1 addition & 1 deletion _build/html/_sources/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Lunch will be provided each day for the participants.
| Day 2 | Measures of central tendency and spread, statistical inference and sampling, observational studies vs. experiments |
| Day 3 | Machine learning fundamentals, answering predictive questions (regression, classification) |

A Jupyter notebook of the material is available [here](https://katieburak.github.io/girls-in-DS/notes/day2.html).
A Jupyter notebook of the material is available [here](https://katieburak.github.io/girls-in-DS/README.html).

## References

Expand Down
144 changes: 125 additions & 19 deletions _build/html/_sources/notes/day1.ipynb

Large diffs are not rendered by default.

1,104 changes: 1,025 additions & 79 deletions _build/html/_sources/notes/day2.ipynb

Large diffs are not rendered by default.

218 changes: 155 additions & 63 deletions _build/html/_sources/notes/day3.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
"\u001b[31m✖\u001b[39m \u001b[34mdplyr\u001b[39m::\u001b[32mlag()\u001b[39m masks \u001b[34mstats\u001b[39m::lag()\n",
"\u001b[31m✖\u001b[39m \u001b[34myardstick\u001b[39m::\u001b[32mspec()\u001b[39m masks \u001b[34mreadr\u001b[39m::spec()\n",
"\u001b[31m✖\u001b[39m \u001b[34mrecipes\u001b[39m::\u001b[32mstep()\u001b[39m masks \u001b[34mstats\u001b[39m::step()\n",
"\u001b[34m•\u001b[39m Dig deeper into tidy modeling with R at \u001b[32mhttps://www.tmwr.org\u001b[39m\n",
"\u001b[34m•\u001b[39m Learn how to get started at \u001b[32mhttps://www.tidymodels.org/start/\u001b[39m\n",
"\n",
"\n",
"Attaching package: ‘palmerpenguins’\n",
Expand Down Expand Up @@ -117,12 +117,49 @@
"Descriptive, exploratory, predictive, inferential, causal, or mechanistic?"
]
},
{
"cell_type": "markdown",
"id": "4d1ad8c2-49ed-44f0-84f3-3948d5fd1408",
"metadata": {
"tags": []
},
"source": [
"## 1.2 Training and Test Sets\n",
"\n",
"When building a machine learning model, typically we start by dividing our data into two sets:\n",
"1) Training set\n",
"2) Test set\n",
" \n",
"The **training set** is a subset of our data that is used is to train or teach our model to perform sort of predictive task. Then, using our **test set**, we can evaluate how well our model performs on unseen data. \n"
]
},
{
"cell_type": "markdown",
"id": "fe206660-2ee8-4c43-9d6d-c3339f3cf870",
"metadata": {},
"source": [
"There are two important things to do when splitting data.\n",
"\n",
"1. **Shuffling:** randomly reorder the data before splitting\n",
"2. **Stratification:** make sure the two split subsets of data have roughly equal proportions of the different labels\n",
"\n",
"<center>\n",
"<img src=\"https://datasciencebook.ca/img/classification2/training_test.png\" width=\"1100\"/>\n",
"</center>\n",
"\n",
"#### Golden Rule of Machine Learning / Statistics:\n",
"\n",
"**Don't use your testing data to train your model!**\n",
"\n",
"Showing your classifier the labels of evaluation data is like cheating on a test; it'll look more accurate than it really is."
]
},
{
"cell_type": "markdown",
"id": "3a604659-c9c4-4b17-9f1a-8c7f5cb8d3a8",
"metadata": {},
"source": [
"## 1.2 K-nearest neighbours classification\n",
"## 1.3 K-nearest neighbours classification\n",
"\n",
"*Predict the label / class for a new observation using the K closest points from our dataset.*\n"
]
Expand Down Expand Up @@ -215,7 +252,7 @@
"id": "5de05126-4608-4a1b-8f3b-dcedb36821d0",
"metadata": {},
"source": [
"## 1.3 Standardizing Data\n",
"## 1.4 Standardizing Data\n",
"<center><img src=\"img/scaling_example1.png\" width=\"600\"/></center>"
]
},
Expand Down Expand Up @@ -287,7 +324,7 @@
"id": "e27ffaac-1825-407a-8bd5-2dbc29006566",
"metadata": {},
"source": [
"## 1.4 `tidymodels` package in R\n",
"## 1.5 `tidymodels` package in R\n",
"\n",
"`tidymodels` is a collection of packages and handles computing distances, standardization, balancing, and prediction for us!\n",
"\n",
Expand Down Expand Up @@ -659,7 +696,7 @@
"id": "8a5c9ef2-9bb3-4f3c-a1cc-4fcd15728997",
"metadata": {},
"source": [
"## 1.5 How to measure classifier performance?\n",
"## 1.6 How to measure classifier performance?\n",
"</br>\n",
"\n",
"### Accuracy\n",
Expand Down Expand Up @@ -805,57 +842,6 @@
"We'll now talk about these steps below."
]
},
{
"cell_type": "markdown",
"id": "4d1ad8c2-49ed-44f0-84f3-3948d5fd1408",
"metadata": {
"tags": []
},
"source": [
"## 1.6 Training and Test Sets\n",
"\n",
"When building a machine learning model, typically we start by dividing our data into two sets:\n",
"1) Training set\n",
"2) Test set\n",
" \n",
"The **training set** is a subset of our data that is used is to train or teach our model to perform sort of predictive task. Then, using our **test set**, we can evaluate how well our model performs on unseen data. \n"
]
},
{
"cell_type": "markdown",
"id": "905177de-cdd2-46dc-b468-08dfdf6c0ad3",
"metadata": {},
"source": [
"There are two important things to do when splitting data.\n",
"\n",
"1. **Shuffling:** randomly reorder the data before splitting\n",
"2. **Stratification:** make sure the two split subsets of data have roughly equal proportions of the different labels\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "59937715-de40-404f-b9f7-40249fc10dc9",
"metadata": {},
"source": [
"\n",
"<center>\n",
"<img src=\"https://datasciencebook.ca/img/classification2/training_test.png\" width=\"1100\"/>\n",
"</center>"
]
},
{
"cell_type": "markdown",
"id": "278edce1-f89a-4fb3-835e-9ce6cb077c7d",
"metadata": {},
"source": [
"#### Golden Rule of Machine Learning / Statistics:\n",
"\n",
"**Don't use your testing data to train your model!**\n",
"\n",
"Showing your classifier the labels of evaluation data is like cheating on a test; it'll look more accurate than it really is."
]
},
{
"cell_type": "markdown",
"id": "cf171c32-67c8-474a-9e8d-25f42b8aee43",
Expand Down Expand Up @@ -1536,7 +1522,113 @@
"source": [
"#### Question\n",
"\n",
"What are the precision and recall for the classifier on the test data?"
"What are the precision and recall for the classifier on the test data?\n",
"\n",
"We can use the precision and recall functions from `tidymodels`."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "4da417d1-5f54-4332-95c1-b3f36e97eeeb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"dataframe\">\n",
"<caption>A tibble: 1 × 3</caption>\n",
"<thead>\n",
"\t<tr><th scope=col>.metric</th><th scope=col>.estimator</th><th scope=col>.estimate</th></tr>\n",
"\t<tr><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th></tr>\n",
"</thead>\n",
"<tbody>\n",
"\t<tr><td>precision</td><td>binary</td><td>0.84</td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"A tibble: 1 × 3\n",
"\\begin{tabular}{lll}\n",
" .metric & .estimator & .estimate\\\\\n",
" <chr> & <chr> & <dbl>\\\\\n",
"\\hline\n",
"\t precision & binary & 0.84\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A tibble: 1 × 3\n",
"\n",
"| .metric &lt;chr&gt; | .estimator &lt;chr&gt; | .estimate &lt;dbl&gt; |\n",
"|---|---|---|\n",
"| precision | binary | 0.84 |\n",
"\n"
],
"text/plain": [
" .metric .estimator .estimate\n",
"1 precision binary 0.84 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tumor_test_predictions |>\n",
" precision(truth = Class, estimate = .pred_class, event_level = \"first\")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "29bc5624-c9a5-4cbe-bcb2-b78a9645b537",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"dataframe\">\n",
"<caption>A tibble: 1 × 3</caption>\n",
"<thead>\n",
"\t<tr><th scope=col>.metric</th><th scope=col>.estimator</th><th scope=col>.estimate</th></tr>\n",
"\t<tr><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th></tr>\n",
"</thead>\n",
"<tbody>\n",
"\t<tr><td>recall</td><td>binary</td><td>0.7924528</td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"A tibble: 1 × 3\n",
"\\begin{tabular}{lll}\n",
" .metric & .estimator & .estimate\\\\\n",
" <chr> & <chr> & <dbl>\\\\\n",
"\\hline\n",
"\t recall & binary & 0.7924528\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A tibble: 1 × 3\n",
"\n",
"| .metric &lt;chr&gt; | .estimator &lt;chr&gt; | .estimate &lt;dbl&gt; |\n",
"|---|---|---|\n",
"| recall | binary | 0.7924528 |\n",
"\n"
],
"text/plain": [
" .metric .estimator .estimate\n",
"1 recall binary 0.7924528"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tumor_test_predictions |>\n",
" recall(truth = Class, estimate = .pred_class, event_level = \"first\")"
]
},
{
Expand Down Expand Up @@ -1567,7 +1659,7 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 20,
"id": "cff7e157-d2b7-451e-8613-deabd3fdf6ce",
"metadata": {
"tags": []
Expand Down Expand Up @@ -1658,7 +1750,7 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 21,
"id": "21f1e6a1-d095-4278-92fe-3792a58d6273",
"metadata": {
"tags": []
Expand Down Expand Up @@ -1855,7 +1947,7 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": 22,
"id": "f48976ac-de94-43b2-9027-50a3516b31f8",
"metadata": {
"tags": []
Expand Down Expand Up @@ -1933,7 +2025,7 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 23,
"id": "63263f8f-e9cc-4992-a5dc-ee6ae0fdae9f",
"metadata": {
"tags": []
Expand Down Expand Up @@ -2005,7 +2097,7 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 24,
"id": "dc937aa9-49e9-419f-99c9-c33c6f967f20",
"metadata": {
"tags": []
Expand Down Expand Up @@ -2075,7 +2167,7 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": 25,
"id": "8cb5c51a-15b9-4984-9b9b-8375a5d938d7",
"metadata": {
"tags": []
Expand Down
1 change: 1 addition & 0 deletions _build/html/genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,7 @@
<li class="toctree-l1"><a class="reference internal" href="notes/day2.html">Day 2 - Girls in Data Science</a></li>



<li class="toctree-l1"><a class="reference internal" href="notes/day3.html">Day 3 - Girls in Data Science</a></li>


Expand Down
Loading

0 comments on commit b4ab9c8

Please sign in to comment.