Skip to content

Commit

Permalink
day3 edits
Browse files Browse the repository at this point in the history
  • Loading branch information
katieburak committed Jul 10, 2024
1 parent c5ee867 commit 252e55f
Show file tree
Hide file tree
Showing 6 changed files with 340 additions and 156 deletions.
256 changes: 175 additions & 81 deletions notes/.ipynb_checkpoints/day3-checkpoint.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"data":{"layout-restorer:data":{"main":{"dock":{"type":"tab-area","currentIndex":0,"widgets":["notebook:day1.ipynb"]},"current":"notebook:day2.ipynb"},"down":{"size":0,"widgets":[]},"left":{"collapsed":true,"visible":false,"widgets":["filebrowser","running-sessions","@jupyterlab/toc:plugin","extensionmanager.main-view"],"widgetStates":{"jp-running-sessions":{"sizes":[0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666],"expansionStates":[false,false,false,false,false,false]},"extensionmanager.main-view":{"sizes":[0.3333333333333333,0.3333333333333333,0.3333333333333333],"expansionStates":[false,false,false]}}},"right":{"collapsed":true,"visible":false,"widgets":["jp-property-inspector","debugger-sidebar"],"widgetStates":{"jp-debugger-sidebar":{"sizes":[0.2,0.2,0.2,0.2,0.2],"expansionStates":[false,false,false,false,false]}}},"relativeSizes":[0,1,0],"top":{"simpleVisibility":true}},"file-browser-filebrowser:cwd":{"path":""},"notebook:day1.ipynb":{"data":{"path":"day1.ipynb","factory":"Notebook"}},"docmanager:recents":{"opened":[{"path":"","contentType":"directory","root":"~/Desktop/MDS/girls-in-DS/notes"},{"path":"day2.ipynb","contentType":"notebook","factory":"Notebook","root":"~/Desktop/MDS/girls-in-DS/notes"},{"path":"day1.ipynb","contentType":"notebook","factory":"Notebook","root":"~/Desktop/MDS/girls-in-DS/notes"}],"closed":[]},"notebook:day2.ipynb":{"data":{"path":"day2.ipynb","factory":"Notebook"}}},"metadata":{"id":"default"}}
{"data":{"layout-restorer:data":{"main":{"dock":{"type":"tab-area","currentIndex":0,"widgets":["notebook:day1.ipynb"]},"current":"notebook:day1.ipynb"},"down":{"size":0,"widgets":[]},"left":{"collapsed":true,"visible":false,"widgets":["filebrowser","running-sessions","@jupyterlab/toc:plugin","extensionmanager.main-view"],"widgetStates":{"jp-running-sessions":{"sizes":[0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666],"expansionStates":[false,false,false,false,false,false]},"extensionmanager.main-view":{"sizes":[0.3333333333333333,0.3333333333333333,0.3333333333333333],"expansionStates":[false,false,false]}}},"right":{"collapsed":true,"visible":false,"widgets":["jp-property-inspector","debugger-sidebar"],"widgetStates":{"jp-debugger-sidebar":{"sizes":[0.2,0.2,0.2,0.2,0.2],"expansionStates":[false,false,false,false,false]}}},"relativeSizes":[0,1,0],"top":{"simpleVisibility":true}},"file-browser-filebrowser:cwd":{"path":""},"notebook:day1.ipynb":{"data":{"path":"day1.ipynb","factory":"Notebook"}},"docmanager:recents":{"opened":[{"path":"","contentType":"directory","root":"~/Desktop/MDS/girls-in-DS/notes"},{"path":"day1.ipynb","contentType":"notebook","factory":"Notebook","root":"~/Desktop/MDS/girls-in-DS/notes"},{"path":"day3.ipynb","contentType":"notebook","factory":"Notebook","root":"~/Desktop/MDS/girls-in-DS/notes"},{"path":"day2.ipynb","contentType":"notebook","factory":"Notebook","root":"~/Desktop/MDS/girls-in-DS/notes"}],"closed":[]},"notebook:day2.ipynb":{"data":{"path":"day2.ipynb","factory":"Notebook"}},"notebook:day3.ipynb":{"data":{"path":"day3.ipynb","factory":"Notebook"}}},"metadata":{"id":"default"}}
218 changes: 155 additions & 63 deletions notes/day3.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
"\u001b[31m✖\u001b[39m \u001b[34mdplyr\u001b[39m::\u001b[32mlag()\u001b[39m masks \u001b[34mstats\u001b[39m::lag()\n",
"\u001b[31m✖\u001b[39m \u001b[34myardstick\u001b[39m::\u001b[32mspec()\u001b[39m masks \u001b[34mreadr\u001b[39m::spec()\n",
"\u001b[31m✖\u001b[39m \u001b[34mrecipes\u001b[39m::\u001b[32mstep()\u001b[39m masks \u001b[34mstats\u001b[39m::step()\n",
"\u001b[34m•\u001b[39m Dig deeper into tidy modeling with R at \u001b[32mhttps://www.tmwr.org\u001b[39m\n",
"\u001b[34m•\u001b[39m Learn how to get started at \u001b[32mhttps://www.tidymodels.org/start/\u001b[39m\n",
"\n",
"\n",
"Attaching package: ‘palmerpenguins’\n",
Expand Down Expand Up @@ -117,12 +117,49 @@
"Descriptive, exploratory, predictive, inferential, causal, or mechanistic?"
]
},
{
"cell_type": "markdown",
"id": "4d1ad8c2-49ed-44f0-84f3-3948d5fd1408",
"metadata": {
"tags": []
},
"source": [
"## 1.2 Training and Test Sets\n",
"\n",
"When building a machine learning model, typically we start by dividing our data into two sets:\n",
"1) Training set\n",
"2) Test set\n",
" \n",
"The **training set** is a subset of our data that is used is to train or teach our model to perform sort of predictive task. Then, using our **test set**, we can evaluate how well our model performs on unseen data. \n"
]
},
{
"cell_type": "markdown",
"id": "fe206660-2ee8-4c43-9d6d-c3339f3cf870",
"metadata": {},
"source": [
"There are two important things to do when splitting data.\n",
"\n",
"1. **Shuffling:** randomly reorder the data before splitting\n",
"2. **Stratification:** make sure the two split subsets of data have roughly equal proportions of the different labels\n",
"\n",
"<center>\n",
"<img src=\"https://datasciencebook.ca/img/classification2/training_test.png\" width=\"1100\"/>\n",
"</center>\n",
"\n",
"#### Golden Rule of Machine Learning / Statistics:\n",
"\n",
"**Don't use your testing data to train your model!**\n",
"\n",
"Showing your classifier the labels of evaluation data is like cheating on a test; it'll look more accurate than it really is."
]
},
{
"cell_type": "markdown",
"id": "3a604659-c9c4-4b17-9f1a-8c7f5cb8d3a8",
"metadata": {},
"source": [
"## 1.2 K-nearest neighbours classification\n",
"## 1.3 K-nearest neighbours classification\n",
"\n",
"*Predict the label / class for a new observation using the K closest points from our dataset.*\n"
]
Expand Down Expand Up @@ -215,7 +252,7 @@
"id": "5de05126-4608-4a1b-8f3b-dcedb36821d0",
"metadata": {},
"source": [
"## 1.3 Standardizing Data\n",
"## 1.4 Standardizing Data\n",
"<center><img src=\"img/scaling_example1.png\" width=\"600\"/></center>"
]
},
Expand Down Expand Up @@ -287,7 +324,7 @@
"id": "e27ffaac-1825-407a-8bd5-2dbc29006566",
"metadata": {},
"source": [
"## 1.4 `tidymodels` package in R\n",
"## 1.5 `tidymodels` package in R\n",
"\n",
"`tidymodels` is a collection of packages and handles computing distances, standardization, balancing, and prediction for us!\n",
"\n",
Expand Down Expand Up @@ -659,7 +696,7 @@
"id": "8a5c9ef2-9bb3-4f3c-a1cc-4fcd15728997",
"metadata": {},
"source": [
"## 1.5 How to measure classifier performance?\n",
"## 1.6 How to measure classifier performance?\n",
"</br>\n",
"\n",
"### Accuracy\n",
Expand Down Expand Up @@ -805,57 +842,6 @@
"We'll now talk about these steps below."
]
},
{
"cell_type": "markdown",
"id": "4d1ad8c2-49ed-44f0-84f3-3948d5fd1408",
"metadata": {
"tags": []
},
"source": [
"## 1.6 Training and Test Sets\n",
"\n",
"When building a machine learning model, typically we start by dividing our data into two sets:\n",
"1) Training set\n",
"2) Test set\n",
" \n",
"The **training set** is a subset of our data that is used is to train or teach our model to perform sort of predictive task. Then, using our **test set**, we can evaluate how well our model performs on unseen data. \n"
]
},
{
"cell_type": "markdown",
"id": "905177de-cdd2-46dc-b468-08dfdf6c0ad3",
"metadata": {},
"source": [
"There are two important things to do when splitting data.\n",
"\n",
"1. **Shuffling:** randomly reorder the data before splitting\n",
"2. **Stratification:** make sure the two split subsets of data have roughly equal proportions of the different labels\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "59937715-de40-404f-b9f7-40249fc10dc9",
"metadata": {},
"source": [
"\n",
"<center>\n",
"<img src=\"https://datasciencebook.ca/img/classification2/training_test.png\" width=\"1100\"/>\n",
"</center>"
]
},
{
"cell_type": "markdown",
"id": "278edce1-f89a-4fb3-835e-9ce6cb077c7d",
"metadata": {},
"source": [
"#### Golden Rule of Machine Learning / Statistics:\n",
"\n",
"**Don't use your testing data to train your model!**\n",
"\n",
"Showing your classifier the labels of evaluation data is like cheating on a test; it'll look more accurate than it really is."
]
},
{
"cell_type": "markdown",
"id": "cf171c32-67c8-474a-9e8d-25f42b8aee43",
Expand Down Expand Up @@ -1536,7 +1522,113 @@
"source": [
"#### Question\n",
"\n",
"What are the precision and recall for the classifier on the test data?"
"What are the precision and recall for the classifier on the test data?\n",
"\n",
"We can use the precision and recall functions from `tidymodels`."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "4da417d1-5f54-4332-95c1-b3f36e97eeeb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"dataframe\">\n",
"<caption>A tibble: 1 × 3</caption>\n",
"<thead>\n",
"\t<tr><th scope=col>.metric</th><th scope=col>.estimator</th><th scope=col>.estimate</th></tr>\n",
"\t<tr><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th></tr>\n",
"</thead>\n",
"<tbody>\n",
"\t<tr><td>precision</td><td>binary</td><td>0.84</td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"A tibble: 1 × 3\n",
"\\begin{tabular}{lll}\n",
" .metric & .estimator & .estimate\\\\\n",
" <chr> & <chr> & <dbl>\\\\\n",
"\\hline\n",
"\t precision & binary & 0.84\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A tibble: 1 × 3\n",
"\n",
"| .metric &lt;chr&gt; | .estimator &lt;chr&gt; | .estimate &lt;dbl&gt; |\n",
"|---|---|---|\n",
"| precision | binary | 0.84 |\n",
"\n"
],
"text/plain": [
" .metric .estimator .estimate\n",
"1 precision binary 0.84 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tumor_test_predictions |>\n",
" precision(truth = Class, estimate = .pred_class, event_level = \"first\")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "29bc5624-c9a5-4cbe-bcb2-b78a9645b537",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"dataframe\">\n",
"<caption>A tibble: 1 × 3</caption>\n",
"<thead>\n",
"\t<tr><th scope=col>.metric</th><th scope=col>.estimator</th><th scope=col>.estimate</th></tr>\n",
"\t<tr><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th></tr>\n",
"</thead>\n",
"<tbody>\n",
"\t<tr><td>recall</td><td>binary</td><td>0.7924528</td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"A tibble: 1 × 3\n",
"\\begin{tabular}{lll}\n",
" .metric & .estimator & .estimate\\\\\n",
" <chr> & <chr> & <dbl>\\\\\n",
"\\hline\n",
"\t recall & binary & 0.7924528\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A tibble: 1 × 3\n",
"\n",
"| .metric &lt;chr&gt; | .estimator &lt;chr&gt; | .estimate &lt;dbl&gt; |\n",
"|---|---|---|\n",
"| recall | binary | 0.7924528 |\n",
"\n"
],
"text/plain": [
" .metric .estimator .estimate\n",
"1 recall binary 0.7924528"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tumor_test_predictions |>\n",
" recall(truth = Class, estimate = .pred_class, event_level = \"first\")"
]
},
{
Expand Down Expand Up @@ -1567,7 +1659,7 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 20,
"id": "cff7e157-d2b7-451e-8613-deabd3fdf6ce",
"metadata": {
"tags": []
Expand Down Expand Up @@ -1658,7 +1750,7 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 21,
"id": "21f1e6a1-d095-4278-92fe-3792a58d6273",
"metadata": {
"tags": []
Expand Down Expand Up @@ -1855,7 +1947,7 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": 22,
"id": "f48976ac-de94-43b2-9027-50a3516b31f8",
"metadata": {
"tags": []
Expand Down Expand Up @@ -1933,7 +2025,7 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 23,
"id": "63263f8f-e9cc-4992-a5dc-ee6ae0fdae9f",
"metadata": {
"tags": []
Expand Down Expand Up @@ -2005,7 +2097,7 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 24,
"id": "dc937aa9-49e9-419f-99c9-c33c6f967f20",
"metadata": {
"tags": []
Expand Down Expand Up @@ -2075,7 +2167,7 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": 25,
"id": "8cb5c51a-15b9-4984-9b9b-8375a5d938d7",
"metadata": {
"tags": []
Expand Down
10 changes: 5 additions & 5 deletions worksheets/.ipynb_checkpoints/worksheet3-checkpoint.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
"metadata": {},
"source": [
"# Worksheet 3. Machine Learning\n",
"\n",
"Before getting started with the worksheet, please take a few minutes to fill out [this survey](https://forms.gle/z4JFsgJkhbiQ8PQZ9) about the camp. Thank you! \n",
"## Girls in Data Science Camp\n",
"\n",
"In this worksheet, we will be working with the [Nutrition Facts for McDonald's Menu](https://www.kaggle.com/datasets/mcdonalds/nutrition-facts?resource=download) data, which contains information about items sold at McDonald across 9 food categories. \n",
Expand Down Expand Up @@ -278,9 +280,7 @@
"metadata": {},
"source": [
"#### 1.2 Removing Unwanted Columns\n",
"As discussed, let's exclude the `Item` and `Serving Size` columns from the dataset for the upcoming activities. Additionally, we will convert the `Category` column to the _factor_ class. \n",
"\n",
"Explain the distinctions between the `factor` and `character` classes in R, and illustrate the advantages of converting the `Category` column into a `factor`."
"As discussed, let's exclude the `Item` and `Serving Size` columns from the dataset for the upcoming activities. Additionally, we will convert the `Category` column to the _factor_ class (you can use `mutate(Category = as_factor(Category))`)."
]
},
{
Expand Down Expand Up @@ -452,11 +452,11 @@
"id": "3c6ce735",
"metadata": {},
"source": [
"#### 2.3 Fit the Data into the Workflow\n",
"#### Fit the Data into the Workflow\n",
"Let's combine the recipe and model we previously defined to fit our `mcdonalds_train` data into a workflow. Follow these steps:\n",
"1. Initialize the workflow.\n",
"2. Specify the recipe.\n",
"3. Incloude the model.\n",
"3. Include the model.\n",
"4. Fit the training data."
]
},
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"data":{"layout-restorer:data":{"main":{"dock":{"type":"tab-area","currentIndex":0,"widgets":["notebook:worksheet3.ipynb"]},"current":"notebook:worksheet2.ipynb"},"down":{"size":0,"widgets":[]},"left":{"collapsed":true,"visible":false,"widgets":["filebrowser","running-sessions","@jupyterlab/toc:plugin","extensionmanager.main-view"],"widgetStates":{"jp-running-sessions":{"sizes":[0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666],"expansionStates":[false,false,false,false,false,false]},"extensionmanager.main-view":{"sizes":[0.3333333333333333,0.3333333333333333,0.3333333333333333],"expansionStates":[false,false,false]}}},"right":{"collapsed":true,"visible":false,"widgets":["jp-property-inspector","debugger-sidebar"],"widgetStates":{"jp-debugger-sidebar":{"sizes":[0.2,0.2,0.2,0.2,0.2],"expansionStates":[false,false,false,false,false]}}},"relativeSizes":[0,1,0],"top":{"simpleVisibility":true}},"notebook:worksheet3.ipynb":{"data":{"path":"worksheet3.ipynb","factory":"Notebook"}},"notebook:worksheet1.ipynb":{"data":{"path":"worksheet1.ipynb","factory":"Notebook"}},"docmanager:recents":{"opened":[{"path":"","contentType":"directory","root":"~/Desktop/MDS/girls-in-DS/worksheets"},{"path":"worksheet2.ipynb","contentType":"notebook","factory":"Notebook","root":"~/Desktop/MDS/girls-in-DS/worksheets"},{"path":"worksheet1.ipynb","contentType":"notebook","factory":"Notebook","root":"~/Desktop/MDS/girls-in-DS/worksheets"},{"path":"worksheet3.ipynb","contentType":"notebook","factory":"Notebook","root":"~/Desktop/MDS/girls-in-DS/worksheets"}],"closed":[]},"notebook:worksheet2.ipynb":{"data":{"path":"worksheet2.ipynb","factory":"Notebook"}}},"metadata":{"id":"default"}}
{"data":{"layout-restorer:data":{"main":{"dock":{"type":"tab-area","currentIndex":0,"widgets":["notebook:worksheet3.ipynb"]},"current":"notebook:worksheet3.ipynb"},"down":{"size":0,"widgets":[]},"left":{"collapsed":true,"visible":false,"widgets":["filebrowser","running-sessions","@jupyterlab/toc:plugin","extensionmanager.main-view"],"widgetStates":{"jp-running-sessions":{"sizes":[0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666],"expansionStates":[false,false,false,false,false,false]},"extensionmanager.main-view":{"sizes":[0.3333333333333333,0.3333333333333333,0.3333333333333333],"expansionStates":[false,false,false]}}},"right":{"collapsed":true,"visible":false,"widgets":["jp-property-inspector","debugger-sidebar"],"widgetStates":{"jp-debugger-sidebar":{"sizes":[0.2,0.2,0.2,0.2,0.2],"expansionStates":[false,false,false,false,false]}}},"relativeSizes":[0,1,0],"top":{"simpleVisibility":true}},"notebook:worksheet3.ipynb":{"data":{"path":"worksheet3.ipynb","factory":"Notebook"}},"notebook:worksheet1.ipynb":{"data":{"path":"worksheet1.ipynb","factory":"Notebook"}},"docmanager:recents":{"opened":[{"path":"","contentType":"directory","root":"~/Desktop/MDS/girls-in-DS/worksheets"},{"path":"worksheet3.ipynb","contentType":"notebook","factory":"Notebook","root":"~/Desktop/MDS/girls-in-DS/worksheets"},{"path":"worksheet2.ipynb","contentType":"notebook","factory":"Notebook","root":"~/Desktop/MDS/girls-in-DS/worksheets"},{"path":"worksheet1.ipynb","contentType":"notebook","factory":"Notebook","root":"~/Desktop/MDS/girls-in-DS/worksheets"}],"closed":[]},"notebook:worksheet2.ipynb":{"data":{"path":"worksheet2.ipynb","factory":"Notebook"}}},"metadata":{"id":"default"}}
8 changes: 3 additions & 5 deletions worksheets/worksheet3.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -280,9 +280,7 @@
"metadata": {},
"source": [
"#### 1.2 Removing Unwanted Columns\n",
"As discussed, let's exclude the `Item` and `Serving Size` columns from the dataset for the upcoming activities. Additionally, we will convert the `Category` column to the _factor_ class. \n",
"\n",
"Explain the distinctions between the `factor` and `character` classes in R, and illustrate the advantages of converting the `Category` column into a `factor`."
"As discussed, let's exclude the `Item` and `Serving Size` columns from the dataset for the upcoming activities. Additionally, we will convert the `Category` column to the _factor_ class (you can use `mutate(Category = as_factor(Category))`)."
]
},
{
Expand Down Expand Up @@ -454,11 +452,11 @@
"id": "3c6ce735",
"metadata": {},
"source": [
"#### 2.3 Fit the Data into the Workflow\n",
"#### Fit the Data into the Workflow\n",
"Let's combine the recipe and model we previously defined to fit our `mcdonalds_train` data into a workflow. Follow these steps:\n",
"1. Initialize the workflow.\n",
"2. Specify the recipe.\n",
"3. Incloude the model.\n",
"3. Include the model.\n",
"4. Fit the training data."
]
},
Expand Down

0 comments on commit 252e55f

Please sign in to comment.