Skip to content

Commit b39afd5

Browse files
committed
Updated notebooks up to and including 2.2.*
1 parent c79004e commit b39afd5

13 files changed

+245
-58
lines changed

Diff for: notebooks/ready_for_review/Module 2.1.1 - Hypothesis Testing.ipynb

+21-14
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
"cell_type": "markdown",
2525
"metadata": {},
2626
"source": [
27-
"(Above comic from https://xkcd.com/882/ Hint: count how many \"per-colour\" experiements were performed.)\n",
27+
"(Above comic from https://xkcd.com/882/ Hint: count how many \"per-colour\" experiments were performed.)\n",
2828
"\n",
2929
"A common hypothesis pair is the following:\n",
3030
"\n",
@@ -117,7 +117,7 @@
117117
"\n",
118118
"The Alternative hypothesis is that our intervention caused some change. For instance, the new medicine reduces illness. Sales increased significantly from the new strategy. The sample is different from the population.\n",
119119
"\n",
120-
"Normally we are interested in computing some statistic and then identifying what the likelihood is of that statistic having occured by chance, based on our assumptions.\n",
120+
"Normally we are interested in computing some statistic and then identifying what the likelihood is of that statistic having occurred by chance, based on our assumptions.\n",
121121
"\n",
122122
"A commonly used method here is a p-test, where we are testing if a mean is different (>, <, $\\neq$) to the population mean, when we assume the mean from a sample is otherwise drawn from a normal distribution. For instance, if we roll 100 dice, we get an expected value of 350, and a normal distribution of results centred around this value:\n"
123123
]
@@ -300,17 +300,24 @@
300300
"\n",
301301
"#### Exercise\n",
302302
"\n",
303-
"1. Research the \"Multiple comparisons problem\" and identify a way to fix our hypothesis. We still want to check if any day has a significant value for our hypothesis, but we want to do it in a rigourous way.\n",
303+
"1. Research the \"Multiple comparisons problem\" and identify a way to fix our hypothesis. We still want to check if any day has a significant value for our hypothesis, but we want to do it in a rigorous way.\n",
304304
"2. Does our finding hold after adjusting? The solution uses one specific method of fixing the thresholds - if you choose another, then you may get another answer."
305305
]
306306
},
307+
{
308+
"cell_type": "markdown",
309+
"metadata": {},
310+
"source": [
311+
"*For solutions, see `solutions/multiple_comparisons.py`*"
312+
]
313+
},
307314
{
308315
"cell_type": "markdown",
309316
"metadata": {},
310317
"source": [
311318
"### Simulations\n",
312319
"\n",
313-
"A topic we will get into in more detail later, but a useful one to touch on here, is the use of simulations for computing p values. When testing a hypothesis, you can use a simulation of your null hypothesis, and then with that simulation, estimate the likelihood of finding like your sample. For instance:\n",
320+
"A topic we will get into in more detail later, but a useful one to touch on here, is the use of simulations for computing p values. When testing a hypothesis, you can use a simulation of your null hypothesis, and then with that simulation, estimate the likelihood of findings like your sample. For instance:\n",
314321
"\n",
315322
"$H_0$: The AUD/USD change is a random walk (that is, there is no pattern)\n",
316323
"\n",
@@ -344,6 +351,13 @@
344351
"3. Compute the p value and determine whether to accept or reject the null hypothesis."
345352
]
346353
},
354+
{
355+
"cell_type": "markdown",
356+
"metadata": {},
357+
"source": [
358+
"*For solutions, see `solutions/hypothesis_two.py`*"
359+
]
360+
},
347361
{
348362
"cell_type": "markdown",
349363
"metadata": {},
@@ -384,7 +398,7 @@
384398
"\n",
385399
"### Are two samples equal?\n",
386400
"\n",
387-
"These test test that, given two samples, they are effectively equal (i.e. they came from the same distribution):\n",
401+
"These tests assert that, given two samples, they are effectively equal (i.e. they came from the same distribution):\n",
388402
"\n",
389403
"* Student's t-test, as identified earlier, implemented in quite a few methods in both scipy and statsmodels\n",
390404
"* Analysis of Variance Test (ANOVA), `scipy.stats.f_oneway` and `statsmodels.api.stats.anova_lm` (among a few other ways to call it).\n",
@@ -401,7 +415,7 @@
401415
"source": [
402416
"#### Extended Exercise\n",
403417
"\n",
404-
"Using a simluation, create your own function that can compute the t-test and p values for a Student's t-test.\n",
418+
"Using a simulation, create your own function that can compute the t-test and p values for a Student's t-test.\n",
405419
"\n",
406420
"For a comparison of two independent samples (i.e. \"here are two samples, do they come from the same distribution?\"), the t value is computed as:\n",
407421
"\n",
@@ -434,13 +448,6 @@
434448
"source": [
435449
"*For solutions, see `solutions/simulation_ttest.py`*"
436450
]
437-
},
438-
{
439-
"cell_type": "code",
440-
"execution_count": null,
441-
"metadata": {},
442-
"outputs": [],
443-
"source": []
444451
}
445452
],
446453
"metadata": {
@@ -459,7 +466,7 @@
459466
"name": "python",
460467
"nbconvert_exporter": "python",
461468
"pygments_lexer": "ipython3",
462-
"version": "3.6.8"
469+
"version": "3.7.2"
463470
}
464471
},
465472
"nbformat": 4,

Diff for: notebooks/ready_for_review/Module 2.2.1 - Linear regression models.ipynb

+19-12
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@
5757
"\n",
5858
"If Melbourne has a temperature of 15°C, and Sydney has a temperature of 30°C, we can say that Sydney has a greater temperature than Melbourne. We **cannot** say that Sydney is twice as hot at Melbourne. \n",
5959
"\n",
60-
"While this would make intuitive sense in lay speak, it actually doesn't mean anything. The reason? The units matter. If we measure in Fahrenheit, we get Melbourne's temperature ast 59°F and Sydney's as 86°F, which is no longer twice, despite the underlying facts not changing.\n",
60+
"While this would make intuitive sense in lay speak, it actually doesn't mean anything. The reason? The units matter. If we measure in Fahrenheit, we get Melbourne's temperature as 59°F and Sydney's as 86°F, which is no longer twice, despite the underlying facts not changing.\n",
6161
"\n",
6262
"Interval data is still very useful, and you can normally just deal with it as a ratio value. This is especially true with practical data where the data sits in a normal range.\n"
6363
]
@@ -109,11 +109,11 @@
109109
"source": [
110110
"### Nominal data (Categorical)\n",
111111
"\n",
112-
"**Nominal** data, also called Categorical or Labels, are unordered categories of data. For instance, if I asked if your pet was a Dog or a Cat, there isn't a \"greater\" of these values (one could argue with this point, but not mathemtically).\n",
112+
"**Nominal** data, also called Categorical or Labels, are unordered categories of data. For instance, if I asked if your pet was a Dog or a Cat, there isn't a \"greater\" of these values (one could argue with this point, but not mathematically).\n",
113113
"\n",
114114
"A common type of data here is Gender. Options could include \"Female\", \"Male\" and \"Other\". They could be presented in an order, and could even get numbered labels for storing, but you can't compute the average. You *can* compute the ratio of the total (e.g. 53% Female, 46% Male, 1% Other), but it would make no sense to say the average customer is 53% Female. If anything, that sends a very different message!\n",
115115
"\n",
116-
"A two-option nominal value is also called a dichtomous variable. This could be True/False, Correct/Incorrect, Married/Not Married, and lots of data data of the form X, not X."
116+
"A two-option nominal value is also called a dichotomous variable. This could be True/False, Correct/Incorrect, Married/Not Married, and lots of data data of the form X, not X."
117117
]
118118
},
119119
{
@@ -143,19 +143,19 @@
143143
"source": [
144144
"### Dealing with nominal and ordinal data types\n",
145145
"\n",
146-
"In statistics and programming, the programs we write often assume data is at least Interval, and perhaps Ratio. We freely compute the mean of data (even if that doesn't make sense) and report the result. One of the key problems here is that we often use numbers to encode varaibles. For instance, in our gender question, we might assign the following numbers:\n",
146+
"In statistics and programming, the programs we write often assume data is at least Interval, and perhaps Ratio. We freely compute the mean of data (even if that doesn't make sense) and report the result. One of the key problems here is that we often use numbers to encode variables. For instance, in our gender question, we might assign the following numbers:\n",
147147
"\n",
148148
"* Male: 0\n",
149149
"* Female: 1\n",
150150
"* Other: 2\n",
151151
"\n",
152152
"Therefore, our data looks like this: `gender = [0, 0, 1, 1, 1, 0, 2]`, indicating Male, Male, Female, Female, and so on. This means that technically we can compute `np.mean(gender)`, and that operation will work. Further, we could fit a OLS model on this data, and that will probably also show something.\n",
153153
"\n",
154-
"To better encode the gender variable, a one-hot encoding is normally recommended. This expands the data into multiple varaibles of the form \"Is Male?\", \"Is Female?\" and \"Is Other?\". Only, and exactly, one of these three will be 1, and the others will be always 0. This turns this into a dichtomous nominal varaible, which can actually be used as a ratio!\n",
154+
"To better encode the gender variable, a one-hot encoding is normally recommended. This expands the data into multiple variables of the form \"Is Male?\", \"Is Female?\" and \"Is Other?\". Only, and exactly, one of these three will be 1, and the others will be always 0. This turns this into a dichotomous nominal variable, which can actually be used as a ratio!\n",
155155
"\n",
156156
"\n",
157157
"<div class=\"alert alert-warning\">\n",
158-
" Gender is <i>nearly</i> dichtomous for the general popuation - most people identify as either male or female. You can do things with dichtomous variables like compute the ratio from the mean, but if you do, you will run the risk of your results being meaningless in some samples.\n",
158+
" Gender is <i>nearly</i> dichotomous for the general population - most people identify as either male or female. You can do things with dichotomous variables like compute the ratio from the mean, but if you do, you will run the risk of your results being meaningless in some samples.\n",
159159
"</div>\n",
160160
"\n",
161161
"\n",
@@ -177,6 +177,13 @@
177177
"1. Load some data that contains ordinal and nominal variables (e.g. Boston house prices). Convert the ordinal and nominal variables to encoded forms using the above scikit-learn classes."
178178
]
179179
},
180+
{
181+
"cell_type": "markdown",
182+
"metadata": {},
183+
"source": [
184+
"*For solutions see `solutions/ordinal_encoding.py`*"
185+
]
186+
},
180187
{
181188
"cell_type": "markdown",
182189
"metadata": {},
@@ -214,7 +221,7 @@
214221
"\n",
215222
"To date, these modules have been a little slack with terminology relating to samples versus populations. We will address that here before we continue further.\n",
216223
"\n",
217-
"If we remember our equation for our LInear Regression model, it has been presented so far like this:\n",
224+
"If we remember our equation for our Linear Regression model, it has been presented so far like this:\n",
218225
"\n",
219226
"$Y = X \\beta + u$\n",
220227
"\n",
@@ -256,28 +263,28 @@
256263
"\n",
257264
"We will finish this module with a mid-point revision of some of the basics covered so far, but specifically with a view to prediction in the Linear Regression space.\n",
258265
"\n",
259-
"The most basic, reasonable, prediction algorithm is simply to predict the mean of your sample of data. This is the best estimator if you have just one variable. i.e. if we have *just* the heights for lots of dogs, and want to do the best guess fo the next dog's height, we just use the mean. This is mathematically proven as the best estimate you can get - known as the *expected value*.\n",
266+
"The most basic, reasonable, prediction algorithm is simply to predict the mean of your sample of data. This is the best estimator if you have just one variable. i.e. if we have *just* the heights for lots of dogs, and want to do the best guess of the next dog's height, we just use the mean. This is mathematically proven as the best estimate you can get - known as the *expected value*.\n",
260267
"\n",
261268
"Our model is therefore of the form:\n",
262269
"\n",
263270
"$\\hat{y} = \\bar{x} + u$\n",
264271
"\n",
265-
"Where $\\bar{x}$ is the mean of $X$ and $u$ is the errors, the residuals. These residuals are the difference between the prediction and the actual value. You'll often hear this refered to as a measure of the \"Goodness of fit\".\n",
272+
"Where $\\bar{x}$ is the mean of $X$ and $u$ is the errors, the residuals. These residuals are the difference between the prediction and the actual value. You'll often hear this referred to as a measure of the \"Goodness of fit\".\n",
266273
"\n",
267274
"The sum of squared error (SSE) is a commonly used metric in comparing two models to determine which is better. It is the sum of the squared of these residuals, therefore:\n",
268275
"\n",
269276
"$SSE = \\sum{u^2}$\n",
270277
"\n",
271278
"For a single set of values, the mean is the value that minimises the SSE in a single row of data. This is why, according to this model, the mean is the best predictor if we have no other data.\n",
272279
"\n",
273-
"The goal of linear regression is to minimise the SSE when we use multiple independent variables to predict a dependent variable. This is only one evaluation metric - there are dozens of commonly used ones, and for financial data we might be more concerned with \"absolute profit\" or some other metric more closely tied to business outcomes. Use the evaluation metric that works for your application. SSE has some nice properties algorithmic properties, for instance, you can compute the gradient at all points, allowing for solving the OLS algorithm quickly to minimise SSE. Not all metrics are as nice.\n",
280+
"The goal of linear regression is to minimise the SSE when we use multiple independent variables to predict a dependent variable. This is only one evaluation metric - there are dozens of commonly used ones, and for financial data we might be more concerned with \"absolute profit\" or some other metric more closely tied to business outcomes. Use the evaluation metric that works for your application. SSE has some nice algorithmic properties, for instance, you can compute the gradient at all points, allowing for solving the OLS algorithm quickly to minimise SSE. Not all metrics are as nice.\n",
274281
"\n",
275282
"When you fit any model to data, and want to evaluate how well it does in practice, you must fit your model on one set of data, and evaluate it on another set of data. This is known as a train/test split. You fit your data on the training data, and evaluate on the testing data. A common split is simply to split randomly 1/3 of the data for testing, and the remaining 2/3 for training. The exact numbers usually don't matter too much. If you are using a time series dataset, you'll need to split by time rather than randomly. Think always about how your model will be used in practice - for price prediction, you want to be predicting tomorrow's price, not some random day in the past. Therefore, your model needs to be evaluated when it's trying to predict data in the future.\n",
276283
"\n",
277284
"\n",
278285
"#### Exercise\n",
279286
"\n",
280-
"An issue with train/test splits, is that you must, by definition, lose the learning power of 1/3 of your dataset (or whatever you put into the test split. To address this, use cross-validation.\n",
287+
"An issue with train/test splits, is that you must, by definition, lose the learning power of 1/3 of your dataset (or whatever you put into the test split). To address this, use cross-validation.\n",
281288
"\n",
282289
"Review the documentation for scikit-learn's cross-validation functions at https://scikit-learn.org/stable/modules/cross_validation.html\n",
283290
"\n",
@@ -310,7 +317,7 @@
310317
"name": "python",
311318
"nbconvert_exporter": "python",
312319
"pygments_lexer": "ipython3",
313-
"version": "3.6.8"
320+
"version": "3.7.2"
314321
}
315322
},
316323
"nbformat": 4,

Diff for: notebooks/ready_for_review/Module 2.2.2 - ARIMA.ipynb

+32-4
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
"\n",
2626
"### Moving Average (MA)\n",
2727
"\n",
28-
"An Moving Average (MA) model is given as:\n",
28+
"A Moving Average (MA) model is given as:\n",
2929
"\n",
3030
"$MA(p) X_t = \\mu + \\epsilon_{t} + \\sum_{i=1}^{p}\\theta_i\\epsilon_{t-i}$\n",
3131
"\n",
@@ -62,6 +62,13 @@
6262
"You can review module 1.6.4 for code on how to run the ARMA model in statsmodels."
6363
]
6464
},
65+
{
66+
"cell_type": "markdown",
67+
"metadata": {},
68+
"source": [
69+
"*For solutions, see `solutions/arma_cryptocurrency.py`*"
70+
]
71+
},
6572
{
6673
"cell_type": "markdown",
6774
"metadata": {},
@@ -105,9 +112,16 @@
105112
"source": [
106113
"#### Extended Exercise\n",
107114
"\n",
108-
"The ARIMA model is implemented in statsmodels under `statsmodels.tsa.arima_model.ARIMA` with a similar use case to the `ARMA` model previously used. Peform an ARIMA modelling on the cryptocurrency data from the previous exercise.\n",
115+
"The ARIMA model is implemented in statsmodels under `statsmodels.tsa.arima_model.ARIMA` with a similar use case to the `ARMA` model previously used. Perform an ARIMA modelling on the cryptocurrency data from the previous exercise.\n",
109116
"\n",
110-
"Normally, the value for $d$ is determined before running the model, but performing a test of stationarity. See Module 1.6.2 for information on performing these tests. Simply difference the datat, check for stationarity, and if it isn't, difference it again. Values more than 3 are abnormal - if you still aren't getting stationary data at that point, check your assumptions."
117+
"Normally, the value for $d$ is determined before running the model, but performing a test of stationarity. See Module 1.6.2 for information on performing these tests. Simply difference the data, check for stationarity, and if it isn't, difference it again. Values more than 3 are abnormal - if you still aren't getting stationary data at that point, check your assumptions."
118+
]
119+
},
120+
{
121+
"cell_type": "markdown",
122+
"metadata": {},
123+
"source": [
124+
"*For solutions, see `solutions/arima_cryptocurrency.py`*"
111125
]
112126
},
113127
{
@@ -676,6 +690,13 @@
676690
"Compute the SSE value on the predicted values from 2015 onwards. Remember to retrain your model after doing a train/test split before you evaluate!"
677691
]
678692
},
693+
{
694+
"cell_type": "markdown",
695+
"metadata": {},
696+
"source": [
697+
"*For solutions, see `solutions/arima_sse.py`*"
698+
]
699+
},
679700
{
680701
"cell_type": "markdown",
681702
"metadata": {},
@@ -850,6 +871,13 @@
850871
"2. Choose a seasonal commodity from Quandl, such as Wheat, and apply a Seasonal ARIMA to the data."
851872
]
852873
},
874+
{
875+
"cell_type": "markdown",
876+
"metadata": {},
877+
"source": [
878+
"*For solutions see `solutions/arima_seasonal.py`*"
879+
]
880+
},
853881
{
854882
"cell_type": "markdown",
855883
"metadata": {},
@@ -888,7 +916,7 @@
888916
"name": "python",
889917
"nbconvert_exporter": "python",
890918
"pygments_lexer": "ipython3",
891-
"version": "3.6.8"
919+
"version": "3.7.2"
892920
}
893921
},
894922
"nbformat": 4,

0 commit comments

Comments
 (0)