|
57 | 57 | "\n",
|
58 | 58 | "If Melbourne has a temperature of 15°C, and Sydney has a temperature of 30°C, we can say that Sydney has a greater temperature than Melbourne. We **cannot** say that Sydney is twice as hot at Melbourne. \n",
|
59 | 59 | "\n",
|
60 |
| - "While this would make intuitive sense in lay speak, it actually doesn't mean anything. The reason? The units matter. If we measure in Fahrenheit, we get Melbourne's temperature ast 59°F and Sydney's as 86°F, which is no longer twice, despite the underlying facts not changing.\n", |
| 60 | + "While this would make intuitive sense in lay speak, it actually doesn't mean anything. The reason? The units matter. If we measure in Fahrenheit, we get Melbourne's temperature as 59°F and Sydney's as 86°F, which is no longer twice, despite the underlying facts not changing.\n", |
61 | 61 | "\n",
|
62 | 62 | "Interval data is still very useful, and you can normally just deal with it as a ratio value. This is especially true with practical data where the data sits in a normal range.\n"
|
63 | 63 | ]
|
|
109 | 109 | "source": [
|
110 | 110 | "### Nominal data (Categorical)\n",
|
111 | 111 | "\n",
|
112 |
| - "**Nominal** data, also called Categorical or Labels, are unordered categories of data. For instance, if I asked if your pet was a Dog or a Cat, there isn't a \"greater\" of these values (one could argue with this point, but not mathemtically).\n", |
| 112 | + "**Nominal** data, also called Categorical or Labels, are unordered categories of data. For instance, if I asked if your pet was a Dog or a Cat, there isn't a \"greater\" of these values (one could argue with this point, but not mathematically).\n", |
113 | 113 | "\n",
|
114 | 114 | "A common type of data here is Gender. Options could include \"Female\", \"Male\" and \"Other\". They could be presented in an order, and could even get numbered labels for storing, but you can't compute the average. You *can* compute the ratio of the total (e.g. 53% Female, 46% Male, 1% Other), but it would make no sense to say the average customer is 53% Female. If anything, that sends a very different message!\n",
|
115 | 115 | "\n",
|
116 |
| - "A two-option nominal value is also called a dichtomous variable. This could be True/False, Correct/Incorrect, Married/Not Married, and lots of data data of the form X, not X." |
| 116 | + "A two-option nominal value is also called a dichotomous variable. This could be True/False, Correct/Incorrect, Married/Not Married, and lots of data data of the form X, not X." |
117 | 117 | ]
|
118 | 118 | },
|
119 | 119 | {
|
|
143 | 143 | "source": [
|
144 | 144 | "### Dealing with nominal and ordinal data types\n",
|
145 | 145 | "\n",
|
146 |
| - "In statistics and programming, the programs we write often assume data is at least Interval, and perhaps Ratio. We freely compute the mean of data (even if that doesn't make sense) and report the result. One of the key problems here is that we often use numbers to encode varaibles. For instance, in our gender question, we might assign the following numbers:\n", |
| 146 | + "In statistics and programming, the programs we write often assume data is at least Interval, and perhaps Ratio. We freely compute the mean of data (even if that doesn't make sense) and report the result. One of the key problems here is that we often use numbers to encode variables. For instance, in our gender question, we might assign the following numbers:\n", |
147 | 147 | "\n",
|
148 | 148 | "* Male: 0\n",
|
149 | 149 | "* Female: 1\n",
|
150 | 150 | "* Other: 2\n",
|
151 | 151 | "\n",
|
152 | 152 | "Therefore, our data looks like this: `gender = [0, 0, 1, 1, 1, 0, 2]`, indicating Male, Male, Female, Female, and so on. This means that technically we can compute `np.mean(gender)`, and that operation will work. Further, we could fit a OLS model on this data, and that will probably also show something.\n",
|
153 | 153 | "\n",
|
154 |
| - "To better encode the gender variable, a one-hot encoding is normally recommended. This expands the data into multiple varaibles of the form \"Is Male?\", \"Is Female?\" and \"Is Other?\". Only, and exactly, one of these three will be 1, and the others will be always 0. This turns this into a dichtomous nominal varaible, which can actually be used as a ratio!\n", |
| 154 | + "To better encode the gender variable, a one-hot encoding is normally recommended. This expands the data into multiple variables of the form \"Is Male?\", \"Is Female?\" and \"Is Other?\". Only, and exactly, one of these three will be 1, and the others will be always 0. This turns this into a dichotomous nominal variable, which can actually be used as a ratio!\n", |
155 | 155 | "\n",
|
156 | 156 | "\n",
|
157 | 157 | "<div class=\"alert alert-warning\">\n",
|
158 |
| - " Gender is <i>nearly</i> dichtomous for the general popuation - most people identify as either male or female. You can do things with dichtomous variables like compute the ratio from the mean, but if you do, you will run the risk of your results being meaningless in some samples.\n", |
| 158 | + " Gender is <i>nearly</i> dichotomous for the general population - most people identify as either male or female. You can do things with dichotomous variables like compute the ratio from the mean, but if you do, you will run the risk of your results being meaningless in some samples.\n", |
159 | 159 | "</div>\n",
|
160 | 160 | "\n",
|
161 | 161 | "\n",
|
|
177 | 177 | "1. Load some data that contains ordinal and nominal variables (e.g. Boston house prices). Convert the ordinal and nominal variables to encoded forms using the above scikit-learn classes."
|
178 | 178 | ]
|
179 | 179 | },
|
| 180 | + { |
| 181 | + "cell_type": "markdown", |
| 182 | + "metadata": {}, |
| 183 | + "source": [ |
| 184 | + "*For solutions see `solutions/ordinal_encoding.py`*" |
| 185 | + ] |
| 186 | + }, |
180 | 187 | {
|
181 | 188 | "cell_type": "markdown",
|
182 | 189 | "metadata": {},
|
|
214 | 221 | "\n",
|
215 | 222 | "To date, these modules have been a little slack with terminology relating to samples versus populations. We will address that here before we continue further.\n",
|
216 | 223 | "\n",
|
217 |
| - "If we remember our equation for our LInear Regression model, it has been presented so far like this:\n", |
| 224 | + "If we remember our equation for our Linear Regression model, it has been presented so far like this:\n", |
218 | 225 | "\n",
|
219 | 226 | "$Y = X \\beta + u$\n",
|
220 | 227 | "\n",
|
|
256 | 263 | "\n",
|
257 | 264 | "We will finish this module with a mid-point revision of some of the basics covered so far, but specifically with a view to prediction in the Linear Regression space.\n",
|
258 | 265 | "\n",
|
259 |
| - "The most basic, reasonable, prediction algorithm is simply to predict the mean of your sample of data. This is the best estimator if you have just one variable. i.e. if we have *just* the heights for lots of dogs, and want to do the best guess fo the next dog's height, we just use the mean. This is mathematically proven as the best estimate you can get - known as the *expected value*.\n", |
| 266 | + "The most basic, reasonable, prediction algorithm is simply to predict the mean of your sample of data. This is the best estimator if you have just one variable. i.e. if we have *just* the heights for lots of dogs, and want to do the best guess of the next dog's height, we just use the mean. This is mathematically proven as the best estimate you can get - known as the *expected value*.\n", |
260 | 267 | "\n",
|
261 | 268 | "Our model is therefore of the form:\n",
|
262 | 269 | "\n",
|
263 | 270 | "$\\hat{y} = \\bar{x} + u$\n",
|
264 | 271 | "\n",
|
265 |
| - "Where $\\bar{x}$ is the mean of $X$ and $u$ is the errors, the residuals. These residuals are the difference between the prediction and the actual value. You'll often hear this refered to as a measure of the \"Goodness of fit\".\n", |
| 272 | + "Where $\\bar{x}$ is the mean of $X$ and $u$ is the errors, the residuals. These residuals are the difference between the prediction and the actual value. You'll often hear this referred to as a measure of the \"Goodness of fit\".\n", |
266 | 273 | "\n",
|
267 | 274 | "The sum of squared error (SSE) is a commonly used metric in comparing two models to determine which is better. It is the sum of the squared of these residuals, therefore:\n",
|
268 | 275 | "\n",
|
269 | 276 | "$SSE = \\sum{u^2}$\n",
|
270 | 277 | "\n",
|
271 | 278 | "For a single set of values, the mean is the value that minimises the SSE in a single row of data. This is why, according to this model, the mean is the best predictor if we have no other data.\n",
|
272 | 279 | "\n",
|
273 |
| - "The goal of linear regression is to minimise the SSE when we use multiple independent variables to predict a dependent variable. This is only one evaluation metric - there are dozens of commonly used ones, and for financial data we might be more concerned with \"absolute profit\" or some other metric more closely tied to business outcomes. Use the evaluation metric that works for your application. SSE has some nice properties algorithmic properties, for instance, you can compute the gradient at all points, allowing for solving the OLS algorithm quickly to minimise SSE. Not all metrics are as nice.\n", |
| 280 | + "The goal of linear regression is to minimise the SSE when we use multiple independent variables to predict a dependent variable. This is only one evaluation metric - there are dozens of commonly used ones, and for financial data we might be more concerned with \"absolute profit\" or some other metric more closely tied to business outcomes. Use the evaluation metric that works for your application. SSE has some nice algorithmic properties, for instance, you can compute the gradient at all points, allowing for solving the OLS algorithm quickly to minimise SSE. Not all metrics are as nice.\n", |
274 | 281 | "\n",
|
275 | 282 | "When you fit any model to data, and want to evaluate how well it does in practice, you must fit your model on one set of data, and evaluate it on another set of data. This is known as a train/test split. You fit your data on the training data, and evaluate on the testing data. A common split is simply to split randomly 1/3 of the data for testing, and the remaining 2/3 for training. The exact numbers usually don't matter too much. If you are using a time series dataset, you'll need to split by time rather than randomly. Think always about how your model will be used in practice - for price prediction, you want to be predicting tomorrow's price, not some random day in the past. Therefore, your model needs to be evaluated when it's trying to predict data in the future.\n",
|
276 | 283 | "\n",
|
277 | 284 | "\n",
|
278 | 285 | "#### Exercise\n",
|
279 | 286 | "\n",
|
280 |
| - "An issue with train/test splits, is that you must, by definition, lose the learning power of 1/3 of your dataset (or whatever you put into the test split. To address this, use cross-validation.\n", |
| 287 | + "An issue with train/test splits, is that you must, by definition, lose the learning power of 1/3 of your dataset (or whatever you put into the test split). To address this, use cross-validation.\n", |
281 | 288 | "\n",
|
282 | 289 | "Review the documentation for scikit-learn's cross-validation functions at https://scikit-learn.org/stable/modules/cross_validation.html\n",
|
283 | 290 | "\n",
|
|
310 | 317 | "name": "python",
|
311 | 318 | "nbconvert_exporter": "python",
|
312 | 319 | "pygments_lexer": "ipython3",
|
313 |
| - "version": "3.6.8" |
| 320 | + "version": "3.7.2" |
314 | 321 | }
|
315 | 322 | },
|
316 | 323 | "nbformat": 4,
|
|
0 commit comments