Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Machine Learning - review (chapter 21, Didi Ooi) #112

Open
mycarta opened this issue Jun 13, 2020 · 0 comments
Open

Simple Machine Learning - review (chapter 21, Didi Ooi) #112

mycarta opened this issue Jun 13, 2020 · 0 comments
Assignees
Labels

Comments

@mycarta
Copy link
Member

mycarta commented Jun 13, 2020

Overall nicely written chapter. I like the style, the structure, and the objectives, which I think are met.
However, I have a few comments on specifics of Machine Learning; see below, organized by section. I may call on others to help out. Ultimately it may need further work from the author.

1. Understand each variable independently
About determining the normality: I recently had an in-depth discussion with a friend (a statistician) about this becasue I was confused by contradicting recommendations in this regards - he assured me there are no distributional assumptions on the predictors, only on the dependent variable, so this needs to be clarified.

2. Feature engineering
All good

3. Understand bivariate relationship
All good

4. Exploit multivariate patterns
I would not only use PCA. I would consider suggesting multiple methods to explore multivariate relationships / variable importance, ideally a combination of model based and some not model based, and decide base on majority vote (variables most methods agree upon).

5. Train your Machine Learning model
In here we have a recommendation for a 80/20 training / validation split. THis needs to be clarified on two levels:

  1. the terminology. It is unclear to me what the author means with Validation (for terminology I try to stick to Sebastian Raschka's, see diagram below:

Screen Shot 2020-06-13 at 4 37 27 PM

  1. If the intended meaning is just an 80 train/test set like in the first row in the diagram, then it may be ok, although 80/20 is seldom a good generic split; I could be wrong but I have a sense the author may be referring to the second row because she mentions training competitive models, in which case this approach would be incorrect. It certainly needs to be clarified.

6. Prediction!
All good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants