diff --git a/.DS_Store b/.DS_Store
index a26c31cb..c0d98df4 100644
Binary files a/.DS_Store and b/.DS_Store differ
diff --git a/English version/README.md b/English version/README.md
new file mode 100644
index 00000000..16f0d5f1
--- /dev/null
+++ b/English version/README.md
@@ -0,0 +1,566 @@
+Translated with Google Translate (corrections welcome)
+
+# 1. Copyright statement
+Please respect the author's intellectual property rights, copyright, piracy will be investigated. It is strictly forbidden to forward content without permission!
+Please work together to maintain the results of your work and supervise. It is strictly forbidden to forward content without permission!
+2018.6.27 TanJiyong
+
+# 2. Overview
+
+This project is to integrate the relevant knowledge of AI and brainstorm ideas to form a comprehensive and comprehensive collection of articles.
+
+
+# 3. Join and document specifications
+1. Seek friends, editors, and writers who are willing to continue to improve; if you are interested in cooperation, improve the book (become a co-author).
+2. All contributors who submit content will reflect the contributor's personal information in the text (eg: Daxie-West Lake University)
+3, in order to make the content more complete and thoughtful, brainstorming, welcome to Fork the project and participate in the preparation. Please note your name-unit (Dayu-Stanford University) while modifying the MD file (or direct message). Once adopted, the contributor's information will be displayed in the original text, thank you!
+4. It is recommended to use the typora-Markdown reader: https://typora.io/
+
+Setting:
+File->Preference
+- Syntax Support
+ - Inline Math
+ - Subscript
+ - Superscript
+ - Highlight
+ - Diagrams
+
+Check these items on
+
+Example:
+
+```markdown
+### 3.3.2 How to find the optimal value of the hyperparameter? (Contributor: Daxie - Stanford University)
+
+There are always some difficult hyperparameters when using machine learning algorithms. For example, weight attenuation size, Gaussian kernel width, and so on. The algorithm does not set these parameters, but instead requires you to set their values. The set value has a large effect on the result. Common practices for setting hyperparameters are:
+
+1. Guess and check: Select parameters based on experience or intuition, and iterate over.
+2. Grid Search: Let the computer try to evenly distribute a set of values within a certain range.
+3. Random search: Let the computer randomly pick a set of values.
+4. Bayesian optimization: Using Bayesian optimization of hyperparameters, it is difficult to meet the Bayesian optimization algorithm itself.
+5. Perform local optimization with good initial guessing: this is the MITIE method, which uses the BOBYQA algorithm and has a carefully chosen starting point. Since BOBYQA only looks for the nearest local optimal solution, the success of this method depends largely on whether there is a good starting point. In the case of MITIE, we know a good starting point, but this is not a universal solution, because usually you won't know where the good starting point is. On the plus side, this approach is well suited to finding local optimal solutions. I will discuss this later.
+6. The latest global optimization method for LIPO. This method has no parameters and is proven to be better than a random search method.
+```
+
+# 4. Contributions and Project Overview
+
+Submitted MD version chapter: Please check MarkDown
+
+
+# 5. More
+
+1. Seek friends, editors, and writers who are willing to continue to improve; if you are interested in cooperation, improve the book (become a co-author).
+ All contributors who submit content will reflect the contributor's personal information in the article (Dalong - West Lake University).
+
+2. Contact: Please contact scutjy2015@163.com (the only official email); WeChat Tan:
+
+ (Into the group, after the MD version is added, improved, and submitted, it is easier to enter the group and enjoy sharing knowledge to help others.)
+
+ Into the "Deep Learning 500 Questions" WeChat group please add WeChat Client 1: HQJ199508212176 Client 2: Xuwumin1203 Client 3: tianyuzy
+
+3. Markdown reader recommendation: https://typora.io/ Free and support for mathematical formulas is better.
+
+4. Note that there are now criminals pretending to be promoters, please let the partners know!
+
+5. Next, the MD version will be provided, and everyone will edit it together, so stay tuned! I hope to make suggestions and add changes!
+
+
+# 6. Contents
+
+**Chapter 1 Mathematical Foundation 1**
+
+1.1 The relationship between scalars, vectors, and tensors 1
+1.2 What is the difference between tensor and matrix? 1
+1.3 Matrix and vector multiplication results 1
+1.4 Vector and matrix norm induction 1
+1.5 How to judge a matrix to be positive? 2
+1.6 Derivative Bias Calculation 3
+What is the difference between 1.7 derivatives and partial derivatives? 3
+1.8 Eigenvalue decomposition and feature vector 3
+1.9 What is the relationship between singular values and eigenvalues? 4
+1.10 Why should machine learning use probabilities? 4
+1.11 What is the difference between a variable and a random variable? 4
+1.12 Common probability distribution? 5
+1.13 Example Understanding Conditional Probability 9
+1.14 What is the difference between joint probability and edge probability? 10
+1.15 Chain Law of Conditional Probability 10
+1.16 Independence and conditional independence 11
+1.17 Summary of Expectations, Variances, Covariances, Correlation Coefficients 11
+
+**Chapter 2 Fundamentals of Machine Learning 14**
+
+2.1 Various common algorithm illustrations 14
+2.2 Supervised learning, unsupervised learning, semi-supervised learning, weak supervised learning? 15
+2.3 What are the steps for supervised learning? 16
+2.4 Multi-instance learning? 17
+2.5 What is the difference between classification networks and regression? 17
+2.6 What is a neural network? 17
+2.7 Advantages and Disadvantages of Common Classification Algorithms? 18
+2.8 Is the correct rate good for evaluating classification algorithms? 20
+2.9 How to evaluate the classification algorithm? 20
+2.10 What kind of classifier is the best? twenty two
+2.11 The relationship between big data and deep learning 22
+2.12 Understanding Local Optimization and Global Optimization 23
+2.13 Understanding Logistic Regression 24
+2.14 What is the difference between logistic regression and naive Bayes? twenty four
+2.15 Why do you need a cost function? 25
+2.16 Principle of the function of the cost function 25
+2.17 Why is the cost function non-negative? 26
+2.18 Common cost function? 26
+2.19 Why use cross entropy instead of quadratic cost function 28
+2.20 What is a loss function? 28
+2.21 Common loss function 28
+2.22 Why does logistic regression use a logarithmic loss function? 30
+How does the logarithmic loss function measure loss? 31
+2.23 Why do gradients need to be reduced in machine learning? 32
+2.24 What are the disadvantages of the gradient descent method? 32
+2.25 Gradient descent method intuitive understanding? 32
+2.23 What is the description of the gradient descent algorithm? 33
+2.24 How to tune the gradient descent method? 35
+2.25 What is the difference between random gradients and batch gradients? 35
+2.26 Performance Comparison of Various Gradient Descent Methods 37
+2.27 Calculation of the derivative calculation diagram of the graph? 37
+2.28 Summary of Linear Discriminant Analysis (LDA) Thoughts 39
+2.29 Graphical LDA Core Ideas 39
+2.30 Principles of the second class LDA algorithm? 40
+2.30 LDA algorithm flow summary? 41
+2.31 What is the difference between LDA and PCA? 41
+2.32 LDA advantages and disadvantages? 41
+2.33 Summary of Principal Component Analysis (PCA) Thoughts 42
+2.34 Graphical PCA Core Ideas 42
+2.35 PCA algorithm reasoning 43
+2.36 Summary of PCA Algorithm Flow 44
+2.37 Main advantages and disadvantages of PCA algorithm 45
+2.38 Necessity and purpose of dimensionality reduction 45
+2.39 What is the difference between KPCA and PCA? 46
+2.40 Model Evaluation 47
+2.40.1 Common methods for model evaluation? 47
+2.40.2 Empirical error and generalization error 47
+2.40.3 Graphic under-fitting, over-fitting 48
+2.40.4 How to solve over-fitting and under-fitting? 49
+2.40.5 The main role of cross-validation? 50
+2.40.6 k fold cross validation? 50
+2.40.7 Confusion Matrix 50
+2.40.8 Error Rate and Accuracy 51
+2.40.9 Precision and recall rate 51
+2.40.10 ROC and AUC 52
+2.40.11 How to draw ROC curve? 53
+2.40.12 How to calculate TPR, FPR? 54
+2.40.13 How to calculate Auc? 56
+2.40.14 Why use Roc and Auc to evaluate the classifier? 56
+2.40.15 Intuitive understanding of AUC 56
+2.40.16 Cost-sensitive error rate and cost curve 57
+2.40.17 What are the comparison test methods for the model 59
+2.40.18 Deviation and variance 59
+2.40.19 Why use standard deviation? 60
+2.40.20 Point Estimation Thoughts 61
+2.40.21 Point Estimation Goodness Principle? 61
+2.40.22 The connection between point estimation, interval estimation, and central limit theorem? 62
+2.40.23 What causes the category imbalance? 62
+2.40.24 Common Category Unbalance Problem Resolution 62
+2.41 Decision Tree 64
+2.41.1 Basic Principles of Decision Trees 64
+2.41.2 Three elements of the decision tree? 64
+2.41.3 Decision Tree Learning Basic Algorithm 65
+2.41.4 Advantages and Disadvantages of Decision Tree Algorithms 65
+2.40.5 Concept of entropy and understanding 66
+2.40.6 Understanding of Information Gain 66
+2.40.7 The role and strategy of pruning treatment? 67
+2.41 Support Vector Machine 67
+2.41.1 What is a support vector machine 67
+2.25.2 Problems solved by the support vector machine? 68
+2.25.2 Function of the kernel function? 69
+2.25.3 Dual Problem 69
+2.25.4 Understanding Support Vector Regression 69
+2.25.5 Understanding SVM (Nuclear Function) 69
+2.25.6 What are the common kernel functions? 69
+2.25.6 Soft Interval and Regularization 73
+2.25.7 Main features and disadvantages of SVM? 73
+2.26 Bayesian 74
+2.26.1 Graphical Maximum Likelihood Estimate 74
+2.26.2 What is the difference between a naive Bayes classifier and a general Bayesian classifier? 76
+2.26.4 Plain and semi-simple Bayesian classifiers 76
+2.26.5 Three typical structures of Bayesian network 76
+2.26.6 What is the Bayesian error rate 76
+2.26.7 What is the Bayesian optimal error rate? 76
+2.27 EM algorithm to solve problems and implementation process 76
+2.28 Why is there a dimensionality disaster? 78
+2.29 How to avoid dimension disasters 82
+2.30 What is the difference and connection between clustering and dimension reduction? 82
+2.31 Differences between GBDT and random forests 83
+2.32 Comparison of four clustering methods 84
+
+**Chapter 3 Fundamentals of Deep Learning 88**
+
+3.1 Basic Concepts 88
+3.1.1 Neural network composition? 88
+3.1.2 What are the common model structures of neural networks? 90
+3.1.3 How to choose a deep learning development platform? 92
+3.1.4 Why use deep representation 92
+3.1.5 Why is deep neural network difficult to train? 93
+3.1.6 What is the difference between deep learning and machine learning? 94
+3.2 Network Operations and Calculations 95
+3.2.1 Forward Propagation and Back Propagation? 95
+3.2.2 How to calculate the output of the neural network? 97
+3.2.3 How to calculate the convolutional neural network output value? 98
+3.2.4 How do I calculate the output value of the Pooling layer output value? 101
+3.2.5 Example Understanding Back Propagation 102
+3.3 Superparameters 105
+3.3.1 What is a hyperparameter? 105
+3.3.2 How to find the optimal value of the hyperparameter? 105
+3.3.3 General procedure for hyperparameter search? 106
+3.4 Activation function 106
+3.4.1 Why do I need a nonlinear activation function? 106
+3.4.2 Common Activation Functions and Images 107
+3.4.3 Derivative calculation of common activation functions? 109
+3.4.4 What are the properties of the activation function? 110
+3.4.5 How do I choose an activation function? 110
+3.4.6 Advantages of using the ReLu activation function? 111
+3.4.7 When can I use the linear activation function? 111
+3.4.8 How to understand that Relu (<0) is a nonlinear activation function? 111
+3.4.9 How does the Softmax function be applied to multiple classifications? 112
+3.5 Batch_Size 113
+3.5.1 Why do I need Batch_Size? 113
+3.5.2 Selection of Batch_Size Values 114
+3.5.3 What are the benefits of increasing Batch_Size within a reasonable range? 114
+3.5.4 What is the disadvantage of blindly increasing Batch_Size? 114
+3.5.5 What is the impact of Batch_Size on the training effect? 114
+3.6 Normalization 115
+3.6.1 What is the meaning of normalization? 115
+3.6.2 Why Normalize 115
+3.6.3 Why can normalization improve the solution speed? 115
+3.6.4 3D illustration not normalized 116
+3.6.5 What types of normalization? 117
+3.6.6 Local response normalization
+Effect 117
+3.6.7 Understanding the local response normalization formula 117
+3.6.8 What is Batch Normalization 118
+3.6.9 Advantages of the Batch Normalization (BN) Algorithm 119
+3.6.10 Batch normalization (BN) algorithm flow 119
+3.6.11 Batch normalization and group normalization 120
+3.6.12 Weight Normalization and Batch Normalization 120
+3.7 Pre-training and fine tuning 121
+3.7.1 Why can unsupervised pre-training help deep learning? 121
+3.7.2 What is the model fine tuning fine tuning 121
+3.7.3 Is the network parameter updated when fine tuning? 122
+3.7.4 Three states of the fine-tuning model 122
+3.8 Weight Deviation Initialization 122
+3.8.1 All initialized to 0 122
+3.8.2 All initialized to the same value 123
+3.8.3 Initializing to a Small Random Number 124
+3.8.4 Calibrating the variance with 1/sqrt(n) 125
+3.8.5 Sparse Initialization (Sparse Initialazation) 125
+3.8.6 Initialization deviation 125
+3.9 Softmax 126
+3.9.1 Softmax Definition and Function 126
+3.9.2 Softmax Derivation 126
+3.10 Understand the principles and functions of One Hot Encodeing? 126
+3.11 What are the commonly used optimizers? 127
+3.12 Dropout Series Issues 128
+3.12.1 Choice of dropout rate 128
+3.27 Padding Series Issues 128
+
+**Chapter 4 Classic Network 129**
+
+4.1 LetNet5 129
+4.1.1 Model Structure 129
+4.1.2 Model Structure 129
+4.1.3 Model characteristics 131
+4.2 AlexNet 131
+4.2.1 Model structure 131
+4.2.2 Model Interpretation 131
+4.2.3 Model characteristics 135
+4.3 Visualization ZFNet-Deconvolution 135
+4.3.1 Basic ideas and processes 135
+4.3.2 Convolution and Deconvolution 136
+4.3.3 Convolution Visualization 137
+4.3.4 Comparison of ZFNe and AlexNet 139
+4.4 VGG 140
+4.1.1 Model Structure 140
+4.1.2 Model Features 140
+4.5 Network in Network 141
+4.5.1 Model Structure 141
+4.5.2 Model Innovation Points 141
+4.6 GoogleNet 143
+4.6.1 Model Structure 143
+4.6.2 Inception Structure 145
+4.6.3 Model hierarchy 146
+4.7 Inception Series 148
+4.7.1 Inception v1 148
+4.7.2 Inception v2 150
+4.7.3 Inception v3 153
+4.7.4 Inception V4 155
+4.7.5 Inception-ResNet-v2 157
+4.8 ResNet and its variants 158
+4.8.1 Reviewing ResNet 159
+4.8.2 residual block 160
+4.8.3 ResNet Architecture 162
+4.8.4 Variants of residual blocks 162
+4.8.5 ResNeXt 162
+4.8.6 Densely Connected CNN 164
+4.8.7 ResNet as a combination of small networks 165
+4.8.8 Features of Paths in ResNet 166
+4.9 Why are the current CNN models adjusted on GoogleNet, VGGNet or AlexNet? 167
+
+**Chapter 5 Convolutional Neural Network (CNN) 170**
+
+5.1 Constitutive layers of convolutional neural networks 170
+5.2 How does convolution detect edge information? 171
+5.2 Several basic definitions of convolution? 174
+5.2.1 Convolution kernel size 174
+5.2.2 Step size of the convolution kernel 174
+5.2.3 Edge Filling 174
+5.2.4 Input and Output Channels 174
+5.3 Convolution network type classification? 174
+5.3.1 Ordinary Convolution 174
+5.3.2 Expansion Convolution 175
+5.3.3 Transposition Convolution 176
+5.3.4 Separable Convolution 177
+5.3 Schematic of 12 different types of 2D convolution? 178
+5.4 What is the difference between 2D convolution and 3D convolution? 181
+5.4.1 2D Convolution 181
+5.4.2 3D Convolution 182
+5.5 What are the pooling methods? 183
+5.5.1 General Pooling 183
+5.5.2 Overlapping Pooling (OverlappingPooling) 184
+5.5.3 Spatial Pyramid Pooling 184
+5.6 1x1 convolution? 186
+5.7 What is the difference between the convolutional layer and the pooled layer? 187
+5.8 The larger the convolution kernel, the better? 189
+5.9 Can each convolution use only one size of convolution kernel? 189
+5.10 How can I reduce the amount of convolutional parameters? 190
+5.11 Convolution operations must consider both channels and zones? 191
+5.12 What are the benefits of using wide convolution? 192
+5.12.1 Narrow Convolution and Wide Convolution 192
+5.12.2 Why use wide convolution? 192
+5.13 Which depth of the convolutional layer output is the same as the number of parts? 192
+5.14 How do I get the depth of the convolutional layer output? 193
+5.15 Is the activation function usually placed after the operation of the convolutional neural network? 194
+5.16 How do you understand that the maximum pooling layer is a little smaller? 194
+5.17 Understanding Image Convolution and Deconvolution 194
+5.17.1 Image Convolution 194
+5.17.2 Image Deconvolution 196
+5.18 Image Size Calculation after Different Convolutions? 198
+5.18.1 Type division 198
+5.18.2 Calculation formula 199
+5.19 Step size, fill size and input and output relationship summary? 199
+5.19.1 No 0 padding, unit step size 200
+5.19.2 Zero fill, unit step size 200
+5.19.3 Not filled, non-unit step size 202
+5.19.4 Zero padding, non-unit step size 202
+5.20 Understanding deconvolution and checkerboard effects 204
+5.20.1 Why does the board phenomenon appear? 204
+5.20.2 What methods can avoid the checkerboard effect? 205
+5.21 CNN main calculation bottleneck? 207
+5.22 CNN parameter experience setting 207
+5.23 Summary of methods for improving generalization ability 208
+5.23.1 Main methods 208
+5.23.2 Experimental proof 208
+5.24 What are the connections and differences between CNN and CLP? 213
+5.24.1 Contact 213
+5.24.2 Differences 213
+5.25 Does CNN highlight commonality? 213
+5.25.1 Local connection 213
+5.25.2 Weight sharing 214
+5.25.3 Pooling Operations 215
+5.26 Similarities and differences between full convolution and Local-Conv 215
+5.27 Example Understanding the Role of Local-Conv 215
+5.28 Brief History of Convolutional Neural Networks 216
+
+**Chapter 6 Cyclic Neural Network (RNN) 218**
+
+6.1 What is the difference between RNNs and FNNs? 218
+6.2 Typical characteristics of RNNs? 218
+6.3 What can RNNs do? 219
+6.4 Typical applications of RNNs in NLP? 220
+6.5 What are the similarities and differences between RNNs training and traditional ANN training? 220
+6.6 Common RNNs Extensions and Improvement Models 221
+6.6.1 Simple RNNs (SRNs) 221
+6.6.2 Bidirectional RNNs 221
+6.6.3 Deep(Bidirectional) RNNs 222
+6.6.4 Echo State Networks (ESNs) 222
+6.6.5 Gated Recurrent Unit Recurrent Neural Networks 224
+6.6.6 LSTM Netwoorks 224
+6.6.7 Clockwork RNNs (CW-RNNs) 225
+
+**Chapter 7 Target Detection 228**
+
+7.1 Candidate-based target detector 228
+7.1.1 Sliding Window Detector 228
+7.1.2 Selective Search 229
+7.1.3 R-CNN 230
+7.1.4 Boundary Box Regressor 230
+7.1.5 Fast R-CNN 231
+7.1.6 ROI Pooling 233
+7.1.7 Faster R-CNN 233
+7.1.8 Candidate Area Network 234
+7.1.9 Performance of the R-CNN method 236
+7.2 Area-based full convolutional neural network (R-FCN) 237
+7.3 Single Target Detector 240
+7.3.1 Single detector 241
+7.3.2 Sliding window for prediction 241
+7.3.3 SSD 243
+7.4 YOLO Series 244
+7.4.1 Introduction to YOLOv1 244
+7.4.2 What are the advantages and disadvantages of the YOLOv1 model? 252
+7.4.3 YOLOv2 253
+7.4.4 YOLOv2 Improvement Strategy 254
+7.4.5 Training of YOLOv2 261
+7.4.6 YOLO9000 261
+7.4.7 YOLOv3 263
+7.4.8 YOLOv3 Improvements 264
+
+** Chapter 8 Image Segmentation 269**
+
+8.1 What are the disadvantages of traditional CNN-based segmentation methods? 269
+8.1 FCN 269
+8.1.1 What has the FCN changed? 269
+8.1.2 FCN network structure? 270
+8.1.3 Example of a full convolution network? 271
+8.1.4 Why is it difficult for CNN to classify pixels? 271
+8.1.5 How do the fully connected and convolved layers transform each other? 272
+8.1.6 Why can the input picture of the FCN be any size? 272
+8.1.7 What are the benefits of reshaping the weight of the fully connected layer into a convolutional layer filter? 273
+8.1.8 Deconvolutional Understanding 275
+8.1.9 Skip structure 276
+8.1.10 Model Training 277
+8.1.11 FCN Disadvantages 280
+8.2 U-Net 280
+8.3 SegNet 282
+8.4 Dilated Convolutions 283
+8.4 RefineNet 285
+8.5 PSPNet 286
+8.6 DeepLab Series 288
+8.6.1 DeepLabv1 288
+8.6.2 DeepLabv2 289
+8.6.3 DeepLabv3 289
+8.6.4 DeepLabv3+ 290
+8.7 Mask-R-CNN 293
+8.7.1 Schematic diagram of the network structure of Mask-RCNN 293
+8.7.2 RCNN pedestrian detection framework 293
+8.7.3 Mask-RCNN Technical Highlights 294
+8.8 Application of CNN in Image Segmentation Based on Weak Supervised Learning 295
+8.8.1 Scribble tag 295
+8.8.2 Image Level Marking 297
+8.8.3 DeepLab+bounding box+image-level labels 298
+8.8.4 Unified framework 299
+
+**Chapter IX Reinforcement Learning 301**
+
+9.1 Main features of intensive learning? 301
+9.2 Reinforced Learning Application Examples 302
+9.3 Differences between reinforcement learning and supervised learning and unsupervised learning 303
+9.4 What are the main algorithms for reinforcement learning? 305
+9.5 Deep Migration Reinforcement Learning Algorithm 305
+9.6 Hierarchical Depth Reinforcement Learning Algorithm 306
+9.7 Deep Memory Reinforcement Learning Algorithm 306
+9.8 Multi-agent deep reinforcement learning algorithm 307
+9.9 Strong depth
+Summary of learning algorithms 307
+
+**Chapter 10 Migration Learning 309**
+
+10.1 What is migration learning? 309
+10.2 What is multitasking? 309
+10.3 What is the significance of multitasking? 309
+10.4 What is end-to-end deep learning? 311
+10.5 End-to-end depth learning example? 311
+10.6 What are the challenges of end-to-end deep learning? 311
+10.7 End-to-end deep learning advantages and disadvantages? 312
+
+**Chapter 13 Optimization Algorithm 314**
+
+13.1 What is the difference between CPU and GPU? 314
+13.2 How to solve the problem of less training samples 315
+13.3 What sample sets are not suitable for deep learning? 315
+13.4 Is it possible to find a better algorithm than the known algorithm? 316
+13.5 What is collinearity and is there a correlation with the fit? 316
+13.6 How is the generalized linear model applied in deep learning? 316
+13.7 Causes the gradient to disappear? 317
+13.8 What are the weight initialization methods? 317
+13.9 In the heuristic optimization algorithm, how to avoid falling into the local optimal solution? 318
+13.10 How to improve the GD method in convex optimization to prevent falling into local optimal solution 319
+13.11 Common loss function? 319
+13.14 How to make feature selection? 321
+13.14.1 How to consider feature selection 321
+13.14.2 Classification of feature selection methods 321
+13.14.3 Feature selection purpose 322
+13.15 Gradient disappearance / Gradient explosion causes, and solutions 322
+13.15.1 Why use gradient update rules? 322
+13.15.2 Does the gradient disappear and the cause of the explosion? 323
+13.15.3 Solutions for Gradient Disappearance and Explosion 324
+13.16 Why does deep learning not use second-order optimization?
+13.17 How to optimize your deep learning system? 326
+13.18 Why set a single numerical evaluation indicator? 326
+13.19 Satisficing and optimizing metrics 327
+13.20 How to divide the training/development/test set 328
+13.21 How to Divide Development/Test Set Size 329
+13.22 When should I change development/test sets and metrics? 329
+13.23 What is the significance of setting the evaluation indicators? 330
+13.24 What is the avoidance of deviation? 331
+13.25 What is the TOP5 error rate? 331
+13.26 What is the human error rate? 332
+13.27 Can avoid the relationship between deviation and several error rates? 332
+13.28 How to choose to avoid deviation and Bayesian error rate? 332
+13.29 How to reduce the variance? 333
+13.30 Best estimate of Bayesian error rate 333
+13.31 How many examples of machine learning over a single human performance? 334
+13.32 How can I improve your model? 334
+13.33 Understanding Error Analysis 335
+13.34 Why is it worth the time to look at the error flag data? 336
+13.35 What is the significance of quickly setting up the initial system? 336
+13.36 Why should I train and test on different divisions? 337
+13.37 How to solve the data mismatch problem? 338
+13.38 Gradient Test Considerations? 340
+13.39 What is the random gradient drop? 341
+13.40 What is the batch gradient drop? 341
+13.41 What is the small batch gradient drop? 341
+13.42 How to configure the mini-batch gradient to drop 342
+13.43 Locally Optimal Problems 343
+13.44 Improving Algorithm Performance Ideas 346
+
+**Chapter 14 Super Parameter Adjustment 358**
+
+14.1 Debugging Processing 358
+14.2 What are the hyperparameters? 359
+14.3 How do I choose a debug value? 359
+14.4 Choosing the right range for hyperparameters 359
+14.5 How do I search for hyperparameters? 359
+
+**Chapter 15 Heterogeneous Computing, GPU and Frame Selection Guide 361**
+
+
+15.1 What is heterogeneous computing? 361
+15.2 What is a GPGPU? 361
+15.3 Introduction to GPU Architecture 361
+ 15.3.1 Why use a GPU?
+ 15.3.2 What is the core of CUDA?
+ 15.3.3 What is the role of the tensor core in the new Turing architecture for deep learning?
+ 15.3.4 What is the connection between GPU memory architecture and application performance?
+15.4 CUDA framework
+ 15.4.1 Is it difficult to do CUDA programming?
+ 15.4.2 cuDNN
+15.5 GPU hardware environment configuration recommendation
+ 15.5.1 GPU Main Performance Indicators
+ 15.5.2 Purchase Proposal
+15.6 Software Environment Construction
+ 15.6.1 Operating System Selection?
+ 15.6.2 Is the native installation still using docker?
+ 15.6.3 GPU Driver Issues
+15.7 Frame Selection
+ 15.7.1 Comparison of mainstream frameworks
+ 15.7.2 Framework details
+ 15.7.3 Which frameworks are friendly to the deployment environment?
+ 15.7.4 How to choose the framework of the mobile platform?
+15.8 Other
+ 15.8.1 Configuration of a Multi-GPU Environment
+ 15.8.2 Is it possible to distribute training?
+ 15.8.3 Can I train or deploy a model in a SPARK environment?
+ 15.8.4 How to further optimize performance?
+ 15.8.5 What is the difference between TPU and GPU?
+ 15.8.6 What is the impact of future quantum computing on AI technology such as deep learning?
+
+**References 366**
+
+Hey you look like a cool developer.
+Translate it to english.
diff --git a/English version/ch01_MathematicalBasis/Chapter 1_MathematicalBasis.md b/English version/ch01_MathematicalBasis/Chapter 1_MathematicalBasis.md
new file mode 100644
index 00000000..9c760a81
--- /dev/null
+++ b/English version/ch01_MathematicalBasis/Chapter 1_MathematicalBasis.md
@@ -0,0 +1,523 @@
+[TOC]
+
+# Chapter 1 Mathematical Foundation
+
+## 1.1 The relationship between scalars, vectors, matrices, and tensors
+**Scalar**
+A scalar represents a single number that is different from most other objects studied in linear algebra (usually an array of multiple numbers). We use italics to represent scalars. Scalars are usually given a lowercase variable name.
+
+**Vector**
+A vector represents a set of ordered numbers. By indexing in the order, we can determine each individual number. Usually we give the lowercase variable name of the vector bold, such as xx. Elements in a vector can be represented in italics with a footer. The first element of the vector $X$ is $X_1$, the second element is $X_2$, and so on. We will also indicate the type of element (real, imaginary, etc.) stored in the vector.
+
+**Matrix**
+A matrix is a collection of objects with the same features and latitudes, represented as a two-dimensional data table. The meaning is that an object is represented as a row in a matrix, and a feature is represented as a column in a matrix, and each feature has a numerical value. The name of an uppercase variable that is usually given to the matrix bold, such as $A$.
+
+**Tensor**
+In some cases, we will discuss arrays with coordinates over two dimensions. In general, the elements in an array are distributed in a regular grid of several dimensional coordinates, which we call a tensor. Use $A$ to represent the tensor "A". The element with a coordinate of $(i,j,k)$ in the tensor $A$ is denoted as $A_{(i,j,k)}$.
+
+**Relationship between the four**
+
+> The scalar is a 0th order tensor and the vector is a first order tensor. Example:
+> The scalar is the length of the stick, but you won't know where the stick is pointing.
+> Vector is not only knowing the length of the stick, but also knowing whether the stick points to the front or the back.
+> The tensor is not only knowing the length of the stick, but also knowing whether the stick points to the front or the back, and how much the stick is deflected up/down and left/right.
+
+## 1.2 What is the difference between tensor and matrix?
+- From an algebra perspective, a matrix is a generalization of vectors. The vector can be seen as a one-dimensional "table" (that is, the components are arranged in a row in order), the matrix is a two-dimensional "table" (components are arranged in the vertical and horizontal positions), then the $n$ order tensor is the so-called $n$ dimension "Form". The strict definition of tensors is described using linear mapping.
+- Geometrically, a matrix is a true geometric quantity, that is, it is something that does not change with the coordinate transformation of the frame of reference. Vectors also have this property.
+- The tensor can be expressed in a 3×3 matrix form.
+- A three-dimensional array representing the number of scalars and the representation vector can also be regarded as a matrix of 1 × 1, 1 × 3, respectively.
+
+## 1.3 Matrix and vector multiplication results
+A matrix of $m$ rows of $n$ columns is multiplied by a $n$ row vector, and finally a vector of $m$ rows is obtained. The algorithm is that each row of data in the matrix is treated as a row vector and multiplied by the vector.
+
+## 1.4 Vector and matrix norm induction
+**Vector norm**
+Define a vector as: $\vec{a}=[-5, 6, 8, -10]$. Any set of vectors is set to $\vec{x}=(x_1,x_2,...,x_N)$. The different norms are solved as follows:
+
+- 1 norm of the vector: the sum of the absolute values of the elements of the vector. The 1 norm result of the above vector $\vec{a}$ is: 29.
+
+$$
+\Vert\vec{x}\Vert_1=\sum_{i=1}^N\vert{x_i}\vert
+$$
+
+- The 2 norm of the vector: the sum of the squares of each element of the vector and the square root. The result of the 2 norm of $\vec{a}$ above is: 15.
+
+$$
+\Vert\vec{x}\Vert_2=\sqrt{\sum_{i=1}^N{\vert{x_i}\vert}^2}
+$$
+
+- Negative infinite norm of the vector: the smallest of the absolute values of all elements of the vector: the negative infinite norm of the above vector $\vec{a}$ is: 5.
+
+$$
+\Vert\vec{x}\Vert_{-\infty}=\min{|{x_i}|}
+$$
+
+- The positive infinite norm of the vector: the largest of the absolute values of all elements of the vector: the positive infinite norm of the above vector $\vec{a}$ is: 10.
+
+$$
+\Vert\vec{x}\Vert_{+\infty}=\max{|{x_i}|}
+$$
+
+- p-norm of vector:
+
+$$
+L_p=\Vert\vec{x}\Vert_p=\sqrt[p]{\sum_{i=1}^{N}|{x_i}|^p}
+$$
+
+**Matrix of the matrix**
+
+Define a matrix $A=[-1, 2, -3; 4, -6, 6]$. The arbitrary matrix is defined as: $A_{m\times n}$ with elements of $a_{ij}$.
+
+The norm of the matrix is defined as
+
+$$
+\Vert{A}\Vert_p :=\sup_{x\neq 0}\frac{\Vert{Ax}\Vert_p}{\Vert{x}\Vert_p}
+$$
+
+When the vectors take different norms, different matrix norms are obtained accordingly.
+
+- **1 norm of the matrix (column norm)**: The absolute values of the elements on each column of the matrix are first summed, and then the largest one is taken, (column and maximum), the 1 matrix of the above matrix $A$ The number first gets $[5,8,9]$, and the biggest final result is: 9.
+
+$$
+\Vert A\Vert_1=\max_{1\le j\le}\sum_{i=1}^m|{a_{ij}}|
+$$
+
+- **2 norm of matrix**: The square root of the largest eigenvalue of the matrix $A^TA$, the final result of the 2 norm of the above matrix $A$ is: 10.0623.
+
+$$
+\Vert A\Vert_2=\sqrt{\lambda_{max}(A^T A)}
+$$
+
+Where $\lambda_{max}(A^T A)$ is the maximum value of the absolute value of the eigenvalue of $A^T A$.
+- **Infinite norm of the matrix (row norm)**: The absolute values of the elements on each line of the matrix are first summed, and then the largest one (row and maximum) is taken, and the above matrix of $A$ is 1 The number first gets $[6;16]$, and the biggest final result is: 16.
+$$
+\Vert A\Vert_{\infty}=\max_{1\le i \le n}\sum_{j=1}^n |{a_{ij}}|
+$$
+
+- **Matrix kernel norm**: the sum of the singular values of the matrix (decomposed of the matrix svd), this norm can be used for low rank representation (because the minimization of the kernel norm is equivalent to minimizing the rank of the matrix - Low rank), the final result of matrix A above is: 10.9287.
+
+- **Matrix L0 norm**: the number of non-zero elements of the matrix, usually used to represent sparse, the smaller the L0 norm, the more elements, the more sparse, the final result of the above matrix $A$ is :6.
+- **Matrix L1 norm**: the sum of the absolute values of each element in the matrix, which is the optimal convex approximation of the L0 norm, so it can also represent sparseness, the final result of the above matrix $A$ is: 22 .
+- **F norm of matrix **: the sum of the squares of the elements of the matrix and the square root of the square. It is also commonly called the L2 norm of the matrix. Its advantage is that it is a convex function, which can be solved and easy to calculate. The final result of the above matrix A is: 10.0995.
+
+$$
+\Vert A\Vert_F=\sqrt{(\sum_{i=1}^m\sum_{j=1}^n{| a_{ij}|}^2)}
+$$
+
+- **Matrix L21 norm**: matrix first in each column, find the F norm of each column (can also be considered as the vector's 2 norm), and then the result obtained L1 norm (also It can be thought of as the 1 norm of the vector. It is easy to see that it is a norm between L1 and L2. The final result of the above matrix $A$ is: 17.1559.
+- **p-norm of the matrix**
+$$
+\Vert A\Vert_p=\sqrt[p]{(\sum_{i=1}^m\sum_{j=1}^n{| a_{ij}|}^p)}
+$$
+
+## 1.5 How to judge a matrix as positive?
+- the order master subtype is all greater than 0;
+- There is a reversible matrix $C$ such that $C^TC$ is equal to the matrix;
+- Positive inertia index is equal to $n$;
+- Contract in unit matrix $E$ (ie: canonical form is $E$)
+- the main diagonal elements in the standard form are all positive;
+- the eigenvalues are all positive;
+- is a measure matrix of a base.
+
+## 1.6 Derivative Bias Calculation
+**Derivative definition**:
+
+The derivative represents the ratio of the change in the value of the function to the change in the independent variable when the change in the independent variable tends to infinity. Geometric meaning is the tangent to this point. The physical meaning is the (instantaneous) rate of change at that moment.
+
+
+*Note*: In a one-way function, only one independent variable changes, that is, there is only one direction of change rate, which is why the unary function has no partial derivative. There is an average speed and instantaneous speed in physics. Average speed
+
+$$
+v=\frac{s}{t}
+$$
+
+Where $v$ represents the average speed, $s$ represents the distance, and $t$ represents the time. This formula can be rewritten as
+
+$$
+\bar{v}=\frac{\Delta s}{\Delta t}=\frac{s(t_0+\Delta t)-s(t_0)}{\Delta t}
+$$
+
+Where $\Delta s$ represents the distance between two points, and $\Delta t$ represents the time it takes to walk through this distance. When $\Delta t$ tends to 0 ($\Delta t \to 0$), that is, when the time becomes very short, the average speed becomes the instantaneous speed at time $t_0$, expressed as follows :
+
+$$
+v(t_0)=\lim_{\Delta t \to 0}{\bar{v}}=\lim_{\Delta t \to 0}{\frac{\Delta s}{\Delta t}}=\lim_ {\Delta t \to 0}{\frac{s(t_0+\Delta t)-s(t_0)}{\Delta t}}
+$$
+
+In fact, the above expression represents the derivative of the function $s$ on time $t$ at $t=t_0$. In general, the derivative is defined such that if the limit of the average rate of change exists, there is
+
+$$
+\lim_{\Delta x \to 0}{\frac{\Delta y}{\Delta x}}=\lim_{\Delta x \to 0}{\frac{f(x_0+\Delta x)-f(x_0 )}{\Delta x}}
+$$
+
+This limit is called the derivative of the function $y=f(x)$ at point $x_0$. Remember as $f'(x_0)$ or $y'\vert_{x=x_0}$ or $\frac{dy}{dx}\vert_{x=x_0}$ or $\frac{df(x)}{ Dx}\vert_{x=x_0}$.
+
+In layman's terms, the derivative is the slope of the curve at a certain point.
+
+**Partial derivative**:
+
+Since we talk about partial derivatives, there are at least two independent variables involved. Taking two independent variables as an example, z=f(x,y), from the derivative to the partial derivative, that is, from the curve to the surface. At one point on the curve, there is only one tangent. But at one point on the surface, there are countless lines of tangent. The partial derivative is the rate of change of the multivariate function along the coordinate axis.
+
+
+*Note*: Intuitively speaking, the partial derivative is the rate of change of the function along the positive direction of the coordinate axis at a certain point.
+
+Let the function $z=f(x,y)$ be defined in the field of the point $(x_0,y_0)$. When $y=y_0$, $z$ can be regarded as a unary function $f on $x$ (x,y_0)$, if the unary function is derivable at $x=x_0$, there is
+
+$$
+\lim_{\Delta x \to 0}{\frac{f(x_0+\Delta x,y_0)-f(x_0,y_0)}{\Delta x}}=A
+$$
+
+The limit of the function $A$ exists. Then say $A$ is the partial derivative of the argument $x=f(x,y)$ at the point $(x_0,y_0)$ about the argument $x$, denoted as $f_x(x_0,y_0)$ or $\ Frac{\partial z}{\partial x}\vert_{y=y_0}^{x=x_0}$ or $\frac{\partial f}{\partial x}\vert_{y=y_0}^{x= X_0}$ or $z_x\vert_{y=y_0}^{x=x_0}$.
+
+When the partial derivative is solved, another variable can be regarded as a constant and solved by ordinary derivation. For example, the partial derivative of $z=3x^2+xy$ for $x$ is $z_x=6x+y$, this When $y$ is equivalent to the coefficient of $x$.
+
+The geometric meaning of the partial derivative at a point $(x_0, y_0)$ is the intersection of the surface $z=f(x,y)$ with the face $x=x_0$ or the face $y=y_0$ at $y=y_0$ Or the slope of the tangent at $x=x_0$.
+
+## 1.7 What is the difference between the derivative and the partial derivative?
+There is no essential difference between the derivative and the partial derivative. If the limit exists, it is the limit of the ratio of the change of the function value to the change of the independent variable when the variation of the independent variable tends to zero.
+
+> - Unary function, a $y$ corresponds to a $x$, and the derivative has only one.
+> - A binary function, a $z$ corresponding to a $x$ and a $y$, has two derivatives: one is the derivative of $z$ to $x$, and the other is the derivative of $z$ to $y$, Call it a partial guide.
+> - Be careful when seeking partial derivatives. If you refer to one variable, then the other variable is constant.
+Only the amount of change is derived, and the solution of the partial derivative is transformed into the derivation of the unary function.
+
+## 1.8 Eigenvalue decomposition and eigenvectors
+- eigenvalue decomposition can obtain eigenvalues and eigenvectors;
+
+- The eigenvalue indicates how important this feature is, and the eigenvector indicates what this feature is.
+
+ If a vector $\vec{v}$ is a feature vector of the square matrix $A$, it will definitely be expressed in the following form:
+
+$$
+A\nu = \lambda \nu
+$$
+
+$\lambda$ is the eigenvalue corresponding to the feature vector $\vec{v}$. Eigenvalue decomposition is the decomposition of a matrix into the following form:
+
+$$
+A=Q\sum Q^{-1}
+$$
+
+Where $Q$ is the matrix of the eigenvectors of the matrix $A$, $\sum$ is a diagonal matrix, and each diagonal element is a eigenvalue, and the eigenvalues are arranged from large to small. The eigenvectors corresponding to these eigenvalues describe the direction of the matrix change (from the primary change to the secondary change arrangement). That is to say, the information of the matrix $A$ can be represented by its eigenvalues and eigenvectors.
+
+## 1.9 What is the relationship between singular values and eigenvalues?
+So how do singular values and eigenvalues correspond? We multiply the transpose of a matrix $A$ by $A$ and the eigenvalues of $AA^T$, which have the following form:
+
+$$
+(A^TA)V = \lambda V
+$$
+
+Here $V$ is the right singular vector above, in addition to:
+
+$$
+\sigma_i = \sqrt{\lambda_i}, u_i=\frac{1}{\sigma_i}A\mu_i
+$$
+
+Here $\sigma$ is the singular value, and $u$ is the left singular vector mentioned above. [Prove that the buddy did not give]
+The singular value $\sigma$ is similar to the eigenvalues, and is also ranked from large to small in the matrix $\sum$, and the reduction of $\sigma$ is particularly fast, in many cases, the first 10% or even the 1% singularity. The sum of the values accounts for more than 99% of the sum of all the singular values. In other words, we can also approximate the description matrix with the singular value of the previous $r$($r$ is much smaller than $m, n$), that is, the partial singular value decomposition:
+
+$$
+A_{m\times n}\approx U_{m \times r}\sum_{r\times r}V_{r \times n}^T
+$$
+
+The result of multiplying the three matrices on the right will be a matrix close to $A$. Here, the closer $r$ is to $n$, the closer the multiplication will be to $A$.
+
+## 1.10 Why should machine use probability?
+The probability of an event is a measure of the likelihood that the event will occur. Although the occurrence of an event in a randomized trial is accidental, randomized trials that can be repeated in large numbers under the same conditions tend to exhibit significant quantitative patterns.
+In addition to dealing with uncertainties, machine learning also needs to deal with random quantities. Uncertainty and randomness may come from multiple sources, using probability theory to quantify uncertainty.
+Probability theory plays a central role in machine learning because the design of machine learning algorithms often relies on probability assumptions about the data.
+
+> For example, in the course of machine learning (Andrew Ng), there is a naive Bayesian hypothesis that is an example of conditional independence. The learning algorithm makes assumptions about the content to determine if the email is spam. Assume that the probability condition that the word x appears in the message is independent of the word y, regardless of whether the message is spam or not. Obviously this assumption is not without loss of generality, because some words almost always appear at the same time. However, the end result is that this simple assumption has little effect on the results, and in any case allows us to quickly identify spam.
+
+## 1.11 What is the difference between a variable and a random variable?
+**Random variable**
+
+A real-valued function (all possible sample points) for various outcomes in a random phenomenon (a phenomenon that does not always appear the same result under certain conditions). For example, the number of passengers waiting at a bus stop at a certain time, the number of calls received by the telephone exchange at a certain time, etc., are all examples of random variables.
+The essential difference between the uncertainty of random variables and fuzzy variables is that the latter results are still uncertain, that is, ambiguity.
+
+**The difference between a variable and a random variable: **
+When the probability of the value of the variable is not 1, the variable becomes a random variable; when the probability of the random variable is 1, the random variable becomes a variable.
+
+> For example:
+> When the probability of a variable $x$ value of 100 is 1, then $x=100$ is determined and will not change unless there is further operation.
+> When the probability of the variable $x$ is 100, the probability of 50 is 0.5, and the probability of 100 is 0.5. Then the variable will change with different conditions. It is a random variable. The probability of 50 or 100 is 0.5, which is 50%.
+
+## 1.12 The relationship between random variables and probability distribution?
+
+A random variable simply represents a state that may be achieved, and a probability distribution associated with it must be given to establish the probability of each state. The method used to describe the probability of the probability of each possible state of a random variable or a cluster of random variables is the **probability distribution**.
+
+Random variables can be divided into discrete random variables and continuous random variables.
+
+The corresponding function describing its probability distribution is
+
+Probability Mass Function (PMF): Describes the probability distribution of discrete random variables, usually expressed in uppercase letters $P$.
+
+Probability Density Function (PDF): A probability distribution describing a continuous random variable, usually expressed in lowercase letters $p$.
+
+### 1.12.1 Discrete random variables and probability mass functions
+
+PMF maps each state that a random variable can take to a random variable to obtain the probability of that state.
+
+- In general, $P(x)$ represents the probability of $X=x $.
+- Sometimes to avoid confusion, explicitly write the name of the random variable $P( $x$=x) $
+- Sometimes you need to define a random variable and then formulate the probability distribution x it follows. Obey $P($x $) $
+
+PMF can act on multiple random variables simultaneously, ie joint probability distribution $P(X=x, Y=y) $* means $X=x $ and the same as $Y=y $ Probability can also be abbreviated as $P(x,y) $.
+
+If a function $P $ is a PMF of the random variable $X $, then it must satisfy the following three conditions:
+
+- $P$'s domain must be a collection of all possible states
+- $∀x∈ $x, $0 \leq P(x) \leq 1 $.
+- $∑_{x∈X} P(x)=1$. We call this property normalized
+
+### 1.12.2 Continuous Random Variables and Probability Density Functions
+
+If a function $p $ is a PDF of x, then it must satisfy the following conditions
+
+- The domain of $p$ must be a collection of all possible states of xx.
+- $∀x∈X,p(x)≥0$. Note that we do not require $p(x)≤1$ because $p(x)$ is not the specific probability of representing this state, and Is a relative size (density) of probability. The specific probability requires integration to find.
+- $∫p(x)dx=1$, the score is down, the sum is still 1, and the sum of the probabilities is still 1.
+
+Note: PDF$p(x)$ does not directly give a probability to a particular state, giving a density. In contrast, it gives a probability that the area falling within a small area of $δx$ is $ p(x)δx$. Thus, we can't find the probability of a particular state. What we can find is that the probability that a state $x$ falls within a certain interval $[a,b]$ is $ \int_{a}^{b}p(x)dx$.
+
+## 1.13 Common probability distribution
+
+### 1.13.1 Bernoulli Distribution
+
+**Bernoulli distribution** is a single binary random variable distribution, single parameter $\phi $∈[0,1] control, $\phi $ gives the probability that the random variable is equal to 1. The main properties are:
+$$
+\begin{align*}
+P(x=1) &= \phi \\
+P(x=0) &= 1-\phi \\
+P(x=x) &= \phi^x(1-\phi)^{1-x} \\
+\end{align*}
+$$
+Its expectations and variances are:
+$$
+\begin{align*}
+E_x[x] &= \phi \\
+Var_x(x) &= \phi{(1-\phi)}
+\end{align*}
+$$
+**Multinoulli distribution** is also called **category distribution**, which is a random distribution of individual *k*k values, often used to represent the distribution of **object classifications**. where $k $ is a finite value. Multinoulli distribution consists of Vector $\vec{p}\in[0,1]^{k-1} $parameterized, each component $p_i $ represents the probability of the $i $ state, and $p_k=1-1 ^Tp $.
+
+**Scope of application**: **Bernoulli distribution** is suitable for modeling **discrete **random variables.
+
+### 1.13.2 Gaussian distribution
+
+Gauss is also called Normal Distribution. The probability function is as follows:
+$$
+N(x;\mu,\sigma^2) = \sqrt{\frac{1}{2\pi\sigma^2}}exp\left ( -\frac{1}{2\sigma^2}(x -\mu)^2 \right )
+$$
+Where $\mu $ and $\sigma $ are mean and variance, respectively. The center peak x coordinate is given by $\mu $, the width of the peak is controlled by $\sigma $, and the maximum point is $x=\ Obtained at mu $, the inflection point is $x=\mu\pm\sigma $
+
+In the normal distribution, the probability of ±1$\sigma$, ±2$\sigma$, and ±3$\sigma$ are 68.3%, 95.5%, and 99.73%, respectively. These three numbers are best remembered.
+
+In addition, let $\mu=0, \sigma=1 $ Gaussian distribution be reduced to the standard normal distribution:
+$$
+N(x;\mu,\sigma^2) = \sqrt{\frac{1}{2\pi}}exp\left ( -\frac{1}{2}x^2 \right )
+$$
+Efficiently evaluate the probability density function:
+$$
+N(x;\mu,\beta^{-1})=\sqrt{\frac{\beta}{2\pi}}exp\left(-\frac{1}{2}\beta(x-\ Mu)^2\right)
+$$
+
+
+Among them, $\beta=\frac{1}{\sigma^2}$ controls the distribution precision by the parameter $\beta∈(0,\infty) $.
+
+### 1.13.3 When is a normal distribution?
+
+Q: When is a normal distribution?
+Answer: There is no prior knowledge distributed on real numbers. When I don't know which form to choose, the default choice of normal distribution is always wrong. The reasons are as follows:
+
+1. The central limit theorem tells us that many independent random variables approximate a normal distribution. In reality, many complex systems can be modeled as normally distributed noise, even if the system can be structurally decomposed.
+2. Normal distribution is the distribution with the greatest uncertainty among all probability distributions with the same variance. In other words, the normal distribution is the distribution with the least knowledge added to the model.
+
+Generalization of normal distribution:
+The normal distribution can be generalized to the $R^n$ space, which is called the **multiple normal distribution**, whose parameter is a positive definite symmetric matrix $\sum$:
+$$
+N(x;\vec\mu,\sum)=\sqrt{\frac{1}{2\pi^ndet(\sum)}}exp\left(-\frac{1}{2}(\vec{ x}-\vec{\mu})^T\sum^-1(\vec{x}-\vec{\mu})\right)
+$$
+Efficiently evaluate the probability density of mostly normal distributions:
+$$
+N(x;\vec{\mu},\vec\beta^{-1}) = \sqrt{det(\vec\beta)}{(2\pi)^n}exp\left(-\frac{ 1}{2}(\vec{x}-\vec\mu)^T\beta(\vec{x}-\vec\mu)\right)
+$$
+Here, $\vec\beta$ is a precision matrix.
+
+### 1.13.4 Exponential distribution
+
+In deep learning, the exponential distribution is used to describe the distribution of the boundary points at $x=0$. The exponential distribution is defined as follows:
+$$
+p(x;\lambda)=\lambda1_{x\geq 0}exp(-\lambda{x})
+$$
+The exponential distribution uses the indication function $I_{x>=0}$ to make the probability of a negative value of $x$ zero.
+
+### 1.13.5 Laplace Distribution
+
+A closely related probability distribution is the Laplace distribution, which allows us to set the peak of the probability mass at any point of $\mu$
+$$
+Laplace(x;\mu;\gamma)=\frac{1}{2\gamma}exp\left(-\frac{|x-\mu|}{\gamma}\right)
+$$
+
+### 1.13.6
+Dirac distribution and empirical distribution
+
+The Dirac distribution ensures that all the masses in the probability distribution are concentrated at one point. The Diract-distributed Dirac $\delta $ function (also known as the **unit pulse function**) is defined as follows:
+$$
+p(x)=\delta(x-\mu), x\neq \mu
+$$
+
+$$
+\int_{a}^{b}\delta(x-\mu)dx = 1, a < \mu < b
+$$
+
+Dirac distribution often appears as an integral part of the empirical distribution
+$$
+\hat{p}(\vec{x})=\frac{1}{m}\sum_{i=1}^{m}\delta(\vec{x}-{\vec{x}}^{ (i)})
+$$
+, where m points $x^{1},...,x^{m}$ is the given data set, **experience distribution** will have probability density $\frac{1}{m} $ Assigned to these points.
+
+When we train the model on the training set, we can assume that the empirical distribution obtained from this training set indicates the source of the sample**.
+
+** Scope of application**: The Dirac δ function is suitable for the empirical distribution of **continuous ** random variables.
+
+## 1.14 Example Understanding Conditional Probability
+The conditional probability formula is as follows:
+
+$$
+P(A/B) = P(A\cap B) / P(B)
+$$
+
+Description: The event or subset $A$ and $B$ in the same sample space $\Omega$, if an element randomly selected from $\Omega$ belongs to $B$, then the next randomly selected element The probability of belonging to $A$ is defined as the conditional probability of $A$ on the premise of $B$.
+
+
+According to the Venn diagram, it can be clearly seen that in the event of event B, the probability of event A occurring is $P(A\bigcap B)$ divided by $P(B)$.
+Example: A couple has two children. What is the probability that one of them is a girl and the other is a girl? (I have encountered interviews and written tests)
+**Exhaustive law**: Knowing that one of them is a girl, then the sample space is for men, women, women, and men, and the probability that another is still a girl is 1/3.
+**Conditional probability method**: $P(female|female)=P(female)/P(female)$, couple has two children, then its sample space is female, male, female, male, male Male, $P (female) $ is 1/4, $P (female) = 1-P (male male) = 3/4$, so the last $1/3$.
+Everyone here may misunderstand that men, women and women are in the same situation, but in fact they are different situations like brothers and sisters.
+
+## 1.15 What is the difference between joint probability and edge probability?
+**The difference:**
+Joint Probability: Joint Probability refers to a probability that, like $P(X=a, Y=b)$, contains multiple conditions, and all conditions are true at the same time. Joint probability refers to the probability that multiple random variables satisfy their respective conditions in a multivariate probability distribution.
+Edge Probability: An edge probability is the probability that an event will occur, regardless of other events. The edge probability refers to a probability similar to $P(X=a)$, $P(Y=b)$, which is only related to a single random variable.
+
+**Contact: **
+The joint distribution can find the edge distribution, but if only the edge distribution is known, the joint distribution cannot be obtained.
+
+## 1.16 The chain rule of conditional probability
+From the definition of conditional probability, the following multiplication formula can be directly derived:
+Multiplication formula Let $A, B$ be two events, and $P(A) > 0$, then
+
+$$
+P(AB) = P(B|A)P(A)
+$$
+
+Promotion
+
+$$
+P(ABC)=P(C|AB)P(B|A)P(A)
+$$
+
+In general, the induction method can be used to prove that if $P(A_1A_2...A_n)>0$, then there is
+
+$$
+P(A_1A_2...A_n)=P(A_n|A_1A_2...A_{n-1})P(A_{n-1}|A_1A_2...A_{n-2})...P(A_2 |A_1)P(A_1)
+=P(A_1)\prod_{i=2}^{n}P(A_i|A_1A_2...A_{i-1})
+$$
+
+Any multi-dimensional random variable joint probability distribution can be decomposed into a conditional probability multiplication form with only one variable.
+
+## 1.17 Independence and conditional independence
+**Independence**
+The two random variables $x$ and $y$, the probability distribution is expressed as a product of two factors, one factor containing only $x$ and the other factor containing only $y$, and the two random variables are independent.
+Conditions sometimes bring independence between events that are not independent, and sometimes they lose their independence because of the existence of this condition.
+Example: $P(XY)=P(X)P(Y)$, event $X$ is independent of event $Y$. Given $Z$ at this time,
+
+$$
+P(X,Y|Z) \not = P(X|Z)P(Y|Z)
+$$
+
+When the event is independent, the joint probability is equal to the product of the probability. This is a very good mathematical nature, but unfortunately, unconditional independence is very rare, because in most cases, events interact with each other.
+
+**Conditional independence**
+Given $Z$, $X$ and $Y$ are conditional, if and only if
+
+$$
+X\bot Y|Z \iff P(X,Y|Z) = P(X|Z)P(Y|Z)
+$$
+
+The relationship between $X$ and $Y$ depends on $Z$, not directly.
+
+>**Example** defines the following events:
+>$X$: It will rain tomorrow;
+>$Y$: Today's ground is wet;
+>$Z$: Is it raining today?
+The establishment of the >$Z$ event has an impact on both $X$ and $Y$. However, given the establishment of the $Z$ event, today's ground conditions have no effect on whether it will rain tomorrow.
+
+## 1.18 Summary of Expectation, Variance, Covariance, Correlation Coefficient
+**Expectation**
+In probability theory and statistics, the mathematical expectation (or mean, also referred to as expectation) is the sum of the probability of each possible outcome in the trial multiplied by the result. It reflects the average value of random variables.
+
+- Linear operation: $E(ax+by+c) = aE(x)+bE(y)+c$
+- Promotion form: $E(\sum_{k=1}^{n}{a_ix_i+c}) = \sum_{k=1}^{n}{a_iE(x_i)+c}$
+- Function expectation: Let $f(x)$ be a function of $x$, then the expectation of $f(x)$ is
+ - Discrete function: $E(f(x))=\sum_{k=1}^{n}{f(x_k)P(x_k)}$
+ - Continuous function: $E(f(x))=\int_{-\infty}^{+\infty}{f(x)p(x)dx}$
+
+> Note:
+>
+> - The expectation of the function is not equal to the expected function, ie $E(f(x))=f(E(x))$
+> - In general, the expectation of the product is not equal to the expected product.
+> - If $X$ and $Y$ are independent of each other, $E(xy)=E(x)E(y) $.
+
+**Variance**
+
+The variance in probability theory is used to measure the degree of deviation between a random variable and its mathematical expectation (ie, mean). Variance is a special expectation. defined as:
+
+$$
+Var(x) = E((x-E(x))^2)
+$$
+
+> Variance nature:
+>
+> 1)$Var(x) = E(x^2) -E(x)^2$
+> 2) The variance of the constant is 0;
+> 3) The variance does not satisfy the linear nature;
+> 4) If $X$ and $Y$ are independent of each other, $Var(ax+by)=a^2Var(x)+b^2Var(y)$
+
+**Covariance**
+Covariance is a measure of the linear correlation strength and variable scale of two variables. The covariance of two random variables is defined as:
+
+$$
+Cov(x,y)=E((x-E(x))(y-E(y)))
+$$
+
+Variance is a special covariance. When $X=Y$, $Cov(x,y)=Var(x)=Var(y)$.
+
+> Covariance nature:
+>
+> 1) The covariance of the independent variable is 0.
+> 2) Covariance calculation formula:
+
+$$
+Cov(\sum_{i=1}^{m}{a_ix_i}, \sum_{j=1}^{m}{b_jy_j}) = \sum_{i=1}^{m} \sum_{j=1 }^{m}{a_ib_jCov(x_iy_i)}
+$$
+
+>
+> 3) Special circumstances:
+
+$$
+Cov(a+bx, c+dy) = bdCov(x, y)
+$$
+
+**Correlation coefficient**
+The correlation coefficient is the amount by which the linear correlation between the variables is studied. The correlation coefficient of two random variables is defined as:
+
+$$
+Corr(x,y) = \frac{Cov(x,y)}{\sqrt{Var(x)Var(y)}}
+$$
+
+> The nature of the correlation coefficient:
+> 1) Bordered. The range of correlation coefficients is , which can be regarded as a dimensionless covariance.
+> 2) The closer the value is to 1, the stronger the positive correlation (linearity) of the two variables. The closer to -1, the stronger the negative correlation, and when 0, the two variables have no correlation.
diff --git a/English version/ch01_MathematicalBasis/img/ch1/conditional_probability.jpg b/English version/ch01_MathematicalBasis/img/ch1/conditional_probability.jpg
new file mode 100644
index 00000000..549310d0
Binary files /dev/null and b/English version/ch01_MathematicalBasis/img/ch1/conditional_probability.jpg differ
diff --git a/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_1.png b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_1.png
new file mode 100644
index 00000000..308c16de
Binary files /dev/null and b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_1.png differ
diff --git a/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_2.png b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_2.png
new file mode 100644
index 00000000..19515432
Binary files /dev/null and b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_2.png differ
diff --git a/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_3.png b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_3.png
new file mode 100644
index 00000000..4303cd9d
Binary files /dev/null and b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_3.png differ
diff --git a/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_4.png b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_4.png
new file mode 100644
index 00000000..2533f214
Binary files /dev/null and b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_4.png differ
diff --git a/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_5.png b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_5.png
new file mode 100644
index 00000000..5c5e6544
Binary files /dev/null and b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_5.png differ
diff --git a/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_6.png b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_6.png
new file mode 100644
index 00000000..8946ebc5
Binary files /dev/null and b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_6.png differ
diff --git a/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_7.png b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_7.png
new file mode 100644
index 00000000..7b637f0b
Binary files /dev/null and b/English version/ch01_MathematicalBasis/img/ch1/prob_distribution_7.png differ
diff --git a/English version/ch01_MathematicalBasis/readme.md b/English version/ch01_MathematicalBasis/readme.md
new file mode 100644
index 00000000..3509a055
--- /dev/null
+++ b/English version/ch01_MathematicalBasis/readme.md
@@ -0,0 +1,14 @@
+###########################################################
+
+### Deep Learning 500 Questions - Chapter 1 Mathematical Foundation
+
+**Responsible person (in no particular order):**
+Harbin Institute of Technology doctoral student - Yuan Di
+Qiao Chenglei-Tongji University
+
+**Contributors (in no particular order):**
+Content contributors can add information
+
+Liu Yanchao - Southeast University
+Liu Yuande-Shanghai University of Technology (Content Revision)
+###########################################################
diff --git a/English version/ch02_MachineLearningFoundation/Chapter 2_TheBasisOfMachineLearning.md b/English version/ch02_MachineLearningFoundation/Chapter 2_TheBasisOfMachineLearning.md
new file mode 100644
index 00000000..370cb8c1
--- /dev/null
+++ b/English version/ch02_MachineLearningFoundation/Chapter 2_TheBasisOfMachineLearning.md
@@ -0,0 +1,2205 @@
+[TOC]
+
+
+
+# Chapter 2 Fundamentals of Machine Learning
+
+## 2.1 Understanding the essence of machine learning
+
+Machine Learning (ML), as the name suggests, lets the machine learn. Here, the machine refers to the computer, which is the physical carrier of the algorithm. You can also think of the various algorithms as a machine with input and output. So what do you want the computer to learn? For a task and its performance measurement method, an algorithm is designed to enable the algorithm to extract the laws contained in the data. This is called machine learning. If the data entered into the machine is tagged, it is called supervised learning. If the data is unlabeled, it is unsupervised learning.
+
+## 2.2 Various common algorithm icons
+
+|Regression Algorithm|Clustering Algorithm|Regularization Method|
+|:-:|:-:|:-:|
+||||
+
+| Decision Tree Learning | Bayesian Methods | Kernel-Based Algorithms |
+|:-:|:-:|:-:|
+||||
+
+|Clustering Algorithm|Association Rule Learning|Artificial Neural Network|
+|:-:|:-:|:-:|
+||||
+
+|Deep Learning|Lower Dimensional Algorithm|Integrated Algorithm|
+|:-:|:-:|:-:|
+||||
+
+## 2.3 Supervised learning, unsupervised learning, semi-supervised learning, weak supervision learning?
+There are different ways to model a problem, depending on the type of data. According to different learning methods and input data, machine learning is mainly divided into the following four learning methods.
+
+**Supervised learning**:
+1. Supervised learning is the use of examples of known correct answers to train the network. A process in which data and its one-to-one correspondence are known, a prediction model is trained, and input data is mapped to a label.
+2. Common application scenarios for supervised learning such as classification and regression.
+3. Common supervised machine learning algorithms include Support Vector Machine (SVM), Naive Bayes, Logistic Regression, K-Nearest Neighborhood (KNN), Decision Tree (Decision Tree), Random Forest, AdaBoost, and Linear Discriminant Analysis (LDA). Deep Learning is also presented in the form of supervised learning.
+
+**Unsupervised learning**:
+
+1. In unsupervised learning, data is not specifically identified and applies to situations where you have a data set but no tags. The learning model is to infer some of the internal structure of the data.
+2. Common application scenarios include learning of association rules and clustering.
+3. Common algorithms include the Apriori algorithm and the k-Means algorithm.
+
+**Semi-supervised learning**:
+
+1. In this learning mode, the input data part is marked and some parts are not marked. This learning model can be used for prediction.
+2. The application scenario includes classification and regression. The algorithm includes some extensions to commonly used supervised learning algorithms. By modeling the marked data, on the basis of this, the unlabeled data is predicted.
+3. Common algorithms such as Graph Inference or Laplacian SVM.
+
+**Weakly supervised learning**:
+
+1. Weak supervised learning can be thought of as a collection of data with multiple tags, which can be empty sets, single elements, or multiple elements containing multiple cases (no tags, one tag, and multiple tags) .
+2. The label of the data set is unreliable. The unreliable here can be incorrect mark, multiple marks, insufficient mark, local mark, etc.
+3. A process in which known data and its one-to-one weak tags train an intelligent algorithm to map input data to a stronger set of tags. The strength of the label refers to the amount of information contained in the label. For example, the label of the classification is a weak label relative to the divided label.
+4. For example, to give a picture containing a balloon, you need to get the position of the balloon in the picture and the dividing line of the balloon and the background. This is the problem that the weak tag is known to learn strong tags.
+
+ In the context of enterprise data applications, the most common ones are the models of supervised learning and unsupervised learning. In the field of image recognition, semi-supervised learning is a hot topic due to the large amount of non-identified data and a small amount of identifiable data.
+
+## 2.4 What steps are there to supervise learning?
+Supervised learning is the use of examples of known correct answers to train the network, with a clear identification or result for each set of training data. Imagine we can train a network to recognize a photo of a balloon from a photo gallery (which contains a photo of a balloon). Here are the steps we will take in this hypothetical scenario.
+
+**Step 1: Data set creation and classification**
+First, browse through your photos (datasets), identify all the photos that contain balloons, and mark them. Then, divide all the photos into a training set and a verification set. The goal is to find a function in the deep network. This function input is any photo. When the photo contains a balloon, it outputs 1, otherwise it outputs 0.
+
+**Step 2: Data Augmentation**
+When the original data is collected and labeled, the data collected generally does not necessarily contain the information under the various disturbances. The quality of the data is critical to the predictive power of the machine learning model, so data enhancement is generally performed. For image data, data enhancement generally includes image rotation, translation, color transformation, cropping, affine transformation, and the like.
+
+**Step 3: Feature Engineering**
+In general, feature engineering includes feature extraction and feature selection. The common Hand-Crafted Feature includes Scale-Invariant Feature Transform (SIFT) and Histogram of Oriented Gradient (HOG). Since the manual features are heuristic, the starting point behind the algorithm design is different. When these features are combined, there may be conflicts, how to make the performance of the combined features play out, and the original data is discriminating in the feature space. To use, the method of feature selection is needed. After the success of the deep learning method, a large part of people no longer pay attention to the feature engineering itself. Because the most commonly used Convolutional Neural Networks (CNNs) are themselves an engine for feature extraction and selection. The different network structures, regularization, and normalization methods proposed by the researchers are actually feature engineering in the context of deep learning.
+
+**Step 4: Building predictive models and losses**
+After mapping the raw data to the feature space, it means that we have a reasonable input. The next step is to build a suitable predictive model to get the output of the corresponding input. How to ensure the consistency of the output of the model and the input label, it is necessary to construct the loss function between the model prediction and the label. The common loss function (Loss Function) has cross entropy and mean square error. The process of continuously iterating through the optimization method to change the model from the initial initialization state to the predictive model step by step is actually the learning process.
+
+**Step 5: Training**
+Select the appropriate model and hyperparameter for initialization, such as the kernel function in the support vector machine, the penalty weight of the error term, and so on. After the model initialization parameters are set, the prepared feature data is input into the model, and the gap between the output and the label is continuously reduced by a suitable optimization method. When the iterative process reaches the cutoff condition, the trained model can be obtained. The most common method of optimization is the gradient descent method and its variants. The premise of using the gradient descent method is that the optimization objective function is deducible for the model.
+
+**Step 6: Verification and Model Selection**
+After training the training set image, you need to test the model. Use validation sets to verify that the model can accurately pick out photos with balloons.
+In this process, steps 2 and 3 are usually repeated by adjusting various things related to the model (hyperparameters), such as how many nodes are there, how many layers are there, and what kind of activation and loss functions are used. Actively and effectively train weights and so on to the stage of communication.
+
+**Step 7: Testing and Application**
+When you have an accurate model, you can deploy it to your application. You can publish the forecasting function as an API (Application Programming Interface) call, and you can call the API from the software to reason and give the corresponding results.
+
+## 2.5 Multi-instance learning?
+Multiple Instance Learning (MIL): A package that knows the data packets and data packets containing multiple data, trains intelligent algorithms, maps data packets to labels, and presents packages in some problems. The label for each data within.
+For example, if a video consists of many images, if it is 10,000, then we need to determine whether the video contains an object, such as a balloon. It is too time-consuming to label a frame with a balloon. It is usually time to say whether there is a balloon in this video, and you get the data of multiple examples. Not every 10000 frame of data has a balloon. As long as there is a balloon in one frame, we think that the packet has a balloon. Only when all video frames have no balloons, there is no balloon. From which part of the video (10000 photos) to learn whether there is a balloon, it is a problem of multi-instance learning.
+
+## 2.6 What is a neural network?
+A neural network is a network that connects multiple neurons in accordance with certain rules. Different neural networks have different connection rules.
+For example, Full Connected (FC) neural network, its rules include:
+
+1. There are three layers: input layer, output layer, and hidden layer.
+2. There is no connection between neurons in the same layer.
+3. The meaning of fully connected: each neuron in the Nth layer is connected to all neurons in the N-1th layer, and the output of the N-1th layer neuron is the input to the Nth layer neuron.
+4. Each connection has a weight.
+
+ **Neural Network Architecture**
+ The picture below is a neural network system, which consists of many layers. The input layer is responsible for receiving information, such as a picture of a cat. The output layer is the result of the computer's judgment on this input information, it is not a cat. The hidden layer is the transfer and processing of input information.
+ 
+
+## 2.7 Understanding Local Optimization and Global Optimization
+
+Laughing about local optimality and global optimality
+
+> Plato one day asks the teacher Socrates what is love? Socrates told him to go to the wheat field once, picking the biggest wheat ear back, not to look back, only to pick it once. Plato started out empty. His reason was that he saw it well, but he didn't know if it was the best. Once and fortunately, when he came to the end, he found that it was not as good as the previous one, so he gave up. Socrates told him: "This is love." This story makes us understand a truth, because of some uncertainty in life, so the global optimal solution is difficult to find, or does not exist at all, we should Set some qualifications, and then find the optimal solution in this range, that is, the local optimal solution - some seizures are better than empty ones, even if this seizure is just an interesting experience.
+> Plato asked one day what marriage is? Socrates told him to go to the woods once and choose the best tree to make a Christmas tree. He also refused to look back and only chose once. This time he was tired and dragged a sapling tree that looked straight, green, but a little sparse. His reason was that with the lessons of the last time, it was hard to see a seemingly good one and found time. Physical strength is not enough, and whether it is the best or not, I will get it back. Socrates told him: "This is marriage."
+
+Optimization problems are generally divided into local optimization and global optimization.
+
+1. Local optimization is to find the minimum value in a finite region of the function value space; and global optimization is to find the minimum value in the entire region of the function value space.
+2. The local minimum point of a function is the point at which its function value is less than or equal to a nearby point. But there may be points that are larger than the distance.
+3. The global minimum is the kind of feasible value whose function value is less than or equal to all.
+
+## 2.8 Classification algorithm
+
+Classification algorithms and regression algorithms are different ways of modeling the real world. The classification model considers that the output of the model is discrete. For example, nature's creatures are divided into different categories and are discrete. The output of the regression model is continuous. For example, the process of changing a person's height is a continuous process, not a discrete one.
+
+Therefore, when using the classification model or the regression model in the actual modeling process, it depends on your analysis and understanding of the task (real world).
+
+### 2.8.1 What are the advantages and disadvantages of common classification algorithms?
+
+| Algorithm | Advantages | Disadvantages |
+|:-|:-|:-|
+|Bayes Bayesian Classification | 1) The estimated parameters required are small and insensitive to missing data. 2) has a solid mathematical foundation and stable classification efficiency. |1) It is necessary to assume that the attributes are independent of each other, which is often not true. (I like to eat tomatoes and eggs, but I don't like to eat tomato scrambled eggs). 2) Need to know the prior probability. 3) There is an error rate in the classification decision. |
+|Decision Tree Decision Tree |1) No domain knowledge or parameter assumptions are required. 2) Suitable for high dimensional data. 3) Simple and easy to understand. 4) Processing a large amount of data in a short time, resulting in a feasible and effective result. 5) Ability to process both data type and regular attributes. |1) For inconsistent data for each category of samples, the information gain is biased towards those features with more values. 2) Easy to overfit. 3) Ignore the correlation between attributes. 4) Online learning is not supported. |
+|SVM support vector machine | 1) can solve the problem of machine learning in small samples. 2) Improve generalization performance. 3) can solve high-dimensional, nonlinear problems. Ultra-high-dimensional text classifications are still popular. 4) Avoid neural network structure selection and local minimum problems. |1) Sensitive to missing data. 2) Memory consumption is large and difficult to explain. 3) Running and tuning is a bit annoying. |
+|KNN K Neighbors|1) Simple thinking, mature theory, can be used for classification or regression; 2) can be used for nonlinear classification; 3) Training time complexity is O (n); 4) High accuracy, no assumptions about the data, not sensitive to the outlier; |1) The amount of calculation is too large. 2) For the problem of unbalanced sample classification, misjudgment will occur. 3) A lot of memory is needed. 4) The output is not very interpretable. |
+|Logistic Regression Logistic Regression|1) Fast. 2) Simple and easy to understand, directly see the weight of each feature. 3) The model can be easily updated to absorb new data. 4) If you want a probability framework, dynamically adjust the classification threshold. |Feature processing is complicated. There is a need for normalization and more feature engineering. |
+|Neural Network Neural Network|1) High classification accuracy. 2) Parallel processing capability. 3) Distributed storage and learning capabilities. 4) Strong robustness and not susceptible to noise. |1) A large number of parameters (network topology, threshold, threshold) are required. 2) The results are difficult to explain. 3) Training time is too long. |
+|Adaboosting|1)adaboost is a classifier with very high precision. 2) Various methods can be used to construct the sub-classifier, and the Adaboost algorithm provides the framework. 3) When using a simple classifier, the calculated results are understandable. And the weak classifier is extremely simple to construct. 4) Simple, no need to do feature screening. 5) Don't worry about overfitting. |sensitive to outlier|
+
+
+
+### 2.8.2 How to evaluate the classification algorithm?
+- **Several common terms**
+ Here are a few common model evaluation terms. Now suppose that our classification target has only two categories, which are considered positive and negative:
+ 1) True positives (TP): the number of positive cases that are correctly divided into positive examples, that is, the number of instances that are actually positive and are classified as positive by the classifier;
+ 2) False positives (FP): the number of positive examples that are incorrectly divided into positive examples, that is, the number of instances that are actually negative but are classified as positive by the classifier;
+ 3) False negatives (FN): the number of instances that are incorrectly divided into negative examples, that is, the number of instances that are actually positive but are classified as negative by the classifier;
+ 4) True negatives(TN): The number of negative cases that are correctly divided into negative examples, which are actually negative examples and are divided into negative examples by the classifier.
+
+
+
+The figure above is the confusion matrix of these four terms, and the following is explained:
+1) P = TP + FN represents the number of samples that are actually positive examples.
+2) True, False describes whether the classifier is correct.
+3) Positive and Negative are the classification results of the classifier. If the positive example is 1, the negative example is -1, ie positive=1, negative=-1. Use 1 for True, -1 for False, then the actual class label = TF\*PN, TF is true or false, and PN is positive or negative.
+4) For example, the actual class of True positives (TP) = 1 * 1 = 1 is a positive example, the actual class of False positives (FP) = (-1) \ * 1 = -1 is a negative example, False negatives ( The actual class label of FN) = (-1) \ * (-1) = 1 is a positive example, and the actual class label of True negatives (TN) = 1 * * (-1) = -1 is a negative example.
+
+
+
+- **Evaluation Indicators**
+ 1) Accuracy
+ The correct rate is our most common evaluation index, accuracy = (TP+TN)/(P+N). The correct rate is the proportion of the number of samples that are paired in all samples. Generally speaking, the higher the correct rate The better the classifier.
+ 2) Error rate (error rate)
+ The error rate is opposite to the correct rate, describing the proportion of the classifier's misclassification, error rate = (FP+FN)/(P+N). For an instance, the pairwise and the fault are mutually exclusive events, so Accuracy =1 - error rate.
+ 3) Sensitivity
+ Sensitivity = TP/P, which is the ratio of the paired pairs in all positive cases, which measures the ability of the classifier to identify positive examples.
+ 4) Specificity
+ Specificity = TN/N, which represents the proportion of pairs in all negative cases, which measures the ability of the classifier to identify negative examples.
+ 5) Precision (precision)
+ Precision=TP/(TP+FP), precision is a measure of accuracy, representing the proportion of the positive example that is divided into positive examples.
+ 6) Recall rate (recall)
+ The recall rate is a measure of coverage. There are several positive examples of the metric being divided into positive examples. Recate=TP/(TP+FN)=TP/P=sensitivity, it can be seen that the recall rate is the same as the sensitivity.
+ 7) Other evaluation indicators
+ Calculation speed: the time required for classifier training and prediction;
+ Robustness: The ability to handle missing values and outliers;
+ Scalability: The ability to handle large data sets;
+ Interpretability: The comprehensibility of the classifier's prediction criteria, such as the rules generated by the decision tree, is easy to understand, and the neural network's parameters are not well understood, we have to think of it as a black box.
+ 8) Accuracy and recall rate reflect two aspects of classifier classification performance. If you comprehensively consider the precision and recall rate, you can get a new evaluation index F1-score, also known as the comprehensive classification rate: $F1=\frac{2 \times precision \times recall}{precision + recall} $.
+
+In order to integrate the classification of multiple categories and evaluate the overall performance of the system, micro-average F1 (micro-averaging) and macro-averaging F1 (macro-averaging) are often used.
+
+(1) The macro-average F1 and the micro-average F1 are global F1 indices obtained in two different averaging modes.
+
+(2) Calculation method of macro average F1 First, F1 values are separately calculated for each category, and the arithmetic mean of these F1 values is taken as a global index.
+
+(3) The calculation method of the micro-average F1 is to first calculate the values of a, b, c, and d of each category, and then obtain the value of F1 from these values.
+
+(4) It is easy to see from the calculation of the two average F1s that the macro average F1 treats each category equally, so its value is mainly affected by the rare category, and the micro-average F1 considers each document in the document set equally, so Its value is greatly affected by common categories.
+
+
+
+- **ROC curve and PR curve**
+
+The ROC curve is an abbreviation for (Receiver Operating Characteristic Curve), which is a performance evaluation curve with sensitivity (true positive rate) as the ordinate and 1-specific (false positive rate) as the abscissa. . The ROC curves of different models for the same data set can be plotted in the same Cartesian coordinate system. The closer the ROC curve is to the upper left corner, the more reliable the corresponding model is. The model can also be evaluated by the area under the ROC curve (Area Under Curve, AUC). The larger the AUC, the more reliable the model.
+
+
+
+Figure 2.7.3 ROC curve
+
+The PR curve is an abbreviation of Precision Recall Curve, which describes the relationship between precision and recall, with recall as the abscissa and precision as the ordinate. The corresponding area AUC of the curve is actually the average accuracy (Average Precision, AP) of the evaluation index commonly used in target detection. The higher the AP, the better the model performance.
+
+### 2.8.3 Is the correct rate good for evaluating classification algorithms?
+
+Different algorithms have different characteristics, and have different performance effects on different data sets, and different algorithms are selected according to specific tasks. How to evaluate the quality of the classification algorithm, to do specific analysis of specific tasks. For the decision tree, the evaluation is mainly based on the correct rate, but other algorithms can only be evaluated with the correct rate.
+The answer is no.
+The correct rate is indeed a very intuitive and very good evaluation indicator, but sometimes the correct rate is not enough to represent an algorithm. For example, earthquake prediction is performed on a certain area, and the seismic classification attribute is divided into 0: no earthquake occurs, and 1 earthquake occurs. We all know that the probability of not happening is very great. For the classifier, if the classifier does not think about it, the class of each test sample is divided into 0, achieving a correct rate of 99%, but the problem comes. If the earthquake is really undetected, the consequences will be enormous. Obviously, the 99% correct rate classifier is not what we want. The reason for this phenomenon is mainly because the data distribution is not balanced, the data with category 1 is too small, and the misclassification of category 1 but reaching a high accuracy rate ignores the situation that the researchers themselves are most concerned about.
+
+### 2.8.4 What kind of classifier is the best?
+For a task, it is not possible for a specific classifier to satisfy or improve all of the metrics described above.
+If a classifier correctly pairs all instances, then the indicators are already optimal, but such classifiers often do not exist. For example, the earthquake prediction mentioned earlier, since it is impossible to predict the occurrence of an earthquake 100%, but the actual situation can tolerate a certain degree of false positives. Suppose that in 1000 predictions, there are 5 predictions of earthquakes. In the real situation, one earthquake occurred, and the other 4 were false positives. The correct rate dropped from 999/1000=99.9 to 996/1000=99.6. The recall rate increased from 0/1=0% to 1/1=100%. This is explained as, although the prediction error has been 4 times, but before the earthquake, the classifier can predict the right, did not miss, such a classifier is actually more significant, exactly what we want. In this case, the recall rate of the classifier is required to be as high as possible under the premise of a certain correct rate.
+
+## 2.9 Logistic Regression
+
+### 2.9.1 Regression division
+
+In the generalized linear model family, depending on the dependent variable, it can be divided as follows:
+
+1. If it is continuous, it is multiple linear regression.
+2. If it is a binomial distribution, it is a logistic regression.
+3. If it is a Poisson distribution, it is Poisson regression.
+4. If it is a negative binomial distribution, it is a negative binomial regression.
+5. The dependent variable of logistic regression can be either two-category or multi-category, but the two-category is more common and easier to explain. So the most common use in practice is the logical regression of the two classifications.
+
+### 2.9.2 Logistic regression applicability
+
+1. Used for probabilistic prediction. When used for probability prediction, the results obtained are comparable. For example, according to the model, it is predicted how much probability of a disease or a certain situation occurs under different independent variables.
+2. For classification. In fact, it is somewhat similar to the prediction. It is also based on the model. It is a probability to judge whether a person belongs to a certain disease or belongs to a certain situation, that is, to see how likely this person is to belong to a certain disease. When classifying, it is only necessary to set a threshold. The probability is higher than the threshold is one type, and the lower than the threshold is another.
+3. Look for risk factors. Look for risk factors for a disease, etc.
+4. Can only be used for linear problems. Logistic regression can only be used when the goals and characteristics are linear. Pay attention to two points when applying logistic regression: First, when it is known that the model is nonlinear, logical regression is not applicable; second, when using logistic regression, attention should be paid to selecting features that are linear with the target.
+5. Conditional independence assumptions need not be met between features, but the contributions of individual features are calculated independently.
+
+### 2.9.3 What is the difference between logistic regression and naive Bayes?
+1. Logistic regression is a discriminant model, Naive Bayes is a generation model, so all the differences between generation and discrimination are available.
+2. Naive Bayes belongs to Bayesian, logistic regression is the maximum likelihood, and the difference between two probabilistic philosophies.
+3. Naive Bayes requires a conditional independent hypothesis.
+4. Logistic regression requires that the feature parameters be linear.
+
+### 2.9.4 What is the difference between linear regression and logistic regression?
+
+(Contributor: Huang Qinjian - South China University of Technology)
+
+The output of a linear regression sample is a continuous value, $ y\in (-\infty , +\infty )$, and $y\in (0 in logistic regression)
+, 1) $, can only take 0 and 1.
+
+There are also fundamental differences in the fit function:
+
+Linear regression: $f(x)=\theta ^{T}x=\theta _{1}x _{1}+\theta _{2}x _{2}+...+\theta _{n }x _{n}$
+
+Logistic regression: $f(x)=P(y=1|x;\theta )=g(\theta ^{T}x)$, where $g(z)=\frac{1}{1+e ^{-z}}$
+
+
+It can be seen that the fitted function of the linear regression is a fit to the output variable y of f(x), and the fitted function of the logistic regression is a fit to the probability of a class 1 sample.
+
+So why do you fit with the probability of a type 1 sample, why can you fit it like this?
+
+$\theta ^{T}x=0$ is equivalent to the decision boundary of class 1 and class 0:
+
+When $\theta ^{T}x>0$, then y>0.5; if $\theta ^{T}x\rightarrow +\infty $, then $y \rightarrow 1 $, ie y is 1;
+
+
+When $\theta ^{T}x<0 $, then y<0.5; if $\theta ^{T}x\rightarrow -\infty $, then $y \rightarrow 0 $, ie y is class 0 ;
+
+At this time, the difference can be seen. In the linear regression, $\theta ^{T}x$ is the fitted function of the predicted value; in the logistic regression, $\theta ^{T}x $ is the decision boundary.
+
+| | Linear Regression | Logistic Regression |
+|:-------------:|:-------------:|:-----:|
+| Purpose | Forecast | Classification |
+| $y^{(i)}$ | Unknown | (0,1)|
+| Function | Fitting function | Predictive function |
+| Parameter Calculation | Least Squares | Maximum Likelihood Estimation |
+
+
+Explain in detail below:
+
+1. What is the relationship between the fitted function and the predictive function? Simply put, the fitting function is transformed into a logical function, which is converted to make $y^{(i)} \in (0,1)$;
+2. Can the least squares and maximum likelihood estimates be substituted for each other? Of course, the answer is no good. Let's take a look at the principle of relying on both: the maximum likelihood estimate is the parameter that makes the most likely data appear, and the natural dependence is Probability. The least squares is the calculation error loss.
+
+## 2.10 Cost function
+
+### 2.10.1 Why do you need a cost function?
+
+1. In order to obtain the parameters of the trained logistic regression model, a cost function is needed to obtain the parameters by training the cost function.
+2. The purpose function used to find the optimal solution.
+
+### 2.10.2 Principle of cost function
+In the regression problem, the cost function is used to solve the optimal solution, and the square error cost function is commonly used. There are the following hypothetical functions:
+
+$$
+h(x) = A + Bx
+$$
+
+Suppose there are two parameters, $A$ and $B$, in the function. When the parameters change, it is assumed that the function state will also change.
+As shown below:
+
+
+
+To fit the discrete points in the graph, we need to find the best $A$ and $B$ as possible to make this line more representative of all the data. How to find the optimal solution, which needs to be solved using the cost function, taking the squared error cost function as an example, assuming the function is $h(x)=\theta_0x$.
+The main idea of the square error cost function is to make the difference between the value given by the actual data and the corresponding value of the fitted line, and find the difference between the fitted line and the actual line. In practical applications, in order to avoid the impact of individual extreme data, a similar variance is used to take one-half of the variance to reduce the impact of individual data. Therefore, the cost function is derived:
+$$
+J(\theta_0, \theta_1) = \frac{1}{m}\sum_{i=1}^m(h(x^{(i)})-y^{(i)})^2
+$$
+
+**The optimal solution is the minimum value of the cost function **$\min J(\theta_0, \theta_1) $. If it is a parameter, the cost function is generally visualized by a two-dimensional curve. If it is 2 parameters, the cost function can see the effect through the 3D image. The more parameters, the more complicated.
+When the parameter is 2, the cost function is a three-dimensional image.
+
+
+
+### 2.10.3 Why is the cost function non-negative?
+There is a lower bound on the objective function. In the optimization process, if the optimization algorithm can reduce the objective function continuously, according to the monotonically bounded criterion, the optimization algorithm can prove that the convergence is effective.
+As long as the objective function of the design has a lower bound, it is basically possible, and the cost function is non-negative.
+
+### 2.10.4 Common cost function?
+1. **quadratic cost**:
+
+$$
+J = \frac{1}{2n}\sum_x\Vert y(x)-a^L(x)\Vert^2
+$$
+
+Where $J$ represents the cost function, $x$ represents the sample, $y$ represents the actual value, $a$ represents the output value, and $n$ represents the total number of samples. Using a sample as an example, the secondary cost function is:
+$$
+J = \frac{(y-a)^2}{2}
+$$
+If Gradient descent is used to adjust the size of the weight parameter, the gradient of weight $w$ and offset $b$ is derived as follows:
+$$
+\frac{\partial J}{\partial b}=(a-y)\sigma'(z)
+$$
+Where $z $ represents the input of the neuron and $\sigma $ represents the activation function. The gradient of the weight $w $ and offset $b $ is proportional to the gradient of the activation function. The larger the gradient of the activation function, the faster the weights $w $ and offset $b $ are adjusted. The faster the training converges.
+
+*Note*: The activation function commonly used in neural networks is the sigmoid function. The curve of this function is as follows:
+
+
+
+Compare the two points of 0.88 and 0.98 as shown:
+Assume that the target is to converge to 1.0. 0.88 is farther from the target 1.0, the gradient is larger, and the weight adjustment is larger. 0.98 is closer to the target 1.0, the gradient is smaller, and the weight adjustment is smaller. The adjustment plan is reasonable.
+If the target is converged to 0. 0.88 is closer to the target 0, the gradient is larger, and the weight adjustment is larger. 0.98 is far from the target 0, the gradient is relatively small, and the weight adjustment is relatively small. The adjustment plan is unreasonable.
+Cause: In the case of using the sigmoid function, the larger the initial cost (error), the slower the training.
+
+2. **Cross-entropy cost function (cross-entropy)**:
+
+$$
+J = -\frac{1}{n}\sum_x[y\ln a + (1-y)\ln{(1-a)}]
+$$
+
+Where $J$ represents the cost function, $x$ represents the sample, $y$ represents the actual value, $a$ represents the output value, and $n$ represents the total number of samples.
+The gradient of the weight $w$ and offset $b $ is derived as follows:
+$$
+\frac{\partial J}{\partial w_j}=\frac{1}{n}\sum_{x}x_j(\sigma{(z)}-y)\;,
+\frac{\partial J}{\partial b}=\frac{1}{n}\sum_{x}(\sigma{(z)}-y)
+$$
+
+The larger the error, the larger the gradient, the faster the weights $w$ and offset $b$ are adjusted, and the faster the training.
+**The quadratic cost function is suitable for the case where the output neuron is linear, and the cross entropy cost function is suitable for the case where the output neuron is a sigmoid function. **
+
+3. **log-likelihood cost**:
+Log-likelihood functions are commonly used as cost functions for softmax regression. The common practice in deep learning is to use softmax as the last layer. The commonly used cost function is the log-likelihood cost function.
+The combination of log-likelihood cost function and softmax and the combination of cross-entropy and sigmoid function are very similar. The log-likelihood cost function can be reduced to the form of a cross-entropy cost function in the case of two classifications.
+In tensorflow:
+The cross entropy function used with sigmoid: `tf.nn.sigmoid_cross_entropy_with_logits()`.
+The cross entropy function used with softmax: `tf.nn.softmax_cross_entropy_with_logits()`.
+In pytorch:
+ The cross entropy function used with sigmoid: `torch.nn.BCEWithLogitsLoss()`.
+The cross entropy function used with softmax: `torch.nn.CrossEntropyLoss()`.
+
+### 2.10.5 Why use cross entropy instead of quadratic cost function
+1. **Why not use the quadratic cost function**
+As you can see from the previous section, the partial derivative of the weight $w$ and the offset $b$ is $\frac{\partial J}{\partial w}=(ay)\sigma'(z)x$,$\frac {\partial J}{\partial b}=(ay)\sigma'(z)$, the partial derivative is affected by the derivative of the activation function, and the derivative of the sigmoid function is very small when the output is close to 0 and 1, which causes some instances to be Learning very slowly when starting training.
+
+2. **Why use cross entropy**
+The gradient of the cross entropy function weights $w$ and the offset $b$ is derived as:
+
+$$
+\frac{\partial J}{\partial w_j}=\frac{1}{n}\sum_{x}(\sigma{(a)}-y)\;,
+\frac{\partial J}{\partial b}=\frac{1}{n}\sum_{x}(\sigma{(z)}-y)
+$$
+
+It can be seen from the above formula that the speed of weight learning is affected by $\sigma{(z)}-y$, and the larger error has faster learning speed, avoiding the quadratic cost function equation due to $\sigma'{ (z)}$ The slow learning situation.
+
+## 2.11 Loss function
+
+### 2.11.1 What is a loss function?
+
+The Loss Function, also called the error function, is used to measure the operation of the algorithm. It is a non-negative real-valued function that measures the inconsistency between the predicted value of the model and the real value. Usually, $ is used.
+L(Y, f(x))$ is used to indicate. The smaller the loss function, the better the robustness of the model. The loss function is the core part of the empirical risk function and an important part of the structural risk function.
+
+### 2.11.2 Common loss function
+Machine learning optimizes the objective function in the algorithm to get the final desired result. In classification and regression problems, a loss function or a cost function is usually used as the objective function.
+The loss function is used to evaluate the extent to which the predicted value is not the same as the true value. Usually the better the loss function, the better the performance of the model.
+The loss function can be divided into an empirical risk loss function and a structural risk loss function. The empirical risk loss function refers to the difference between the predicted result and the actual result. The structural risk loss function adds a regular term to the empirical risk loss function.
+The following describes the commonly used loss function:
+
+1. **0-1 loss function**
+If the predicted value is equal to the target value, the value is 0. If they are not equal, the value is 1.
+
+$$
+L(Y, f(x)) =
+\begin{cases}
+1,& Y\ne f(x)\\
+0, & Y = f(x)
+\end{cases}
+$$
+
+Generally, in actual use, the equivalent conditions are too strict, and the conditions can be appropriately relaxed:
+
+$$
+L(Y, f(x)) =
+\begin{cases}
+1,& |Y-f(x)|\ge T\\
+0,& |Y-f(x)|< T
+\end{cases}
+$$
+
+2. **Absolute loss function**
+Similar to the 0-1 loss function, the absolute value loss function is expressed as:
+
+$$
+L(Y, f(x)) = |Y-f(x)|
+$$
+
+3. **squared loss function**
+
+$$
+L(Y, f(x)) = \sum_N{(Y-f(x))}^2
+$$
+
+This can be understood from the perspective of least squares and Euclidean distance. The principle of the least squares method is that the best fit curve should minimize and minimize the distance from all points to the regression line.
+
+4. **log logarithmic loss function**
+
+$$
+L(Y, P(Y|X)) = -\log{P(Y|X)}
+$$
+
+The common logistic regression uses the logarithmic loss function. Many people think that the loss of the functionalized square of the logistic regression is not. Logistic Regression It assumes that the sample obeys the Bernoulli distribution, and then finds the likelihood function that satisfies the distribution, and then takes the logarithm to find the extremum. The empirical risk function derived from logistic regression is to minimize the negative likelihood function. From the perspective of the loss function, it is the log loss function.
+
+5. **Exponential loss function**
+The standard form of the exponential loss function is:
+
+$$
+L(Y, f(x)) = \exp{-yf(x)}
+$$
+
+For example, AdaBoost uses the exponential loss function as a loss function.
+
+6. **Hinge loss function**
+The standard form of the Hinge loss function is as follows:
+
+$$
+L(Y) = \max{(0, 1-
+Ty)}
+$$
+
+Where y is the predicted value, the range is (-1,1), and t is the target value, which is -1 or 1.
+
+In linear support vector machines, the optimization problem can be equivalent to
+
+$$
+\underset{\min}{w,b}\sum_{i=1}^N (1-y_i(wx_i+b))+\lambda\Vert w^2\Vert
+$$
+
+The above formula is similar to the following
+
+$$
+\frac{1}{m}\sum_{i=1}^{N}l(wx_i+by_i) + \Vert w^2\Vert
+$$
+
+Where $l(wx_i+by_i)$ is the Hinge loss function and $\Vert w^2\Vert$ can be considered as a regularization.
+
+### 2.11.3 Why does Logistic Regression use a logarithmic loss function?
+Hypothetical logistic regression model
+$$
+P(y=1|x;\theta)=\frac{1}{1+e^{-\theta^{T}x}}
+$$
+Assume that the probability distribution of the logistic regression model is a Bernoulli distribution whose probability mass function is
+$$
+P(X=n)=
+\begin{cases}
+1-p, n=0\\
+ p,n=1
+\end{cases}
+$$
+Its likelihood function is
+$$
+L(\theta)=\prod_{i=1}^{m}
+P(y=1|x_i)^{y_i}P(y=0|x_i)^{1-y_i}
+$$
+Log likelihood function
+$$
+\ln L(\theta)=\sum_{i=1}^{m}[y_i\ln{P(y=1|x_i)}+(1-y_i)\ln{P(y=0|x_i) }]\\
+ =\sum_{i=1}^m[y_i\ln{P(y=1|x_i)}+(1-y_i)\ln(1-P(y=1|x_i))]
+$$
+The logarithmic function is defined on a single data point as
+$$
+Cost(y,p(y|x))=-y\ln{p(y|x)-(1-y)\ln(1-p(y|x))}
+$$
+Then the global sample loss function is:
+$$
+Cost(y,p(y|x)) = -\sum_{i=1}^m[y_i\ln p(y_i|x_i)-(1-y_i)\ln(1-p(y_i|x_i)) ]
+$$
+It can be seen that the log-like loss function and the log-likelihood function of the maximum likelihood estimation are essentially the same. So logistic regression directly uses the logarithmic loss function.
+
+### 2.11.4 How does the logarithmic loss function measure loss?
+Example:
+In the Gaussian distribution, we need to determine the mean and standard deviation.
+How to determine these two parameters? Maximum likelihood estimation is a more common method. The goal of maximum likelihood is to find some parameter values whose distributions maximize the probability of observing the data.
+Because it is necessary to calculate the full probability of all data observed, that is, the joint probability of all observed data points. Now consider the following simplification:
+
+1. Assume that the probability of observing each data point is independent of the probability of other data points.
+2. Take the natural logarithm.
+ Suppose the probability of observing a single data point $x_i(i=1,2,...n) $ is:
+$$
+P(x_i;\mu,\sigma)=\frac{1}{\sigma \sqrt{2\pi}}\exp
+\left( - \frac{(x_i-\mu)^2}{2\sigma^2} \right)
+$$
+
+
+3. Its joint probability is
+ $$
+ P(x_1,x_2,...,x_n;\mu,\sigma)=\frac{1}{\sigma \sqrt{2\pi}}\exp
+ \left( - \frac{(x_1-\mu)^2}{2\sigma^2} \right) \\ \times
+ \frac{1}{\sigma \sqrt{2\pi}}\exp
+ \left( - \frac{(x_2-\mu)^2}{2\sigma^2} \right) \times ... \times
+ \frac{1}{\sigma \sqrt{2\pi}}\exp
+ \left( - \frac{(x_n-\mu)^2}{2\sigma^2} \right)
+ $$
+
+
+ Take the natural logarithm of the above formula, you can get:
+ $$
+ \ln(P(x_1,x_2,...x_n;\mu,\sigma))=
+ \ln \left(\frac{1}{\sigma \sqrt{2\pi}} \right)
+ - \frac{(x_1-\mu)^2}{2\sigma^2} \\ +
+ \ln \left( \frac{1}{\sigma \sqrt{2\pi}} \right)
+ - \frac{(x_2-\mu)^2}{2\sigma^2} +...+
+ \ln \left( \frac{1}{\sigma \sqrt{2\pi}} \right)
+ - \frac{(x_n-\mu)^2}{2\sigma^2}
+ $$
+ According to the law of logarithm, the above formula can be reduced to:
+ $$
+ \ln(P(x_1,x_2,...x_n;\mu,\sigma))=-n\ln(\sigma)-\frac{n}{2} \ln(2\pi)\\
+ -\frac{1}{2\sigma^2}[(x_1-\mu)^2+(x_2-\mu)^2+...+(x_n-\mu)^2]
+ $$
+ Guide:
+ $$
+ \frac{\partial\ln(P(x_1,x_2,...,x_n;\mu,\sigma))}{\partial\mu}=
+ \frac{1}{\sigma^2}[x_1+x_2+...+x_n]
+ $$
+ The left half of the above equation is a logarithmic loss function. The smaller the loss function, the better, so we have a logarithmic loss function of 0, which gives:
+ $$
+ \mu=\frac{x_1+x_2+...+x_n}{n}
+ $$
+ Similarly, $\sigma $ can be calculated.
+
+## 2.12 Gradient descent
+
+### 2.12.1 Why do gradients need to be dropped in machine learning?
+
+1. Gradient descent is a type of iterative method that can be used to solve the least squares problem.
+2. When solving the model parameters of the machine learning algorithm, that is, the unconstrained optimization problem, there are mainly Gradient Descent and Least Squares.
+3. When solving the minimum value of the loss function, it can be solved step by step by the gradient descent method to obtain the minimized loss function and model parameter values.
+4. If we need to solve the maximum value of the loss function, it can be iterated by the gradient ascent method. The gradient descent method and the gradient descent method are mutually convertible.
+5. In machine learning, the gradient descent method mainly includes stochastic gradient descent method and batch gradient descent method.
+
+### 2.12.2 What are the disadvantages of the gradient descent method?
+1. Convergence slows down near the minimum value.
+2. There may be some problems when searching in a straight line.
+3. It may fall "zigzag".
+
+Gradient concepts need to be noted:
+1. A gradient is a vector, that is, it has a direction and a size;
+2. The direction of the gradient is the direction of the maximum direction derivative;
+3. The value of the gradient is the value of the maximum direction derivative.
+
+### 2.12.3 Gradient descent method intuitive understanding?
+Classical illustration of the gradient descent method:
+
+
+
+Visualization example:
+> From the above picture, if at the beginning, we are somewhere on a mountain, because there are strangers everywhere, we don’t know the way down the mountain, so we can only explore the steps based on intuition, in the process, Every time you go to a position, it will solve the gradient of the current position, go down one step along the negative direction of the gradient, that is, the current steepest position, and then continue to solve the current position gradient, and the steepest position along this step. The easiest place to go down the mountain. Constantly cycle through the gradients, and so on, step by step, until we feel that we have reached the foot of the mountain. Of course, if we go on like this, it is possible that we cannot go to the foot of the mountain, but to the low point of a certain local peak.
+Thus, as can be seen from the above explanation, the gradient descent does not necessarily find a global optimal solution, and may be a local optimal solution. Of course, if the loss function is a convex function, the solution obtained by the gradient descent method must be the global optimal solution.
+
+**Introduction of core ideas**:
+
+1. Initialize the parameters and randomly select any number within the range of values;
+2. Iterative operation:
+ a) calculate the current gradient;
+b) modify the new variable;
+c) calculate one step towards the steepest downhill direction;
+d) determine whether termination is required, if not, return a);
+3. Get the global optimal solution or close to the global optimal solution.
+
+### 2.12.4 Gradient descent algorithm description
+1. Determine the hypothesis function and loss function of the optimization model.
+ For example, for linear regression, the hypothesis function is:
+$$
+ H_\theta(x_1,x_2,...,x_n)=\theta_0+\theta_1x_1+...+\theta_nx_n
+$$
+ Where $\theta,x_i(i=1,2,...,n) $ are the model parameters and the eigenvalues of each sample.
+ For the hypothetical function, the loss function is:
+$$
+ J(\theta_0,\theta_1,...,\theta_n)=\frac{1}{2m}\sum^{m}_{j=0}(h_\theta (x^{(j)}_0)
+ ,(x^{(j)}_1),...,(x^{(j)}_n)-y_j)^2
+$$
+
+2. Related parameters are initialized.
+ Mainly initializes ${\theta}_i $, algorithm iteration step ${\alpha} $, and terminates distance ${\zeta} $. Initialization can be initialized empirically, ie ${\theta} $ is initialized to 0, and the step ${\alpha} $ is initialized to 1. The current step size is recorded as ${\varphi}_i $. Of course, it can also be randomly initialized.
+
+3. Iterative calculations.
+
+ 1) Calculate the gradient of the loss function at the current position. For ${\theta}_i$, the gradient is expressed as:
+
+$$
+\frac{\partial}{\partial \theta_i}J({\theta}_0,{\theta}_1,...,{\theta}_n)
+$$
+2) Calculate the distance at which the current position drops.
+$$
+{\varphi}_i={\alpha} \frac{\partial}{\partial \theta_i}J({\theta}_0,{\theta}_1,...,{\theta}_n)
+$$
+3) Determine if it is terminated.
+Determine if all ${\theta}_i$ gradients fall by ${\varphi}_i$ are less than the termination distance ${\zeta}$, if both are less than ${\zeta}$, the algorithm terminates, of course the value That is the final result, otherwise go to the next step.
+4) Update all ${\theta}_i$, the updated expression is:
+$$
+{\theta}_i={\theta}_i-\alpha \frac{\partial}{\partial \theta_i}J({\theta}_0,{\theta}_1,...,{\theta}_n)
+$$
+5) After the update is completed, transfer to 1).
+
+**Example **. Take linear regression as an example.
+Suppose the sample is
+$$
+(x^{(0)}_1,x^{(0)}_2,...,x^{(0)}_n,y_0),(x^{(1)}_1,x^{(1 )}_2,...,x^{(1)}_n,y_1),...,
+(x^{(m)}_1,x^{(m)}_2,...,x^{(m)}_n,y_m)
+$$
+The loss function is
+$$
+J(\theta_0,\theta_1,...,\theta_n)=\frac{1}{2m}\sum^{m}_{j=0}(h_\theta (x^{(j)}_0
+,x^{(j)}_1,...,x^{(j)}_n)-y_j)^2
+$$
+In the calculation, the partial derivative of ${\theta}_i$ is calculated as follows:
+$$
+\frac{\partial}{\partial \theta_i}J({\theta}_0,{\theta}_1,...,{\theta}_n)=\frac{1}{m}\sum^{m }_{j=0}(h_\theta (x^{(j)}_0
+,x^{(j)}_1,...,x^{(j)}_n)-y_j)x^{(j)}_i
+$$
+Let the above formula $x^{(j)}_0=1$. 4) The update expression for $\theta_i$ is:
+$$
+\theta_i=\theta_i - \alpha \frac{1}{m} \sum^{m}_{j=0}(h_\theta (x^{(j)}_0
+,x^{(j)}_1,...,x^{(j)}_n)-y_j)x^{(j)}_i
+$$
+
+
+From this, you can see
+Out, the gradient direction of the current position is determined by all samples. The purpose of $\frac{1}{m} $, $\alpha \frac{1}{m} $ is to make it easier to understand.
+
+### 2.12.5 How to tune the gradient descent method?
+When the gradient descent method is actually used, each parameter index can not reach the ideal state in one step, and the optimization of the gradient descent method is mainly reflected in the following aspects:
+1. **Algorithm iteration step $\alpha$ selection. **
+ When the algorithm parameters are initialized, the step size is sometimes initialized to 1 based on experience. The actual value depends on the data sample. You can take some values from big to small, and run the algorithm to see the iterative effect. If the loss function is smaller, the value is valid. If the value is invalid, it means to increase the step size. However, the step size is too large, sometimes causing the iteration speed to be too fast and missing the optimal solution. The step size is too small, the iteration speed is slow, and the algorithm runs for a long time.
+2. **The initial value selection of the parameter. **
+ The initial values are different, and the minimum values obtained may also be different. It is possible to obtain a local minimum due to the gradient drop. If the loss function is a convex function, it must be the optimal solution. Due to the risk of local optimal solutions, it is necessary to run the algorithm with different initial values multiple times, the minimum value of the key loss function, and the initial value of the loss function minimized.
+3. **Standardization process. **
+ Due to the different samples, the range of feature values is different, resulting in slow iteration. In order to reduce the influence of feature values, the feature data can be normalized so that the new expectation is 0 and the new variance is 1, which can save the algorithm running time.
+
+### 2.12.7 What is the difference between random gradients and batch gradients?
+Stochastic gradient descent and batch gradient descent are two main gradient descent methods whose purpose is to increase certain limits to speed up the computational solution.
+The following is a comparison of the two gradient descent methods.
+Assume that the function is
+$$
+H_\theta (x_1,x_2,...,x_3) = \theta_0 + \theta_1 + ... + \theta_n x_n
+$$
+The loss function is
+$$
+J(\theta_0, \theta_1, ... , \theta_n) =
+\frac{1}{2m} \sum^{m}_{j=0}(h_\theta (x^{(j)}_0
+,x^{(j)}_1,...,x^{(j)}_n)-y_j)^2
+$$
+Among them, $m $ is the number of samples, and $j $ is the number of parameters.
+
+1, **batch gradient descent solution ideas are as follows: **
+
+a) Get the gradient corresponding to each $\theta $:
+$$
+\frac{\partial}{\partial \theta_i}J({\theta}_0,{\theta}_1,...,{\theta}_n)=\frac{1}{m}\sum^{m }_{j=0}(h_\theta (x^{(j)}_0
+,x^{(j)}_1,...,x^{(j)}_n)-y_j)x^{(j)}_i
+$$
+b) Since it is to minimize the risk function, update $ \theta_i $ in the negative direction of the gradient of each parameter $ \theta $ :
+$$
+\theta_i=\theta_i - \frac{1}{m} \sum^{m}_{j=0}(h_\theta (x^{(j)}_0
+,x^{(j)}_1,...,x^{(j)}_n)-y_j)x^{(j)}_i
+$$
+c) It can be noticed from the above equation that although it is a global optimal solution, all data of the training set is used for each iteration. If the sample data is large, the iteration speed of this method is very slow.
+In contrast, a random gradient drop can avoid this problem.
+
+2. **The solution to the stochastic gradient descent is as follows: **
+a) Compared to all training samples compared to the batch gradient drop, the loss function in the stochastic gradient descent method corresponds to the granularity of each sample in the training set.
+The loss function can be written in the form of
+$$
+J(\theta_0, \theta_1, ... , \theta_n) =
+\frac{1}{m} \sum^{m}_{j=0}(y_i - h_\theta (x^{(j)}_0
+,x^{(j)}_1,...,x^{(j)}_n))^2 =
+\frac{1}{m} \sum^{m}_{j=0} cost(\theta,(x^j,y^j))
+$$
+b) For each parameter $ \theta $ update the gradient direction $ \theta $:
+$$
+\theta_i = \theta_i + (y_j - h_\theta (x^{(j)}_0, x^{(j)}_1, ... , x^{(j)}_n))
+$$
+c) The random gradient descent is iteratively updated by each sample.
+One problem associated with stochastic gradient descent is that the noise is much lower than the batch gradient, so that the stochastic gradient descent is not the direction of overall optimization for each iteration.
+
+**summary:**
+The stochastic gradient descent method and the batch gradient descent method are relatively extreme, and the simple comparison is as follows:
+
+| Method | Features |
+| :----------: | :----------------------------------- ------------------------ |
+| Batch gradient descent | a) Use all data to gradient down. b) The batch gradient descent method is slow in training when the sample size is large. |
+| Stochastic Gradient Descent | a) Stochastic gradient descent with a sample to gradient down. b) Training is fast.
+c) The stochastic gradient descent method uses only one sample to determine the direction of the gradient, which may result in a solution that is not optimal.
+d) In terms of convergence speed, the stochastic gradient descent method iterates one sample at a time, resulting in a large change in the iteration direction and cannot converge to the local optimal solution very quickly. |
+
+The following describes a small batch gradient descent method that combines the advantages of both methods.
+
+3, **small batch (mini-batch) gradient drop solution is as follows **
+For data with a total of $m$ samples, according to the sample data, select $n(1< n< m)$ subsamples to iterate. Its parameter $\theta$ updates the $\theta_i$ formula in the gradient direction as follows:
+$$
+\theta_i = \theta_i - \alpha \sum^{t+n-1}_{j=t}
+( h_\theta (x^{(j)}_{0}, x^{(j)}_{1}, ... , x^{(j)}_{n} ) - y_j ) x^ {j}_{i}
+$$
+
+### 2.12.8 Comparison of various gradient descent methods
+The table below briefly compares the difference between stochastic gradient descent (SGD), batch gradient descent (BGD), small batch gradient descent (mini-batch GD), and online GD:
+
+|BGD|SGD|GD|Mini-batch GD|Online GD|
+|:-:|:-:|:-:|:-:|:-:|:-:|
+| training set | fixed | fixed | fixed | real-time update |
+|Single iteration sample number | Whole training set | Single sample | Subset of training set | According to specific algorithm |
+| Algorithm Complexity | High | Low | General | Low |
+|Timeliness|Low|General|General|High|
+|convergence|stability|unstable|stable|unstable|unstable|
+
+BGD, SGD, Mini-batch GD, have been discussed before, here introduces Online GD.
+
+The difference between Online GD and mini-batch GD/SGD is that all training data is used only once and then discarded. The advantage of this is that it predicts the changing trend of the final model.
+
+Online GD is used more in the Internet field, such as the click rate (CTR) estimation model of search advertising, and the click behavior of netizens will change over time. Using the normal BGD algorithm (updated once a day) takes a long time (requires retraining of all historical data); on the other hand, it is unable to promptly feedback the user's click behavior migration. The Online GD algorithm can be migrated in real time according to the click behavior of netizens.
+
+## 2.13 Calculating the derivative calculation diagram of the graph?
+The computational graph derivative calculation is backpropagation, which is derived using chained rules and implicit functions.
+
+Suppose $z = f(u,v)$ is contiguous at point $(u,v)$, $(u,v)$ is a function of $t$, which can be guided at $t$ $z$ is the derivative of the $t$ point.
+
+According to the chain rule
+$$
+\frac{dz}{dt}=\frac{\partial z}{\partial u}.\frac{du}{dt}+\frac{\partial z}{\partial v}
+.\frac{dv}{dt}
+$$
+
+For ease of understanding, the following examples are given.
+Suppose $f(x) $ is a function of a, b, c. The chain derivation rule is as follows:
+$$
+\frac{dJ}{du}=\frac{dJ}{dv}\frac{dv}{du},\frac{dJ}{db}=\frac{dJ}{du}\frac{du}{db },\frac{dJ}{da}=\frac{dJ}{du}\frac{du}{da}
+$$
+
+The chain rule is described in words: "A composite function composed of two functions whose derivative is equal to the derivative of the value of the inner function substituted into the outer function, multiplied by the derivative of the inner function.
+
+example:
+
+$$
+f(x)=x^2,g(x)=2x+1
+$$
+
+then
+
+$$
+{f[g(x)]}'=2[g(x)] \times g'(x)=2[2x+1] \times 2=8x+1
+$$
+
+
+## 2.14 Linear Discriminant Analysis (LDA)
+
+### 2.14.1 Summary of LDA Thoughts
+
+Linear Discriminant Analysis (LDA) is a classic dimensionality reduction method. Unlike PCA, which does not consider the unsupervised dimensionality reduction technique of sample category output, LDA is a dimensionality reduction technique for supervised learning, with each sample of the data set having a category output.
+
+The LDA classification idea is briefly summarized as follows:
+1. In multi-dimensional space, the data processing classification problem is more complicated. The LDA algorithm projects the data in the multi-dimensional space onto a straight line, and converts the d-dimensional data into one-dimensional data for processing.
+2. For training data, try to project the multidimensional data onto a straight line. The projection points of the same kind of data are as close as possible, and the heterogeneous data points are as far as possible.
+3. When classifying the data, project it onto the same line and determine the category of the sample based on the position of the projected point.
+If you summarize the LDA idea in one sentence, that is, "the variance within the class after projection is the smallest, and the variance between classes is the largest".
+
+### 2.14.2 Graphical LDA Core Ideas
+Assume that there are two types of data, red and blue. These data features are two-dimensional, as shown in the following figure. Our goal is to project these data into one dimension, so that the projection points of each type of similar data are as close as possible, and the different types of data are as far as possible, that is, the distance between the red and blue data centers in the figure is as large as possible.
+
+
+
+The left and right images are two different projections.
+
+Left diagram: The projection method that allows the farthest points of different categories to be the farthest.
+
+The idea on the right: Let the data of the same category get the closest projection.
+
+As can be seen from the above figure, the red and blue data on the right are relatively concentrated in their respective regions. It can also be seen from the data distribution histogram, so the projection effect on the right is better than the one on the left. There is a clear intersection in the figure section.
+
+The above example is based on the fact that the data is two-dimensional, and the classified projection is a straight line. If the original data is multidimensional, the projected classification surface is a low-dimensional hyperplane.
+
+### 2.14.3 Principles of the second class LDA algorithm?
+Input: data set $D=\{(x_1,y_1),(x_2,y_2),...,(x_m,y_m)\} $, where sample $x_i $ is an n-dimensional vector, $y_i \ Epsilon \{C_1, C_2, ..., C_k\} $, the dimension dimension $d $ after dimension reduction. definition
+
+$N_j(j=0,1)$ is the number of samples of the $j$ class;
+
+$X_j(j=0,1)$ is a collection of $j$ class samples;
+
+$u_j(j=0,1) $ is the mean vector of the $j $ class sample;
+
+$\sum_j(j=0,1)$ is the covariance matrix of the $j$ class sample.
+
+among them
+$$
+U_j = \frac{1}{N_j} \sum_{x\epsilon X_j}x(j=0,1),
+\sum_j = \sum_{x\epsilon X_j}(x-u_j)(x-u_j)^T(j=0,1)
+$$
+Suppose the projection line is the vector $w$. For any sample $x_i$, its projection on the line $w$ is $w^tx_i$, the center point of the two categories is $u_0$, $u_1 $ is in the line $w $The projections of are $w^Tu_0$ and $w^Tu_1$ respectively.
+
+The goal of LDA is to maximize the distance between the two categories of data centers, $\| w^Tu_0 - w^Tu_1 \|^2_2$, and at the same time, hope that the covariance of the similar sample projection points is $w^T \sum_0 w$, $w^T \sum_1 w$ Try to be as small as possible, and minimize $w^T \sum_0 w - w^T \sum_1 w $ .
+definition
+Intraclass divergence matrix
+$$
+S_w = \sum_0 + \sum_1 =
+\sum_{x\epsilon X_0}(x-u_0)(x-u_0)^T +
+\sum_{x\epsilon X_1}(x-u_1)(x-u_1)^T
+$$
+Interclass divergence matrix $S_b = (u_0 - u_1)(u_0 - u_1)^T$
+
+According to the analysis, the optimization goal is
+$$
+\arg \max J(w) = \frac{\| w^Tu_0 - w^Tu_1 \|^2_2}{w^T \sum_0w + w^T \sum_1w} =
+\frac{w^T(u_0-u_1)(u_0-u_1)^Tw}{w^T(\sum_0 + \sum_1)w} =
+\frac{w^TS_bw}{w^TS_ww}
+$$
+According to the nature of the generalized Rayleigh quotient, the maximum eigenvalue of the matrix $S^{-1}_{w} S_b$ is the maximum value of $j(w)$, and the matrix $S^{-1}_{w} $The feature vector corresponding to the largest eigenvalue of $S_b$ is $w $.
+
+### 2.14.4 Summary of LDA algorithm flow?
+The LDA algorithm dimension reduction process is as follows:
+
+Input: Dataset $D = \{ (x_1,y_1), (x_2,y_2), ... ,(x_m,y_m) \}$, where the sample $x_i $ is an n-dimensional vector, $y_i\epsilon \{C_1, C_2, ..., C_k \} $, the dimension dimension $d$ after dimension reduction.
+
+Output: Divised data set $\overline{D} $ .
+
+step:
+1. Calculate the intra-class divergence matrix $S_w$.
+2. Calculate the inter-class divergence matrix $S_b $ .
+3. Calculate the matrix $S^{-1}_wS_b $ .
+4. Calculate the largest d eigenvalues for the matrix $S^{-1}_wS_b$.
+5. Calculate d eigenvectors corresponding to d eigenvalues, and record the projection matrix as W .
+6. Convert each sample of the sample set to get the new sample $P_i = W^Tx_i $ .
+7. Output a new sample set $\overline{D} = \{ (p_1,y_1),(p_2,y_2),...,(p_m,y_m) \} $
+
+### 2.14.5 What is the difference between LDA and PCA?
+
+|similarities and differences | LDA | PCA |
+|:-:|:-|:-|
+|Same point|1. Both can reduce the dimension of the data; 2. Both use the idea of matrix feature decomposition in dimension reduction; 3. Both assume that the data is Gaussian Distribution;||
+|Different points |Supervised dimensionality reduction methods;|Unsupervised dimensionality reduction methods;|
+||The dimension reduction is reduced to the k-1 dimension at most; |There is no limit on the dimension reduction;|
+|| can be used for dimensionality reduction, and can also be used for classification; | only for dimensionality reduction;
+||Select the best projection direction for classification performance;|Select the direction of the sample point projection with the largest variance;|
+||More specific, more reflective of differences between samples; | purpose is more vague; |
+
+### 2.14.6 What are the advantages and disadvantages of LDA?
+
+| Advantages and Disadvantages | Brief Description |
+|:-:|:-|
+|Advantages|1. A priori knowledge of categories can be used; 2. Supervised dimensionality reduction method for measuring differences by label and category, with a clearer purpose than PCA's ambiguity, more reflecting the sample Difference;
+|Disadvantages|1. LDA is not suitable for dimension reduction of non-Gaussian distribution samples; 2. LDA reduction is reduced to k-1 dimension at most; 3. LDA depends on variance in sample classification information instead of mean When the dimension reduction effect is not good; 4. LDA may overfit the data. |
+
+## 2.15 Principal Component Analysis (PCA)
+
+### 2.15.1 Summary of Principal Component Analysis (PCA) Thoughts
+
+1. PCA is to project high-dimensional data into a low-dimensional space by linear transformation.
+2. Projection Idea: Find the projection method that best represents the original data. Those dimensions that are dropped by the PCA can only be those that are noisy or redundant.
+3. De-redundancy: Removes linear correlation vectors that can be represented by other vectors. This amount of information is redundant.
+4. Denoising, removing the eigenvector corresponding to the smaller eigenvalue. The magnitude of the eigenvalue reflects the magnitude of the transformation in the direction of the eigenvector after the transformation. The larger the amplitude, the larger the difference in the element in this direction is to be preserved.
+5. Diagonalization matrix, find the maximal linearly irrelevant group, retain larger eigenvalues, remove smaller eigenvalues, form a projection matrix, and project the original sample matrix to obtain a new sample matrix after dimension reduction.
+6. The key to completing the PCA is the covariance matrix.
+The covariance matrix can simultaneously represent the correlation between different dimensions and the variance in each dimension.
+The covariance matrix measures the relationship between dimensions and dimensions, not between samples and samples.
+7. The reason for diagonalization is that since the non-diagonal elements are all 0 after diagonalization, the purpose of denoising is achieved. The diagonalized covariance matrix, the smaller new variance on the diagonal corresponds to those dimensions that should be removed. So we only take those dimensions that contain larger energy (characteristic values), and the rest are rounded off, that is, redundant.
+
+### 2.15.2 Graphical PCA Core Ideas
+The PCA can solve the problem that there are too many data features or cumbersome features in the training data. The core idea is to map m-dimensional features to n-dimensional (n < m), which form the principal element, which is the orthogonal feature reconstructed to represent the original data.
+
+Suppose the data set is m n-dimensional, $(x^{(1)}, x^{(2)}, \cdots, x^{(m)})$. If n=2, you need to reduce the dimension to $n'=1$. Now you want to find the data of a dimension that represents these two dimensions. The figure below has $u_1, u_2$ two vector directions, but which vector is what we want, can better represent the original data set?
+
+
+
+As can be seen from the figure, $u_1$ is better than $u_2$, why? There are two main evaluation indicators:
+1. The sample point is close enough to this line.
+2. The projection of the sample points on this line can be separated as much as possible.
+
+If the target dimension we need to reduce dimension is any other dimension, then:
+1. The sample point is close enough to the hyperplane.
+2. The projection of the sample points on this hyperplane can be separated as much as possible.
+
+### 2.15.3 PCA algorithm reasoning
+The following is based on the minimum projection distance as the evaluation index reasoning:
+
+Suppose the data set is m n-dimensional, $(x^{(1)}, x^{(2)},...,x^{(m)})$, and the data is centered. After projection transformation, the new coordinates are ${w_1, w_2,...,w_n}$, where $w$ is the standard orthogonal basis, ie $\| w \|_2 = 1$, $w^T_iw_j = 0$ .
+
+After dimension reduction, the new coordinates are $\{ w_1,2_2,...,w_n \}$, where $n'$ is the target dimension after dimension reduction. The projection of the sample point $x^{(i)}$ in the new coordinate system is $z^{(i)} = \left(z^{(i)}_1, z^{(i)}_2, . .., z^{(i)}_{n'} \right)$, where $z^{(i)}_j = w^T_j x^{(i)}$ is $x^{(i) } $ Coordinate of the jth dimension in the low dimensional coordinate system.
+
+If $z^{(i)} $ is used to recover $x^{(i)} $ , the resulting recovery data is $\widehat{x}^{(i)} = \sum^{n'} _{j=1} x^{(i)}_j w_j = Wz^{(i)}$, where $W$ is a matrix of standard orthonormal basis.
+
+Considering the entire sample set, the distance from the sample point to this hyperplane is close enough that the target becomes minimized. $\sum^m_{i=1} \| \hat{x}^{(i)} - x^{ (i)} \|^2_2$ . Reasoning about this formula, you can get:
+$$
+\sum^m_{i=1} \| \hat{x}^{(i)} - x^{(i)} \|^2_2 =
+\sum^m_{i=1} \| Wz^{(i)} - x^{(i)} \|^2_2 \\
+= \sum^m_{i=1} \left( Wz^{(i)} \right)^T \left( Wz^{(i)} \right)
+- 2\sum^m_{i=1} \left( Wz^{(i)} \right)^T x^{(i)}
++ \sum^m_{i=1} \left( x^{(i)} \right)^T x^{(i)} \\
+= \sum^m_{i=1} \left( z^{(i)} \right)^T \left( z^{(i)} \right)
+- 2\sum^m_{i=1} \left( z^{(i)} \right)^T x^{(i)}
++ \sum^m_{i=1} \left( x^{(i)} \right)^T x^{(i)} \\
+= - \sum^m_{i=1} \left( z^{(i)} \right)^T \left( z^{(i)} \right)
++ \sum^m_{i=1} \left( x^{(i)} \right)^T x^{(i)} \\
+= -tr \left( W^T \left( \sum^m_{i=1} x^{(i)} \left( x^{(i)} \right)^T \right)W \right)
++ \sum^m_{i=1} \left( x^{(i)} \right)^T x^{(i)} \\
+= -tr \left( W^TXX^TW \right)
++ \sum^m_{i=1} \left( x^{(i)} \right)^T x^{(i)}
+$$
+
+In the derivation process, $\overline{x}^{(i)} = Wz^{(i)}$ is used respectively, and the matrix transposition formula $(AB)^T = B^TA^T$,$ W^TW = I$, $z^{(i)} = W^Tx^{(i)}$ and the trace of the matrix. The last two steps are to convert the algebraic sum into a matrix form.
+Since each vector $w_j$ of $W$ is a standard orthonormal basis, $\sum^m_{i=1} x^{(i)} \left( x^{(i)} \right)^T $ is the covariance matrix of the data set, $\sum^m_{i=1} \left( x^{(i)} \right)^T x^{(i)} $ is a constant. Minimize $\sum^m_{i=1} \| \hat{x}^{(i)} - x^{(i)} \|^2_2$ is equivalent to
+$$
+\underbrace{\arg \min}_W - tr \left( W^TXX^TW \right) s.t.W^TW = I
+$$
+Use the Lagrang function to get
+$$
+J(W) = -tr(W^TXX^TW) + \lambda(W^TW - I)
+$$
+For $W$, you can get $-XX^TW + \lambda W = 0 $ , which is $ XX^TW = \lambda W $ . $ XX^T $ is a matrix of $ n' $ eigenvectors, and $\lambda$ is the eigenvalue of $ XX^T $ . $W$ is the matrix we want.
+For raw data, you only need $z^{(i)} = W^TX^{(i)}$ to reduce the original data set to the $n'$ dimension dataset of the minimum projection distance.
+
+Based on the derivation of the maximum projection variance, we will not repeat them here. Interested colleagues can refer to the information themselves.
+
+### 2.15.4 Summary of PCA algorithm flow
+Input: $n $ dimension sample set $D = \left( x^{(1)},x^{(2)},...,x^{(m)} \right) $ , target drop The dimension of the dimension is $n' $ .
+
+Output: New sample set after dimensionality reduction $D' = \left( z^{(1)}, z^{(2)},...,z^{(m)} \right)$ .
+
+The main steps are as follows:
+1. Center all the samples, $ x^{(i)} = x^{(i)} - \frac{1}{m} \sum^m_{j=1} x^{(j) } $ .
+2. Calculate the covariance matrix of the sample $XX^T $ .
+3. Perform eigenvalue decomposition on the covariance matrix $XX^T$.
+4. Take the feature vector $\{ w_1,w_2,...,w_{n'} \}$ for the largest $n' $ eigenvalues.
+5. Normalize the eigenvectors to get the eigenvector matrix $W$ .
+6. Convert each sample in the sample set $z^{(i)} = W^T x^{(i)}$ .
+7. Get the output matrix $D' = \left( z^{(1)},z^{(2)},...,z^{(n)} \right) $ .
+*Note*: In dimension reduction, sometimes the target dimension is not specified, but the principal component weighting threshold value $kk (k \epsilon(0,1]) $ is specified. The assumed $n $ eigenvalues For $\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_n $ , then $n' $ can be obtained from $\sum^{n'}_{i=1} \lambda_i \geq k \times \sum^n_{i=1} \lambda_i $.
+
+### 2.15.5 Main advantages and disadvantages of PCA algorithm
+| Advantages and Disadvantages | Brief Description |
+|:-:|:-|
+|Advantages|1. It is only necessary to measure the amount of information by variance, independent of factors outside the data set. 2. Orthogonality between the principal components can eliminate the interaction between the original data components. 3. The calculation method is simple, and the main operation is eigenvalue decomposition, which is easy to implement. |
+|Disadvantages|1. The meaning of each feature dimension of the principal component has a certain ambiguity, which is not as strong as the original sample feature. 2. Non-principal components with small variances may also contain important information about sample differences, as dimensionality reduction may have an impact on subsequent data processing. |
+
+### 2.15.6 Necessity and purpose of dimensionality reduction
+**The need for dimensionality reduction**:
+1. The multicollinearity and predictive variables are related to each other. Multiple collinearity can lead to instability in the solution space, which can lead to inconsistencies in the results.
+2. The high dimensional space itself is sparse. The one-dimensional normal distribution has a value of 68% falling between positive and negative standard deviations, and only 0.02% in ten-dimensional space.
+3. Excessive variables cause redundancy in the search rules.
+4. Analysis at the variable level alone may ignore potential links between variables. For example, several predictors may fall into a group that reflects only one aspect of the data.
+
+**The purpose of dimension reduction**:
+1. Reduce the number of predictors.
+2. Make sure these variables are independent of each other.
+3. Provide a framework to explain the results. Close features, especially important features, can be clearly displayed in the data; if only two or three dimensions, it is easier to visualize.
+4. Data is easier to handle and easier to use at lower dimensions.
+5. Remove data noise.
+6. Reduce the computational overhead of the algorithm.
+
+### 2.15.7 What is the difference between KPCA and PCA?
+The premise of applying the PCA algorithm is to assume that there is a linear hyperplane and then project. So what if the data is not linear? What should I do? At this time, KPCA is needed. The dataset is mapped from $n$ dimension to linearly separable high dimension $N > n$, and then reduced from $N$ dimension to a low dimension $n' (n' 
+>
+
+### 2.16.3 Empirical error and generalization error
+
+Empirical error: also known as training error, the error of the model on the training set.
+
+Generalization error: The error of the model on the new sample set (test set) is called the "generalization error."
+
+### 2.16.4 Under-fitting, over-fitting
+Under-fitting is different from over-fitting plots depending on the coordinate method.
+1. **The horizontal axis is the number of training samples and the vertical axis is the error**
+
+
+
+As shown in the figure above, we can visually see the difference between under-fitting and over-fitting:
+
+Under-fitting of the model: there is a high error in both the training set and the test set, and the deviation of the model is large at this time;
+
+Model overfitting: has a lower error on the training set and a higher error on the test set, where the variance of the model is larger.
+
+The model is normal: there are relatively low deviations and variances on both the training set and the test set.
+
+2. **The horizontal axis is the complexity of the model and the vertical axis is the error**
+
+
+
+The red line is the Error on the test set, and the blue line is the Error on the training set.
+
+Under-fitting of the model: At the point A, the model has a high error at both the training set and the test set. At this time, the deviation of the model is large.
+
+Model overfitting: The model has a lower error on the training set at point C and a higher error on the test set. The variance of the model is larger.
+
+The model is normal: the model complexity control is optimal at point B.
+
+3. **The horizontal axis is the regular term coefficient and the vertical axis is the error**
+
+
+
+The red line is the Error on the test set, and the blue line is the Error on the training set.
+
+Under-fitting of the model: At the point C, the model has a high error at both the training set and the test set. At this time, the deviation of the model is large.
+
+Model overfitting: The model has a lower error on the training set at point A and a higher error on the test set, and the variance of the model is larger. It usually happens when the model is too complicated, such as too many parameters, which will make the prediction performance of the model weaker and increase the volatility of the data. Although the effect of the model during training can be performed perfectly, basically remembering all the characteristics of the data, but the performance of this model in the unknown data will be greatly reduced, because the simple model generalization ability is usually very weak. of.
+
+The model is normal: the model complexity control is optimal at point B.
+
+### 2.16.5 How to solve over-fitting and under-fitting?
+**How to solve the under-fitting:**
+1. Add additional feature items. Features such as combination, generalization, correlation, context features, and platform features are important means of feature addition. Sometimes insufficient feature items can lead to under-fitting of the model.
+2. Add a polynomial feature. For example, adding a linear or cubic term to a linear model makes the model more generalizable. For example, the FM model and the FFM model are actually linear models, and second-order polynomials are added to ensure a certain degree of fitting of the model.
+3. You can increase the complexity of the model.
+4. Reduce the regularization coefficient. The purpose of regularization is to prevent overfitting, but now the model has an under-fitting, you need to reduce the regularization parameters.
+
+**How to solve overfitting:**
+1. Re-clean the data, the data is not pure will lead to over-fitting, and such cases require re-cleaning of the data.
+2. Increase the number of training samples.
+3. Reduce the complexity of the model.
+4. Increase the coefficient of the regular term.
+5. Using the dropout method, the dropout method, in layman's terms, is to let the neurons not work with a certain probability during training.
+6. early stoping.
+7. Reduce the number of iterations.
+8. Increase the learning rate.
+9. Add noise data.
+10. In the tree structure, the tree can be pruned.
+
+Under-fitting and over-fitting these methods requires selection based on actual problems and actual models.
+
+### 2.16.6 The main role of cross-validation
+In order to obtain a more robust and reliable model, the generalization error of the model is evaluated, and an approximation of the model generalization error is obtained. When there are multiple models to choose from, we usually choose the model with the smallest "generalization error".
+
+There are many ways to cross-validate, but the most common ones are: leave a cross-validation, k-fold cross-validation.
+
+### 2.16.7 Understanding k-fold cross validation
+1. Divide the data set containing N samples into K shares, each containing N/K samples. One of them was selected as the test set, and the other K-1 was used as the training set. There were K cases in the test set.
+2. In each case, train the model with the training set and test the model with the test set to calculate the generalization error of the model.
+3. Cross-validation is repeated K times, each verification is performed, the average K times results or other combination methods are used, and finally a single estimation is obtained, and the final generalization error of the model is obtained.
+4. In the case of K, the generalization error of the model is averaged to obtain the final generalization error of the model.
+**Note**:
+1. Generally 2<=K<=10. The advantage of k-fold cross-validation is that it repeatedly uses randomly generated sub-samples for training and verification. Each time the results are verified once, 10-fold cross-validation is the most commonly used.
+2. The number of samples in the training set should be sufficient, generally at least 50% of the total number of samples.
+3. The training set and test set must be evenly sampled from the complete data set. The purpose of uniform sampling is to reduce the deviation between the training set, the test set, and the original data set. When the number of samples is sufficient, the effect of uniform sampling can be achieved by random sampling.
+
+### 2.16.8 Confusion matrix
+The first type of confusion matrix:
+
+|The real situation T or F| prediction is positive example 1, P| prediction is negative example 0, N|
+|:-:|:-|:-|
+|The original label is marked as 1, the prediction result is true T, false is F|TP (predicted to be 1, actual is 1)|FN (predicted to be 0, actually 1)|
+|The original label is 0, the prediction result is true T, false is F|FP (predicted to be 1, actual is 0)|TN (predicted to be 0, actually 0)|
+
+The second type of confusion matrix:
+
+|Predictive situation P or N|The actual label is 1, the predicted pair is T|the actual label is 0, and the predicted pair is T|
+|:-:|:-|:-|
+|Forecast is positive example 1, P|TP (predicted to be 1, actually 1)|FP (predicted to be 1, actually 0)|
+|Predicted as negative example 0, N|FN (predicted to be 0, actually 1)|TN (predicted to be 0, actually 0)|
+
+### 2.16.9 Error rate and accuracy
+1. Error Rate: The ratio of the number of samples with the wrong classification to the total number of samples.
+2. Accuracy: The proportion of the correct number of samples to the total number of samples.
+
+### 2.16.10 Precision and recall rate
+The results predicted by the algorithm are divided into four cases:
+1. True Positive (TP): The prediction is true, the actual is true
+2. True Negation (True NegatiVe, TN): the forecast is false, the actual is false
+3. False Positive (FP): The prediction is true, the actual is false
+4. False Negative (FN): The forecast is false, actually true
+
+Then:
+
+Precision = TP / (TP + FP)
+
+**Understanding**: The correct number of samples predicted to be positive. Distinguish the accuracy rate (correctly predicted samples, including correct predictions as positive and negative, accounting for the proportion of total samples).
+For example, in all of the patients we predicted to have malignant tumors, the percentage of patients who actually had malignant tumors was as high as possible.
+
+Recall = TP / (TP + FN)
+
+**Understanding**: The proportion of positively predicted positives as a percentage of positives in the total sample.
+For example, in all patients with malignant tumors, the percentage of patients who successfully predicted malignant tumors was as high as possible.
+
+### 2.16.11 ROC and AUC
+The full name of the ROC is "Receiver Operating Characteristic".
+
+The area of the ROC curve is AUC (Area Under the Curve).
+
+AUC is used to measure the performance of the "two-class problem" machine learning algorithm (generalization capability).
+
+The ROC curve is calculated by setting a continuous variable to a plurality of different critical values, thereby calculating a series of true rates and false positive rates, and then plotting the curves with the false positive rate and the true rate as the ordinate, under the curve. The larger the area, the higher the diagnostic accuracy. On the ROC curve, the point closest to the upper left of the graph is the critical value of both the false positive rate and the true rate.
+
+For classifiers, or classification algorithms, the evaluation indicators mainly include precision, recall, and F-score. The figure below is an example of a ROC curve.
+
+
+
+The abscissa of the ROC curve is the false positive rate (FPR) and the ordinate is the true positive rate (TPR). among them
+$$
+TPR = \frac{TP}{TP+FN} , FPR = \frac{FP}{FP+TN},
+$$
+
+The following focuses on the four points and one line in the ROC graph.
+The first point, (0,1), ie FPR=0, TPR=1, means that FN(false negative)=0, and FP(false positive)=0. This means that this is a perfect classifier that classifies all samples correctly.
+The second point, (1,0), ie FPR=1, TPR=0, means that this is the worst classifier because it successfully avoided all the correct answers.
+The third point, (0,0), that is, FPR=TPR=0, that is, FP (false positive)=TP(true positive)=0, it can be found that the classifier predicts that all samples are negative samples (negative). .
+The fourth point, (1,1), ie FPR=TPR=1, the classifier actually predicts that all samples are positive samples.
+After the above analysis, the closer the ROC curve is to the upper left corner, the better the performance of the classifier.
+
+The area covered by the ROC curve is called AUC (Area Under Curve), which can more easily judge the performance of the learner. The larger the AUC, the better the performance.
+### 2.16.12 How to draw ROC curve?
+The following figure is an example. There are 20 test samples in the figure. The “Class” column indicates the true label of each test sample (p indicates positive samples, n indicates negative samples), and “Score” indicates that each test sample is positive. The probability of the sample.
+
+step:
+1. Assume that the probability that a series of samples are classified as positive classes has been derived, sorted by size.
+2. From high to low, the “Score” value is used as the threshold threshold. When the probability that the test sample belongs to the positive sample is greater than or equal to the threshold, we consider it to be a positive sample, otherwise it is a negative sample. For example, for the fourth sample in the graph, the "Score" value is 0.6, then the samples 1, 2, 3, 4 are considered positive samples because their "Score" values are greater than or equal to 0.6, while others The samples are considered to be negative samples.
+3. Select a different threshold each time to get a set of FPR and TPR, which is a point on the ROC curve. In this way, a total of 20 sets of FPR and TPR values were obtained. The FPR and TPR are briefly understood as follows:
+4. Draw according to each coordinate point in 3).
+
+
+
+### 2.16.13 How to calculate TPR, FPR?
+1, analysis of data
+Y_true = [0, 0, 1, 1]; scores = [0.1, 0.4, 0.35, 0.8];
+2, the list
+
+| Sample | Predict the probability of belonging to P (score) | Real Category |
+| ---- | ---------------------- | -------- |
+| y[0] | 0.1 | N |
+| y[1] | 0.35 | P |
+| y[2] | 0.4 | N |
+| y[3] | 0.8 | P |
+
+3. Take the cutoff point as the score value and calculate the TPR and FPR.
+When the truncation point is 0.1:
+Explain that as long as score>=0.1, its prediction category is a positive example. Since the scores of all four samples are greater than or equal to 0.1, the prediction category for all samples is P.
+Scores = [0.1, 0.4, 0.35, 0.8]; y_true = [0, 0, 1, 1]; y_pred = [1, 1, 1, 1];
+The positive and negative examples are as follows:
+
+| Real value\predicted value | | |
+| ------------- | ---- | ---- |
+| | Positive example | Counterexample |
+| Positive example | TP=2 | FN=0 |
+| Counterexample | FP=2 | TN=0 |
+
+Therefore:
+TPR = TP / (TP + FN) = 1; FPR = FP / (TN + FP) = 1;
+
+When the cutoff point is 0.35:
+Scores = [0.1, 0.4, 0.35, 0.8]; y_true = [0, 0, 1, 1]; y_pred = [0, 1, 1, 1];
+The positive and negative examples are as follows:
+
+| Real value\predicted value | | |
+| ------------- | ---- | ---- |
+| | Positive example | Counterexample |
+| Positive example | TP=2 | FN=0 |
+| Counterexample | FP=1 | TN=1 |
+
+Therefore:
+TPR = TP / (TP + FN) = 1; FPR = FP / (TN + FP) = 0.5;
+
+When the truncation point is 0.4:
+Scores = [0.1, 0.4, 0.35, 0.8]; y_true = [0, 0, 1, 1]; y_pred = [0, 1, 0, 1];
+The positive and negative examples are as follows:
+
+| Real value\predicted value | | |
+| ------------- | ---- | ---- |
+| | Positive example | Counterexample |
+| Positive example | TP=1 | FN=1 |
+| Counterexample | FP=1 | TN=1 |
+
+Therefore:
+TPR = TP / (TP + FN) = 0.5; FPR = FP / (TN + FP) = 0.5;
+
+When the cutoff point is 0.8:
+Scores = [0.1, 0.4, 0.35, 0.8]; y_true = [0, 0, 1, 1]; y_pred = [0, 0, 0, 1];
+
+The positive and negative examples are as follows:
+
+| Real value\predicted value | | |
+| ------------- | ---- | ---- |
+| | Positive example | Counterexample |
+| Positive example | TP=1 | FN=1 |
+| Counterexample | FP=0 | TN=2 |
+
+Therefore:
+TPR = TP / (TP + FN) = 0.5; FPR = FP / (TN + FP) = 0;
+4. According to the TPR and FPR values, the FPR is plotted on the horizontal axis and the TPR is plotted on the vertical axis.
+
+### 2.16.14 How to calculate Auc?
+- Sort the coordinate points by the horizontal FPR.
+- Calculate the distance $dx$ between the $i$ coordinate point and the $i+1$ coordinate point.
+- Get the ordinate y of the $i$ (or $i+1$) coordinate points.
+- Calculate the area micro-element $ds=ydx$.
+- Accumulate the area micro-elements to get the AUC.
+
+### 2.16.15 Why use Roc and Auc to evaluate the classifier?
+There are many evaluation methods for the model. Why use ROC and AUC?
+Because the ROC curve has a very good property: when the distribution of positive and negative samples in the test set is transformed, the ROC curve can remain unchanged. In the actual data set, sample class imbalances often occur, that is, the ratio of positive and negative samples is large, and the positive and negative samples in the test data may also change with time.
+
+### 2.16.17 Intuitive understanding of AUC
+The figure below shows the values of the three AUCs:
+
+
+
+AUC is an evaluation index to measure the pros and cons of the two-category model, indicating the probability that the positive example is in front of the negative example. Other evaluation indicators have accuracy, accuracy, and recall rate, and AUC is more common than the three.
+Generally, in the classification model, the prediction results are expressed in the form of probability. If the accuracy is to be calculated, a threshold is usually set manually to convert the corresponding probability into a category, which greatly affects the accuracy of the model. Rate calculation.
+Example:
+
+Now suppose that a trained two-classifier predicts 10 positive and negative samples (5 positive cases and 5 negative examples). The best prediction result obtained by sorting the scores in high to low is [1, 1, 1, 1, 1, 0, 0, 0, 0, 0], that is, 5 positive examples are ranked in front of 5 negative examples, and the positive example is 100% in front of the negative example. Then draw its ROC curve, since it is 10 samples, we need to draw 10 points in addition to the origin, as follows:
+
+
+
+The method of traversing starts from left to right according to the score of the sample prediction result. Starting from the origin, every time you encounter 1, move the y-axis in the positive direction of the y-axis with a minimum step size of 1 unit, here is 1/5=0.2; every time you encounter 0, move to the x-axis positive direction, the x-axis minimum step size is 1 Unit, here is also 0.2. It is not difficult to see that the AUC of the above figure is equal to 1, which confirms that the probability that the positive example is in front of the negative example is indeed 100%.
+
+Assume that the prediction result sequence is [1, 1, 1, 1, 0, 1, 0, 0, 0, 0].
+
+
+
+Calculate the AUC of the above figure is 0.96 and the probability of calculating the positive example is 0.8 × 1 + 0.2 × 0.8 = 0.96 in front of the negative example, and the area of the shadow in the upper left corner is the probability that the negative example is in front of the positive example. × 0.2 = 0.04.
+
+Assume that the prediction result sequence is [1, 1, 1, 0, 1, 0, 1, 0, 0, 0].
+
+
+
+Calculate the AUC of the above figure is 0.88 and the probability of calculating the positive example and the front of the negative example is equal to 0.6 × 1 + 0.2 × 0.8 + 0.2 × 0.6 = 0.88. The area of the shaded part in the upper left corner is the negative example in front of the positive example. The probability is 0.2 × 0.2 × 3 = 0.12.
+
+### 2.16.18 Cost-sensitive error rate and cost curve
+
+Different errors can come at different costs. Taking the dichotomy as an example, set the cost matrix as follows:
+
+
+
+When the judgment is correct, the value is 0. When it is incorrect, it is $Cost_{01} $ and $Cost_{10} $ respectively.
+
+$Cost_{10}$: indicates the cost of actually being a counterexample but predicting a positive case.
+
+$Cost_{01}$: indicates the cost of being a positive example but predicting a counterexample.
+
+**Cost Sensitive Error Rate**=The sum of the error value and the cost product obtained by the model in the sample / total sample.
+Its mathematical expression is:
+$$
+E(f;D;cost)=\frac{1}{m}\left( \sum_{x_{i} \in D^{+}}({f(x_i)\neq y_i})\times Cost_{ 01}+ \sum_{x_{i} \in D^{-}}({f(x_i)\neq y_i})\times Cost_{10}\right)
+$$
+$D^{+}, D^{-} $ respectively represent samplesSet of positive and negative examples.
+
+**cost curve**:
+At an equal cost, the ROC curve does not directly reflect the expected overall cost of the model, and the cost curve can.
+The cost of the positive example of the cost curve with the horizontal axis of [0,1]:
+$$
+P(+)Cost=\frac{p*Cost_{01}}{p*Cost_{01}+(1-p)*Cost_{10}}
+$$
+Where p is the probability that the sample is a positive example.
+
+The normalized cost of the vertical axis dimension [0,1] of the cost curve:
+$$
+Cost_{norm}=\frac{FNR*p*Cost_{01}+FNR*(1-p)*Cost_{10}}{p*Cost_{01}+(1-p)*Cost_{10}}
+$$
+
+Among them, FPR is a false positive rate, and FNR=1-TPR is a false anti-interest rate.
+
+Note: Each point of the ROC corresponds to a line on the cost plane.
+
+For example, on the ROC (TPR, FPR), FNR=1-TPR is calculated, and a line segment from (0, FPR) to (1, FNR) is drawn on the cost plane, and the area is the expected overall cost under the condition. The area under the bounds of all segments, the expected overall cost of the learner under all conditions.
+
+
+
+### 2.16.19 What are the comparison test methods for the model?
+Correctness analysis: model stability analysis, robustness analysis, convergence analysis, trend analysis, extreme value analysis, etc.
+Validity analysis: error analysis, parameter sensitivity analysis, model comparison test, etc.
+Usefulness analysis: key data solving, extreme points, inflection points, trend analysis, and data validation dynamic simulation.
+Efficient analysis: Time-space complexity analysis is compared with existing ones.
+
+### 2.16.21 Why use standard deviation?
+
+The variance formula is: $S^2_{N}=\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\bar{x})^{2} $
+
+The standard deviation formula is: $S_{N}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\bar{x})^{2}} $
+
+The sample standard deviation formula is: $S_{N}=\sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(x_{i}-\bar{x})^{ 2}} $
+
+There are three benefits to using standard deviation to represent the degree of dispersion of data points compared to variance:
+1. The number indicating the degree of dispersion is consistent with the order of magnitude of the sample data points, which is more suitable for forming perceptual cognition of data samples.
+
+2. The unit of digital representation of the degree of dispersion is consistent with the unit of sample data, making it easier to perform subsequent analysis operations.
+
+3. In the case that the sample data generally conforms to the normal distribution, the standard deviation has the characteristics of convenient estimation: 66.7% of the data points fall within the range of 1 standard deviation before and after the average, and 95% of the data points fall before and after the average Within the range of 2 standard deviations, 99% of the data points will fall within the range of 3 standard deviations before and after the mean.
+
+### 2.16.25 What causes the category imbalance?
+Class-imbalance refers to the case where the number of training examples of different categories in a classification task varies greatly.
+
+cause:
+
+Classification learning algorithms usually assume that the number of training examples for different categories is basically the same. If the number of training examples in different categories is very different, it will affect the learning results and the test results will be worse. For example, there are 998 counterexamples in the two-category problem, and there are two positive cases. The learning method only needs to return a classifier that always predicts the new sample as a counterexample, which can achieve 99.8% accuracy; however, such a classifier has no value.
+### 2.16.26 Common category imbalance problem solving method
+To prevent the impact of category imbalance on learning, it is necessary to deal with the problem of classification imbalance before constructing the classification model. The main solutions are:
+
+1, expand the data set
+
+Add data that contains small sample data, and more data can get more distribution information.
+
+2. Undersampling large types of data
+
+Reduce the number of large data samples so that they are close to the small sample size.
+Disadvantages: If you discard a large class of samples randomly during an undersampling operation, you may lose important information.
+Representative algorithm: EasyEnsemble. The idea is to use an integrated learning mechanism to divide the large class into several sets for use by different learners. This is equivalent to undersampling each learner, but it does not lose important information for the whole world.
+
+3. Oversampling small class data
+
+Oversampling: Sampling a small class of data samples to increase the number of data samples in a small class.
+
+Representative algorithms: SMOTE and ADASYN.
+
+SMOTE: Generate additional subclass sample data by interpolating the small class data in the training set.
+
+A new minority class sample generation strategy: for each minority class sample a, randomly select a sample b in the nearest neighbor of a, and then randomly select a point on the line between a and b as a newly synthesized minority class. sample.
+ADASYN: Uses a weighted distribution for different minority categories of samples based on the difficulty of learning, and produces more comprehensive data for a small number of samples that are difficult to learn. Data distribution is improved by reducing the bias introduced by class imbalance and adaptively transferring classification decision boundaries to difficult samples.
+
+4. Use new evaluation indicators
+
+If the current evaluation indicator does not apply, you should look for other convincing evaluation indicators. For example, the accuracy index is not applicable or even misleading in the classification task with unbalanced categories. Therefore, in the category unbalanced classification task, more convincing evaluation indicators are needed to evaluate the classifier.
+
+5, choose a new algorithm
+
+Different algorithms are suitable for different tasks and data, and should be compared using different algorithms.
+
+6, data cost weighting
+
+For example, when the classification task is to identify the small class, the weight of the small class sample data of the classifier can be added, and the weight of the large sample can be reduced, so that the classifier concentrates on the small sample.
+
+7, the conversion problem thinking angle
+
+For example, in the classification problem, the sample of the small class is used as the abnormal point, and the problem is transformed into the abnormal point detection or the change trend detection problem. Outlier detection is the identification of rare events. The change trend detection is distinguished from the abnormal point detection in that it is identified by detecting an unusual change trend.
+
+8, the problem is refined and analyzed
+
+Analyze and mine the problem, divide the problem into smaller problems, and see if these small problems are easier to solve.
+
+## 2.17 Decision Tree
+
+### 2.17.1 Basic Principles of Decision Trees
+The Decision Tree is a divide-and-conquer decision-making process. A difficult prediction problem is divided into two or more simple subsets through the branch nodes of the tree, which are structurally divided into different sub-problems. The process of splitting data sets by rules is recursively partitioned (Recursive Partitioning). As the depth of the tree increases, the subset of branch nodes becomes smaller and smaller, and the number of problems that need to be raised is gradually simplified. When the depth of the branch node or the simplicity of the problem satisfies a certain Stopping Rule, the branch node stops splitting. This is the top-down Cutoff Threshold method; some decision trees are also used from below. And the Pruning method.
+
+### 2.17.2 Three elements of the decision tree?
+The generation process of a decision tree is mainly divided into the following three parts:
+
+Feature selection: Select one feature from the many features in the training data as the split criterion of the current node. How to select features has many different quantitative evaluation criteria, and thus derive different decision tree algorithms.
+
+Decision Tree Generation: The child nodes are recursively generated from top to bottom according to the selected feature evaluation criteria, and the decision tree stops growing until the data set is inseparable. In terms of tree structure, recursive structure is the easiest way to understand.
+
+Pruning: Decision trees are easy to overfit, generally requiring pruning, reducing the size of the tree structure, and alleviating overfitting. The pruning technique has two types: pre-pruning and post-pruning.
+
+### 2.17.3 Decision Tree Learning Basic Algorithm
+
+
+
+### 2.17.4 Advantages and disadvantages of decision tree algorithms
+
+**The advantages of the decision tree algorithm**:
+
+1. The decision tree algorithm is easy to understand and the mechanism is simple to explain.
+
+2. The decision tree algorithm can be used for small data sets.
+
+3. The time complexity of the decision tree algorithm is small, which is the logarithm of the data points used to train the decision tree.
+
+4. The decision tree algorithm can handle numbers and data categories compared to other algorithms that intelligently analyze a type of variable.
+
+5, able to handle the problem of multiple output.
+
+6. Not sensitive to missing values.
+
+7, can handle irrelevant feature data.
+
+8, high efficiency, decision tree only needs to be constructed once, repeated use, the maximum number of calculations per prediction does not exceed the depth of the decision tree.
+
+**The disadvantages of the decision tree algorithm**:
+
+1. It is hard to predict the field of continuity.
+
+2, easy to appear over-fitting.
+
+3. When there are too many categories, the error may increase faster.
+
+4. It is not very good when dealing with data with strong feature relevance.
+
+5. For data with inconsistent sample sizes in each category, in the decision tree, the results of information gain are biased toward those with more values.
+
+### 2.17.5 The concept of entropy and understanding
+
+ Entropy: measures the uncertainty of a random variable.
+
+Definition: Assume that the possible values of the random variable X are $x_{1}, x_{2},...,x_{n}$, for each possible value $x_{i}$, the probability is $P(X=x_{i})=p_{i}, i=1, 2..., n$. The entropy of a random variable is:
+$$
+H(X)=-\sum_{i=1}^{n}p_{i}log_{2}p_{i}
+$$
+For the sample set, assume that the sample has k categories, and the probability of each category is $\frac{|C_{k}|}{|D|}$, where ${|C_{k}|}{|D| }$ is the number of samples with category k, and $|D| $ is the total number of samples. The entropy of the sample set D is:
+$$
+H(D)=-\sum_{k=1}^{k}\frac{|C_{k}|}{|D|}log_{2}\frac{|C_{k}|}{|D| }
+$$
+
+### 2.17.6 Understanding of Information Gain
+Definition: The difference in entropy before and after the data set is divided by a feature.
+Entropy can represent the uncertainty of the sample set. The larger the entropy, the greater the uncertainty of the sample. Therefore, the difference between the set entropy before and after the partition can be used to measure the effect of using the current feature on the partitioning of the sample set D.
+It is assumed that the entropy of the sample set D before division is H(D). The data set D is divided by a certain feature A, and the entropy of the divided data subset is calculated as $H(D|A)$.
+
+Then the information gain is:
+$$
+g(D,A)=H(D)-H(D|A)
+$$
+Note: In the process of building a decision tree, we always want the set to move toward the fastest-purchasing sub-sets. Therefore, we always choose the feature that maximizes the information gain to divide the current data set D.
+
+Thought: Calculate all feature partition data sets D, obtain the information gain of multiple feature partition data sets D, and select the largest from these information gains, so the partition feature of the current node is the partition used to maximize the information gain. feature.
+
+In addition, here is the information gain ratio related knowledge:
+
+Information gain ratio = penalty parameter X information gain.
+
+Information gain ratio essence: multiply a penalty parameter based on the information gain. When the number of features is large, the penalty parameter is small; when the number of features is small, the penalty parameter is large.
+
+Penalty parameter: Data set D takes feature A as the reciprocal of the entropy of the random variable.
+
+### 2.17.7 The role and strategy of pruning treatment?
+Pruning is a method used by decision tree learning algorithms to solve overfitting.
+
+In the decision tree algorithm, in order to classify the training samples as accurately as possible, the node partitioning process is repeated repeatedly, sometimes causing too many branches of the decision tree, so that the characteristics of the training sample set are regarded as generalized features, resulting in over-fitting . Therefore, pruning can be used to remove some branches to reduce the risk of overfitting.
+
+The basic strategies for pruning are prepruning and postprunint.
+
+Pre-pruning: In the decision tree generation process, the generalization performance of each node is estimated before each node is divided. If it cannot be upgraded, the division is stopped, and the current node is marked as a leaf node.
+
+Post-pruning: After generating the decision tree, the non-leaf nodes are examined from the bottom up. If the node is marked as a leaf node, the generalization performance can be improved.
+
+## 2.18 Support Vector Machine
+
+### 2.18.1 What is a support vector machine?
+Support Vector: During the process of solving, you will find that the classifier can be determined based only on part of the data. These data are called support vectors.
+
+Support Vector Machine (SVM): The meaning is a classifier that supports vector operations.
+
+In a two-dimensional environment, points R, S, G and other points near the middle black line can be seen as support vectors, which can determine the specific parameters of the classifier, black line.
+
+
+
+The support vector machine is a two-class model. Its purpose is to find a hyperplane to segment the sample. The principle of segmentation is to maximize the interval and finally transform it into a convex quadratic programming problem. The simple to complex models include:
+
+When the training samples are linearly separable, learn a linear separable support vector machine by maximizing the hard interval;
+
+When the training samples are approximately linearly separable, learn a linear support vector machine by maximizing the soft interval;
+
+When the training samples are linearly inseparable, learn a nonlinear support vector machine by kernel techniques and soft interval maximization;
+
+### 2.18.2 What problems can the support vector machine solve?
+
+**Linear classification**
+
+In the training data, each data has n attributes and a second class category flag, which we can think of in an n-dimensional space. Our goal is to find an n-1 dimensional hyperplane that divides the data into two parts, eachSome of the data belong to the same category.
+
+There are many such hyperplanes, if we want to find the best one. At this point, a constraint is added: the distance from the hyperplane to the nearest data point on each side is required to be the largest, becoming the maximum interval hyperplane. This classifier is the maximum interval classifier.
+
+**Nonlinear classification**
+
+One advantage of SVM is its support for nonlinear classification. It combines the Lagrangian multiplier method with the KKT condition, and the kernel function can produce a nonlinear classifier.
+
+### 2.18.3 Features of the nuclear function and its role?
+
+The purpose of introducing the kernel function is to project the linearly inseparable data in the original coordinate system into another space with Kernel, and try to make the data linearly separable in the new space.
+
+The wide application of the kernel function method is inseparable from its characteristics:
+
+1) The introduction of the kernel function avoids the "dimensionality disaster" and greatly reduces the amount of calculation. The dimension n of the input space has no effect on the kernel function matrix. Therefore, the kernel function method can effectively handle high-dimensional input.
+
+2) There is no need to know the form and parameters of the nonlinear transformation function Φ.
+
+3) The change of the form and parameters of the kernel function implicitly changes the mapping from the input space to the feature space, which in turn affects the properties of the feature space, and ultimately changes the performance of various kernel function methods.
+
+4) The kernel function method can be combined with different algorithms to form a variety of different kernel function-based methods, and the design of these two parts can be performed separately, and different kernel functions and algorithms can be selected for different applications.
+
+### 2.18.4 Why does SVM introduce dual problems?
+
+1. The dual problem turns the constraint in the original problem into the equality constraint in the dual problem. The dual problem is often easier to solve.
+
+2, you can naturally refer to the kernel function (the Lagrangian expression has an inner product, and the kernel function is also mapped by the inner product).
+
+3. In the optimization theory, the objective function f(x) can take many forms: if the objective function and the constraint are both linear functions of the variable x, the problem is called linear programming; if the objective function is a quadratic function, the constraint For a linear function, the optimization problem is called quadratic programming; if the objective function or the constraint is a nonlinear function, the optimization problem is called nonlinear programming. Each linear programming problem has a dual problem corresponding to it. The dual problem has very good properties. Here are a few:
+
+a, the duality of the dual problem is the original problem;
+
+b, whether the original problem is convex or not, the dual problem is a convex optimization problem;
+
+c, the dual problem can give a lower bound on the original problem;
+
+d, when certain conditions are met, the original problem is completely equivalent to the solution to the dual problem.
+
+### 2.18.5 How to understand the dual problem in SVM
+
+In the hard-space support vector machine, the solution of the problem can be transformed into a convex quadratic programming problem.
+
+Assume that the optimization goal is
+$$
+\begin{align}
+&\min_{\boldsymbol w, b}\frac{1}{2}||\boldsymbol w||^2\\
+&s.t. y_i(\boldsymbol w^T\boldsymbol x_i+b)\geq1, i=1,2,\cdots,m.\\
+\end{align} \tag{1}
+$$
+**step 1**. Conversion issues:
+$$
+\min_{\boldsymbol w, b} \max_{\alpha_i \geq 0} \left\{\frac{1}{2}||\boldsymbol w||^2 + \sum_{i=1}^m\alpha_i( 1 - y_i(\boldsymbol w^T\boldsymbol x_i+b))\right\} \tag{2}
+$$
+The above formula is equivalent to the original problem, because if the inequality constraint in (1) is satisfied, then $\alpha_i(1 - y_i(\boldsymbol w^T\boldsymbol x_i+b))$ must be taken when (2) is used to find max 0, equivalent to (1); if the inequality constraint in (1) is not satisfied, the max in (2) will get infinity. Exchange min and max to get their dual problem:
+$$
+\max_{\alpha_i \geq 0} \min_{\boldsymbol w, b} \left\{\frac{1}{2}||\boldsymbol w||^2 + \sum_{i=1}^m\alpha_i( 1 - y_i(\boldsymbol w^T\boldsymbol x_i+b))\right\}
+$$
+The dual problem after the exchange is not equal to the original problem. The solution of the above formula is less than the solution of the original problem.
+
+**step 2**. The question now is how to find the best lower bound for the optimal value of the problem (1)?
+$$
+{\frac 1 2}||\boldsymbol w||^2 < v\\
+1 - y_i(\boldsymbol w^T\boldsymbol x_i+b) \leq 0\tag{3}
+$$
+If equation (3) has no solution, then v is a lower bound of question (1). If (3) has a solution, then
+$$
+\forall \boldsymbol \alpha > 0 , \ \min_{\boldsymbol w, b} \left\{\frac{1}{2}||\boldsymbol w||^2 + \sum_{i=1}^m\alpha_i (1 - y_i(\boldsymbol w^T\boldsymbol x_i+b))\right\} < v
+$$
+From the inverse of the proposition: if
+$$
+\exists \boldsymbol \alpha > 0 , \ \min_{\boldsymbol w, b} \left\{\frac{1}{2}||\boldsymbol w||^2 + \sum_{i=1}^m\alpha_i (1 - y_i(\boldsymbol w^T\boldsymbol x_i+b))\right\} \geq v
+$$
+Then (3) no solution.
+
+Then v is the problem
+
+A lower bound of (1).
+ Ask for a good lower bound, take the maximum
+$$
+\max_{\alpha_i \geq 0} \min_{\boldsymbol w, b} \left\{\frac{1}{2}||\boldsymbol w||^2 + \sum_{i=1}^m\alpha_i( 1 - y_i(\boldsymbol w^T\boldsymbol x_i+b))\right\}
+$$
+**step 3**. Order
+$$
+L(\boldsymbol w, b,\boldsymbol a) = \frac{1}{2}||\boldsymbol w||^2 + \sum_{i=1}^m\alpha_i(1 - y_i(\boldsymbol w^T \boldsymbol x_i+b))
+$$
+$p^*$ is the minimum value of the original question, the corresponding $w, b$ are $w^*, b^*$, respectively, for any $a>0$:
+$$
+p^* = {\frac 1 2}||\boldsymbol w^*||^2 \geq L(\boldsymbol w^*, b,\boldsymbol a) \geq \min_{\boldsymbol w, b} L( \boldsymbol w, b,\boldsymbol a)
+$$
+Then $\min_{\boldsymbol w, b} L(\boldsymbol w, b,\boldsymbol a)$ is the next issue of question (1).
+
+At this point, take the maximum value to find the lower bound, that is,
+$$
+\max_{\alpha_i \geq 0} \min_{\boldsymbol w, b} L(\boldsymbol w, b,\boldsymbol a)
+$$
+
+### 2.18.7 What are the common kernel functions?
+| Kernel Functions | Expressions | Notes |
+| ---------------------------- | -------------------- ---------------------------------------- | --------- -------------------------- |
+Linear Kernel Linear Kernel | $k(x,y)=xy$ | |
+Polynomial Kernel Polynomial Kernel | $k(x,y)=(ax^{t}y+c)^{d}$ | $d>=1$ is the number of polynomials |
+Exponential Kernel Exponent Kernel Function | $k(x,y)=exp(-\frac{\left \|x-y \right \|}{2\sigma ^{2}})$ | $\sigma>0$ |
+Gaussian Kernel Gaussian Kernel Function | $k(x,y)=exp(-\frac{\left \|xy \right \|^{2}}{2\sigma ^{2}})$ | $\sigma $ is the bandwidth of the Gaussian kernel, $\sigma>0$, |
+| Laplacian Kernel Laplacian Kernel | $k(x,y)=exp(-\frac{\left \|x-y \right|}{\sigma})$ | $\sigma>0$ |
+| ANOVA Kernel | $k(x,y)=exp(-\sigma(x^{k}-y^{k})^{2})^{d}$ | |
+| Sigmoid Kernel | $k(x,y)=tanh(ax^{t}y+c)$ | $tanh$ is a hyperbolic tangent function, $a>0,c<0$ |
+
+### 2.18.9 Main features of SVM?
+
+Features:
+
+(1) The theoretical basis of the SVM method is nonlinear mapping, and the SVM uses the inner product kernel function instead of the nonlinear mapping to the high dimensional space;
+
+(2) The goal of SVM is to obtain the optimal hyperplane for feature space partitioning, and the core of SVM method is to maximize the classification margin;
+
+(3) The support vector is the training result of the SVM. The support vector plays a decisive role in the SVM classification decision.
+
+(4) SVM is a novel small sample learning method with solid theoretical foundation. It basically does not involve probability measures and laws of large numbers, but also simplifies the problems of general classification and regression.
+
+(5) The final decision function of the SVM is determined by only a few support vectors. The computational complexity depends on the number of support vectors, not the dimension of the sample space, which avoids the “dimensionality disaster” in a sense.
+
+(6) A few support vectors determine the final result, which not only helps us to grasp the key samples, “cull” a large number of redundant samples, but also destined that the method is not only simple, but also has good “robustness”. This "robustness" is mainly reflected in:
+
+1 Adding or deleting non-support vector samples has no effect on the model;
+
+2 Support vector sample sets have certain robustness;
+
+3 In some successful applications, the SVM method is not sensitive to the selection of cores.
+
+(7) The SVM learning problem can be expressed as a convex optimization problem, so the global minimum of the objective function can be found using a known effective algorithm. Other classification methods (such as rule-based classifiers and artificial neural networks) use a greedy learning-based strategy to search for hypothesis space. This method generally only obtains local optimal solutions.
+
+(8) The SVM controls the capabilities of the model by maximizing the edges of the decision boundaries. However, the user must provide other parameters, such as the use of kernel function types and the introduction of slack variables.
+
+(9) SVM can get much better results than other algorithms on small sample training sets. The SVM optimization goal is to minimize the risk of structuring, instead of minimizing the risk of learning, avoiding the problem of learning. Through the concept of margin, the structured description of the data distribution is obtained, which reduces the requirements on data size and data distribution. Ability.
+
+(10) It is a convex optimization problem, so the local optimal solution must be the advantage of the global optimal solution.
+
+### 2.18.9 SVMThe main disadvantage?
+
+(1) SVM algorithm is difficult to implement for large-scale training samples
+
+The space consumption of the SVM is mainly to store the training samples and the kernel matrix. Since the SVM solves the support vector by means of quadratic programming, the solution to the quadratic programming involves the calculation of the m-order matrix (m is the number of samples), when the number of m When large, the storage and calculation of the matrix will consume a lot of machine memory and computation time.
+
+If the amount of data is large, SVM training time will be longer, such as spam classification detection, instead of using SVM classifier, but using a simple naive bayes classifier, or using logistic regression model classification.
+
+(2) It is difficult to solve multi-classification problems with SVM
+
+The classical support vector machine algorithm only gives the algorithm of the second class classification, but in practical applications, it is generally necessary to solve the classification problem of multiple classes. It can be solved by a combination of multiple second-class support vector machines. There are mainly one-to-many combination mode, one-to-one combination mode and SVM decision tree; and then it is solved by constructing a combination of multiple classifiers. The main principle is to overcome the inherent shortcomings of SVM, combined with the advantages of other algorithms, to solve the classification accuracy of many types of problems. For example, combined with the rough set theory, a combined classifier of multiple types of problems with complementary advantages is formed.
+
+(3) sensitive to missing data, sensitive to the selection of parameters and kernel functions
+
+The performance of support vector machine performance depends mainly on the selection of kernel function, so for a practical problem, how to choose the appropriate kernel function according to the actual data model to construct SVM algorithm. The more mature kernel function and its parameters The choices are all artificial, selected according to experience, with certain randomness. In different problem areas, the kernel function should have different forms and parameters, so the domain knowledge should be introduced when selecting, but there is no A good way to solve the problem of kernel function selection.
+
+### 2.18.10 Similarities and differences between logistic regression and SVM
+
+Same point:
+
+- LR and SVM are both **classification** algorithms
+- Both LR and SVM are **supervised learning** algorithms.
+- Both LR and SVM are ** discriminant models**.
+- If the kernel function is not considered, both LR and SVM are **linear classification** algorithms, which means that their classification decision surfaces are linear.
+ Note: LR can also use the kernel function. But LR usually does not use the kernel function method. (**The amount of calculation is too large**)
+
+difference:
+
+**1, LR uses log loss, SVM uses hinge loss. **
+Logistic regression loss function:
+$$
+J(\theta)=-\frac{1}{m}\left[\sum^m_{i=1}y^{(i)}logh_{\theta}(x^{(i)})+ ( 1-y^{(i)})log(1-h_{\theta}(x^{(i)}))\right]
+$$
+The objective function of the support vector machine:
+$$
+L(w,n,a)=\frac{1}{2}||w||^2-\sum^n_{i=1}\alpha_i \left(y_i(w^Tx_i+b)-1 \right)
+$$
+The logistic regression method is based on probability theory. The probability that the sample is 1 can be represented by the sigmoid function, and then the value of the parameter is estimated by the method of maximum likelihood estimation.
+
+The support vector machine is based on the principle of geometric interval maximization,and it is considered that the classification plane with the largest geometric interval is the optimal classification plane.
+
+2. **LR is sensitive to outliers and SVM is not sensitive to outliers**.
+
+The support vector machine only considers points near the local boundary line, while logistic regression considers the global. The hyperplane found by the LR model tries to keep all points away from him, and the hyperplane that the SVM looks for is to keep only those points closest to the middle dividing line as far away as possible, that is, only those samples that support vectors.
+
+Support vector machines to change non-support vector samples do not cause changes in the decision surface.
+
+Changing any sample in a logistic regression can cause changes in the decision surface.
+
+**3, the calculation complexity is different. For massive data, SVM is less efficient and LR efficiency is higher**
+
+When the number of samples is small and the feature dimension is low, the running time of SVM and LR is relatively short, and the SVM is shorter. For accuracy, LR is significantly higher than SVM. When the sample is slightly increased, the SVM runtime begins to grow, but the accuracy has surpassed LR. Although the SVM time is long, it is within the receiving range. When the amount of data grows to 20,000, when the feature dimension increases to 200, the running time of the SVM increases dramatically, far exceeding the running time of the LR. But the accuracy rate is almost the same as LR. (The main reason for this is that a large number of non-support vectors participate in the calculation, resulting in secondary planning problems for SVM)
+
+**4. Different ways of dealing with nonlinear problems, LR mainly relies on feature structure, and must combine cross-characteristics and feature discretization. SVM can also be like this, but also through the kernel (because only the support vector participates in the core calculation, the computational complexity is not high). ** (Because the kernel function can be used, the SVM can be efficiently processed by the dual solution. LR is poor when the feature space dimension is high.)
+
+**5, SVM loss function comes with regular! ! ! (1/2 ||w||^2 in the loss function), which is why SVM is the structural risk minimization algorithm! ! ! And LR must add a regular item to the loss function! ! ! **
+
+6, SVM comes with ** structural risk minimization**, LR is ** empirical risk minimization**.
+
+7, SVM will use the kernel function and LR generally does not use [nuclear function] (https://www.cnblogs.com/huangyc/p/9940487.html).
+
+## 2.19 Bayesian classifier
+### 2.19.1 Graphical Maximum Likelihood Estimation
+
+The principle of maximum likelihood estimation is illustrated by a picture, as shown in the following figure:
+
+
+
+Example: There are two boxes with the same shape. There are 99 white balls and 1 black ball in the 1st box. There are 1 white ball and 99 black balls in the 2nd box. In one experiment, the black ball was taken out. Which box was taken out from?
+
+Generally, based on empirical thoughts, it is guessed that this black ball is most like being taken out from the No. 2 box. The "most like" described at this time has the meaning of "maximum likelihood". This idea is often called "maximum likelihood." principle".
+
+### 2.19.2 Principle of Maximum Likelihood Estimation
+
+To sum up, the purpose of the maximum likelihood estimation is to use the known sample results to reverse the most likely (maximum probability) parameter values that lead to such results.
+
+The maximum likelihood estimation is a statistical method based on the principle of maximum likelihood. The maximum likelihood estimation provides a method for estimating the model parameters given the observed data, namely: "The model has been determined and the parameters are unknown." Through several experiments, the results are observed. Using the test results to obtain a parameter value that maximizes the probability of occurrence of the sample is called maximum likelihood estimation.
+
+Since the samples in the sample set are all independent and identical, the parameter vector θ can be estimated by considering only one type of sample set D. The known sample set is:
+$$
+D=x_{1}, x_{2},...,x_{n}
+$$
+Linkehood function: The joint probability density function $P(D|\theta )$ is called the likelihood of θ relative to $x_{1}, x_{2},...,x_{n}$ function.
+$$
+l(\theta )=p(D|\theta ) =p(x_{1},x_{2},...,x_{N}|\theta )=\prod_{i=1}^{N} p(x_{i}|\theta )
+$$
+If $\hat{\theta}$ is the value of θ in the parameter space that maximizes the likelihood function $l(\theta)$, then $\hat{\theta}$ should be the "most likely" parameter value, then $\hat{\theta} $ is the maximum likelihood estimator of θ. It is a function of the sample set and is written as:
+$$
+\hat{\theta}=d(x_{1},x_{2},...,x_{N})=d(D)
+$$
+$\hat{\theta}(x_{1}, x_{2},...,x_{N})$ is called the maximum likelihood function estimate.
+
+### 2.19.3 Basic Principles of Bayesian Classifier
+
+Https://www.cnblogs.com/hxyue/p/5873566.html
+
+Https://www.cnblogs.com/super-zhang-828/p/8082500.html
+
+Bayesian decision theory uses the **false positive loss** to select the optimal category classification by using the **correlation probability known**.
+
+Suppose there are $N$ possible classification tags, denoted as $Y=\{c_1,c_2,...,c_N\}$, which category does it belong to for the sample $x$?
+
+The calculation steps are as follows:
+
+Step 1. Calculate the probability that the sample $x$ belongs to the $i$ class, ie $P(c_i|x) $;
+
+Step 2. By comparing all $P(c_i|\boldsymbol{x})$, get the best category to which the sample $x$ belongs.
+
+Step 3. Substituting the category $c_i$ and the sample $x$ into the Bayesian formula,
+$$
+P(c_i|\boldsymbol{x})=\frac{P(\boldsymbol{x}|c_i)P(c_i)}{P(\boldsymbol{x})}.
+$$
+In general, $P(c_i)$ is the prior probability, $P(\boldsymbol{x}|c_i)$ is the conditional probability, and $P(\boldsymbol{x})$ is the evidence factor for normalization. . For $P(c_i)$, you can estimate the proportion of the sample with the category $c_i$ in the training sample; in addition, since we only need to find the largest $P(\boldsymbol{x}|c_i)$, we It is not necessary to calculate $P(\boldsymbol{x})$.
+
+In order to solve the conditional probability, different methods are proposed based on different hypotheses. The naive Bayes classifier and the semi-premise Bayes classifier are introduced below.
+
+### 2.19.4 Naive Bayes Classifier
+
+Suppose the sample $x$ contains $d$ attributes, ie $x=\{ x_1,x_2,...,x_d\}$. Then there is
+$$
+P(\boldsymbol{x}|c_i)=P(x_1,x_2,\cdots,x_d|c_i).
+$$
+This joint probability is difficult to estimate directly from a limited training sample. Thus, Naive Bayesian (NB) adopted the "attribute conditional independence hypothesis": for known categories, assume that all attributes are independent of each other. Then there is
+$$
+P(x_1,x_2,\cdots,x_d|c_i)=\prod_{j=1}^d P(x_j|c_i).
+$$
+In this case, we can easily introduce the corresponding criteria:
+$$
+H_{nb}(\boldsymbol{x})=\mathop{\arg \max}_{c_i\in Y} P(c_i)\prod_{j=1}^dP(x_j|c_i).
+$$
+**Solution of conditional probability $P(x_j|c_i) $**
+
+If $x_j$ is a tag attribute, then we can estimate $P(x_j|c_i)$ by counting.
+$$
+P(x_j|c_i)=\frac{P(x_j,c_i)}{P(c_i)}\approx\frac{\#(x_j,c_i)}{\#(c_i)}.
+$$
+Where $\#(x_j,c_i)$ represents the number of times $x_j$ and common $c_{i}$ appear in the training sample.
+
+If $x_j$ is a numeric attribute, we usually assume that the first $j$ property of all samples of $c_{i}$ in the category follows a normal distribution. We first estimate the mean of the distribution $μ$ and the variance $σ$, and then calculate the probability density $P(x_j|c_i)$ of $x_j$ in this distribution.
+
+### 2.19.5 An example to understand the naive Bayes classifier
+
+The training set is as follows:
+
+Https://www.cnblogs.com/super-zhang-828/p/8082500.html
+
+| Number | Color | Roots | Knock | Texture | Umbilical | Tactile | Density | Sugar Content |
+| :--: | :--: | :-- | : :--: | :--: | :--: | :--: | :---: | :----: | --: |
+| 1 | Green | Condensation | Turbidity | Clear | Sag | Hard slip | 0.697 | 0.460 |
+| 2 | Black | Cursed | Dull | Clear | Sag | Hard slip | 0.774 | 0.376 |
+| 3 | Black | Cursed | Turbid | Clear | Hollow | Hard slip | 0.634 | 0.264 |
+4 | Green | Collapse | Dull | Clear | Hollow | Hard slip | 0.608 | 0.318 |
+| 5 | White | Cursed | Turbid | Clear | Sag | Hard slip | 0.556 | 0.215 |
+| 6 | Green | A little bit | 浊 | | Clear | slightly concave | soft sticky | 0.403 | 0.237 |
+| 7 | Black | Slightly 蜷 | 浊响 | Slightly sloppy | Slightly concave | Soft sticky | 0.481 | 0.149 |
+| 8 | Black | Slightly 蜷 | 浊 | | Clear | Slightly concave | Hard slip | 0.437 | 0.211 |
+| 9乌黑 | 蜷 蜷 | dull | slightly battered | slightly concave | hard slip | 0.666 | 0.091 |
+| 10 | Green | Tough | Crisp | Clear | Flat | Soft Sticky | 0.243 | 0.267 |
+| 11 | White | Hard | Crisp | Blur | Flat | Hard slip | 0.245 | 0.057 |
+| 12 | White | Collapse | Turbidity | Blur | Flat | Soft Sticky | 0.343 | 0.099 |
+| 13 | Green | Slightly 蜷 | 浊响 | Slightly smeared | Sag | Hard slip | 0.639 | 0.161 |
+| 14 | 白白 | 蜷 蜷 | dull | slightly paste | dent | hard slip | 0.657 | 0.198 |
+| 15 | black | slightly 蜷 | 浊 | | Clear | slightly concave | soft sticky | 0.360 | 0.370 |
+| 16 | White | Cursed | Turbid | Blurred | Flat | Hard slip | 0.593 | 0.042 |
+| 17 | Green | Collapse | Dull | Slightly Paste | Slightly concave | Hard slip | 0.719 | 0.103 |
+
+The following test example "Measure 1" is classified:
+
+Number | Color | Roots | Knock | Texture | Umbilical | Tactile | Density | Sugar Content |
+| :--: | :--: | :-- | : :--: | :--: | :--: | :--: | :---: | :----: | --: |
+| 1 | Green | Cursed | Turbid | Clear | Hollow | Hard slip | 0.697 | 0.460 | |
+
+First, estimate the class prior probability $P(c_j)$, there is
+$$
+\begin{align}
+&P (good melon = yes) = \frac{8}{17}=0.471 \newline
+&P (good melon = no) = \frac{9}{17}=0.529
+\end{align}
+$$
+Then, estimate the conditional probability for each attribute (here, for continuous attributes, assume they follow a normal distribution)
+
+
+
+Then there is
+$$
+\begin{align}
+P(&好瓜=是)\times P_{青绿|是} \times P_{蜷缩|是} \times P_{ 浊响|是} \times P_{clear|yes} \times P_{sag|is}\ Newline
+&\times P_{hard slip|yes} \times p_{density: 0.697|yes} \times p_{sugar:0.460|yes} \approx 0.063 \newline\newline
+P(&good melon=no)\times P_{green=no} \times P_{collapse|no} \times P_{turbidity|no} \times P_{clear|no} \times P_{sag|no}\ Newline
+&\times P_{hard slip|no} \times p_{density: 0.697|no} \times p_{sugar:0.460|no} \approx 6.80\times 10^{-5}
+\end{align}
+$$
+
+```
+`0.063>6.80\times 10^{-5}`
+```
+
+### 2.19.4 Naive Bayes Classifier
+
+Naïve Bayes adopts the "attribute conditional independence hypothesis". The basic idea of the semi-simple Bayesian classifier is to properly consider the interdependence information between some attributes. ** One-Dependent Estimator (ODE) is one of the most commonly used strategies for semi-simple Bayesian classifiers. As the name implies, the sole dependency assumes that each attribute depends on at most one other attribute outside the category, ie
+$$
+P(x|c_i)=\prod_{j=1}^d P(x_j|c_i,{\rm pa}_j).
+$$
+Where $pa_j$ is the property on which the attribute $x_i$ depends and becomes the parent of $x_i$. Assuming the parent attribute $pa_j$ is known, you can use the following formula to estimate $P(x_j|c_i,{\rm pa}_j)$
+$$
+P(x_j|c_i,{\rm pa}_j)=\frac{P(x_j,c_i,{\rm pa}_j)}{P(c_i,{\rm pa}_j)}.
+$$
+
+## 2.20 EM algorithm
+
+### 2.20.1 The basic idea of EM algorithm
+
+The Expectation-Maximization algorithm (EM) is a kind of optimization algorithm that performs maximum likelihood estimation by iteration. It is usually used as an alternative to the Newton iteration method to parameterize the probability model containing hidden variables or missing data. estimate.
+
+The basic idea of the maximum expectation algorithm is to calculate it alternately in two steps:
+
+The first step is to calculate the expectation (E), using the existing estimates of the hidden variables, to calculate its maximum likelihood estimate;
+
+The second step is to maximize (M) and maximize the maximum likelihood value found on the E step to calculate the value of the parameter.
+
+The parameter estimates found on step M are used in the next E-step calculation, which is alternated.
+
+### 2.20.2 EM algorithm derivation
+
+For the $m$ sample observation data $x=(x^{1}, x^{2},...,x^{m})$, now I want to find the model parameter $\theta$ of the sample, which The log-likelihood function of the maximized model distribution is:
+$$
+\theta = arg \max \limits_{\theta}\sum\limits_{i=1}^m logP(x^{(i)};\theta)
+$$
+If the obtained observation data has unobserved implicit data $z=(z^{(1)}, z^{(2)},...z^{(m)})$, maximization model The logarithm likelihood function of the distribution is:
+$$
+\theta = arg \max \limits_{\theta}\sum\limits_{i=1}^m logP(x^{(i)};\theta) = arg \max \limits_{\theta}\sum\limits_ {i=1}^m log\sum\limits_{z^{(i)}}P(x^{(i)}, z^{(i)};\theta) \tag{a}
+$$
+Since the above formula cannot directly find $\theta$, use the zooming technique:
+$$
+\begin{align} \sum\limits_{i=1}^m log\sum\limits_{z^{(i)}}P(x^{(i)}, z^{(i)};\theta ) & = \sum\limits_{i=1}^m log\sum\limits_{z^{(i)}}Q_i(z^{(i)})\frac{P(x^{(i)} , z^{(i)};\theta)}{Q_i(z^{(i)})} \\ & \geq \sum\limits_{i=1}^m \sum\limits_{z^{( i)}}Q_i(z^{(i)})log\frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)} )} \end{align} \tag{1}
+$$
+The above formula uses the Jensen inequality:
+$$
+Log\sum\limits_j\lambda_jy_j \geq \sum\limits_j\lambda_jlogy_j\;\;, \lambda_j \geq 0, \sum\limits_j\lambda_j =1
+$$
+And introduced an unknown new distribution $Q_i(z^{(i)})$.
+
+At this point, if you need to meet the equal sign in Jensen's inequality, there are:
+$$
+\frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})} =c, c is a constant
+$$
+Since $Q_i(z^{(i)})$ is a distribution, it is satisfied
+$$
+\sum\limits_{z}Q_i(z^{(i)}) =1
+$$
+In summary, you can get:
+$$
+Q_i(z^{(i)}) = \frac{P(x^{(i)}, z^{(i)};\theta)}{\sum\limits_{z}P(x^{( i)}, z^{(i)};\theta)} = \frac{P(x^{(i)}, z^{(i)};\theta)}{P(x^{(i )};\theta)} = P( z^{(i)}|x^{(i)};\theta))
+$$
+If $Q_i(z^{(i)}) = P( z^{(i)}|x^{(i)};\theta))$ , then equation (1) is our inclusion of hidden data. A lower bound of log likelihood. If we can maximize this lower bound, we are also trying to maximize our log likelihood. That is, we need to maximize the following:
+$$
+Arg \max \limits_{\theta} \sum\limits_{i=1}^m \sum\limits_{z^{(i)}}Q_i(z^{(i)})log\frac{P(x ^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})}
+$$
+Simplified:
+$$
+Arg \max \limits_{\theta} \sum\limits_{i=1}^m \sum\limits_{z^{(i)}}Q_i(z^{(i)})log{P(x^{ (i)}, z^{(i)};\theta)}
+$$
+The above is the M step of the EM algorithm, $\sum\limits_{z^{(i)}}Q_i(z^{(i)})log{P(x^{(i)}, z^{(i )};\theta)} $ can be understood as $logP(x^{(i)}, z^{(i)};\theta) $ based on conditional probability distribution $Q_i(z^{(i)} ) $ expectation. The above is the specific mathematical meaning of the E step and the M step in the EM algorithm.
+
+### 2.20.3 Graphical EM algorithm
+
+Considering the formula (a) in the previous section, there are hidden variables in the expression. It is difficult to find the parameter estimation directly. The EM algorithm is used to iteratively solve the maximum value of the lower bound until convergence.
+
+
+
+The purple part of the picture is our target model $p(x|\theta)$. The model is complex and difficult to find analytical solutions. In order to eliminate the influence of the hidden variable $z^{(i)}$, we can choose one not. The $r(x|\theta)$ model containing $z^{(i)}$ is such that it satisfies the condition $r(x|\theta) \leq p(x|\theta) $.
+
+The solution steps are as follows:
+
+(1) Select $\theta_1$ so that $r(x|\theta_1) = p(x|\theta_1)$, then take the maximum value for $r$ at this time and get the extreme point $\theta_2$, Implement parameter updates.
+
+(2) Repeat the above process until convergence, and always satisfy $r \leq p $.
+
+### 2.20.4 EM algorithm flow
+
+Input: Observed data $x=(x^{(1)}, x^{(2)},...x^{(m)})$, joint distribution $p(x,z ;\theta)$ , conditional distribution $p(z|x; \theta)$, maximum iterations $J$
+
+1) Randomly initialize the initial value of the model parameter $\theta$ $\theta^0$.
+
+2) $for \ j \ from \ 1 \ to \ j$:
+
+a) Step E. Calculate the conditional probability expectation of the joint distribution:
+$$
+Q_i(z^{(i)}) = P( z^{(i)}|x^{(i)},\theta^{j}))
+$$
+
+$$
+L(\theta, \theta^{j}) = \sum\limits_{i=1}^m\sum\limits_{z^{(i)}}Q_i(z^{(i)})log{P (x^{(i)}, z^{(i)};\theta)}
+$$
+
+b) M steps. Maximize $L(\theta, \theta^{j})$ and get $\theta^{j+1}$:
+$$
+\theta^{j+1} = arg \max \limits_{\theta}L(\theta, \theta^{j})
+$$
+c) If $\theta^{j+1}$ converges, the algorithm ends. Otherwise continue back to step a) for an E step iteration.
+
+Output: Model parameter $\theta $.
+
+## 2.21 Dimensionality and clustering
+
+### 2.21.1 Why does the diagram create a dimensional disaster?
+
+Http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classificatIon/
+
+If the data set contains 10 photos, the photo contains both triangle and circle shapes. Now let's design a classifier for training, let this model classify other photos correctly (assuming the total number of triangles and circles is infinite), and simple, we use a feature to classify:
+
+
+
+
+
+Figure 2.21.1.a
+
+As can be seen from the above figure, if only one feature is classified, the triangle and the circle are almost evenly distributed on this line segment, and it is difficult to linearly classify 10 photos. So what happens when you add a feature:
+
+
+
+Figure 2.21.1.b
+
+After adding a feature, we found that we still couldn't find a straight line to separate the cat from the dog. So, consider adding another feature:
+
+
+
+Figure 2.21.1.c
+
+
+
+Figure 2.21.1.d
+
+At this point, you can find a plane separating the triangle from the circle.
+
+Now calculate that the different feature numbers are the density of the sample:
+
+(1) For a feature, assuming a line segment of length 5 in the feature space, the sample density is 10/5=2.
+
+(2) For two features, the feature space size is 5*5=25 and the sample density is 10/25=0.4.
+
+(3) In the case of three features, the feature space size is 5*5\*5=125, and the sample density is 10/125=0.08.
+
+By analogy, if you continue to increase the number of features, the sample density will become more and more sparse. At this time, it is easier to find a hyperplane to separate the training samples. As the number of features grows to infinity, the sample density becomes very sparse.
+
+Let's look at what happens when you map the classification results of high-dimensional space to low-dimensional space.
+
+
+
+Figure 2.21.1.e
+
+The above figure is the result of mapping the 3D feature space to the 2D feature space. Although the training samples are linearly separable in the high dimensional feature space, the results are reversed after mapping to the low dimensional space. In fact, increasing the number of features makes the high-dimensional space linearly separable, which is equivalent to training a complex nonlinear classifier in a low-dimensional space. However, this nonlinear classifier is too "smart" to learn only a few special cases. If it is used to identify test samples that have not appeared in the training sample, the results are usually not ideal and can cause over-fitting problems.
+
+
+
+Figure 2.21.1.f
+
+The linear classifier with only two features shown in the above figure is divided into some training samples. The accuracy does not seem to be as high as in Figure 2.21.1.e. However, the generalization ability ratio of linear classifiers with two features is shown. A linear classifier with three features is stronger. Because the linear classifier with two features learns not only the special case, but an overall trend, which can be better distinguished for those samples that have never appeared before. In other words, by reducing the number of features, over-fitting problems can be avoided, thereby avoiding "dimensionality disasters."
+
+
+
+From another perspective, the "dimensional disaster" is explained. Assuming that there is only one feature, the range of features is 0 to 1, and the eigenvalues of each triangle and circle are unique. If we want the training sample to cover 20% of the eigenvalue range, then we need 20% of the total number of triangles and circles. After we add a feature, 45% (0.452 = 0.2) of the total number of triangles and circles is needed to continue covering 20% of the eigenvalue range. After continuing to add a feature, 58% (0.583 = 0.2) of the total number of triangles and circles is required. As the number of features increases, more training samples are needed to cover 20% of the eigenvalue range. If there are not enough training samples, there may be over-fitting problems.
+
+Through the above example, we can see that the more the number of features, the more sparse the training samples will be, the less accurate the parameter estimates of the classifier will be, and the over-fitting problem will be more likely to occur. Another effect of the "dimension disaster" is that the sparsity of the training samples is not evenly distributed. The training samples at the center are more sparse than the surrounding training samples.
+
+
+
+Suppose there is a two-dimensional feature space, such as the rectangle shown in Figure 8, with an inscribed circle inside the rectangle. As the sample closer to the center of the circle is sparse, those samples located at the four corners of the rectangle are more difficult to classify than the samples within the circle. When the dimension becomes larger, the capacity of the feature hyperspace does not change, but the capacity of the unit circle tends to zero. In the high-dimensional space, most of the training data resides in the corner of the feature hyperspace. Data scattered in the corner is harder to classify than data in the center.
+
+### 2.21.2 How to avoid dimension disaster
+
+**To be improved! ! ! **
+
+Solve the dimensional disaster problem:
+
+Principal Component Analysis PCA, Linear Discrimination LDA
+
+Singular value decomposition simplified data, Laplacian feature mapping
+
+Lassio reduction factor method, wavelet analysis,
+
+### 2.21.3 What is the difference and connection between clustering and dimension reduction?
+
+Clustering is used to find the distribution structure inherent in data, either as a separate process, such as anomaly detection. It can also be used as a precursor to other learning tasks such as classification. Clustering is the standard unsupervised learning.
+
+1) In some recommendation systems, the type of new user needs to be determined, but it is not easy to define the “user type”. In this case, the original user data can be clustered first, and each cluster is clustered according to the clustering result. Defined as a class, and then based on these classes to train the classification model to identify the type of new user.
+
+
+
+2) Dimensionality reduction is an important method to alleviate the dimensionality disaster. It is to transform the original high-dimensional attribute space into a low-dimensional "subspace" through some mathematical transformation. It is based on the assumption that although the data samples that people usually observe are high-dimensional, what is actually related to the learning task is a low-dimensional distribution. Therefore, the description of the data can be realized through the most important feature dimensions, which is helpful for the subsequent classification. For example, the Titanic on Kaggle was still a problem. By giving a person a number of characteristics such as age, name, gender, fare, etc., to determine whether it can survive in a shipwreck. This requires first feature screening to identify the main features and make the learned models more generalizable.
+
+Both clustering and dimensionality reduction can be used as preprocessing steps for classification and other issues.
+
+
+
+But although they can achieve the reduction of data. However, the two objects are different, the clustering is for data points, and the dimension reduction is for the characteristics of the data. In addition, they have a variety of implementation methods. K-means, hierarchical clustering, density-based clustering, etc. are commonly used in clustering; PCA, Isomap, LLE, etc. are commonly used in dimension reduction.
+
+
+
+
+
+### 2.21.4 Comparison of four clustering methods
+
+http://www.cnblogs.com/William_Fire/archive/2013/02/09/2909499.html
+
+
+
+Clustering is to divide a data set into different classes or clusters according to a certain standard (such as distance criterion), so that the similarity of the data objects in the same cluster is as large as possible, and the data objects are not in the same cluster. The difference is also as large as possible. That is, after clustering, the same type of data is gathered together as much as possible, and different data is separated as much as possible.
+The main clustering algorithms can be divided into the following categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. The following is a comparison and analysis of the clustering effects of the k-means clustering algorithm, the condensed hierarchical clustering algorithm, the neural network clustering algorithm SOM, and the fuzzy clustering FCM algorithm through the universal test data set.
+
+### 2.21.5 k-means clustering algorithm
+
+K-means is one of the more classical clustering algorithms in the partitioning method. Because of its high efficiency, the algorithm is widely used in clustering large-scale data. At present, many algorithms are extended and improved around the algorithm.
+The k-means algorithm uses k as a parameter to divide n objects into k clusters, so that the clusters have higher similarity and the similarity between clusters is lower. The processing of the k-means algorithm is as follows: First, k objects are randomly selected, each object initially representing the average or center of a cluster; for each remaining object, according to its distance from the center of each cluster, It is assigned to the nearest cluster; then the average of each cluster is recalculated. This process is repeated until the criterion function converges. Usually, the squared error criterion is used, which is defined as follows:
+
+$E=\sum_{i=1}^{k}\sum_{p\subset C}|p-m_{i}|^{2} $
+
+Here E is the sum of the squared errors of all objects in the database, p is the point in space, and mi is the average of cluster Ci [9]. The objective function makes the generated cluster as compact and independent as possible, and the distance metric used is the Euclidean distance, although other distance metrics can be used.
+
+The algorithm flow of the k-means clustering algorithm is as follows:
+Input: the number of databases and clusters containing n objects k;
+Output: k clusters, which minimizes the squared error criterion.
+Steps:
+(1) arbitrarily select k objects as the initial cluster center;
+(2) repeat;
+(3) Assign each object (re) to the most similar cluster based on the average of the objects in the cluster;
+(4) Update the average of the clusters, that is, calculate the average of the objects in each cluster;
+(5) until no longer changes.
+
+### 2.21.6 Hierarchical Clustering Algorithm
+
+According to the order of hierarchical decomposition, whether it is bottom-up or top-down, the hierarchical clustering algorithm is divided into a condensed hierarchical clustering algorithm and a split hierarchical clustering algorithm.
+The strategy of condensed hierarchical clustering is to first treat each object as a cluster, then merge the clusters into larger and larger clusters until all objects are in one cluster, or a certain termination condition is satisfied. Most hierarchical clusters belong to agglomerative hierarchical clustering, which differ only in the definition of similarity between clusters. The four widely used methods for measuring the distance between clusters are as follows:
+
+
+
+Here is the flow of the condensed hierarchical clustering algorithm using the minimum distance:
+
+(1) Treat each object as a class and calculate the minimum distance between the two;
+(2) Combine the two classes with the smallest distance into one new class;
+(3) Recalculate the distance between the new class and all classes;
+(4) Repeat (2), (3) until all classes are finally merged into one class.
+
+### 2.21.7 SOM clustering algorithm
+The SOM neural network [11] was proposed by the Finnish neural network expert Professor Kohonen, which assumes that there are some topologies or sequences in the input object that can be implemented from the input space (n-dimensional) to the output plane (2-dimensional). Dimensional mapping, whose mapping has topological feature retention properties, has a strong theoretical connection with actual brain processing.
+
+The SOM network consists of an input layer and an output layer. The input layer corresponds to a high-dimensional input vector, and the output layer consists of a series of ordered nodes organized on a 2-dimensional grid. The input node and the output node are connected by a weight vector. During the learning process, find the output layer unit with the shortest distance, that is, the winning unit, and update it. At the same time, the weights of the neighboring regions are updated so that the output node maintains the topological features of the input vector.
+
+Algorithm flow:
+
+(1) Network initialization, assigning an initial value to each node weight of the output layer;
+(2) randomly select the input vector from the input sample to find the weight vector with the smallest distance from the input vector;
+(3) Defining the winning unit, adjusting the weight in the vicinity of the winning unit to make it close to the input vector;
+(4) Provide new samples and conduct training;
+(5) Shrink the neighborhood radius, reduce the learning rate, and repeat until it is less than the allowable value, and output the clustering result.
+
+### 2.21.8 FCM clustering algorithm
+
+In 1965, Professor Zade of the University of California, Berkeley, first proposed the concept of 'collection'. After more than ten years of development, the fuzzy set theory has gradually been applied to various practical applications. In order to overcome the shortcomings of classification, the clustering analysis based on fuzzy set theory is presented. Cluster analysis using fuzzy mathematics is fuzzy cluster analysis [12].
+
+The FCM algorithm is an algorithm that determines the degree to which each data point belongs to a certain cluster degree by membership degree. This clustering algorithm is an improvement of the traditional hard clustering algorithm.
+
+
+
+Algorithm flow:
+
+(1) Standardized data matrix;
+(2) Establish a fuzzy similarity matrix and initialize the membership matrix;
+(3) The algorithm starts iterating until the objective function converges to a minimum value;
+(4) According to the iterative result, the class to which the data belongs is determined by the last membership matrix, and the final clustering result is displayed.
+
+3 four clustering algorithm experiments
+
+
+3.1 Test data
+
+In the experiment, IRIS [13] data set in the international UCI database dedicated to test classification and clustering algorithm was selected. The IRIS data set contains 150 sample data, which are taken from three different Iris plants, setosa. Versicolor and virginica flower samples, each data contains 4 attributes, namely the length of the bract, the width of the bract, the length of the petal, in cm. By performing different clustering algorithms on the dataset, clustering results with different precisions can be obtained.
+
+3.2 Description of test results
+
+Based on the previous algorithm principles and algorithm flow, the programming operation is performed by matlab, and the clustering results shown in Table 1 are obtained.
+
+
+
+As shown in Table 1, for the four clustering algorithms, compare them in three aspects:
+
+(1) The number of errors in the number of samples: the total number of samples of the error, that is, the sum of the number of samples in each category;
+
+(2) Running time: that is, the whole clusteringThe time spent by the process, the unit is s;
+
+(3) Average accuracy: Let the original data set have k classes, use ci to represent the i-th class, ni is the number of samples in ci, and mi is the correct number of clusters, then mi/ni is the i-th class. The accuracy of the average accuracy is:
+
+$avg=\frac{1}{k}\sum_{i=1}^{k}\frac{m_{i}}{n_{i}} $
+
+## 2.22 The difference between GBDT and random forest
+
+The same point between GBDT and random forest:
+1, are composed of multiple trees
+2, the final result is determined by multiple trees together
+
+Differences between GBDT and random forests:
+1. The tree that constitutes a random forest can be a classification tree or a regression tree; and GBDT consists only of regression trees.
+2, the trees that make up the random forest can be generated in parallel; and GBDT can only be serially generated
+3. For the final output, random forests use majority voting, etc.; while GBDT accumulates all results, or weights up and accumulate
+4. Random forest is not sensitive to outliers, GBDT is very sensitive to outliers
+5. Random forests treat training sets equally, and GBDT is a weak classifier based on weights.
+6. Random forests improve performance by reducing model variance, and GBDT improves performance by reducing model bias.
+
+## 2.23 The relationship between big data and deep learning
+
+**Big Data** is usually defined as a dataset that “beyond the capture, management, and processing capabilities of common software tools.”
+**Machine Learning** The concern is how to build a computer program using experience to automatically improve.
+**Data Mining** is the application of a specific algorithm for extracting patterns from data.
+In data mining, the focus is on the application of the algorithm, not the algorithm itself.
+
+**The relationship between machine learning and data mining** is as follows:
+Data mining is a process in which machine learning algorithms are used as a tool to extract potentially valuable patterns in a data set.
+The relationship between big data and deep learning is summarized as follows:
+
+1. Deep learning is a behavior that mimics the brain. You can learn from many related aspects of the mechanism and behavior of the object you are learning, imitating type behavior and thinking.
+2. Deep learning is helpful for the development of big data. Deep learning can help every stage of big data technology development, whether it is data analysis or mining or modeling, only deep learning, these work will be possible to achieve one by one.
+3. Deep learning has transformed the thinking of problem solving. Many times when we find problems to solve problems, taking a step by step is not a major way to solve problems. On the basis of deep learning, we need to base ourselves on the goal from the beginning to the end, in order to optimize the ultimate goal. Go to process the data and put the data into the data application platform.
+4. Deep learning of big data requires a framework. Deep learning in big data is based on a fundamental perspective. Deep learning requires a framework or a system. In general, turning your big data into reality through in-depth analysis is the most direct relationship between deep learning and big data.
+
+
+
+
+## References
+Machine Learning p74 Zhou Zhihua Decision Tree Pseudo Code
+[Neural Networks and Deep Learning CHAPTER 3](http://neuralnetworksanddeeplearning.com/chap3.html) Introducing the cross-entropy cost function
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-1.png
new file mode 100644
index 00000000..38d1b589
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-10.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-10.png
new file mode 100644
index 00000000..ab9b095c
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-10.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-11.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-11.png
new file mode 100644
index 00000000..5ab57c87
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-11.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-12.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-12.png
new file mode 100644
index 00000000..38f9b1a6
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-12.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-13.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-13.png
new file mode 100644
index 00000000..d0317d33
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-13.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-14.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-14.png
new file mode 100644
index 00000000..53cb70ed
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-14.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-15.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-15.png
new file mode 100644
index 00000000..fbbd76af
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-15.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-16.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-16.png
new file mode 100644
index 00000000..614c3872
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-16.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-17.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-17.png
new file mode 100644
index 00000000..e00c9b63
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-17.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-18.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2-18.jpg
new file mode 100644
index 00000000..3438ca2e
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-18.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-19.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2-19.jpg
new file mode 100644
index 00000000..b5e4f7c7
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-19.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-2.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-2.png
new file mode 100644
index 00000000..8bbc66cd
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-2.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-20.gif b/English version/ch02_MachineLearningFoundation/img/ch2/2-20.gif
new file mode 100644
index 00000000..3b8fbfdb
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-20.gif differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-21.gif b/English version/ch02_MachineLearningFoundation/img/ch2/2-21.gif
new file mode 100644
index 00000000..2bca00e1
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-21.gif differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-22.gif b/English version/ch02_MachineLearningFoundation/img/ch2/2-22.gif
new file mode 100644
index 00000000..bc986b15
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-22.gif differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-4.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-4.png
new file mode 100644
index 00000000..72282893
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-4.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-5.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-5.png
new file mode 100644
index 00000000..40111f8f
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-5.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-6.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-6.png
new file mode 100644
index 00000000..f555a6b6
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-6.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-7.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-7.png
new file mode 100644
index 00000000..a472df88
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-7.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-8.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-8.png
new file mode 100644
index 00000000..be239e6f
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-8.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2-9.png b/English version/ch02_MachineLearningFoundation/img/ch2/2-9.png
new file mode 100644
index 00000000..31e457fc
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2-9.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.1/1.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/1.jpg
new file mode 100644
index 00000000..13d8e28d
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/1.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.1/10.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/10.jpg
new file mode 100644
index 00000000..ccd1eeda
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/10.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.1/12.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/12.jpg
new file mode 100644
index 00000000..d62b27bc
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/12.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.1/2.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/2.jpg
new file mode 100644
index 00000000..c158a94d
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/2.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.1/3.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/3.jpg
new file mode 100644
index 00000000..918b8bb2
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/3.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.1/4.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/4.png
new file mode 100644
index 00000000..ddc43d47
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/4.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.1/5.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/5.jpg
new file mode 100644
index 00000000..3f93149f
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/5.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.1/6.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/6.jpg
new file mode 100644
index 00000000..f91f6259
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/6.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.1/7.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/7.jpg
new file mode 100644
index 00000000..d4d1281f
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/7.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.1/8.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/8.jpg
new file mode 100644
index 00000000..6ccddac2
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/8.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.1/9.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/9.png
new file mode 100644
index 00000000..19201b56
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.1/9.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.16.17-1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.17-1.png
new file mode 100644
index 00000000..83915ea4
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.17-1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.16.17-2.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.17-2.png
new file mode 100644
index 00000000..ca5e2c41
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.17-2.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.16.17-3.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.17-3.png
new file mode 100644
index 00000000..511492ce
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.17-3.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.16.18.1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.18.1.png
new file mode 100644
index 00000000..23b150ed
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.18.1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.16.20.1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.20.1.png
new file mode 100644
index 00000000..acc95fb4
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.20.1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.16.4.1.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.4.1.jpg
new file mode 100644
index 00000000..51a16bcb
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.4.1.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.16.4.2.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.4.2.png
new file mode 100644
index 00000000..d74817f7
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.4.2.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.16.4.3.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.4.3.png
new file mode 100644
index 00000000..42e4cee4
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.16.4.3.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.16/1.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.16/1.jpg
new file mode 100644
index 00000000..64794617
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.16/1.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.16/2.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.16/2.jpg
new file mode 100644
index 00000000..e8e8f9ce
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.16/2.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.18/1.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.18/1.jpg
new file mode 100644
index 00000000..a14454c4
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.18/1.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.19.1.1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.19.1.1.png
new file mode 100644
index 00000000..3fd7fd8f
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.19.1.1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.19.5A.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.19.5A.jpg
new file mode 100644
index 00000000..a6c5feb7
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.19.5A.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.19.5B.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.19.5B.jpg
new file mode 100644
index 00000000..ad455db6
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.19.5B.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.19.5C.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.19.5C.png
new file mode 100644
index 00000000..5e499e4e
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.19.5C.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.2.09.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.2.09.png
new file mode 100644
index 00000000..94e91922
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.2.09.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.2.10.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.2.10.png
new file mode 100644
index 00000000..c8b9935b
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.2.10.png differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.1/11.jpg" b/English version/ch02_MachineLearningFoundation/img/ch2/2.2.11.png
similarity index 58%
rename from "ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.1/11.jpg"
rename to English version/ch02_MachineLearningFoundation/img/ch2/2.2.11.png
index e615b4bc..750c2e34 100644
Binary files "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.1/11.jpg" and b/English version/ch02_MachineLearningFoundation/img/ch2/2.2.11.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.2.12.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.2.12.png
new file mode 100644
index 00000000..9357b0a4
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.2.12.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.2.4.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.2.4.png
new file mode 100644
index 00000000..e1e166dc
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.2.4.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.2.8.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.2.8.png
new file mode 100644
index 00000000..234f4d18
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.2.8.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.20.1.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.20.1.jpg
new file mode 100644
index 00000000..efb4f6fc
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.20.1.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.1.png
new file mode 100644
index 00000000..b6205fa9
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.2.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.2.png
new file mode 100644
index 00000000..ef734236
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.2.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.3.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.3.png
new file mode 100644
index 00000000..246ec05e
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.3.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.4.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.4.png
new file mode 100644
index 00000000..70a1d786
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.4.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.5.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.5.png
new file mode 100644
index 00000000..61bd081e
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.5.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.6.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.6.png
new file mode 100644
index 00000000..620857bb
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.6.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.6a.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.6a.png
new file mode 100644
index 00000000..85755d61
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.6a.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.7.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.7.png
new file mode 100644
index 00000000..40fbf900
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.1.7.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.21.3.1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.3.1.png
new file mode 100644
index 00000000..1b7c8051
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.21.3.1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.25/1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.25/1.png
new file mode 100644
index 00000000..b736f4d8
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.25/1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.27/1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.27/1.png
new file mode 100644
index 00000000..27440a49
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.27/1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.27/2.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.27/2.png
new file mode 100644
index 00000000..fd3465ef
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.27/2.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.29/1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.29/1.png
new file mode 100644
index 00000000..a3c751d1
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.29/1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.34/1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.34/1.png
new file mode 100644
index 00000000..fff539b7
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.34/1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.40.10/1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.10/1.png
new file mode 100644
index 00000000..04e82ec7
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.10/1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.40.11/1.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.11/1.jpg
new file mode 100644
index 00000000..c928da5d
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.11/1.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.40.15/1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.15/1.png
new file mode 100644
index 00000000..2bf0c1d8
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.15/1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/1.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/1.jpg
new file mode 100644
index 00000000..51a16bcb
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/1.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/1.png
new file mode 100644
index 00000000..0a1fa4a3
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/2.20.1.jpg b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/2.20.1.jpg
new file mode 100644
index 00000000..05e9ebe2
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/2.20.1.jpg differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/2.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/2.png
new file mode 100644
index 00000000..d74817f7
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/2.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/3.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/3.png
new file mode 100644
index 00000000..42e4cee4
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.40.3/3.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.5.1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.5.1.png
new file mode 100644
index 00000000..3eb29bcd
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.5.1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.6/1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.6/1.png
new file mode 100644
index 00000000..94f75b0c
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.6/1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.7.3.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.7.3.png
new file mode 100644
index 00000000..98257154
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.7.3.png differ
diff --git a/English version/ch02_MachineLearningFoundation/img/ch2/2.9/1.png b/English version/ch02_MachineLearningFoundation/img/ch2/2.9/1.png
new file mode 100644
index 00000000..7a93ae57
Binary files /dev/null and b/English version/ch02_MachineLearningFoundation/img/ch2/2.9/1.png differ
diff --git a/English version/ch02_MachineLearningFoundation/readme.md b/English version/ch02_MachineLearningFoundation/readme.md
new file mode 100644
index 00000000..ed99a5f1
--- /dev/null
+++ b/English version/ch02_MachineLearningFoundation/readme.md
@@ -0,0 +1,16 @@
+###########################################################
+
+\### Deep Learning 500 Questions - First * Chapter xxx
+
+**Responsible person (in no particular order): **
+xxxGraduate student-xxx(xxx)
+xxxDoctoral student-xxx
+xxx-xxx
+
+**Contributors (in no particular order): **
+Content contributors can add information
+
+Liu Yanchao - Southeast University
+Liu Yuande-Shanghai University of Technology (Content Revision)
+
+###########################################################
\ No newline at end of file
diff --git a/English version/ch03_DeepLearningFoundation/ChapterIII_DeepLearningFoundation.md b/English version/ch03_DeepLearningFoundation/ChapterIII_DeepLearningFoundation.md
new file mode 100644
index 00000000..1fc4d5f1
--- /dev/null
+++ b/English version/ch03_DeepLearningFoundation/ChapterIII_DeepLearningFoundation.md
@@ -0,0 +1,1127 @@
+[TOC]
+
+# Chapter 3 Foundation of Deep Learning
+
+## 3.1 Basic Concepts
+
+### 3.1.1 Neural network composition?
+
+There are many types of neural networks, the most important of which is the multilayer perceptron. To describe the neural network in detail, let's start with the simplest neural network.
+
+**Perceived machine**
+
+The feature neuron model in the multi-layer perceptron is called the perceptron and was invented in 1957 by *Frank Rosenblatt*.
+
+The simple perceptron is shown below:
+
+
+
+Where $x_1$, $x_2$, $x_3$ is the input to the perceptron, and its output is:
+
+$$
+Output = \left\{
+\begin{aligned}
+0, \quad if \ \ \sum_i w_i x_i \leqslant threshold \\
+1, \quad if \ \ \sum_i w_i x_i > threshold
+\end{aligned}
+\right.
+$$
+
+If the perceptron is imagined as a weighted voting mechanism, for example, three judges rate a singer with a score of $4 $, $1$, $-3 $, respectively, and the weight of the $3$ score is $1. 3, 2$, the singer will eventually score $4 * 1 + 1 * 3 + (-3) * 2 = 1$. According to the rules of the game, the selected $threshold$ is $3$, indicating that only the singer's overall score is greater than $3$. Against the perceptron, the player was eliminated because
+
+$$
+\sum_i w_i x_i < threshold=3, output = 0
+$$
+
+Replace $threshold$ with $-b$ and the output becomes:
+
+$$
+Output = \left\{
+\begin{aligned}
+0, \quad if \ \ w \cdot x + b \le threshold \\
+1, \quad if \ \ w \cdot x + b > threshold
+\end{aligned}
+\right.
+$$
+
+Set the appropriate $x$ and $b$ , a simple perceptual unit's NAND gate is expressed as follows:
+
+
+
+When the input is $0$, $1$, the perceptron output is $ 0 * (-2) + 1 * (-2) + 3 = 1$.
+
+More complex perceptrons are composed of simple perceptron units:
+
+
+
+**Multilayer Perceptron**
+
+The multi-layer perceptron is promoted by the perceptron. The most important feature is that there are multiple neuron layers, so it is also called deep neural network. Each neuron in the $i$ layer of the multilayer perceptron is connected to each neuron in the $i-1$ layer compared to a separate perceptron.
+
+
+
+The output layer can have more than $1$ neurons. The hidden layer can have only $1 $ layers, or it can have multiple layers. The output layer is a neural network of multiple neurons such as the following:
+
+
+
+
+### 3.1.2 What are the common model structures of neural networks?
+
+The figure below contains most of the commonly used models:
+
+
+
+### 3.1.3 How to choose a deep learning development platform?
+
+The existing deep learning open source platforms mainly include Caffe, PyTorch, MXNet, CNTK, Theano, TensorFlow, Keras, fastai and so on. So how to choose a platform that suits you? Here are some measures for reference.
+
+**Reference 1: How easy is it to integrate with existing programming platforms and skills**
+
+Mainly the development experience and resources accumulated in the early stage, such as programming language, pre-dataset storage format and so on.
+
+**Reference 2: Closeness of ecological integration with related machine learning and data processing**
+
+Deep learning research is inseparable from various software packages such as data processing, visualization, and statistical inference. Is there a convenient data preprocessing tool before considering modeling? After modeling, is there a convenient tool for visualization, statistical inference, and data analysis?
+
+**Reference 3: Requirements and support for data volume and hardware**
+
+Deep learning is not the same amount of data in different application scenarios, which leads us to consider the issues of distributed computing and multi-GPU computing. For example, people working on computer image processing often need to segment image files and computing tasks onto multiple computer nodes for execution. At present, each deep learning platform is developing rapidly, and each platform's support for distributed computing and other scenarios is also evolving.
+
+**Reference 4: The maturity of the deep learning platform**
+
+The maturity consideration is a more subjective consideration. These factors can include: the level of activity of the community; whether it is easy to communicate with developers; the momentum of current applications.
+
+**Reference 5: Is the diversity of platform utilization? **
+
+Some platforms are specifically developed for deep learning research and applications. Some platforms have powerful optimizations for distributed computing, GPU and other architectures. Can you use these platforms/software to do other things? For example, some deep learning software can be used to solve quadratic optimization; some deep learning platforms are easily extended and used in reinforcement learning applications.
+
+### 3.1.4 Why use deep representation?
+
+1. Deep neural network is a feature-oriented learning algorithm. Shallow neurons learn some low-level simple features, such as edges and textures, directly from the input data. The deep features continue to learn more advanced features based on the shallow features that have been learned, and learn deep semantic information from a computer perspective.
+2. The number of hidden cells in the deep network is relatively small, and the number of hidden layers is large. If the shallow network wants to achieve the same calculation result, the number of cells requiring exponential growth can be achieved.
+
+### 3.1.5 Why is deep neural network difficult to train?
+
+
+1. Gradient Gradient
+ The disappearance of the gradient means that the gradient will become smaller and smaller as seen from the back and the front through the hidden layer, indicating that the learning of the front layer will be significantly slower than the learning of the latter layer, so the learning will get stuck unless the gradient becomes larger.
+
+ The reason for the disappearance of the gradient is affected by many factors, such as the size of the learning rate, the initialization of the network parameters, and the edge effect of the activation function. In the deep neural network, the gradient calculated by each neuron is passed to the previous layer, and the gradient received by the shallower neurons is affected by all previous layer gradients. If the calculated gradient value is very small, as the number of layers increases, the obtained gradient update information will decay exponentially, and the gradient disappears. The figure below shows the learning rate of different hidden layers:
+
+
+
+2. Exploding Gradient
+ In a network structure such as a deep network or a Recurrent Neural Network (RNN), gradients can accumulate in the process of network update, becoming a very large gradient, resulting in a large update of the network weight value, making the network unstable; In extreme cases, the weight value will even overflow and become a $NaN$ value, which cannot be updated anymore.
+
+3. Degeneration of the weight matrix results in a reduction in the effective degrees of freedom of the model. The degradation rate of learning in the parameter space is slowed down, which leads to the reduction of the effective dimension of the model. The available degrees of freedom of the network contribute to the gradient norm in learning. As the number of multiplication matrices (ie, network depth) increases, The product of the matrix becomes more and more degraded. In nonlinear networks with hard saturated boundaries (such as ReLU networks), as the depth increases, the degradation process becomes faster and faster. The visualization of this degradation process is shown in a 2014 paper by Duvenaud et al:
+
+
+
+As the depth increases, the input space (shown in the upper left corner) is twisted into thinner and thinner filaments at each point in the input space, and only one direction orthogonal to the filament affects the response of the network. In this direction, the network is actually very sensitive to change.
+
+### 3.1.6 What is the difference between deep learning and machine learning?
+
+Machine learning: use computer, probability theory, statistics and other knowledge to input data and let the computer learn new knowledge. The process of machine learning is to train the data to optimize the objective function.
+
+Deep learning: It is a special machine learning with powerful capabilities and flexibility. It learns to represent the world as a nested hierarchy, each representation is associated with a simpler feature, and the abstract representation is used to compute a more abstract representation.
+
+Traditional machine learning needs to define some manual features to purposefully extract target information, relying heavily on task specificity and expert experience in designing features. Deep learning can learn simple features from big data, and gradually learn from the deeper features of more complex abstraction, independent of artificial feature engineering, which is also a major reason for deep learning in the era of big data.
+
+
+
+
+
+
+
+## 3.2 Network Operations and Calculations
+
+### 3.2.1 Forward Propagation and Back Propagation?
+
+There are two main types of neural network calculations: foward propagation (FP) acts on the input of each layer, and the output is obtained by layer-by-layer calculation; backward propagation (BP) acts on the output of the network. Calculate the gradient from deep to shallow to update the network parameters.
+
+** Forward Propagation**
+
+
+
+Suppose the upper node $ i, j, k, ... $ and so on are connected to the node $ w $ of this layer, so what is the value of the node $ w $? That is, the weighting operation is performed by the nodes of $i, j, k, ... $ above and the corresponding connection weights, and the final result is added with an offset term (for simplicity in the figure) Finally, through a non-linear function (ie activation function), such as $ReLu $, $ sigmoid $ and other functions, the final result is the output of this layer node $ w $.
+
+Finally, through this method of layer by layer operation, the output layer results are obtained.
+
+**Backpropagation**
+
+
+
+Because of the final result of our forward propagation, taking the classification as an example, there is always an error in the end. How to reduce the error? One algorithm that is widely used at present is the gradient descent algorithm, but the gradient requires the partial derivative. The Chinese alphabet is used as an example to explain:
+
+Let the final error be $ E $ and the activation function of the output layer be a linear activation function, for the output then $ E $ for the output node $ y_l $ the partial derivative is $ y_l - t_l $, where $ t_l $ is the real value, $ \ Frac{\partial y_l}{\partial z_l} $ refers to the activation function mentioned above, $ z_l $ is the weighted sum mentioned above, then the $ E $ for this layer has a partial derivative of $ z_l $ Frac{\partial E}{\partial z_l} = \frac{\partial E}{\partial y_l} \frac{\partial y_l}{\partial z_l} $. In the same way, the next level is calculated as well, except that the $\frac{\partial E}{\partial y_k} $ calculation method has been changed back to the input layer, and finally $ \frac{\partial E}{ \partial x_i} = \frac{\partial E}{\partial y_j} \frac{\partial y_j}{\partial z_j} $, and $ \frac{\partial z_j}{\partial x_i} = w_i j $ . Then adjust the weights in these processes, and then continue the process of forward propagation and back propagation, and finally get a better result.
+
+### 3.2.2 How to calculate the output of the neural network?
+
+
+
+As shown in the figure above, the input layer has three nodes, which we numbered as 1, 2, and 3; the four nodes of the hidden layer are numbered 4, 5, 6, and 7; the last two nodes of the output layer are numbered 8. 9. For example, node 4 of the hidden layer is connected to the three nodes 1, 2, and 3 of the input layer, and the weights on the connection are $ w_{41}, w_{42}, w_{43} $.
+
+In order to calculate the output value of node 4, we must first get the output values of all its upstream nodes (ie nodes 1, 2, 3). Nodes 1, 2, and 3 are nodes of the input layer, so their output value is the input vector itself. According to the corresponding relationship in the above picture, you can see that the output values of nodes 1, 2, and 3 are $ x_1, x_2, x_3 $, respectively.
+
+$$
+A_4 = \sigma(w^T \cdot a) = \sigma(w_{41}x_4 + w_{42}x_2 + w_{43}a_3 + w_{4b})
+$$
+
+Where $ w_{4b} $ is the offset of node 4.
+
+Similarly, we can continue to calculate the output values of nodes 5, 6, and 7 $ a_5, a_6, a_7 $.
+
+Calculate the output value of node 8 of the output layer $ y_1 $:
+
+$$
+Y_1 = \sigma(w^T \cdot a) = \sigma(w_{84}A_4 + w_{85}a_5 + w_{86}a_6 + w_{87}a_7 + w_{8b})
+$$
+
+Where $ w_{8b} $ is the offset of node 8.
+
+For the same reason, we can also calculate $ y_2 $. So that the output values of all the nodes in the output layer are calculated, we get the output vector $ y_1, y_2 $ of the neural network when the input vectors $ x_1, x_2, x_3, x_4 $. Here we also see that the output vector has the same number of dimensions as the output layer neurons.
+
+### 3.2.3 How to calculate the output value of convolutional neural network?
+
+Suppose there is a 5\*5 image, convolved with a 3\*3 filter, and I want a 3\*3 Feature Map, as shown below:
+
+
+
+$ x_{i,j} $ represents the $ j $ column element of the $ i $ line of the image. $ w_{m,n} $ means filter $ m $ line $ n $ column weight. $ w_b $ represents the offset of $filter$. Table $a_i, _j$ shows the feature map $ i$ line $ j $ column element. $f$ represents the activation function, here the $ReLU$ function is used as an example.
+
+The convolution calculation formula is as follows:
+
+$$
+A_{i,j} = f(\sum_{m=0}^2 \sum_{n=0}^2 w_{m,n} x_{i+m, j+n} + w_b )
+$$
+
+When the step size is $1$, the feature map element $ a_{0,0} $ is calculated as follows:
+
+$$
+A_{0,0} = f(\sum_{m=0}^2 \sum_{n=0}^2 w_{m,n} x_{0+m, 0+n} + w_b )
+
+
+= relu(w_{0,0} x_{0,0} + w_{0,1} x_{0,1} + w_{0,2} x_{0,2} + w_{1,0} x_{ 1,0} + \\w_{1,1} x_{1,1} + w_{1,2} x_{1,2} + w_{2,0} x_{2,0} + w_{2, 1} x_{2,1} + w_{2,2} x_{2,2}) \\
+
+
+= 1 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 1 \\
+
+= 4
+$$
+
+The calculation process is illustrated as follows:
+
+
+
+By analogy, all Feature Maps are calculated.
+
+
+
+When the stride is 2, the Feature Map is calculated as follows
+
+
+
+Note: Image size, stride, and the size of the Feature Map after convolution are related. They satisfy the following relationship:
+
+$$
+W_2 = (W_1 - F + 2P)/S + 1\\
+H_2 = (H_1 - F + 2P)/S + 1
+$$
+
+Where $ W_2 $ is the width of the Feature Map after convolution; $ W_1 $ is the width of the image before convolution; $ F $ is the width of the filter; $ P $ is the number of Zero Padding, and Zero Padding is around the original image Make a few laps of $0$. If the value of $P$ is $1$, then make $1$ lap $0$; $S$ is the stride; $ H_2 $ is the height of the Feature Map after convolution; $ H_1 $ is the convolution The width of the image.
+
+Example: Suppose the image width is $ W_1 = 5 $, filter width $ F=3 $, Zero Padding $ P=0 $, stride $ S=2 $, $ Z $
+
+$$
+W_2 = (W_1 - F + 2P)/S + 1
+
+= (5-3+0)/2 + 1
+
+= 2
+$$
+
+The Feature Map width is 2. Similarly, we can also calculate that the Feature Map height is also 2.
+
+If the image depth before convolution is $ D $, then the corresponding filter depth must also be $ D $. Convolution calculation formula with depth greater than 1:
+
+$$
+A_{i,j} = f(\sum_{d=0}^{D-1} \sum_{m=0}^{F-1} \sum_{n=0}^{F-1} w_{ d,m,n} x_{d,i+m,j+n} + w_b)
+$$
+
+Where $D$ is the depth; $F$ is the size of the filter; $w_{d,m,n}$ represents the $d$$ layer of the filter, the $m$ line, the $n$ column Weight; $ a_{d,i,j} $ means the $d$ of the feature map, the $i$ line, the $j$ column, and the other symbols have the same meanings and are not described again.
+
+There can be multiple filters per convolutional layer. After each filter is convolved with the original image, you get a Feature Map. The depth (number) of the Feature Map after convolution is the same as the number of filters in the convolutional layer. The following illustration shows the calculation of a convolutional layer with two filters. $7*7*3$ Input, after two convolutions of $3*3*3$ filter (step size is $2$), get the output of $3*3*2$. The Zero padding in the figure is $1$, which is a $0$ around the input element.
+
+
+
+The above is the calculation method of the convolutional layer. This is a partial connection and weight sharing: each layer of neurons is only connected to the upper layer of neurons (convolution calculation rules), and the weight of the filter is the same for all neurons in the previous layer. For a convolutional layer containing two $3 * 3 * 3 $ fitlers, the number of parameters is only $ (3 * 3 * 3+1) * 2 = 56 $, and the number of parameters is the same as the previous one. The number of layers of neurons is irrelevant. Compared to a fully connected neural network, the number of parameters is greatly reduced.
+
+### 3.2.4 How to calculate the output value of the Pooling layer output value?
+
+The main role of the Pooling layer is to downsample, further reducing the number of parameters by removing unimportant samples from the Feature Map. There are many ways to pooling, the most common one is Max Pooling. Max Pooling actually takes the maximum value in the sample of n\*n as the sampled value after sampling. The figure below is 2\*2 max pooling:
+
+
+
+In addition to Max Pooing, Average Pooling is also commonly used - taking the average of each sample.
+For a Feature Map with a depth of $ D $ , each layer does Pooling independently, so the depth after Pooling is still $ D $.
+
+### 3.2.5 Example Understanding Back Propagation
+
+A typical three-layer neural network is as follows:
+
+
+
+Where Layer $ L_1 $ is the input layer, Layer $ L_2 $ is the hidden layer, and Layer $ L_3 $ is the output layer.
+
+Assuming the input dataset is $ D={x_1, x_2, ..., x_n} $, the output dataset is $ y_1, y_2, ..., y_n $.
+
+If the input and output are the same, it is a self-encoding model. If the raw data is mapped, it will get an output different from the input.
+
+Suppose you have the following network layer:
+
+
+
+The input layer contains neurons $ i_1, i_2 $, offset $ b_1 $; the hidden layer contains neurons $ h_1, h_2 $, offset $ b_2 $, and the output layer is $ o_1, o_2 $, $ W_i $ is the weight of the connection between the layers, and the activation function is the $sigmoid $ function. Take the initial value of the above parameters, as shown below:
+
+
+
+among them:
+
+- Enter the data $ i1=0.05, i2 = 0.10 $
+- Output data $ o1=0.01, o2=0.99 $;
+- Initial weights $ w1=0.15, w2=0.20, w3=0.25, w4=0.30, w5=0.40, w6=0.45, w7=0.50, w8=0.55 $
+- Target: Give the input data $ i1,i2 $ ( $0.05$ and $0.10$ ) so that the output is as close as possible to the original output $ o1,o2 $,( $0.01$ and $0.99$).
+
+** Forward Propagation**
+
+1. Input layer --> output layer
+
+Calculate the input weighted sum of neurons $ h1 $:
+
+$$
+Net_{h1} = w_1 * i_1 + w_2 * i_2 + b_1 * 1\\
+
+Net_{h1} = 0.15 * 0.05 + 0.2 * 0.1 + 0.35 * 1 = 0.3775
+$$
+
+The output of the neuron $ h1 $ $ o1 $ : (the activation function used here is the sigmoid function):
+
+$$
+Out_{h1} = \frac{1}{1 + e^{-net_{h1}}} = \frac{1}{1 + e^{-0.3775}} = 0.593269992
+$$
+
+Similarly, the output of neuron $ h2 $ can be calculated. $ o1 $:
+
+$$
+Out_{h2} = 0.596884378
+$$
+
+
+2. Implicit layer --> output layer:
+
+Calculate the values of the output layer neurons $ o1 $ and $ o2 $ :
+
+$$
+Net_{o1} = w_5 * out_{h1} + w_6 * out_{h2} + b_2 * 1
+$$
+
+$$
+Net_{o1} = 0.4 * 0.593269992 + 0.45 * 0.596884378 + 0.6 * 1 = 1.105905967
+$$
+
+$$
+Out_{o1} = \frac{1}{1 + e^{-net_{o1}}} = \frac{1}{1 + e^{1.105905967}} = 0.75136079
+$$
+
+The process of forward propagation is over. We get the output value of $ [0.75136079 , 0.772928465] $, which is far from the actual value of $ [0.01 , 0.99] $. Now we reverse the error and update the right. Value, recalculate the output.
+
+**Backpropagation **
+
+Calculate the total error
+
+Total error: (Use Square Error here)
+
+$$
+E_{total} = \sum \frac{1}{2}(target - output)^2
+$$
+
+But there are two outputs, so calculate the error of $ o1 $ and $ o2 $ respectively, the total error is the sum of the two:
+
+$E_{o1} = \frac{1}{2}(target_{o1} - out_{o1})^2
+= \frac{1}{2}(0.01 - 0.75136507)^2 = 0.274811083$.
+
+$E_{o2} = 0.023560026$.
+
+$E_{total} = E_{o1} + E_{o2} = 0.274811083 + 0.023560026 = 0.298371109$.
+
+
+2. Implicit layer --> Output layer weight update:
+
+Taking the weight parameter $ w5 $ as an example, if we want to know how much influence $ w5 $ has on the overall error, we can use the overall error to find the partial derivative of $ w5 $: (chain rule)
+
+$$
+\frac{\partial E_{total}}{\partial w5} = \frac{\partial E_{total}}{\partial out_{o1}} * \frac{\partial out_{o1}}{\partial net_{ O1}} * \frac{\partial net_{o1}}{\partial w5}
+$$
+
+The following diagram can be more intuitive to see how the error propagates back:
+
+
+
+### 3.2.6 What is the meaning of the neural network more "deep"?
+
+Premise: within a certain range.
+
+- In the case of the same number of neurons, the deep network structure has a larger capacity, and the layered combination brings an exponential expression space, which can be combined into more different types of substructures, which makes learning and representation easier. Various features.
+- An increase in the hidden layer means that the number of nesting layers of the nonlinear transformation brought by the activation function is more, you can construct more complex mappings.
+
+## 3.3 Hyperparameters
+
+### 3.3.1 What is a hyperparameter?
+
+**Super-parameter**: For example, the learning rate in the algorithm, the iterations of the gradient descent method, the hidden layers, the number of hidden layer units, and the activation function are all required. The actual situation is set, these numbers actually control the last parameter and the value, so they are called hyperparameters.
+
+### 3.3.2 How to find the optimal value of the hyperparameter?
+
+There are always some difficult parameters to adjust when using machine learning algorithms. For example, weight attenuation size, Gaussian kernel width, and so on. These parameters require manual settings, and the set values have a large impact on the results. Common methods for setting hyperparameters are:
+
+1. Guess and check: Select parameters based on experience or intuition, and iterate over.
+
+2. Grid Search: Let the computer try to evenly distribute a set of values within a certain range.
+
+3. Random search: Let the computer randomly pick a set of values.
+
+4. Bayesian optimization: Using Bayesian optimization of hyperparameters, it is difficult to meet the Bayesian optimization algorithm itself.
+
+5. The MITIE method performs local optimization under the premise of good initial guessing. It uses the BOBYQA algorithm and has a carefully chosen starting point. Since BOBYQA only looks for the nearest local optimal solution, the success of this method depends largely on whether there is a good starting point. In the case of MITIE, we know a good starting point, but this is not a universal solution, because usually you won't know where the good starting point is. On the plus side, this approach is well suited to finding local optimal solutions. I will discuss this later.
+
+6. The latest proposed global optimization method for LIPO. This method has no parameters and is proven to be better than a random search method.
+
+### 3.3.3 Superparameter search general process?
+
+The general process of hyperparameter search:
+1. Divide the data set into a training set, a validation set, and a test set.
+2. Optimize the model parameters based on the performance indicators of the model on the training set.
+3. Search the model's hyperparameters based on the model's performance metrics on the validation set.
+4. Steps 2 and 3 alternately iteratively, finalizing the parameters and hyperparameters of the model, and verifying the pros and cons of the evaluation model in the test set.
+
+Among them, the search process requires a search algorithm, generally: grid search, random search, heuristic intelligent search, Bayesian search.
+
+## 3.4 Activation function
+
+### 3.4.1 Why do I need a nonlinear activation function?
+
+**Why do I need to activate the function? **
+
+1. The activation function plays an important role in model learning and understanding very complex and nonlinear functions.
+2. The activation function can introduce nonlinear factors. If the activation function is not used, the output signal is only a simple linear function. The linear function is a first-order polynomial. The complexity of the linear equation is limited, and the ability to learn complex function mapping from the data is small. Without an activation function, the neural network will not be able to learn and simulate other complex types of data, such as images, video, audio, speech, and so on.
+3. The activation function can convert the current feature space to another space through a certain linear mapping, so that the data can be better classified.
+
+**Why does the activation function require a nonlinear function? **
+
+1. If the network is all linear, the linear combination is linear, just like a single linear classifier. This makes it impossible to approximate arbitrary functions with nonlinearities.
+2. Use a nonlinear activation function to make the network more powerful, increasing its ability to learn complex things, complex form data, and complex arbitrary function mappings that represent nonlinearities between input and output. A nonlinear activation function can be used to generate a nonlinear mapping from input to output.
+
+### 3.4.2 Common activation functions and images
+
+1. sigmoid activation function
+
+ The function is defined as: $ f(x) = \frac{1}{1 + e^{-x}} $, whose value is $ (0,1) $.
+
+ The function image is as follows:
+
+
+
+2. tanh activation function
+
+ The function is defined as: $ f(x) = tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $, the value range is $ (- 1,1) $.
+
+ The function image is as follows:
+
+
+
+3. Relu activation function
+
+ The function is defined as: $ f(x) = max(0, x) $ , and the value field is $ [0,+∞) $;
+
+ The function image is as follows:
+
+
+
+4. Leak Relu activation function
+
+ The function is defined as: $ f(x) = \left\{
+ \begin{aligned}
+ Ax, \quad x<0 \\
+ x, \quad x>0
+ \end{aligned}
+ \right. $, the value field is $ (-∞, +∞) $.
+
+ The image is as follows ($ a = 0.5 $):
+
+
+
+5. SoftPlus activation function
+
+ The function is defined as: $ f(x) = ln( 1 + e^x) $, and the value range is $ (0, +∞) $.
+
+ The function image is as follows:
+
+
+
+6. softmax function
+
+ The function is defined as: $ \sigma(z)_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} $.
+
+ Softmax is mostly used for multi-class neural network output.
+
+### 3.4.3 Derivative calculation of common activation functions?
+
+For common activation functions, the derivative is calculated as follows:
+
+
+
+### 3.4.4 What are the properties of the activation function?
+
+1. Nonlinearity: When the activation function is linear, a two-layer neural network can approximate all functions. But if the activation function is an identity activation function, ie $ f(x)=x $, this property is not satisfied, and if the MLP uses an identity activation function, then the entire network is waiting with the single layer neural network. Price
+2. Differentiability: This property is reflected when the optimization method is based on gradients;
+3. Monotonicity: When the activation function is monotonous, the single-layer network can guarantee a convex function;
+4. $ f(x)≈x $: When the activation function satisfies this property, if the initialization of the parameter is a random, small value, the training of the neural network will be very efficient; if this property is not met, then Set the initial value in detail;
+5. Range of output values: Gradient-based optimization methods are more stable when the output value of the activation function is finite, because the representation of the feature is more significantly affected by the finite weight; when the output of the activation function is infinite, the model Training will be more efficient, but in this case it is small and generally requires a smaller Learning Rate.
+
+### 3.4.5 How to choose an activation function?
+
+Choosing a suitable activation function is not easy. There are many factors to consider. Usually, if you are not sure which activation function works better, you can try them all and then evaluate them on the verification set or test set. Then see which one performs better, just use it.
+
+The following are common choices:
+
+1. If the output is a 0, 1 value (two-category problem), the output layer selects the sigmoid function, and then all other cells select the Relu function.
+2. If you are not sure which activation function to use on the hidden layer, then the Relu activation function is usually used. Sometimes, the tanh activation function is also used, but one advantage of Relu is that the derivative is equal to 0 when it is negative.
+3. sigmoid activation function: basically it will not be used except that the output layer is a two-class problem.
+4. tanh activation function: tanh is very good, almost suitable for all occasions.
+5. ReLu activation function: The most commonly used default function. If you are not sure which activation function to use, use ReLu or Leaky ReLu and try other activation functions.
+6. If we encounter some dead neurons, we can use the Leaky ReLU function.
+
+### 3.4.6 What are the advantages of using the ReLu activation function?
+
+1. In the case where the interval varies greatly, the derivative of the ReLu activation function or the slope of the activation function will be much larger than 0. In the program implementation is an if-else statement, and the sigmoid function needs to perform floating-point arithmetic. In practice, Using ReLu to activate function neural networks is usually faster than using sigmoid or tanh activation functions.
+2. The derivatives of the sigmoid and tanh functions will have a gradient close to 0 in the positive and negative saturation regions, which will cause the gradient to diffuse, while the Relu and Leaky ReLu functions are more constant than the 0 part, and will not produce gradient dispersion.
+3. It should be noted that when Relu enters the negative half, the gradient is 0, and the neurons are not trained at this time, resulting in so-called sparsity, which Leaky ReLu does not cause.
+
+### 3.4.7 When can I use the linear activation function?
+
+1. The output layer mostly uses a linear activation function.
+2. Some linear activation functions may be used at the hidden layer.
+3. There are very few linear activation functions commonly used.
+
+### 3.4.8 How to understand Relu (< 0) is a nonlinear activation function?
+
+The Relu activation function image is as follows:
+
+
+
+According to the image, it can be seen that it has the following characteristics:
+
+Unilateral inhibition
+2. A relatively broad excitement boundary;
+3. Sparse activation;
+
+From the image, the ReLU function is a piecewise linear function that changes all negative values to 0, while the positive values are unchanged, thus becoming a one-sided suppression.
+
+Because of this unilateral inhibition, the neurons in the neural network also have sparse activation.
+
+**Sparse activation**: From the signal point of view, the neurons only selectively respond to a small part of the input signal at the same time, a large number of signals are deliberately shielded, which can improve the learning accuracy and extract better and faster. Sparse features. When $ x<0 $, ReLU is hard saturated, and when $ x>0 $, there is no saturation problem. ReLU is able to keep the gradient from decaying when $ x>0 $, thus alleviating the gradient disappearance problem.
+
+### 3.4.9 How does the Softmax function be applied to multiple classifications?
+
+Softmax is used in the multi-classification process. It maps the output of multiple neurons to the $ (0,1) $ interval, which can be understood as a probability to be multi-classified!
+
+Suppose we have an array, $ V_i $ represents the $ i $ element in $ V $ , then the softmax value of this element is
+
+$$
+S_i = \frac{e^{V_i}}{\sum_j e^{V_j}}
+$$
+
+From the following figure, the neural network contains the input layer, and then processed by two feature layers. Finally, the softmax analyzer can get the probability under different conditions. Here, it needs to be divided into three categories, and finally get $ y=0. , y=1, y=2 probability value of $.
+
+
+
+Continuing with the picture below, the three inputs pass through softmax to get an array of $[0.05, 0.10, 0.85] $, which is the function of soft.
+
+
+
+The more visual mapping process is shown below:
+
+
+
+In the case of softmax, the original output is $3,1,-3$, which is mapped to the value of $(0,1)$ by the softmax function, and the sum of these values is $1 $( Satisfy the nature of the probability), then we can understand it as a probability, when we finally select the output node, we can select the node with the highest probability (that is, the value corresponds to the largest) as our prediction target!
+
+### 3.4.10 Cross entropy cost function definition and its derivative derivation
+
+(**Contributors: Huang Qinjian - South China University of Technology**)
+
+
+The output of the neuron is a = σ(z), where $z=\sum w_{j}i_{j}+b $ is the weighted sum of the inputs.
+
+$C=-\frac{1}{n}\sum[ylna+(1-y)ln(1-a)]$
+
+Where n is the total number of training data, summation is performed on all training inputs x, and y is the corresponding target output.
+
+Whether the expression solves the problem of slow learning is not obvious. In fact, even seeing this definition as a cost function is not obvious! Before solving the slow learning, let's see why the cross entropy can be interpreted as a cost function.
+
+There are two reasons for considering cross entropy as a cost function.
+
+First, it is non-negative, C > 0. It can be seen that all independent terms in the summation in the expression are negative, because the domain of the logarithm function is (0,1), and there is a negative sign before the sum, so the result is non- negative.
+
+Second, if the actual output of the neuron is close to the target value for all training inputs x, then the cross entropy will be close to zero.
+
+Suppose in this example, y = 0 and a ≈ 0. This is the result we think of. We see that the first term in the formula is eliminated because y = 0 and the second is actually − ln(1 −a) ≈ 0. Conversely, y = 1 and a ≈ 1. So the smaller the difference between the actual output and the target output, the lower the value of the final cross entropy. (This assumes that the output is not 0, which is 1, the actual classification is also the same)
+
+In summary, the cross entropy is non-negative and will approach zero when the neuron reaches a good rate of accuracy. These are actually the characteristics of the cost function we want. In fact, these characteristics are also available in the quadratic cost function. Therefore, cross entropy is a good choice. But the cross entropy cost function has a better feature than the quadratic cost function, which is that it avoids the problem of slow learning speed. In order to clarify this situation, we calculate the partial derivative of the cross entropy function with respect to the weight. We substitute $a={\varsigma}(z)$ into the formula and apply the two-chain rule to get:
+
+$\begin{eqnarray}\frac{\partial C}{\partial w_{j}}&=&-\frac{1}{n}\sum \frac{\partial }{\partial w_{j}}[ Ylna+(1-y)ln(1-a)]\\&=&-\frac{1}{n}\sum \frac{\partial }{\partial a}[ylna+(1-y)ln(1 -a)]*\frac{\partial a}{\partial w_{j}}\\&=&-\frac{1}{n}\sum (\frac{y}{a}-\frac{1 -y}{1-a})*\frac{\partial a}{\partial w_{j}}\\&=&-\frac{1}{n}\sum (\frac{y}{\varsigma (z)}-\frac{1-y}{1-\varsigma(z)})\frac{\partial \varsigma(z)}{\partial w_{j}}\\&=&-\frac{ 1}{n}\sum (\frac{y}{\varsigma(z)}-\frac{1-y}{1-\varsigma(z)}){\varsigma}'(z)x_{j} \end{eqnarray}$
+
+According to the definition of $\varsigma(z)=\frac{1}{1+e^{-z}}$, and some operations, we can get ${\varsigma}'(z)=\varsigma(z ) (1-\varsigma(z))$. After simplification, you can get:
+
+$\frac{\partial C}{\partial w_{j}}=\frac{1}{n}\sum x_{j}({\varsigma}(z)-y)$
+
+This is a beautiful formula. It tells us that the speed of weight learning is controlled by $\varsigma(z)-y$, which is the error in the output. Greater error and faster learning. This is the result of our intuitive expectation. In particular, this cost function also avoids the slow learning caused by ${\varsigma}'(z)$ in a similar equation in the quadratic cost function. When we use cross entropy, ${\varsigma}'(z)$ is dropped, so we no longer need to care if it gets small. This addition is the special effect of cross entropy. In fact, this is not a very miraculous thing. As we will see later, cross entropy is actually just a choice to satisfy this characteristic.
+
+According to a similar approach, we can calculate the partial derivative of the bias. I will not give a detailed process here, you can easily verify:
+
+$\frac{\partial C}{\partial b}=\frac{1}{n}\sum ({\varsigma}(z)-y) $
+
+
+Again, this avoids slow learning caused by similar ${\varsigma}'(z)$ items in the quadratic cost function.
+
+### 3.4.11 Why is Tanh faster than Sigmoid?
+
+** (Contributor: Huang Qinjian - South China University of Technology)**
+
+$tanh^{,}(x)=1-tanh(x)^{2}\in (0,1) $
+
+$s^{,}(x)=s(x)*(1-s(x))\in (0,\frac{1}{4}]$
+
+It can be seen from the above two formulas that the problem of disappearing the tanh(x) gradient is lighter than sigmoid, so Tanh converges faster than Sigmoid.
+
+## 3.5 Batch_Size
+
+### 3.5.1 Why do I need Batch_Size?
+
+The choice of Batch, the first decision is the direction of the decline.
+
+If the data set is small, it can take the form of a full data set. The benefits are:
+
+1. The direction determined by the full data set better represents the sample population and is more accurately oriented in the direction of the extreme value.
+2. Since the gradient values of different weights are very different, it is difficult to select a global learning rate. Full Batch Learning can use Rprop to update each weight individually based on gradient symbols only.
+
+For larger data sets, if you use a full data set, the downside is:
+1. With the massive growth of data sets and memory limitations, it is becoming increasingly infeasible to load all of the data at once.
+2. Iteratively in the Rprop manner, due to the sampling difference between the batches, the gradient correction values cancel each other and cannot be corrected. This was followed by a compromise with RMSProp.
+
+### 3.5.2 Selection of Batch_Size value
+
+If only one sample is trained at a time, Batch_Size = 1. The error surface of a linear neuron in the mean square error cost function is a paraboloid with an ellipse in cross section. For multi-layered neurons and nonlinear networks, the local approximation is still a paraboloid. At this time, each correction direction is corrected by the gradient direction of each sample, and the traverse is directly inconsistent, and it is difficult to achieve convergence.
+
+Since Batch_Size is a full data set or Batch_Size = 1 has its own shortcomings, can you choose a moderate Batch_Size value?
+
+At this time, a batch-grading learning method (Mini-batches Learning) can be employed. Because if the data set is sufficient, then the gradient calculated using half (or even much less) data training is almost the same as the gradient trained with all the data.
+
+### 3.5.3 What are the benefits of increasing Batch_Size within a reasonable range?
+
+1. The memory utilization is improved, and the parallelization efficiency of large matrix multiplication is improved.
+2. The number of iterations required to complete an epoch (full data set) is reduced, and the processing speed for the same amount of data is further accelerated.
+3. Within a certain range, generally, the larger the Batch_Size, the more accurate the determined direction of decline, resulting in less training shock.
+
+### 3.5.4 What is the disadvantage of blindly increasing Batch_Size?
+
+1. Memory utilization has increased, but memory capacity may not hold up.
+2. The number of iterations required to complete an epoch (full data set) is reduced. To achieve the same accuracy, the time it takes is greatly increased, and the correction of the parameters becomes slower.
+3. Batch_Size is increased to a certain extent, and its determined downward direction has not changed substantially.
+
+### 3.5.5 What is the effect of adjusting Batch_Size on the training effect?
+
+1. Batch_Size is too small, and the model performs extremely badly (error is soaring).
+2. As Batch_Size increases, the faster the same amount of data is processed.
+3. As Batch_Size increases, the number of epochs required to achieve the same accuracy is increasing.
+4. Due to the contradiction between the above two factors, Batch_Size is increased to a certain time, and the time is optimal.
+5. Since the final convergence accuracy will fall into different local extrema, Batch_Size will increase to some point and achieve the best convergence accuracy.
+
+## 3.6 Normalization
+
+### 3.6.1 What is the meaning of normalization?
+
+1. Inductive statistical distribution of unified samples. The normalized between $ 0-1 $ is the statistical probability distribution, normalized between $ -1--+1 $ is the statistical coordinate distribution.
+
+2. Whether for modeling or calculation, the basic unit of measurement is the same, the neural network is trained (probability calculation) and prediction based on the statistical probability of the sample in the event, and the value of the sigmoid function is 0 to 1. Between, the output of the last node of the network is also the same, so it is often necessary to normalize the output of the sample.
+
+3. Normalization is a statistical probability distribution unified between $ 0-1 $. When the input signals of all samples are positive, the weights connected to the first hidden layer neurons can only increase or decrease simultaneously. Small, which leads to slow learning.
+
+4. In addition, singular sample data often exists in the data. The network training time caused by the existence of singular sample data increases, and may cause the network to fail to converge. In order to avoid this situation and the convenience of data processing and speed up the network learning speed, the input signal can be normalized so that the input signals of all samples have an average value close to 0 or small compared with their mean square error.
+
+### 3.6.2 Why do you want to normalize?
+
+1. For the convenience of subsequent data processing, normalization can avoid some unnecessary numerical problems.
+2. The convergence speeds up for the program to run.
+3. The same dimension. The evaluation criteria of the sample data are different, and it is necessary to standardize and standardize the evaluation criteria. This is an application level requirement.
+4. Avoid neuron saturation. What do you mean? That is, when the activation of a neuron is saturated near 0 or 1, in these regions, the gradient is almost zero, so that during the backpropagation, the local gradient will approach 0, which effectively "kills" the gradient.
+5. Ensure that the value in the output data is small and not swallowed.
+
+### 3.6.3 Why can normalization improve the solution speed?
+
+
+
+The above figure is the optimal solution finding process that represents whether the data is uniform (the circle can be understood as a contour). The left image shows the search process without normalization, and the right image shows the normalized search process.
+
+When using the gradient descent method to find the optimal solution, it is very likely to take the "Zigzag" route (vertical contour line), which leads to the need to iterate many times to converge; the right picture normalizes the two original features. The corresponding contour line appears to be very round, and it can converge faster when the gradient is solved.
+
+Therefore, if the machine learning model uses the gradient descent method to find the optimal solution, normalization is often necessary, otherwise it is difficult to converge or even converge.
+
+### 3.6.4 3D illustration is not normalized
+
+example:
+
+Suppose $w1$ ranges in $[-10, 10]$, while $w2$ ranges in $[-100, 100]$, the gradient advances by 1 unit each time, then every time in the $w1$ direction Going forward for $1/20$, and on $w2$ is only equivalent to $1/200$! In a sense, the step forward on $ w2 $ is smaller, and $ w1 $ will "walk" faster than $ w2 $ during the search.
+
+This will result in a more bias toward the direction of $ w1 $ during the search. Go out of the "L" shape, or become the "Zigzag" shape.
+
+
+
+### 3.6.5 What types of normalization?
+
+Linear normalization
+
+$$
+X^{\prime} = \frac{x-min(x)}{max(x) - min(x)}
+$$
+
+Scope of application: It is more suitable for the case where the numerical comparison is concentrated.
+
+Disadvantages: If max and min are unstable, it is easy to make the normalization result unstable, which makes the subsequent use effect unstable.
+
+2. Standard deviation standardization
+
+$$
+X^{\prime} = \frac{x-\mu}{\sigma}
+$$
+
+Meaning: The processed data conforms to the standard normal distribution, ie the mean is 0, the standard deviation is 1 where $ \mu $ is the mean of all sample data, and $ \sigma $ is the standard deviation of all sample data.
+
+3. Nonlinear normalization
+
+Scope of application: It is often used in scenes where data differentiation is relatively large. Some values are large and some are small. The original values are mapped by some mathematical function. The method includes $ log $, exponent, tangent, and so on.
+
+### 3.6.6 Local response normalization
+
+LRN is a technical method to improve the accuracy of deep learning. LRN is generally a method after activation and pooling functions.
+
+In ALexNet, the LRN layer is proposed to create a competitive mechanism for the activity of local neurons, which makes the response larger and the value becomes relatively larger, and suppresses other neurons with less feedback, which enhances the generalization ability of the model.
+
+### 3.6.7 Understanding local response normalization
+
+The local response normalization principle is to mimic the inhibition phenomenon (side inhibition) of biologically active neurons on adjacent neurons. The formula is as follows:
+
+$$
+B_{x,y}^i = a_{x,y}^i / (k + \alpha \sum_{j=max(0, in/2)}^{min(N-1, i+n/2 )}(a_{x,y}^j)^2 )^\beta
+$$
+
+among them,
+1) $ a $: indicates the output of the convolutional layer (including the convolution operation and the pooling operation), which is a four-dimensional array [batch, height, width, channel].
+
+- batch: number of batches (one image per batch).
+- height: the height of the image.
+- width: the width of the image.
+- channel: number of channels. Can be understood as a picture of a batch of pictures after the convolution operation and outputThe number of neurons, or the depth of the image after processing.
+
+2) $ a_{x,y}^i $ means a position in the output structure $ [a,b,c,d] $, which can be understood as a certain height under a certain channel in a certain picture. The point at a certain width position, that is, the point under the $d$$ channel of the $a$$ map is the point where the b width is c.
+
+3) $ N $: $ N $ in the paper formula indicates the number of channels.
+
+4) $ a $, $ n/2 $, $ k $ respectively represent input, depth_radius, and bias in the function. The parameters $ k, n, \alpha, \beta $ are all hyperparameters, generally set $ k=2, n=5, \alpha=1*e-4, \beta=0.75 $
+
+5) $ \sum $:$ \sum $ The direction of the overlay is along the channel direction, that is, the sum of the squares of each point value is along the 3rd dimension channel direction in $ a $, which is a Points in the same direction as the front of the $n/2$ channel (minimum for the $0$$ channel) and after the $n/2$ channel (maximum for the $d-1$ channel) Sum of squares (total $ n+1 $ points). The English annotation of the function also shows that input is treated as $ d $ 3 dimensional matrix. To put it plainly, the number of channels of input is regarded as the number of 3D matrix, and the direction of superposition is also in the channel direction.
+
+A simple diagram is as follows:
+
+
+
+### 3.6.8 What is Batch Normalization?
+
+In the past, in neural network training, only the input layer data was normalized, but it was not normalized in the middle layer. Be aware that although we normalize the input data, after the input data is subjected to matrix multiplication such as $ \sigma(WX+b) $ and nonlinear operations, the data distribution is likely to be changed, and After the multi-layer operation of the deep network, the data distribution will change more and more. If we can also normalize in the middle of the network, is it possible to improve the training of the network? The answer is yes.
+
+This method of normalizing the middle layer of the neural network to make the training effect better is batch normalization (BN).
+
+### 3.6.9 Advantages of Batch Normalization (BN) Algorithm
+
+Let us talk about the advantages of the BN algorithm:
+1. Reduced artificial selection parameters. In some cases, you can cancel the dropout and L2 regular item parameters, or take a smaller L2 regular item constraint parameter;
+2. Reduced the requirement for learning rate. Now we can use the initial large learning rate or choose a smaller learning rate, and the algorithm can also quickly train convergence;
+3. Local response normalization can no longer be used. BN itself is a normalized network (local response normalization exists in the AlexNet network)
+4. Destroy the original data distribution, to some extent alleviate the over-fitting (to prevent one sample in each batch of training from being frequently selected, the literature says this can improve the accuracy by 1%).
+5. Reduce the disappearance of the gradient, speed up the convergence, and improve the training accuracy.
+
+### 3.6.10 Batch normalization (BN) algorithm flow
+
+The process of BN algorithm during training is given below.
+
+Input: The output of the previous layer $ X = {x_1, x_2, ..., x_m} $, learning parameters $ \gamma, \beta $
+
+Algorithm flow:
+
+1. Calculate the mean of the output data of the previous layer
+
+$$
+\mu_{\beta} = \frac{1}{m} \sum_{i=1}^m(x_i)
+$$
+
+Where $ m $ is the size of the training sample batch.
+
+2. Calculate the standard deviation of the output data of the previous layer
+
+$$
+\sigma_{\beta}^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_{\beta})^2
+$$
+
+3. Normalize and get
+
+$$
+\hat x_i = \frac{x_i + \mu_{\beta}}{\sqrt{\sigma_{\beta}^2} + \epsilon}
+$$
+
+Where $ \epsilon $ is a small value close to 0 added to avoid the denominator being 0
+
+4. Refactoring, reconstructing the data obtained by the above normalization process,
+
+$$
+Y_i = \gamma \hat x_i + \beta
+$$
+
+Among them, $ \gamma, \beta $ is a learnable parameter.
+
+Note: The above is the process of BN training, but when it is put into use, it is often just a sample, there is no so-called mean $ \mu_{\beta} $ and standard deviation $ \sigma_{\beta}^2 $. At this point, the mean $ \mu_{\beta} $ is calculated by averaging all the batch $ \mu_{\beta} $ values, and the standard deviation is $ \sigma_{\beta}^2 $ using each batch $ \sigma_{ The unbiased estimate of \beta}^2 $ is obtained.
+
+### 3.6.11 Batch normalization and group normalization comparison
+
+| Name | Features |
+| ------------------------------------------------ | :------------------------------------------------- ---------- |
+| Batch Normalization (BN) | allows various networks to be trained in parallel. However, normalization of batch dimensions can cause problems - inaccurate batch statistics estimates result in batches becoming smaller, and BN errors can increase rapidly. In training large networks and transferring features to computer vision tasks (including detection, segmentation, and video), memory consumption limits the use of small batches of BN. |
+| Group Normalization ** Group Normalization (GN) | GN groups channels into groups and calculates normalized mean and variance within each group. GN calculations are independent of batch size and their accuracy is stable across a wide range of batch sizes. |
+| Comparison | On ResNet-50 trained on ImageNet, the GN uses a batch size of 2 with an error rate that is 10.6% lower than the BN error rate; when using a typical batch, GN is comparable to BN and is superior to other targets. A variant. Moreover, GN can naturally migrate from pre-training to fine-tuning. In the target detection and segmentation in COCO and the video classification competition in Kinetics, GN can outperform its competitors, indicating that GN can effectively replace powerful BN in various tasks. |
+
+### 3.6.12 Comparison of Weight Normalization and Batch Normalization
+
+Both Weight Normalization and Batch Normalization are methods of parameter rewriting, but they are used in different ways.
+
+Weight Normalization is a normalization of the network weight $ W $ , so it is also called Weight Normalization;
+
+Batch Normalization is the normalization of input data to a layer of the network.
+
+Weight Normalization has the following three advantages over Batch Normalization:
+
+1. Weight Normalization accelerates the deep learning network parameter convergence by rewriting the weight of the deep learning network W, without introducing the dependency of minbatch, which is applicable to the RNN (LSTM) network (Batch Normalization cannot be directly used for RNN, normalization operation, reason It is: 1) the sequence processed by RNN is variable length; 2) RNN is calculated based on time step. If you use Batch Normalization directly, you need to save the mean and variance of mini btach under each time step, which is inefficient and takes up memory. .
+
+2. Batch Normalization calculates the mean and variance based on a mini batch's data, rather than based on the entire training set, which is equivalent to introducing a gradient calculation to introduce noise. Therefore, Batch Normalization is not suitable for noise-sensitive reinforcement learning and generation models (Generative model: GAN, VAE). In contrast, Weight Normalization rewrites the weight $W$ by scalar $g$ and vector $v$, and the rewrite vector $v$ is fixed, so Normalization based on Weight Normalization can be seen as less introduction than Batch Normalization The noise.
+
+3. No additional storage space is required to preserve the mean and variance of the mini batch. At the same time, when Weight Normalization is implemented, the additional computational overhead caused by the forward signal propagation and the inverse gradient calculation of the deep learning network is also small. Therefore, it is faster than normalization with Batch Normalization. However, Weight Normalization does not have Batch Normalization to fix the output Y of each layer of the network in a range of variation. Therefore, special attention should be paid to the initial value selection of parameters when using Normal Normalization.
+
+### 3.6.13 When is Batch Normalization suitable?
+
+** (Contributor: Huang Qinjian - South China University of Technology)**
+
+In CNN, BN should act before nonlinear mapping. BN can be tried when the neural network training encounters a slow convergence rate, or a situation such as a gradient explosion that cannot be trained. In addition, in general use, BN can also be added to speed up the training and improve the accuracy of the model.
+
+The applicable scenario of BN is: each mini-batch is relatively large, and the data distribution is relatively close. Before doing the training, you should do a good shuffle, otherwise the effect will be much worse. In addition, since BN needs to count the first-order statistics and second-order statistics of each mini-batch during operation, it is not applicable to dynamic network structures and RNN networks.
+
+
+## 3.7 Pre-training and fine tuning
+
+### 3.7.1 Why can unsupervised pre-training help deep learning?
+There is a problem with the deep network:
+
+1. The deeper the network, the more training samples are needed. If supervision is used, a large number of samples need to be labeled, otherwise small-scale samples are likely to cause over-fitting. There are many characteristics of deep network, and there are many multi-feature problems, such as multi-sample problem, regularization problem and feature selection problem.
+
+2. Multi-layer neural network parameter optimization is a high-order non-convex optimization problem, and often obtains a local solution with poor convergence;
+
+3. Gradient diffusion problem, the gradient calculated by the BP algorithm drops significantly with the depth forward, resulting in little contribution to the previous network parameters and slow update speed.
+
+**Solution:**
+
+Layer-by-layer greedy training, unsupervised pre-training is the first hidden layer of the training network, and then the second one is trained... Finally, these trained network parameter values are used as the initial values of the overall network parameters.
+
+After pre-training, a better local optimal solution can be obtained.
+
+### 3.7.2 What is the model fine tuning fine tuning
+
+Training with other people's parameters, modified network and their own data, so that the parameters adapt to their own data, such a process, usually called fine tuning.
+
+** Example of fine-tuning the model: **
+
+We know that CNN has made great progress in the field of image recognition. If you want to apply CNN to our own dataset, you will usually face a problem: usually our dataset will not be particularly large, generally no more than 10,000, or even less, each type of image is only a few Ten or ten. At this time, the idea of directly applying these data to train a network is not feasible, because a key factor in the success of deep learning is the training set consisting of a large number of tagged data. If we only use this data on hand, even if we use a very good network structure, we can't achieve high performance. At this time, the idea of fine-tuning can solve our problem well: we pass the model trained on ImageNet (such as Caf).feNet, VGGNet, ResNet) Fine-tune and apply to our own dataset.
+
+### 3.7.3 Is the network parameter updated when fine tuning?
+
+Answer: Will be updated.
+
+1. The process of finetune is equivalent to continuing training. The difference from direct training is when initialization.
+2. Direct training is initiated in the manner specified by the network definition.
+3. Finetune is initialized with the parameter file you already have.
+
+### 3.7.4 Three states of the fine-tuning model
+
+1. State 1: Only predict, not training.
+Features: Relatively fast and simple, it is very efficient for projects that have been trained and now actually label unknown data;
+
+2. State 2: Training, but only train the final classification layer.
+Features: The final classification of the fine-tuning model and the requirements are met, and now only on their basis for category dimension reduction.
+
+3. State 3: Full training, classification layer + previous convolution layer training
+Features: The difference with state two is very small. Of course, state three is time-consuming and requires training of GPU resources, but it is very suitable for fine-tuning into the model that you want. The prediction accuracy is also improved compared with state two.
+
+## 3.8 Weight Deviation Initialization
+
+### 3.8.1 All initialized to 0
+
+** Deviation Initialization Trap**: Both are initialized to 0.
+
+**The reason for the trap**: Because we don't know the last value of each weight in the training neural network, but if we do the proper data normalization, we can reasonably think that half of the weight is positive, the other half It is negative. The weight of ownership is initialized to 0. If the output value calculated by the neural network is the same, the gradient value calculated by the neural network in the back propagation algorithm is the same, and the parameter update value is also the same. More generally, if the weights are initialized to the same value, the network is symmetric.
+
+**Visualized understanding**: When considering the gradient drop in the neural network, imagine that you are climbing, but in a linear valley, the two sides are symmetrical peaks. Because of the symmetry, the gradient you are in can only follow the direction of the valley, not to the mountain; after you take a step, the situation remains the same. The result is that you can only converge to a maximum in the valley and not reach the mountain.
+
+### 3.8.2 All initialized to the same value
+
+Deviation Initialization Trap: Both are initialized to the same value.
+Take a three-layer network as an example:
+First look at the structure
+
+
+
+Its expression is:
+
+$$
+A_1^{(2)} = f(W_{11}^{(1)} x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{( 1)})
+$$
+
+$$
+A_2^{(2)} = f(W_{21}^{(1)} x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 + b_2^{( 1)})
+$$
+
+$$
+A_3^{(2)} = f(W_{31}^{(1)} x_1 + W_{32}^{(1)} x_2 + W_{33}^{(1)} x_3 + b_3^{( 1)})
+$$
+
+$$
+H_{W,b}(x) = a_1^{(3)} = f(W_{11}^{(2)} a_1^{(2)} + W_{12}^{(2)} a_2^ {(2)} + W_{13}^{(2)} a_3^{(2)} + b_1^{(2)})
+$$
+
+$$
+Xa_1^{(2)} = f(W_{11}^{(1)} x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{( 1)})a_2^{(2)} = f(W_{21}^{(1)} x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 +
+$$
+
+If each weight is the same, then in a multi-layer network, starting from the second layer, the input values of each layer are the same, that is, $ a1=a2=a3=.... $, since they are all the same, It is equivalent to an input, why? ?
+
+If it is a reverse transfer algorithm (see the above connection if you don't understand it here), the iterative partial derivative of the bias term and the weight term is calculated as follows
+
+$$
+\frac{\partial}{\partial W_{ij}^{(l)}} J(W,b;x,y) = a_j^{(l)} \delta_i^{(l+1)}
+
+\frac{\partial}{\partial b_{i}^{(l)}} J(W,b;x,y) = \delta_i^{(l+1)}
+$$
+
+Calculation formula for $ \delta $
+
+$$
+\delta_i^{(l)} = (\sum_{j=1}^{s_{t+1}} W_{ji}^{(l)} \delta_j^{(l+1)} ) f^{ \prime}(z_i^{(l)})
+$$
+
+
+If you are using the sigmoid function
+
+$$
+F^{\prime}(z_i^{(l)}) = a_i^{(l)}(1-a_i^{(l)})
+$$
+
+Substituting the latter two formulas, we can see that the obtained gradient descent method has the same partial derivatives, non-stop iterations, non-stopping the same, non-stop iterations, non-stop the same... and finally got The same value (weight and intercept).
+
+### 3.8.3 Initializing to a small random number
+
+Initializing the weights to very small numbers is a common solution to breaking network symmetry. The idea is that neurons are random and unique at first, so they calculate different updates and integrate themselves into the various parts of the network. An implementation of a weight matrix might look like $ W=0.01∗np.random.randn(D,H) $, where randn is sampled from a unit standard Gaussian distribution with a mean of 0. Through this formula (function), the weight vector of each neuron is initialized to a random vector sampled from the multidimensional Gaussian distribution, so the neuron points in the input direction in the input direction. It should mean that the input space has an effect on the random direction). In fact, it is also possible to randomly select decimals from a uniform distribution, but in practice it does not seem to have much effect on the final performance.
+
+Note: Warning: Not the smaller the number, the better it will perform. For example, if the weight of a neural network layer is very small, then the backpropagation algorithm will calculate a small gradient (because the gradient gradient is proportional to the weight). The "gradient signal" will be greatly reduced during the continuous back propagation of the network, and may become a problem that needs to be paid attention to in the deep network.
+
+### 3.8.4 Calibrating the variance with $ 1/\sqrt n $
+
+One problem with the above suggestions is that the distribution of the output of the randomly initialized neuron has a variance that varies as the input increases. It turns out that we can normalize the variance of the output of each neuron to 1 by scaling its weight vector by the square root of its input (ie the number of inputs). That is to say, the recommended heuristic is to initialize the weight vector of each neuron as follows: $ w=np.random.randn(n)/\sqrt n $, where n is the number of inputs . This ensures that the initial output distribution of all neurons in the network is roughly the same and empirically increases the rate of convergence.
+
+### 3.8.5 Sparse Initialization (Sparse Initialazation)
+
+Another way to solve the uncalibrated variance problem is to set all weight matrices to zero, but to break the symmetry, each neuron is randomly connected (extracted from a small Gaussian distribution as described above) Weight) to a fixed number of neurons below it. The number of a typical neuron connection can be as small as ten.
+
+### 3.8.6 Initialization deviation
+
+It is possible to initialize the deviation to zero, which is also very common, because asymmetry damage is caused by small random numbers of weights. Because ReLU has non-linear characteristics, some people like to use to set all deviations to small constant values such as 0.01, because this ensures that all ReLU units activate the fire at the very beginning and therefore can acquire and propagate some Gradient value. However, it is not clear whether this will provide continuous improvement (in fact some results indicate that doing so makes performance worse), so it is more common to simply initialize the deviation to 0.
+
+## 3.9 Softmax
+
+### 3.9.1 Softmax Definition and Function
+
+Softmax is a function of the form:
+
+$$
+P(i) = \frac{exp(\theta_i^T x)}{\sum_{k=1}^{K} exp(\theta_i^T x)}
+$$
+
+Where $ \theta_i $ and $ x $ are the column vectors, $ \theta_i^T x $ may be replaced by the function $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $
+
+With the softmax function, you can make $ P(i) $ range between $ [0,1] $. In regression and classification problems, usually $ \theta $ is a parameter to be sought by looking for $ \(ta) $ max $ $theta_i $ as the best parameter.
+
+However, there are many ways to make the range between $ [0,1] $, in order to add the power function of $ e $ in front of it? Reference logistic function:
+
+$$
+P(i) = \frac{1}{1+exp(-\theta_i^T x)}
+$$
+
+The function of this function is to make $ P(i) $ tend to 0 in the interval from negative infinity to 0, and to 1 in the interval from 0 to positive infinity. Similarly, the softmax function adds the power function of $ e $ for the purpose of polarization: the result of the positive sample will approach 1 and the result of the negative sample will approach 0. This makes it convenient for multiple categories (you can think of $ P(i) $ as the probability that a sample belongs to a category). It can be said that the Softmax function is a generalization of the logistic function.
+
+The softmax function can process its input, commonly referred to as logits or logit scores, between 0 and 1, and normalize the output to a sum of 1. This means that the softmax function is equivalent to the probability distribution of the classification. It is the best output activation function for network prediction of polyphenols.
+
+### 3.9.2 Softmax Derivation
+
+## 3.10 Understanding the principle and function of One Hot Encodeing?
+
+The origin of the problem
+
+In many **machine learning** tasks, features are not always continuous values, but may be categorical values.
+
+For example, consider the three characteristics:
+
+```
+["male", "female"] ["from Europe", "from US", "from Asia"]
+["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]
+```
+
+If the above features are represented by numbers, the efficiency will be much higher. E.g:
+
+```
+["male", "from US", "uses Internet Explorer"] is expressed as [0, 1, 3]
+["female", "from Asia", "uses Chrome"] means [1, 2, 1]
+```
+
+However, even after converting to a digital representation, the above data cannot be directly used in our classifier. Because the classifier tends to default data data is continuous (can calculate distance?), and is ordered (and the above 0 is not said to be higher than 1). However, according to our above representation, the numbers are not ordered, but are randomly assigned.
+
+**Individual heating code**
+
+In order to solve the above problem, one possible solution is to use One-Hot Encoding. One-Hot encoding, also known as one-bit efficient encoding, uses an N-bit status register to encode N states, each state being independent of its register bits, and at any time, only One is valid.
+
+E.g:
+
+```
+The natural status code is: 000,001,010,011,100,101
+The unique heat code is: 000001,000010,000100, 001000, 010000, 100000
+```
+
+It can be understood that for each feature, if it has m possible values, it will become m binary features after being uniquely encoded (such as the score, the feature is good, the middle, the difference becomes one-hot 100, 010, 001). Also, these features are mutually exclusive, with only one activation at a time. Therefore, the data becomes sparse.
+
+The main benefits of this are:
+
+1. Solved the problem that the classifier is not good at processing attribute data;
+2. To a certain extent, it also plays a role in expanding features.
+
+## 3.11 What are the commonly used optimizers?
+
+List separately
+
+```
+Optimizer:
+tf.train.GradientDescentOptimizer
+tf.train.AdadeltaOptimizer
+tf.train.AdagradOptimizer
+Tf.train.AdagradDAOptimizer
+tf.train.MomentumOptimizer
+tf.train.AdamOptimizer
+tf.train.FtrlOptimizer
+tf.train.ProximalGradientDescentOptimizer
+tf.train.ProximalAdagradOptimizer
+tf.train.RMSPropOptimizer
+```
+
+## 3.12 Dropout series questions
+
+### 3.12.1 Why do you want to regularize?
+1. Deep learning may have over-fitting problems - high variance, there are two solutions, one is regularization, the other is to prepare more data, this is a very reliable method, but you may not be able to prepare at all times The cost of training enough data or getting more data is high, but regularization often helps to avoid overfitting or reducing your network error.
+2. If you suspect that the neural network over-fitting the data, that is, there is a high variance problem, the first method that comes to mind may be regularization. Another way to solve the high variance is to prepare more data, which is also a very reliable method. But you may not be able to prepare enough training data from time to time, or the cost of getting more data is high, but regularization helps to avoid overfitting or reduce network errors.
+
+### 3.12.2 Why is regularization helpful in preventing overfitting?
+
+
+
+
+The left picture is high deviation, the right picture is high variance, and the middle is Just Right, which we saw in the previous lesson.
+
+### 3.12.3 Understanding dropout regularization
+Dropout can randomly remove neural units from the network. Why can it play such a big role through regularization?
+
+Intuitively understand: don't rely on any feature, because the input of the unit may be cleared at any time, so the unit propagates in this way and adds a little weight to the four inputs of the unit. By spreading the weight of ownership, dropout will generate The effect of the squared norm of the contraction weight is similar to the L2 regularization mentioned earlier; the result of implementing the dropout will compress the weight and complete some outer regularization to prevent overfitting; L2 has different attenuation for different weights. It depends on the size of the activation function multiplication.
+
+### 3.12.4 Choice of dropout rate
+
+1. After cross-validation, the implicit node dropout rate is equal to 0.5, which is the best, because the dropout randomly generates the most network structure at 0.5.
+2. dropout can also be used as a way to add noise, directly on the input. The input layer is set to a number closer to 1. Make the input change not too big (0.8)
+3. The max-normalization of the training of the parameter $ w $ is very useful for the training of dropout.
+4. Spherical radius $ c $ is a parameter that needs to be adjusted, and the validation set can be used for parameter tuning.
+5. dropout itself is very good, but dropout, max-normalization, large decaying learning rates and high momentum are better combined. For example, max-norm regularization can prevent the parameter blow up caused by large learning rate.
+6. Use the pretraining method to also help dropout training parameters. When using dropout, multiply all parameters by $ 1/p $.
+
+### 3.12.5 What are the disadvantages of dropout?
+
+One of the major drawbacks of dropout is that the cost function J is no longer explicitly defined. Each iteration will randomly remove some nodes. If you repeatedly check the performance of the gradient descent, it is actually difficult to review. The well-defined cost function J will drop after each iteration, because the cost function J we optimized is not really defined, or it is difficult to calculate to some extent, so we lost the debugging tool to draw such a picture. . I usually turn off the dropout function, set the keep-prob value to 1, and run the code to make sure that the J function is monotonically decreasing. Then open the dropout function, I hope that the code does not introduce bugs during the dropout process. I think you can try other methods as well, although we don't have statistics on the performance of these methods, but you can use them with the dropout method.
+
+## 3.13 Data Augmentation commonly used in deep learning?
+
+**(Contributor: Huang Qinjian - South China University of Technology)**
+
+- Color Jittering: Data enhancement for color: image brightness, saturation, contrast change (here the understanding of color jitter is not known);
+
+- PCA Jittering: First calculate the mean and standard deviation according to the RGB three color channels, then calculate the covariance matrix on the entire training set, perform feature decomposition, and obtain the feature vector and eigenvalue for PCA Jittering;
+
+- Random Scale: scale transformation;
+
+- Random Crop: crops and scales the image using random image difference methods; including Scale Jittering method (used by VGG and ResNet models) or scale and aspect ratio enhancement transform;
+
+- Horizontal/Vertical Flip: horizontal/vertical flip;
+
+- Shift: translation transformation;
+
+- Rotation/Reflection: rotation/affine transformation;
+
+- Noise: Gaussian noise, fuzzy processing;
+
+- Label Shuffle: an augmentation of category imbalance data;
+
+## 3.14 How to understand Internal Covariate Shift?
+
+**(Contributor: Huang Qinjian - South China University of Technology)**
+
+Why is the training of deep neural network models difficult? One of the important reasons is that the deep neural network involves the superposition of many layers, and the parameter update of each layer will cause the distribution of the input data of the upper layer to change. By layer-by-layer superposition, the input distribution of the upper layer will change very sharply. This makes it necessary for the upper layers to constantly adapt to the underlying parameter updates. In order to train the model, we need to be very careful to set the learning rate, initialization weights, and the most detailed parameter update strategy.
+
+Google summarizes this phenomenon as Internal Covariate Shift, referred to as ICS. What is ICS?
+
+Everyone knows that a classic assumption in statistical machine learning is that "the data distribution of the source domain and the target domain is consistent." If they are inconsistent, then new machine learning problems arise, such as transfer learning / domain adaptation. Covariate shift is a branch problem under the assumption of inconsistent distribution. It means that the conditional probability of source space and target space is consistent, but its edge probability is different.
+
+Everyone will find out, indeed, for the output of each layer of the neural network, because they have undergone intra-layer operation, the distribution is obviously different from the corresponding input signal distribution of each layer, and the difference will increase as the network depth increases. Large, but the sample labels they can "instruct" are still invariant, which is consistent with the definition of covariate shift. Because it is the analysis of the signal between layers, it is the reason of "internal".
+
+**What problems does ICS cause? **
+
+In short, the input data for each neuron is no longer "independently distributed."
+
+First, the upper parameters need to constantly adapt to the new input data distribution and reduce the learning speed.
+
+Second, the change in the input of the lower layer may tend to become larger or smaller, causing the upper layer to fall into the saturation region, so that the learning stops prematurely.
+
+Third, the update of each layer will affect other layers, so the parameter update strategy of each layer needs to be as cautious as possible.
+
+## 3.15 When do you use local-conv? When do you use full convolution?
+
+**(Contributor: Liang Zhicheng - Meizu Technology)**
+
+1. When the data set has a global local feature distribution, that is to say, there is a strong correlation between the local features, suitable for full convolution.
+
+2. When there are different feature distributions in different areas, it is suitable to use local-Conv.
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-1.png
new file mode 100644
index 00000000..5b065bc9
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-10.jpg b/English version/ch03_DeepLearningFoundation/img/ch3/3-10.jpg
new file mode 100644
index 00000000..526dfa3b
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-10.jpg differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-11.jpg b/English version/ch03_DeepLearningFoundation/img/ch3/3-11.jpg
new file mode 100644
index 00000000..d671beb2
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-11.jpg differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-12.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-12.png
new file mode 100644
index 00000000..af6b11a4
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-12.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-13.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-13.png
new file mode 100644
index 00000000..6d8799c6
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-13.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-14.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-14.png
new file mode 100644
index 00000000..124af841
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-14.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-15.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-15.png
new file mode 100644
index 00000000..a32940e4
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-15.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-16.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-16.png
new file mode 100644
index 00000000..856802fc
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-16.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-17.gif b/English version/ch03_DeepLearningFoundation/img/ch3/3-17.gif
new file mode 100644
index 00000000..ba665c65
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-17.gif differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-18.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-18.png
new file mode 100644
index 00000000..c17ecad8
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-18.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-19.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-19.png
new file mode 100644
index 00000000..cd20e854
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-19.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-2.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-2.png
new file mode 100644
index 00000000..3dbddcb4
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-2.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-20.gif b/English version/ch03_DeepLearningFoundation/img/ch3/3-20.gif
new file mode 100644
index 00000000..308e93d3
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-20.gif differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-21.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-21.png
new file mode 100644
index 00000000..c4403e78
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-21.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-22.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-22.png
new file mode 100644
index 00000000..05be6684
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-22.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-23.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-23.png
new file mode 100644
index 00000000..2e4c9b21
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-23.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-24.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-24.png
new file mode 100644
index 00000000..fbec6c04
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-24.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-25.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-25.png
new file mode 100644
index 00000000..cd2f64ce
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-25.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-26.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-26.png
new file mode 100644
index 00000000..9d67734e
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-26.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-27.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-27.png
new file mode 100644
index 00000000..9678e96a
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-27.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-28.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-28.png
new file mode 100644
index 00000000..ea2a6f96
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-28.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-29.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-29.png
new file mode 100644
index 00000000..1d589b63
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-29.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-3.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-3.png
new file mode 100644
index 00000000..c7178b54
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-3.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-30.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-30.png
new file mode 100644
index 00000000..0296411e
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-30.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-31.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-31.png
new file mode 100644
index 00000000..4c757ad3
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-31.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-32.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-32.png
new file mode 100644
index 00000000..ea2a6f96
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-32.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-33.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-33.png
new file mode 100644
index 00000000..9798bf81
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-33.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-34.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-34.png
new file mode 100644
index 00000000..71edbcc1
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-34.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-35.jpg b/English version/ch03_DeepLearningFoundation/img/ch3/3-35.jpg
new file mode 100644
index 00000000..5bb87c99
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-35.jpg differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-36.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-36.png
new file mode 100644
index 00000000..5b90a4e3
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-36.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-37.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-37.png
new file mode 100644
index 00000000..087ae7b7
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-37.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-38.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-38.png
new file mode 100644
index 00000000..731c159e
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-38.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-39.jpg b/English version/ch03_DeepLearningFoundation/img/ch3/3-39.jpg
new file mode 100644
index 00000000..50fe309f
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-39.jpg differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-4.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-4.png
new file mode 100644
index 00000000..a6f322cd
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-4.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-40.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-40.png
new file mode 100644
index 00000000..2303937c
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-40.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-41.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-41.png
new file mode 100644
index 00000000..a16e2651
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-41.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-5.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-5.png
new file mode 100644
index 00000000..d094c2c9
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-5.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-6.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-6.png
new file mode 100644
index 00000000..10ac6d41
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-6.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-7.jpg b/English version/ch03_DeepLearningFoundation/img/ch3/3-7.jpg
new file mode 100644
index 00000000..3cdfc2d1
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-7.jpg differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-8.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-8.png
new file mode 100644
index 00000000..dd13a29c
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-8.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-9.jpg b/English version/ch03_DeepLearningFoundation/img/ch3/3-9.jpg
new file mode 100644
index 00000000..49d4e00f
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-9.jpg differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.1.1.5.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.1.1.5.png
new file mode 100644
index 00000000..dde211ce
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.1.1.5.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.1.1.6.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.1.1.6.png
new file mode 100644
index 00000000..1bd76971
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.1.1.6.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.1.6.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.1.6.1.png
new file mode 100644
index 00000000..e0ce77d9
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.1.6.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.12.2.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.12.2.1.png
new file mode 100644
index 00000000..34f2d8b6
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.12.2.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.12.2.2.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.12.2.2.png
new file mode 100644
index 00000000..ad3d4bd6
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.12.2.2.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.1.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.1.1.png
new file mode 100644
index 00000000..160cf461
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.1.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.1.2.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.1.2.png
new file mode 100644
index 00000000..fec7b5b1
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.1.2.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.2.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.2.1.png
new file mode 100644
index 00000000..02dbca08
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.2.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.1.png
new file mode 100644
index 00000000..1c4dacd9
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.2.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.2.png
new file mode 100644
index 00000000..09f4d9e2
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.2.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.4.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.4.png
new file mode 100644
index 00000000..03f1c6a9
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.4.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.5.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.5.png
new file mode 100644
index 00000000..56443fd2
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.5.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.6.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.6.png
new file mode 100644
index 00000000..bda9c5cb
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.6.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.4.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.4.1.png
new file mode 100644
index 00000000..f91e8ab4
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.4.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.1.png
new file mode 100644
index 00000000..1236163e
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.2.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.2.png
new file mode 100644
index 00000000..c36f3bfd
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.2.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.3.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.3.png
new file mode 100644
index 00000000..d53c8ea8
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.3.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.4.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.4.png
new file mode 100644
index 00000000..f00cca44
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.4.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.1.png
new file mode 100644
index 00000000..3e126ba7
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.2.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.2.png
new file mode 100644
index 00000000..ea9b0c34
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.2.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.3.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.3.png
new file mode 100644
index 00000000..fe1d7779
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.3.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.6.3.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.6.3.1.png
new file mode 100644
index 00000000..d091389e
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.6.3.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.6.7.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.6.7.1.png
new file mode 100644
index 00000000..4b98c00e
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.6.7.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.8.2.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.8.2.1.png
new file mode 100644
index 00000000..1236163e
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.8.2.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/readme.md b/English version/ch03_DeepLearningFoundation/readme.md
new file mode 100644
index 00000000..26ad0b97
--- /dev/null
+++ b/English version/ch03_DeepLearningFoundation/readme.md
@@ -0,0 +1,14 @@
+###########################################################
+
+\### Deep Learning 500 Questions - First * Chapter xxx
+
+**Responsible person (in no particular order): **
+Xxx graduate student-xxx(xxx)
+Xxx doctoral student-xxx
+Xxx-xxx
+
+
+**Contributors (in no particular order): **
+Content contributors can add information
+
+###########################################################
\ No newline at end of file
diff --git a/English version/ch04_ClassicNetwork/ChapterIV_ClassicNetwork.md b/English version/ch04_ClassicNetwork/ChapterIV_ClassicNetwork.md
new file mode 100644
index 00000000..f5c39d45
--- /dev/null
+++ b/English version/ch04_ClassicNetwork/ChapterIV_ClassicNetwork.md
@@ -0,0 +1,293 @@
+[TOC]
+
+# Chapter 4 Classic Network
+## 4.1 LeNet-5
+
+### 4.1.1 Introduction to the model
+
+LeNet-5 is a Convolutional Neural Network (CNN) $^{[1]}$ proposed by $LeCun$ for recognizing handwritten digits and machine-printed characters. The name is derived from the author $LeCun$ The name, 5 is the code name for its research, and LeNet-4 and LeNet-1 were little known before LeNet-5. LeNet-5 illustrates that the correlation between pixel features in an image can be extracted by a convolution operation shared by parameters, and a combination of convolution, downsampling (pooling), and nonlinear mapping is currently popular. The basis of most depth image recognition networks.
+
+### 4.1.2 Model structure
+
+
+
+Figure 4.1 LeNet-5 network structure
+
+As shown in Figure 4.1, LeNet-5 consists of 7 layers (the input layer is not used as the network structure), which consists of 2 convolution layers, 2 downsampling layers and 3 connection layers. The parameter configuration of the network is shown in Table 4.1. The kernel size of the downsampling layer and the fully connected layer respectively represents the sampling range and the size of the connection matrix (eg, "5\times5\times1/1,6" in the convolution kernel size indicates that the kernel size is $5\times5 \times1$, convolution kernel with a step size of $1$ and a core number of 6.)
+
+Table 4.1 LeNet-5 Network Parameter Configuration
+
+| Network Layer | Input Size | Core Size | Output Size | Trainable Parameter Quantity |
+| :-------------: | :------------------: | :----------- -----------: | :------------------: | :--------------- --------------: |
+| Convolutional Layer $C_1$ | $32\times32\times1$ | $5\times5\times1/1,6$ | $28\times28\times6$ | $(5\times5\times1+1)\times6$ |
+| Downsampling layer $S_2$ | $28\times28\times6$ | $2\times2/2$ | $14\times14\times6$ | $(1+1)\times6$ $^*$ |
+| Convolutional Layer $C_3$ | $14\times14\times6$ | $5\times5\times6/1,16$ | $10\times10\times16$ | $1516^*$ |
+| Downsampling layer $S_4$ | $10\times10\times16$ | $2\times2/2$ | $5\times5\times16$ | $(1+1)\times16$ |
+| Convolutional Layer $C_5$$^*$ | $5\times5\times16$ | $5\times5\times16/1,120$ | $1\times1\times120$ | $(5\times5\times16+1)\times120$ |
+| Full Connect Layer $F_6$ | $1\times1\times120$ | $120\times84$ | $1\times1\times84$ | $(120+1)\times84$ |
+| Output Layer | $1\times1\times84$ | $84\times10$ | $1\times1\times10$ | $(84+1)\times10$ |
+
+> $^*$ In LeNet, the downsampling operation is similar to the pooling operation, but after multiplying the sampled result by a coefficient and adding an offset term, the number of parameters sampled below is $(1+1) )\times6$ instead of zero.
+>
+> $^*$ $C_3$ Convolutional layer training parameters are not directly connected to all feature maps in $S_2$, but are connected using the sampling feature as shown in Figure 4.2 (sparse connection). The generated 16 channel feature maps are mapped according to the adjacent three feature maps, the adjacent four feature maps, the non-adjacent four feature maps, and all six feature maps. The calculated number of parameters is calculated as $6\ Times(25\times3+1)+6\times(25\times4+1)+3\times(25\times4+1)+1\times(25\times6+1)=1516$, explained in the original paper There are two reasons for using this sampling method: the number of connections is not too large (the computing power of the current year is weak); forcing the combination of different feature maps can make the mapped feature maps learn different feature patterns.
+
+
+
+Figure 4.2 Sparse connection between feature maps between $S_2$ and $C_3$
+
+> $^*$ $C_5$ The convolutional layer is shown in Figure 4.1 as a fully connected layer. The original paper explains that the convolution operation is actually used here, but the size is compressed to $1 just after the $5\times5$ convolution. Times1$, the output looks very similar to a full connection.
+
+### 4.1.3 Model Features
+- The convolutional network uses a three-layer sequence combination: convolution, downsampling (pooling), and non-linear mapping (the most important feature of LeNet-5, which forms the basis of the current deep convolutional network)
+- Extract spatial features using convolution
+- Downsampling using the mapped spatial mean
+- Non-linear mapping using $tanh$ or $sigmoid$
+- Multilayer Neural Network (MLP) as the final classifier
+- Sparse connection matrix between layers to avoid huge computational overhead
+
+## 4.2 AlexNet
+
+### 4.2.1 Introduction to the model
+
+AlexNet is the first deep convolutional neural network applied to image classification by $Alex$ $Krizhevsky $, which was 15.3% top-5 in the 2012 ILSVRC (ImageNet Large Scale Visual Recognition Competition) image classification competition. The test error rate won the first place $^{[2]}$. AlexNet uses GPU instead of CPU to make the model structure more complex in an acceptable time range. Its appearance proves the effectiveness of deep convolutional neural networks in complex models, making CNN popular in computer vision. Directly or indirectly triggered a wave of deep learning.
+
+### 4.2.2 Model structure
+
+
+
+Figure 4.3 AlexNet network structure
+
+As shown in Figure 4.3, except for downsampling (pooling layer) and Local Responsible Normalization (LRN), AlexNet consists of 8 layers. The first 5 layers are composed of convolution layers, while the remaining 3 layers are all. Connection layer. The network structure is divided into upper and lower layers, corresponding to the operation process of the two GPUs, except for some layers in the middle ($C_3$ convolutional layer and $F_{6-8}$ full connection layer will have GPU interaction), other Layer two GPUs calculate the results separately. The output of the last layer of the fully connected layer is input as $softmax$, and the probability values corresponding to 1000 image classification labels are obtained. Excluding the design of the GPU parallel structure, the AlexNet network structure is very similar to that of LeNet. The parameter configuration of the network is shown in Table 4.2.
+
+Table 4.2 AlexNet Network Parameter Configuration
+
+| Network Layer | Input Size | Core Size | Output Size | Trainable Parameter Quantity |
+| :-------------------: | :-------------------------- --------: | :--------------------------------------: | :----------------------------------: | :----------- --------------------------: |
+| Convolutional Layer $C_1$ $^*$ | $224\times224\times3$ | $11\times11\times3/4,48(\times2_{GPU})$ | $55\times55\times48(\times2_{GPU})$ | $(11\times11\times3+1)\times48\times2$ |
+| Downsampling layer $S_{max}$$^*$ | $55\times55\times48(\times2_{GPU})$ | $3\times3/2(\times2_{GPU})$ | $27\times27\times48(\ Times2_{GPU})$ | 0 |
+| Convolutional Layer $C_2$ | $27\times27\times48(\times2_{GPU})$ | $5\times5\times48/1,128(\times2_{GPU})$ | $27\times27\times128(\times2_{GPU}) $ | $(5\times5\times48+1)\times128\times2$ |
+| Downsampling layer $S_{max}$ | $27\times27\times128(\times2_{GPU})$ | $3\times3/2(\times2_{GPU})$ | $13\times13\times128(\times2_{GPU} )$ | 0 |
+| Convolutional layer $C_3$ $^*$ | $13\times13\times128\times2_{GPU}$ | $3\times3\times256/1,192(\times2_{GPU})$ | $13\times13\times192(\times2_{GPU })$ | $(3\times3\times256+1)\times192\times2$ |
+| Convolutional layer $C_4$ | $13\times13\times192(\times2_{GPU})$ | $3\times3\times192/1,192(\times2_{GPU})$ | $13\times13\times192(\times2_{GPU}) $ | $(3\times3\times192+1)\times192\times2$ |
+| Convolutional layer $C_5$ | $13\times13\times192(\times2_{GPU})$ | $3\times3\times192/1,128(\times2_{GPU})$ | $13\times13\times128(\times2_{GPU}) $ | $(3\times3\times192+1)\timEs128\times2$ |
+| Downsampling layer $S_{max}$ | $13\times13\times128(\times2_{GPU})$ | $3\times3/2(\times2_{GPU})$ | $6\times6\times128(\times2_{GPU} )$ | 0 |
+| Fully connected layer $F_6$ $^*$ | $6\times6\times128\times2_{GPU}$ | $9216\times2048(\times2_{GPU})$ | $1\times1\times2048(\times2_{GPU})$ | $(9216+1)\times2048\times2$ |
+| Fully connected layer $F_7$ | $1\times1\times2048\times2_{GPU}$ | $4096\times2048(\times2_{GPU})$ | $1\times1\times2048(\times2_{GPU})$ | $(4096+ 1)\times2048\times2$ |
+| Full Connect Layer $F_8$ | $1\times1\times2048\times2_{GPU}$ | $4096\times1000$ | $1\times1\times1000$ | $(4096+1)\times1000\times2$ |
+
+>Convolution layer $C_1$ input image data of $224\times224\times3$, respectively, after convolution convolution of $11\times11\times3$ and stride of 4 in two GPUs. Get two separate output data for $55\times55\times48$.
+>
+The downsampling layer $S_{max}$ is actually the largest pooling operation nested in the convolution, but is separately listed to distinguish the convolutional layers that do not have the largest pooling. After the pooling operation in the $C_{1-2}$ convolutional layer (before the ReLU activation operation), there is also an LRN operation for normalization of adjacent feature points.
+>
+The input of the convolutional layer $C_3$ is different from other convolutional layers. $13\times13\times192\times2_{GPU}$ means that the output of the upper layer network on the two GPUs is collected as input, so the convolution is performed. The convolution kernel dimension on the channel during operation is 384.
+>
+The input data size of the fully connected layer $F_{6-8}$ is similar to $C_3$, which is an input that combines the output of two GPU flows.
+
+
+### 4.2.3 Model Features
+- All convolutional layers use ReLU as a nonlinear mapping function to make the model converge faster
+- Training of models on multiple GPUs not only improves the training speed of the model, but also increases the scale of data usage.
+- Normalize the local features using LRN, and as a result the input to the ReLU activation function can effectively reduce the error rate
+- Overlapping max pooling, ie the pooling range z has a relationship with the step size s $z>s$ (eg $S\{max}$ in the kernel scale is $3\times3/2$), avoiding average pooling Average effect of (average pooling)
+- Use random dropout to selectively ignore individual neurons in training to avoid overfitting of the model
+
+
+
+## 4.3 ZFNet
+### 4.3.1 Introduction to the model
+
+ZFNet is a large convolutional network based on AlexNet from $Matthew$ $D. Zeiler$ and $Rob$ $Fergus$. In the 2013 ILSVRC image classification competition, it won the championship with an error rate of 11.19% (actually the original ZFNet The team is not a real champion. The original ZFNet ranked 8th with a 13.51% error rate. The real champion is the $Clarifai$ team, and the CEO of a startup company corresponding to $Clarifai$ is $Zeiler$. And $Clarifai$ has a relatively small change to ZFNet, so it is generally considered that ZFNet won the championship) $^{[3-4]}$. ZFNet is actually fine-tuning AlexNet, and visualizes the output features of each layer by means of deconvolution, further explaining why convolution operations are significant in large networks.
+
+### 4.3.2 Model structure
+
+
+
+Figure 4.4 ZFNet network structure diagram (original structure diagram and AlexNet style structure diagram)
+
+As shown in Figure 4.4, ZFNet is similar to AlexNet. It is a convolutional neural network consisting of 8 layers of networks, including 5 layers of convolutional layers and 3 layers of fully connected layers. The biggest difference between the two network architectures is that the ZFNet first-layer convolution replaces the convolution of the first-order convolution kernel $11\times11\times3/4$ in AlexNet with a convolution kernel of $7\times7\times3/2$. nuclear. In Figure 4.5, ZFNet contains more intermediate frequency information in the feature map of the first layer output than AlexNet, while the characteristic map of the first layer output of AlexNet is mostly low frequency or high frequency information, and the lack of intermediate frequency features leads to The characteristics of the subsequent network level as shown in Figure 4.5(c) are not detailed enough, and the root cause of this problem is that the convolution kernel and step size adopted by AlexNet in the first layer are too large.
+
+
+
+
+
+Figure 4.5 (a) Characteristic map of the first layer output of ZFNet (b) Characteristic map of the first layer output of AlexNet (c) Characteristic map of the output of the second layer of AlexNet (d) Characteristic map of the output of the second layer of ZFNet
+
+Table 4.3 ZFNet Network Parameter Configuration
+| Network Layer | Input Size | Core Size | Output Size | Trainable Parameter Quantity |
+| :-------------------: | :-------------------------- --------: | :--------------------------------------: | :----------------------------------: | :----------- --------------------------: |
+| Convolutional Layer $C_1$ $^*$ | $224\times224\times3$ | $7\times7\times3/2,96$ | $110\times110\times96$ | $(7\times7\times3+1)\times96$ |
+| Downsampling layer $S_{max}$ | $110\times110\times96$ | $3\times3/2$ | $55\times55\times96$ | 0 |
+| Convolutional Layer $C_2$ $^*$ | $55\times55\times96$ | $5\times5\times96/2,256$ | $26\times26\times256$ | $(5\times5\times96+1)\times256$ |
+| Downsampling layer $S_{max}$ | $26\times26\times256$ | $3\times3/2$ | $13\times13\times256$ | 0 |
+| Convolutional Layer $C_3$ | $13\times13\times256$ | $3\times3\times256/1,384$ | $13\times13\times384$ | $(3\times3\times256+1)\times384$ |
+| Convolutional layer $C_4$ | $13\times13\times384$ | $3\times3\times384/1,384$ | $13\times13\times384$ | $(3\times3\times384+1)\times384$ |
+| Convolutional layer $C_5$ | $13\times13\times384$ | $3\times3\times384/1,256$ | $13\times13\times256$ | $(3\times3\times384+1)\times256$ |
+| Downsampling layer $S_{max}$ | $13\times13\times256$ | $3\times3/2$ | $6\times6\times256$ | 0 |
+| Full Connect Layer $F_6$ | $6\times6\times256$ | $9216\times4096$ | $1\times1\times4096$ | $(9216+1)\times4096$ |
+| Full Connect Layer $F_7$ | $1\times1\times4096$ | $4096\times4096$ | $1\times1\times4096$ | $(4096+1)\times4096$ |
+| Full Connect Layer $F_8$ | $1\times1\times4096$ | $4096\times1000$ | $1\times1\times1000$ | $(4096+1)\times1000$ |
+> Convolutional layer $C_1$ is different from $C_1$ in AlexNet, using $7\times7\times3/2$ convolution kernel instead of $11\times11\times3/4 $ to make the first layer convolution output The result can include more medium frequency features, providing more choices for a diverse set of features in subsequent network layers, facilitating the capture of more detailed features.
+>
+> The convolutional layer $C_2$ uses a convolution kernel of step size 2, which is different from the convolution kernel step size of $C_2$ in AlexNet, so the output dimensions are different.
+
+### 4.3.3 Model Features
+
+ZFNet and AlexNet are almost identical in structure. Although this part belongs to the model characteristics, it should be accurately the contribution of visualization technology in the original ZFNet paper.
+
+- Visualization techniques reveal individual feature maps for each layer in the excitation model.
+- Visualization techniques allow observation of the evolution of features during the training phase and the diagnosis of potential problems with the model.
+- Visualization technology uses a multi-layer deconvolution network that returns to the input pixel space by feature activation.
+- Visualization techniques perform sensitivity analysis of the classifier output by revealing that part of the input image reveals which part is important for classification.
+- The visualization technique provides a non-parametric invariance to show which piece of the training set activates which feature map, not only the cropped input image, but also a top-down projection to expose a feature map from each block.
+- Visualization techniques rely on deconvolution operations, the inverse of convolution operations, to map features onto pixels.
+
+## 4.4 Network in Network
+
+### 4.4.1 Introduction to the model
+NetwThe ork In Network (NIN) was proposed by $Min Lin$ et al. to achieve the best level at the time of the CIFAR-10 and CIFAR-100 classification tasks, as its network structure was made up of three multi-layered perceptron stacks. NIN$^{[5]}$. NIN examines the convolution kernel design in convolutional neural networks from a new perspective, and replaces the linear mapping part in pure convolution by introducing subnetwork structures. This form of network structure stimulates more complex convolutional neural networks. The structural design of GoogLeNet's Inception structure introduced in the next section is derived from this idea.
+
+### 4.4.2 Model Structure
+
+
+Figure 4.6 NIN network structure
+
+NIN consists of three layers of multi-layer perceptual convolutional layer (MLPConv Layer). Each layer of multi-layer perceptual convolutional layer is composed of several layers of local fully connected layers and nonlinear activation functions instead of traditional convolutional layers. Linear convolution kernel used. In network inference, the multi-layer perceptron calculates the local features of the input feature map, and the weights of the products corresponding to the local feature maps of each window are shared. The convolution operation is completely consistent, the biggest difference is that the multilayer perceptron performs a nonlinear mapping of local features, while the traditional convolution method is linear. NIN's network parameter configuration table 4.4 is shown (the original paper does not give the network parameters, the parameters in the table are the compiler combined network structure diagram and CIFAR-100 data set with $3\times3$ convolution as an example).
+
+Table 4.4 NIN network parameter configuration (combined with the original paper NIN structure and CIFAR-100 data)
+
+| Network Layer | Input Size | Core Size | Output Size | Number of Parameters |
+|:------:|:-------:|:------:|:--------:|:-------:|
+| Local Fully Connected Layer $L_{11}$ $^*$ | $32\times32\times3$ | $(3\times3)\times16/1$ | $30\times30\times16$ | $(3\times3\times3+ 1)\times16$ |
+| Fully connected layer $L_{12}$ $^*$ | $30\times30\times16$ | $16\times16$ | $30\times30\times16$ | $((16+1)\times16)$ |
+| Local Full Connection Layer $L_{21}$ | $30\times30\times16$ | $(3\times3)\times64/1$ | $28\times28\times64$ | $(3\times3\times16+1)\times64 $ |
+| Fully connected layer $L_{22}$ | $28\times28\times64$ | $64\times64$ | $28\times28\times64$ | $((64+1)\times64)$ |
+| Local Full Connection Layer $L_{31}$ | $28\times28\times64$ | $(3\times3)\times100/1$ | $26\times26\times100$ | $(3\times3\times64+1)\times100 $ |
+| Fully connected layer $L_{32}$ | $26\times26\times100$ | $100\times100$ | $26\times26\times100$ | $((100+1)\times100)$ |
+| Global Average Sampling $GAP$ $^*$ | $26\times26\times100$ | $26\times26\times100/1$ | $1\times1\times100$ | $0$ |
+> The local fully connected layer $L_{11}$ is actually a windowed full join operation on the original input image, so the windowed output feature size is $30\times30$($\frac{32-3_k+1 }{1_{stride}}=30$)
+> The fully connected layer $L_{12}$ is a fully connected operation immediately following $L_{11}$. The input feature is the activated local response feature after windowing, so only need to connect $L_{11}$ and The node of $L_{12}$ is sufficient, and each partial fully connected layer and the immediately connected fully connected layer constitute a multilayer perceptual convolutional layer (MLPConv) instead of a convolution operation.
+> The global average sampling layer or the global average pooling layer $GAP$(Global Average Pooling) performs a global average pooling operation on each feature map output by $L_{32}$, directly obtaining the final number of categories, which can effectively Reduce the amount of parameters.
+
+### 4.4.3 Model Features
+
+- The use of a multi-layer perceptron structure instead of convolution filtering operation not only effectively reduces the problem of excessive parameterization caused by excessive convolution kernels, but also improves the abstraction ability of the model by introducing nonlinear mapping.
+- Using global average pooling instead of the last fully connected layer, can effectively reduce the amount of parameters (no trainable parameters), while pooling uses the information of the entire feature map, which is more robust to the transformation of spatial information, and finally obtained The output can be directly used as a confidence level for the corresponding category.
+
+## 4.5 VGGNet
+
+### 4.5.1 Introduction to the model
+
+VGGNet is a deep convolutional network structure proposed by the Visual Geometry Group (VGG) of Oxford University. They won the runner-up of the 2014 ILSVRC classification task with a 7.32% error rate (the champion was 6.65% by GoogLeNet). The error rate was won) and the error rate of 25.32% won the first place in the Localization (GoogLeNet error rate was 26.44%) $^{[5]}$, and the network name VGGNet was taken from the group name abbreviation. VGGNet was the first to reduce the error rate of image classification to less than 10%, and the idea of the $3\times3$ convolution kernel used in the network was the basis of many later models published at the 2015 International Conference on Learning Representation ( International Conference On Learning Representations (ICLR) has been cited more than 14,000 times since.
+
+### 4.5.2 Model structure
+
+
+
+Figure 4.7 VGG16 network structure
+
+In the original paper, VGGNet contains six versions of evolution, corresponding to VGG11, VGG11-LRN, VGG13, VGG16-1, VGG16-3 and VGG19, respectively. Different suffix values indicate different network layers (VGG11-LRN is represented in In the first layer, VGG11 of LRN is used. VGG16-1 indicates that the convolution kernel size of the last three convolutional blocks is $1\times1$, and the corresponding VGG16-3 indicates that the convolution kernel size is $3\. Times3$), the VGG16 introduced in this section is VGG16-3. The VGG16 in Figure 4.7 embodies the core idea of VGGNet, using the convolution combination of $3\times3$ instead of the large convolution (2 $3\times3 convolutions can have the same perception field as the $$5\times5$ convolution ), network parameter settings are shown in Table 4.5.
+
+Table 4.5 VGG16 network parameter configuration
+
+| Network Layer | Input Size | Core Size | Output Size | Number of Parameters |
+| :--------------------: | :-------------------: | :--- ------------------: | :--------------------: | :------ -----------------------: |
+| Convolutional Layer $C_{11}$ | $224\times224\times3$ | $3\times3\times64/1$ | $224\times224\times64$ | $(3\times3\times3+1)\times64$ |
+| Convolutional Layer $C_{12}$ | $224\times224\times64$ | $3\times3\times64/1$ | $224\times224\times64$ | $(3\times3\times64+1)\times64$ |
+| Downsampling layer $S_{max1}$ | $224\times224\times64$ | $2\times2/2$ | $112\times112\times64$ | $0$ |
+| Convolutional layer $C_{21}$ | $112\times112\times64$ | $3\times3\times128/1$ | $112\times112\times128$ | $(3\times3\times64+1)\times128$ |
+| Convolutional Layer $C_{22}$ | $112\times112\times128$ | $3\times3\times128/1$ | $112\times112\times128$ | $(3\times3\times128+1)\times128$ |
+| Downsampling layer $S_{max2}$ | $112\times112\times128$ | $2\times2/2$ | $56\times56\times128$ | $0$ |
+| Convolutional layer $C_{31}$ | $56\times56\times128$ | $3\times3\times256/1$ | $56\times56\times256$ | $(3\times3\times128+1)\times256$ |
+| Convolutional layer $C_{32}$ | $56\times56\times256$ | $3\times3\times256/1$ | $56\times56\times256$ | $(3\times3\times256+1)\times256$ |
+| Convolutional layer $C_{33}$ | $56\times56\times256$ | $26\times26\times256/1$ | $56\times56\times256$ | $(3\times3\times256+1)\times256$ |
+| Downsampling layer $S_{max3}$ | $56\times56\times256$ | $2\times2/2$ | $28\times28\times256$ | $0$ |
+| Convolutional Layer $C_{41}$ | $28\times28\times256$ | $3\times3\times512/1$ | $28\times28\times512$ | $(3\times3\times256+1)\times512$ |
+| Convolutional Layer $C_{42}$ | $28\times28\times512$ | $3\times3\times512/1$ | $28\times28\times512$ | $(3\times3\times512+1)\times512$ |
+| Convolutional Layer $C_{43}$ | $28\times28\times512$ | $3\times3\times512/1$ | $28\times28\times512$ | $(3\times3\times512+1)\times512$ |
+| Downsampling layer $S_{max4}$ | $28\times28\times512$ | $2\times2/2$ | $14\times14\times512$ | $0$ |
+| Convolutional Layer $C_{51}$ | $14\times14\times512$ | $3\times3\times512/1$ | $14\times14\times512$ | $(3\times3\times512+1)\times512$ |
+| Convolutional Layer $C_{52}$ | $14\times14\times512$ | $3\times3\times512/1$ | $14\times14\times512$ | $(3\times3\times512+1)\times512$ || Convolutional Layer $C_{53}$ | $14\times14\times512$ | $3\times3\times512/1$ | $14\times14\times512$ | $(3\times3\times512+1)\times512$ |
+| Downsampling layer $S_{max5}$ | $14\times14\times512$ | $2\times2/2$ | $7\times7\times512$ | $0$ |
+| Fully connected layer $FC_{1}$ | $7\times7\times512$ | $(7\times7\times512)\times4096$ | $1\times4096$ | $(7\times7\times512+1)\times4096$ |
+| Fully connected layer $FC_{2}$ | $1\times4096$ | $4096\times4096$ | $1\times4096$ | $(4096+1)\times4096$ |
+| Fully connected layer $FC_{3}$ | $1\times4096$ | $4096\times1000$ | $1\times1000$ | $(4096+1)\times1000$ |
+
+### 4.5.3 Model Features
+
+- The entire network uses the same size convolution kernel size $3\times3$ and the maximum pooled size $2\times2$.
+- The meaning of $1\times1$convolution is mainly linear transformation, while the number of input channels and the number of output channels are unchanged, and no dimensionality reduction occurs.
+- Two convolutional layers of $3\times3$ are concatenated as a convolutional layer of $5\times5$ with a receptive field size of $5\times5$. Similarly, the concatenation of three $3\times3$ convolutions is equivalent to a convolutional layer of $7\times7$. This type of connection makes the network parameters smaller, and the multi-layer activation function makes the network more capable of learning features.
+- VGGNet has a trick in training. It first trains the shallow simple network VGG11, and then reuses the weight of VGG11 to initialize VGG13. This training and initialization VGG19 can make the convergence faster during training.
+- Use multi-scale transformations in the training process to enhance the data of the original data, making the model difficult to overfit.
+
+## 4.6 GoogLeNet
+### 4.6.1 Introduction to the model
+
+As the winner of the ILSVRC classification task in 2014, GoogLeNet pressured VGGNet and other models with an error rate of 6.65%. Compared with the previous two championships ZFNet and AlexNet, the accuracy of the classification is greatly improved. From the name **GoogLe**Net, you can know that this is a network structure designed by Google engineers, and the name Goog**LeNet** is a tribute to LeNet$^{[0]}$. The core part of GoogLeNet is its internal subnet structure, Inception, which is inspired by NIN and has undergone four iterations (Inception$_{v1-4}$).
+
+
+Figure 4.8 Inception performance comparison chart
+
+### 4.6.2 Model Structure
+
+
+Figure 4.9 GoogLeNet network structure
+As shown in Figure 4.9, GoogLeNet extends the width of the network in addition to the depth of the previous convolutional neural network structure. The entire network is composed of a number of block subnetworks. This subnet constitutes the Inception structure. Figure 4.9 shows four versions of Inception: $Inception_{v1}$ uses different convolution kernels in the same layer and merges the convolution results; $Inception_{v2}$ combines the stacking of different convolution kernels, and The convolution results are merged; $Inception_{v3}$ is a deep combination attempt on the basis of $v_2$; the $Inception_{v4} $ structure is more complex than the previous version, nested in the subnet The internet.
+
+
+
+
+
+Figure 4.10 Inception$_{v1-4}$ structure diagram
+
+Table 4.6 Inception$_{v1}$ Network Parameter Configuration in GoogLeNet
+
+| Network Layer | Input Size | Core Size | Output Size | Number of Parameters |
+| :--------------------: | :-------------------: | :--- ------------------: | :--------------------: | :------ -----------------------: |
+| Convolutional layer $C_{11}$ | $H\times{W}\times{C_1}$ | $1\times1\times{C_2}/2$ | $\frac{H}{2}\times\frac {W}{2}\times{C_2}$ | $(1\times1\times{C_1}+1)\times{C_2}$ |
+| Convolutional layer $C_{21}$ | $H\times{W}\times{C_2}$ | $1\times1\times{C_2}/2$ | $\frac{H}{2}\times\frac {W}{2}\times{C_2}$ | $(1\times1\times{C_2}+1)\times{C_2}$ |
+| Convolutional Layer $C_{22}$ | $H\times{W}\times{C_2}$ | $3\times3\times{C_2}/1$ | $H\times{W}\times{C_2}/ 1$ | $(3\times3\times{C_2}+1)\times{C_2}$ |
+| Convolutional layer $C_{31}$ | $H\times{W}\times{C_1}$ | $1\times1\times{C_2}/2$ | $\frac{H}{2}\times\frac {W}{2}\times{C_2}$ | $(1\times1\times{C_1}+1)\times{C_2}$ |
+| Convolutional Layer $C_{32}$ | $H\times{W}\times{C_2}$ | $5\times5\times{C_2}/1$ | $H\times{W}\times{C_2}/ 1$ | $(5\times5\times{C_2}+1)\times{C_2}$ |
+| Downsampling layer $S_{41}$ | $H\times{W}\times{C_1}$ | $3\times3/2$ | $\frac{H}{2}\times\frac{W}{2 }\times{C_2}$ | $0$ |
+| Convolutional Layer $C_{42}$ | $\frac{H}{2}\times\frac{W}{2}\times{C_2}$ | $1\times1\times{C_2}/1$ | $ \frac{H}{2}\times\frac{W}{2}\times{C_2}$ | $(3\times3\times{C_2}+1)\times{C_2}$ |
+| Merge layer $M$ | $\frac{H}{2}\times\frac{W}{2}\times{C_2}(\times4)$ | Stitching | $\frac{H}{2}\times \frac{W}{2}\times({C_2}\times4)$ | $0$ |
+
+### 4.6.3 Model Features
+
+- The use of convolution kernels of different sizes means different sizes of receptive fields, and the final splicing means the fusion of different scale features;
+- The reason why the convolution kernel size is 1, 3 and 5 is mainly for the convenience of alignment. After setting the convolution step stride=1, as long as pad=0, 1, 2 are set respectively, the features of the same dimension can be obtained after convolution, and then these features can be directly stitched together;
+- The more the network goes to the end, the more abstract the features are, and the more susceptible fields are involved in each feature. As the number of layers increases, the proportion of 3x3 and 5x5 convolutions also increases. However, using a 5x5 convolution kernel still has a huge amount of computation. To this end, the article draws on NIN2 and uses a 1x1 convolution kernel for dimensionality reduction.
+
+## 4.7 Why are the current CNN models adjusted on GoogleNet, VGGNet or AlexNet?
+
+- Evaluation comparison: In order to make your results more convincing, when you publish your own results, you will compare them with a standard baseline and improve on the baseline. Common problems such as detection and segmentation will be based on VGG or Resnet101. Basic network.
+- Limited time and effort: In research pressure and work stress, time and energy only allow you to explore in a limited range.
+- Model innovation is difficult: Improving the basic model requires a lot of experimentation and experimentation, and requires a lot of experiment accumulation and strong inspiration. It is very likely that the input-output ratio is relatively small.
+- Resource limitations: Creating a new model requires a lot of time and computing resources, often not feasible in schools and small business teams.
+- In actual application scenarios, there are actually a large number of non-standard model configurations.
+
+## Related Literature
+
+[1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. *Proceedings of the IEEE*, november 1998.
+
+[2] A. Krizhevsky, I. Sutskever and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. *Advances in Neural Information Processing Systems 25*. Curran Associates, Inc. 1097–1105.
+
+[3] LSVRC-2013. http://www.image-net.org/challenges/LSVRC/2013/results.php
+
+[4] M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. *European Conference on Computer Vision*.
+
+[5] M. Lin, Q. Chen, and S. Yan. Network in network. *Computing Research Repository*, abs/1312.4400, 2013.
+
+[6] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. *International Conference on Machine Learning*, 2015.
+
+[7] Bharath Raj. [a-simple-guide-to-the-versions-of-the-inception-network] (https://towardsdatascience.com/a-simple-guide-to-the-versions-of- The-inception-network-7fc52b863202), 2018.
+
+[8] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. [Inception-v4, Inception-ResNet and
+the Impact of Residual Connections on Learning](https://arxiv.org/pdf/1602.07261.pdf), 2016.
+
+[9] Sik-Ho Tsang. [review-inception-v4-evolved-from-googlenet-merged-with-resnet-idea-image-classification](https://towardsdatascience.com/review-inception-v4-evolved-from-googlenet-merged-with-resnet-idea-image-classification-5e8c339d18bc), 2018.
+
+[10] Zbigniew Wojna, Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens. [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/pdf/1512.00567v3.pdf), 2015.
+
+[11] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. [Going deeper with convolutions](https://arxiv.org/pdf/1409.4842v1.pdf), 2014.
diff --git a/English version/ch04_ClassicNetwork/img/ch4/LeNet-5.jpg b/English version/ch04_ClassicNetwork/img/ch4/LeNet-5.jpg
new file mode 100644
index 00000000..fb65aedd
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/LeNet-5.jpg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/alexnet.png b/English version/ch04_ClassicNetwork/img/ch4/alexnet.png
new file mode 100644
index 00000000..1a954ca5
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/alexnet.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/featureMap.jpg b/English version/ch04_ClassicNetwork/img/ch4/featureMap.jpg
new file mode 100644
index 00000000..2e077ef9
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/featureMap.jpg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image10.png b/English version/ch04_ClassicNetwork/img/ch4/image10.png
new file mode 100644
index 00000000..ef7ed01d
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image10.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image11.GIF b/English version/ch04_ClassicNetwork/img/ch4/image11.GIF
new file mode 100644
index 00000000..ab72b16f
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image11.GIF differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image12.png b/English version/ch04_ClassicNetwork/img/ch4/image12.png
new file mode 100644
index 00000000..a8428209
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image12.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image13.png b/English version/ch04_ClassicNetwork/img/ch4/image13.png
new file mode 100644
index 00000000..56c7a125
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image13.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image14.png b/English version/ch04_ClassicNetwork/img/ch4/image14.png
new file mode 100644
index 00000000..0f52f559
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image14.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image15.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image15.jpeg
new file mode 100644
index 00000000..2ea99b40
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image15.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image18.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image18.jpeg
new file mode 100644
index 00000000..ba94f6c7
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image18.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image19.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image19.jpeg
new file mode 100644
index 00000000..920f056f
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image19.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image2.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image2.jpeg
new file mode 100644
index 00000000..284e36e2
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image2.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image20.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image20.jpeg
new file mode 100644
index 00000000..48f8eb45
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image20.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image21.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image21.jpeg
new file mode 100644
index 00000000..6afa65c4
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image21.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image22.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image22.jpeg
new file mode 100644
index 00000000..4a46634e
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image22.jpeg differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image23.jpeg" b/English version/ch04_ClassicNetwork/img/ch4/image23.jpeg
similarity index 100%
rename from "ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image23.jpeg"
rename to English version/ch04_ClassicNetwork/img/ch4/image23.jpeg
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image24.png b/English version/ch04_ClassicNetwork/img/ch4/image24.png
new file mode 100644
index 00000000..538640e7
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image24.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image25.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image25.jpeg
new file mode 100644
index 00000000..536e6551
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image25.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image26.png b/English version/ch04_ClassicNetwork/img/ch4/image26.png
new file mode 100644
index 00000000..51c7d189
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image26.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image27.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image27.jpeg
new file mode 100644
index 00000000..6e526ee1
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image27.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image28.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image28.jpeg
new file mode 100644
index 00000000..931a1d64
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image28.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image3.png b/English version/ch04_ClassicNetwork/img/ch4/image3.png
new file mode 100644
index 00000000..aeedc2de
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image3.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image31.png b/English version/ch04_ClassicNetwork/img/ch4/image31.png
new file mode 100644
index 00000000..5d238865
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image31.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image32.png b/English version/ch04_ClassicNetwork/img/ch4/image32.png
new file mode 100644
index 00000000..7b463dae
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image32.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image33.png b/English version/ch04_ClassicNetwork/img/ch4/image33.png
new file mode 100644
index 00000000..43d7908f
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image33.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image34.png b/English version/ch04_ClassicNetwork/img/ch4/image34.png
new file mode 100644
index 00000000..95016085
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image34.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image35.png b/English version/ch04_ClassicNetwork/img/ch4/image35.png
new file mode 100644
index 00000000..083675ca
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image35.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image36.png b/English version/ch04_ClassicNetwork/img/ch4/image36.png
new file mode 100644
index 00000000..404390cf
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image36.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image37.png b/English version/ch04_ClassicNetwork/img/ch4/image37.png
new file mode 100644
index 00000000..6f163002
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image37.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image38.png b/English version/ch04_ClassicNetwork/img/ch4/image38.png
new file mode 100644
index 00000000..2bed01bc
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image38.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image39.png b/English version/ch04_ClassicNetwork/img/ch4/image39.png
new file mode 100644
index 00000000..66c49a23
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image39.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image4.png b/English version/ch04_ClassicNetwork/img/ch4/image4.png
new file mode 100644
index 00000000..37193778
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image4.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image40.png b/English version/ch04_ClassicNetwork/img/ch4/image40.png
new file mode 100644
index 00000000..8643e43e
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image40.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image41.png b/English version/ch04_ClassicNetwork/img/ch4/image41.png
new file mode 100644
index 00000000..32bf30da
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image41.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image42.png b/English version/ch04_ClassicNetwork/img/ch4/image42.png
new file mode 100644
index 00000000..233128ec
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image42.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image43.png b/English version/ch04_ClassicNetwork/img/ch4/image43.png
new file mode 100644
index 00000000..43713358
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image43.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image44.png b/English version/ch04_ClassicNetwork/img/ch4/image44.png
new file mode 100644
index 00000000..c1da2658
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image44.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image45.png b/English version/ch04_ClassicNetwork/img/ch4/image45.png
new file mode 100644
index 00000000..f5257d08
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image45.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image46.png b/English version/ch04_ClassicNetwork/img/ch4/image46.png
new file mode 100644
index 00000000..3dbb5118
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image46.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image47.png b/English version/ch04_ClassicNetwork/img/ch4/image47.png
new file mode 100644
index 00000000..adcebabd
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image47.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image48.png b/English version/ch04_ClassicNetwork/img/ch4/image48.png
new file mode 100644
index 00000000..97801f04
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image48.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image49.png b/English version/ch04_ClassicNetwork/img/ch4/image49.png
new file mode 100644
index 00000000..883bc7b3
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image49.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image5.png b/English version/ch04_ClassicNetwork/img/ch4/image5.png
new file mode 100644
index 00000000..6c518d67
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image5.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image50.png b/English version/ch04_ClassicNetwork/img/ch4/image50.png
new file mode 100644
index 00000000..e265b45a
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image50.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image51.png b/English version/ch04_ClassicNetwork/img/ch4/image51.png
new file mode 100644
index 00000000..34133a06
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image51.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image52.png b/English version/ch04_ClassicNetwork/img/ch4/image52.png
new file mode 100644
index 00000000..a6ed65ba
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image52.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image53.png b/English version/ch04_ClassicNetwork/img/ch4/image53.png
new file mode 100644
index 00000000..ec92de40
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image53.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image54.png b/English version/ch04_ClassicNetwork/img/ch4/image54.png
new file mode 100644
index 00000000..3558da80
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image54.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image55.png b/English version/ch04_ClassicNetwork/img/ch4/image55.png
new file mode 100644
index 00000000..963436a2
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image55.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image56.png b/English version/ch04_ClassicNetwork/img/ch4/image56.png
new file mode 100644
index 00000000..c07aed22
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image56.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image6.png b/English version/ch04_ClassicNetwork/img/ch4/image6.png
new file mode 100644
index 00000000..f1509b8b
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image6.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image60.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image60.jpeg
new file mode 100644
index 00000000..5f45a372
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image60.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image61.png b/English version/ch04_ClassicNetwork/img/ch4/image61.png
new file mode 100644
index 00000000..dd7b1283
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image61.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image62.png b/English version/ch04_ClassicNetwork/img/ch4/image62.png
new file mode 100644
index 00000000..221fb4bd
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image62.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image63.png b/English version/ch04_ClassicNetwork/img/ch4/image63.png
new file mode 100644
index 00000000..38167b0c
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image63.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image7.png b/English version/ch04_ClassicNetwork/img/ch4/image7.png
new file mode 100644
index 00000000..0c5fe01f
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image7.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image8.png b/English version/ch04_ClassicNetwork/img/ch4/image8.png
new file mode 100644
index 00000000..20a6a5d2
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image8.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image9.png b/English version/ch04_ClassicNetwork/img/ch4/image9.png
new file mode 100644
index 00000000..b2fc2e73
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image9.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/vgg16.png b/English version/ch04_ClassicNetwork/img/ch4/vgg16.png
new file mode 100644
index 00000000..3dca57d1
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/vgg16.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/zfnet-layer1.png b/English version/ch04_ClassicNetwork/img/ch4/zfnet-layer1.png
new file mode 100644
index 00000000..7fb93f7f
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/zfnet-layer1.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/zfnet-layer2.png b/English version/ch04_ClassicNetwork/img/ch4/zfnet-layer2.png
new file mode 100644
index 00000000..c2e2f6c7
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/zfnet-layer2.png differ
diff --git a/English version/ch04_ClassicNetwork/readme.md b/English version/ch04_ClassicNetwork/readme.md
new file mode 100644
index 00000000..5072a519
--- /dev/null
+++ b/English version/ch04_ClassicNetwork/readme.md
@@ -0,0 +1,12 @@
+###########################################################
+
+### Deep Learning 500 Questions - Chapter 4 Classic Network
+
+**Responsible person (in no particular order):**
+South China Institute of Technology graduate student - Huang Qinjian (wechat: HQJ199508212176, email: csqjhuang@mail.scut.edu.cn)
+
+
+**Contributors (in no particular order):**
+Content contributors can add information
+
+###########################################################
diff --git a/README.md b/README.md
index f75c2afe..877e0ba1 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,8 @@
+
+# 英文版本
+
+[English version](https://github.com/scutan90/DeepLearning-500-questions/tree/master/English%20version)
+
# 1. 版权声明
请尊重作者的知识产权,版权所有,翻版必究。 未经许可,严禁转发内容!
请大家一起维护自己的劳动成果,进行监督。 未经许可, 严禁转发内容!
@@ -12,7 +17,25 @@
1、寻求有愿意继续完善的朋友、编辑、写手;如有意合作,完善出书(成为共同作者)。
2、所有提交内容的贡献者,将会在文中体现贡献者个人信息(例: 大佬-西湖大学)
3、为了让内容更充实完善,集思广益,欢迎Fork该项目并参与编写。请在修改MD文件的同时(或直接留言)备注自己的姓名-单位(大佬-斯坦福大学),一经采纳,会在原文中显示贡献者的信息,谢谢!
-4、推荐使用typora-Markdown阅读器:https://typora.io/
+4、推荐使用typora-Markdown阅读器:https://typora.io/
+
+设置:
+文件->偏好设置
+
+- Markdown扩展语法
+ - 内敛公式
+ - 下标
+ - 上标
+ - 高亮
+ - 图表
+
+
+都打勾
+
+- 数学公式
+ - 自动添加序号
+
+都打勾
例子:
@@ -37,18 +60,13 @@
# 5. 更多
1. 寻求有愿意继续完善的朋友、编辑、写手; 如有意合作,完善出书(成为共同作者)。
- 所有提交内容的贡献者,将会在文中体现贡献者个人信息(大佬-西湖大学)。
+ 所有提交内容的贡献者,将会在文中体现贡献者个人信息(大佬-西湖大学)。
2. 联系方式 : 请联系scutjy2015@163.com (唯一官方邮箱); 微信Tan:
(进群先在MD版本增加、改善、提交内容后,更易进群,享受分享知识帮助他人。)
- 进群请加微信 委托人1:HQJ199508212176 委托人2:Xuwumin1203 委托人3:tianyuzy
-
-
-
- 
-
+ 
3. Markdown阅读器推荐:https://typora.io/ 免费且对于数学公式显示支持的比较好。
@@ -57,7 +75,13 @@
5. 接下来,将提供MD版本,大家一起编辑完善,敬请期待!希望踊跃提建议,补充修改内容!
-# 6. 目录
+# 6. 友情链接
+
+[FlyAI-AI竞赛平台](https://www.flyai.com/)
+
+
+
+# 7. 目录
**第一章 数学基础 1**
@@ -548,7 +572,7 @@
15.8.3 可以在SPARK环境里训练或者部署模型吗?
15.8.4 怎么进一步优化性能?
15.8.5 TPU和GPU的区别?
- 15.8.6 未来量子计算对于深度学习等AI技术的影像?
+ 15.8.6 未来量子计算对于深度学习等AI技术的影响?
**参考文献 366**
diff --git a/WechatIMG3.jpeg b/WechatIMG3.jpeg
deleted file mode 100644
index 8a693040..00000000
Binary files a/WechatIMG3.jpeg and /dev/null differ
diff --git "a/ch01_\346\225\260\345\255\246\345\237\272\347\241\200/readme.md" "b/ch01_\346\225\260\345\255\246\345\237\272\347\241\200/readme.md"
index 5564579d..4397ac7b 100644
--- "a/ch01_\346\225\260\345\255\246\345\237\272\347\241\200/readme.md"
+++ "b/ch01_\346\225\260\345\255\246\345\237\272\347\241\200/readme.md"
@@ -10,6 +10,6 @@
**贡献者(排名不分先后):**
内容贡献者可自加信息
-刘彦超-东南大学
-
+刘彦超-东南大学
+刘元德-上海理工大学(内容修订)
###########################################################
diff --git "a/ch01_\346\225\260\345\255\246\345\237\272\347\241\200/\347\254\254\344\270\200\347\253\240_\346\225\260\345\255\246\345\237\272\347\241\200.md" "b/ch01_\346\225\260\345\255\246\345\237\272\347\241\200/\347\254\254\344\270\200\347\253\240_\346\225\260\345\255\246\345\237\272\347\241\200.md"
index 27f26799..78a48323 100644
--- "a/ch01_\346\225\260\345\255\246\345\237\272\347\241\200/\347\254\254\344\270\200\347\253\240_\346\225\260\345\255\246\345\237\272\347\241\200.md"
+++ "b/ch01_\346\225\260\345\255\246\345\237\272\347\241\200/\347\254\254\344\270\200\347\253\240_\346\225\260\345\255\246\345\237\272\347\241\200.md"
@@ -2,19 +2,9 @@
# 第一章 数学基础
-> Markdown Revision 1; --update 2018/10/30 13:00
-
-> Date: 2018/10/25 -- 2018/10/30 -- 2018/11/01
-
-> Editor: 谈继勇 &乔成磊-同济大学 & 哈工大博士生-袁笛
-
-> Contact: scutjy2015@163.com & qchl0318@163.com & dyuanhit@gmail.com
-
-
-
## 1.1 标量、向量、矩阵、张量之间的联系
-**标量(scalar)**
-一个标量表示一个单独的数,它不同于线性代数中研究的其他大部分对象(通常是多个数的数组)。我们用斜体表示标量。标量通常被赋予小写的变量名称。
+**标量(scalar)**
+一个标量表示一个单独的数,它不同于线性代数中研究的其他大部分对象(通常是多个数的数组)。我们用斜体表示标量。标量通常被赋予小写的变量名称。
**向量(vector)**
一个向量表示一组有序排列的数。通过次序中的索引,我们可以确定每个单独的数。通常我们赋予向量粗体的小写变量名称,比如xx。向量中的元素可以通过带脚标的斜体表示。向量$X$的第一个元素是$X_1$,第二个元素是$X_2$,以此类推。我们也会注明存储在向量中的元素的类型(实数、虚数等)。
@@ -36,7 +26,7 @@
- 从代数角度讲, 矩阵它是向量的推广。向量可以看成一维的“表格”(即分量按照顺序排成一排), 矩阵是二维的“表格”(分量按照纵横位置排列), 那么$n$阶张量就是所谓的$n$维的“表格”。 张量的严格定义是利用线性映射来描述。
- 从几何角度讲, 矩阵是一个真正的几何量,也就是说,它是一个不随参照系的坐标变换而变化的东西。向量也具有这种特性。
- 张量可以用3×3矩阵形式来表达。
-- 表示标量的数和表示矢量的三维数组也可分别看作1×1,1×3的矩阵。
+- 表示标量的数和表示向量的三维数组也可分别看作1×1,1×3的矩阵。
## 1.3 矩阵和向量相乘结果
一个$m$行$n$列的矩阵和$n$行向量相乘,最后得到就是一个$m$行的向量。运算法则就是矩阵中的每一行数据看成一个行向量与该向量作点乘。
@@ -63,21 +53,21 @@ $$
\Vert\vec{x}\Vert_{-\infty}=\min{|{x_i}|}
$$
-- 向量的正无穷范数:向量的所有元素的绝对值中最大的:上述向量$\vec{a}$的负无穷范数结果就是:10。
+- 向量的正无穷范数:向量的所有元素的绝对值中最大的:上述向量$\vec{a}$的正无穷范数结果就是:10。
$$
\Vert\vec{x}\Vert_{+\infty}=\max{|{x_i}|}
$$
-- 向量的p范数:向量元素绝对值的p次方和的1/p次幂。
-
+- 向量的p范数:
+
$$
L_p=\Vert\vec{x}\Vert_p=\sqrt[p]{\sum_{i=1}^{N}|{x_i}|^p}
$$
**矩阵的范数**
-定义一个矩阵$A=[-1, 2, -3; 4, -6, 6]$。 任意矩阵定义为:$A_{m\times n}$,其元素为 $a_{ij}$。
+定义一个矩阵$A=[-1, 2, -3; 4, -6, 6]$。 任意矩阵定义为:$A_{m\times n}$,其元素为 $a_{ij}$。
矩阵的范数定义为
@@ -85,32 +75,30 @@ $$
\Vert{A}\Vert_p :=\sup_{x\neq 0}\frac{\Vert{Ax}\Vert_p}{\Vert{x}\Vert_p}
$$
-当向量取不同范数时, 相应得到了不同的矩阵范数。
+当向量取不同范数时, 相应得到了不同的矩阵范数。
- **矩阵的1范数(列范数)**:矩阵的每一列上的元素绝对值先求和,再从中取个最大的,(列和最大),上述矩阵$A$的1范数先得到$[5,8,9]$,再取最大的最终结果就是:9。
-
$$
\Vert A\Vert_1=\max_{1\le j\le n}\sum_{i=1}^m|{a_{ij}}|
$$
- **矩阵的2范数**:矩阵$A^TA$的最大特征值开平方根,上述矩阵$A$的2范数得到的最终结果是:10.0623。
-
+
$$
\Vert A\Vert_2=\sqrt{\lambda_{max}(A^T A)}
$$
-其中, $\lambda_{max}(A^T A)$ 为 $A^T A$ 的特征值绝对值的最大值。
-- **矩阵的无穷范数(行范数)**:矩阵的每一行上的元素绝对值先求和,再从中取个最大的,(行和最大),上述矩阵$A$的1范数先得到$[6;16]$,再取最大的最终结果就是:16。
-
+其中, $\lambda_{max}(A^T A)$ 为 $A^T A$ 的特征值绝对值的最大值。
+- **矩阵的无穷范数(行范数)**:矩阵的每一行上的元素绝对值先求和,再从中取个最大的,(行和最大),上述矩阵$A$的行范数先得到$[6;16]$,再取最大的最终结果就是:16。
$$
-\Vert A\Vert_{\infty}=\max_{1\le i \le n}\sum_{j=1}^n |{a_{ij}}|
+\Vert A\Vert_{\infty}=\max_{1\le i \le m}\sum_{j=1}^n |{a_{ij}}|
$$
- **矩阵的核范数**:矩阵的奇异值(将矩阵svd分解)之和,这个范数可以用来低秩表示(因为最小化核范数,相当于最小化矩阵的秩——低秩),上述矩阵A最终结果就是:10.9287。
- **矩阵的L0范数**:矩阵的非0元素的个数,通常用它来表示稀疏,L0范数越小0元素越多,也就越稀疏,上述矩阵$A$最终结果就是:6。
- **矩阵的L1范数**:矩阵中的每个元素绝对值之和,它是L0范数的最优凸近似,因此它也可以表示稀疏,上述矩阵$A$最终结果就是:22。
-- **矩阵的F范数**:矩阵的各个元素平方之和再开平方根,它通常也叫做矩阵的L2范数,它的优点在它是一个凸函数,可以求导求解,易于计算,上述矩阵A最终结果就是:10.0995。
+- **矩阵的F范数**:矩阵的各个元素平方之和再开平方根,它通常也叫做矩阵的L2范数,它的优点在于它是一个凸函数,可以求导求解,易于计算,上述矩阵A最终结果就是:10.0995。
$$
\Vert A\Vert_F=\sqrt{(\sum_{i=1}^m\sum_{j=1}^n{| a_{ij}|}^2)}
@@ -135,7 +123,7 @@ $$
## 1.6 导数偏导计算
**导数定义**:
-导数代表了在自变量变化趋于无穷小的时候,函数值的变化与自变量的变化的比值。几何意义是这个点的切线。物理意义是该时刻的(瞬时)变化率。
+导数代表了在自变量变化趋于无穷小的时候,函数值的变化与自变量的变化的比值。几何意义是这个点的切线。物理意义是该时刻的(瞬时)变化率。
*注意*:在一元函数中,只有一个自变量变动,也就是说只存在一个方向的变化率,这也就是为什么一元函数没有偏导数的原因。在物理学中有平均速度和瞬时速度之说。平均速度有
@@ -144,49 +132,49 @@ $$
v=\frac{s}{t}
$$
-其中$v$表示平均速度,$s$表示路程,$t$表示时间。这个公式可以改写为
+其中$v$表示平均速度,$s$表示路程,$t$表示时间。这个公式可以改写为
$$
\bar{v}=\frac{\Delta s}{\Delta t}=\frac{s(t_0+\Delta t)-s(t_0)}{\Delta t}
$$
-其中$\Delta s$表示两点之间的距离,而$\Delta t$表示走过这段距离需要花费的时间。当$\Delta t$趋向于0($\Delta t \to 0$)时,也就是时间变得很短时,平均速度也就变成了在$t_0$时刻的瞬时速度,表示成如下形式:
+其中$\Delta s$表示两点之间的距离,而$\Delta t$表示走过这段距离需要花费的时间。当$\Delta t$趋向于0($\Delta t \to 0$)时,也就是时间变得很短时,平均速度也就变成了在$t_0$时刻的瞬时速度,表示成如下形式:
$$
v(t_0)=\lim_{\Delta t \to 0}{\bar{v}}=\lim_{\Delta t \to 0}{\frac{\Delta s}{\Delta t}}=\lim_{\Delta t \to 0}{\frac{s(t_0+\Delta t)-s(t_0)}{\Delta t}}
$$
-实际上,上式表示的是路程$s$关于时间$t$的函数在$t=t_0$处的导数。一般的,这样定义导数:如果平均变化率的极限存在,即有
+实际上,上式表示的是路程$s$关于时间$t$的函数在$t=t_0$处的导数。一般的,这样定义导数:如果平均变化率的极限存在,即有
$$
\lim_{\Delta x \to 0}{\frac{\Delta y}{\Delta x}}=\lim_{\Delta x \to 0}{\frac{f(x_0+\Delta x)-f(x_0)}{\Delta x}}
$$
-则称此极限为函数 $y=f(x)$ 在点 $x_0$ 处的导数。记作 $f'(x_0)$ 或 $y'\vert_{x=x_0}$ 或 $\frac{dy}{dx}\vert_{x=x_0}$ 或 $\frac{df(x)}{dx}\vert_{x=x_0}$。
+则称此极限为函数 $y=f(x)$ 在点 $x_0$ 处的导数。记作 $f'(x_0)$ 或 $y'\vert_{x=x_0}$ 或 $\frac{dy}{dx}\vert_{x=x_0}$ 或 $\frac{df(x)}{dx}\vert_{x=x_0}$。
-通俗地说,导数就是曲线在某一点切线的斜率。
+通俗地说,导数就是曲线在某一点切线的斜率。
**偏导数**:
-既然谈到偏导数,那就至少涉及到两个自变量。以两个自变量为例,z=f(x,y),从导数到偏导数,也就是从曲线来到了曲面。曲线上的一点,其切线只有一条。但是曲面上的一点,切线有无数条。而偏导数就是指多元函数沿着坐标轴的变化率。
-
+既然谈到偏导数,那就至少涉及到两个自变量。以两个自变量为例,$z=f(x,y)$,从导数到偏导数,也就是从曲线来到了曲面。曲线上的一点,其切线只有一条。但是曲面上的一点,切线有无数条。而偏导数就是指多元函数沿着坐标轴的变化率。
+
*注意*:直观地说,偏导数也就是函数在某一点上沿坐标轴正方向的的变化率。
-设函数$z=f(x,y)$在点$(x_0,y_0)$的领域内有定义,当$y=y_0$时,$z$可以看作关于$x$的一元函数$f(x,y_0)$,若该一元函数在$x=x_0$处可导,即有
+设函数$z=f(x,y)$在点$(x_0,y_0)$的领域内有定义,当$y=y_0$时,$z$可以看作关于$x$的一元函数$f(x,y_0)$,若该一元函数在$x=x_0$处可导,即有
$$
\lim_{\Delta x \to 0}{\frac{f(x_0+\Delta x,y_0)-f(x_0,y_0)}{\Delta x}}=A
$$
-函数的极限$A$存在。那么称$A$为函数$z=f(x,y)$在点$(x_0,y_0)$处关于自变量$x$的偏导数,记作$f_x(x_0,y_0)$或$\frac{\partial z}{\partial x}\vert_{y=y_0}^{x=x_0}$或$\frac{\partial f}{\partial x}\vert_{y=y_0}^{x=x_0}$或$z_x\vert_{y=y_0}^{x=x_0}$。
+函数的极限$A$存在。那么称$A$为函数$z=f(x,y)$在点$(x_0,y_0)$处关于自变量$x$的偏导数,记作$f_x(x_0,y_0)$或$\frac{\partial z}{\partial x}\vert_{y=y_0}^{x=x_0}$或$\frac{\partial f}{\partial x}\vert_{y=y_0}^{x=x_0}$或$z_x\vert_{y=y_0}^{x=x_0}$。
-偏导数在求解时可以将另外一个变量看做常数,利用普通的求导方式求解,比如$z=3x^2+xy$关于$x$的偏导数就为$z_x=6x+y$,这个时候$y$相当于$x$的系数。
+偏导数在求解时可以将另外一个变量看做常数,利用普通的求导方式求解,比如$z=3x^2+xy$关于$x$的偏导数就为$z_x=6x+y$,这个时候$y$相当于$x$的系数。
-某点$(x_0,y_0)$处的偏导数的几何意义为曲面$z=f(x,y)$与面$x=x_0$或面$y=y_0$交线在$y=y_0$或$x=x_0$处切线的斜率。
+某点$(x_0,y_0)$处的偏导数的几何意义为曲面$z=f(x,y)$与面$x=x_0$或面$y=y_0$交线在$y=y_0$或$x=x_0$处切线的斜率。
## 1.7 导数和偏导数有什么区别?
-导数和偏导没有本质区别,如果极限存在,都是当自变量的变化量趋于0时,函数值的变化量与自变量变化量比值的极限。
+导数和偏导没有本质区别,如果极限存在,都是当自变量的变化量趋于0时,函数值的变化量与自变量变化量比值的极限。
> - 一元函数,一个$y$对应一个$x$,导数只有一个。
> - 二元函数,一个$z$对应一个$x$和一个$y$,有两个导数:一个是$z$对$x$的导数,一个是$z$对$y$的导数,称之为偏导。
@@ -204,37 +192,36 @@ A\nu = \lambda \nu
$$
$\lambda$为特征向量$\vec{v}$对应的特征值。特征值分解是将一个矩阵分解为如下形式:
-
+
$$
-A=Q\Sigma Q^{-1}
+A=Q\sum Q^{-1}
$$
-其中,$Q$是这个矩阵$A$的特征向量组成的矩阵,$\Sigma$是一个对角矩阵,每一个对角线元素就是一个特征值,里面的特征值是由大到小排列的,这些特征值所对应的特征向量就是描述这个矩阵变化方向(从主要的变化到次要的变化排列)。也就是说矩阵$A$的信息可以由其特征值和特征向量表示。
+其中,$Q$是这个矩阵$A$的特征向量组成的矩阵,$\sum$是一个对角矩阵,每一个对角线元素就是一个特征值,里面的特征值是由大到小排列的,这些特征值所对应的特征向量就是描述这个矩阵变化方向(从主要的变化到次要的变化排列)。也就是说矩阵$A$的信息可以由其特征值和特征向量表示。
## 1.9 奇异值与特征值有什么关系?
-那么奇异值和特征值是怎么对应起来的呢?我们将一个矩阵$A$的转置乘以$A$,并对$AA^T$求特征值,则有下面的形式:
+那么奇异值和特征值是怎么对应起来的呢?我们将一个矩阵$A$的转置乘以$A$,并对$A^TA$求特征值,则有下面的形式:
$$
(A^TA)V = \lambda V
$$
-这里$V$就是上面的右奇异向量,另外还有:
+这里$V$就是上面的右奇异向量,另外还有:
$$
\sigma_i = \sqrt{\lambda_i}, u_i=\frac{1}{\sigma_i}A\mu_i
$$
-这里的$\sigma$就是奇异值,$u$就是上面说的左奇异向量。【证明那个哥们也没给】
-奇异值$\sigma$跟特征值类似,在矩阵$\Sigma$中也是从大到小排列,而且$\sigma$的减少特别的快,在很多情况下,前10%甚至1%的奇异值的和就占了全部的奇异值之和的99%以上了。也就是说,我们也可以用前$r$($r$远小于$m、n$)个的奇异值来近似描述矩阵,即部分奇异值分解:
-
+这里的$\sigma$就是奇异值,$u$就是上面说的左奇异向量。【证明那个哥们也没给】
+奇异值$\sigma$跟特征值类似,在矩阵$\sum$中也是从大到小排列,而且$\sigma$的减少特别的快,在很多情况下,前10%甚至1%的奇异值的和就占了全部的奇异值之和的99%以上了。也就是说,我们也可以用前$r$($r$远小于$m、n$)个的奇异值来近似描述矩阵,即部分奇异值分解:
$$
-A_{m\times n}\approx U_{m \times r}\Sigma_{r\times r}V_{r \times n}^T
+A_{m\times n}\approx U_{m \times r}\sum_{r\times r}V_{r \times n}^T
$$
右边的三个矩阵相乘的结果将会是一个接近于$A$的矩阵,在这儿,$r$越接近于$n$,则相乘的结果越接近于$A$。
## 1.10 机器学习为什么要使用概率?
-事件的概率是衡量该事件发生的可能性的量度。虽然在一次随机试验中某个事件的发生是带有偶然性的,但那些可在相同条件下大量重复的随机试验却往往呈现出明显的数量规律。
+事件的概率是衡量该事件发生的可能性的量度。虽然在一次随机试验中某个事件的发生是带有偶然性的,但那些可在相同条件下大量重复的随机试验却往往呈现出明显的数量规律。
机器学习除了处理不确定量,也需处理随机量。不确定性和随机性可能来自多个方面,使用概率论来量化不确定性。
概率论在机器学习中扮演着一个核心角色,因为机器学习算法的设计通常依赖于对数据的概率假设。
@@ -243,7 +230,7 @@ $$
## 1.11 变量与随机变量有什么区别?
**随机变量**(random variable)
-表示随机现象(在一定条件下,并不总是出现相同结果的现象称为随机现象)中各种结果的实值函数(一切可能的样本点)。例如某一时间内公共汽车站等车乘客人数,电话交换台在一定时间内收到的呼叫次数等,都是随机变量的实例。
+表示随机现象(在一定条件下,并不总是出现相同结果的现象称为随机现象)中各种结果的实值函数(一切可能的样本点)。例如某一时间内公共汽车站等车乘客人数,电话交换台在一定时间内收到的呼叫次数等,都是随机变量的实例。
随机变量与模糊变量的不确定性的本质差别在于,后者的测定结果仍具有不确定性,即模糊性。
**变量与随机变量的区别:**
@@ -253,33 +240,161 @@ $$
> 当变量$x$值为100的概率为1的话,那么$x=100$就是确定了的,不会再有变化,除非有进一步运算.
> 当变量$x$的值为100的概率不为1,比如为50的概率是0.5,为100的概率是0.5,那么这个变量就是会随不同条件而变化的,是随机变量,取到50或者100的概率都是0.5,即50%。
-## 1.12 常见概率分布
-(https://wenku.baidu.com/view/6418b0206d85ec3a87c24028915f804d2b168707)
-
-
-
-
-
-
-
+## 1.12 随机变量与概率分布的联系?
+
+一个随机变量仅仅表示一个可能取得的状态,还必须给定与之相伴的概率分布来制定每个状态的可能性。用来描述随机变量或一簇随机变量的每一个可能的状态的可能性大小的方法,就是 **概率分布(probability distribution)**.
+
+随机变量可以分为离散型随机变量和连续型随机变量。
+
+相应的描述其概率分布的函数是
+
+概率质量函数(Probability Mass Function, PMF):描述离散型随机变量的概率分布,通常用大写字母 $P$表示。
+
+概率密度函数(Probability Density Function, PDF):描述连续型随机变量的概率分布,通常用小写字母$p$表示。
+
+### 1.12.1 离散型随机变量和概率质量函数
+
+PMF 将随机变量能够取得的每个状态映射到随机变量取得该状态的概率。
+
+- 一般而言,$P(x)$ 表示时$X=x$的概率.
+- 有时候为了防止混淆,要明确写出随机变量的名称$P($x$=x)$
+- 有时候需要先定义一个随机变量,然后制定它遵循的概率分布x服从$P($x$)$
+
+PMF 可以同时作用于多个随机变量,即联合概率分布(joint probability distribution) $P(X=x,Y=y)$*表示 $X=x$和$Y=y$同时发生的概率,也可以简写成 $P(x,y)$.
+
+如果一个函数$P$是随机变量 $X$ 的 PMF, 那么它必须满足如下三个条件
+
+- $P$的定义域必须是的所有可能状态的集合
+- $∀x∈$x, $0 \leq P(x) \leq 1 $.
+- $∑_{x∈X} P(x)=1$. 我们把这一条性质称之为 归一化的(normalized)
+
+### 1.12.2 连续型随机变量和概率密度函数
+
+如果一个函数$p$是x的PDF,那么它必须满足如下几个条件
+
+- $p$的定义域必须是 xx 的所有可能状态的集合。
+- $∀x∈X,p(x)≥0$. 注意,我们并不要求$ p(x)≤1$,因为此处 $p(x)$不是表示的对应此状态具体的概率,而是概率的一个相对大小(密度)。具体的概率,需要积分去求。
+- $∫p(x)dx=1$, 积分下来,总和还是1,概率之和还是1.
+
+注:PDF$p(x)$并没有直接对特定的状态给出概率,给出的是密度,相对的,它给出了落在面积为 $δx$的无线小的区域内的概率为$ p(x)δx$. 由此,我们无法求得具体某个状态的概率,我们可以求得的是 某个状态 $x$ 落在 某个区间$[a,b]$内的概率为$ \int_{a}^{b}p(x)dx$.
+
+## 1.13 常见概率分布
+
+### 1.13.1 Bernoulli分布
+
+**Bernoulli分布**是单个二值随机变量分布, 单参数$\phi$∈[0,1]控制,$\phi$给出随机变量等于1的概率. 主要性质有:
+$$
+\begin{align*}
+P(x=1) &= \phi \\
+P(x=0) &= 1-\phi \\
+P(x=x) &= \phi^x(1-\phi)^{1-x} \\
+\end{align*}
+$$
+其期望和方差为:
+$$
+\begin{align*}
+E_x[x] &= \phi \\
+Var_x(x) &= \phi{(1-\phi)}
+\end{align*}
+$$
+**Multinoulli分布**也叫**范畴分布**, 是单个*k*值随机分布,经常用来表示**对象分类的分布**. 其中$k$是有限值.Multinoulli分布由向量$\vec{p}\in[0,1]^{k-1}$参数化,每个分量$p_i$表示第$i$个状态的概率, 且$p_k=1-1^Tp$.
+
+**适用范围**: **伯努利分布**适合对**离散型**随机变量建模.
+
+### 1.13.2 高斯分布
+
+高斯也叫正态分布(Normal Distribution), 概率度函数如下:
+$$
+N(x;\mu,\sigma^2) = \sqrt{\frac{1}{2\pi\sigma^2}}exp\left ( -\frac{1}{2\sigma^2}(x-\mu)^2 \right )
+$$
+其中, $\mu$和$\sigma$分别是均值和方差, 中心峰值x坐标由$\mu$给出, 峰的宽度受$\sigma$控制, 最大点在$x=\mu$处取得, 拐点为$x=\mu\pm\sigma$
+
+正态分布中,±1$\sigma$、±2$\sigma$、±3$\sigma$下的概率分别是68.3%、95.5%、99.73%,这3个数最好记住。
-## 1.13 举例理解条件概率
-条件概率公式如下:
+此外, 令$\mu=0,\sigma=1$高斯分布即简化为标准正态分布:
+$$
+N(x;\mu,\sigma^2) = \sqrt{\frac{1}{2\pi}}exp\left ( -\frac{1}{2}x^2 \right )
+$$
+对概率密度函数高效求值:
+$$
+N(x;\mu,\beta^{-1})=\sqrt{\frac{\beta}{2\pi}}exp\left(-\frac{1}{2}\beta(x-\mu)^2\right)
+$$
+
+
+其中,$\beta=\frac{1}{\sigma^2}$通过参数$\beta∈(0,\infty)$来控制分布精度。
+
+### 1.13.3 何时采用正态分布?
+
+问: 何时采用正态分布?
+答: 缺乏实数上分布的先验知识, 不知选择何种形式时, 默认选择正态分布总是不会错的, 理由如下:
+
+1. 中心极限定理告诉我们, 很多独立随机变量均近似服从正态分布, 现实中很多复杂系统都可以被建模成正态分布的噪声, 即使该系统可以被结构化分解.
+2. 正态分布是具有相同方差的所有概率分布中, 不确定性最大的分布, 换句话说, 正态分布是对模型加入先验知识最少的分布.
+
+正态分布的推广:
+正态分布可以推广到$R^n$空间, 此时称为**多位正态分布**, 其参数是一个正定对称矩阵$\Sigma$:
+$$
+N(x;\vec\mu,\Sigma)=\sqrt{\frac{1}{(2\pi)^ndet(\Sigma)}}exp\left(-\frac{1}{2}(\vec{x}-\vec{\mu})^T\Sigma^{-1}(\vec{x}-\vec{\mu})\right)
+$$
+对多为正态分布概率密度高效求值:
+$$
+N(x;\vec{\mu},\vec\beta^{-1}) = \sqrt{det(\vec\beta)}{(2\pi)^n}exp\left(-\frac{1}{2}(\vec{x}-\vec\mu)^T\beta(\vec{x}-\vec\mu)\right)
+$$
+此处,$\vec\beta$是一个精度矩阵。
+### 1.13.4 指数分布
+
+深度学习中, 指数分布用来描述在$x=0$点处取得边界点的分布, 指数分布定义如下:
$$
-P(A/B) = P(A\cap B) / P(B)
+p(x;\lambda)=\lambda I_{x\geq 0}exp(-\lambda{x})
$$
+指数分布用指示函数$I_{x\geq 0}$来使$x$取负值时的概率为零。
+
+### 1.13.5 Laplace 分布
-说明:在同一个样本空间$\Omega$中的事件或者子集$A$与$B$,如果随机从$\Omega$中选出的一个元素属于$B$,那么下一个随机选择的元素属于$A$ 的概率就定义为在$B$的前提下$A$的条件概率。
+一个联系紧密的概率分布是 Laplace 分布(Laplace distribution),它允许我们在任意一点 $\mu$处设置概率质量的峰值
+$$
+Laplace(x;\mu;\gamma)=\frac{1}{2\gamma}exp\left(-\frac{|x-\mu|}{\gamma}\right)
+$$
+
+### 1.13.6 Dirac分布和经验分布
+
+Dirac分布可保证概率分布中所有质量都集中在一个点上. Diract分布的狄拉克$\delta$函数(也称为**单位脉冲函数**)定义如下:
+$$
+p(x)=\delta(x-\mu), x\neq \mu
+$$
+
+$$
+\int_{a}^{b}\delta(x-\mu)dx = 1, a < \mu < b
+$$
+
+Dirac 分布经常作为 经验分布(empirical distribution)的一个组成部分出现
+$$
+\hat{p}(\vec{x})=\frac{1}{m}\sum_{i=1}^{m}\delta(\vec{x}-{\vec{x}}^{(i)})
+$$
+, 其中, m个点$x^{1},...,x^{m}$是给定的数据集, **经验分布**将概率密度$\frac{1}{m}$赋给了这些点.
+
+当我们在训练集上训练模型时, 可以认为从这个训练集上得到的经验分布指明了**采样来源**.
+
+**适用范围**: 狄拉克δ函数适合对**连续型**随机变量的经验分布.
+
+## 1.14 举例理解条件概率
+条件概率公式如下:
+
+$$
+P(A|B) = P(A\cap B) / P(B)
+$$
+
+说明:在同一个样本空间$\Omega$中的事件或者子集$A$与$B$,如果随机从$\Omega$中选出的一个元素属于$B$,那么下一个随机选择的元素属于$A$ 的概率就定义为在$B$的前提下$A$的条件概率。

-根据文氏图,可以很清楚地看到在事件B发生的情况下,事件A发生的概率就是$P(A\bigcap B)$除以$P(B)$。
+根据文氏图,可以很清楚地看到在事件B发生的情况下,事件A发生的概率就是$P(A\bigcap B)$除以$P(B)$。
举例:一对夫妻有两个小孩,已知其中一个是女孩,则另一个是女孩子的概率是多少?(面试、笔试都碰到过)
**穷举法**:已知其中一个是女孩,那么样本空间为男女,女女,女男,则另外一个仍然是女生的概率就是1/3。
-**条件概率法**:$P(女|女)=P(女女)/P(女)$,夫妻有两个小孩,那么它的样本空间为女女,男女,女男,男男,则$P(女女)$为1/4,$P(女)= 1-P(男男)=3/4$,所以最后$1/3$。
+**条件概率法**:$P(女|女)=P(女女)/P(女)$,夫妻有两个小孩,那么它的样本空间为女女,男女,女男,男男,则$P(女女)$为1/4,$P(女)= 1-P(男男)=3/4$,所以最后$1/3$。
这里大家可能会误解,男女和女男是同一种情况,但实际上类似姐弟和兄妹是不同情况。
-## 1.14 联合概率与边缘概率联系区别?
+## 1.15 联合概率与边缘概率联系区别?
**区别:**
联合概率:联合概率指类似于$P(X=a,Y=b)$这样,包含多个条件,且所有条件同时成立的概率。联合概率是指在多元的概率分布中多个随机变量分别满足各自条件的概率。
边缘概率:边缘概率是某个事件发生的概率,而与其它事件无关。边缘概率指类似于$P(X=a)$,$P(Y=b)$这样,仅与单个随机变量有关的概率。
@@ -287,30 +402,29 @@ $$
**联系:**
联合分布可求边缘分布,但若只知道边缘分布,无法求得联合分布。
-## 1.15 条件概率的链式法则
-由条件概率的定义,可直接得出下面的乘法公式:
-乘法公式 设$A, B$是两个事件,并且$P(A) > 0$, 则有
-
+## 1.16 条件概率的链式法则
+由条件概率的定义,可直接得出下面的乘法公式:
+乘法公式 设$A, B$是两个事件,并且$P(A) > 0$, 则有
$$
P(AB) = P(B|A)P(A)
$$
-推广
+推广
$$
P(ABC)=P(C|AB)P(B|A)P(A)
$$
-一般地,用归纳法可证:若$P(A_1A_2...A_n)>0$,则有
+一般地,用归纳法可证:若$P(A_1A_2...A_n)>0$,则有
$$
P(A_1A_2...A_n)=P(A_n|A_1A_2...A_{n-1})P(A_{n-1}|A_1A_2...A_{n-2})...P(A_2|A_1)P(A_1)
=P(A_1)\prod_{i=2}^{n}P(A_i|A_1A_2...A_{i-1})
$$
-任何多维随机变量联合概率分布,都可以分解成只有一个变量的条件概率相乘形式。
+任何多维随机变量联合概率分布,都可以分解成只有一个变量的条件概率相乘形式。
-## 1.16 独立性和条件独立性
+## 1.17 独立性和条件独立性
**独立性**
两个随机变量$x$和$y$,概率分布表示成两个因子乘积形式,一个因子只包含$x$,另一个因子只包含$y$,两个随机变量相互独立(independent)。
条件有时为不独立的事件之间带来独立,有时也会把本来独立的事件,因为此条件的存在,而失去独立性。
@@ -320,7 +434,7 @@ $$
P(X,Y|Z) \not = P(X|Z)P(Y|Z)
$$
-事件独立时,联合概率等于概率的乘积。这是一个非常好的数学性质,然而不幸的是,无条件的独立是十分稀少的,因为大部分情况下,事件之间都是互相影响的。
+事件独立时,联合概率等于概率的乘积。这是一个非常好的数学性质,然而不幸的是,无条件的独立是十分稀少的,因为大部分情况下,事件之间都是互相影响的。
**条件独立性**
给定$Z$的情况下,$X$和$Y$条件独立,当且仅当
@@ -329,7 +443,7 @@ $$
X\bot Y|Z \iff P(X,Y|Z) = P(X|Z)P(Y|Z)
$$
-$X$和$Y$的关系依赖于$Z$,而不是直接产生。
+$X$和$Y$的关系依赖于$Z$,而不是直接产生。
>**举例**定义如下事件:
>$X$:明天下雨;
@@ -337,24 +451,24 @@ $$
>$Z$:今天是否下雨;
>$Z$事件的成立,对$X$和$Y$均有影响,然而,在$Z$事件成立的前提下,今天的地面情况对明天是否下雨没有影响。
-## 1.17 期望、方差、协方差、相关系数总结
+## 1.18 期望、方差、协方差、相关系数总结
**期望**
在概率论和统计学中,数学期望(或均值,亦简称期望)是试验中每次可能结果的概率乘以其结果的总和。它反映随机变量平均取值的大小。
- 线性运算: $E(ax+by+c) = aE(x)+bE(y)+c$
-- 推广形式: $E(\sum_{k=1}^{n}{a_kx_k+c}) = \sum_{k=1}^{n}{a_kE(x_k)+c}$
+- 推广形式: $E(\sum_{k=1}^{n}{a_ix_i+c}) = \sum_{k=1}^{n}{a_iE(x_i)+c}$
- 函数期望:设$f(x)$为$x$的函数,则$f(x)$的期望为
- 离散函数: $E(f(x))=\sum_{k=1}^{n}{f(x_k)P(x_k)}$
- 连续函数: $E(f(x))=\int_{-\infty}^{+\infty}{f(x)p(x)dx}$
> 注意:
>
-> - 函数的期望不等于期望的函数,即$E(f(x))=f(E(x))$
+> - 函数的期望大于等于期望的函数(Jensen不等式),即$E(f(x))\geqslant f(E(x))$
> - 一般情况下,乘积的期望不等于期望的乘积。
> - 如果$X$和$Y$相互独立,则$E(xy)=E(x)E(y)$。
**方差**
-概率论中方差用来度量随机变量和其数学期望(即均值)之间的偏离程度。方差是一种特殊的期望。定义为:
+概率论中方差用来度量随机变量和其数学期望(即均值)之间的偏离程度。方差是一种特殊的期望。定义为:
$$
Var(x) = E((x-E(x))^2)
@@ -374,7 +488,7 @@ $$
Cov(x,y)=E((x-E(x))(y-E(y)))
$$
-方差是一种特殊的协方差。当$X=Y$时,$Cov(x,y)=Var(x)=Var(y)$。
+方差是一种特殊的协方差。当$X=Y$时,$Cov(x,y)=Var(x)=Var(y)$。
> 协方差性质:
>
@@ -400,5 +514,17 @@ Corr(x,y) = \frac{Cov(x,y)}{\sqrt{Var(x)Var(y)}}
$$
> 相关系数的性质:
-> 1)有界性。相关系数的取值范围是 ,可以看成无量纲的协方差。
+> 1)有界性。相关系数的取值范围是 [-1,1],可以看成无量纲的协方差。
> 2)值越接近1,说明两个变量正相关性(线性)越强。越接近-1,说明负相关性越强,当为0时,表示两个变量没有相关性。
+
+
+
+## 参考文献
+
+[1]Ian,Goodfellow,Yoshua,Bengio,Aaron...深度学习[M],人民邮电出版,2017
+
+[2]周志华.机器学习[M].清华大学出版社,2016.
+
+[3]同济大学数学系.高等数学(第七版)[M],高等教育出版社,2014.
+
+[4]盛骤,试式千,潘承毅等编. 概率论与数理统计(第4版)[M],高等教育出版社,2008
\ No newline at end of file
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2-3.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2-3.png"
deleted file mode 100644
index 2916190b..00000000
Binary files "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2-3.png" and /dev/null differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.1/5.jpg" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.1/5.jpg"
index 3629e729..3f93149f 100644
Binary files "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.1/5.jpg" and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.1/5.jpg" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.17-1.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.17-1.png"
new file mode 100644
index 00000000..83915ea4
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.17-1.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.17-2.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.17-2.png"
new file mode 100644
index 00000000..ca5e2c41
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.17-2.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.17-3.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.17-3.png"
new file mode 100644
index 00000000..511492ce
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.17-3.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.18.1.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.18.1.png"
new file mode 100644
index 00000000..23b150ed
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.18.1.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.20.1.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.20.1.png"
new file mode 100644
index 00000000..acc95fb4
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.20.1.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.4.1.jpg" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.4.1.jpg"
new file mode 100644
index 00000000..51a16bcb
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.4.1.jpg" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.4.2.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.4.2.png"
new file mode 100644
index 00000000..d74817f7
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.4.2.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.4.3.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.4.3.png"
new file mode 100644
index 00000000..42e4cee4
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.16.4.3.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.19.1.1.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.19.1.1.png"
new file mode 100644
index 00000000..3fd7fd8f
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.19.1.1.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.19.5A.jpg" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.19.5A.jpg"
new file mode 100644
index 00000000..a6c5feb7
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.19.5A.jpg" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.19.5B.jpg" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.19.5B.jpg"
new file mode 100644
index 00000000..ad455db6
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.19.5B.jpg" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.19.5C.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.19.5C.png"
new file mode 100644
index 00000000..5e499e4e
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.19.5C.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.09.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.09.png"
new file mode 100644
index 00000000..94e91922
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.09.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.10.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.10.png"
new file mode 100644
index 00000000..c8b9935b
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.10.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.11.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.11.png"
new file mode 100644
index 00000000..750c2e34
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.11.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.12.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.12.png"
new file mode 100644
index 00000000..9357b0a4
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.12.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.4.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.4.png"
new file mode 100644
index 00000000..e1e166dc
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.4.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.8.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.8.png"
new file mode 100644
index 00000000..234f4d18
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.2.8.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.20.1.jpg" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.20.1.jpg"
new file mode 100644
index 00000000..efb4f6fc
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.20.1.jpg" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.1.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.1.png"
new file mode 100644
index 00000000..b6205fa9
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.1.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.2.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.2.png"
new file mode 100644
index 00000000..ef734236
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.2.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.3.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.3.png"
new file mode 100644
index 00000000..246ec05e
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.3.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.4.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.4.png"
new file mode 100644
index 00000000..70a1d786
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.4.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.5.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.5.png"
new file mode 100644
index 00000000..61bd081e
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.5.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.6.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.6.png"
new file mode 100644
index 00000000..620857bb
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.6.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.6a.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.6a.png"
new file mode 100644
index 00000000..85755d61
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.6a.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.7.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.7.png"
new file mode 100644
index 00000000..40fbf900
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.1.7.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.3.1.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.3.1.png"
new file mode 100644
index 00000000..1b7c8051
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.21.3.1.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/1.jpg" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/1.jpg"
new file mode 100644
index 00000000..51a16bcb
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/1.jpg" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/2.20.1.jpg" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/2.20.1.jpg"
new file mode 100644
index 00000000..05e9ebe2
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/2.20.1.jpg" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/2.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/2.png"
index e5c69de4..d74817f7 100644
Binary files "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/2.png" and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/2.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/3.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/3.png"
index ef3d80a6..42e4cee4 100644
Binary files "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/3.png" and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.40.3/3.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.5.1.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.5.1.png"
new file mode 100644
index 00000000..3eb29bcd
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.5.1.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.7.3.png" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.7.3.png"
new file mode 100644
index 00000000..98257154
Binary files /dev/null and "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/img/ch2/2.7.3.png" differ
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/modify_log.txt" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/modify_log.txt"
index f0784f11..c5e0e582 100644
--- "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/modify_log.txt"
+++ "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/modify_log.txt"
@@ -18,7 +18,3 @@ modify_log---->用来记录修改日志
3. 修改modify内容
4. 修改章节内容,图片路径等
-<----qjhuang-2019-3-15---->
-1. 修改2.4错别字
-2. 修改3错别字
-
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/readme.md" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/readme.md"
index db45ad57..548d000c 100644
--- "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/readme.md"
+++ "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/readme.md"
@@ -1,16 +1,17 @@
###########################################################
-### 深度学习500问-第 * 章 xxx
+### 深度学习500问-第 2 章 xxx
**负责人(排名不分先后):**
xxx研究生-xxx(xxx)
xxx博士生-xxx
xxx-xxx
+刘元德-上海理工大学
**贡献者(排名不分先后):**
内容贡献者可自加信息
-刘彦超-东南大学
-
-###########################################################
\ No newline at end of file
+刘彦超-东南大学
+刘元德-上海理工大学(内容修订)
+###########################################################
diff --git "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/\347\254\254\344\272\214\347\253\240_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200.md" "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/\347\254\254\344\272\214\347\253\240_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200.md"
index 918c9e80..b4aabb60 100644
--- "a/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/\347\254\254\344\272\214\347\253\240_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200.md"
+++ "b/ch02_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/\347\254\254\344\272\214\347\253\240_\346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200.md"
@@ -1,41 +1,38 @@
-
[TOC]
# 第二章 机器学习基础
-## 2.1 各种常见算法图示
+## 2.1 大话理解机器学习本质
-|回归算法|基于实例的算法|正则化方法|
+ 机器学习(Machine Learning, ML),顾名思义,让机器去学习。这里,机器指的是计算机,是算法运行的物理载体,你也可以把各种算法本身当做一个有输入和输出的机器。那么到底让计算机去学习什么呢?对于一个任务及其表现的度量方法,设计一种算法,让算法能够提取中数据所蕴含的规律,这就叫机器学习。如果输入机器的数据是带有标签的,就称作有监督学习。如果数据是无标签的,就是无监督学习。
+
+## 2.2 各种常见算法图示
+
+|回归算法|聚类算法|正则化方法|
|:-:|:-:|:-:|
||||
|决策树学习|贝叶斯方法|基于核的算法|
|:-:|:-:|:-:|
-||||
+||||
|聚类算法|关联规则学习|人工神经网络|
|:-:|:-:|:-:|
-||||
+||||
|深度学习|降低维度算法|集成算法|
|:-:|:-:|:-:|
-||||
+||||
-## 2.2 监督学习、非监督学习、半监督学习、弱监督学习?
-根据数据类型的不同,对一个问题的建模有不同的方式。依据不同的学习方式和输入数据,机器学习主要分为以下四种学习方式。
+## 2.3 监督学习、非监督学习、半监督学习、弱监督学习?
+ 根据数据类型的不同,对一个问题的建模有不同的方式。依据不同的学习方式和输入数据,机器学习主要分为以下四种学习方式。
**监督学习**:
-1. 监督学习是使用已知正确答案的示例来训练网络。已知数据和其一一对应的标签,训练一个智能算法,将输入数据映射到标签的过程。
+1. 监督学习是使用已知正确答案的示例来训练网络。已知数据和其一一对应的标签,训练一个预测模型,将输入数据映射到标签的过程。
2. 监督式学习的常见应用场景如分类问题和回归问题。
-3. 常见算法有逻辑回归(Logistic Regression)和反向传递神经网络(Back Propagation Neural Network)
+3. 常见的有监督机器学习算法包括支持向量机(Support Vector Machine, SVM),朴素贝叶斯(Naive Bayes),逻辑回归(Logistic Regression),K近邻(K-Nearest Neighborhood, KNN),决策树(Decision Tree),随机森林(Random Forest),AdaBoost以及线性判别分析(Linear Discriminant Analysis, LDA)等。深度学习(Deep Learning)也是大多数以监督学习的方式呈现。
**非监督式学习**:
1. 在非监督式学习中,数据并不被特别标识,适用于你具有数据集但无标签的情况。学习模型是为了推断出数据的一些内在结构。
@@ -51,170 +48,206 @@
1. 弱监督学习可以看做是有多个标记的数据集合,次集合可以是空集,单个元素,或包含多种情况(没有标记,有一个标记,和有多个标记)的多个元素。
2. 数据集的标签是不可靠的,这里的不可靠可以是标记不正确,多种标记,标记不充分,局部标记等。
3. 已知数据和其一一对应的弱标签,训练一个智能算法,将输入数据映射到一组更强的标签的过程。标签的强弱指的是标签蕴含的信息量的多少,比如相对于分割的标签来说,分类的标签就是弱标签。
-4. 举例,告诉一张包含气球的图片,需要得出气球在图片中的位置及气球和背景的分割线,这就是已知弱标签学习强标签的问题。
+4. 举例,给出一张包含气球的图片,需要得出气球在图片中的位置及气球和背景的分割线,这就是已知弱标签学习强标签的问题。
-在企业数据应用的场景下, 人们最常用的可能就是监督式学习和非监督式学习的模型。 在图像识别等领域,由于存在大量的非标识的数据和少量的可标识数据, 目前半监督式学习是一个很热的话题。
+ 在企业数据应用的场景下, 人们最常用的可能就是监督式学习和非监督式学习的模型。 在图像识别等领域,由于存在大量的非标识的数据和少量的可标识数据, 目前半监督式学习是一个很热的话题。
+
+## 2.4 监督学习有哪些步骤
+ 监督学习是使用已知正确答案的示例来训练网络,每组训练数据有一个明确的标识或结果。想象一下,我们可以训练一个网络,让其从照片库中(其中包含气球的照片)识别出气球的照片。以下就是我们在这个假设场景中所要采取的步骤。
-## 2.3 监督学习有哪些步骤
-**监督式学习**:
-监督学习是使用已知正确答案的示例来训练网络。每组训练数据有一个明确的标识或结果,想象一下,我们可以训练一个网络,让其从照片库中(其中包含气球的照片)识别出气球的照片。以下就是我们在这个假设场景中所要采取的步骤。
**步骤1:数据集的创建和分类**
-首先,浏览你的照片(数据集),确定所有包含气球的照片,并对其进行标注。然后,将所有照片分为训练集和验证集。目标就是在深度网络中找一函数,这个函数输入是任意一张照片,当照片中包含气球时,输出1,否则输出0。
-**步骤2:训练**
-选择合适的模型,模型可通过以下激活函数对每张照片进行预测。既然我们已经知道哪些是包含气球的图片,那么我们就可以告诉模型它的预测是对还是错。然后我们会将这些信息反馈(feed back)给网络。
-该算法使用的这种反馈,就是一个量化“真实答案与模型预测有多少偏差”的函数的结果。这个函数被称为成本函数(cost function),也称为目标函数(objective function),效用函数(utility function)或适应度函数(fitness function)。然后,该函数的结果用于修改一个称为反向传播(backpropagation)过程中节点之间的连接强度和偏差。
-我们会为每个图片都重复一遍此操作,而在每种情况下,算法都在尽量最小化成本函数。
-其实,我们有多种数学技术可以用来验证这个模型是正确还是错误的,但我们常用的是一个非常常见的方法,我们称之为梯度下降(gradient descent)。
-**步骤3:验证**
-当处理完训练集所有照片,接着要去测试该模型。利用验证集来来验证训练有素的模型是否可以准确地挑选出含有气球在内的照片。
-在此过程中,通常会通过调整和模型相关的各种事物(超参数)来重复步骤2和3,诸如里面有多少个节点,有多少层,哪些数学函数用于决定节点是否亮起,如何在反向传播阶段积极有效地训练权值等等。
-**步骤4:测试及应用**
-当有了一个准确的模型,就可以将该模型部署到你的应用程序中。你可以将模型定义为API调用,并且你可以从软件中调用该方法,从而进行推理并给出相应的结果。
-
-## 2.4 多实例学习?
-多实例学习(multiple instance learning) :已知包含多个数据的数据包和数据包的标签,训练智能算法,将数据包映射到标签的过程,在有的问题中也同时给出包内每个数据的标签。
-比如说一段视频由很多张图组成,假如10000张,那么我们要判断视频里是否包含某一物体,比如气球。单张标注每一帧是否有气球太耗时,通常人们看一遍说这个视频里是否有气球,就得到了多实例学习的数据。10000帧的数据不是每一个都有气球出现,只要有一帧有气球,那么我们就认为这个数据包是有气球的。只有当所有的视频帧都没有气球,才是没有气球的。从这里面学习哪一段视频(10000张)是否有气球出现就是多实例学习的问题。
-
-## 2.5 分类网络和回归的区别?
-2.3小节介绍了包含气球照片的数据集整理。当照片中包含气球时,输出1,否则输出0。此步骤通常称为分类任务(categorization task)。在这种情况下,我们进行的通常是一个结果为yes or no的训练。
-但事实上,监督学习也可以用于输出一组值,而不仅仅是0或1。例如,我们可以训练一个网络,用它来输出一张图片上有气球的概率,那么在这种情况下,输出值就是0到1之间的任意值。这些任务我们称之为回归。
+ 首先,浏览你的照片(数据集),确定所有包含气球的照片,并对其进行标注。然后,将所有照片分为训练集和验证集。目标就是在深度网络中找一函数,这个函数输入是任意一张照片,当照片中包含气球时,输出1,否则输出0。
+
+**步骤2:数据增强(Data Augmentation)**
+ 当原始数据搜集和标注完毕,一般搜集的数据并不一定包含目标在各种扰动下的信息。数据的好坏对于机器学习模型的预测能力至关重要,因此一般会进行数据增强。对于图像数据来说,数据增强一般包括,图像旋转,平移,颜色变换,裁剪,仿射变换等。
+
+**步骤3:特征工程(Feature Engineering)**
+ 一般来讲,特征工程包含特征提取和特征选择。常见的手工特征(Hand-Crafted Feature)有尺度不变特征变换(Scale-Invariant Feature Transform, SIFT),方向梯度直方图(Histogram of Oriented Gradient, HOG)等。由于手工特征是启发式的,其算法设计背后的出发点不同,将这些特征组合在一起的时候有可能会产生冲突,如何将组合特征的效能发挥出来,使原始数据在特征空间中的判别性最大化,就需要用到特征选择的方法。在深度学习方法大获成功之后,人们很大一部分不再关注特征工程本身。因为,最常用到的卷积神经网络(Convolutional Neural Networks, CNNs)本身就是一种特征提取和选择的引擎。研究者提出的不同的网络结构、正则化、归一化方法实际上就是深度学习背景下的特征工程。
+
+**步骤4:构建预测模型和损失**
+ 将原始数据映射到特征空间之后,也就意味着我们得到了比较合理的输入。下一步就是构建合适的预测模型得到对应输入的输出。而如何保证模型的输出和输入标签的一致性,就需要构建模型预测和标签之间的损失函数,常见的损失函数(Loss Function)有交叉熵、均方差等。通过优化方法不断迭代,使模型从最初的初始化状态一步步变化为有预测能力的模型的过程,实际上就是学习的过程。
+
+**步骤5:训练**
+ 选择合适的模型和超参数进行初始化,其中超参数比如支持向量机中核函数、误差项惩罚权重等。当模型初始化参数设定好后,将制作好的特征数据输入到模型,通过合适的优化方法不断缩小输出与标签之间的差距,当迭代过程到了截止条件,就可以得到训练好的模型。优化方法最常见的就是梯度下降法及其变种,使用梯度下降法的前提是优化目标函数对于模型是可导的。
+
+**步骤6:验证和模型选择**
+ 训练完训练集图片后,需要进行模型测试。利用验证集来验证模型是否可以准确地挑选出含有气球在内的照片。
+ 在此过程中,通常会通过调整和模型相关的各种事物(超参数)来重复步骤2和3,诸如里面有多少个节点,有多少层,使用怎样的激活函数和损失函数,如何在反向传播阶段积极有效地训练权值等等。
+
+**步骤7:测试及应用**
+ 当有了一个准确的模型,就可以将该模型部署到你的应用程序中。你可以将预测功能发布为API(Application Programming Interface, 应用程序编程接口)调用,并且你可以从软件中调用该API,从而进行推理并给出相应的结果。
+
+## 2.5 多实例学习?
+ 多实例学习(Multiple Instance Learning, MIL) :已知包含多个数据的数据包和数据包的标签,训练智能算法,将数据包映射到标签的过程,在有的问题中也同时给出包内每个数据的标签。
+ 比如说一段视频由很多张图组成,假如10000张,那么我们要判断视频里是否包含某一物体,比如气球。单张标注每一帧是否有气球太耗时,通常人们看一遍说这个视频里是否有气球,就得到了多示例学习的数据。10000帧的数据不是每一个都有气球出现,只要有一帧有气球,那么我们就认为这个数据包是有气球的。只有当所有的视频帧都没有气球,才是没有气球的。从这里面学习哪一段视频(10000张)是否有气球出现就是多实例学习的问题。
## 2.6 什么是神经网络?
-神经网络就是按照一定规则将多个神经元连接起来的网络。不同的神经网络,具有不同的连接规则。
-例如全连接(full connected, FC)神经网络,它的规则包括:
+ 神经网络就是按照一定规则将多个神经元连接起来的网络。不同的神经网络,具有不同的连接规则。
+例如全连接(Full Connected, FC)神经网络,它的规则包括:
+
1. 有三种层:输入层,输出层,隐藏层。
2. 同一层的神经元之间没有连接。
-3. full connected的含义:第 N 层的每个神经元和第 N-1 层的所有神经元相连,第 N-1 层神经元的输出就是第 N 层神经元的输入。
+3. fully connected的含义:第 N 层的每个神经元和第 N-1 层的所有神经元相连,第 N-1 层神经元的输出就是第 N 层神经元的输入。
4. 每个连接都有一个权值。
-**神经网络架构**
-下面这张图就是一个神经网络系统,它由很多层组成。输入层负责接收信息,比如一只猫的图片。输出层是计算机对这个输入信息的判断结果,它是不是猫。隐藏层就是对输入信息的传递和加工处理。
-
+
+ **神经网络架构**
+ 下面这张图就是一个神经网络系统,它由很多层组成。输入层负责接收信息,比如一只猫的图片。输出层是计算机对这个输入信息的判断结果,它是不是猫。隐藏层就是对输入信息的传递和加工处理。
+ 
## 2.7 理解局部最优与全局最优
-笑谈局部最优和全局最优
+ 笑谈局部最优和全局最优
-> 柏拉图有一天问老师苏格拉底什么是爱情?苏格拉底叫他到麦田走一次,摘一颗最大的麦穗回来,不许回头,只可摘一次。柏拉图空着手出来了,他的理由是,看见不错的,却不知道是不是最好的,一次次侥幸,走到尽头时,才发现还不如前面的,于是放弃。苏格拉底告诉他:“这就是爱情。”这故事让我们明白了一个道理,因为生命的一些不确定性,所以全局最优解是很难寻找到的,或者说根本就不存在,我们应该设置一些限定条件,然后在这个范围内寻找最优解,也就是局部最优解——有所斩获总比空手而归强,哪怕这种斩获只是一次有趣的经历。
-> 柏拉图有一天又问什么是婚姻?苏格拉底叫他到彬树林走一次,选一棵最好的树做圣诞树,也是不许回头,只许选一次。这次他一身疲惫地拖了一棵看起来直挺、翠绿,却有点稀疏的杉树回来,他的理由是,有了上回的教训,好不容易看见一棵看似不错的,又发现时间、体力已经快不够用了,也不管是不是最好的,就拿回来了。苏格拉底告诉他:“这就是婚姻。
+> 柏拉图有一天问老师苏格拉底什么是爱情?苏格拉底叫他到麦田走一次,摘一颗最大的麦穗回来,不许回头,只可摘一次。柏拉图空着手出来了,他的理由是,看见不错的,却不知道是不是最好的,一次次侥幸,走到尽头时,才发现还不如前面的,于是放弃。苏格拉底告诉他:“这就是爱情。”这故事让我们明白了一个道理,因为生命的一些不确定性,所以全局最优解是很难寻找到的,或者说根本就不存在,我们应该设置一些限定条件,然后在这个范围内寻找最优解,也就是局部最优解——有所斩获总比空手而归强,哪怕这种斩获只是一次有趣的经历。
+> 柏拉图有一天又问什么是婚姻?苏格拉底叫他到树林走一次,选一棵最好的树做圣诞树,也是不许回头,只许选一次。这次他一身疲惫地拖了一棵看起来直挺、翠绿,却有点稀疏的杉树回来,他的理由是,有了上回的教训,好不容易看见一棵看似不错的,又发现时间、体力已经快不够用了,也不管是不是最好的,就拿回来了。苏格拉底告诉他:“这就是婚姻。”
-优化问题一般分为局部最优和全局最优。
+ 优化问题一般分为局部最优和全局最优。
1. 局部最优,就是在函数值空间的一个有限区域内寻找最小值;而全局最优,是在函数值空间整个区域寻找最小值问题。
-2. 函数局部最小点是那种它的函数值小于或等于附近点的点。但是有可能大于较远距离的点。
+2. 函数局部最小点是它的函数值小于或等于附近点的点。但是有可能大于较远距离的点。
3. 全局最小点是那种它的函数值小于或等于所有的可行点。
## 2.8 分类算法
+ 分类算法和回归算法是对真实世界不同建模的方法。分类模型是认为模型的输出是离散的,例如大自然的生物被划分为不同的种类,是离散的。回归模型的输出是连续的,例如人的身高变化过程是一个连续过程,而不是离散的。
+
+ 因此,在实际建模过程时,采用分类模型还是回归模型,取决于你对任务(真实世界)的分析和理解。
+
### 2.8.1 常用分类算法的优缺点?
|算法|优点|缺点|
|:-|:-|:-|
-|Bayes 贝叶斯分类法|1)所需估计的参数少,对于缺失数据不敏感。2)有着坚实的数学基础,以及稳定的分类效率。|1)假设属性之间相互独立,这往往并不成立。(喜欢吃番茄、鸡蛋,却不喜欢吃番茄炒蛋)。2)需要知道先验概率。3)分类决策存在错误率。|
-|Decision Tree决策树|1)不需要任何领域知识或参数假设。2)适合高维数据。3)简单易于理解。4)短时间内处理大量数据,得到可行且效果较好的结果。5)能够同时处理数据型和常规性属性。|1)对于各类别样本数量不一致数据,信息增益偏向于那些具有更多数值的特征。2)易于过拟合。3)忽略属性之间的相关性。4)不支持在线学习。|
-|SVM支持向量机|1)可以解决小样本下机器学习的问题。2)提高泛化性能。3)可以解决高维、非线性问题。超高维文本分类仍受欢迎。4)避免神经网络结构选择和局部极小的问题。|1)对缺失数据敏感。2)内存消耗大,难以解释。3)运行和调差略烦人。|
-|KNN K近邻|1)思想简单,理论成熟,既可以用来做分类也可以用来做回归; 2)可用于非线性分类; 3)训练时间复杂度为O(n); 4)准确度高,对数据没有假设,对outlier不敏感;|1)计算量太大2)对于样本分类不均衡的问题,会产生误判。3)需要大量的内存。4)输出的可解释性不强。|
-|Logistic Regression逻辑回归|1)速度快。2)简单易于理解,直接看到各个特征的权重。3)能容易地更新模型吸收新的数据。4)如果想要一个概率框架,动态调整分类阀值。|特征处理复杂。需要归一化和较多的特征工程。|
-|Neural Network 神经网络|1)分类准确率高。2)并行处理能力强。3)分布式存储和学习能力强。4)鲁棒性较强,不易受噪声影响。|1)需要大量参数(网络拓扑、阀值、阈值)。2)结果难以解释。3)训练时间过长。|
-|Adaboosting|1)adaboost是一种有很高精度的分类器。2)可以使用各种方法构建子分类器,Adaboost算法提供的是框架。3)当使用简单分类器时,计算出的结果是可以理解的。而且弱分类器构造极其简单。4)简单,不用做特征筛选。5)不用担心overfitting。|对outlier比较敏感|
-
-### 2.8.2 正确率能很好的评估分类算法吗?
-不同算法有不同特点,在不同数据集上有不同的表现效果,根据特定的任务选择不同的算法。如何评价分类算法的好坏,要做具体任务具体分析。对于决策树,主要用正确率去评估,但是其他算法,只用正确率能很好的评估吗?
-答案是否定的。
-正确率确实是一个很直观很好的评价指标,但是有时候正确率高并不能完全代表一个算法就好。比如对某个地区进行地震预测,地震分类属性分为0:不发生地震、1发生地震。我们都知道,不发生的概率是极大的,对于分类器而言,如果分类器不加思考,对每一个测试样例的类别都划分为0,达到99%的正确率,但是,问题来了,如果真的发生地震时,这个分类器毫无察觉,那带来的后果将是巨大的。很显然,99%正确率的分类器并不是我们想要的。出现这种现象的原因主要是数据分布不均衡,类别为1的数据太少,错分了类别1但达到了很高的正确率缺忽视了研究者本身最为关注的情况。
-
-### 2.8.3 分类算法的评估方法?
-1. **几个常用的术语**
-这里首先介绍几个*常见*的 模型评价术语,现在假设我们的分类目标只有两类,计为正例(positive)和负例(negative)分别是:
- 1) True positives(TP): 被正确地划分为正例的个数,即实际为正例且被分类器划分为正例的实例数(样本数);
- 2) False positives(FP): 被错误地划分为正例的个数,即实际为负例但被分类器划分为正例的实例数;
- 3) False negatives(FN):被错误地划分为负例的个数,即实际为正例但被分类器划分为负例的实例数;
- 4) True negatives(TN): 被正确地划分为负例的个数,即实际为负例且被分类器划分为负例的实例数。
+|Bayes 贝叶斯分类法|1)所需估计的参数少,对于缺失数据不敏感。 2)有着坚实的数学基础,以及稳定的分类效率。|1)需要假设属性之间相互独立,这往往并不成立。(喜欢吃番茄、鸡蛋,却不喜欢吃番茄炒蛋)。 2)需要知道先验概率。 3)分类决策存在错误率。|
+|Decision Tree决策树|1)不需要任何领域知识或参数假设。 2)适合高维数据。 3)简单易于理解。 4)短时间内处理大量数据,得到可行且效果较好的结果。 5)能够同时处理数据型和常规性属性。|1)对于各类别样本数量不一致数据,信息增益偏向于那些具有更多数值的特征。 2)易于过拟合。 3)忽略属性之间的相关性。 4)不支持在线学习。|
+|SVM支持向量机|1)可以解决小样本下机器学习的问题。 2)提高泛化性能。 3)可以解决高维、非线性问题。超高维文本分类仍受欢迎。 4)避免神经网络结构选择和局部极小的问题。|1)对缺失数据敏感。 2)内存消耗大,难以解释。 3)运行和调参略烦人。|
+|KNN K近邻|1)思想简单,理论成熟,既可以用来做分类也可以用来做回归; 2)可用于非线性分类; 3)训练时间复杂度为O(n); 4)准确度高,对数据没有假设,对outlier不敏感;|1)计算量太大。 2)对于样本分类不均衡的问题,会产生误判。 3)需要大量的内存。 4)输出的可解释性不强。|
+|Logistic Regression逻辑回归|1)速度快。 2)简单易于理解,直接看到各个特征的权重。 3)能容易地更新模型吸收新的数据。 4)如果想要一个概率框架,动态调整分类阀值。|特征处理复杂。需要归一化和较多的特征工程。|
+|Neural Network 神经网络|1)分类准确率高。 2)并行处理能力强。 3)分布式存储和学习能力强。 4)鲁棒性较强,不易受噪声影响。|1)需要大量参数(网络拓扑、阀值、阈值)。 2)结果难以解释。 3)训练时间过长。|
+|Adaboosting|1)adaboost是一种有很高精度的分类器。 2)可以使用各种方法构建子分类器,Adaboost算法提供的是框架。 3)当使用简单分类器时,计算出的结果是可以理解的。而且弱分类器构造极其简单。 4)简单,不用做特征筛选。 5)不用担心overfitting。|对outlier比较敏感|
+
+
+
+### 2.8.2 分类算法的评估方法?
+- **几个常用术语**
+ 这里首先介绍几个常见的模型评价术语,现在假设我们的分类目标只有两类,计为正例(positive)和负例(negative)分别是:
+ 1) True positives(TP): 被正确地划分为正例的个数,即实际为正例且被分类器划分为正例的实例数;
+ 2) False positives(FP): 被错误地划分为正例的个数,即实际为负例但被分类器划分为正例的实例数;
+ 3) False negatives(FN):被错误地划分为负例的个数,即实际为正例但被分类器划分为负例的实例数;
+ 4) True negatives(TN): 被正确地划分为负例的个数,即实际为负例且被分类器划分为负例的实例数。

-上图是这四个术语的混淆矩阵。
-1)P=TP+FN表示实际为正例的样本个数。
-2)True、False描述的是分类器是否判断正确。
-3)Positive、Negative是分类器的分类结果,如果正例计为1、负例计为-1,即positive=1、negative=-1。用1表示True,-1表示False,那么实际的类标=TF\*PN,TF为true或false,PN为positive或negative。
-4)例如True positives(TP)的实际类标=1\*1=1为正例,False positives(FP)的实际类标=(-1)\*1=-1为负例,False negatives(FN)的实际类标=(-1)\*(-1)=1为正例,True negatives(TN)的实际类标=1\*(-1)=-1为负例。
-
-2. **评价指标**
- 1) 正确率(accuracy)
- 正确率是我们最常见的评价指标,accuracy = (TP+TN)/(P+N),正确率是被分对的样本数在所有样本数中的占比,通常来说,正确率越高,分类器越好。
- 2) 错误率(error rate)
- 错误率则与正确率相反,描述被分类器错分的比例,error rate = (FP+FN)/(P+N),对某一个实例来说,分对与分错是互斥事件,所以accuracy =1 - error rate。
- 3) 灵敏度(sensitive)
- sensitive = TP/P,表示的是所有正例中被分对的比例,衡量了分类器对正例的识别能力。
- 4) 特效度(specificity)
- specificity = TN/N,表示的是所有负例中被分对的比例,衡量了分类器对负例的识别能力。
- 5) 精度(precision)
- 精度是精确性的度量,表示被分为正例的示例中实际为正例的比例,precision=TP/(TP+FP)。
- 6) 召回率(recall)
- 召回率是覆盖面的度量,度量有多个正例被分为正例,recall=TP/(TP+FN)=TP/P=sensitive,可以看到召回率与灵敏度是一样的。
- 7) 其他评价指标
- 计算速度:分类器训练和预测需要的时间;
- 鲁棒性:处理缺失值和异常值的能力;
- 可扩展性:处理大数据集的能力;
- 可解释性:分类器的预测标准的可理解性,像决策树产生的规则就是很容易理解的,而神经网络的一堆参数就不好理解,我们只好把它看成一个黑盒子。
- 8) 查准率和查全率反映了分类器分类性能的两个方面。如果综合考虑查准率与查全率,可以得到新的评价指标F1测试值,也称为综合分类率:$F1=\frac{2 \times precision \times recall}{precision + recall}$
- 为了综合多个类别的分类情况,评测系统整体性能,经常采用的还有微平均F1(micro-averaging)和宏平均F1(macro-averaging )两种指标。宏平均F1与微平均F1是以两种不同的平均方式求的全局的F1指标。其中宏平均F1的计算方法先对每个类别单独计算F1值,再取这些F1值的算术平均值作为全局指标。而微平均F1的计算方法是先累加计算各个类别的a、b、c、d的值,再由这些值求出F1值。由两种平均F1的计算方式不难看出,宏平均F1平等对待每一个类别,所以它的值主要受到稀有类别的影响,而微平均F1平等考虑文档集中的每一个文档,所以它的值受到常见类别的影响比较大。
- **ROC曲线和PR曲线**
-
-References
-[1] 李航. 统计学习方法[M]. 北京:清华大学出版社,2012.
+上图是这四个术语的混淆矩阵,做以下说明:
+ 1)P=TP+FN表示实际为正例的样本个数。
+ 2)True、False描述的是分类器是否判断正确。
+ 3)Positive、Negative是分类器的分类结果,如果正例计为1、负例计为-1,即positive=1、negative=-1。用1表示True,-1表示False,那么实际的类标=TF\*PN,TF为true或false,PN为positive或negative。
+ 4)例如True positives(TP)的实际类标=1\*1=1为正例,False positives(FP)的实际类标=(-1)\*1=-1为负例,False negatives(FN)的实际类标=(-1)\*(-1)=1为正例,True negatives(TN)的实际类标=1\*(-1)=-1为负例。
+
+
+
+- **评价指标**
+ 1) 正确率(accuracy)
+ 正确率是我们最常见的评价指标,accuracy = (TP+TN)/(P+N),正确率是被分对的样本数在所有样本数中的占比,通常来说,正确率越高,分类器越好。
+ 2) 错误率(error rate)
+ 错误率则与正确率相反,描述被分类器错分的比例,error rate = (FP+FN)/(P+N),对某一个实例来说,分对与分错是互斥事件,所以accuracy =1 - error rate。
+ 3) 灵敏度(sensitivity)
+ sensitivity = TP/P,表示的是所有正例中被分对的比例,衡量了分类器对正例的识别能力。
+ 4) 特异性(specificity)
+ specificity = TN/N,表示的是所有负例中被分对的比例,衡量了分类器对负例的识别能力。
+ 5) 精度(precision)
+ precision=TP/(TP+FP),精度是精确性的度量,表示被分为正例的示例中实际为正例的比例。
+ 6) 召回率(recall)
+ 召回率是覆盖面的度量,度量有多个正例被分为正例,recall=TP/(TP+FN)=TP/P=sensitivity,可以看到召回率与灵敏度是一样的。
+ 7) 其他评价指标
+ 计算速度:分类器训练和预测需要的时间;
+ 鲁棒性:处理缺失值和异常值的能力;
+ 可扩展性:处理大数据集的能力;
+ 可解释性:分类器的预测标准的可理解性,像决策树产生的规则就是很容易理解的,而神经网络的一堆参数就不好理解,我们只好把它看成一个黑盒子。
+ 8) 精度和召回率反映了分类器分类性能的两个方面。如果综合考虑查准率与查全率,可以得到新的评价指标F1-score,也称为综合分类率:$F1=\frac{2 \times precision \times recall}{precision + recall}$。
+
+ 为了综合多个类别的分类情况,评测系统整体性能,经常采用的还有微平均F1(micro-averaging)和宏平均F1(macro-averaging )两种指标。
+
+ (1)宏平均F1与微平均F1是以两种不同的平均方式求的全局F1指标。
+
+ (2)宏平均F1的计算方法先对每个类别单独计算F1值,再取这些F1值的算术平均值作为全局指标。
+
+ (3)微平均F1的计算方法是先累加计算各个类别的a、b、c、d的值,再由这些值求出F1值。
+
+ (4)由两种平均F1的计算方式不难看出,宏平均F1平等对待每一个类别,所以它的值主要受到稀有类别的影响,而微平均F1平等考虑文档集中的每一个文档,所以它的值受到常见类别的影响比较大。
+
+
+
+- **ROC曲线和PR曲线**
+
+ ROC曲线是(Receiver Operating Characteristic Curve,受试者工作特征曲线)的简称,是以灵敏度(真阳性率)为纵坐标,以1减去特异性(假阳性率)为横坐标绘制的性能评价曲线。可以将不同模型对同一数据集的ROC曲线绘制在同一笛卡尔坐标系中,ROC曲线越靠近左上角,说明其对应模型越可靠。也可以通过ROC曲线下面的面积(Area Under Curve, AUC)来评价模型,AUC越大,模型越可靠。
+
+
+
+ 图2.8.2.1 ROC曲线
+
+ PR曲线是Precision Recall Curve的简称,描述的是precision和recall之间的关系,以recall为横坐标,precision为纵坐标绘制的曲线。该曲线的所对应的面积AUC实际上是目标检测中常用的评价指标平均精度(Average Precision, AP)。AP越高,说明模型性能越好。
+
+### 2.8.3 正确率能很好的评估分类算法吗?
+
+ 不同算法有不同特点,在不同数据集上有不同的表现效果,根据特定的任务选择不同的算法。如何评价分类算法的好坏,要做具体任务具体分析。对于决策树,主要用正确率去评估,但是其他算法,只用正确率能很好的评估吗?
+ 答案是否定的。
+ 正确率确实是一个很直观很好的评价指标,但是有时候正确率高并不能完全代表一个算法就好。比如对某个地区进行地震预测,地震分类属性分为0:不发生地震、1发生地震。我们都知道,不发生的概率是极大的,对于分类器而言,如果分类器不加思考,对每一个测试样例的类别都划分为0,达到99%的正确率,但是,问题来了,如果真的发生地震时,这个分类器毫无察觉,那带来的后果将是巨大的。很显然,99%正确率的分类器并不是我们想要的。出现这种现象的原因主要是数据分布不均衡,类别为1的数据太少,错分了类别1但达到了很高的正确率缺忽视了研究者本身最为关注的情况。
### 2.8.4 什么样的分类器是最好的?
-对某一个任务,某个具体的分类器不可能同时满足或提高所有上面介绍的指标。
-如果一个分类器能正确分对所有的实例,那么各项指标都已经达到最优,但这样的分类器往往不存在。比如之前说的地震预测,既然不能百分百预测地震的发生,但实际情况中能容忍一定程度的误报。假设在1000次预测中,共有5次预测发生了地震,真实情况中有一次发生了地震,其他4次则为误报。正确率由原来的999/1000=99.9下降为996/10000=99.6。召回率由0/1=0%上升为1/1=100%。对此解释为,虽然预测失误了4次,但真的地震发生前,分类器能预测对,没有错过,这样的分类器实际意义更为重大,正是我们想要的。在这种情况下,在一定正确率前提下,要求分类器的召回率尽量高。
+ 对某一个任务,某个具体的分类器不可能同时满足或提高所有上面介绍的指标。
+ 如果一个分类器能正确分对所有的实例,那么各项指标都已经达到最优,但这样的分类器往往不存在。比如之前说的地震预测,既然不能百分百预测地震的发生,但实际情况中能容忍一定程度的误报。假设在1000次预测中,共有5次预测发生了地震,真实情况中有一次发生了地震,其他4次则为误报。正确率由原来的999/1000=99.9下降为996/1000=99.6。召回率由0/1=0%上升为1/1=100%。对此解释为,虽然预测失误了4次,但真的地震发生前,分类器能预测对,没有错过,这样的分类器实际意义更为重大,正是我们想要的。在这种情况下,在一定正确率前提下,要求分类器的召回率尽量高。
## 2.9 逻辑回归
-### 2.9.1 理解逻辑回归
+### 2.9.1 回归划分
-**回归划分**:
广义线性模型家族里,依据因变量不同,可以有如下划分:
-1. 如果是连续的,就是多重线性回归;
-2. 如果是二项分布,就是Logistic回归;
-3. 如果是Poisson分布,就是Poisson回归;
+
+1. 如果是连续的,就是多重线性回归。
+2. 如果是二项分布,就是逻辑回归。
+3. 如果是泊松(Poisson)分布,就是泊松回归。
4. 如果是负二项分布,就是负二项回归。
-Logistic回归的因变量可以是二分类的,也可以是多分类的,但是二分类的更为常用,也更加容易解释。所以实际中最常用的就是二分类的Logistic回归。
+5. 逻辑回归的因变量可以是二分类的,也可以是多分类的,但是二分类的更为常用,也更加容易解释。所以实际中最常用的就是二分类的逻辑回归。
+
+### 2.9.2 逻辑回归适用性
-**Logistic回归的适用性**:
-1. 用于概率预测。用于可能性预测时,得到的结果有可比性。比如根据模型进而预测在不同的自变量情况下,发生某病或某种情况的概率有多大;
+1. 用于概率预测。用于可能性预测时,得到的结果有可比性。比如根据模型进而预测在不同的自变量情况下,发生某病或某种情况的概率有多大。
2. 用于分类。实际上跟预测有些类似,也是根据模型,判断某人属于某病或属于某种情况的概率有多大,也就是看一下这个人有多大的可能性是属于某病。进行分类时,仅需要设定一个阈值即可,可能性高于阈值是一类,低于阈值是另一类。
3. 寻找危险因素。寻找某一疾病的危险因素等。
4. 仅能用于线性问题。只有当目标和特征是线性关系时,才能用逻辑回归。在应用逻辑回归时注意两点:一是当知道模型是非线性时,不适用逻辑回归;二是当使用逻辑回归时,应注意选择和目标为线性关系的特征。
5. 各特征之间不需要满足条件独立假设,但各个特征的贡献独立计算。
-### 2.9.2 逻辑回归与朴素贝叶斯有什么区别?
+### 2.9.3 逻辑回归与朴素贝叶斯有什么区别?
1. 逻辑回归是判别模型, 朴素贝叶斯是生成模型,所以生成和判别的所有区别它们都有。
2. 朴素贝叶斯属于贝叶斯,逻辑回归是最大似然,两种概率哲学间的区别。
-3. 朴素贝叶斯需要独立假设。
+3. 朴素贝叶斯需要条件独立假设。
4. 逻辑回归需要求特征参数间是线性的。
-### 2.9.3线性回归与逻辑回归的区别?(贡献者:黄钦建-华南理工大学)
+### 2.9.4 线性回归与逻辑回归的区别?
+
+(贡献者:黄钦建-华南理工大学)
-线性回归的样本的输出,都是连续值,$ y\in (-\infty ,+\infty )$,而逻辑回归中$y\in (0,1)$,只能取0和1。
+ 线性回归的样本的输出,都是连续值,$ y\in (-\infty ,+\infty )$,而逻辑回归中$y\in (0,1)$,只能取0和1。
-对于拟合函数也有本质上的差别:
+ 对于拟合函数也有本质上的差别:
-线性回归:$f(x)=\theta ^{T}x=\theta _{1}x _{1}+\theta _{2}x _{2}+...+\theta _{n}x _{n}$
+ 线性回归:$f(x)=\theta ^{T}x=\theta _{1}x _{1}+\theta _{2}x _{2}+...+\theta _{n}x _{n}$
-逻辑回归:$f(x)=P(y=1|x;\theta )=g(\theta ^{T}x)$,其中,$g(z)=\frac{1}{1+e^{-z}}$
+ 逻辑回归:$f(x)=P(y=1|x;\theta )=g(\theta ^{T}x)$,其中,$g(z)=\frac{1}{1+e^{-z}}$
-可以看出,线性回归的拟合函数,是对f(x)的输出变量y的拟合,而逻辑回归的拟合函数是对为1类的样本的概率的拟合。
+ 可以看出,线性回归的拟合函数,是对f(x)的输出变量y的拟合,而逻辑回归的拟合函数是对为1类样本的概率的拟合。
-那么,为什么要以1类样本的概率进行拟合呢,为什么可以这样拟合呢?
+ 那么,为什么要以1类样本的概率进行拟合呢,为什么可以这样拟合呢?
-$\theta ^{T}x=0$就相当于是1类和0类的决策边界:
+ $\theta ^{T}x=0$就相当于是1类和0类的决策边界:
-当$\theta ^{T}x>0$,则y>0.5;若$\theta ^{T}x\rightarrow +\infty $,则$y \rightarrow 1 $,即y为1类;
+ 当$\theta ^{T}x>0$,则y>0.5;若$\theta ^{T}x\rightarrow +\infty $,则$y \rightarrow 1 $,即y为1类;
-当$\theta ^{T}x<0$,则y<0.5;若$\theta ^{T}x\rightarrow -\infty $,则$y \rightarrow 0 $,即y为0类;
+ 当$\theta ^{T}x<0$,则y<0.5;若$\theta ^{T}x\rightarrow -\infty $,则$y \rightarrow 0 $,即y为0类;
-这个时候就能看出区别来了,在线性回归中$\theta ^{T}x$为预测值的拟合函数;而在逻辑回归中$\theta ^{T}x$为决策边界。
+这个时候就能看出区别,在线性回归中$\theta ^{T}x$为预测值的拟合函数;而在逻辑回归中$\theta ^{T}x$为决策边界。
| | 线性回归 | 逻辑回归 |
|:-------------:|:-------------:|:-----:|
@@ -226,27 +259,9 @@ $\theta ^{T}x=0$就相当于是1类和0类的决策边界:
下面具体解释一下:
-1. 拟合函数和预测函数什么关系呢?其实就是将拟合函数做了一个逻辑函数的转换,转换后使得$y^{(i)} \in (0,1)$;
+1. 拟合函数和预测函数什么关系呢?简单来说就是将拟合函数做了一个逻辑函数的转换,转换后使得$y^{(i)} \in (0,1)$;
2. 最小二乘和最大似然估计可以相互替代吗?回答当然是不行了。我们来看看两者依仗的原理:最大似然估计是计算使得数据出现的可能性最大的参数,依仗的自然是Probability。而最小二乘是计算误差损失。
-### 2.9.4 Factorization Machines(FM)模型原理
-1.FM旨在解决稀疏数据的特征组合问题,某些特征经过关联之后,就会与label之间的相关性就会提高,例如设备id与ip地址之间的特征交叉就会更好的与label之间有相关性.
-2.FM为二阶多项式模型
-
-• 假设有D维特征,𝑥 , ... , 𝑥 ,若采用线性模型,则
-$y = w_{0} +\sum_{j = 1}^{D} w_{i}x_{j}$
-• 若考虑二阶特征组合,得到模型
-$y = w_{0} +\sum_{j = 1}^{D} w_{i}x_{j} + \sum_{i = 1}^{D}\sum_{j = i + 1}^{D}w_{ij}x_{i}x_{j}$
-– 组合特征的参数一共有D(D-1)/2个,任意两个参数都是独立的
-– 数据稀疏使得二次项参数的训练很困难:
-. 每个样本都需要大量非0的$x_{j}$和$x_{i}$样本
-. 训练样本不足会导致$w_{ij}$不准确
-FM采用类似model-based协同过滤中的矩阵分解方式对二次 多项式的系数进行有效表示:
-$y = w_{0} +\sum_{j = 1}^{D} w_{i}x_{j} + \sum_{i = 1}^{D}\sum_{j = i + 1}^{D}x_{i}x_{j}$
-– FM为进一步对隐含向量只取K维
-从而$ = \sum_{k = 1}^{K} v_{i,k}v_{j,k}$
-– 二项式参数之前的D(D-1)/2变成了KD个 大大降低了计算量.
-
## 2.10 代价函数
### 2.10.1 为什么需要代价函数?
@@ -255,114 +270,114 @@ $y = w_{0} +\sum_{j = 1}^{D} w_{i}x_{j} + \sum_{i = 1}^{D}\sum_{j = i + 1}^{D} 由上图,假如最开始,我们在一座大山上的某处位置,因为到处都是陌生的,不知道下山的路,所以只能摸索着根据直觉,走一步算一步,在此过程中,每走到一个位置的时候,都会求解当前位置的梯度,沿着梯度的负方向,也就是当前最陡峭的位置向下走一步,然后继续求解当前位置梯度,向这一步所在位置沿着最陡峭最易下山的位置走一步。不断循环求梯度,就这样一步步的走下去,一直走到我们觉得已经到了山脚。当然这样走下去,有可能我们不能走到山脚,而是到了某一个局部的山峰低处。
-由此,从上面的解释可以看出,梯度下降不一定能够找到全局的最优解,有可能是一个局部最优解。当然,如果损失函数是凸函数,梯度下降法得到的解就一定是全局最优解。
+ 形象化举例,由上图,假如最开始,我们在一座大山上的某处位置,因为到处都是陌生的,不知道下山的路,所以只能摸索着根据直觉,走一步算一步,在此过程中,每走到一个位置的时候,都会求解当前位置的梯度,沿着梯度的负方向,也就是当前最陡峭的位置向下走一步,然后继续求解当前位置梯度,向这一步所在位置沿着最陡峭最易下山的位置走一步。不断循环求梯度,就这样一步步地走下去,一直走到我们觉得已经到了山脚。当然这样走下去,有可能我们不能走到山脚,而是到了某一个局部的山势低处。
+ 由此,从上面的解释可以看出,梯度下降不一定能够找到全局的最优解,有可能是一个局部的最优解。当然,如果损失函数是凸函数,梯度下降法得到的解就一定是全局最优解。
+
+**核心思想归纳**:
-核心思想归纳:
1. 初始化参数,随机选取取值范围内的任意数;
2. 迭代操作:
- a) 计算当前梯度;
- b)修改新的变量;
- c)计算朝最陡的下坡方向走一步;
- d)判断是否需要终止,如否,返回a);
+ a)计算当前梯度;
+b)修改新的变量;
+c)计算朝最陡的下坡方向走一步;
+d)判断是否需要终止,如否,返回a);
3. 得到全局最优解或者接近全局最优解。
-### 2.12.4 梯度下降法算法描述?
+### 2.12.4 梯度下降法算法描述
1. 确定优化模型的假设函数及损失函数。
-举例,对于线性回归,假设函数为:
-TODO
-其中,TODO分别为模型参数、每个样本的特征值。
-对于假设函数,损失函数为:
-TODO
-2. 相关参数初始化。
-主要初始化TODO、算法迭代步长TODO、终止距离TODO。初始化时可以根据经验初始化,即TODO初始化为0,步长TODO初始化为1。当前步长记为TODO。当然,也可随机初始化。
-3. 迭代计算。
+ 举例,对于线性回归,假设函数为:
+$$
+ h_\theta(x_1,x_2,...,x_n)=\theta_0+\theta_1x_1+...+\theta_nx_n
+$$
+ 其中,$\theta_i,x_i(i=0,1,2,...,n)$分别为模型参数、每个样本的特征值。
+ 对于假设函数,损失函数为:
+$$
+ J(\theta_0,\theta_1,...,\theta_n)=\frac{1}{2m}\sum^{m}_{j=0}(h_\theta (x^{(j)}_0
+ ,x^{(j)}_1,...,x^{(j)}_n)-y_j)^2
+$$
-1) 计算当前位置时损失函数的梯度,对TODO,其梯度表示为:TODO
+2. 相关参数初始化。
+ 主要初始化${\theta}_i$、算法迭代步长${\alpha} $、终止距离${\zeta} $。初始化时可以根据经验初始化,即${\theta} $初始化为0,步长${\alpha} $初始化为1。当前步长记为${\varphi}_i $。当然,也可随机初始化。
-2) 计算当前位置下降的距离。TODO
+3. 迭代计算。
-3) 判断是否终止。
-确定是否所有TODO梯度下降的距离TODO都小于终止距离TODO,如果都小于TODO,则算法终止,当然的值即为最终结果,否则进入下一步。
-4) 更新所有的TODO,更新后的表达式为:TODO
-5) 更新完毕后转入1)。
+ 1)计算当前位置时损失函数的梯度,对${\theta}_i $,其梯度表示为:
-**举例**。以线性回归为例。
-假设样本是
-TODO
-损失函数为
-TODO
-在计算中,TODO的偏导数计算如下:
-TODO
-令上式 。4)中TODO的更新表达式为:
- TODO
-由此,可看出,当前位置的梯度方向由所有样本决定,上式中TODO的目的是为了便于理解。
+$$
+\frac{\partial}{\partial \theta_i}J({\theta}_0,{\theta}_1,...,{\theta}_n)=\frac{1}{2m}\sum^{m}_{j=0}(h_\theta (x^{(j)}_0
+ ,x^{(j)}_1,...,x^{(j)}_n)-y_j)^2
+$$
+ 2)计算当前位置下降的距离。
+$$
+{\varphi}_i={\alpha} \frac{\partial}{\partial \theta_i}J({\theta}_0,{\theta}_1,...,{\theta}_n)
+$$
+ 3)判断是否终止。
+ 确定是否所有${\theta}_i$梯度下降的距离${\varphi}_i$都小于终止距离${\zeta}$,如果都小于${\zeta}$,则算法终止,当然的值即为最终结果,否则进入下一步。
+ 4)更新所有的${\theta}_i$,更新后的表达式为:
+$$
+{\theta}_i={\theta}_i-\alpha \frac{\partial}{\partial \theta_i}J({\theta}_0,{\theta}_1,...,{\theta}_n)
+$$
+$$
+\theta_i=\theta_i - \alpha \frac{1}{m} \sum^{m}_{j=0}(h_\theta (x^{(j)}_0
+ ,x^{(j)}_1,...,x^{(j)}_n)-y_j)x^{(j)}_i
+$$
+ 5)令上式$x^{(j)}_0=1$,更新完毕后转入1)。
+ 由此,可看出,当前位置的梯度方向由所有样本决定,上式中 $\frac{1}{m}$、$\alpha \frac{1}{m}$ 的目的是为了便于理解。
### 2.12.5 如何对梯度下降法进行调优?
实际使用梯度下降法时,各项参数指标不能一步就达到理想状态,对梯度下降法调优主要体现在以下几个方面:
1. **算法迭代步长$\alpha$选择。**
-在算法参数初始化时,有时根据经验将步长 初始化为1。实际取值取决于数据样本。可以从大到小,多取一些值,分别运行算法看迭代效果,如果损失函数在变小,则取值有效。如果取值无效,说明要增大步长。但步长太大,有时会导致迭代速度过快,错过最优解。步长太小,迭代速度慢,算法运行时间长。
+ 在算法参数初始化时,有时根据经验将步长初始化为1。实际取值取决于数据样本。可以从大到小,多取一些值,分别运行算法看迭代效果,如果损失函数在变小,则取值有效。如果取值无效,说明要增大步长。但步长太大,有时会导致迭代速度过快,错过最优解。步长太小,迭代速度慢,算法运行时间长。
2. **参数的初始值选择。**
-初始值不同,获得的最小值也有可能不同,梯度下降有可能得到的是局部最小值。如果损失函数是凸函数,则一定是最优解。由于有局部最优解的风险,需要多次用不同初始值运行算法,关键损失函数的最小值,选择损失函数最小化的初值。
+ 初始值不同,获得的最小值也有可能不同,梯度下降有可能得到的是局部最小值。如果损失函数是凸函数,则一定是最优解。由于有局部最优解的风险,需要多次用不同初始值运行算法,关键损失函数的最小值,选择损失函数最小化的初值。
3. **标准化处理。**
-由于样本不同,特征取值范围也不同,导致迭代速度慢。为了减少特征取值的影响,可对特征数据标准化,使新期望为0,新方差为1,可节省算法运行时间。
+ 由于样本不同,特征取值范围也不同,导致迭代速度慢。为了减少特征取值的影响,可对特征数据标准化,使新期望为0,新方差为1,可节省算法运行时间。
-### 2.12.7 随机梯度和批量梯度区别?
-随机梯度下降和批量梯度下降是两种主要梯度下降法,其目的是增加某些限制来加速运算求解。
-引入随机梯度下降法与mini-batch梯度下降法是为了应对大数据量的计算而实现一种快速的求解。
+### 2.12.6 随机梯度和批量梯度区别?
+ 随机梯度下降(SDG)和批量梯度下降(BDG)是两种主要梯度下降法,其目的是增加某些限制来加速运算求解。
下面通过介绍两种梯度下降法的求解思路,对其进行比较。
-假设函数为
-TODO
-损失函数为
-TODO
-其中,TODO为样本个数,TODO为参数个数。
+假设函数为:
+$$
+h_\theta (x_0,x_1,...,x_3) = \theta_0 x_0 + \theta_1 x_1 + ... + \theta_n x_n
+$$
+损失函数为:
+$$
+J(\theta_0, \theta_1, ... , \theta_n) =
+ \frac{1}{2m} \sum^{m}_{j=0}(h_\theta (x^{j}_0
+ ,x^{j}_1,...,x^{j}_n)-y^j)^2
+$$
+其中,$m$为样本个数,$j$为参数个数。
1、 **批量梯度下降的求解思路如下:**
-
-a) 得到每个TODO对应的梯度:
-TODO
-
-b) 由于是求最小化风险函数,所以按每个参数TODO的梯度负方向更新TODO:
-TODO
-
-c) 从上式可以注意到,它得到的虽然是一个全局最优解,但每迭代一步,都要用到训练集所有的数据,如果样本数据 很大,这种方法迭代速度就很慢。
+a) 得到每个$ \theta $对应的梯度:
+$$
+\frac{\partial}{\partial \theta_i}J({\theta}_0,{\theta}_1,...,{\theta}_n)=\frac{1}{m}\sum^{m}_{j=0}(h_\theta (x^{j}_0
+ ,x^{j}_1,...,x^{j}_n)-y^j)x^{j}_i
+$$
+b) 由于是求最小化风险函数,所以按每个参数 $ \theta $ 的梯度负方向更新 $ \theta_i $ :
+$$
+\theta_i=\theta_i - \frac{1}{m} \sum^{m}_{j=0}(h_\theta (x^{j}_0
+ ,x^{j}_1,...,x^{j}_n)-y^j)x^{j}_i
+$$
+c) 从上式可以注意到,它得到的虽然是一个全局最优解,但每迭代一步,都要用到训练集所有的数据,如果样本数据很大,这种方法迭代速度就很慢。
相比而言,随机梯度下降可避免这种问题。
2、**随机梯度下降的求解思路如下:**
a) 相比批量梯度下降对应所有的训练样本,随机梯度下降法中损失函数对应的是训练集中每个样本的粒度。
损失函数可以写成如下这种形式,
- TODO
-
-b)对每个参数TODO按梯度方向更新 :
- TODO
-
+$$
+J(\theta_0, \theta_1, ... , \theta_n) =
+ \frac{1}{m} \sum^{m}_{j=0}(y^j - h_\theta (x^{j}_0
+ ,x^{j}_1,...,x^{j}_n))^2 =
+ \frac{1}{m} \sum^{m}_{j=0} cost(\theta,(x^j,y^j))
+$$
+b)对每个参数 $ \theta$ 按梯度方向更新 $ \theta$:
+$$
+\theta_i = \theta_i + (y^j - h_\theta (x^{j}_0, x^{j}_1, ... ,x^{j}_n))
+$$
c) 随机梯度下降是通过每个样本来迭代更新一次。
随机梯度下降伴随的一个问题是噪音较批量梯度下降要多,使得随机梯度下降并不是每次迭代都向着整体最优化方向。
**小结:**
随机梯度下降法、批量梯度下降法相对来说都比较极端,简单对比如下:
-批量梯度下降:
-a)采用所有数据来梯度下降。
-b) 批量梯度下降法在样本量很大的时候,训练速度慢。
-随机梯度下降:
-a) 随机梯度下降用一个样本来梯度下降。
-b) 训练速度很快。
-c) 随机梯度下降法仅仅用一个样本决定梯度方向,导致解有可能不是最优。
-d) 收敛速度来说,随机梯度下降法一次迭代一个样本,导致迭代方向变化很大,不能很快的收敛到局部最优解。
+| 方法 | 特点 |
+| :----------: | :----------------------------------------------------------- |
+| 批量梯度下降 | a)采用所有数据来梯度下降。 b)批量梯度下降法在样本量很大的时候,训练速度慢。 |
+| 随机梯度下降 | a)随机梯度下降用一个样本来梯度下降。 b)训练速度很快。 c)随机梯度下降法仅仅用一个样本决定梯度方向,导致解有可能不是全局最优。 d)收敛速度来说,随机梯度下降法一次迭代一个样本,导致迭代方向变化很大,不能很快的收敛到局部最优解。 |
下面介绍能结合两种方法优点的小批量梯度下降法。
-3、 **小批量(mini-batch)梯度下降的求解思路如下**
+3、 **小批量(Mini-Batch)梯度下降的求解思路如下**
对于总数为$m$个样本的数据,根据样本的数据,选取其中的$n(1< n< m)$个子样本来迭代。其参数$\theta$按梯度方向更新$\theta_i$公式如下:
-TODO
+$$
+\theta_i = \theta_i - \alpha \sum^{t+n-1}_{j=t}
+ ( h_\theta (x^{j}_{0}, x^{j}_{1}, ... , x^{j}_{n} ) - y^j ) x^{j}_{i}
+$$
-### 2.12.8 各种梯度下降法性能比较
-下表简单对比随机梯度下降(SGD)、批量梯度下降(BGD)、小批量梯度下降(mini-batch GD)、和online GD的区别,主要区别在于如何选取训练数据:
+### 2.12.7 各种梯度下降法性能比较
+ 下表简单对比随机梯度下降(SGD)、批量梯度下降(BGD)、小批量梯度下降(Mini-batch GD)、和Online GD的区别:
-|BGD|SGD|GD|Mini-batch GD|Online GD|
+||BGD|SGD|Mini-batch GD|Online GD|
|:-:|:-:|:-:|:-:|:-:|:-:|
|训练集|固定|固定|固定|实时更新|
|单次迭代样本数|整个训练集|单个样本|训练集的子集|根据具体算法定|
@@ -605,148 +700,149 @@ TODO
|时效性|低|一般|一般|高|
|收敛性|稳定|不稳定|较稳定|不稳定|
-BGD、SGD、Mini-batch GD,前面均已讨论过,这里介绍一下Online GD。
+BGD、SGD、Mini-batch GD,前面均已讨论过,这里介绍一下Online GD。
-Online GD于mini-batch GD/SGD的区别在于,所有训练数据只用一次,然后丢弃。这样做的优点在于可预测最终模型的变化趋势。
+ Online GD于Mini-batch GD/SGD的区别在于,所有训练数据只用一次,然后丢弃。这样做的优点在于可预测最终模型的变化趋势。
-Online GD在互联网领域用的较多,比如搜索广告的点击率(CTR)预估模型,网民的点击行为会随着时间改变。用普通的BGD算法(每天更新一次)一方面耗时较长(需要对所有历史数据重新训练);另一方面,无法及时反馈用户的点击行为迁移。而Online GD算法可以实时的依据网民的点击行为进行迁移。
+ Online GD在互联网领域用的较多,比如搜索广告的点击率(CTR)预估模型,网民的点击行为会随着时间改变。用普通的BGD算法(每天更新一次)一方面耗时较长(需要对所有历史数据重新训练);另一方面,无法及时反馈用户的点击行为迁移。而Online GD算法可以实时的依据网民的点击行为进行迁移。
-## 2.13 计算图的导数计算图解?
+## 2.13 计算图的导数计算?
计算图导数计算是反向传播,利用链式法则和隐式函数求导。
- 假设TODO在点TODO处偏导连续,TODO是关于TODO的函数,在TODO点可导,求TODO在TODO点的导数。
+ 假设 $z = f(u,v)$ 在点 $(u,v)$ 处偏导连续,$(u,v)$是关于 $t$ 的函数,在 $t$ 点可导,求 $z$ 在 $t$ 点的导数。
根据链式法则有
-TODO
-
- 为了便于理解,下面举例说明。
-假设$f(x)$是关于a,b,c的函数。链式求导法则如下:
-
$$
-\frac{dJ}{du}=\frac{dJ}{dv}\frac{dv}{du},\frac{dJ}{db}=\frac{dJ}{du}\frac{du}{db},\frac{dJ}{da}=\frac{dJ}{du}\frac{du}{da}
+\frac{dz}{dt}=\frac{\partial z}{\partial u}.\frac{du}{dt}+\frac{\partial z}{\partial v}
+ .\frac{dv}{dt}
$$
-链式法则用文字描述:“由两个函数凑起来的复合函数,其导数等于里边函数代入外边函数的值之导数,乘以里边函数的导数。
-
-例:
+ 链式法则用文字描述:“由两个函数凑起来的复合函数,其导数等于里边函数代入外边函数的值之导数,乘以里边函数的导数。
+ 为了便于理解,下面举例说明:
$$
f(x)=x^2,g(x)=2x+1
$$
-则
-
+ 则:
$$
-{f[g(x)]}'=2[g(x)]*g'(x)=2[2x+1]*2=8x+1
+{f[g(x)]}'=2[g(x)] \times g'(x)=2[2x+1] \times 2=8x+4
$$
## 2.14 线性判别分析(LDA)
-### 2.14.1 线性判别分析(LDA)思想总结
+### 2.14.1 LDA思想总结
-线性判别分析(Linear Discriminant Analysis,LDA)是一种经典的降维方法。
+ 线性判别分析(Linear Discriminant Analysis,LDA)是一种经典的降维方法。和主成分分析PCA不考虑样本类别输出的无监督降维技术不同,LDA是一种监督学习的降维技术,数据集的每个样本有类别输出。
-和PCA不考虑样本类别输出的无监督降维技术不同,LDA是一种监督学习的降维技术,数据集的每个样本有类别输出。
+LDA分类思想简单总结如下:
+1. 多维空间中,数据处理分类问题较为复杂,LDA算法将多维空间中的数据投影到一条直线上,将d维数据转化成1维数据进行处理。
+2. 对于训练数据,设法将多维数据投影到一条直线上,同类数据的投影点尽可能接近,异类数据点尽可能远离。
+3. 对数据进行分类时,将其投影到同样的这条直线上,再根据投影点的位置来确定样本的类别。
-LDA分类思想简单总结如下:
-1. 多维空间中,数据处理分类问题较为复杂,LDA算法将多维空间中的数据投影到一条直线上,将d维数据转化成1维数据进行处理。
-2. 对于训练数据,设法将多维数据投影到一条直线上,同类数据的投影点尽可能接近,异类数据点尽可能远离。
-3. 对数据进行分类时,将其投影到同样的这条直线上,再根据投影点的位置来确定样本的类别。
如果用一句话概括LDA思想,即“投影后类内方差最小,类间方差最大”。
### 2.14.2 图解LDA核心思想
-假设有红、蓝两类数据,这些数据特征均为二维,如下图所示。我们的目标是将这些数据投影到一维,让每一类相近的数据的投影点尽可能接近,不同类别数据尽可能远,即图中红色和蓝色数据中心之间的距离尽可能大。
+ 假设有红、蓝两类数据,这些数据特征均为二维,如下图所示。我们的目标是将这些数据投影到一维,让每一类相近的数据的投影点尽可能接近,不同类别数据尽可能远,即图中红色和蓝色数据中心之间的距离尽可能大。

左图和右图是两种不同的投影方式。
-左图思路:让不同类别的平均点距离最远的投影方式。
+ 左图思路:让不同类别的平均点距离最远的投影方式。
-右图思路:让同类别的数据挨得最近的投影方式。
+ 右图思路:让同类别的数据挨得最近的投影方式。
-从上图直观看出,右图红色数据和蓝色数据在各自的区域来说相对集中,根据数据分布直方图也可看出,所以右图的投影效果好于左图,左图中间直方图部分有明显交集。
+ 从上图直观看出,右图红色数据和蓝色数据在各自的区域来说相对集中,根据数据分布直方图也可看出,所以右图的投影效果好于左图,左图中间直方图部分有明显交集。
-以上例子是基于数据是二维的,分类后的投影是一条直线。如果原始数据是多维的,则投影后的分类面是一低维的超平面。
+ 以上例子是基于数据是二维的,分类后的投影是一条直线。如果原始数据是多维的,则投影后的分类面是一低维的超平面。
### 2.14.3 二类LDA算法原理?
-输入:数据集TODO,其中样本TODO是n维向量,TODO,TODO降维后的目标维度TODO。定义
+ 输入:数据集 $D=\{(\boldsymbol x_1,\boldsymbol y_1),(\boldsymbol x_2,\boldsymbol y_2),...,(\boldsymbol x_m,\boldsymbol y_m)\}$,其中样本 $\boldsymbol x_i $ 是n维向量,$\boldsymbol y_i \epsilon \{0, 1\}$,降维后的目标维度 $d$。定义
-TODO为第TODO类样本个数;
+ $N_j(j=0,1)$ 为第 $j$ 类样本个数;
-TODO为第TODO类样本的集合;
+ $X_j(j=0,1)$ 为第 $j$ 类样本的集合;
-TODO为第TODO类样本的均值向量;
+ $u_j(j=0,1)$ 为第 $j$ 类样本的均值向量;
-TODO为第TODO类样本的协方差矩阵。
+ $\sum_j(j=0,1)$ 为第 $j$ 类样本的协方差矩阵。
-其中TODO,TODO。
-
-假设投影直线是向量TODO,对任意样本TODO,它在直线TODO上的投影为TODO,两个类别的中心点TODO在直线TODO的投影分别为TODO、TODO。
-
-LDA的目标是让两类别的数据中心间的距离TODO尽量大,与此同时,希望同类样本投影点的协方差TODO、TODO尽量小,最小化TODO。
-定义
-类内散度矩阵TODO
-
-类间散度矩阵TODO
+ 其中
+$$
+u_j = \frac{1}{N_j} \sum_{\boldsymbol x\epsilon X_j}\boldsymbol x(j=0,1),
+\sum_j = \sum_{\boldsymbol x\epsilon X_j}(\boldsymbol x-u_j)(\boldsymbol x-u_j)^T(j=0,1)
+$$
+ 假设投影直线是向量 $\boldsymbol w$,对任意样本 $\boldsymbol x_i$,它在直线 $w$上的投影为 $\boldsymbol w^Tx_i$,两个类别的中心点 $u_0$, $u_1 $在直线 $w$ 的投影分别为 $\boldsymbol w^Tu_0$ 、$\boldsymbol w^Tu_1$。
-据上分析,优化目标为TODO
+ LDA的目标是让两类别的数据中心间的距离 $\| \boldsymbol w^Tu_0 - \boldsymbol w^Tu_1 \|^2_2$ 尽量大,与此同时,希望同类样本投影点的协方差$\boldsymbol w^T \sum_0 \boldsymbol w$、$\boldsymbol w^T \sum_1 \boldsymbol w$ 尽量小,最小化 $\boldsymbol w^T \sum_0 \boldsymbol w - \boldsymbol w^T \sum_1 \boldsymbol w$ 。
+ 定义
+ 类内散度矩阵
+$$
+S_w = \sum_0 + \sum_1 =
+ \sum_{\boldsymbol x\epsilon X_0}(\boldsymbol x-u_0)(\boldsymbol x-u_0)^T +
+ \sum_{\boldsymbol x\epsilon X_1}(\boldsymbol x-u_1)(\boldsymbol x-u_1)^T
+$$
+ 类间散度矩阵 $S_b = (u_0 - u_1)(u_0 - u_1)^T$
-根据广义瑞利商的性质,矩阵TODO的最大特征值为TODO的最大值,矩阵TODO的最大特征值对应的特征向量即为TODO。
+ 据上分析,优化目标为
+$$
+\mathop{\arg\max}_\boldsymbol w J(\boldsymbol w) = \frac{\| \boldsymbol w^Tu_0 - \boldsymbol w^Tu_1 \|^2_2}{\boldsymbol w^T \sum_0\boldsymbol w + \boldsymbol w^T \sum_1\boldsymbol w} =
+\frac{\boldsymbol w^T(u_0-u_1)(u_0-u_1)^T\boldsymbol w}{\boldsymbol w^T(\sum_0 + \sum_1)\boldsymbol w} =
+\frac{\boldsymbol w^TS_b\boldsymbol w}{\boldsymbol w^TS_w\boldsymbol w}
+$$
+ 根据广义瑞利商的性质,矩阵 $S^{-1}_{w} S_b$ 的最大特征值为 $J(\boldsymbol w)$ 的最大值,矩阵 $S^{-1}_{w} S_b$ 的最大特征值对应的特征向量即为 $\boldsymbol w$。
### 2.14.4 LDA算法流程总结?
LDA算法降维流程如下:
-输入:数据集TODO,其中样本TODO是n维向量,TODO,降维后的目标维度TODO。
+ 输入:数据集 $D = \{ (x_1,y_1),(x_2,y_2), ... ,(x_m,y_m) \}$,其中样本 $x_i $ 是n维向量,$y_i \epsilon \{C_1, C_2, ..., C_k\}$,降维后的目标维度 $d$ 。
-输出:降维后的数据集TODO。
+ 输出:降维后的数据集 $\overline{D} $ 。
步骤:
-1. 计算类内散度矩阵 。
-2. 计算类间散度矩阵 。
-3. 计算矩阵 。
-4. 计算矩阵 的最大的d个特征值。
-5. 计算d个特征值对应的d个特征向量,记投影矩阵为 。
-6. 转化样本集的每个样本,得到新样本 。
-7. 输出新样本集
+1. 计算类内散度矩阵 $S_w$。
+2. 计算类间散度矩阵 $S_b$ 。
+3. 计算矩阵 $S^{-1}_wS_b$ 。
+4. 计算矩阵 $S^{-1}_wS_b$ 的最大的 d 个特征值。
+5. 计算 d 个特征值对应的 d 个特征向量,记投影矩阵为 W 。
+6. 转化样本集的每个样本,得到新样本 $P_i = W^Tx_i$ 。
+7. 输出新样本集 $\overline{D} = \{ (p_1,y_1),(p_2,y_2),...,(p_m,y_m) \}$
### 2.14.5 LDA和PCA区别?
|异同点|LDA|PCA|
|:-:|:-|:-|
-|相同点|1. 两者均可以对数据进行降维;2. 两者在降维时均使用了矩阵特征分解的思想;3. 两者都假设数据符合高斯分布;|
-|不同点|有监督的降维方法|无监督的降维方法|
-||降维最多降到k-1维|降维多少没有限制|
-||可以用于降维,还可以用于分类|只用于降维|
-||选择分类性能最好的投影方向|选择样本点投影具有最大方差的方向|
-||更明确,更能反映样本间差异|目的较为模糊|
+|相同点|1. 两者均可以对数据进行降维; 2. 两者在降维时均使用了矩阵特征分解的思想; 3. 两者都假设数据符合高斯分布;||
+|不同点|有监督的降维方法;|无监督的降维方法;|
+||降维最多降到k-1维;|降维多少没有限制;|
+||可以用于降维,还可以用于分类;|只用于降维;|
+||选择分类性能最好的投影方向;|选择样本点投影具有最大方差的方向;|
+||更明确,更能反映样本间差异;|目的较为模糊;|
### 2.14.6 LDA优缺点?
|优缺点|简要说明|
|:-:|:-|
-|优点|1. 可以使用类别的先验知识;2. 以标签,类别衡量差异性的有监督降维方式,相对于PCA的模糊性,其目的更明确,更能反映样本间的差异;|
-|缺点|1. LDA不适合对非高斯分布样本进行降维;2. LDA降维最多降到k-1维;3. LDA在样本分类信息依赖方差而不是均值时,降维效果不好;4. LDA可能过度拟合数据。|
+|优点|1. 可以使用类别的先验知识; 2. 以标签、类别衡量差异性的有监督降维方式,相对于PCA的模糊性,其目的更明确,更能反映样本间的差异;|
+|缺点|1. LDA不适合对非高斯分布样本进行降维; 2. LDA降维最多降到分类数k-1维; 3. LDA在样本分类信息依赖方差而不是均值时,降维效果不好; 4. LDA可能过度拟合数据。|
## 2.15 主成分分析(PCA)
### 2.15.1 主成分分析(PCA)思想总结
-
1. PCA就是将高维的数据通过线性变换投影到低维空间上去。
2. 投影思想:找出最能够代表原始数据的投影方法。被PCA降掉的那些维度只能是那些噪声或是冗余的数据。
3. 去冗余:去除可以被其他向量代表的线性相关向量,这部分信息量是多余的。
4. 去噪声,去除较小特征值对应的特征向量,特征值的大小反映了变换后在特征向量方向上变换的幅度,幅度越大,说明这个方向上的元素差异也越大,要保留。
5. 对角化矩阵,寻找极大线性无关组,保留较大的特征值,去除较小特征值,组成一个投影矩阵,对原始样本矩阵进行投影,得到降维后的新样本矩阵。
-6. 完成PCA的关键是——协方差矩阵。
-协方差矩阵,能同时表现不同维度间的相关性以及各个维度上的方差。
-协方差矩阵度量的是维度与维度之间的关系,而非样本与样本之间。
+6. 完成PCA的关键是——协方差矩阵。协方差矩阵,能同时表现不同维度间的相关性以及各个维度上的方差。协方差矩阵度量的是维度与维度之间的关系,而非样本与样本之间。
7. 之所以对角化,因为对角化之后非对角上的元素都是0,达到去噪声的目的。对角化后的协方差矩阵,对角线上较小的新方差对应的就是那些该去掉的维度。所以我们只取那些含有较大能量(特征值)的维度,其余的就舍掉,即去冗余。
### 2.15.2 图解PCA核心思想
-PCA可解决训练数据中存在数据特征过多或特征累赘的问题。核心思想是将m维特征映射到n维(n < m),这n维形成主元,是重构出来最能代表原始数据的正交特征。
+ PCA可解决训练数据中存在数据特征过多或特征累赘的问题。核心思想是将m维特征映射到n维(n < m),这n维形成主元,是重构出来最能代表原始数据的正交特征。
-假设数据集是m个n维,$(x^{(1)}, x^{(2)}, \cdots, x^{(m)})$。如果n=2,需要降维到$n'=1$,现在想找到某一维度方向代表这两个维度的数据。下图有$u_1, u_2$两个向量方向,但是哪个向量才是我们所想要的,可以更好代表原始数据集的呢?
+ 假设数据集是m个n维,$(\boldsymbol x^{(1)}, \boldsymbol x^{(2)}, \cdots, \boldsymbol x^{(m)})$。如果$n=2$,需要降维到$n'=1$,现在想找到某一维度方向代表这两个维度的数据。下图有$u_1, u_2$两个向量方向,但是哪个向量才是我们所想要的,可以更好代表原始数据集的呢?

@@ -761,38 +857,58 @@ PCA可解决训练数据中存在数据特征过多或特征累赘的问题。
### 2.15.3 PCA算法推理
下面以基于最小投影距离为评价指标推理:
-假设数据集是m个n维,TODO,且数据进行了中心化。经过投影变换得到新坐标为TODO,其中TODO是标准正交基,即TODO,TODO。经过降维后,新坐标为TODO,其中TODO是降维后的目标维数。样本点TODO在新坐标系下的投影为TODO,其中TODO是TODO在低维坐标系里第j维的坐标。如果用TODO去恢复TODO,则得到的恢复数据为TODO,其中TODO为标准正交基组成的矩阵。
+ 假设数据集是m个n维,$(x^{(1)}, x^{(2)},...,x^{(m)})$,且数据进行了中心化。经过投影变换得到新坐标为 ${w_1,w_2,...,w_n}$,其中 $w$ 是标准正交基,即 $\| w \|_2 = 1$,$w^T_iw_j = 0$。
-考虑到整个样本集,样本点到这个超平面的距离足够近,目标变为最小化TODO。对此式进行推理,可得:
-TODO
+ 经过降维后,新坐标为 $\{ w_1,w_2,...,w_n \}$,其中 $n'$ 是降维后的目标维数。样本点 $x^{(i)}$ 在新坐标系下的投影为 $z^{(i)} = \left(z^{(i)}_1, z^{(i)}_2, ..., z^{(i)}_{n'} \right)$,其中 $z^{(i)}_j = w^T_j x^{(i)}$ 是 $x^{(i)} $ 在低维坐标系里第 j 维的坐标。
-在推导过程中,分别用到了TODO,矩阵转置公式TODO,TODO,TODO以及矩阵的迹,最后两步是将代数和转为矩阵形式。
-由于TODO的每一个向量TODO是标准正交基,TODO是数据集的协方差矩阵,TODO是一个常量。最小化TODO又可等价于
+ 如果用 $z^{(i)} $ 去恢复 $x^{(i)} $ ,则得到的恢复数据为 $\widehat{x}^{(i)} = \sum^{n'}_{j=1} x^{(i)}_j w_j = Wz^{(i)}$,其中 $W$为标准正交基组成的矩阵。
-TODO
+ 考虑到整个样本集,样本点到这个超平面的距离足够近,目标变为最小化 $\sum^m_{i=1} \| \hat{x}^{(i)} - x^{(i)} \|^2_2$ 。对此式进行推理,可得:
+$$
+\sum^m_{i=1} \| \hat{x}^{(i)} - x^{(i)} \|^2_2 =
+ \sum^m_{i=1} \| Wz^{(i)} - x^{(i)} \|^2_2 \\
+ = \sum^m_{i=1} \left( Wz^{(i)} \right)^T \left( Wz^{(i)} \right)
+ - 2\sum^m_{i=1} \left( Wz^{(i)} \right)^T x^{(i)}
+ + \sum^m_{i=1} \left( x^{(i)} \right)^T x^{(i)} \\
+ = \sum^m_{i=1} \left( z^{(i)} \right)^T \left( z^{(i)} \right)
+ - 2\sum^m_{i=1} \left( z^{(i)} \right)^T x^{(i)}
+ + \sum^m_{i=1} \left( x^{(i)} \right)^T x^{(i)} \\
+ = - \sum^m_{i=1} \left( z^{(i)} \right)^T \left( z^{(i)} \right)
+ + \sum^m_{i=1} \left( x^{(i)} \right)^T x^{(i)} \\
+ = -tr \left( W^T \left( \sum^m_{i=1} x^{(i)} \left( x^{(i)} \right)^T \right)W \right)
+ + \sum^m_{i=1} \left( x^{(i)} \right)^T x^{(i)} \\
+ = -tr \left( W^TXX^TW \right)
+ + \sum^m_{i=1} \left( x^{(i)} \right)^T x^{(i)}
+$$
+ 在推导过程中,分别用到了 $\overline{x}^{(i)} = Wz^{(i)}$ ,矩阵转置公式 $(AB)^T = B^TA^T$,$W^TW = I$,$z^{(i)} = W^Tx^{(i)}$ 以及矩阵的迹,最后两步是将代数和转为矩阵形式。
+ 由于 $W$ 的每一个向量 $w_j$ 是标准正交基,$\sum^m_{i=1} x^{(i)} \left( x^{(i)} \right)^T$ 是数据集的协方差矩阵,$\sum^m_{i=1} \left( x^{(i)} \right)^T x^{(i)} $ 是一个常量。最小化 $\sum^m_{i=1} \| \hat{x}^{(i)} - x^{(i)} \|^2_2$ 又可等价于
+$$
+\underbrace{\arg \min}_W - tr \left( W^TXX^TW \right) s.t.W^TW = I
+$$
利用拉格朗日函数可得到
-TODO
-
-对TODO求导,可得TODO,也即TODO。 是TODO个特征向量组成的矩阵, 为TODO的特征值。TODO即为我们想要的矩阵。
-对于原始数据,只需要TODO,就可把原始数据集降维到最小投影距离的TODO维数据集。
+$$
+J(W) = -tr(W^TXX^TW) + \lambda(W^TW - I)
+$$
+ 对 $W$ 求导,可得 $-XX^TW + \lambda W = 0 $ ,也即 $ XX^TW = \lambda W $ 。 $ XX^T $ 是 $ n' $ 个特征向量组成的矩阵,$\lambda$ 为$ XX^T $ 的特征值。$W$ 即为我们想要的矩阵。
+ 对于原始数据,只需要 $z^{(i)} = W^TX^{(i)}$ ,就可把原始数据集降维到最小投影距离的 $n'$ 维数据集。
-基于最大投影方差的推导,这里就不再赘述,有兴趣的同仁可自行查阅资料。
+ 基于最大投影方差的推导,这里就不再赘述,有兴趣的同仁可自行查阅资料。
### 2.15.4 PCA算法流程总结
-输入:TODO维样本集TODO,目标降维的维数TODO。
+输入:$n$ 维样本集 $D = \left( x^{(1)},x^{(2)},...,x^{(m)} \right)$ ,目标降维的维数 $n'$ 。
-输出:降维后的新样本集TODO。
+输出:降维后的新样本集 $D' = \left( z^{(1)},z^{(2)},...,z^{(m)} \right)$ 。
主要步骤如下:
-1. 对所有的样本进行中心化,TODO。
-2. 计算样本的协方差矩阵TODO。
-3. 对协方差矩阵TODO进行特征值分解。
-4. 取出最大的TODO个特征值对应的特征向量TODO。
-5. 标准化特征向量,得到特征向量矩阵TODO。
-6. 转化样本集中的每个样本TODO。
-7. 得到输出矩阵TODO。
-*注*:在降维时,有时不明确目标维数,而是指定降维到的主成分比重阈值TODO。假设TODO个特征值为TODO,则TODO可从TODO得到。
+1. 对所有的样本进行中心化,$ x^{(i)} = x^{(i)} - \frac{1}{m} \sum^m_{j=1} x^{(j)} $ 。
+2. 计算样本的协方差矩阵 $XX^T$ 。
+3. 对协方差矩阵 $XX^T$ 进行特征值分解。
+4. 取出最大的 $n' $ 个特征值对应的特征向量 $\{ w_1,w_2,...,w_{n'} \}$ 。
+5. 标准化特征向量,得到特征向量矩阵 $W$ 。
+6. 转化样本集中的每个样本 $z^{(i)} = W^T x^{(i)}$ 。
+7. 得到输出矩阵 $D' = \left( z^{(1)},z^{(2)},...,z^{(n)} \right)$ 。
+*注*:在降维时,有时不明确目标维数,而是指定降维到的主成分比重阈值 $k(k \epsilon(0,1])$ 。假设 $n$ 个特征值为 $\lambda_1 \geqslant \lambda_2 \geqslant ... \geqslant \lambda_n$ ,则 $n'$ 可从 $\sum^{n'}_{i=1} \lambda_i \geqslant k \times \sum^n_{i=1} \lambda_i $ 得到。
### 2.15.5 PCA算法主要优缺点
|优缺点|简要说明|
@@ -802,83 +918,100 @@ TODO
### 2.15.6 降维的必要性及目的
**降维的必要性**:
-1. 多重共线性--预测变量之间相互关联。多重共线性会导致解空间的不稳定,从而可能导致结果的不连贯。
-2. 高维空间本身具有稀疏性。一维正态分布有68%的值落于正负标准差之间,而在十维空间上只有0.02%。
+1. 多重共线性和预测变量之间相互关联。多重共线性会导致解空间的不稳定,从而可能导致结果的不连贯。
+2. 高维空间本身具有稀疏性。一维正态分布有68%的值落于正负标准差之间,而在十维空间上只有2%。
3. 过多的变量,对查找规律造成冗余麻烦。
4. 仅在变量层面上分析可能会忽略变量之间的潜在联系。例如几个预测变量可能落入仅反映数据某一方面特征的一个组内。
**降维的目的**:
1. 减少预测变量的个数。
2. 确保这些变量是相互独立的。
-3. 提供一个框架来解释结果。关特征,特别是重要特征更能在数据中明确的显示出来;如果只有两维或者三维的话,更便于可视化展示。
+3. 提供一个框架来解释结果。相关特征,特别是重要特征更能在数据中明确的显示出来;如果只有两维或者三维的话,更便于可视化展示。
4. 数据在低维下更容易处理、更容易使用。
5. 去除数据噪声。
6. 降低算法运算开销。
### 2.15.7 KPCA与PCA的区别?
-应用PCA算法的前提是假设存在一个线性的超平面,进而投影。那如果数据不是线性的呢?该怎么办?这时候就需要KPCA,数据集从TODO维映射到线性可分的高维TODO,然后再从TODO维降维到一个低维度TODO。
+ 应用PCA算法前提是假设存在一个线性超平面,进而投影。那如果数据不是线性的呢?该怎么办?这时候就需要KPCA,数据集从 $n$ 维映射到线性可分的高维 $N >n$,然后再从 $N$ 维降维到一个低维度 $n'(n' 训练误差大,测试误差小 → Bias大
->
-> 训练误差小,测试误差大→ Variance大 → 降VC维
->
-> 训练误差大,测试误差大→ 升VC维
+> 
+>
### 2.16.3 经验误差与泛化误差
-误差(error):一般地,我们把学习器的实际预测输出与样本的真是输出之间的差异称为“误差”
-经验误差(empirical error):也叫训练误差(training error)。模型在训练集上的误差。
+经验误差(empirical error):也叫训练误差(training error),模型在训练集上的误差。
泛化误差(generalization error):模型在新样本集(测试集)上的误差称为“泛化误差”。
@@ -886,40 +1019,44 @@ Bias(偏差),Error(误差),和Variance(方差)
根据不同的坐标方式,欠拟合与过拟合图解不同。
1. **横轴为训练样本数量,纵轴为误差**
-
+
如上图所示,我们可以直观看出欠拟合和过拟合的区别:
-模型欠拟合:在训练集以及测试集上同时具有较高的误差,此时模型的偏差较大;
+ 模型欠拟合:在训练集以及测试集上同时具有较高的误差,此时模型的偏差较大;
-模型过拟合:在训练集上具有较低的误差,在测试集上具有较高的误差,此时模型的方差较大。
+ 模型过拟合:在训练集上具有较低的误差,在测试集上具有较高的误差,此时模型的方差较大。
-模型正常:在训练集以及测试集上,同时具有相对较低的偏差以及方差。
+ 模型正常:在训练集以及测试集上,同时具有相对较低的偏差以及方差。
2. **横轴为模型复杂程度,纵轴为误差**
-
+
+
+ 红线为测试集上的Error,蓝线为训练集上的Error
-模型欠拟合:模型在点A处,在训练集以及测试集上同时具有较高的误差,此时模型的偏差较大。
+ 模型欠拟合:模型在点A处,在训练集以及测试集上同时具有较高的误差,此时模型的偏差较大。
-模型过拟合:模型在点C处,在训练集上具有较低的误差,在测试集上具有较高的误差,此时模型的方差较大。
+ 模型过拟合:模型在点C处,在训练集上具有较低的误差,在测试集上具有较高的误差,此时模型的方差较大。
-模型正常:模型复杂程度控制在点B处为最优。
+ 模型正常:模型复杂程度控制在点B处为最优。
3. **横轴为正则项系数,纵轴为误差**
-
+
-模型欠拟合:模型在点C处,在训练集以及测试集上同时具有较高的误差,此时模型的偏差较大。
+ 红线为测试集上的Error,蓝线为训练集上的Error
-模型过拟合:模型在点A处,在训练集上具有较低的误差,在测试集上具有较高的误差,此时模型的方差较大。 它通常发生在模型过于复杂的情况下,如参数过多等,会使得模型的预测性能变弱,并且增加数据的波动性。虽然模型在训练时的效果可以表现的很完美,基本上记住了数据的全部特点,但这种模型在未知数据的表现能力会大减折扣,因为简单的模型泛化能力通常都是很弱的。
+ 模型欠拟合:模型在点C处,在训练集以及测试集上同时具有较高的误差,此时模型的偏差较大。
-模型正常:模型复杂程度控制在点B处为最优。
+ 模型过拟合:模型在点A处,在训练集上具有较低的误差,在测试集上具有较高的误差,此时模型的方差较大。 它通常发生在模型过于复杂的情况下,如参数过多等,会使得模型的预测性能变弱,并且增加数据的波动性。虽然模型在训练时的效果可以表现的很完美,基本上记住了数据的全部特点,但这种模型在未知数据的表现能力会大减折扣,因为简单的模型泛化能力通常都是很弱的。
+
+ 模型正常:模型复杂程度控制在点B处为最优。
### 2.16.5 如何解决过拟合与欠拟合?
**如何解决欠拟合:**
1. 添加其他特征项。组合、泛化、相关性、上下文特征、平台特征等特征是特征添加的重要手段,有时候特征项不够会导致模型欠拟合。
-2. 添加多项式特征。例如将线性模型添加二次项或三次项使模型泛化能力更强。例如,FM模型、FFM模型,其实就是线性模型,增加了二阶多项式,保证了模型一定的拟合程度。
+2. 添加多项式特征。例如将线性模型添加二次项或三次项使模型泛化能力更强。例如,FM(Factorization Machine)模型、FFM(Field-aware Factorization Machine)模型,其实就是线性模型,增加了二阶多项式,保证了模型一定的拟合程度。
3. 可以增加模型的复杂程度。
4. 减小正则化系数。正则化的目的是用来防止过拟合的,但是现在模型出现了欠拟合,则需要减少正则化参数。
@@ -929,28 +1066,28 @@ Bias(偏差),Error(误差),和Variance(方差)
3. 降低模型复杂程度。
4. 增大正则项系数。
5. 采用dropout方法,dropout方法,通俗的讲就是在训练的时候让神经元以一定的概率不工作。
-6. early stoping。
+6. early stopping。
7. 减少迭代次数。
8. 增大学习率。
9. 添加噪声数据。
10. 树结构中,可以对树进行剪枝。
+11. 减少特征项。
欠拟合和过拟合这些方法,需要根据实际问题,实际模型,进行选择。
-### 2.16.6 交叉验证的主要作用?
-为了得到更为稳健可靠的模型,对模型的泛化误差进行评估,得到模型泛化误差的近似值。当有多个模型可以选择时,我们通常选择“泛化误差”最小的模型。
+### 2.16.6 交叉验证的主要作用
+ 为了得到更为稳健可靠的模型,对模型的泛化误差进行评估,得到模型泛化误差的近似值。当有多个模型可以选择时,我们通常选择“泛化误差”最小的模型。
-交叉验证的方法有许多种,但是最常用的是:留一交叉验证、k折交叉验证
+ 交叉验证的方法有许多种,但是最常用的是:留一交叉验证、k折交叉验证。
-### 2.16.7 k折交叉验证?
+### 2.16.7 理解k折交叉验证
1. 将含有N个样本的数据集,分成K份,每份含有N/K个样本。选择其中1份作为测试集,另外K-1份作为训练集,测试集就有K种情况。
-2. 在每种情况中,用训练集训练模型,用测试集测试模型,计算模型的泛化误差。
+2. 在每种情况中,用训练集训练模型,用测试集测试模型,计算模型的泛化误差。
3. 交叉验证重复K次,每份验证一次,平均K次的结果或者使用其它结合方式,最终得到一个单一估测,得到模型最终的泛化误差。
-4. 将K种情况下,模型的泛化误差取均值,得到模型最终的泛化误差。
-**注**:
-1. 一般2<=K<=10。 k折交叉验证的优势在于,同时重复运用随机产生的子样本进行训练和验证,每次的结果验证一次,10折交叉验证是最常用的。
-2. 训练集中样本数量要足够多,一般至少大于总样本数的50%。
-3. 训练集和测试集必须从完整的数据集中均匀取样。均匀取样的目的是希望减少训练集、测试集与原数据集之间的偏差。当样本数量足够多时,通过随机取样,便可以实现均匀取样的效果。
+4. 将K种情况下,模型的泛化误差取均值,得到模型最终的泛化误差。
+5. 一般$2\leqslant K \leqslant10$。 k折交叉验证的优势在于,同时重复运用随机产生的子样本进行训练和验证,每次的结果验证一次,10折交叉验证是最常用的。
+6. 训练集中样本数量要足够多,一般至少大于总样本数的50%。
+7. 训练集和测试集必须从完整的数据集中均匀取样。均匀取样的目的是希望减少训练集、测试集与原数据集之间的偏差。当样本数量足够多时,通过随机取样,便可以实现均匀取样的效果。
### 2.16.8 混淆矩阵
第一种混淆矩阵:
@@ -991,332 +1128,283 @@ Bias(偏差),Error(误差),和Variance(方差)
例,在所有实际上有恶性肿瘤的病人中,成功预测有恶性肿瘤的病人的百分比,越高越好。
### 2.16.11 ROC与AUC
-ROC全称是“受试者工作特征”(Receiver Operating Characteristic)。
+ ROC全称是“受试者工作特征”(Receiver Operating Characteristic)。
-ROC曲线的面积就是AUC(Area Under the Curve)。
+ ROC曲线的面积就是AUC(Area Under Curve)。
-AUC用于衡量“二分类问题”机器学习算法性能(泛化能力)。
+ AUC用于衡量“二分类问题”机器学习算法性能(泛化能力)。
-ROC曲线,通过将连续变量设定出多个不同的临界值,从而计算出一系列真正率和假正率,再以假正率为纵坐标、真正率为横坐标绘制成曲线,曲线下面积越大,诊断准确性越高。在ROC曲线上,最靠近坐标图左上方的点为假正率和真正率均较高的临界值。
+ ROC曲线,通过将连续变量设定出多个不同的临界值,从而计算出一系列真正率和假正率,再以假正率为横坐标、真正率为纵坐标绘制成曲线,曲线下面积越大,推断准确性越高。在ROC曲线上,最靠近坐标图左上方的点为假正率和真正率均较高的临界值。
-对于分类器,或者说分类算法,评价指标主要有precision,recall,F-score。下图是一个ROC曲线的示例。
+ 对于分类器,或者说分类算法,评价指标主要有Precision,Recall,F-score。下图是一个ROC曲线的示例。

-ROC曲线的横坐标为false positive rate(FPR),纵坐标为true positive rate(TPR)。其中
-TODO, TODO,
-下面着重介绍ROC曲线图中的四个点和一条线。
-第一个点,(0,1),即FPR=0, TPR=1,这意味着FN(false negative)=0,并且FP(false positive)=0。意味着这是一个完美的分类器,它将所有的样本都正确分类。
-第二个点,(1,0),即FPR=1,TPR=0,意味着这是一个最糟糕的分类器,因为它成功避开了所有的正确答案。
-第三个点,(0,0),即FPR=TPR=0,即FP(false positive)=TP(true positive)=0,可以发现该分类器预测所有的样本都为负样本(negative)。
-第四个点,(1,1),即FPR=TPR=1,分类器实际上预测所有的样本都为正样本。
-经过以上分析,ROC曲线越接近左上角,该分类器的性能越好。
+ROC曲线的横坐标为False Positive Rate(FPR),纵坐标为True Positive Rate(TPR)。其中
+$$
+TPR = \frac{TP}{TP+FN} ,FPR = \frac{FP}{FP+TN}
+$$
+
+ 下面着重介绍ROC曲线图中的四个点和一条线。
+ 第一个点(0,1),即FPR=0, TPR=1,这意味着FN(False Negative)=0,并且FP(False Positive)=0。意味着这是一个完美的分类器,它将所有的样本都正确分类。
+ 第二个点(1,0),即FPR=1,TPR=0,意味着这是一个最糟糕的分类器,因为它成功避开了所有的正确答案。
+ 第三个点(0,0),即FPR=TPR=0,即FP(False Positive)=TP(True Positive)=0,可以发现该分类器预测所有的样本都为负样本(Negative)。
+ 第四个点(1,1),即FPR=TPR=1,分类器实际上预测所有的样本都为正样本。
+ 经过以上分析,ROC曲线越接近左上角,该分类器的性能越好。
+
+ ROC曲线所覆盖的面积称为AUC(Area Under Curve),可以更直观的判断学习器的性能,AUC越大则性能越好。
-ROC曲线所覆盖的面积称为AUC(Area Under Curve),可以更直观的判断学习器的性能,AUC越大则性能越好。
### 2.16.12 如何画ROC曲线?
-http://blog.csdn.net/zdy0_2004/article/details/44948511
-下图是一个示例,图中共有20个测试样本,“Class”一栏表示每个测试样本真正的标签(p表示正样本,n表示负样本),“Score”表示每个测试样本属于正样本的概率。
+ 下图是一个示例,图中共有20个测试样本,“Class”一栏表示每个测试样本真正的标签(p表示正样本,n表示负样本),“Score”表示每个测试样本属于正样本的概率。
步骤:
-1、假设已经得出一系列样本被划分为正类的概率,按照大小排序。
-2、从高到低,依次将“Score”值作为阈值threshold,当测试样本属于正样本的概率大于或等于这个threshold时,我们认为它为正样本,否则为负样本。 举例来说,对于图中的第4个样本,其“Score”值为0.6,那么样本1,2,3,4都被认为是正样本,因为它们的“Score”值都大于等于0.6,而其他样本则都认为是负样本。
-3、每次选取一个不同的threshold,得到一组FPR和TPR,即ROC曲线上的一点。以此共得到20组FPR和TPR的值。其中FPR和TPR简单理解如下:
-4、根据3)中的每个坐标点点,画图。
+ 1、假设已经得出一系列样本被划分为正类的概率,按照大小排序。
+ 2、从高到低,依次将“Score”值作为阈值threshold,当测试样本属于正样本的概率大于或等于这个threshold时,我们认为它为正样本,否则为负样本。举例来说,对于图中的第4个样本,其“Score”值为0.6,那么样本1,2,3,4都被认为是正样本,因为它们的“Score”值都大于等于0.6,而其他样本则都认为是负样本。
+ 3、每次选取一个不同的threshold,得到一组FPR和TPR,即ROC曲线上的一点。以此共得到20组FPR和TPR的值。
+ 4、根据3、中的每个坐标点,画图。

### 2.16.13 如何计算TPR,FPR?
1、分析数据
-y_true = [0, 0, 1, 1];
-scores = [0.1, 0.4, 0.35, 0.8];
+y_true = [0, 0, 1, 1];scores = [0.1, 0.4, 0.35, 0.8];
2、列表
-样本 预测属于P的概率(score) 真实类别
-y[0] 0.1 N
-y[2] 0.35 P
-y[1] 0.4 N
-y[3] 0.8 P
+
+| 样本 | 预测属于P的概率(score) | 真实类别 |
+| ---- | ---------------------- | -------- |
+| y[0] | 0.1 | N |
+| y[1] | 0.4 | N |
+| y[2] | 0.35 | P |
+| y[3] | 0.8 | P |
+
3、将截断点依次取为score值,计算TPR和FPR。
当截断点为0.1时:
说明只要score>=0.1,它的预测类别就是正例。 因为4个样本的score都大于等于0.1,所以,所有样本的预测类别都为P。
-scores = [0.1, 0.4, 0.35, 0.8];
-y_true = [0, 0, 1, 1];
-y_pred = [1, 1, 1, 1];
+scores = [0.1, 0.4, 0.35, 0.8];y_true = [0, 0, 1, 1];y_pred = [1, 1, 1, 1];
正例与反例信息如下:
-真实值 预测值
- 正例 反例
-正例 TP=2 FN=0
-反例 FP=2 TN=0
+
+| | 正例 | 反例 |
+| ------------- | ---- | ---- |
+| **正例** | TP=2 | FN=0 |
+| **反例** | FP=2 | TN=0 |
+
由此可得:
-TPR = TP/(TP+FN) = 1;
-FPR = FP/(TN+FP) = 1;
+TPR = TP/(TP+FN) = 1; FPR = FP/(TN+FP) = 1;
当截断点为0.35时:
-scores = [0.1, 0.4, 0.35, 0.8]
-y_true = [0, 0, 1, 1]
-y_pred = [0, 1, 1, 1]
+scores = [0.1, 0.4, 0.35, 0.8];y_true = [0, 0, 1, 1];y_pred = [0, 1, 1, 1];
正例与反例信息如下:
-真实值 预测值
- 正例 反例
-正例 TP=2 FN=0
-反例 FP=1 TN=1
+
+| | 正例 | 反例 |
+| ------------- | ---- | ---- |
+| **正例** | TP=2 | FN=0 |
+| **反例** | FP=1 | TN=1 |
+
由此可得:
-TPR = TP/(TP+FN) = 1;
-FPR = FP/(TN+FP) = 0.5;
+TPR = TP/(TP+FN) = 1; FPR = FP/(TN+FP) = 0.5;
当截断点为0.4时:
-scores = [0.1, 0.4, 0.35, 0.8];
-y_true = [0, 0, 1, 1];
-y_pred = [0, 1, 0, 1];
+scores = [0.1, 0.4, 0.35, 0.8];y_true = [0, 0, 1, 1];y_pred = [0, 1, 0, 1];
正例与反例信息如下:
-真实值 预测值
- 正例 反例
-正例 TP=1 FN=1
-反例 FP=1 TN=1
+
+| | 正例 | 反例 |
+| ------------- | ---- | ---- |
+| **正例** | TP=1 | FN=1 |
+| **反例** | FP=1 | TN=1 |
+
由此可得:
-TPR = TP/(TP+FN) = 0.5;
-FPR = FP/(TN+FP) = 0.5;
+TPR = TP/(TP+FN) = 0.5; FPR = FP/(TN+FP) = 0.5;
当截断点为0.8时:
-scores = [0.1, 0.4, 0.35, 0.8];
-y_true = [0, 0, 1, 1];
-y_pred = [0, 0, 0, 1];
+scores = [0.1, 0.4, 0.35, 0.8];y_true = [0, 0, 1, 1];y_pred = [0, 0, 0, 1];
+
正例与反例信息如下:
-真实值 预测值
- 正例 反例
-正例 TP=1 FN=1
-反例 FP=0 TN=2
+
+| | 正例 | 反例 |
+| ------------- | ---- | ---- |
+| **正例** | TP=1 | FN=1 |
+| **反例** | FP=0 | TN=2 |
+
由此可得:
-TPR = TP/(TP+FN) = 0.5;
-FPR = FP/(TN+FP) = 0;
+TPR = TP/(TP+FN) = 0.5; FPR = FP/(TN+FP) = 0;
+
4、根据TPR、FPR值,以FPR为横轴,TPR为纵轴画图。
-### 2.16.14 如何计算Auc?
-a.将坐标点按照横着FPR排序
-b.计算第i个坐标点和第i+1个坐标点的间距 dx;
-c.获取第i(或者i+1)个坐标点的纵坐标y;
-d.计算面积微元ds = ydx;
-e.对面积微元进行累加,得到AUC。
+### 2.16.14 如何计算AUC?
+- 将坐标点按照横坐标FPR排序 。
+- 计算第$i$个坐标点和第$i+1$个坐标点的间距$dx$ 。
+- 获取第$i$或者$i+1$个坐标点的纵坐标y。
+- 计算面积微元$ds=ydx$。
+- 对面积微元进行累加,得到AUC。
### 2.16.15 为什么使用Roc和Auc评价分类器?
-模型有很多评估方法,为什么还要使用ROC和AUC呢?
-因为ROC曲线有个很好的特性:当测试集中的正负样本的分布变换的时候,ROC曲线能够保持不变。在实际的数据集中经常会出现样本类不平衡,即正负样本比例差距较大,而且测试数据中的正负样本也可能随着时间变化。
+ 模型有很多评估方法,为什么还要使用ROC和AUC呢?
+ 因为ROC曲线有个很好的特性:当测试集中的正负样本的分布变换的时候,ROC曲线能够保持不变。在实际的数据集中经常会出现样本类不平衡,即正负样本比例差距较大,而且测试数据中的正负样本也可能随着时间变化。
-### 2.16.17 直观理解AUC
-http://blog.csdn.net/cherrylvlei/article/details/52958720
-AUC是ROC右下方的曲线面积。下图展现了三种AUC的值:
+### 2.16.16 直观理解AUC
+ 下图展现了三种AUC的值:

-AUC是衡量二分类模型优劣的一种评价指标,表示正例排在负例前面的概率。其他评价指标有精确度、准确率、召回率,而AUC比这三者更为常用。
-因为一般在分类模型中,预测结果都是以概率的形式表现,如果要计算准确率,通常都会手动设置一个阈值来将对应的概率转化成类别,这个阈值也就很大程度上影响了模型准确率的计算。
-我们不妨举一个极端的例子:一个二类分类问题一共10个样本,其中9个样本为正例,1个样本为负例,在全部判正的情况下准确率将高达90%,而这并不是我们希望的结果,尤其是在这个负例样本得分还是最高的情况下,模型的性能本应极差,从准确率上看却适得其反。而AUC能很好描述模型整体性能的高低。这种情况下,模型的AUC值将等于0(当然,通过取反可以解决小于50%的情况,不过这是另一回事了)。
+ AUC是衡量二分类模型优劣的一种评价指标,表示正例排在负例前面的概率。其他评价指标有精确度、准确率、召回率,而AUC比这三者更为常用。
+ 一般在分类模型中,预测结果都是以概率的形式表现,如果要计算准确率,通常都会手动设置一个阈值来将对应的概率转化成类别,这个阈值也就很大程度上影响了模型准确率的计算。
+ 举例:
+ 现在假设有一个训练好的二分类器对10个正负样本(正例5个,负例5个)预测,得分按高到低排序得到的最好预测结果为[1, 1, 1, 1, 1, 0, 0, 0, 0, 0],即5个正例均排在5个负例前面,正例排在负例前面的概率为100%。然后绘制其ROC曲线,由于是10个样本,除去原点我们需要描10个点,如下:
-### 2.16.18 代价敏感错误率与代价曲线
+
-http://blog.csdn.net/cug_lzt/article/details/78295140
+ 描点方式按照样本预测结果的得分高低从左至右开始遍历。从原点开始,每遇到1便向y轴正方向移动y轴最小步长1个单位,这里是1/5=0.2;每遇到0则向x轴正方向移动x轴最小步长1个单位,这里也是0.2。不难看出,上图的AUC等于1,印证了正例排在负例前面的概率的确为100%。
-不同的错误会产生不同代价。
-以二分法为例,设置代价矩阵如下:
+ 假设预测结果序列为[1, 1, 1, 1, 0, 1, 0, 0, 0, 0]。
-
+
-当判断正确的时候,值为0,不正确的时候,分别为$Cost_{01}$和$Cost_{10}$ 。
+ 计算上图的AUC为0.96与计算正例与排在负例前面的概率0.8 × 1 + 0.2 × 0.8 = 0.96相等,而左上角阴影部分的面积则是负例排在正例前面的概率0.2 × 0.2 = 0.04。
-$Cost_{10}$:表示实际为反例但预测成正例的代价。
+ 假设预测结果序列为[1, 1, 1, 0, 1, 0, 1, 0, 0, 0]。
-$Cost_{01}$:表示实际为正例但是预测为反例的代价。
+
-**代价敏感错误率**:
-$\frac{样本中由模型得到的错误值与代价乘积之和}{总样本}$
+ 计算上图的AUC为0.88与计算正例与排在负例前面的概率0.6 × 1 + 0.2 × 0.8 + 0.2 × 0.6 = 0.88相等,左上角阴影部分的面积是负例排在正例前面的概率0.2 × 0.2 × 3 = 0.12。
-其数学表达式为:
+### 2.16.17 代价敏感错误率与代价曲线
-
+不同的错误会产生不同代价。以二分法为例,设置代价矩阵如下:
-$D^{+}、D^{-}$分别代表样例集 的正例子集和反例子集。
+
-代价曲线:
-在均等代价时,ROC曲线不能直接反应出模型的期望总体代价,而代价曲线可以。
-代价曲线横轴为[0,1]的正例函数代价:
+当判断正确的时候,值为0,不正确的时候,分别为$Cost_{01}$和$Cost_{10}$ 。
-$P(+)Cost=\frac{p*Cost_{01}}{p*Cost_{01}+(1-p)*Cost_{10}}$
+$Cost_{10}$:表示实际为反例但预测成正例的代价。
+
+$Cost_{01}$:表示实际为正例但是预测为反例的代价。
+
+**代价敏感错误率**=样本中由模型得到的错误值与代价乘积之和 / 总样本。
+其数学表达式为:
+$$
+E(f;D;cost)=\frac{1}{m}\left( \sum_{x_{i} \in D^{+}}({f(x_i)\neq y_i})\times Cost_{01}+ \sum_{x_{i} \in D^{-}}({f(x_i)\neq y_i})\times Cost_{10}\right)
+$$
+$D^{+}、D^{-}$分别代表样例集的正例子集和反例子集,x是预测值,y是真实值。
+**代价曲线**:
+ 在均等代价时,ROC曲线不能直接反应出模型的期望总体代价,而代价曲线可以。
+代价曲线横轴为[0,1]的正例函数代价:
+$$
+P(+)Cost=\frac{p*Cost_{01}}{p*Cost_{01}+(1-p)*Cost_{10}}
+$$
其中p是样本为正例的概率。
代价曲线纵轴维[0,1]的归一化代价:
-$Cost_{norm}=\frac{FNR*p*Cost_{01}+FNR*(1-p)*Cost_{10}}{p*Cost_{01}+(1-p)*Cost_{10}}$
-
+$$
+Cost_{norm}=\frac{FNR*p*Cost_{01}+FNR*(1-p)*Cost_{10}}{p*Cost_{01}+(1-p)*Cost_{10}}
+$$
-其中FPR为假正例率,FNR=1-TPR为假反利率。
+其中FPR为假阳率,FNR=1-TPR为假阴率。
注:ROC每个点,对应代价平面上一条线。
例如,ROC上(TPR,FPR),计算出FNR=1-TPR,在代价平面上绘制一条从(0,FPR)到(1,FNR)的线段,面积则为该条件下期望的总体代价。所有线段下界面积,所有条件下学习器的期望总体代价。
-
+
-### 2.16.19 模型有哪些比较检验方法
-http://wenwen.sogou.com/z/q721171854.htm
+### 2.16.18 模型有哪些比较检验方法
正确性分析:模型稳定性分析,稳健性分析,收敛性分析,变化趋势分析,极值分析等。
有效性分析:误差分析,参数敏感性分析,模型对比检验等。
有用性分析:关键数据求解,极值点,拐点,变化趋势分析,用数据验证动态模拟等。
高效性分析:时空复杂度分析与现有进行比较等。
-### 2.16.20 偏差与方差
-http://blog.csdn.net/zhihua_oba/article/details/78684257
-
-方差公式为:
-
-$S_{N}^{2}=\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\bar{x})^{2}$
-
-泛化误差可分解为偏差、方差与噪声之和,即
-generalization error=bias+variance+noise。
-
-噪声:描述了在当前任务上任何学习算法所能达到的期望泛化误差的下界,即刻画了学习问题本身的难度。
-假定期望噪声为零,则泛化误差可分解为偏差、方差之和,即
-generalization error=bias+variance。
-
-偏差(bias):描述的是预测值(估计值)的期望与真实值之间的差距。偏差越大,越偏离真实数据,如下图第二行所示。
+### 2.16.19 为什么使用标准差?
-方差(variance):描述的是预测值的变化范围,离散程度,也就是离其期望值的距离。方差越大,数据的分布越分散,模型的稳定程度越差。如果模型在训练集上拟合效果比较优秀,但是在测试集上拟合效果比较差劣,则方差较大,说明模型的稳定程度较差,出现这种现象可能是由于模型对训练集过拟合造成的。 如下图右列所示。
+方差公式为:$S^2_{N}=\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\bar{x})^{2}$
-
+标准差公式为:$S_{N}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\bar{x})^{2}}$
-简单的总结一下:
-偏差大,会造成模型欠拟合;
-方差大,会造成模型过拟合。
-
-### 2.16.21为什么使用标准差?
-
-标准差公式为:$S_{N}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\bar{x})^{2}}$
-
-样本标准差公式为:$S_{N}=\sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(x_{i}-\bar{x})^{2}}$
+样本标准差公式为:$S_{N}=\sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(x_{i}-\bar{x})^{2}}$
与方差相比,使用标准差来表示数据点的离散程度有3个好处:
1、表示离散程度的数字与样本数据点的数量级一致,更适合对数据样本形成感性认知。
2、表示离散程度的数字单位与样本数据的单位一致,更方便做后续的分析运算。
-3、在样本数据大致符合正态分布的情况下,标准差具有方便估算的特性:66.7%的数据点落在平均值前后1个标准差的范围内、95%的数据点落在平均值前后2个标准差的范围内,而99%的数据点将会落在平均值前后3个标准差的范围内。
-### 2.16.22点估计思想
-点估计:用实际样本的一个指标来估计总体的一个指标的一种估计方法。
-
-点估计举例:比如说,我们想要了解中国人的平均身高,那么在大街上随便找了一个人,通过测量这个人的身高来估计中国人的平均身高水平;或者在淘宝上买东西的时候随便一次买到假货就说淘宝上都是假货等;这些都属于点估计。
-
-点估计主要思想:在样本数据中得到一个指标,通过这个指标来估计总体指标;比如我们用样本均数来估计总体均数,样本均数就是我们要找到的指标。
-
-### 2.16.23 点估计优良性原则?
-获取样本均数指标相对来说比较简单,但是并不是总体的所有指标都很容易在样本中得到,比如说总体的标准差用样本的哪个指标来估计呢?
+3、在样本数据大致符合正态分布的情况下,标准差具有方便估算的特性:68%的数据点落在平均值前后1个标准差的范围内、95%的数据点落在平均值前后2个标准差的范围内,而99%的数据点将会落在平均值前后3个标准差的范围内。
-优良性准则有两大类:一类是小样本准则,即在样本大小固定时的优良性准则;另一类是大样本准则,即在样本大小趋于无穷时的优良性准则。最重要的小样本优良性准则是无偏性及与此相关的一致最小方差无偏计。
-
-样本中用来估计总体的指标要符合以下规则:
-
-1.首先必须是无偏统计量。
-所谓无偏性,即数学期望等于总体相应的统计量的样本估计量。
-
-2.最小方差准则
-针对总体样本的无偏估计量不唯一的情况,需选用其他准则,例如最小方差准则。如果一个统计量具有最小方差,也就是说所有的样本点与此统计量的离差平方和最小,则这个统计量被称为最小平方无偏估计量。
-最大概率准则
-
-4、缺一交叉准则
-在非参数回归中好像用的是缺一交叉准则
-
-要明白一个原则:计算样本的任何分布、均数、标准差都是没有任何意义的,如果样本的这种计算不能反映总体的某种特性。
-
-### 2.16.24 点估计、区间估计、中心极限定理之间的联系?
-https://www.zhihu.com/question/21871331#answer-4090464
-点估计:是用样本统计量来估计总体参数,因为样本统计量为数轴上某一点值,估计的结果也以一个点的数值表示,所以称为点估计。
-
-区间估计:通过从总体中抽取的样本,根据一定的正确度与精确度的要求,构造出适当的区间,以作为总体的分布参数(或参数的函数)的真值所在范围的估计。
-中心极限定理:设从均值为、方差为;(有限)的任意一个总体中抽取样本量为n的样本,当n充分大时,样本均值的抽样分布近似服从均值为、方差为的正态分布。
-
-三者之间联系:
-
-1、中心极限定理是推断统计的理论基础,推断统计包括参数估计和假设检验,其中参数估计包括点估计和区间估计,所以说,中心极限定理也是点估计和区间估计的理论基础。
-
-2、参数估计有两种方法:点估计和区间估计,区间估计包含了点估计。
-
-相同点:都是基于一个样本作出;
-
-不同点:点估计只提供单一的估计值,而区间估计基于点估计还提供误差界限,给出了置信区间,受置信度的影响。
-
-### 2.16.25 类别不平衡产生原因?
-类别不平衡(class-imbalance)是指分类任务中不同类别的训练样例数目差别很大的情况。
+### 2.16.20 类别不平衡产生原因?
+ 类别不平衡(class-imbalance)是指分类任务中不同类别的训练样例数目差别很大的情况。
产生原因:
-通常分类学习算法都会假设不同类别的训练样例数目基本相同。如果不同类别的训练样例数目差别很大,则会影响学习结果,测试结果变差。例如二分类问题中有998个反例,正例有2个,那学习方法只需返回一个永远将新样本预测为反例的分类器,就能达到99.8%的精度;然而这样的分类器没有价值。
-### 2.16.26 常见的类别不平衡问题解决方法
-http://blog.csdn.net/u013829973/article/details/77675147
-
+ 分类学习算法通常都会假设不同类别的训练样例数目基本相同。如果不同类别的训练样例数目差别很大,则会影响学习结果,测试结果变差。例如二分类问题中有998个反例,正例有2个,那学习方法只需返回一个永远将新样本预测为反例的分类器,就能达到99.8%的精度;然而这样的分类器没有价值。
+### 2.16.21 常见的类别不平衡问题解决方法
防止类别不平衡对学习造成的影响,在构建分类模型之前,需要对分类不平衡性问题进行处理。主要解决方法有:
1、扩大数据集
-增加包含小类样本数据的数据,更多的数据能得到更多的分布信息。
+ 增加包含小类样本数据的数据,更多的数据能得到更多的分布信息。
2、对大类数据欠采样
-减少大类数据样本个数,使与小样本个数接近。
-缺点:欠采样操作时若随机丢弃大类样本,可能会丢失重要信息。
-代表算法:EasyEnsemble。利用集成学习机制,将大类划分为若干个集合供不同的学习器使用。相当于对每个学习器都进行了欠采样,但在全局来看却不会丢失重要信息。
+ 减少大类数据样本个数,使与小样本个数接近。
+ 缺点:欠采样操作时若随机丢弃大类样本,可能会丢失重要信息。
+ 代表算法:EasyEnsemble。其思想是利用集成学习机制,将大类划分为若干个集合供不同的学习器使用。相当于对每个学习器都进行欠采样,但对于全局则不会丢失重要信息。
3、对小类数据过采样
-过采样:对小类的数据样本进行采样来增加小类的数据样本个数。
+ 过采样:对小类的数据样本进行采样来增加小类的数据样本个数。
-代表算法:SMOTE和ADASYN。
+ 代表算法:SMOTE和ADASYN。
-SMOTE:通过对训练集中的小类数据进行插值来产生额外的小类样本数据。
+ SMOTE:通过对训练集中的小类数据进行插值来产生额外的小类样本数据。
-新的少数类样本产生的策略:对每个少数类样本a,在a的最近邻中随机选一个样本b,然后在a、b之间的连线上随机选一点作为新合成的少数类样本。
-ADASYN:根据学习难度的不同,对不同的少数类别的样本使用加权分布,对于难以学习的少数类的样本,产生更多的综合数据。 通过减少类不平衡引入的偏差和将分类决策边界自适应地转移到困难的样本两种手段,改善了数据分布。
+ 新的少数类样本产生的策略:对每个少数类样本a,在a的最近邻中随机选一个样本b,然后在a、b之间的连线上随机选一点作为新合成的少数类样本。
+ ADASYN:根据学习难度的不同,对不同的少数类别的样本使用加权分布,对于难以学习的少数类的样本,产生更多的综合数据。 通过减少类不平衡引入的偏差和将分类决策边界自适应地转移到困难的样本两种手段,改善了数据分布。
4、使用新评价指标
-如果当前评价指标不适用,则应寻找其他具有说服力的评价指标。比如准确度这个评价指标在类别不均衡的分类任务中并不适用,甚至进行误导。因此在类别不均衡分类任务中,需要使用更有说服力的评价指标来对分类器进行评价。
+ 如果当前评价指标不适用,则应寻找其他具有说服力的评价指标。比如准确度这个评价指标在类别不均衡的分类任务中并不适用,甚至进行误导。因此在类别不均衡分类任务中,需要使用更有说服力的评价指标来对分类器进行评价。
5、选择新算法
-不同的算法适用于不同的任务与数据,应该使用不同的算法进行比较。
+ 不同的算法适用于不同的任务与数据,应该使用不同的算法进行比较。
6、数据代价加权
-例如当分类任务是识别小类,那么可以对分类器的小类样本数据增加权值,降低大类样本的权值,从而使得分类器将重点集中在小类样本身上。
+ 例如当分类任务是识别小类,那么可以对分类器的小类样本数据增加权值,降低大类样本的权值,从而使得分类器将重点集中在小类样本身上。
7、转化问题思考角度
-例如在分类问题时,把小类的样本作为异常点,将问题转化为异常点检测或变化趋势检测问题。 异常点检测即是对那些罕见事件进行识别。变化趋势检测区别于异常点检测在于其通过检测不寻常的变化趋势来识别。
+ 例如在分类问题时,把小类的样本作为异常点,将问题转化为异常点检测或变化趋势检测问题。 异常点检测即是对那些罕见事件进行识别。变化趋势检测区别于异常点检测在于其通过检测不寻常的变化趋势来识别。
8、将问题细化分析
-对问题进行分析与挖掘,将问题划分成多个更小的问题,看这些小问题是否更容易解决。
+ 对问题进行分析与挖掘,将问题划分成多个更小的问题,看这些小问题是否更容易解决。
## 2.17 决策树
### 2.17.1 决策树的基本原理
-决策树是一种分而治之(Divide and Conquer)的决策过程。一个困难的预测问题, 通过树的分支节点, 被划分成两个或多个较为简单的子集,从结构上划分为不同的子问题。将依规则分割数据集的过程不断递归下去(Recursive Partitioning)。随着树的深度不断增加,分支节点的子集越来越小,所需要提的问题数也逐渐简化。当分支节点的深度或者问题的简单程度满足一定的停止规则(Stopping Rule)时, 该分支节点会停止劈分,此为自上而下的停止阈值(Cutoff Threshold)法;有些决策树也使用自下而上的剪枝(Pruning)法。
+
+ 决策树(Decision Tree)是一种分而治之的决策过程。一个困难的预测问题,通过树的分支节点,被划分成两个或多个较为简单的子集,从结构上划分为不同的子问题。将依规则分割数据集的过程不断递归下去(Recursive Partitioning)。随着树的深度不断增加,分支节点的子集越来越小,所需要提的问题数也逐渐简化。当分支节点的深度或者问题的简单程度满足一定的停止规则(Stopping Rule)时, 该分支节点会停止分裂,此为自上而下的停止阈值(Cutoff Threshold)法;有些决策树也使用自下而上的剪枝(Pruning)法。
### 2.17.2 决策树的三要素?
-一棵决策树的生成过程主要分为以下3个部分:
-特征选择:从训练数据中众多的特征中选择一个特征作为当前节点的分裂标准,如何选择特征有着很多不同量化评估标准标准,从而衍生出不同的决策树算法。
+ 一棵决策树的生成过程主要分为以下3个部分:
+
+ 1、特征选择:从训练数据中众多的特征中选择一个特征作为当前节点的分裂标准,如何选择特征有着很多不同量化评估标准,从而衍生出不同的决策树算法。
-决策树生成:根据选择的特征评估标准,从上至下递归地生成子节点,直到数据集不可分则停止决策树停止生长。树结构来说,递归结构是最容易理解的方式。
+ 2、决策树生成:根据选择的特征评估标准,从上至下递归地生成子节点,直到数据集不可分则决策树停止生长。树结构来说,递归结构是最容易理解的方式。
+
+ 3、剪枝:决策树容易过拟合,一般来需要剪枝,缩小树结构规模、缓解过拟合。剪枝技术有预剪枝和后剪枝两种。
-剪枝:决策树容易过拟合,一般来需要剪枝,缩小树结构规模、缓解过拟合。剪枝技术有预剪枝和后剪枝两种。
### 2.17.3 决策树学习基本算法

### 2.17.4 决策树算法优缺点
-决策树算法的优点:
+**决策树算法的优点**:
-1、理解和解释起来简单,决策树模型易想象。
+1、决策树算法易理解,机理解释起来简单。
-2、相比于其他算法需要大量数据集而已,决策树算法要求的数据集不大。
+2、决策树算法可以用于小数据集。
3、决策树算法的时间复杂度较小,为用于训练决策树的数据点的对数。
@@ -1330,7 +1418,7 @@ ADASYN:根据学习难度的不同,对不同的少数类别的样本使用
8、效率高,决策树只需要一次构建,反复使用,每一次预测的最大计算次数不超过决策树的深度。
-决策树算法的缺点:
+**决策树算法的缺点**:
1、对连续性的字段比较难预测。
@@ -1338,345 +1426,623 @@ ADASYN:根据学习难度的不同,对不同的少数类别的样本使用
3、当类别太多时,错误可能就会增加的比较快。
-4、信息缺失时处理起来比较困难,忽略了数据集中属性之间的相关性。
+4、在处理特征关联性比较强的数据时表现得不是太好。
-5、在处理特征关联性比较强的数据时表现得不是太好。
+5、对于各类别样本数量不一致的数据,在决策树当中,信息增益的结果偏向于那些具有更多数值的特征。
-6、对于各类别样本数量不一致的数据,在决策树当中,信息增益的结果偏向于那些具有更多数值的特征。
+### 2.17.5 熵的概念以及理解
-### 2.17.5熵的概念以及理解
+ 熵:度量随机变量的不确定性。
+ 定义:假设随机变量X的可能取值有$x_{1},x_{2},...,x_{n}$,对于每一个可能的取值$x_{i}$,其概率为$P(X=x_{i})=p_{i},i=1,2...,n$。随机变量的熵为:
+$$
+H(X)=-\sum_{i=1}^{n}p_{i}log_{2}p_{i}
+$$
+ 对于样本集合,假设样本有k个类别,每个类别的概率为$\frac{|C_{k}|}{|D|}$,其中 ${|C_{k}|}{|D|}$为类别为k的样本个数,$|D|$为样本总数。样本集合D的熵为:
+$$
+H(D)=-\sum_{k=1}^{k}\frac{|C_{k}|}{|D|}log_{2}\frac{|C_{k}|}{|D|}
+$$
-熵:度量随机变量的不确定性。
+### 2.17.6 信息增益的理解
-定义:假设随机变量X的可能取值有$x_{1},x_{2},...,x_{n}$,对于每一个可能的取值$x_{i}$,其概率为$P(X=x_{i})=p_{i},i=1,2...,n$。随机变量的熵为:
+ 定义:以某特征划分数据集前后的熵的差值。
+ 熵可以表示样本集合的不确定性,熵越大,样本的不确定性就越大。因此可以使用划分前后集合熵的差值来衡量使用当前特征对于样本集合D划分效果的好坏。 假设划分前样本集合D的熵为H(D)。使用某个特征A划分数据集D,计算划分后的数据子集的熵为H(D|A)。
+ 则信息增益为:
+$$
+g(D,A)=H(D)-H(D|A)
+$$
+ *注:*在决策树构建的过程中我们总是希望集合往最快到达纯度更高的子集合方向发展,因此我们总是选择使得信息增益最大的特征来划分当前数据集D。
+ 思想:计算所有特征划分数据集D,得到多个特征划分数据集D的信息增益,从这些信息增益中选择最大的,因而当前结点的划分特征便是使信息增益最大的划分所使用的特征。
+ 另外这里提一下信息增益比相关知识:
+ $信息增益比=惩罚参数\times信息增益$
+ 信息增益比本质:在信息增益的基础之上乘上一个惩罚参数。特征个数较多时,惩罚参数较小;特征个数较少时,惩罚参数较大。
+ 惩罚参数:数据集D以特征A作为随机变量的熵的倒数。
-$H(X)=-\sum_{i=1}^{n}p_{i}log_{2}p_{i}$
+### 2.17.7 剪枝处理的作用及策略?
+ 剪枝处理是决策树学习算法用来解决过拟合问题的一种办法。
-对于样本集合 ,假设样本有k个类别,每个类别的概率为$\frac{|C_{k}|}{|D|}$,其中 ${|C_{k}|}{|D|}$为类别为k的样本个数,$|D|$为样本总数。样本集合D的熵为:
-$H(D)=-\sum_{k=1}^{k}\frac{|C_{k}|}{|D|}log_{2}\frac{|C_{k}|}{|D|}$
+ 在决策树算法中,为了尽可能正确分类训练样本, 节点划分过程不断重复, 有时候会造成决策树分支过多,以至于将训练样本集自身特点当作泛化特点, 而导致过拟合。 因此可以采用剪枝处理来去掉一些分支来降低过拟合的风险。
-### 2.17.6 信息增益的理解
-定义:以某特征划分数据集前后的熵的差值。
-熵可以表示样本集合的不确定性,熵越大,样本的不确定性就越大。因此可以使用划分前后集合熵的差值来衡量使用当前特征对于样本集合D划分效果的好坏。
-假设划分前样本集合D的熵为H(D)。使用某个特征A划分数据集D,计算划分后的数据子集的熵为H(D|A)。
+ 剪枝的基本策略有预剪枝(pre-pruning)和后剪枝(post-pruning)。
-则信息增益为:
+ 预剪枝:在决策树生成过程中,在每个节点划分前先估计其划分后的泛化性能, 如果不能提升,则停止划分,将当前节点标记为叶结点。
-$g(D,A)=H(D)-H(D|A)$
+ 后剪枝:生成决策树以后,再自下而上对非叶结点进行考察, 若将此节点标记为叶结点可以带来泛化性能提升,则修改之。
-注:在决策树构建的过程中我们总是希望集合往最快到达纯度更高的子集合方向发展,因此我们总是选择使得信息增益最大的特征来划分当前数据集D。
+## 2.18 支持向量机
-思想:计算所有特征划分数据集D,得到多个特征划分数据集D的信息增益,从这些信息增益中选择最大的,因而当前结点的划分特征便是使信息增益最大的划分所使用的特征。
+### 2.18.1 什么是支持向量机
+ 支持向量:在求解的过程中,会发现只根据部分数据就可以确定分类器,这些数据称为支持向量。
-另外这里提一下信息增益比相关知识:
+ 支持向量机(Support Vector Machine,SVM):其含义是通过支持向量运算的分类器。
-信息增益比=惩罚参数X信息增益。
+ 在一个二维环境中,其中点R,S,G点和其它靠近中间黑线的点可以看作为支持向量,它们可以决定分类器,即黑线的具体参数。
-信息增益比本质:在信息增益的基础之上乘上一个惩罚参数。特征个数较多时,惩罚参数较小;特征个数较少时,惩罚参数较大。
+
-惩罚参数:数据集D以特征A作为随机变量的熵的倒数。
+ 支持向量机是一种二分类模型,它的目的是寻找一个超平面来对样本进行分割,分割的原则是边界最大化,最终转化为一个凸二次规划问题来求解。由简至繁的模型包括:
-### 2.17.7 剪枝处理的作用及策略?
-剪枝处理是决策树学习算法用来解决过拟合的一种办法。
+ 当训练样本线性可分时,通过硬边界(hard margin)最大化,学习一个线性可分支持向量机;
-在决策树算法中,为了尽可能正确分类训练样本, 节点划分过程不断重复, 有时候会造成决策树分支过多,以至于将训练样本集自身特点当作泛化特点, 而导致过拟合。 因此可以采用剪枝处理来去掉一些分支来降低过拟合的风险。
+ 当训练样本近似线性可分时,通过软边界(soft margin)最大化,学习一个线性支持向量机;
-剪枝的基本策略有预剪枝(prepruning)和后剪枝(postprunint)。
+ 当训练样本线性不可分时,通过核技巧和软边界最大化,学习一个非线性支持向量机;
-预剪枝:在决策树生成过程中,在每个节点划分前先估计其划分后的泛化性能, 如果不能提升,则停止划分,将当前节点标记为叶结点。
+### 2.18.2 支持向量机能解决哪些问题?
-后剪枝:生成决策树以后,再自下而上对非叶结点进行考察, 若将此节点标记为叶结点可以带来泛化性能提升,则修改之。
+**线性分类**
-## 2.18 支持向量机
+ 在训练数据中,每个数据都有n个的属性和一个二分类类别标志,我们可以认为这些数据在一个n维空间里。我们的目标是找到一个n-1维的超平面,这个超平面可以将数据分成两部分,每部分数据都属于同一个类别。
-### 2.18.1 什么是支持向量机
-SVM - Support Vector Machine。支持向量机,其含义是通过支持向量运算的分类器。其中“机”的意思是机器,可以理解为分类器。
+ 这样的超平面有很多,假如我们要找到一个最佳的超平面。此时,增加一个约束条件:要求这个超平面到每边最近数据点的距离是最大的,成为最大边距超平面。这个分类器即为最大边距分类器。
-什么是支持向量呢?在求解的过程中,会发现只根据部分数据就可以确定分类器,这些数据称为支持向量。
+**非线性分类**
-见下图,在一个二维环境中,其中点R,S,G点和其它靠近中间黑线的点可以看作为支持向量,它们可以决定分类器,也就是黑线的具体参数。
+ SVM的一个优势是支持非线性分类。它结合使用拉格朗日乘子法(Lagrange Multiplier)和KKT(Karush Kuhn Tucker)条件,以及核函数可以生成非线性分类器。
-
+### 2.18.3 核函数特点及其作用?
+
+ 引入核函数目的:把原坐标系里线性不可分的数据用核函数Kernel投影到另一个空间,尽量使得数据在新的空间里线性可分。
+ 核函数方法的广泛应用,与其特点是分不开的:
-### 2.18.2 支持向量机解决的问题?
-https://www.cnblogs.com/steven-yang/p/5658362.html
-解决的问题:
+1)核函数的引入避免了“维数灾难”,大大减小了计算量。而输入空间的维数n对核函数矩阵无影响。因此,核函数方法可以有效处理高维输入。
-线性分类
+2)无需知道非线性变换函数Φ的形式和参数。
-在训练数据中,每个数据都有n个的属性和一个二类类别标志,我们可以认为这些数据在一个n维空间里。我们的目标是找到一个n-1维的超平面(hyperplane),这个超平面可以将数据分成两部分,每部分数据都属于同一个类别。
+3)核函数的形式和参数的变化会隐式地改变从输入空间到特征空间的映射,进而对特征空间的性质产生影响,最终改变各种核函数方法的性能。
-其实这样的超平面有很多,我们要找到一个最佳的。因此,增加一个约束条件:这个超平面到每边最近数据点的距离是最大的。也成为最大间隔超平面(maximum-margin hyperplane)。这个分类器也成为最大间隔分类器(maximum-margin classifier)。
+4)核函数方法可以和不同的算法相结合,形成多种不同的基于核函数技术的方法,且这两部分的设计可以单独进行,并可以为不同的应用选择不同的核函数和算法。
-支持向量机是一个二类分类器。
+### 2.18.4 SVM为什么引入对偶问题?
-非线性分类
+1,对偶问题将原始问题中的约束转为了对偶问题中的等式约束,对偶问题往往更加容易求解。
-SVM的一个优势是支持非线性分类。它结合使用拉格朗日乘子法和KKT条件,以及核函数可以产生非线性分类器。
+2,可以很自然的引用核函数(拉格朗日表达式里面有内积,而核函数也是通过内积进行映射的)。
-分类器1 - 线性分类器
+3,在优化理论中,目标函数 f(x) 会有多种形式:如果目标函数和约束条件都为变量 x 的线性函数,称该问题为线性规划;如果目标函数为二次函数,约束条件为线性函数,称该最优化问题为二次规划;如果目标函数或者约束条件均为非线性函数,称该最优化问题为非线性规划。每个线性规划问题都有一个与之对应的对偶问题,对偶问题有非常良好的性质,以下列举几个:
-是一个线性函数,可以用于线性分类。一个优势是不需要样本数据。
+ a, 对偶问题的对偶是原问题;
-classifier 1:
-f(x)=xwT+b(1)
-(1)f(x)=xwT+b
+ b, 无论原始问题是否是凸的,对偶问题都是凸优化问题;
-ww 和 bb 是训练数据后产生的值。
+ c, 对偶问题可以给出原始问题一个下界;
+ d, 当满足一定条件时,原始问题与对偶问题的解是完全等价的。
-分类器2 - 非线性分类器
+### 2.18.5 如何理解SVM中的对偶问题
-支持线性分类和非线性分类。需要部分样本数据(支持向量),也就是$\alpha_i \ne 0$ 的数据。
+在硬边界支持向量机中,问题的求解可以转化为凸二次规划问题。
+ 假设优化目标为
+$$
+\begin{align}
+&\min_{\boldsymbol w, b}\frac{1}{2}||\boldsymbol w||^2\\
+&s.t. y_i(\boldsymbol w^T\boldsymbol x_i+b)\geqslant 1, i=1,2,\cdots,m.\\
+\end{align} \tag{1}
$$
-w=∑ni=1αiyixiw=∑i=1nαiyixi
+**step 1**. 转化问题:
$$
+\min_{\boldsymbol w, b} \max_{\alpha_i \geqslant 0} \left\{\frac{1}{2}||\boldsymbol w||^2 + \sum_{i=1}^m\alpha_i(1 - y_i(\boldsymbol w^T\boldsymbol x_i+b))\right\} \tag{2}
+$$
+上式等价于原问题,因为若满足(1)中不等式约束,则(2)式求max时,$\alpha_i(1 - y_i(\boldsymbol w^T\boldsymbol x_i+b))$必须取0,与(1)等价;若不满足(1)中不等式约束,(2)中求max会得到无穷大。 交换min和max获得其对偶问题:
+$$
+\max_{\alpha_i \geqslant 0} \min_{\boldsymbol w, b} \left\{\frac{1}{2}||\boldsymbol w||^2 + \sum_{i=1}^m\alpha_i(1 - y_i(\boldsymbol w^T\boldsymbol x_i+b))\right\}
+$$
+交换之后的对偶问题和原问题并不相等,上式的解小于等于原问题的解。
-classifier 2:
+**step 2**.现在的问题是如何找到问题(1) 的最优值的一个最好的下界?
+$$
+\frac{1}{2}||\boldsymbol w||^2 < v\\
+1 - y_i(\boldsymbol w^T\boldsymbol x_i+b) \leqslant 0\tag{3}
+$$
+若方程组(3)无解, 则v是问题(1)的一个下界。若(3)有解, 则
+$$
+\forall \boldsymbol \alpha > 0 , \ \min_{\boldsymbol w, b} \left\{\frac{1}{2}||\boldsymbol w||^2 + \sum_{i=1}^m\alpha_i(1 - y_i(\boldsymbol w^T\boldsymbol x_i+b))\right\} < v
+$$
+由逆否命题得:若
+$$
+\exists \boldsymbol \alpha > 0 , \ \min_{\boldsymbol w, b} \left\{\frac{1}{2}||\boldsymbol w||^2 + \sum_{i=1}^m\alpha_i(1 - y_i(\boldsymbol w^T\boldsymbol x_i+b))\right\} \geqslant v
+$$
+则(3)无解。
-f(x)=∑ni=1αiyiK(xi,x)+bherexi : training data iyi : label value of training data iαi : Lagrange multiplier of training data iK(x1,x2)=exp(−∥x1−x2∥22σ2) : kernel function(2)
-(2)f(x)=∑i=1nαiyiK(xi,x)+bherexi : training data iyi : label value of training data iαi : Lagrange multiplier of training data iK(x1,x2)=exp(−‖x1−x2‖22σ2) : kernel function
+那么v是问题
-αα, σσ 和 bb 是训练数据后产生的值。
-可以通过调节σσ来匹配维度的大小,σσ越大,维度越低。
+(1)的一个下界。
+ 要求得一个好的下界,取最大值即可
+$$
+\max_{\alpha_i \geqslant 0} \min_{\boldsymbol w, b} \left\{\frac{1}{2}||\boldsymbol w||^2 + \sum_{i=1}^m\alpha_i(1 - y_i(\boldsymbol w^T\boldsymbol x_i+b))\right\}
+$$
+**step 3**. 令
+$$
+L(\boldsymbol w, b,\boldsymbol a) = \frac{1}{2}||\boldsymbol w||^2 + \sum_{i=1}^m\alpha_i(1 - y_i(\boldsymbol w^T\boldsymbol x_i+b))
+$$
+$p^*$为原问题的最小值,对应的$w,b$分别为$w^*,b^*$,则对于任意的$a>0$:
+$$
+p^* = \frac{1}{2}||\boldsymbol w^*||^2 \geqslant L(\boldsymbol w^*, b,\boldsymbol a) \geqslant \min_{\boldsymbol w, b} L(\boldsymbol w, b,\boldsymbol a)
+$$
+则 $\min_{\boldsymbol w, b} L(\boldsymbol w, b,\boldsymbol a)$是问题(1)的一个下界。
-### 2.18.3 核函数作用?
+此时,取最大值即可求得好的下界,即
+$$
+\max_{\alpha_i \geqslant 0} \min_{\boldsymbol w, b} L(\boldsymbol w, b,\boldsymbol a)
+$$
-核函数目的:把原坐标系里线性不可分的数据用Kernel投影到另一个空间,尽量使得数据在新的空间里线性可分。
+### 2.18.7 常见的核函数有哪些?
+| 核函数 | 表达式 | 备注 |
+| ---------------------------- | ------------------------------------------------------------ | ----------------------------------- |
+| Linear Kernel线性核 | $k(x,y)=x^{t}y+c$ | |
+| Polynomial Kernel多项式核 | $k(x,y)=(ax^{t}y+c)^{d}$ | $d\geqslant1$为多项式的次数 |
+| Exponential Kernel指数核 | $k(x,y)=exp(-\frac{\left \|x-y \right \|}{2\sigma ^{2}})$ | $\sigma>0$ |
+| Gaussian Kernel高斯核 | $k(x,y)=exp(-\frac{\left \|x-y \right \|^{2}}{2\sigma ^{2}})$ | $\sigma$为高斯核的带宽,$\sigma>0$, |
+| Laplacian Kernel拉普拉斯核 | $k(x,y)=exp(-\frac{\left \|x-y \right \|}{\sigma})$ | $\sigma>0$ |
+| ANOVA Kernel | $k(x,y)=exp(-\sigma(x^{k}-y^{k})^{2})^{d}$ | |
+| Sigmoid Kernel | $k(x,y)=tanh(ax^{t}y+c)$ | $tanh$为双曲正切函数,$a>0,c<0$ |
+
+### 2.18.9 SVM主要特点?
+
+特点:
+
+(1) SVM方法的理论基础是非线性映射,SVM利用内积核函数代替向高维空间的非线性映射。
+(2) SVM的目标是对特征空间划分得到最优超平面,SVM方法核心是最大化分类边界。
+(3) 支持向量是SVM的训练结果,在SVM分类决策中起决定作用的是支持向量。
+(4) SVM是一种有坚实理论基础的新颖的适用小样本学习方法。它基本上不涉及概率测度及大数定律等,也简化了通常的分类和回归等问题。
+(5) SVM的最终决策函数只由少数的支持向量所确定,计算的复杂性取决于支持向量的数目,而不是样本空间的维数,这在某种意义上避免了“维数灾难”。
+(6) 少数支持向量决定了最终结果,这不但可以帮助我们抓住关键样本、“剔除”大量冗余样本,而且注定了该方法不但算法简单,而且具有较好的“鲁棒性”。这种鲁棒性主要体现在:
+ ①增、删非支持向量样本对模型没有影响;
+ ②支持向量样本集具有一定的鲁棒性;
+ ③有些成功的应用中,SVM方法对核的选取不敏感
+(7) SVM学习问题可以表示为凸优化问题,因此可以利用已知的有效算法发现目标函数的全局最小值。而其他分类方法(如基于规则的分类器和人工神经网络)都采用一种基于贪心学习的策略来搜索假设空间,这种方法一般只能获得局部最优解。
+(8) SVM通过最大化决策边界的边缘来控制模型的能力。尽管如此,用户必须提供其他参数,如使用核函数类型和引入松弛变量等。
+(9) SVM在小样本训练集上能够得到比其它算法好很多的结果。SVM优化目标是结构化风险最小,而不是经验风险最小,避免了过拟合问题,通过margin的概念,得到对数据分布的结构化描述,减低了对数据规模和数据分布的要求,有优秀的泛化能力。
+(10) 它是一个凸优化问题,因此局部最优解一定是全局最优解的优点。
+
+### 2.18.10 SVM主要缺点?
+
+(1) SVM算法对大规模训练样本难以实施
+ SVM的空间消耗主要是存储训练样本和核矩阵,由于SVM是借助二次规划来求解支持向量,而求解二次规划将涉及m阶矩阵的计算(m为样本的个数),当m数目很大时该矩阵的存储和计算将耗费大量的机器内存和运算时间。
+ 如果数据量很大,SVM的训练时间就会比较长,如垃圾邮件的分类检测,没有使用SVM分类器,而是使用简单的朴素贝叶斯分类器,或者是使用逻辑回归模型分类。
-核函数方法的广泛应用,与其特点是分不开的:
+(2) 用SVM解决多分类问题存在困难
-1)核函数的引入避免了“维数灾难”,大大减小了计算量。而输入空间的维数n对核函数矩阵无影响,因此,核函数方法可以有效处理高维输入。
+ 经典的支持向量机算法只给出了二类分类的算法,而在实际应用中,一般要解决多类的分类问题。可以通过多个二类支持向量机的组合来解决。主要有一对多组合模式、一对一组合模式和SVM决策树;再就是通过构造多个分类器的组合来解决。主要原理是克服SVM固有的缺点,结合其他算法的优势,解决多类问题的分类精度。如:与粗糙集理论结合,形成一种优势互补的多类问题的组合分类器。
-2)无需知道非线性变换函数Φ的形式和参数.
+(3) 对缺失数据敏感,对参数和核函数的选择敏感
-3)核函数的形式和参数的变化会隐式地改变从输入空间到特征空间的映射,进而对特征空间的性质产生影响,最终改变各种核函数方法的性能。
+ 支持向量机性能的优劣主要取决于核函数的选取,所以对于一个实际问题而言,如何根据实际的数据模型选择合适的核函数从而构造SVM算法。目前比较成熟的核函数及其参数的选择都是人为的,根据经验来选取的,带有一定的随意性。在不同的问题领域,核函数应当具有不同的形式和参数,所以在选取时候应该将领域知识引入进来,但是目前还没有好的方法来解决核函数的选取问题。
-4)核函数方法可以和不同的算法相结合,形成多种不同的基于核函数技术的方法,且这两部分的设计可以单独进行,并可以为不同的应用选择不同的核函数和算法。
+### 2.18.11 逻辑回归与SVM的异同
-### 2.18.4 对偶问题
-### 2.18.5 理解支持向量回归
-http://blog.csdn.net/liyaohhh/article/details/51077082
+相同点:
-### 2.18.6 理解SVM(核函数)
-http://blog.csdn.net/Love_wanling/article/details/69390047
+- LR和SVM都是**分类**算法。
+- LR和SVM都是**监督学习**算法。
+- LR和SVM都是**判别模型**。
+- 如果不考虑核函数,LR和SVM都是**线性分类**算法,也就是说他们的分类决策面都是线性的。
+ 说明:LR也是可以用核函数的.但LR通常不采用核函数的方法。(**计算量太大**)
-### 2.18.7 常见的核函数有哪些?
-http://blog.csdn.net/Love_wanling/article/details/69390047
+不同点:
-本文将遇到的核函数进行收集整理,分享给大家。
-http://blog.csdn.net/wsj998689aa/article/details/47027365
+**1、LR采用log损失,SVM采用合页(hinge)损失。**
+逻辑回归的损失函数:
+$$
+J(\theta)=-\frac{1}{m}\sum^m_{i=1}\left[y^{i}logh_{\theta}(x^{i})+ (1-y^{i})log(1-h_{\theta}(x^{i}))\right]
+$$
+支持向量机的目标函数:
+$$
+L(w,n,a)=\frac{1}{2}||w||^2-\sum^n_{i=1}\alpha_i \left( y_i(w^Tx_i+b)-1\right)
+$$
+ 逻辑回归方法基于概率理论,假设样本为1的概率可以用sigmoid函数来表示,然后通过**极大似然估计**的方法估计出参数的值。
+ 支持向量机基于几何**边界最大化**原理,认为存在最大几何边界的分类面为最优分类面。
-1.Linear Kernel
-线性核是最简单的核函数,核函数的数学公式如下:
+2、**LR对异常值敏感,SVM对异常值不敏感**。
-$k(x,y)=xy$
+ 支持向量机只考虑局部的边界线附近的点,而逻辑回归考虑全局。LR模型找到的那个超平面,是尽量让所有点都远离他,而SVM寻找的那个超平面,是只让最靠近中间分割线的那些点尽量远离,即只用到那些支持向量的样本。
+ 支持向量机改变非支持向量样本并不会引起决策面的变化。
+ 逻辑回归中改变任何样本都会引起决策面的变化。
-如果我们将线性核函数应用在KPCA中,我们会发现,推导之后和原始PCA算法一模一样,很多童鞋借此说“kernel is shit!!!”,这是不对的,这只是线性核函数偶尔会出现等价的形式罢了。
+3、**计算复杂度不同。对于海量数据,SVM的效率较低,LR效率比较高**
-2.Polynomial Kernel
+ 当样本较少,特征维数较低时,SVM和LR的运行时间均比较短,SVM较短一些。准确率的话,LR明显比SVM要高。当样本稍微增加些时,SVM运行时间开始增长,但是准确率赶超了LR。SVM时间虽长,但在可接受范围内。当数据量增长到20000时,特征维数增长到200时,SVM的运行时间剧烈增加,远远超过了LR的运行时间。但是准确率却和LR相差无几。(这其中主要原因是大量非支持向量参与计算,造成SVM的二次规划问题)
-多项式核实一种非标准核函数,它非常适合于正交归一化后的数据,其具体形式如下:
+4、**对非线性问题的处理方式不同**
-$k(x,y)=(ax^{t}y+c)^{d}$
+ LR主要靠特征构造,必须组合交叉特征,特征离散化。SVM也可以这样,还可以通过核函数kernel(因为只有支持向量参与核计算,计算复杂度不高)。由于可以利用核函数,SVM则可以通过对偶求解高效处理。LR则在特征空间维度很高时,表现较差。
-这个核函数是比较好用的,就是参数比较多,但是还算稳定。
+5、**SVM的损失函数就自带正则**。
+ 损失函数中的1/2||w||^2项,这就是为什么SVM是结构风险最小化算法的原因!!!而LR必须另外在损失函数上添加正则项!!!**
-3.Gaussian Kernel
+6、SVM自带**结构风险最小化**,LR则是**经验风险最小化**。
-这里说一种经典的鲁棒径向基核,即高斯核函数,鲁棒径向基核对于数据中的噪音有着较好的抗干扰能力,其参数决定了函数作用范围,超过了这个范围,数据的作用就“基本消失”。高斯核函数是这一族核函数的优秀代表,也是必须尝试的核函数,其数学形式如下:
+7、SVM会用核函数而LR一般不用核函数。
-$k(x,y)=exp(-\frac{\left \| x-y \right \|^{2}}{2\sigma ^{2}})$
+## 2.19 贝叶斯分类器
+### 2.19.1 图解极大似然估计
-虽然被广泛使用,但是这个核函数的性能对参数十分敏感,以至于有一大把的文献专门对这种核函数展开研究,同样,高斯核函数也有了很多的变种,如指数核,拉普拉斯核等。
+极大似然估计的原理,用一张图片来说明,如下图所示:
-4.Exponential Kernel
+
-指数核函数就是高斯核函数的变种,它仅仅是将向量之间的L2距离调整为L1距离,这样改动会对参数的依赖性降低,但是适用范围相对狭窄。其数学形式如下:
+ 例:有两个外形完全相同的箱子,1号箱有99只白球,1只黑球;2号箱有1只白球,99只黑球。在一次实验中,取出的是黑球,请问是从哪个箱子中取出的?
-$k(x,y)=exp(-\frac{\left \| x-y \right \|}{2\sigma ^{2}})$
+ 一般的根据经验想法,会猜测这只黑球最像是从2号箱取出,此时描述的“最像”就有“最大似然”的意思,这种想法常称为“最大似然原理”。
-5.Laplacian Kernel
+### 2.19.2 极大似然估计原理
-拉普拉斯核完全等价于指数核,唯一的区别在于前者对参数的敏感性降低,也是一种径向基核函数。
+ 总结起来,最大似然估计的目的就是:利用已知的样本结果,反推最有可能(最大概率)导致这样结果的参数值。
-$k(x,y)=exp(-\frac{\left \| x-y \right \|}{\sigma })$
+ 极大似然估计是建立在极大似然原理的基础上的一个统计方法。极大似然估计提供了一种给定观察数据来评估模型参数的方法,即:“模型已定,参数未知”。通过若干次试验,观察其结果,利用试验结果得到某个参数值能够使样本出现的概率为最大,则称为极大似然估计。
-6.ANOVA Kernel
+ 由于样本集中的样本都是独立同分布,可以只考虑一类样本集$D$,来估计参数向量$\vec\theta$。记已知的样本集为:
+$$
+D=\vec x_{1},\vec x_{2},...,\vec x_{n}
+$$
+似然函数(likelihood function):联合概率密度函数$p(D|\vec\theta )$称为相对于$\vec x_{1},\vec x_{2},...,\vec x_{n}$的$\vec\theta$的似然函数。
+$$
+l(\vec\theta )=p(D|\vec\theta ) =p(\vec x_{1},\vec x_{2},...,\vec x_{n}|\vec\theta )=\prod_{i=1}^{n}p(\vec x_{i}|\vec \theta )
+$$
+如果$\hat{\vec\theta}$是参数空间中能使似然函数$l(\vec\theta)$最大的$\vec\theta$值,则$\hat{\vec\theta}$应该是“最可能”的参数值,那么$\hat{\vec\theta}$就是$\theta$的极大似然估计量。它是样本集的函数,记作:
+$$
+\hat{\vec\theta}=d(D)= \mathop {\arg \max}_{\vec\theta} l(\vec\theta )
+$$
+$\hat{\vec\theta}(\vec x_{1},\vec x_{2},...,\vec x_{n})$称为极大似然函数估计值。
-ANOVA 核也属于径向基核函数一族,其适用于多维回归问题,数学形式如下:
+### 2.19.3 贝叶斯分类器基本原理
-$k(x,y)=exp(-\sigma(x^{k}-y^{k})^{2})^{d}$
+ 贝叶斯决策论通过**相关概率已知**的情况下利用**误判损失**来选择最优的类别分类。
+假设有$N$种可能的分类标记,记为$Y=\{c_1,c_2,...,c_N\}$,那对于样本$\boldsymbol{x}$,它属于哪一类呢?
-7.Sigmoid Kernel
+计算步骤如下:
-Sigmoid 核来源于神经网络,现在已经大量应用于深度学习,是当今机器学习的宠儿,它是S型的,所以被用作于“激活函数”。关于这个函数的性质可以说好几篇文献,大家可以随便找一篇深度学习的文章看看。
+step 1. 算出样本$\boldsymbol{x}$属于第i个类的概率,即$P(c_i|x)$;
-$k(x,y)=tanh(ax^{t}y+c)$
+step 2. 通过比较所有的$P(c_i|\boldsymbol{x})$,得到样本$\boldsymbol{x}$所属的最佳类别。
-8.Rational Quadratic Kernel
-二次有理核完完全全是作为高斯核的替代品出现,如果你觉得高斯核函数很耗时,那么不妨尝试一下这个核函数,顺便说一下,这个核函数作用域虽广,但是对参数十分敏感,慎用!!!!
+step 3. 将类别$c_i$和样本$\boldsymbol{x}$代入到贝叶斯公式中,得到:
+$$
+P(c_i|\boldsymbol{x})=\frac{P(\boldsymbol{x}|c_i)P(c_i)}{P(\boldsymbol{x})}.
+$$
+ 一般来说,$P(c_i)$为先验概率,$P(\boldsymbol{x}|c_i)$为条件概率,$P(\boldsymbol{x})$是用于归一化的证据因子。对于$P(c_i)$可以通过训练样本中类别为$c_i$的样本所占的比例进行估计;此外,由于只需要找出最大的$P(\boldsymbol{x}|c_i)$,因此我们并不需要计算$P(\boldsymbol{x})$。
+ 为了求解条件概率,基于不同假设提出了不同的方法,以下将介绍朴素贝叶斯分类器和半朴素贝叶斯分类器。
-$k(x,y)=1-\frac{\left \| x-y \right \|^{2}}{\left \| x-y \right \|^{2}+c}$
+### 2.19.4 朴素贝叶斯分类器
-### 2.18.8 软间隔与正则化
-### 2.18.9 SVM主要特点及缺点?
+ 假设样本$\boldsymbol{x}$包含$d$个属性,即$\boldsymbol{x}=\{ x_1,x_2,...,x_d\}$。于是有:
+$$
+P(\boldsymbol{x}|c_i)=P(x_1,x_2,\cdots,x_d|c_i)
+$$
+这个联合概率难以从有限的训练样本中直接估计得到。于是,朴素贝叶斯(Naive Bayesian,简称NB)采用了“属性条件独立性假设”:对已知类别,假设所有属性相互独立。于是有:
+$$
+P(x_1,x_2,\cdots,x_d|c_i)=\prod_{j=1}^d P(x_j|c_i)
+$$
+这样的话,我们就可以很容易地推出相应的判定准则了:
+$$
+h_{nb}(\boldsymbol{x})=\mathop{\arg \max}_{c_i\in Y} P(c_i)\prod_{j=1}^dP(x_j|c_i)
+$$
+**条件概率$P(x_j|c_i)$的求解**
-http://www.elecfans.com/emb/fpga/20171118582139_2.html
+如果$x_j$是标签属性,那么我们可以通过计数的方法估计$P(x_j|c_i)$
+$$
+P(x_j|c_i)=\frac{P(x_j,c_i)}{P(c_i)}\approx\frac{\#(x_j,c_i)}{\#(c_i)}
+$$
+其中,$\#(x_j,c_i)$表示在训练样本中$x_j$与$c_{i}$共同出现的次数。
+
+如果$x_j$是数值属性,通常我们假设类别中$c_{i}$的所有样本第$j$个属性的值服从正态分布。我们首先估计这个分布的均值$μ$和方差$σ$,然后计算$x_j$在这个分布中的概率密度$P(x_j|c_i)$。
+
+### 2.19.5 举例理解朴素贝叶斯分类器
+
+使用经典的西瓜训练集如下:
+
+| 编号 | 色泽 | 根蒂 | 敲声 | 纹理 | 脐部 | 触感 | 密度 | 含糖率 | 好瓜 |
+| :--: | :--: | :--: | :--: | :--: | :--: | :--: | :---: | :----: | :--: |
+| 1 | 青绿 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.697 | 0.460 | 是 |
+| 2 | 乌黑 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 0.774 | 0.376 | 是 |
+| 3 | 乌黑 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.634 | 0.264 | 是 |
+| 4 | 青绿 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 0.608 | 0.318 | 是 |
+| 5 | 浅白 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.556 | 0.215 | 是 |
+| 6 | 青绿 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 0.403 | 0.237 | 是 |
+| 7 | 乌黑 | 稍蜷 | 浊响 | 稍糊 | 稍凹 | 软粘 | 0.481 | 0.149 | 是 |
+| 8 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 硬滑 | 0.437 | 0.211 | 是 |
+| 9 | 乌黑 | 稍蜷 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 0.666 | 0.091 | 否 |
+| 10 | 青绿 | 硬挺 | 清脆 | 清晰 | 平坦 | 软粘 | 0.243 | 0.267 | 否 |
+| 11 | 浅白 | 硬挺 | 清脆 | 模糊 | 平坦 | 硬滑 | 0.245 | 0.057 | 否 |
+| 12 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 软粘 | 0.343 | 0.099 | 否 |
+| 13 | 青绿 | 稍蜷 | 浊响 | 稍糊 | 凹陷 | 硬滑 | 0.639 | 0.161 | 否 |
+| 14 | 浅白 | 稍蜷 | 沉闷 | 稍糊 | 凹陷 | 硬滑 | 0.657 | 0.198 | 否 |
+| 15 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 0.360 | 0.370 | 否 |
+| 16 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 硬滑 | 0.593 | 0.042 | 否 |
+| 17 | 青绿 | 蜷缩 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 0.719 | 0.103 | 否 |
+
+对下面的测试例“测1”进行 分类:
+
+| 编号 | 色泽 | 根蒂 | 敲声 | 纹理 | 脐部 | 触感 | 密度 | 含糖率 | 好瓜 |
+| :--: | :--: | :--: | :--: | :--: | :--: | :--: | :---: | :----: | :--: |
+| 测1 | 青绿 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.697 | 0.460 | ? |
+
+首先,估计类先验概率$P(c_j)$,有
+$$
+\begin{align}
+&P(好瓜=是)=\frac{8}{17}=0.471 \newline
+&P(好瓜=否)=\frac{9}{17}=0.529
+\end{align}
+$$
+然后,为每个属性估计条件概率(这里,对于连续属性,假定它们服从正态分布)
+$$
+P_{青绿|是}=P(色泽=青绿|好瓜=是)=\frac{3}{8}=0.375
+$$
-3.3.2.1 SVM有如下主要几个特点:
+$$
+P_{青绿|否}=P(色泽=青绿|好瓜=否)=\frac{3}{9}\approx0.333
+$$
-(1)非线性映射是SVM方法的理论基础,SVM利用内积核函数代替向高维空间的非线性映射;
-(2)对特征空间划分的最优超平面是SVM的目标,最大化分类边际的思想是SVM方法的核心;
-(3)支持向量是SVM的训练结果,在SVM分类决策中起决定作用的是支持向量。
-(4)SVM 是一种有坚实理论基础的新颖的小样本学习方法。它基本上不涉及概率测度及大数定律等,因此不同于现有的统计方法。从本质上看,它避开了从归纳到演绎的传统过程,实现了高效的从训练样本到预报样本的“转导推理”,大大简化了通常的分类和回归等问题。
-(5)SVM 的最终决策函数只由少数的支持向量所确定,计算的复杂性取决于支持向量的数目,而不是样本空间的维数,这在某种意义上避免了“维数灾难”。
-(6)少数支持向量决定了最终结果,这不但可以帮助我们抓住关键样本、“剔除”大量冗余样本,而且注定了该方法不但算法简单,而且具有较好的“鲁棒”性。这种“鲁棒”性主要体现在:
-①增、删非支持向量样本对模型没有影响;
-②支持向量样本集具有一定的鲁棒性;
-③有些成功的应用中,SVM 方法对核的选取不敏感
+$$
+P_{蜷缩|是}=P(根蒂=蜷缩|好瓜=是)=\frac{5}{8}=0.625
+$$
-3.3.2.2 SVM的两个不足:
-(1) SVM算法对大规模训练样本难以实施
-由 于SVM是借助二次规划来求解支持向量,而求解二次规划将涉及m阶矩阵的计算(m为样本的个数),当m数目很大时该矩阵的存储和计算将耗费大量的机器内存 和运算时间。针对以上问题的主要改进有有J.Platt的SMO算法、T.Joachims的SVM、C.J.C.Burges等的PCGC、张学工的 CSVM以及O.L.Mangasarian等的SOR算法。
-(2) 用SVM解决多分类问题存在困难
-经典的支持向量机算法只给出了二类分类的算法,而在数据挖掘的实际应用中,一般要解决多类的分类问题。可以通过多个二类支持向量机的组合来解决。主要有一对多组合模式、一对一组合模式和SVM决策树;再就是通过构造多个分类器的组合来解决。主要原理是克服SVM固有的缺点,结合其他算法的优势,解决多类问题的分类精度。如:与粗集理论结合,形成一种优势互补的多类问题的组合分类器。
+$$
+P_{蜷缩|否}=P(根蒂=蜷缩|好瓜=否)=\frac{3}{9}=0.333
+$$
-## 2.19 贝叶斯
-### 2.19.1 图解极大似然估计
+$$
+P_{浊响|是}=P(敲声=浊响|好瓜=是)=\frac{6}{8}=0.750
+$$
-极大似然估计 http://blog.csdn.net/zengxiantao1994/article/details/72787849
+$$
+P_{浊响|否}=P(敲声=浊响|好瓜=否)=\frac{4}{9}\approx 0.444
+$$
-极大似然估计的原理,用一张图片来说明,如下图所示:
+$$
+P_{清晰|是}=P(纹理=清晰|好瓜=是)=\frac{7}{8}= 0.875
+$$
-
+$$
+P_{清晰|否}=P(纹理=清晰|好瓜=否)=\frac{2}{9}\approx 0.222
+$$
-总结起来,最大似然估计的目的就是:利用已知的样本结果,反推最有可能(最大概率)导致这样结果的参数值。
+$$
+P_{凹陷|是}=P(脐部=凹陷|好瓜=是)=\frac{6}{8}= 0.750
+$$
-原理:极大似然估计是建立在极大似然原理的基础上的一个统计方法,是概率论在统计学中的应用。极大似然估计提供了一种给定观察数据来评估模型参数的方法,即:“模型已定,参数未知”。通过若干次试验,观察其结果,利用试验结果得到某个参数值能够使样本出现的概率为最大,则称为极大似然估计。
+$$
+P_{凹陷|否}=P(脐部=凹陷|好瓜=否)=\frac{2}{9} \approx 0.222
+$$
-由于样本集中的样本都是独立同分布,可以只考虑一类样本集D,来估计参数向量θ。记已知的样本集为:
+$$
+P_{硬滑|是}=P(触感=硬滑|好瓜=是)=\frac{6}{8}= 0.750
+$$
-$D=x_{1},x_{2},...,x_{n}$
+$$
+P_{硬滑|否}=P(触感=硬滑|好瓜=否)=\frac{6}{9} \approx 0.667
+$$
-似然函数(linkehood function):联合概率密度函数$P(D|\theta )$称为相对于$x_{1},x_{2},...,x_{n}$的θ的似然函数。
+$$
+\begin{aligned}
+\rho_{密度:0.697|是}&=\rho(密度=0.697|好瓜=是)\\&=\frac{1}{\sqrt{2 \pi}\times0.129}exp\left( -\frac{(0.697-0.574)^2}{2\times0.129^2}\right) \approx 1.959
+\end{aligned}
+$$
-$l(\theta )=p(D|\theta ) =p(x_{1},x_{2},...,x_{N}|\theta )=\prod_{i=1}^{N}p(x_{i}|\theta )$
+$$
+\begin{aligned}
+\rho_{密度:0.697|否}&=\rho(密度=0.697|好瓜=否)\\&=\frac{1}{\sqrt{2 \pi}\times0.195}exp\left( -\frac{(0.697-0.496)^2}{2\times0.195^2}\right) \approx 1.203
+\end{aligned}
+$$
-如果$\hat{\theta}$是参数空间中能使似然函数$l(\theta)$最大的θ值,则$\hat{\theta}$应该是“最可能”的参数值,那么$\hat{\theta}$就是θ的极大似然估计量。它是样本集的函数,记作:
+$$
+\begin{aligned}
+\rho_{含糖:0.460|是}&=\rho(密度=0.460|好瓜=是)\\&=\frac{1}{\sqrt{2 \pi}\times0.101}exp\left( -\frac{(0.460-0.279)^2}{2\times0.101^2}\right) \approx 0.788
+\end{aligned}
+$$
-$\hat{\theta}=d(x_{1},x_{2},...,x_{N})=d(D)$
+$$
+\begin{aligned}
+\rho_{含糖:0.460|否}&=\rho(密度=0.460|好瓜=是)\\&=\frac{1}{\sqrt{2 \pi}\times0.108}exp\left( -\frac{(0.460-0.154)^2}{2\times0.108^2}\right) \approx 0.066
+\end{aligned}
+$$
-$\hat{\theta}(x_{1},x_{2},...,x_{N})$称为极大似然函数估计值。
+于是有
+$$
+\begin{align}
+P(&好瓜=是)\times P_{青绿|是} \times P_{蜷缩|是} \times P_{浊响|是} \times P_{清晰|是} \times P_{凹陷|是}\newline
+&\times P_{硬滑|是} \times p_{密度:0.697|是} \times p_{含糖:0.460|是} \approx 0.063 \newline\newline
+P(&好瓜=否)\times P_{青绿|否} \times P_{蜷缩|否} \times P_{浊响|否} \times P_{清晰|否} \times P_{凹陷|否}\newline
+&\times P_{硬滑|否} \times p_{密度:0.697|否} \times p_{含糖:0.460|否} \approx 6.80\times 10^{-5}
+\end{align}
+$$
-### 2.19.2 朴素贝叶斯分类器和一般的贝叶斯分类器有什么区别?
-### 2.19.3 朴素与半朴素贝叶斯分类器
-### 2.19.4 贝叶斯网三种典型结构
-### 2.19.5 什么是贝叶斯错误率
-### 2.19.6 什么是贝叶斯最优错误率
-## 2.20 EM算法解决问题及实现流程
+由于$0.063>6.80\times 10^{-5}$,因此,朴素贝叶斯分类器将测试样本“测1”判别为“好瓜”。
-1.EM算法要解决的问题
+### 2.19.6 半朴素贝叶斯分类器
- 我们经常会从样本观察数据中,找出样本的模型参数。 最常用的方法就是极大化模型分布的对数似然函数。
+ 朴素贝叶斯采用了“属性条件独立性假设”,半朴素贝叶斯分类器的基本想法是适当考虑一部分属性间的相互依赖信息。**独依赖估计**(One-Dependence Estimator,简称ODE)是半朴素贝叶斯分类器最常用的一种策略。顾名思义,独依赖是假设每个属性在类别之外最多依赖一个其他属性,即:
+$$
+P(\boldsymbol{x}|c_i)=\prod_{j=1}^d P(x_j|c_i,{\rm pa}_j)
+$$
+其中$pa_j$为属性$x_i$所依赖的属性,成为$x_i$的父属性。假设父属性$pa_j$已知,那么可以使用下面的公式估计$P(x_j|c_i,{\rm pa}_j)$
+$$
+P(x_j|c_i,{\rm pa}_j)=\frac{P(x_j,c_i,{\rm pa}_j)}{P(c_i,{\rm pa}_j)}
+$$
+
+## 2.20 EM算法
+
+### 2.20.1 EM算法基本思想
-但是在一些情况下,我们得到的观察数据有未观察到的隐含数据,此时我们未知的有隐含数据和模型参数,因而无法直接用极大化对数似然函数得到模型分布的参数。怎么办呢?这就是EM算法可以派上用场的地方了。
+ 最大期望算法(Expectation-Maximization algorithm, EM),是一类通过迭代进行极大似然估计的优化算法,通常作为牛顿迭代法的替代,用于对包含隐变量或缺失数据的概率模型进行参数估计。
-EM算法解决这个的思路是使用启发式的迭代方法,既然我们无法直接求出模型分布参数,那么我们可以先猜想隐含数据(EM算法的E步),接着基于观察数据和猜测的隐含数据一起来极大化对数似然,求解我们的模型参数(EM算法的M步)。由于我们之前的隐藏数据是猜测的,所以此时得到的模型参数一般还不是我们想要的结果。不过没关系,我们基于当前得到的模型参数,继续猜测隐含数据(EM算法的E步),然后继续极大化对数似然,求解我们的模型参数(EM算法的M步)。以此类推,不断的迭代下去,直到模型分布参数基本无变化,算法收敛,找到合适的模型参数。
+ 最大期望算法基本思想是经过两个步骤交替进行计算:
-从上面的描述可以看出,EM算法是迭代求解最大值的算法,同时算法在每一次迭代时分为两步,E步和M步。一轮轮迭代更新隐含数据和模型分布参数,直到收敛,即得到我们需要的模型参数。
+ 第一步是计算期望(E),利用对隐藏变量的现有估计值,计算其最大似然估计值**;**
+ 第二步是最大化(M),最大化在E步上求得的最大似然值来计算参数的值。
-一个最直观了解EM算法思路的是K-Means算法,见之前写的K-Means聚类算法原理。
+ M步上找到的参数估计值被用于下一个E步计算中,这个过程不断交替进行。
+### 2.20.2 EM算法推导
+
+ 对于$m$个样本观察数据$x=(x^{1},x^{2},...,x^{m})$,现在想找出样本的模型参数$\theta$,其极大化模型分布的对数似然函数为:
+$$
+\theta = \mathop{\arg\max}_\theta\sum\limits_{i=1}^m logP(x^{(i)};\theta)
+$$
+如果得到的观察数据有未观察到的隐含数据$z=(z^{(1)},z^{(2)},...z^{(m)})$,极大化模型分布的对数似然函数则为:
+$$
+\theta =\mathop{\arg\max}_\theta\sum\limits_{i=1}^m logP(x^{(i)};\theta) = \mathop{\arg\max}_\theta\sum\limits_{i=1}^m log\sum\limits_{z^{(i)}}P(x^{(i)}, z^{(i)};\theta) \tag{a}
+$$
+由于上式不能直接求出$\theta$,采用缩放技巧:
+$$
+\begin{align} \sum\limits_{i=1}^m log\sum\limits_{z^{(i)}}P(x^{(i)}, z^{(i)};\theta) & = \sum\limits_{i=1}^m log\sum\limits_{z^{(i)}}Q_i(z^{(i)})\frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})} \\ & \geqslant \sum\limits_{i=1}^m \sum\limits_{z^{(i)}}Q_i(z^{(i)})log\frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})} \end{align} \tag{1}
+$$
+上式用到了Jensen不等式:
+$$
+log\sum\limits_j\lambda_jy_j \geqslant \sum\limits_j\lambda_jlogy_j\;\;, \lambda_j \geqslant 0, \sum\limits_j\lambda_j =1
+$$
+并且引入了一个未知的新分布$Q_i(z^{(i)})$。
-在K-Means聚类时,每个聚类簇的质心是隐含数据。我们会假设KK个初始化质心,即EM算法的E步;然后计算得到每个样本最近的质心,并把样本聚类到最近的这个质心,即EM算法的M步。重复这个E步和M步,直到质心不再变化为止,这样就完成了K-Means聚类。
+此时,如果需要满足Jensen不等式中的等号,所以有:
+$$
+\frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})} =c, c为常数
+$$
+由于$Q_i(z^{(i)})$是一个分布,所以满足
+$$
+\sum\limits_{z}Q_i(z^{(i)}) =1
+$$
+综上,可得:
+$$
+Q_i(z^{(i)}) = \frac{P(x^{(i)}, z^{(i)};\theta)}{\sum\limits_{z}P(x^{(i)}, z^{(i)};\theta)} = \frac{P(x^{(i)}, z^{(i)};\theta)}{P(x^{(i)};\theta)} = P( z^{(i)}|x^{(i)};\theta)
+$$
+如果$Q_i(z^{(i)}) = P( z^{(i)}|x^{(i)};\theta)$ ,则第(1)式是我们的包含隐藏数据的对数似然的一个下界。如果我们能极大化这个下界,则也在尝试极大化我们的对数似然。即我们需要最大化下式:
+$$
+\mathop{\arg\max}_\theta \sum\limits_{i=1}^m \sum\limits_{z^{(i)}}Q_i(z^{(i)})log\frac{P(x^{(i)}, z^{(i)};\theta)}{Q_i(z^{(i)})}
+$$
+简化得:
+$$
+\mathop{\arg\max}_\theta \sum\limits_{i=1}^m \sum\limits_{z^{(i)}}Q_i(z^{(i)})log{P(x^{(i)}, z^{(i)};\theta)}
+$$
+以上即为EM算法的M步,$\sum\limits_{z^{(i)}}Q_i(z^{(i)})log{P(x^{(i)}, z^{(i)};\theta)}$可理解为$logP(x^{(i)}, z^{(i)};\theta) $基于条件概率分布$Q_i(z^{(i)}) $的期望。以上即为EM算法中E步和M步的具体数学含义。
+### 2.20.3 图解EM算法
-当然,K-Means算法是比较简单的,实际中的问题往往没有这么简单。上面对EM算法的描述还很粗糙,我们需要用数学的语言精准描述。
+ 考虑上一节中的(a)式,表达式中存在隐变量,直接找到参数估计比较困难,通过EM算法迭代求解下界的最大值到收敛为止。
-2.EM算法流程
+
-现在我们总结下EM算法的流程。
+ 图片中的紫色部分是我们的目标模型$p(x|\theta)$,该模型复杂,难以求解析解,为了消除隐变量$z^{(i)}$的影响,我们可以选择一个不包含$z^{(i)}$的模型$r(x|\theta)$,使其满足条件$r(x|\theta) \leqslant p(x|\theta) $。
-输入:观察数据x=(x(1),x(2),...x(m))x=(x(1),x(2),...x(m)),联合分布p(x,z|θ)p(x,z|θ), 条件分布p(z|x,θ)p(z|x,θ), 最大迭代次数JJ。
+求解步骤如下:
-1) 随机初始化模型参数θθ的初值θ0θ0。
+(1)选取$\theta_1$,使得$r(x|\theta_1) = p(x|\theta_1)$,然后对此时的$r$求取最大值,得到极值点$\theta_2$,实现参数的更新。
-2) for j from 1 to J开始EM算法迭代:
+(2)重复以上过程到收敛为止,在更新过程中始终满足$r \leqslant p $.
-a) E步:计算联合分布的条件概率期望:
-Qi(z(i))=P(z(i)|x(i),θj))Qi(z(i))=P(z(i)|x(i),θj))
-L(θ,θj)=∑i=1m∑z(i)Qi(z(i))logP(x(i),z(i)|θ)L(θ,θj)=∑i=1m∑z(i)Qi(z(i))logP(x(i),z(i)|θ)
+### 2.20.4 EM算法流程
-b) M步:极大化L(θ,θj)L(θ,θj),得到θj+1θj+1:
-θj+1=argmaxθL(θ,θj)θj+1=argmaxθL(θ,θj)
+输入:观察数据$x=(x^{(1)},x^{(2)},...x^{(m)})$,联合分布$p(x,z ;\theta)$,条件分布$p(z|x; \theta)$,最大迭代次数$J$
-c) 如果θj+1θj+1已收敛,则算法结束。否则继续回到步骤a)进行E步迭代。
+1)随机初始化模型参数$\theta$的初值$\theta^0$。
-输出:模型参数θθ。
+2)$for \ j \ from \ 1 \ to \ j$:
+
+ a) E步。计算联合分布的条件概率期望:
+$$
+Q_i(z^{(i)}) = P( z^{(i)}|x^{(i)}, \theta^{j})
+$$
+
+$$
+L(\theta, \theta^{j}) = \sum\limits_{i=1}^m\sum\limits_{z^{(i)}}P( z^{(i)}|x^{(i)}, \theta^{j})log{P(x^{(i)}, z^{(i)};\theta)}
+$$
+
+ b) M步。极大化$L(\theta, \theta^{j})$,得到$\theta^{j+1}$:
+$$
+\theta^{j+1} = \mathop{\arg\max}_\theta L(\theta, \theta^{j})
+$$
+ c) 如果$\theta^{j+1}$收敛,则算法结束。否则继续回到步骤a)进行E步迭代。
+
+输出:模型参数$\theta$。
## 2.21 降维和聚类
-### 2.21.1 为什么会产生维数灾难?
+### 2.21.1 图解为什么会产生“维数灾难”?
+
+ 假如数据集包含10张照片,照片中包含三角形和圆两种形状。现在来设计一个分类器进行训练,让这个分类器对其他的照片进行正确分类(假设三角形和圆的总数是无限大),简单的,我们用一个特征进行分类:
-http://blog.csdn.net/chenjianbo88/article/details/52382943
-假设地球上猫和狗的数量是无限的。由于有限的时间和计算能力,我们仅仅选取了10张照片作为训练样本。我们的目的是基于这10张照片来训练一个线性分类器,使得这个线性分类器可以对剩余的猫或狗的照片进行正确分类。我们从只用一个特征来辨别猫和狗开始:
-
+
-从图2可以看到,如果仅仅只有一个特征的话,猫和狗几乎是均匀分布在这条线段上,很难将10张照片线性分类。那么,增加一个特征后的情况会怎么样:
+ 图2.21.1.a
-
+ 从上图可看到,如果仅仅只有一个特征进行分类,三角形和圆几乎是均匀分布在这条线段上,很难将10张照片线性分类。那么,增加一个特征后的情况会怎么样:
+
+
+
+ 图2.21.1.b
增加一个特征后,我们发现仍然无法找到一条直线将猫和狗分开。所以,考虑需要再增加一个特征:
-
+
+
+ 图2.21.1.c
-此时,我们终于找到了一个平面将猫和狗分开。需要注意的是,只有一个特征时,假设特征空间是长度为5的线段,则样本密度是10/5=2。有两个特征时,特征空间大小是5*5=25,样本密度是10/25=0.4。有三个特征时,特征空间大小是5*5*5=125,样本密度是10/125=0.08。如果继续增加特征数量,样本密度会更加稀疏,也就更容易找到一个超平面将训练样本分开。因为随着特征数量趋向于无限大,样本密度非常稀疏,训练样本被分错的可能性趋向于零。当我们将高维空间的分类结果映射到低维空间时,一个严重的问题出现了:
+
-
+ 图2.21.1.d
-从图5可以看到将三维特征空间映射到二维特征空间后的结果。尽管在高维特征空间时训练样本线性可分,但是映射到低维空间后,结果正好相反。事实上,增加特征数量使得高维空间线性可分,相当于在低维空间内训练一个复杂的非线性分类器。不过,这个非线性分类器太过“聪明”,仅仅学到了一些特例。如果将其用来辨别那些未曾出现在训练样本中的测试样本时,通常结果不太理想。这其实就是我们在机器学习中学过的过拟合问题。
+ 此时,可以找到一个平面将三角形和圆分开。
-
+ 现在计算一下不同特征数是样本的密度:
-尽管图6所示的只采用2个特征的线性分类器分错了一些训练样本,准确率似乎没有图4的高,但是,采用2个特征的线性分类器的泛化能力比采用3个特征的线性分类器要强。因为,采用2个特征的线性分类器学习到的不只是特例,而是一个整体趋势,对于那些未曾出现过的样本也可以比较好地辨别开来。换句话说,通过减少特征数量,可以避免出现过拟合问题,从而避免“维数灾难”。
+ (1)一个特征时,假设特征空间时长度为5的线段,则样本密度为$10 \div 5 = 2$。
-
+ (2)两个特征时,特征空间大小为$ 5\times5 = 25$,样本密度为$10 \div 25 = 0.4$。
-图7从另一个角度诠释了“维数灾难”。假设只有一个特征时,特征的值域是0到1,每一只猫和狗的特征值都是唯一的。如果我们希望训练样本覆盖特征值值域的20%,那么就需要猫和狗总数的20%。我们增加一个特征后,为了继续覆盖特征值值域的20%就需要猫和狗总数的45%(0.45^2=0.2)。继续增加一个特征后,需要猫和狗总数的58%(0.58^3=0.2)。随着特征数量的增加,为了覆盖特征值值域的20%,就需要更多的训练样本。如果没有足够的训练样本,就可能会出现过拟合问题。
+ (3)三个特征时,特征空间大小是$ 5\times5\times5 = 125$,样本密度为$10 \div 125 = 0.08$。
-
+ 以此类推,如果继续增加特征数量,样本密度会越来越稀疏,此时,更容易找到一个超平面将训练样本分开。当特征数量增长至无限大时,样本密度就变得非常稀疏。
-通过上述例子,我们可以看到特征数量越多,训练样本就会越稀疏,分类器的参数估计就会越不准确,更加容易出现过拟合问题。“维数灾难”的另一个影响是训练样本的稀疏性并不是均匀分布的。处于中心位置的训练样本比四周的训练样本更加稀疏。
+ 下面看一下将高维空间的分类结果映射到低维空间时,会出现什么情况?
-假设有一个二维特征空间,如图8所示的矩形,在矩形内部有一个内切的圆形。由于越接近圆心的样本越稀疏,因此,相比于圆形内的样本,那些位于矩形四角的样本更加难以分类。那么,随着特征数量的增加,圆形的面积会不会变化呢?这里我们假设超立方体(hypercube)的边长d=1,那么计算半径为0.5的超球面(hypersphere)的体积(volume)的公式为:
-$V(d)=\frac{\pi ^{\frac{d}{2}}}{\Gamma (\frac{d}{2}+1)}0.5^{d}$
+
-
+ 图2.21.1.e
-从图9可以看出随着特征数量的增加,超球面的体积逐渐减小直至趋向于零,然而超立方体的体积却不变。这个结果有点出乎意料,但部分说明了分类问题中的“维数灾难”:在高维特征空间中,大多数的训练样本位于超立方体的角落。
+ 上图是将三维特征空间映射到二维特征空间后的结果。尽管在高维特征空间时训练样本线性可分,但是映射到低维空间后,结果正好相反。事实上,增加特征数量使得高维空间线性可分,相当于在低维空间内训练一个复杂的非线性分类器。不过,这个非线性分类器太过“聪明”,仅仅学到了一些特例。如果将其用来辨别那些未曾出现在训练样本中的测试样本时,通常结果不太理想,会造成过拟合问题。
-
+
- 图10显示了不同维度下,样本的分布情况。在8维特征空间中,共有2^8=256个角落,而98%的样本分布在这些角落。随着维度的不断增加,公式2将趋向于0,其中dist_max和dist_min分别表示样本到中心的最大与最小距离。
+ 图2.21.1.f
-
+ 上图所示的只采用2个特征的线性分类器分错了一些训练样本,准确率似乎没有图2.21.1.e的高,但是,采用2个特征的线性分类器的泛化能力比采用3个特征的线性分类器要强。因为,采用2个特征的线性分类器学习到的不只是特例,而是一个整体趋势,对于那些未曾出现过的样本也可以比较好地辨别开来。换句话说,通过减少特征数量,可以避免出现过拟合问题,从而避免“维数灾难”。
-因此,在高维特征空间中对于样本距离的度量失去意义。由于分类器基本都依赖于如Euclidean距离,Manhattan距离等,所以在特征数量过大时,分类器的性能就会出现下降。
+
-所以,我们如何避免“维数灾难”?图1显示了分类器的性能随着特征个数的变化不断增加,过了某一个值后,性能不升反降。这里的某一个值到底是多少呢?目前,还没有方法来确定分类问题中的这个阈值是多少,这依赖于训练样本的数量,决策边界的复杂性以及分类器的类型。理论上,如果训练样本的数量无限大,那么就不会存在“维数灾难”,我们可以采用任意多的特征来训练分类器。事实上,训练样本的数量是有限的,所以不应该采用过多的特征。此外,那些需要精确的非线性决策边界的分类器,比如neural network,knn,decision trees等的泛化能力往往并不是很好,更容易发生过拟合问题。因此,在设计这些分类器时应当慎重考虑特征的数量。相反,那些泛化能力较好的分类器,比如naive Bayesian,linear classifier等,可以适当增加特征的数量。
+ 上图从另一个角度诠释了“维数灾难”。假设只有一个特征时,特征的值域是0到1,每一个三角形和圆的特征值都是唯一的。如果我们希望训练样本覆盖特征值值域的20%,那么就需要三角形和圆总数的20%。我们增加一个特征后,为了继续覆盖特征值值域的20%就需要三角形和圆总数的45%($0.452^2\approx0.2$)。继续增加一个特征后,需要三角形和圆总数的58%($0.583^3\approx0.2$)。随着特征数量的增加,为了覆盖特征值值域的20%,就需要更多的训练样本。如果没有足够的训练样本,就可能会出现过拟合问题。
-如果给定了N个特征,我们该如何从中选出M个最优的特征?最简单粗暴的方法是尝试所有特征的组合,从中挑出M个最优的特征。事实上,这是非常花时间的,或者说不可行的。其实,已经有许多特征选择算法(feature selection algorithms)来帮助我们确定特征的数量以及选择特征。此外,还有许多特征抽取方法(feature extraction methods),比如PCA等。交叉验证(cross-validation)也常常被用于检测与避免过拟合问题。
+ 通过上述例子,我们可以看到特征数量越多,训练样本就会越稀疏,分类器的参数估计就会越不准确,更加容易出现过拟合问题。“维数灾难”的另一个影响是训练样本的稀疏性并不是均匀分布的。处于中心位置的训练样本比四周的训练样本更加稀疏。
-参考资料:
-[1] Vincent Spruyt. The Curse of Dimensionality in classification. Computer vision for dummies. 2014. [Link]
+
+
+ 假设有一个二维特征空间,如上图所示的矩形,在矩形内部有一个内切的圆形。由于越接近圆心的样本越稀疏,因此,相比于圆形内的样本,那些位于矩形四角的样本更加难以分类。当维数变大时,特征超空间的容量不变,但单位圆的容量会趋于0,在高维空间中,大多数训练数据驻留在特征超空间的角落。散落在角落的数据要比处于中心的数据难于分类。
### 2.21.2 怎样避免维数灾难
+**有待完善!!!**
+
解决维度灾难问题:
主成分分析法PCA,线性判别法LDA
@@ -1687,122 +2053,149 @@ Lassio缩减系数法、小波分析法、
### 2.21.3 聚类和降维有什么区别与联系?
-聚类用于找寻数据内在的分布结构,既可以作为一个单独的过程,比如异常检测等等。也可作为分类等其他学习任务的前驱过程。聚类是标准的无监督学习。
+ 聚类用于找寻数据内在的分布结构,既可以作为一个单独的过程,比如异常检测等等。也可作为分类等其他学习任务的前驱过程。聚类是标准的无监督学习。
-1) 在一些推荐系统中需确定新用户的类型,但定义“用户类型”却可能不太容易,此时往往可先对原油的用户数据进行聚类,根据聚类结果将每个簇定义为一个类,然后再基于这些类训练分类模型,用于判别新用户的类型。
+ 1)在一些推荐系统中需确定新用户的类型,但定义“用户类型”却可能不太容易,此时往往可先对原有的用户数据进行聚类,根据聚类结果将每个簇定义为一个类,然后再基于这些类训练分类模型,用于判别新用户的类型。
-
+
-2)而降维则是为了缓解维数灾难的一个重要方法,就是通过某种数学变换将原始高维属性空间转变为一个低维“子空间”。其基于的假设就是,虽然人们平时观测到的数据样本虽然是高维的,但是实际上真正与学习任务相关的是个低维度的分布。从而通过最主要的几个特征维度就可以实现对数据的描述,对于后续的分类很有帮助。比如对于Kaggle上的泰坦尼克号生还问题。通过给定一个人的许多特征如年龄、姓名、性别、票价等,来判断其是否能在海难中生还。这就需要首先进行特征筛选,从而能够找出主要的特征,让学习到的模型有更好的泛化性。
+ 2)而降维则是为了缓解维数灾难的一个重要方法,就是通过某种数学变换将原始高维属性空间转变为一个低维“子空间”。其基于的假设就是,虽然人们平时观测到的数据样本虽然是高维的,但是实际上真正与学习任务相关的是个低维度的分布。从而通过最主要的几个特征维度就可以实现对数据的描述,对于后续的分类很有帮助。比如对于Kaggle(数据分析竞赛平台之一)上的泰坦尼克号生还问题。通过给定一个乘客的许多特征如年龄、姓名、性别、票价等,来判断其是否能在海难中生还。这就需要首先进行特征筛选,从而能够找出主要的特征,让学习到的模型有更好的泛化性。
-聚类和降维都可以作为分类等问题的预处理步骤。
+ 聚类和降维都可以作为分类等问题的预处理步骤。

-但是他们虽然都能实现对数据的约减。但是二者适用的对象不同,聚类针对的是数据点,而降维则是对于数据的特征。另外它们着很多种实现方法。聚类中常用的有K-means、层次聚类、基于密度的聚类等;降维中常用的则PCA、Isomap、LLE等。
+ 但是他们虽然都能实现对数据的约减。但是二者适用的对象不同,聚类针对的是数据点,而降维则是对于数据的特征。另外它们有着很多种实现方法。聚类中常用的有K-means、层次聚类、基于密度的聚类等;降维中常用的则PCA、Isomap、LLE等。
+
+### 2.21.4 有哪些聚类算法优劣衡量标准
-### 2.21.4 四种聚类方法之比较
+不同聚类算法有不同的优劣和不同的适用条件。可从以下方面进行衡量判断:
+ 1、算法的处理能力:处理大的数据集的能力,即算法复杂度;处理数据噪声的能力;处理任意形状,包括有间隙的嵌套的数据的能力;
+ 2、算法是否需要预设条件:是否需要预先知道聚类个数,是否需要用户给出领域知识;
-http://www.cnblogs.com/William_Fire/archive/2013/02/09/2909499.html
+ 3、算法的数据输入属性:算法处理的结果与数据输入的顺序是否相关,也就是说算法是否独立于数据输入顺序;算法处理有很多属性数据的能力,也就是对数据维数是否敏感,对数据的类型有无要求。
- 聚类分析是一种重要的人类行为,早在孩提时代,一个人就通过不断改进下意识中的聚类模式来学会如何区分猫狗、动物植物。目前在许多领域都得到了广泛的研究和成功的应用,如用于模式识别、数据分析、图像处理、市场研究、客户分割、Web文档分类等[1]。
+### 2.21.5 聚类和分类有什么区别?
-聚类就是按照某个特定标准(如距离准则)把一个数据集分割成不同的类或簇,使得同一个簇内的数据对象的相似性尽可能大,同时不在同一个簇中的数据对象的差异性也尽可能地大。即聚类后同一类的数据尽可能聚集到一起,不同数据尽量分离。
+**聚类(Clustering) **
+ 聚类,简单地说就是把相似的东西分到一组,聚类的时候,我们并不关心某一类是什么,我们需要实现的目标只是把相似的东西聚到一起。一个聚类算法通常只需要知道如何计算相似度就可以开始工作了,因此聚类通常并不需要使用训练数据进行学习,在机器学习中属于无监督学习。
-聚类技术[2]正在蓬勃发展,对此有贡献的研究领域包括数据挖掘、统计学、机器学习、空间数据库技术、生物学以及市场营销等。各种聚类方法也被不断提出和改进,而不同的方法适合于不同类型的数据,因此对各种聚类方法、聚类效果的比较成为值得研究的课题。
+**分类(Classification) **
-1 聚类算法的分类
+ 分类,对于一个分类器,通常需要你告诉它“这个东西被分为某某类”。一般情况下,一个分类器会从它得到的训练集中进行学习,从而具备对未知数据进行分类的能力,在机器学习中属于监督学习。
-目前,有大量的聚类算法[3]。而对于具体应用,聚类算法的选择取决于数据的类型、聚类的目的。如果聚类分析被用作描述或探查的工具,可以对同样的数据尝试多种算法,以发现数据可能揭示的结果。
-
-主要的聚类算法可以划分为如下几类:划分方法、层次方法、基于密度的方法、基于网格的方法以及基于模型的方法[4-6]。
+### 2.21.6 不同聚类算法特点性能比较
-每一类中都存在着得到广泛应用的算法,例如:划分方法中的k-means[7]聚类算法、层次方法中的凝聚型层次聚类算法[8]、基于模型方法中的神经网络[9]聚类算法等。
- 目前,聚类问题的研究不仅仅局限于上述的硬聚类,即每一个数据只能被归为一类,模糊聚类[10]也是聚类分析中研究较为广泛的一个分支。模糊聚类通过隶 属函数来确定每个数据隶属于各个簇的程度,而不是将一个数据对象硬性地归类到某一簇中。目前已有很多关于模糊聚类的算法被提出,如著名的FCM算法等。
- 本文主要对k-means聚类算法、凝聚型层次聚类算法、神经网络聚类算法之SOM,以及模糊聚类的FCM算法通过通用测试数据集进行聚类效果的比较和分析。
+| 算法名称 | 可伸缩性 | 适合的数据类型 | 高维性 | 异常数据抗干扰性 | 聚类形状 | 算法效率 |
+| :----------: | :------: | :------------: | :----: | :--------------: | :------: | :------: |
+| WAVECLUSTER | 很高 | 数值型 | 很高 | 较高 | 任意形状 | 很高 |
+| ROCK | 很高 | 混合型 | 很高 | 很高 | 任意形状 | 一般 |
+| BIRCH | 较高 | 数值型 | 较低 | 较低 | 球形 | 很高 |
+| CURE | 较高 | 数值型 | 一般 | 很高 | 任意形状 | 较高 |
+| K-PROTOTYPES | 一般 | 混合型 | 较低 | 较低 | 任意形状 | 一般 |
+| DENCLUE | 较低 | 数值型 | 较高 | 一般 | 任意形状 | 较高 |
+| OPTIGRID | 一般 | 数值型 | 较高 | 一般 | 任意形状 | 一般 |
+| CLIQUE | 较高 | 数值型 | 较高 | 较高 | 任意形状 | 较低 |
+| DBSCAN | 一般 | 数值型 | 较低 | 较高 | 任意形状 | 一般 |
+| CLARANS | 较低 | 数值型 | 较低 | 较高 | 球形 | 较低 |
-2 四种常用聚类算法研究
+### 2.21.7 四种常用聚类方法之比较
-2.1 k-means聚类算法
+ 聚类就是按照某个特定标准把一个数据集分割成不同的类或簇,使得同一个簇内的数据对象的相似性尽可能大,同时不在同一个簇中的数据对象的差异性也尽可能地大。即聚类后同一类的数据尽可能聚集到一起,不同类数据尽量分离。
+ 主要的聚类算法可以划分为如下几类:划分方法、层次方法、基于密度的方法、基于网格的方法以及基于模型的方法。下面主要对k-means聚类算法、凝聚型层次聚类算法、神经网络聚类算法之SOM,以及模糊聚类的FCM算法通过通用测试数据集进行聚类效果的比较和分析。
- k-means是划分方法中较经典的聚类算法之一。由于该算法的效率高,所以在对大规模数据进行聚类时被广泛应用。目前,许多算法均围绕着该算法进行扩展和改进。
- k-means算法以k为参数,把n个对象分成k个簇,使簇内具有较高的相似度,而簇间的相似度较低。k-means算法的处理过程如下:首先,随机地 选择k个对象,每个对象初始地代表了一个簇的平均值或中心;对剩余的每个对象,根据其与各簇中心的距离,将它赋给最近的簇;然后重新计算每个簇的平均值。 这个过程不断重复,直到准则函数收敛。通常,采用平方误差准则,其定义如下:
+### 2.21.8 k-means聚类算法
- $E=\sum_{i=1}^{k}\sum_{p\subset C}|p-m_{i}|^{2}$
+k-means是划分方法中较经典的聚类算法之一。由于该算法的效率高,所以在对大规模数据进行聚类时被广泛应用。目前,许多算法均围绕着该算法进行扩展和改进。
+k-means算法以k为参数,把n个对象分成k个簇,使簇内具有较高的相似度,而簇间的相似度较低。k-means算法的处理过程如下:首先,随机地 选择k个对象,每个对象初始地代表了一个簇的平均值或中心;对剩余的每个对象,根据其与各簇中心的距离,将它赋给最近的簇;然后重新计算每个簇的平均值。 这个过程不断重复,直到准则函数收敛。通常,采用平方误差准则,其定义如下:
+$$
+E=\sum_{i=1}^{k}\sum_{p\in C_i}\left\|p-m_i\right\|^2
+$$
+ 这里E是数据中所有对象的平方误差的总和,p是空间中的点,$m_i$是簇$C_i$的平均值[9]。该目标函数使生成的簇尽可能紧凑独立,使用的距离度量是欧几里得距离,当然也可以用其他距离度量。
- 这里E是数据库中所有对象的平方误差的总和,p是空间中的点,mi是簇Ci的平均值[9]。该目标函数使生成的簇尽可能紧凑独立,使用的距离度量是欧几里得距离,当然也可以用其他距离度量。k-means聚类算法的算法流程如下:
- 输入:包含n个对象的数据库和簇的数目k;
- 输出:k个簇,使平方误差准则最小。
+**算法流程**:
+ 输入:包含n个对象的数据和簇的数目k;
+ 输出:n个对象到k个簇,使平方误差准则最小。
步骤:
(1) 任意选择k个对象作为初始的簇中心;
- (2) repeat;
- (3) 根据簇中对象的平均值,将每个对象(重新)赋予最类似的簇;
- (4) 更新簇的平均值,即计算每个簇中对象的平均值;
- (5) until不再发生变化。
+ (2) 根据簇中对象的平均值,将每个对象(重新)赋予最类似的簇;
+ (3) 更新簇的平均值,即计算每个簇中对象的平均值;
+ (4) 重复步骤(2)、(3)直到簇中心不再变化;
+
+### 2.21.9 层次聚类算法
-2.2 层次聚类算法
根据层次分解的顺序是自底向上的还是自上向下的,层次聚类算法分为凝聚的层次聚类算法和分裂的层次聚类算法。
- 凝聚型层次聚类的策略是先将每个对象作为一个簇,然后合并这些原子簇为越来越大的簇,直到所有对象都在一个簇中,或者某个终结条件被满足。绝大多数层次聚类属于凝聚型层次聚类,它们只是在簇间相似度的定义上有所不同。四种广泛采用的簇间距离度量方法如下:
+ 凝聚型层次聚类的策略是先将每个对象作为一个簇,然后合并这些原子簇为越来越大的簇,直到所有对象都在一个簇中,或者某个终结条件被满足。绝大多数层次聚类属于凝聚型层次聚类,它们只是在簇间相似度的定义上有所不同。
-
+**算法流程**:
-这里给出采用最小距离的凝聚层次聚类算法流程:
+注:以采用最小距离的凝聚层次聚类算法为例:
(1) 将每个对象看作一类,计算两两之间的最小距离;
(2) 将距离最小的两个类合并成一个新类;
(3) 重新计算新类与所有类之间的距离;
(4) 重复(2)、(3),直到所有类最后合并成一类。
-### 2.21.5 SOM聚类算法
-SOM神经网络[11]是由芬兰神经网络专家Kohonen教授提出的,该算法假设在输入对象中存在一些拓扑结构或顺序,可以实现从输入空间(n维)到输出平面(2维)的降维映射,其映射具有拓扑特征保持性质,与实际的大脑处理有很强的理论联系。
+### 2.21.10 SOM聚类算法
+ SOM神经网络[11]是由芬兰神经网络专家Kohonen教授提出的,该算法假设在输入对象中存在一些拓扑结构或顺序,可以实现从输入空间(n维)到输出平面(2维)的降维映射,其映射具有拓扑特征保持性质,与实际的大脑处理有很强的理论联系。
-SOM网络包含输入层和输出层。输入层对应一个高维的输入向量,输出层由一系列组织在2维网格上的有序节点构成,输入节点与输出节点通过权重向量连接。 学习过程中,找到与之距离最短的输出层单元,即获胜单元,对其更新。同时,将邻近区域的权值更新,使输出节点保持输入向量的拓扑特征。
+ SOM网络包含输入层和输出层。输入层对应一个高维的输入向量,输出层由一系列组织在2维网格上的有序节点构成,输入节点与输出节点通过权重向量连接。 学习过程中,找到与之距离最短的输出层单元,即获胜单元,对其更新。同时,将邻近区域的权值更新,使输出节点保持输入向量的拓扑特征。
-算法流程:
+**算法流程**:
-(1) 网络初始化,对输出层每个节点权重赋初值;
-(2) 将输入样本中随机选取输入向量,找到与输入向量距离最小的权重向量;
-(3) 定义获胜单元,在获胜单元的邻近区域调整权重使其向输入向量靠拢;
-(4) 提供新样本、进行训练;
-(5) 收缩邻域半径、减小学习率、重复,直到小于允许值,输出聚类结果。
+ (1) 网络初始化,对输出层每个节点权重赋初值;
+ (2) 从输入样本中随机选取输入向量并且归一化,找到与输入向量距离最小的权重向量;
+ (3) 定义获胜单元,在获胜单元的邻近区域调整权重使其向输入向量靠拢;
+ (4) 提供新样本、进行训练;
+ (5) 收缩邻域半径、减小学习率、重复,直到小于允许值,输出聚类结果。
-### 2.21.6 FCM聚类算法
+### 2.21.11 FCM聚类算法
-1965年美国加州大学柏克莱分校的扎德教授第一次提出了‘集合’的概念。经过十多年的发展,模糊集合理论渐渐被应用到各个实际应用方面。为克服非此即彼的分类缺点,出现了以模糊集合论为数学基础的聚类分析。用模糊数学的方法进行聚类分析,就是模糊聚类分析[12]。
-
-FCM算法是一种以隶属度来确定每个数据点属于某个聚类程度的算法。该聚类算法是传统硬聚类算法的一种改进。
-
-
+ 1965年美国加州大学柏克莱分校的扎德教授第一次提出了‘集合’的概念。经过十多年的发展,模糊集合理论渐渐被应用到各个实际应用方面。为克服非此即彼的分类缺点,出现了以模糊集合论为数学基础的聚类分析。用模糊数学的方法进行聚类分析,就是模糊聚类分析[12]。
+ FCM算法是一种以隶属度来确定每个数据点属于某个聚类程度的算法。该聚类算法是传统硬聚类算法的一种改进。
+ 设数据集$X={x_1,x_2,...,x_n}$,它的模糊$c$划分可用模糊矩阵$U=[u_{ij}]$表示,矩阵$U$的元素$u_{ij}$表示第$j(j=1,2,...,n)$个数据点属于第$i(i=1,2,...,c)$类的隶属度,$u_{ij}$满足如下条件:
+$$
+\begin{equation}
+\left\{
+\begin{array}{lr}
+\sum_{i=1}^c u_{ij}=1 \quad\forall~j
+\\u_{ij}\in[0,1] \quad\forall ~i,j
+\\\sum_{j=1}^c u_{ij}>0 \quad\forall ~i
+\end{array}
+\right.
+\end{equation}
+$$
+目前被广泛使用的聚类准则是取类内加权误差平方和的极小值。即:
+$$
+(min)J_m(U,V)=\sum^n_{j=1}\sum^c_{i=1}u^m_{ij}d^2_{ij}(x_j,v_i)
+$$
+其中$V$为聚类中心,$m$为加权指数,$d_{ij}(x_j,v_i)=||v_i-x_j||$。
-算法流程:
+**算法流程**:
(1) 标准化数据矩阵;
(2) 建立模糊相似矩阵,初始化隶属矩阵;
(3) 算法开始迭代,直到目标函数收敛到极小值;
(4) 根据迭代结果,由最后的隶属矩阵确定数据所属的类,显示最后的聚类结果。
-3 四种聚类算法试验
+### 2.21.12 四种聚类算法试验
-3.1 试验数据
+ 选取专门用于测试分类、聚类算法的国际通用的UCI数据库中的IRIS数据集,IRIS数据集包含150个样本数据,分别取自三种不同 的莺尾属植物setosa、versicolor和virginica的花朵样本,每个数据含有4个属性,即萼片长度、萼片宽度、花瓣长度、花瓣宽度,单位为cm。 在数据集上执行不同的聚类算法,可以得到不同精度的聚类结果。基于前面描述的各算法原理及流程,可初步得如下聚类结果。
-实验中,选取专门用于测试分类、聚类算法的国际通用的UCI数据库中的IRIS[13]数据集,IRIS数据集包含150个样本数据,分别取自三种不同 的莺尾属植物setosa、versicolor和virginica的花朵样本,每个数据含有4个属性,即萼片长度、萼片宽度、花瓣长度,单位为cm。 在数据集上执行不同的聚类算法,可以得到不同精度的聚类结果。
+| 聚类方法 | 聚错样本数 | 运行时间/s | 平均准确率/(%) |
+| -------- | ---------- | ---------- | ---------------- |
+| K-means | 17 | 0.146001 | 89 |
+| 层次聚类 | 51 | 0.128744 | 66 |
+| SOM | 22 | 5.267283 | 86 |
+| FCM | 12 | 0.470417 | 92 |
-3.2 试验结果说明
-文中基于前面所述各算法原理及算法流程,用matlab进行编程运算,得到表1所示聚类结果。
-
-
-
-如表1所示,对于四种聚类算法,按三方面进行比较:
-
-(1)聚错样本数:总的聚错的样本数,即各类中聚错的样本数的和;
-
-(2)运行时间:即聚类整个 过程所耗费的时间,单位为s;
-
-(3)平均准确度:设原数据集有k个类,用ci表示第i类,ni为ci中样本的个数,mi为聚类正确的个数,则mi/ni为 第i类中的精度,则平均精度为:
+**注**:
-$avg=\frac{1}{k}\sum_{i=1}^{k}\frac{m_{i}}{n_{i}}$
+(1) 聚错样本数:总的聚错的样本数,即各类中聚错的样本数的和;
+(2) 运行时间:即聚类整个过程所耗费的时间,单位为s;
+(3) 平均准确度:设原数据集有k个类,用$c_i$表示第i类,$n_i$为$c_i$中样本的个数,$m_i$为聚类正确的个数,则$m_i/n_i$为 第i类中的精度,则平均精度为:$avg=\frac{1}{k}\sum_{i=1}^{k}\frac{m_{i}}{n_{i}}$。
## 2.22 GBDT和随机森林的区别
@@ -1818,9 +2211,49 @@ GBDT和随机森林的不同点:
5、随机森林对训练集一视同仁,GBDT是基于权值的弱分类器的集成
6、随机森林是通过减少模型方差提高性能,GBDT是通过减少模型偏差提高性能
-## 2.23 大数据与深度学习之间的关系
+## 2.23 理解 One Hot Encodeing 原理及作用?
+
+问题由来:
+
+ 在很多**机器学习**任务中,特征并不总是连续值,而有可能是分类值。
+
+例如,考虑一下的三个特征:
+
+```
+["male", "female"] ["from Europe", "from US", "from Asia"]
+["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]
+```
+
+如果将上述特征用数字表示,效率会高很多。例如:
+
+```
+["male", "from US", "uses Internet Explorer"] 表示为 [0, 1, 3]
+["female", "from Asia", "uses Chrome"] 表示为 [1, 2, 1]
+```
-大数据**通常被定义为“超出常用软件工具捕获,管理和处理能力”的数据集。
+ 但是,即使转化为数字表示后,上述数据也不能直接用在我们的分类器中。因为,分类器往往默认数据数据是连续的(可以计算距离?),并且是有序的(而上面这个 0 并不是说比 1 要高级)。但是,按照我们上述的表示,数字并不是有序的,而是随机分配的。
+
+**独热编码**
+
+ 为了解决上述问题,其中一种可能的解决方法是采用独热编码(One-Hot Encoding)。独热编码即 One-Hot 编码,又称一位有效编码,其方法是使用N位状态寄存器来对 N 个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候,其中只有一位有效。
+
+例如:
+
+```
+自然状态码为:000,001,010,011,100,101
+独热编码为:000001,000010,000100,001000,010000,100000
+```
+
+ 可以这样理解,对于每一个特征,如果它有 m 个可能值,那么经过独热编码后,就变成了 m 个二元特征(如成绩这个特征有好,中,差变成 one-hot 就是 100, 010, 001)。并且,这些特征互斥,每次只有一个激活。因此,数据会变成稀疏的。
+
+这样做的好处主要有:
+
+1. 解决了分类器不好处理属性数据的问题;
+2. 在一定程度上也起到了扩充特征的作用。
+
+## 2.24 大数据与深度学习之间的关系
+
+**大数据**通常被定义为“超出常用软件工具捕获,管理和处理能力”的数据集。
**机器学习**关心的问题是如何构建计算机程序使用经验自动改进。
**数据挖掘**是从数据中提取模式的特定算法的应用。
在数据挖掘中,重点在于算法的应用,而不是算法本身。
@@ -1831,12 +2264,31 @@ GBDT和随机森林的不同点:
1. 深度学习是一种模拟大脑的行为。可以从所学习对象的机制以及行为等等很多相关联的方面进行学习,模仿类型行为以及思维。
2. 深度学习对于大数据的发展有帮助。深度学习对于大数据技术开发的每一个阶段均有帮助,不管是数据的分析还是挖掘还是建模,只有深度学习,这些工作才会有可能一一得到实现。
-3. 深度学习转变了解决问题的思维。很多时候发现问题到解决问题,走一步看一步不是一个主要的解决问题的方式了,在深度学习的基础上,要求我们从开始到最后都要基于哦那个一个目标,为了需要优化的那个最终目的去进行处理数据以及将数据放入到数据应用平台上去。
-4. 大数据的深度学习需要一个框架。在大数据方面的深度学习都是从基础的角度出发的,深度学习需要一个框架或者一个系统总而言之,将你的大数据通过深度分析变为现实这就是深度学习和大数据的最直接关系。
+3. 深度学习转变了解决问题的思维。很多时候发现问题到解决问题,走一步看一步不是一个主要的解决问题的方式了,在深度学习的基础上,要求我们从开始到最后都要基于一个目标,为了需要优化的那个最终目标去进行处理数据以及将数据放入到数据应用平台上去,这就是端到端(End to End)。
+4. 大数据的深度学习需要一个框架。在大数据方面的深度学习都是从基础的角度出发的,深度学习需要一个框架或者一个系统。总而言之,将你的大数据通过深度分析变为现实,这就是深度学习和大数据的最直接关系。
+
+
+
+## 参考文献
+[1] Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016.
+[2] 周志华. 机器学习[M].清华大学出版社, 2016.
+[3] Michael A. Nielsen. "Neural Networks and Deep Learning", Determination Press, 2015.
+[4] Suryansh S. Gradient Descent: All You Need to Know, 2018.
+[5] 刘建平. 梯度下降小结,EM算法的推导, 2018
+[6] 杨小兵.聚类分析中若干关键技术的研究[D]. 杭州:浙江大学, 2005.
+[7] XU Rui, Donald Wunsch 1 1. survey of clustering algorithm[J].IEEE.Transactions on Neural Networks, 2005, 16(3):645-67 8.
+[8] YI Hong, SAM K. Learning assignment order of instances for the constrained k-means clustering algorithm[J].IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cybernetics,2009,39 (2):568-574.
+[9] 贺玲, 吴玲达, 蔡益朝.数据挖掘中的聚类算法综述[J].计算机应用研究, 2007, 24(1):10-13.
+[10] 孙吉贵, 刘杰, 赵连宇.聚类算法研究[J].软件学报, 2008, 19(1):48-61.
+[11] 孔英会, 苑津莎, 张铁峰等.基于数据流管理技术的配变负荷分类方法研究.中国国际供电会议, CICED2006.
+[12] 马晓艳, 唐雁.层次聚类算法研究[J].计算机科学, 2008, 34(7):34-36.
+[13] FISHER R A. Iris Plants Database https://www.ics.uci.edu/vmlearn/MLRepository.html, Authorized license.
+[14] Quinlan J R. Induction of decision trees[J]. Machine learning, 1986, 1(1): 81-106.
+[15] Breiman L. Random forests[J]. Machine learning, 2001, 45(1): 5-32.
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.1.1.5.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.1.1.5.png"
new file mode 100644
index 00000000..dde211ce
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.1.1.5.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.1.1.6.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.1.1.6.png"
new file mode 100644
index 00000000..1bd76971
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.1.1.6.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.1.6.1.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.1.6.1.png"
new file mode 100644
index 00000000..e0ce77d9
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.1.6.1.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.12.2.1.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.12.2.1.png"
new file mode 100644
index 00000000..34f2d8b6
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.12.2.1.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.12.2.2.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.12.2.2.png"
new file mode 100644
index 00000000..ad3d4bd6
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.12.2.2.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.1.1.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.1.1.png"
new file mode 100644
index 00000000..160cf461
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.1.1.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.1.2.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.1.2.png"
new file mode 100644
index 00000000..fec7b5b1
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.1.2.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.2.1.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.2.1.png"
new file mode 100644
index 00000000..02dbca08
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.2.1.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.1.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.1.png"
new file mode 100644
index 00000000..1c4dacd9
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.1.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.2.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.2.png"
new file mode 100644
index 00000000..09f4d9e2
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.2.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.4.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.4.png"
new file mode 100644
index 00000000..03f1c6a9
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.4.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.5.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.5.png"
new file mode 100644
index 00000000..56443fd2
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.5.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.6.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.6.png"
new file mode 100644
index 00000000..bda9c5cb
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.3.6.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.4.1.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.4.1.png"
new file mode 100644
index 00000000..f91e8ab4
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.4.1.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.5.1.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.5.1.png"
new file mode 100644
index 00000000..1236163e
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.5.1.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.5.2.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.5.2.png"
new file mode 100644
index 00000000..c36f3bfd
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.5.2.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.5.3.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.5.3.png"
new file mode 100644
index 00000000..d53c8ea8
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.5.3.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.5.4.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.5.4.png"
new file mode 100644
index 00000000..f00cca44
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.2.5.4.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.4.9.1.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.4.9.1.png"
new file mode 100644
index 00000000..3e126ba7
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.4.9.1.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.4.9.2.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.4.9.2.png"
new file mode 100644
index 00000000..ea9b0c34
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.4.9.2.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.4.9.3.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.4.9.3.png"
new file mode 100644
index 00000000..fe1d7779
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.4.9.3.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.6.3.1.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.6.3.1.png"
new file mode 100644
index 00000000..d091389e
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.6.3.1.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.6.7.1.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.6.7.1.png"
new file mode 100644
index 00000000..4b98c00e
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.6.7.1.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.8.2.1.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.8.2.1.png"
new file mode 100644
index 00000000..1236163e
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/3.8.2.1.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate1.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate1.png"
new file mode 100644
index 00000000..3f7efadc
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate1.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate2.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate2.png"
new file mode 100644
index 00000000..f507418d
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate2.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate3.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate3.png"
new file mode 100644
index 00000000..579d32ed
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate3.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate4.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate4.png"
new file mode 100644
index 00000000..e66fba8e
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate4.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate5.png" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate5.png"
new file mode 100644
index 00000000..5835a5ea
Binary files /dev/null and "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/img/ch3/learnrate5.png" differ
diff --git "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/\347\254\254\344\270\211\347\253\240_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200.md" "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/\347\254\254\344\270\211\347\253\240_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200.md"
index 9f38cd51..4e3ec832 100644
--- "a/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/\347\254\254\344\270\211\347\253\240_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200.md"
+++ "b/ch03_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200/\347\254\254\344\270\211\347\253\240_\346\267\261\345\272\246\345\255\246\344\271\240\345\237\272\347\241\200.md"
@@ -6,140 +6,116 @@
### 3.1.1 神经网络组成?
-为了描述神经网络,我们先从最简单的神经网络说起。
+神经网络类型众多,其中最为重要的是多层感知机。为了详细地描述神经网络,我们先从最简单的神经网络说起。
**感知机**
+多层感知机中的特征神经元模型称为感知机,由*Frank Rosenblatt*于1957年发明。
+
简单的感知机如下图所示:

-其输出为:
+其中$x_1$,$x_2$,$x_3$为感知机的输入,其输出为:
$$
output = \left\{
\begin{aligned}
-0, \quad if \sum_i w_i x_i \le threshold \\
-1, \quad if \sum_i w_i x_i > threshold
+0, \quad if \ \ \sum_i w_i x_i \leqslant threshold \\
+1, \quad if \ \ \sum_i w_i x_i > threshold
\end{aligned}
\right.
$$
-假如把感知机想象成一个加权投票机制,比如 3 位评委给一个歌手打分,打分分别为 4 分、1 分、-3 分,这 3 位评分的权重分别是 1、3、2,则该歌手最终得分为 $ 4 * 1 + 1 * 3 + (-3) * 2 = 1 $。按照比赛规则,选取的 threshold 为 3,说明只有歌手的综合评分大于 3 时,才可顺利晋级。对照感知机,该选手被淘汰,因为
+假如把感知机想象成一个加权投票机制,比如 3 位评委给一个歌手打分,打分分别为$ 4 $分、$1$ 分、$-3 $分,这$ 3$ 位评分的权重分别是 $1、3、2$,则该歌手最终得分为 $4 \times 1 + 1 \times 3 + (-3) \times 2 = 1$ 。按照比赛规则,选取的 $threshold$ 为 $3$,说明只有歌手的综合评分大于$ 3$ 时,才可顺利晋级。对照感知机,该选手被淘汰,因为:
$$
\sum_i w_i x_i < threshold=3, output = 0
$$
-用 $ -b $ 代替 threshold。输出变为:
+用 $-b$ 代替 $threshold$,输出变为:
$$
output = \left\{
\begin{aligned}
-0, \quad if w \cdot x + b \le 0 \\
-1, \quad if w \cdot x + b > 0
+0, \quad if \ \ \boldsymbol{w} \cdot \boldsymbol{x} + b \leqslant 0 \\
+1, \quad if \ \ \boldsymbol{w} \cdot \boldsymbol{x} + b > 0
\end{aligned}
\right.
$$
-设置合适的 $ x $ 和 $ b $,一个简单的感知机单元的与非门表示如下:
+设置合适的 $\boldsymbol{x}$ 和 $b$ ,一个简单的感知机单元的与非门表示如下:

-当输入为 $ 0,1 $ 时,感知机输出为 $ 0 * (-2) + 1 * (-2) + 3 = 1 $。
+当输入为 $0$,$1$ 时,感知机输出为 $ 0 \times (-2) + 1 \times (-2) + 3 = 1$。
复杂一些的感知机由简单的感知机单元组合而成:

-**Sigmoid单元**
-
-感知机单元的输出只有 0 和 1,实际情况中,更多的输出类别不止 0 和 1,而是 $ [0, 1] $ 上的概率值,这时候就需要 sigmoid 函数把任意实数映射到 $ [0, 1] $ 上。
-
-神经元的输入
-
-$$
-z = \sum_i w_i x_i + b
-$$
-
-假设神经元的输出采用 sigmoid 激活函数
-
-$$
-\sigma(z) = \frac{1}{1+e^{-z}}
-$$
-
-sigmoid 激活函数图像如下图所示:
-
-
-
-全连接神经网络
-即第 $ i $ 层的每个神经元和第 $ i-1 $ 层的每个神经元都有连接。
+**多层感知机**
-
+多层感知机由感知机推广而来,最主要的特点是有多个神经元层,因此也叫深度神经网络。相比于单独的感知机,多层感知机的第 $ i $ 层的每个神经元和第 $ i-1 $ 层的每个神经元都有连接。
-输出层可以不止有 1 个神经元。隐藏层可以只有 1 层,也可以有多层。输出层为多个神经元的神经网络例如下图:
+
-
+输出层可以不止有$ 1$ 个神经元。隐藏层可以只有$ 1$ 层,也可以有多层。输出层为多个神经元的神经网络例如下图所示:
+
-### 3.1.2神经网络有哪些常用模型结构?
-答案来源:[25张图让你读懂神经网络架构](https://blog.csdn.net/nicholas_liu2017/article/details/73694666)
+### 3.1.2 神经网络有哪些常用模型结构?
下图包含了大部分常用的模型:

-### 3.1.3如何选择深度学习开发平台?
+### 3.1.3 如何选择深度学习开发平台?
-现有的深度学习开源平台主要有 Caffe, Torch, MXNet, CNTK, Theano, TensorFlow, Keras 等。那如何选择一个适合自己的平台呢,下面列出一些衡量做参考。
+ 现有的深度学习开源平台主要有 Caffe, PyTorch, MXNet, CNTK, Theano, TensorFlow, Keras, fastai等。那如何选择一个适合自己的平台呢,下面列出一些衡量做参考。
**参考1:与现有编程平台、技能整合的难易程度**
-主要是前期积累的开发经验和资源,比如编程语言,前期数据集存储格式等。
+ 主要是前期积累的开发经验和资源,比如编程语言,前期数据集存储格式等。
**参考2: 与相关机器学习、数据处理生态整合的紧密程度**
-深度学习研究离不开各种数据处理、可视化、统计推断等软件包。考虑建模之前,是否具有方便的数据预处理工具?建模之后,是否具有方便的工具进行可视化、统计推断、数据分析?
+ 深度学习研究离不开各种数据处理、可视化、统计推断等软件包。考虑建模之前,是否具有方便的数据预处理工具?建模之后,是否具有方便的工具进行可视化、统计推断、数据分析。
**参考3:对数据量及硬件的要求和支持**
-深度学习在不同应用场景的数据量是不一样的,这也就导致我们可能需要考虑分布式计算、多 GPU 计算的问题。例如,对计算机图像处理研究的人员往往需要将图像文件和计算任务分部到多台计算机节点上进行执行。当下每个深度学习平台都在快速发展,每个平台对分布式计算等场景的支持也在不断演进。
+ 深度学习在不同应用场景的数据量是不一样的,这也就导致我们可能需要考虑分布式计算、多GPU计算的问题。例如,对计算机图像处理研究的人员往往需要将图像文件和计算任务分部到多台计算机节点上进行执行。当下每个深度学习平台都在快速发展,每个平台对分布式计算等场景的支持也在不断演进。
**参考4:深度学习平台的成熟程度**
-成熟程度的考量是一个比较主观的考量因素,这些因素可包括:社区的活跃程度;是否容易和开发人员进行交流;当前应用的势头。
+ 成熟程度的考量是一个比较主观的考量因素,这些因素可包括:社区的活跃程度;是否容易和开发人员进行交流;当前应用的势头。
**参考5:平台利用是否多样性?**
-有些平台是专门为深度学习研究和应用进行开发的,有些平台对分布式计算、GPU 等构架都有强大的优化,能否用这些平台/软件做其他事情?比如有些深度学习软件是可以用来求解二次型优化;有些深度学习平台很容易被扩展,被运用在强化学习的应用中。
+ 有些平台是专门为深度学习研究和应用进行开发的,有些平台对分布式计算、GPU 等构架都有强大的优化,能否用这些平台/软件做其他事情?比如有些深度学习软件是可以用来求解二次型优化;有些深度学习平台很容易被扩展,被运用在强化学习的应用中。
-### 3.1.4为什么使用深层表示?
-
-1. 深度神经网络的多层隐藏层中,前几层能学习一些低层次的简单特征,后几层能把前面简单的特征结合起来,去学习更加复杂的东西。比如刚开始检测到的是边缘信息,而后检测更为细节的信息。
+### 3.1.4 为什么使用深层表示?
+1. 深度神经网络是一种特征递进式的学习算法,浅层的神经元直接从输入数据中学习一些低层次的简单特征,例如边缘、纹理等。而深层的特征则基于已学习到的浅层特征继续学习更高级的特征,从计算机的角度学习深层的语义信息。
2. 深层的网络隐藏单元数量相对较少,隐藏层数目较多,如果浅层的网络想要达到同样的计算结果则需要指数级增长的单元数量才能达到。
-### 3.1.5为什么深层神经网络难以训练?
-
-答案来源:
-
-[为什么深层神经网络难以训练](https://blog.csdn.net/BinChasing/article/details/50300069)
-
-[为什么很难训练深度神经网络](http://mini.eastday.com/mobile/180116023302833.html)
+### 3.1.5 为什么深层神经网络难以训练?
1. 梯度消失
- 梯度消失是指通过隐藏层从后向前看,梯度会变的越来越小,说明前面层的学习会显著慢于后面层的学习,所以学习会卡住,除非梯度变大。下图是不同隐含层的学习速率;
+ 梯度消失是指通过隐藏层从后向前看,梯度会变的越来越小,说明前面层的学习会显著慢于后面层的学习,所以学习会卡住,除非梯度变大。
+
+ 梯度消失的原因受到多种因素影响,例如学习率的大小,网络参数的初始化,激活函数的边缘效应等。在深层神经网络中,每一个神经元计算得到的梯度都会传递给前一层,较浅层的神经元接收到的梯度受到之前所有层梯度的影响。如果计算得到的梯度值非常小,随着层数增多,求出的梯度更新信息将会以指数形式衰减,就会发生梯度消失。下图是不同隐含层的学习速率:

2. 梯度爆炸
- 又称exploding gradient problem,在深度网络或循环神经网络(RNN)等网络结构中,梯度可在网络更新的过程中不断累积,变成非常大的梯度,导致网络权重值的大幅更新,使得网络不稳定;在极端情况下,权重值甚至会溢出,变为NaN值,再也无法更新。 具体可参考文献:[A Gentle Introduction to Exploding Gradients in Neural Networks](https://machinelearningmastery.com/exploding-gradients-in-neural-networks/)
+ 在深度网络或循环神经网络(Recurrent Neural Network, RNN)等网络结构中,梯度可在网络更新的过程中不断累积,变成非常大的梯度,导致网络权重值的大幅更新,使得网络不稳定;在极端情况下,权重值甚至会溢出,变为$NaN$值,再也无法更新。
-3. 权重矩阵的退化导致模型的有效自由度减少。参数空间中学习的退化速度减慢,导致减少了模型的有效维数,网络的可用自由度对学习中梯度范数的贡献不均衡,随着相乘矩阵的数量(即网络深度)的增加,矩阵的乘积变得越来越退化;
+3. 权重矩阵的退化导致模型的有效自由度减少。
-在有硬饱和边界的非线性网络中(例如 ReLU 网络),随着深度增加,退化过程会变得越来越快。Duvenaud 等人 2014 年的论文里展示了关于该退化过程的可视化:
+ 参数空间中学习的退化速度减慢,导致减少了模型的有效维数,网络的可用自由度对学习中梯度范数的贡献不均衡,随着相乘矩阵的数量(即网络深度)的增加,矩阵的乘积变得越来越退化。在有硬饱和边界的非线性网络中(例如 ReLU 网络),随着深度增加,退化过程会变得越来越快。Duvenaud等人2014年的论文里展示了关于该退化过程的可视化:

@@ -147,75 +123,73 @@ sigmoid 激活函数图像如下图所示:
### 3.1.6 深度学习和机器学习有什么不同?
-机器学习:利用计算机、概率论、统计学等知识,输入数据,让计算机学会新知识。机器学习的过程,就是通过训练数据寻找目标函数。
+ **机器学习**:利用计算机、概率论、统计学等知识,输入数据,让计算机学会新知识。机器学习的过程,就是训练数据去优化目标函数。
+
+ **深度学习**:是一种特殊的机器学习,具有强大的能力和灵活性。它通过学习将世界表示为嵌套的层次结构,每个表示都与更简单的特征相关,而抽象的表示则用于计算更抽象的表示。
+
+ 传统的机器学习需要定义一些手工特征,从而有目的的去提取目标信息, 非常依赖任务的特异性以及设计特征的专家经验。而深度学习可以从大数据中先学习简单的特征,并从其逐渐学习到更为复杂抽象的深层特征,不依赖人工的特征工程,这也是深度学习在大数据时代受欢迎的一大原因。
-深度学习是机器学习的一种,现在深度学习比较火爆。在传统机器学习中,手工设计特征对学习效果很重要,但是特征工程非常繁琐。而深度学习能够从大数据中自动学习特征,这也是深度学习在大数据时代受欢迎的一大原因。
-
+
+

## 3.2 网络操作与计算
-### 3.2.1前向传播与反向传播?
-
-答案来源:[神经网络中前向传播和反向传播解析](https://blog.csdn.net/lhanchao/article/details/51419150)
+### 3.2.1 前向传播与反向传播?
-在神经网络的计算中,主要由前向传播(foward propagation,FP)和反向传播(backward propagation,BP)。
+神经网络的计算主要有两种:前向传播(foward propagation, FP)作用于每一层的输入,通过逐层计算得到输出结果;反向传播(backward propagation, BP)作用于网络的输出,通过计算梯度由深到浅更新网络参数。
**前向传播**
-
+
-假设上一层结点 $ i,j,k,... $ 等一些结点与本层的结点 $ w $ 有连接,那么结点 $ w $ 的值怎么算呢?就是通过上一层的 $ i,j,k,... $ 等结点以及对应的连接权值进行加权和运算,最终结果再加上一个偏置项(图中为了简单省略了),最后在通过一个非线性函数(即激活函数),如 ReLu,sigmoid 等函数,最后得到的结果就是本层结点 $ w $ 的输出。
+假设上一层结点 $ i,j,k,... $ 等一些结点与本层的结点 $ w $ 有连接,那么结点 $ w $ 的值怎么算呢?就是通过上一层的 $ i,j,k,... $ 等结点以及对应的连接权值进行加权和运算,最终结果再加上一个偏置项(图中为了简单省略了),最后在通过一个非线性函数(即激活函数),如 $ReLu$,$sigmoid$ 等函数,最后得到的结果就是本层结点 $ w $ 的输出。
最终不断的通过这种方法一层层的运算,得到输出层结果。
**反向传播**
-
+
由于我们前向传播最终得到的结果,以分类为例,最终总是有误差的,那么怎么减少误差呢,当前应用广泛的一个算法就是梯度下降算法,但是求梯度就要求偏导数,下面以图中字母为例讲解一下:
-设最终中误差为 $ E $,对于输出那么 $ E $ 对于输出节点 $ y_l $ 的偏导数是 $ y_l - t_l $,其中 $ t_l $ 是真实值,$ \frac{\partial y_l}{\partial z_l} $ 是指上面提到的激活函数,$ z_l $ 是上面提到的加权和,那么这一层的 $ E $ 对于 $ z_l $ 的偏导数为 $ \frac{\partial E}{\partial z_l} = \frac{\partial E}{\partial y_l} \frac{\partial y_l}{\partial z_l} $。同理,下一层也是这么计算,只不过 $ \frac{\partial E}{\partial y_k} $ 计算方法变了,一直反向传播到输入层,最后有 $ \frac{\partial E}{\partial x_i} = \frac{\partial E}{\partial y_j} \frac{\partial y_j}{\partial z_j} $,且 $ \frac{\partial z_j}{\partial x_i} = w_i j $。然后调整这些过程中的权值,再不断进行前向传播和反向传播的过程,最终得到一个比较好的结果;
-
-### 3.2.2如何计算神经网络的输出?
+设最终误差为 $ E $且输出层的激活函数为线性激活函数,对于输出那么 $ E $ 对于输出节点 $ y_l $ 的偏导数是 $ y_l - t_l $,其中 $ t_l $ 是真实值,$ \frac{\partial y_l}{\partial z_l} $ 是指上面提到的激活函数,$ z_l $ 是上面提到的加权和,那么这一层的 $ E $ 对于 $ z_l $ 的偏导数为 $ \frac{\partial E}{\partial z_l} = \frac{\partial E}{\partial y_l} \frac{\partial y_l}{\partial z_l} $。同理,下一层也是这么计算,只不过 $ \frac{\partial E}{\partial y_k} $ 计算方法变了,一直反向传播到输入层,最后有 $ \frac{\partial E}{\partial x_i} = \frac{\partial E}{\partial y_j} \frac{\partial y_j}{\partial z_j} $,且 $ \frac{\partial z_j}{\partial x_i} = w_i j $。然后调整这些过程中的权值,再不断进行前向传播和反向传播的过程,最终得到一个比较好的结果。
-答案来源:[零基础入门深度学习(3) - 神经网络和反向传播算法](https://www.zybuluo.com/hanbingtao/note/476663)
+### 3.2.2 如何计算神经网络的输出?
-
+
-如上图,输入层有三个节点,我们将其依次编号为 1、2、3;隐藏层的 4 个节点,编号依次为 4、5、6、7;最后输出层的两个节点编号为 8、9。比如,隐藏层的节点 4,它和输入层的三个节点 1、2、3 之间都有连接,其连接上的权重分别为是 $ w_{41}, w_{42}, w_{43} $。
+如上图,输入层有三个节点,我们将其依次编号为 1、2、3;隐藏层的 4 个节点,编号依次为 4、5、6、7;最后输出层的两个节点编号为 8、9。比如,隐藏层的节点 4,它和输入层的三个节点 1、2、3 之间都有连接,其连接上的权重分别为是 $ w_{41}, w_{42}, w_{43} $。
为了计算节点 4 的输出值,我们必须先得到其所有上游节点(也就是节点 1、2、3)的输出值。节点 1、2、3 是输入层的节点,所以,他们的输出值就是输入向量本身。按照上图画出的对应关系,可以看到节点 1、2、3 的输出值分别是 $ x_1, x_2, x_3 $。
$$
-a_4 = \sigma(w^T \cdot a) = \sigma(w_{41}x_4 + w_{42}x_2 + w_{43}x_3 + w_{4b})
+a_4 = \sigma(w^T \cdot a) = \sigma(w_{41}x_4 + w_{42}x_2 + w_{43}a_3 + w_{4b})
$$
-其中 $ w_{4b} $ 是节点 4 的偏置项
+其中 $ w_{4b} $ 是节点 4 的偏置项。
同样,我们可以继续计算出节点 5、6、7 的输出值 $ a_5, a_6, a_7 $。
-计算输出层的节点 8 的输出值 $ y_1 $:
+计算输出层的节点 8 的输出值 $ y_1 $:
$$
y_1 = \sigma(w^T \cdot a) = \sigma(w_{84}a_4 + w_{85}a_5 + w_{86}a_6 + w_{87}a_7 + w_{8b})
$$
-其中 $ w_{8b} $ 是节点 8 的偏置项。
+其中 $ w_{8b} $ 是节点 8 的偏置项。
-同理,我们还可以计算出 $ y_2 $。这样输出层所有节点的输出值计算完毕,我们就得到了在输入向量 $ x_1, x_2, x_3, x_4 $ 时,神经网络的输出向量 $ y_1, y_2 $ 。这里我们也看到,输出向量的维度和输出层神经元个数相同。
+同理,我们还可以计算出 $ y_2 $。这样输出层所有节点的输出值计算完毕,我们就得到了在输入向量 $ x_1, x_2, x_3, x_4 $ 时,神经网络的输出向量 $ y_1, y_2 $ 。这里我们也看到,输出向量的维度和输出层神经元个数相同。
-### 3.2.3如何计算卷积神经网络输出值?
-
-答案来源:[零基础入门深度学习(4) - 卷积神经网络](https://www.zybuluo.com/hanbingtao/note/485480)
+### 3.2.3 如何计算卷积神经网络输出值?
假设有一个 5\*5 的图像,使用一个 3\*3 的 filter 进行卷积,想得到一个 3\*3 的 Feature Map,如下所示:
-
+
-$ x_{i,j} $ 表示图像第 $ i $ 行第 $ j $ 列元素。$ w_{m,n} $ 表示 filter 第 $ m $ 行第 $ n $ 列权重。 $ w_b $ 表示 filter 的偏置项。 表示 feature map 第 $ i $ 行第 $ j $ 列元素。 $ f $ 表示激活函数,这里以 relu 函数为例。
+$ x_{i,j} $ 表示图像第 $ i $ 行第 $ j $ 列元素。$ w_{m,n} $ 表示 filter 第 $ m $ 行第 $ n $ 列权重。 $ w_b $ 表示 $filter$ 的偏置项。 表$a_i,_j$示 feature map 第 $ i$ 行第 $ j $ 列元素。 $f$ 表示激活函数,这里以$ ReLU$ 函数为例。
卷积计算公式如下:
@@ -223,47 +197,40 @@ $$
a_{i,j} = f(\sum_{m=0}^2 \sum_{n=0}^2 w_{m,n} x_{i+m, j+n} + w_b )
$$
-当步长为 1 时,计算 feature map 元素 $ a_{0,0} $ 如下:
+当步长为 $1$ 时,计算 feature map 元素 $ a_{0,0} $ 如下:
$$
a_{0,0} = f(\sum_{m=0}^2 \sum_{n=0}^2 w_{m,n} x_{0+m, 0+n} + w_b )
-
-= relu(w_{0,0} x_{0,0} + w_{0,1} x_{0,1} + w_{0,2} x_{0,2} + w_{1,0} x_{1,0} + w_{1,1} x_{1,1} + w_{1,2} x_{1,2} + w_{2,0} x_{2,0} + w_{2,1} x_{2,1} + w_{2,2} x_{2,2}) \\
-
+= relu(w_{0,0} x_{0,0} + w_{0,1} x_{0,1} + w_{0,2} x_{0,2} + w_{1,0} x_{1,0} + \\w_{1,1} x_{1,1} + w_{1,2} x_{1,2} + w_{2,0} x_{2,0} + w_{2,1} x_{2,1} + w_{2,2} x_{2,2}) \\
= 1 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 1 \\
-
= 4
$$
-结果如下:
-
-
-
其计算过程图示如下:
-
+
以此类推,计算出全部的Feature Map。
-
+
当步幅为 2 时,Feature Map计算如下
-
+
注:图像大小、步幅和卷积后的Feature Map大小是有关系的。它们满足下面的关系:
$$
-W_2 = (W_1 - F + 2P)/S + 1
+W_2 = (W_1 - F + 2P)/S + 1\\
H_2 = (H_1 - F + 2P)/S + 1
$$
-其中 $ W_2 $, 是卷积后 Feature Map 的宽度;$ W_1 $ 是卷积前图像的宽度;$ F $ 是 filter 的宽度;$ P $ 是 Zero Padding 数量,Zero Padding 是指在原始图像周围补几圈 0,如果 P 的值是 1,那么就补 1 圈 0;S 是步幅;$ H_2 $ 卷积后 Feature Map 的高度;$ H_1 $ 是卷积前图像的宽度。
+ 其中 $ W_2 $, 是卷积后 Feature Map 的宽度;$ W_1 $ 是卷积前图像的宽度;$ F $ 是 filter 的宽度;$ P $ 是 Zero Padding 数量,Zero Padding 是指在原始图像周围补几圈 $0$,如果 $P$ 的值是 $1$,那么就补 $1$ 圈 $0$;$S$ 是步幅;$ H_2 $ 卷积后 Feature Map 的高度;$ H_1 $ 是卷积前图像的宽度。
-举例:假设图像宽度 $ W_1 = 5 $,filter 宽度 $ F=3 $,Zero Padding $ P=0 $,步幅 $ S=2 $,$ Z $ 则
+ 举例:假设图像宽度 $ W_1 = 5 $,filter 宽度 $ F=3 $,Zero Padding $ P=0 $,步幅 $ S=2 $,$ Z $ 则
$$
W_2 = (W_1 - F + 2P)/S + 1
@@ -273,68 +240,66 @@ W_2 = (W_1 - F + 2P)/S + 1
= 2
$$
-说明 Feature Map 宽度是2。同样,我们也可以计算出 Feature Map 高度也是 2。
+ 说明 Feature Map 宽度是2。同样,我们也可以计算出 Feature Map 高度也是 2。
-如果卷积前的图像深度为 $ D $,那么相应的 filter 的深度也必须为 $ D $。深度大于 1 的卷积计算公式:
+如果卷积前的图像深度为 $ D $,那么相应的 filter 的深度也必须为 $ D $。深度大于 1 的卷积计算公式:
$$
a_{i,j} = f(\sum_{d=0}^{D-1} \sum_{m=0}^{F-1} \sum_{n=0}^{F-1} w_{d,m,n} x_{d,i+m,j+n} + w_b)
$$
-其中,$ D $ 是深度;$ F $ 是 filter 的大小;$ w_{d,m,n} $ 表示 filter 的第 $ d $ 层第 $ m $ 行第 $ n $ 列权重;$ a_{d,i,j} $ 表示 feature map 的第 $ d $ 层第 $ i $ 行第 $ j $ 列像素;其它的符号含义前面相同,不再赘述。
+ 其中,$ D $ 是深度;$ F $ 是 filter 的大小;$ w_{d,m,n} $ 表示 filter 的第 $ d $ 层第 $ m $ 行第 $ n $ 列权重;$ a_{d,i,j} $ 表示 feature map 的第 $ d $ 层第 $ i $ 行第 $ j $ 列像素;其它的符号含义前面相同,不再赘述。
-每个卷积层可以有多个 filter。每个 filter 和原始图像进行卷积后,都可以得到一个 Feature Map。卷积后 Feature Map 的深度(个数)和卷积层的 filter 个数是相同的。下面的图示显示了包含两个 filter 的卷积层的计算。7\*7\*3 输入,经过两个 3\*3\*3 filter 的卷积(步幅为 2),得到了 3\*3\*2 的输出。图中的 Zero padding 是 1,也就是在输入元素的周围补了一圈 0。Zero padding 对于图像边缘部分的特征提取是很有帮助的。
+ 每个卷积层可以有多个 filter。每个 filter 和原始图像进行卷积后,都可以得到一个 Feature Map。卷积后 Feature Map 的深度(个数)和卷积层的 filter 个数相同。下面的图示显示了包含两个 filter 的卷积层的计算。$7*7*3$ 输入,经过两个 $3*3*3$ filter 的卷积(步幅为 $2$),得到了 $3*3*2$ 的输出。图中的 Zero padding 是 $1$,也就是在输入元素的周围补了一圈 $0$。
-
+
-以上就是卷积层的计算方法。这里面体现了局部连接和权值共享:每层神经元只和上一层部分神经元相连(卷积计算规则),且 filter 的权值对于上一层所有神经元都是一样的。对于包含两个 $ 3 * 3 * 3 $ 的 fitler 的卷积层来说,其参数数量仅有 $ (3 * 3 * 3+1) * 2 = 56 $ 个,且参数数量与上一层神经元个数无关。与全连接神经网络相比,其参数数量大大减少了。
+ 以上就是卷积层的计算方法。这里面体现了局部连接和权值共享:每层神经元只和上一层部分神经元相连(卷积计算规则),且 filter 的权值对于上一层所有神经元都是一样的。对于包含两个 $ 3 * 3 * 3 $ 的 fitler 的卷积层来说,其参数数量仅有 $ (3 * 3 * 3+1) * 2 = 56 $ 个,且参数数量与上一层神经元个数无关。与全连接神经网络相比,其参数数量大大减少了。
### 3.2.4 如何计算 Pooling 层输出值输出值?
-Pooling 层主要的作用是下采样,通过去掉 Feature Map 中不重要的样本,进一步减少参数数量。Pooling 的方法很多,最常用的是 Max Pooling。Max Pooling 实际上就是在 n\*n 的样本中取最大值,作为采样后的样本值。下图是 2\*2 max pooling:
+ Pooling 层主要的作用是下采样,通过去掉 Feature Map 中不重要的样本,进一步减少参数数量。Pooling 的方法很多,最常用的是 Max Pooling。Max Pooling 实际上就是在 n\*n 的样本中取最大值,作为采样后的样本值。下图是 2\*2 max pooling:
-
+
-除了 Max Pooing 之外,常用的还有 Mean Pooling ——取各样本的平均值。
-对于深度为 $ D $ 的 Feature Map,各层独立做 Pooling,因此 Pooling 后的深度仍然为 $ D $。
+ 除了 Max Pooing 之外,常用的还有 Average Pooling ——取各样本的平均值。
+ 对于深度为 $ D $ 的 Feature Map,各层独立做 Pooling,因此 Pooling 后的深度仍然为 $ D $。
### 3.2.5 实例理解反向传播
-答案来源:[一文弄懂神经网络中的反向传播法——BackPropagation](http://www.cnblogs.com/charlotte77/p/5629865.html)
+ 一个典型的三层神经网络如下所示:
-一个典型的三层神经网络如下所示:
+
-
+ 其中 Layer $ L_1 $ 是输入层,Layer $ L_2 $ 是隐含层,Layer $ L_3 $ 是输出层。
-其中 Layer $ L_1 $ 是输入层,Layer $ L_2 $ 是隐含层,Layer $ L_3 $ 是输出层。
+ 假设输入数据集为 $ D={x_1, x_2, ..., x_n} $,输出数据集为 $ y_1, y_2, ..., y_n $。
-假设输入数据集为 $ D={x_1, x_2, ..., x_n} $,输出数据集为 $ y_1, y_2, ..., y_n $。
-
-如果输入和输出是一样,即为自编码模型。如果原始数据经过映射,会得到不同与输入的输出。
+ 如果输入和输出是一样,即为自编码模型。如果原始数据经过映射,会得到不同于输入的输出。
假设有如下的网络层:
-
+
-输入层包含神经元 $ i_1, i_2 $,偏置 $ b_1 $;隐含层包含神经元 $ h_1, h_2 $,偏置 $ b_2 $,输出层为 $ o_1, o_2 $,$ w_i $ 为层与层之间连接的权重,激活函数为 sigmoid 函数。对以上参数取初始值,如下图所示:
+ 输入层包含神经元 $ i_1, i_2 $,偏置 $ b_1 $;隐含层包含神经元 $ h_1, h_2 $,偏置 $ b_2 $,输出层为 $ o_1, o_2 $,$ w_i $ 为层与层之间连接的权重,激活函数为 $sigmoid$ 函数。对以上参数取初始值,如下图所示:
-
+
其中:
- 输入数据 $ i1=0.05, i2 = 0.10 $
- 输出数据 $ o1=0.01, o2=0.99 $;
- 初始权重 $ w1=0.15, w2=0.20, w3=0.25,w4=0.30, w5=0.40, w6=0.45, w7=0.50, w8=0.55 $
-- 目标:给出输入数据 $ i1,i2 $ (0.05和0.10),使输出尽可能与原始输出 $ o1,o2 $,(0.01和0.99)接近。
+- 目标:给出输入数据 $ i1,i2 $ ( $0.05$和$0.10$ ),使输出尽可能与原始输出 $ o1,o2 $,( $0.01$和$0.99$)接近。
**前向传播**
1. 输入层 --> 输出层
-计算神经元 $ h1 $ 的输入加权和:
+计算神经元 $ h1 $ 的输入加权和:
$$
-net_{h1} = w_1 * i_1 + w_2 * i_2 + b_1 * 1
+net_{h1} = w_1 * i_1 + w_2 * i_2 + b_1 * 1\\
net_{h1} = 0.15 * 0.05 + 0.2 * 0.1 + 0.35 * 1 = 0.3775
$$
@@ -354,23 +319,27 @@ $$
2. 隐含层-->输出层:
-计算输出层神经元 $ o1 $ 和 $ o2 $ 的值:
+计算输出层神经元 $ o1 $ 和 $ o2 $ 的值:
$$
net_{o1} = w_5 * out_{h1} + w_6 * out_{h2} + b_2 * 1
+$$
+$$
net_{o1} = 0.4 * 0.593269992 + 0.45 * 0.596884378 + 0.6 * 1 = 1.105905967
+$$
+$$
out_{o1} = \frac{1}{1 + e^{-net_{o1}}} = \frac{1}{1 + e^{1.105905967}} = 0.75136079
$$
-这样前向传播的过程就结束了,我们得到输出值为 $ [0.75136079 , 0.772928465] $,与实际值 $ [0.01 , 0.99] $ 相差还很远,现在我们对误差进行反向传播,更新权值,重新计算输出。
+这样前向传播的过程就结束了,我们得到输出值为 $ [0.75136079 , 0.772928465] $,与实际值 $ [0.01 , 0.99] $ 相差还很远,现在我们对误差进行反向传播,更新权值,重新计算输出。
**反向传播 **
-1. 计算总误差
+ 1.计算总误差
-总误差:(square error)
+总误差:(这里使用Square Error)
$$
E_{total} = \sum \frac{1}{2}(target - output)^2
@@ -379,16 +348,15 @@ $$
但是有两个输出,所以分别计算 $ o1 $ 和 $ o2 $ 的误差,总误差为两者之和:
$E_{o1} = \frac{1}{2}(target_{o1} - out_{o1})^2
-= \frac{1}{2}(0.01 - 0.75136507)^2 = 0.274811083$
+= \frac{1}{2}(0.01 - 0.75136507)^2 = 0.274811083$.
-$E_{o2} = 0.023560026$
+$E_{o2} = 0.023560026$.
-$E_{total} = E_{o1} + E_{o2} = 0.274811083 + 0.023560026 = 0.298371109$
+$E_{total} = E_{o1} + E_{o2} = 0.274811083 + 0.023560026 = 0.298371109$.
+ 2.隐含层 --> 输出层的权值更新:
-2. 隐含层 --> 输出层的权值更新:
-
-以权重参数 $ w5 $ 为例,如果我们想知道 $ w5 $ 对整体误差产生了多少影响,可以用整体误差对 $ w5 $ 求偏导求出:(链式法则)
+以权重参数 $ w5 $ 为例,如果我们想知道 $ w5 $ 对整体误差产生了多少影响,可以用整体误差对 $ w5 $ 求偏导求出:(链式法则)
$$
\frac{\partial E_{total}}{\partial w5} = \frac{\partial E_{total}}{\partial out_{o1}} * \frac{\partial out_{o1}}{\partial net_{o1}} * \frac{\partial net_{o1}}{\partial w5}
@@ -396,7 +364,7 @@ $$
下面的图可以更直观的看清楚误差是怎样反向传播的:
-
+
### 3.2.6 神经网络更“深”有什么意义?
@@ -409,18 +377,31 @@ $$
### 3.3.1 什么是超参数?
-超参数:比如算法中的 learning rate (学习率)、iterations (梯度下降法循环的数量)、(隐藏层数目)、(隐藏层单元数目)、choice of activation function(激活函数的选择)都需要根据实际情况来设置,这些数字实际上控制了最后的参数和的值,所以它们被称作超参数。
+ **超参数** : 在机器学习的上下文中,超参数是在开始学习过程之前设置值的参数,而不是通过训练得到的参数数据。通常情况下,需要对超参数进行优化,给学习机选择一组最优超参数,以提高学习的性能和效果。
+
+ 超参数通常存在于:
+
+ 1. 定义关于模型的更高层次的概念,如复杂性或学习能力。
+ 2. 不能直接从标准模型培训过程中的数据中学习,需要预先定义。
+ 3. 可以通过设置不同的值,训练不同的模型和选择更好的测试值来决定
+
+ 超参数具体来讲比如算法中的学习率(learning rate)、梯度下降法迭代的数量(iterations)、隐藏层数目(hidden layers)、隐藏层单元数目、激活函数( activation function)都需要根据实际情况来设置,这些数字实际上控制了最后的参数和的值,所以它们被称作超参数。
### 3.3.2 如何寻找超参数的最优值?
-在使用机器学习算法时,总有一些难搞的超参数。例如权重衰减大小,高斯核宽度等等。算法不会设置这些参数,而是需要你去设置它们的值。设置的值对结果产生较大影响。常见设置超参数的做法有:
+ 在使用机器学习算法时,总有一些难调的超参数。例如权重衰减大小,高斯核宽度等等。这些参数需要人为设置,设置的值对结果产生较大影响。常见设置超参数的方法有:
1. 猜测和检查:根据经验或直觉,选择参数,一直迭代。
+
2. 网格搜索:让计算机尝试在一定范围内均匀分布的一组值。
+
3. 随机搜索:让计算机随机挑选一组值。
+
4. 贝叶斯优化:使用贝叶斯优化超参数,会遇到贝叶斯优化算法本身就需要很多的参数的困难。
-5. 在良好初始猜测的前提下进行局部优化:这就是 MITIE 的方法,它使用 BOBYQA 算法,并有一个精心选择的起始点。由于 BOBYQA 只寻找最近的局部最优解,所以这个方法是否成功很大程度上取决于是否有一个好的起点。在 MITIE 的情况下,我们知道一个好的起点,但这不是一个普遍的解决方案,因为通常你不会知道好的起点在哪里。从好的方面来说,这种方法非常适合寻找局部最优解。稍后我会再讨论这一点。
-6. 最新提出的 LIPO 的全局优化方法。这个方法没有参数,而且经验证比随机搜索方法好。
+
+5. MITIE方法,好初始猜测的前提下进行局部优化。它使用BOBYQA算法,并有一个精心选择的起始点。由于BOBYQA只寻找最近的局部最优解,所以这个方法是否成功很大程度上取决于是否有一个好的起点。在MITIE的情况下,我们知道一个好的起点,但这不是一个普遍的解决方案,因为通常你不会知道好的起点在哪里。从好的方面来说,这种方法非常适合寻找局部最优解。稍后我会再讨论这一点。
+
+6. 最新提出的LIPO的全局优化方法。这个方法没有参数,而且经验证比随机搜索方法好。
### 3.3.3 超参数搜索一般过程?
@@ -504,7 +485,11 @@ $$
对常见激活函数,导数计算如下:
-
+| 原函数 | 函数表达式 | 导数 | 备注 |
+| --------------- | -------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| Sigmoid激活函数 | $f(x)=\frac{1}{1+e^{-x}}$ | $f^{'}(x)=\frac{1}{1+e^{-x}}\left( 1- \frac{1}{1+e^{-x}} \right)=f(x)(1-f(x))$ | 当$x=10$,或$x=-10$,$f^{'}(x) \approx0$,当$x=0$$f^{'}(x) =0.25$ |
+| Tanh激活函数 | $f(x)=tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}$ | $f^{'}(x)=-(tanh(x))^2$ | 当$x=10$,或$x=-10$,$f^{'}(x) \approx0$,当$x=0$$f^{`}(x) =1$ |
+| Relu激活函数 | $f(x)=max(0,x)$ | $c(u)=\begin{cases} 0,x<0 \\ 1,x>0 \\ undefined,x=0\end{cases}$ | 通常$x=0$时,给定其导数为1和0 |
### 3.4.4 激活函数有哪些性质?
@@ -516,7 +501,7 @@ $$
### 3.4.5 如何选择激活函数?
-选择一个适合的激活函数并不容易,需要考虑很多因素,通常的做法是,如果不确定哪一个激活函数效果更好,可以把它们都试试,然后在验证集或者测试集上进行评价。然后看哪一种表现的更好,就去使用它。
+ 选择一个适合的激活函数并不容易,需要考虑很多因素,通常的做法是,如果不确定哪一个激活函数效果更好,可以把它们都试试,然后在验证集或者测试集上进行评价。然后看哪一种表现的更好,就去使用它。
以下是常见的选择情况:
@@ -533,7 +518,7 @@ $$
2. sigmoid 和 tanh 函数的导数在正负饱和区的梯度都会接近于 0,这会造成梯度弥散,而 Relu 和Leaky ReLu 函数大于 0 部分都为常数,不会产生梯度弥散现象。
3. 需注意,Relu 进入负半区的时候,梯度为 0,神经元此时不会训练,产生所谓的稀疏性,而 Leaky ReLu 不会产生这个问题。
-### 3.4.7什么时候可以用线性激活函数?
+### 3.4.7 什么时候可以用线性激活函数?
1. 输出层,大多使用线性激活函数。
2. 在隐含层可能会使用一些线性激活函数。
@@ -548,16 +533,36 @@ Relu 激活函数图像如下:
根据图像可看出具有如下特点:
1. 单侧抑制;
+
2. 相对宽阔的兴奋边界;
+
3. 稀疏激活性;
-ReLU 函数从图像上看,是一个分段线性函数,把所有的负值都变为 0,而正值不变,这样就成为单侧抑制。
+ ReLU 函数从图像上看,是一个分段线性函数,把所有的负值都变为 0,而正值不变,这样就成为单侧抑制。
+
+ 因为有了这单侧抑制,才使得神经网络中的神经元也具有了稀疏激活性。
+
+ **稀疏激活性**:从信号方面来看,即神经元同时只对输入信号的少部分选择性响应,大量信号被刻意的屏蔽了,这样可以提高学习的精度,更好更快地提取稀疏特征。当 $ x<0 $ 时,ReLU 硬饱和,而当 $ x>0 $ 时,则不存在饱和问题。ReLU 能够在 $ x>0 $ 时保持梯度不衰减,从而缓解梯度消失问题。
+
+### 3.4.9 Softmax 定义及作用
+
+Softmax 是一种形如下式的函数:
+$$
+P(i) = \frac{exp(\theta_i^T x)}{\sum_{k=1}^{K} exp(\theta_i^T x)}
+$$
+ 其中,$ \theta_i $ 和 $ x $ 是列向量,$ \theta_i^T x $ 可能被换成函数关于 $ x $ 的函数 $ f_i(x) $
+
+ 通过 softmax 函数,可以使得 $ P(i) $ 的范围在 $ [0,1] $ 之间。在回归和分类问题中,通常 $ \theta $ 是待求参数,通过寻找使得 $ P(i) $ 最大的 $ \theta_i $ 作为最佳参数。
-因为有了这单侧抑制,才使得神经网络中的神经元也具有了稀疏激活性。
+ 但是,使得范围在 $ [0,1] $ 之间的方法有很多,为啥要在前面加上以 $ e $ 的幂函数的形式呢?参考 logistic 函数:
+$$
+P(i) = \frac{1}{1+exp(-\theta_i^T x)}
+$$
+ 这个函数的作用就是使得 $ P(i) $ 在负无穷到 0 的区间趋向于 0, 在 0 到正无穷的区间趋向 1,。同样 softmax 函数加入了 $ e $ 的幂函数正是为了两极化:正样本的结果将趋近于 1,而负样本的结果趋近于 0。这样为多类别提供了方便(可以把 $ P(i) $ 看做是样本属于类别的概率)。可以说,Softmax 函数是 logistic 函数的一种泛化。
-**稀疏激活性**:从信号方面来看,即神经元同时只对输入信号的少部分选择性响应,大量信号被刻意的屏蔽了,这样可以提高学习的精度,更好更快地提取稀疏特征。当 $ x<0 $ 时,ReLU 硬饱和,而当 $ x>0 $ 时,则不存在饱和问题。ReLU 能够在 $ x>0 $ 时保持梯度不衰减,从而缓解梯度消失问题。
+ softmax 函数可以把它的输入,通常被称为 logits 或者 logit scores,处理成 0 到 1 之间,并且能够把输出归一化到和为 1。这意味着 softmax 函数与分类的概率分布等价。它是一个网络预测多酚类问题的最佳输出激活函数。
-### 3.4.9 Softmax 函数如何应用于多分类?
+### 3.4.10 Softmax 函数如何应用于多分类?
softmax 用于多分类过程中,它将多个神经元的输出,映射到 $ (0,1) $ 区间内,可以看成概率来理解,从而来进行多分类!
@@ -569,22 +574,24 @@ $$
从下图看,神经网络中包含了输入层,然后通过两个特征层处理,最后通过 softmax 分析器就能得到不同条件下的概率,这里需要分成三个类别,最终会得到 $ y=0, y=1, y=2 $ 的概率值。
-
+
-继续看下面的图,三个输入通过 softmax 后得到一个数组 $ [0.05 , 0.10 , 0.85] $,这就是 soft 的功能。
+继续看下面的图,三个输入通过 softmax 后得到一个数组 $ [0.05 , 0.10 , 0.85] $,这就是 soft 的功能。
-
+
更形象的映射过程如下图所示:
-
+
- softmax 直白来说就是将原来输出是 $ 3,1,-3 $ 通过 softmax 函数一作用,就映射成为 $ (0,1) $ 的值,而这些值的累和为 $ 1 $(满足概率的性质),那么我们就可以将它理解成概率,在最后选取输出结点的时候,我们就可以选取概率最大(也就是值对应最大的)结点,作为我们的预测目标!
+ softmax 直白来说就是将原来输出是 $ 3,1,-3 $ 通过 softmax 函数一作用,就映射成为 $ (0,1) $ 的值,而这些值的累和为 $ 1 $(满足概率的性质),那么我们就可以将它理解成概率,在最后选取输出结点的时候,我们就可以选取概率最大(也就是值对应最大的)结点,作为我们的预测目标!
-### 3.4.10 交叉熵代价函数定义及其求导推导。(贡献者:黄钦建-华南理工大学)
+### 3.4.11 交叉熵代价函数定义及其求导推导
+(**贡献者:黄钦建-华南理工大学**)
- 神经元的输出就是 a = σ(z),其中$z=\sum w_{j}i_{j}+b$是输⼊的带权和。
+
+ 神经元的输出就是 a = σ(z),其中$z=\sum w_{j}i_{j}+b$是输⼊的带权和。
$C=-\frac{1}{n}\sum[ylna+(1-y)ln(1-a)]$
@@ -612,14 +619,18 @@ $\frac{\partial C}{\partial w_{j}}=\frac{1}{n}\sum x_{j}({\varsigma}(z)-y)$
根据类似的⽅法,我们可以计算出关于偏置的偏导数。我这⾥不再给出详细的过程,你可以轻易验证得到:
-$\frac{\partial C}{\partial b}=\frac{1}{n}\sum ({\varsigma}(z)-y)$
+$\frac{\partial C}{\partial b}=\frac{1}{n}\sum ({\varsigma}(z)-y)$
再⼀次, 这避免了⼆次代价函数中类似${\varsigma}'(z)$项导致的学习缓慢。
-### 3.4.11 为什么Tanh收敛速度比Sigmoid快?(贡献者:黄钦建-华南理工大学)
+### 3.4.12 为什么Tanh收敛速度比Sigmoid快?
+
+**(贡献者:黄钦建-华南理工大学)**
-$tanh^{,}(x)=1-tanh(x)^{2}\in (0,1)$
+首先看如下两个函数的求导:
+
+$tanh^{,}(x)=1-tanh(x)^{2}\in (0,1)$
$s^{,}(x)=s(x)*(1-s(x))\in (0,\frac{1}{4}]$
@@ -642,11 +653,11 @@ Batch的选择,首先决定的是下降的方向。
### 3.5.2 Batch_Size 值的选择
-假如每次只训练一个样本,即 Batch_Size = 1。线性神经元在均方误差代价函数的错误面是一个抛物面,横截面是椭圆。对于多层神经元、非线性网络,在局部依然近似是抛物面。此时,每次修正方向以各自样本的梯度方向修正,横冲直撞各自为政,难以达到收敛。
+ 假如每次只训练一个样本,即 Batch_Size = 1。线性神经元在均方误差代价函数的错误面是一个抛物面,横截面是椭圆。对于多层神经元、非线性网络,在局部依然近似是抛物面。此时,每次修正方向以各自样本的梯度方向修正,横冲直撞各自为政,难以达到收敛。
-既然 Batch_Size 为全数据集或者Batch_Size = 1都有各自缺点,可不可以选择一个适中的Batch_Size值呢?
+ 既然 Batch_Size 为全数据集或者Batch_Size = 1都有各自缺点,可不可以选择一个适中的Batch_Size值呢?
-此时,可采用批梯度下降法(Mini-batches Learning)。因为如果数据集足够充分,那么用一半(甚至少得多)的数据训练算出来的梯度与用全部数据训练出来的梯度是几乎一样的。
+ 此时,可采用批梯度下降法(Mini-batches Learning)。因为如果数据集足够充分,那么用一半(甚至少得多)的数据训练算出来的梯度与用全部数据训练出来的梯度是几乎一样的。
### 3.5.3 在合理范围内,增大Batch_Size有何好处?
@@ -662,47 +673,49 @@ Batch的选择,首先决定的是下降的方向。
### 3.5.5 调节 Batch_Size 对训练效果影响到底如何?
-1. Batch_Size 太小,模型表现效果极其糟糕(error飙升)。
+1. Batch_Size 太小,模型表现效果极其糟糕(error飙升)。
2. 随着 Batch_Size 增大,处理相同数据量的速度越快。
3. 随着 Batch_Size 增大,达到相同精度所需要的 epoch 数量越来越多。
4. 由于上述两种因素的矛盾, Batch_Size 增大到某个时候,达到时间上的最优。
5. 由于最终收敛精度会陷入不同的局部极值,因此 Batch_Size 增大到某些时候,达到最终收敛精度上的最优。
-### 3.5.6 受限于客观条件无法给足够的Batch Size怎么办?
-
-在极小的情况下(低于十),建议使用[Group Norm](https://arxiv.org/abs/1803.08494)。
-
## 3.6 归一化
### 3.6.1 归一化含义?
-归一化的具体作用是归纳统一样本的统计分布性。归一化在 $ 0-1$ 之间是统计的概率分布,归一化在$ -1--+1$ 之间是统计的坐标分布。归一化有同一、统一和合一的意思。无论是为了建模还是为了计算,首先基本度量单位要同一,神经网络是以样本在事件中的统计分别几率来进行训练(概率计算)和预测的,且 sigmoid 函数的取值是 0 到 1 之间的,网络最后一个节点的输出也是如此,所以经常要对样本的输出归一化处理。归一化是统一在 $ 0-1 $ 之间的统计概率分布,当所有样本的输入信号都为正值时,与第一隐含层神经元相连的权值只能同时增加或减小,从而导致学习速度很慢。另外在数据中常存在奇异样本数据,奇异样本数据存在所引起的网络训练时间增加,并可能引起网络无法收敛。为了避免出现这种情况及后面数据处理的方便,加快网络学习速度,可以对输入信号进行归一化,使得所有样本的输入信号其均值接近于 0 或与其均方差相比很小。
+1. 归纳统一样本的统计分布性。归一化在 $ 0-1$ 之间是统计的概率分布,归一化在$ -1--+1$ 之间是统计的坐标分布。
+
+2. 无论是为了建模还是为了计算,首先基本度量单位要同一,神经网络是以样本在事件中的统计分别几率来进行训练(概率计算)和预测,且 sigmoid 函数的取值是 0 到 1 之间的,网络最后一个节点的输出也是如此,所以经常要对样本的输出归一化处理。
+
+3. 归一化是统一在 $ 0-1 $ 之间的统计概率分布,当所有样本的输入信号都为正值时,与第一隐含层神经元相连的权值只能同时增加或减小,从而导致学习速度很慢。
+
+4. 另外在数据中常存在奇异样本数据,奇异样本数据存在所引起的网络训练时间增加,并可能引起网络无法收敛。为了避免出现这种情况及后面数据处理的方便,加快网络学习速度,可以对输入信号进行归一化,使得所有样本的输入信号其均值接近于 0 或与其均方差相比很小。
### 3.6.2 为什么要归一化?
1. 为了后面数据处理的方便,归一化的确可以避免一些不必要的数值问题。
-2. 为了程序运行时收敛加快。 下面图解。
+2. 为了程序运行时收敛加快。
3. 同一量纲。样本数据的评价标准不一样,需要对其量纲化,统一评价标准。这算是应用层面的需求。
4. 避免神经元饱和。啥意思?就是当神经元的激活在接近 0 或者 1 时会饱和,在这些区域,梯度几乎为 0,这样,在反向传播过程中,局部梯度就会接近 0,这会有效地“杀死”梯度。
5. 保证输出数据中数值小的不被吞食。
### 3.6.3 为什么归一化能提高求解最优解速度?
-
+
-上图是代表数据是否均一化的最优解寻解过程(圆圈可以理解为等高线)。左图表示未经归一化操作的寻解过程,右图表示经过归一化后的寻解过程。
+ 上图是代表数据是否均一化的最优解寻解过程(圆圈可以理解为等高线)。左图表示未经归一化操作的寻解过程,右图表示经过归一化后的寻解过程。
-当使用梯度下降法寻求最优解时,很有可能走“之字型”路线(垂直等高线走),从而导致需要迭代很多次才能收敛;而右图对两个原始特征进行了归一化,其对应的等高线显得很圆,在梯度下降进行求解时能较快的收敛。
+ 当使用梯度下降法寻求最优解时,很有可能走“之字型”路线(垂直等高线走),从而导致需要迭代很多次才能收敛;而右图对两个原始特征进行了归一化,其对应的等高线显得很圆,在梯度下降进行求解时能较快的收敛。
-因此如果机器学习模型使用梯度下降法求最优解时,归一化往往非常有必要,否则很难收敛甚至不能收敛。
+ 因此如果机器学习模型使用梯度下降法求最优解时,归一化往往非常有必要,否则很难收敛甚至不能收敛。
### 3.6.4 3D 图解未归一化
例子:
-假设 $ w1 $ 的范围在 $ [-10, 10] $,而 $ w2 $ 的范围在 $ [-100, 100] $,梯度每次都前进 1 单位,那么在 $ w1 $ 方向上每次相当于前进了 $ 1/20 $,而在 $ w2 $ 上只相当于 $ 1/200 $!某种意义上来说,在 $ w2 $ 上前进的步长更小一些,而 $ w1 $ 在搜索过程中会比 $ w2 $ “走”得更快。
+ 假设 $ w1 $ 的范围在 $ [-10, 10] $,而 $ w2 $ 的范围在 $ [-100, 100] $,梯度每次都前进 1 单位,那么在 $ w1 $ 方向上每次相当于前进了 $ 1/20 $,而在 $ w2 $ 上只相当于 $ 1/200 $!某种意义上来说,在 $ w2 $ 上前进的步长更小一些,而 $ w1 $ 在搜索过程中会比 $ w2 $ “走”得更快。
-这样会导致,在搜索过程中更偏向于 $ w1 $ 的方向。走出了“L”形状,或者成为“之”字形。
+ 这样会导致,在搜索过程中更偏向于 $ w1 $ 的方向。走出了“L”形状,或者成为“之”字形。

@@ -714,9 +727,9 @@ $$
x^{\prime} = \frac{x-min(x)}{max(x) - min(x)}
$$
-适用范围:比较适用在数值比较集中的情况。
+ 适用范围:比较适用在数值比较集中的情况。
-缺点:如果 max 和 min 不稳定,很容易使得归一化结果不稳定,使得后续使用效果也不稳定。
+ 缺点:如果 max 和 min 不稳定,很容易使得归一化结果不稳定,使得后续使用效果也不稳定。
2. 标准差标准化
@@ -724,30 +737,28 @@ $$
x^{\prime} = \frac{x-\mu}{\sigma}
$$
-含义:经过处理的数据符合标准正态分布,即均值为 0,标准差为 1 其中 $ \mu $ 为所有样本数据的均值,$ \sigma $ 为所有样本数据的标准差。
+ 含义:经过处理的数据符合标准正态分布,即均值为 0,标准差为 1 其中 $ \mu $ 为所有样本数据的均值,$ \sigma $ 为所有样本数据的标准差。
3. 非线性归一化
-适用范围:经常用在数据分化比较大的场景,有些数值很大,有些很小。通过一些数学函数,将原始值进行映射。该方法包括 $ log $、指数,正切等。
+ 适用范围:经常用在数据分化比较大的场景,有些数值很大,有些很小。通过一些数学函数,将原始值进行映射。该方法包括 $ log $、指数,正切等。
### 3.6.6 局部响应归一化作用
-LRN 是一种提高深度学习准确度的技术方法。LRN 一般是在激活、池化函数后的一种方法。
-
-在 ALexNet 中,提出了 LRN 层,对局部神经元的活动创建竞争机制,使其中响应比较大对值变得相对更大,并抑制其他反馈较小的神经元,增强了模型的泛化能力。
+ LRN 是一种提高深度学习准确度的技术方法。LRN 一般是在激活、池化函数后的一种方法。
-### 3.6.7理解局部响应归一化公式
+ 在 ALexNet 中,提出了 LRN 层,对局部神经元的活动创建竞争机制,使其中响应比较大对值变得相对更大,并抑制其他反馈较小的神经元,增强了模型的泛化能力。
-答案来源:[深度学习的局部响应归一化LRN(Local Response Normalization)理解](https://blog.csdn.net/yangdashi888/article/details/77918311)
+### 3.6.7 理解局部响应归一化
-局部响应归一化原理是仿造生物学上活跃的神经元对相邻神经元的抑制现象(侧抑制),根据论文其公式如下:
+ 局部响应归一化原理是仿造生物学上活跃的神经元对相邻神经元的抑制现象(侧抑制),其公式如下:
$$
b_{x,y}^i = a_{x,y}^i / (k + \alpha \sum_{j=max(0, i-n/2)}^{min(N-1, i+n/2)}(a_{x,y}^j)^2 )^\beta
$$
其中,
-1) $ a $:表示卷积层(包括卷积操作和池化操作)后的输出结果,是一个四维数组[batch,height,width,channel]。
+1) $ a $:表示卷积层(包括卷积操作和池化操作)后的输出结果,是一个四维数组[batch,height,width,channel]。
- batch:批次数(每一批为一张图片)。
- height:图片高度。
@@ -760,17 +771,17 @@ $$
4) $ a $,$ n/2 $, $ k $ 分别表示函数中的 input,depth_radius,bias。参数 $ k, n, \alpha, \beta $ 都是超参数,一般设置 $ k=2, n=5, \alpha=1*e-4, \beta=0.75 $
-5) $ \sum $:$ \sum $ 叠加的方向是沿着通道方向的,即每个点值的平方和是沿着 $ a $ 中的第 3 维 channel 方向的,也就是一个点同方向的前面 $ n/2 $ 个通道(最小为第 $ 0 $ 个通道)和后 $ n/2 $ 个通道(最大为第 $ d-1 $ 个通道)的点的平方和(共 $ n+1 $ 个点)。而函数的英文注解中也说明了把 input 当成是 $ d $ 个 3 维的矩阵,说白了就是把 input 的通道数当作 3 维矩阵的个数,叠加的方向也是在通道方向。
+5) $ \sum $:$ \sum $ 叠加的方向是沿着通道方向的,即每个点值的平方和是沿着 $ a $ 中的第 3 维 channel 方向的,也就是一个点同方向的前面 $ n/2 $ 个通道(最小为第 $ 0 $ 个通道)和后 $ n/2 $ 个通道(最大为第 $ d-1 $ 个通道)的点的平方和(共 $ n+1 $ 个点)。而函数的英文注解中也说明了把 input 当成是 $ d $ 个 3 维的矩阵,说白了就是把 input 的通道数当作 3 维矩阵的个数,叠加的方向也是在通道方向。
简单的示意图如下:
-
+
### 3.6.8 什么是批归一化(Batch Normalization)
-以前在神经网络训练中,只是对输入层数据进行归一化处理,却没有在中间层进行归一化处理。要知道,虽然我们对输入数据进行了归一化处理,但是输入数据经过 $ \sigma(WX+b) $ 这样的矩阵乘法以及非线性运算之后,其数据分布很可能被改变,而随着深度网络的多层运算之后,数据分布的变化将越来越大。如果我们能在网络的中间也进行归一化处理,是否对网络的训练起到改进作用呢?答案是肯定的。
+ 以前在神经网络训练中,只是对输入层数据进行归一化处理,却没有在中间层进行归一化处理。要知道,虽然我们对输入数据进行了归一化处理,但是输入数据经过 $ \sigma(WX+b) $ 这样的矩阵乘法以及非线性运算之后,其数据分布很可能被改变,而随着深度网络的多层运算之后,数据分布的变化将越来越大。如果我们能在网络的中间也进行归一化处理,是否对网络的训练起到改进作用呢?答案是肯定的。
-这种在神经网络中间层也进行归一化处理,使训练效果更好的方法,就是批归一化Batch Normalization(BN)。
+ 这种在神经网络中间层也进行归一化处理,使训练效果更好的方法,就是批归一化Batch Normalization(BN)。
### 3.6.9 批归一化(BN)算法的优点
@@ -795,7 +806,7 @@ $$
\mu_{\beta} = \frac{1}{m} \sum_{i=1}^m(x_i)
$$
-其中,$ m $ 是此次训练样本 batch 的大小。
+其中,$ m $ 是此次训练样本 batch 的大小。
2. 计算上一层输出数据的标准差
@@ -821,19 +832,23 @@ $$
注:上述是 BN 训练时的过程,但是当在投入使用时,往往只是输入一个样本,没有所谓的均值 $ \mu_{\beta} $ 和标准差 $ \sigma_{\beta}^2 $。此时,均值 $ \mu_{\beta} $ 是计算所有 batch $ \mu_{\beta} $ 值的平均值得到,标准差 $ \sigma_{\beta}^2 $ 采用每个batch $ \sigma_{\beta}^2 $ 的无偏估计得到。
-### 3.6.11 批归一化和群组归一化
+### 3.6.11 批归一化和群组归一化比较
-批量归一化(Batch Normalization,以下简称 BN)是深度学习发展中的一项里程碑式技术,可让各种网络并行训练。但是,批量维度进行归一化会带来一些问题——批量统计估算不准确导致批量变小时,BN 的误差会迅速增加。在训练大型网络和将特征转移到计算机视觉任务中(包括检测、分割和视频),内存消耗限制了只能使用小批量的 BN。
+| 名称 | 特点 |
+| ---------------------------------------------- | :----------------------------------------------------------- |
+| 批量归一化(Batch Normalization,以下简称 BN) | 可让各种网络并行训练。但是,批量维度进行归一化会带来一些问题——批量统计估算不准确导致批量变小时,BN 的误差会迅速增加。在训练大型网络和将特征转移到计算机视觉任务中(包括检测、分割和视频),内存消耗限制了只能使用小批量的 BN。 |
+| 群组归一化 Group Normalization (简称 GN) | GN 将通道分成组,并在每组内计算归一化的均值和方差。GN 的计算与批量大小无关,并且其准确度在各种批量大小下都很稳定。 |
+| 比较 | 在 ImageNet 上训练的 ResNet-50上,GN 使用批量大小为 2 时的错误率比 BN 的错误率低 10.6% ;当使用典型的批量时,GN 与 BN 相当,并且优于其他标归一化变体。而且,GN 可以自然地从预训练迁移到微调。在进行 COCO 中的目标检测和分割以及 Kinetics 中的视频分类比赛中,GN 可以胜过其竞争对手,表明 GN 可以在各种任务中有效地取代强大的 BN。 |
-何恺明团队在[群组归一化(Group Normalization)](http://tech.ifeng.com/a/20180324/44918599_0.shtml) 中提出群组归一化 Group Normalization (简称 GN) 作为 BN 的替代方案。
+### 3.6.12 Weight Normalization和Batch Normalization比较
-GN 将通道分成组,并在每组内计算归一化的均值和方差。GN 的计算与批量大小无关,并且其准确度在各种批量大小下都很稳定。在 ImageNet 上训练的 ResNet-50上,GN 使用批量大小为 2 时的错误率比 BN 的错误率低 10.6% ;当使用典型的批量时,GN 与 BN 相当,并且优于其他标归一化变体。而且,GN 可以自然地从预训练迁移到微调。在进行 COCO 中的目标检测和分割以及 Kinetics 中的视频分类比赛中,GN 可以胜过其竞争对手,表明 GN 可以在各种任务中有效地取代强大的 BN。
+ Weight Normalization 和 Batch Normalization 都属于参数重写(Reparameterization)的方法,只是采用的方式不同。
-### 3.6.12 Weight Normalization和Batch Normalization
+ Weight Normalization 是对网络权值$ W $ 进行 normalization,因此也称为 Weight Normalization;
-答案来源:[Weight Normalization 相比batch Normalization 有什么优点呢?](https://www.zhihu.com/question/55132852/answer/171250929)
+ Batch Normalization 是对网络某一层输入数据进行 normalization。
-Weight Normalization 和 Batch Normalization 都属于参数重写(Reparameterization)的方法,只是采用的方式不同,Weight Normalization 是对网络权值$ W $ 进行 normalization,因此也称为 Weight Normalization;Batch Normalization 是对网络某一层输入数据进行 normalization。Weight Normalization相比Batch Normalization有以下三点优势:
+ Weight Normalization相比Batch Normalization有以下三点优势:
1. Weight Normalization 通过重写深度学习网络的权重W的方式来加速深度学习网络参数收敛,没有引入 minbatch 的依赖,适用于 RNN(LSTM)网络(Batch Normalization 不能直接用于RNN,进行 normalization 操作,原因在于:1) RNN 处理的 Sequence 是变长的;2) RNN 是基于 time step 计算,如果直接使用 Batch Normalization 处理,需要保存每个 time step 下,mini btach 的均值和方差,效率低且占内存)。
@@ -841,18 +856,18 @@ Weight Normalization 和 Batch Normalization 都属于参数重写(Reparameter
3. 不需要额外的存储空间来保存 mini batch 的均值和方差,同时实现 Weight Normalization 时,对深度学习网络进行正向信号传播和反向梯度计算带来的额外计算开销也很小。因此,要比采用 Batch Normalization 进行 normalization 操作时,速度快。 但是 Weight Normalization 不具备 Batch Normalization 把网络每一层的输出 Y 固定在一个变化范围的作用。因此,采用 Weight Normalization 进行 Normalization 时需要特别注意参数初始值的选择。
-### 3.6.13 Batch Normalization在什么时候用比较合适?(贡献者:黄钦建-华南理工大学)
+### 3.6.13 Batch Normalization在什么时候用比较合适?
+
+**(贡献者:黄钦建-华南理工大学)**
-在CNN中,BN应作用在非线性映射前。在神经网络训练时遇到收敛速度很慢,或梯度爆炸等无法训练的状况时可以尝试BN来解决。另外,在一般使用情况下也可以加入BN来加快训练速度,提高模型精度。
+ 在CNN中,BN应作用在非线性映射前。在神经网络训练时遇到收敛速度很慢,或梯度爆炸等无法训练的状况时可以尝试BN来解决。另外,在一般使用情况下也可以加入BN来加快训练速度,提高模型精度。
-BN比较适用的场景是:每个mini-batch比较大,数据分布比较接近。在进行训练之前,要做好充分的shuffle,否则效果会差很多。另外,由于BN需要在运行过程中统计每个mini-batch的一阶统计量和二阶统计量,因此不适用于动态的网络结构和RNN网络。
+ BN比较适用的场景是:每个mini-batch比较大,数据分布比较接近。在进行训练之前,要做好充分的shuffle,否则效果会差很多。另外,由于BN需要在运行过程中统计每个mini-batch的一阶统计量和二阶统计量,因此不适用于动态的网络结构和RNN网络。
## 3.7 预训练与微调(fine tuning)
### 3.7.1 为什么无监督预训练可以帮助深度学习?
-答案来源:[为什么无监督的预训练可以帮助深度学习](http://blog.csdn.net/Richard_More/article/details/52334272?locationNum=3&fps=1)
-
深度网络存在问题:
1. 网络越深,需要的训练样本数越多。若用监督则需大量标注样本,不然小规模样本容易造成过拟合。深层网络特征比较多,会出现的多特征问题主要有多样本问题、规则化问题、特征选择问题。
@@ -863,21 +878,21 @@ BN比较适用的场景是:每个mini-batch比较大,数据分布比较接
**解决方法:**
-逐层贪婪训练,无监督预训练(unsupervised pre-training)即训练网络的第一个隐藏层,再训练第二个…最后用这些训练好的网络参数值作为整体网络参数的初始值。
+ 逐层贪婪训练,无监督预训练(unsupervised pre-training)即训练网络的第一个隐藏层,再训练第二个…最后用这些训练好的网络参数值作为整体网络参数的初始值。
经过预训练最终能得到比较好的局部最优解。
### 3.7.2 什么是模型微调fine tuning
-用别人的参数、修改后的网络和自己的数据进行训练,使得参数适应自己的数据,这样一个过程,通常称之为微调(fine tuning).
+ 用别人的参数、修改后的网络和自己的数据进行训练,使得参数适应自己的数据,这样一个过程,通常称之为微调(fine tuning).
**模型的微调举例说明:**
-我们知道,CNN 在图像识别这一领域取得了巨大的进步。如果想将 CNN 应用到我们自己的数据集上,这时通常就会面临一个问题:通常我们的 dataset 都不会特别大,一般不会超过 1 万张,甚至更少,每一类图片只有几十或者十几张。这时候,直接应用这些数据训练一个网络的想法就不可行了,因为深度学习成功的一个关键性因素就是大量带标签数据组成的训练集。如果只利用手头上这点数据,即使我们利用非常好的网络结构,也达不到很高的 performance。这时候,fine-tuning 的思想就可以很好解决我们的问题:我们通过对 ImageNet 上训练出来的模型(如CaffeNet,VGGNet,ResNet) 进行微调,然后应用到我们自己的数据集上。
+ 我们知道,CNN 在图像识别这一领域取得了巨大的进步。如果想将 CNN 应用到我们自己的数据集上,这时通常就会面临一个问题:通常我们的 dataset 都不会特别大,一般不会超过 1 万张,甚至更少,每一类图片只有几十或者十几张。这时候,直接应用这些数据训练一个网络的想法就不可行了,因为深度学习成功的一个关键性因素就是大量带标签数据组成的训练集。如果只利用手头上这点数据,即使我们利用非常好的网络结构,也达不到很高的 performance。这时候,fine-tuning 的思想就可以很好解决我们的问题:我们通过对 ImageNet 上训练出来的模型(如CaffeNet,VGGNet,ResNet) 进行微调,然后应用到我们自己的数据集上。
### 3.7.3 微调时候网络参数是否更新?
-会更新。
+答案:会更新。
1. finetune 的过程相当于继续训练,跟直接训练的区别是初始化的时候。
2. 直接训练是按照网络定义指定的方式初始化。
@@ -906,24 +921,34 @@ BN比较适用的场景是:每个mini-batch比较大,数据分布比较接
### 3.8.2 全都初始化为同样的值
-偏差初始化陷阱: 都初始化为一样的值。
-以一个三层网络为例:
+ 偏差初始化陷阱: 都初始化为一样的值。
+ 以一个三层网络为例:
首先看下结构
-
+
它的表达式为:
$$
a_1^{(2)} = f(W_{11}^{(1)} x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{(1)})
+$$
+$$
a_2^{(2)} = f(W_{21}^{(1)} x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 + b_2^{(1)})
+$$
+$$
a_3^{(2)} = f(W_{31}^{(1)} x_1 + W_{32}^{(1)} x_2 + W_{33}^{(1)} x_3 + b_3^{(1)})
+$$
+$$
h_{W,b}(x) = a_1^{(3)} = f(W_{11}^{(2)} a_1^{(2)} + W_{12}^{(2)} a_2^{(2)} + W_{13}^{(2)} a_3^{(2)} + b_1^{(2)})
$$
+$$
+xa_1^{(2)} = f(W_{11}^{(1)} x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{(1)})a_2^{(2)} = f(W_{21}^{(1)} x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 +
+$$
+
如果每个权重都一样,那么在多层网络中,从第二层开始,每一层的输入值都是相同的了也就是$ a1=a2=a3=.... $,既然都一样,就相当于一个输入了,为啥呢??
如果是反向传递算法(如果这里不明白请看上面的连接),其中的偏置项和权重项的迭代的偏导数计算公式如下
@@ -934,7 +959,7 @@ $$
\frac{\partial}{\partial b_{i}^{(l)}} J(W,b;x,y) = \delta_i^{(l+1)}
$$
-$ \delta $ 的计算公式
+$ \delta $ 的计算公式
$$
\delta_i^{(l)} = (\sum_{j=1}^{s_{t+1}} W_{ji}^{(l)} \delta_j^{(l+1)} ) f^{\prime}(z_i^{(l)})
@@ -951,105 +976,117 @@ $$
### 3.8.3 初始化为小的随机数
-将权重初始化为很小的数字是一个普遍的打破网络对称性的解决办法。这个想法是,神经元在一开始都是随机的、独一无二的,所以它们会计算出不同的更新,并将自己整合到整个网络的各个部分。一个权重矩阵的实现可能看起来像 $ W=0.01∗np.random.randn(D,H) $,其中 randn 是从均值为 0 的单位标准高斯分布进行取样。通过这个公式(函数),每个神经元的权重向量初始化为一个从多维高斯分布取样的随机向量,所以神经元在输入空间中指向随机的方向(so the neurons point in random direction in the input space). 应该是指输入空间对于随机方向有影响)。其实也可以从均匀分布中来随机选取小数,但是在实际操作中看起来似乎对最后的表现并没有太大的影响。
+ 将权重初始化为很小的数字是一个普遍的打破网络对称性的解决办法。这个想法是,神经元在一开始都是随机的、独一无二的,所以它们会计算出不同的更新,并将自己整合到整个网络的各个部分。一个权重矩阵的实现可能看起来像 $ W=0.01∗np.random.randn(D,H) $,其中 randn 是从均值为 0 的单位标准高斯分布进行取样。通过这个公式(函数),每个神经元的权重向量初始化为一个从多维高斯分布取样的随机向量,所以神经元在输入空间中指向随机的方向(so the neurons point in random direction in the input space). 应该是指输入空间对于随机方向有影响)。其实也可以从均匀分布中来随机选取小数,但是在实际操作中看起来似乎对最后的表现并没有太大的影响。
-备注:警告:并不是数字越小就会表现的越好。比如,如果一个神经网络层的权重非常小,那么在反向传播算法就会计算出很小的梯度(因为梯度 gradient 是与权重成正比的)。在网络不断的反向传播过程中将极大地减少“梯度信号”,并可能成为深层网络的一个需要注意的问题。
+ 备注:并不是数字越小就会表现的越好。比如,如果一个神经网络层的权重非常小,那么在反向传播算法就会计算出很小的梯度(因为梯度 gradient 是与权重成正比的)。在网络不断的反向传播过程中将极大地减少“梯度信号”,并可能成为深层网络的一个需要注意的问题。
### 3.8.4 用 $ 1/\sqrt n $ 校准方差
-上述建议的一个问题是,随机初始化神经元的输出的分布有一个随输入量增加而变化的方差。结果证明,我们可以通过将其权重向量按其输入的平方根(即输入的数量)进行缩放,从而将每个神经元的输出的方差标准化到 1。也就是说推荐的启发式方法 (heuristic) 是将每个神经元的权重向量按下面的方法进行初始化: $ w=np.random.randn(n)/\sqrt n $,其中 n 表示输入的数量。这保证了网络中所有的神经元最初的输出分布大致相同,并在经验上提高了收敛速度。
+ 上述建议的一个问题是,随机初始化神经元的输出的分布有一个随输入量增加而变化的方差。结果证明,我们可以通过将其权重向量按其输入的平方根(即输入的数量)进行缩放,从而将每个神经元的输出的方差标准化到 1。也就是说推荐的启发式方法 (heuristic) 是将每个神经元的权重向量按下面的方法进行初始化: $ w=np.random.randn(n)/\sqrt n $,其中 n 表示输入的数量。这保证了网络中所有的神经元最初的输出分布大致相同,并在经验上提高了收敛速度。
### 3.8.5 稀疏初始化(Sparse Initialazation)
-另一种解决未校准方差问题的方法是把所有的权重矩阵都设为零,但是为了打破对称性,每个神经元都是随机连接地(从如上面所介绍的一个小的高斯分布中抽取权重)到它下面的一个固定数量的神经元。一个典型的神经元连接的数目可能是小到 10 个。
+ 另一种解决未校准方差问题的方法是把所有的权重矩阵都设为零,但是为了打破对称性,每个神经元都是随机连接地(从如上面所介绍的一个小的高斯分布中抽取权重)到它下面的一个固定数量的神经元。一个典型的神经元连接的数目可能是小到 10 个。
### 3.8.6 初始化偏差
-将偏差初始化为零是可能的,也是很常见的,因为非对称性破坏是由权重的小随机数导致的。因为 ReLU 具有非线性特点,所以有些人喜欢使用将所有的偏差设定为小的常数值如 0.01,因为这样可以确保所有的 ReLU 单元在最开始就激活触发(fire)并因此能够获得和传播一些梯度值。然而,这是否能够提供持续的改善还不太清楚(实际上一些结果表明这样做反而使得性能更加糟糕),所以更通常的做法是简单地将偏差初始化为 0.
+ 将偏差初始化为零是可能的,也是很常见的,因为非对称性破坏是由权重的小随机数导致的。因为 ReLU 具有非线性特点,所以有些人喜欢使用将所有的偏差设定为小的常数值如 0.01,因为这样可以确保所有的 ReLU 单元在最开始就激活触发(fire)并因此能够获得和传播一些梯度值。然而,这是否能够提供持续的改善还不太清楚(实际上一些结果表明这样做反而使得性能更加糟糕),所以更通常的做法是简单地将偏差初始化为 0.
-## 3.9 Softmax
+## 3.9 学习率
-### 3.9.1 Softmax 定义及作用
+### 3.9.1 学习率的作用
-Softmax 是一种形如下式的函数:
+ 在机器学习中,监督式学习通过定义一个模型,并根据训练集上的数据估计最优参数。梯度下降法是一个广泛被用来最小化模型误差的参数优化算法。梯度下降法通过多次迭代,并在每一步中最小化成本函数(cost 来估计模型的参数。学习率 (learning rate),在迭代过程中会控制模型的学习进度。
-$$
-P(i) = \frac{exp(\theta_i^T x)}{\sum_{k=1}^{K} exp(\theta_i^T x)}
-$$
-
-其中,$ \theta_i $ 和 $ x $ 是列向量,$ \theta_i^T x $ 可能被换成函数关于 $ x $ 的函数 $ f_i(x) $
-
-通过 softmax 函数,可以使得 $ P(i) $ 的范围在 $ [0,1] $ 之间。在回归和分类问题中,通常 $ \theta $ 是待求参数,通过寻找使得 $ P(i) $ 最大的 $ \theta_i $ 作为最佳参数。
+ 在梯度下降法中,都是给定的统一的学习率,整个优化过程中都以确定的步长进行更新, 在迭代优化的前期中,学习率较大,则前进的步长就会较长,这时便能以较快的速度进行梯度下降,而在迭代优化的后期,逐步减小学习率的值,减小步长,这样将有助于算法的收敛,更容易接近最优解。故而如何对学习率的更新成为了研究者的关注点。
+ 在模型优化中,常用到的几种学习率衰减方法有:分段常数衰减、多项式衰减、指数衰减、自然指数衰减、余弦衰减、线性余弦衰减、噪声线性余弦衰减
-但是,使得范围在 $ [0,1] $ 之间的方法有很多,为啥要在前面加上以 $ e $ 的幂函数的形式呢?参考 logistic 函数:
+### 3.9.2 学习率衰减常用参数有哪些
-$$
-P(i) = \frac{1}{1+exp(-\theta_i^T x)}
-$$
+| 参数名称 | 参数说明 |
+| ----------------- | -------------------------------------------------- |
+| learning_rate | 初始学习率 |
+| global_step | 用于衰减计算的全局步数,非负,用于逐步计算衰减指数 |
+| decay_steps | 衰减步数,必须是正值,决定衰减周期 |
+| decay_rate | 衰减率 |
+| end_learning_rate | 最低的最终学习率 |
+| cycle | 学习率下降后是否重新上升 |
+| alpha | 最小学习率 |
+| num_periods | 衰减余弦部分的周期数 |
+| initial_variance | 噪声的初始方差 |
+| variance_decay | 衰减噪声的方差 |
-这个函数的作用就是使得 $ P(i) $ 在负无穷到 0 的区间趋向于 0, 在 0 到正无穷的区间趋向 1,。同样 softmax 函数加入了 $ e $ 的幂函数正是为了两极化:正样本的结果将趋近于 1,而负样本的结果趋近于 0。这样为多类别提供了方便(可以把 $ P(i) $ 看做是样本属于类别的概率)。可以说,Softmax 函数是 logistic 函数的一种泛化。
+### 3.9.3 分段常数衰减
-softmax 函数可以把它的输入,通常被称为 logits 或者 logit scores,处理成 0 到 1 之间,并且能够把输出归一化到和为 1。这意味着 softmax 函数与分类的概率分布等价。它是一个网络预测多酚类问题的最佳输出激活函数。
+ 分段常数衰减需要事先定义好的训练次数区间,在对应区间置不同的学习率的常数值,一般情况刚开始的学习率要大一些,之后要越来越小,要根据样本量的大小设置区间的间隔大小,样本量越大,区间间隔要小一点。下图即为分段常数衰减的学习率变化图,横坐标代表训练次数,纵坐标代表学习率。
-### 3.9.2 Softmax 推导
+
-## 3.10 理解 One Hot Encodeing 原理及作用?
+### 3.9.4 指数衰减
-问题由来
+ 以指数衰减方式进行学习率的更新,学习率的大小和训练次数指数相关,其更新规则为:
+$$
+decayed{\_}learning{\_}rate =learning{\_}rate*decay{\_}rate^{\frac{global{\_step}}{decay{\_}steps}}
+$$
+ 这种衰减方式简单直接,收敛速度快,是最常用的学习率衰减方式,如下图所示,绿色的为学习率随
+训练次数的指数衰减方式,红色的即为分段常数衰减,它在一定的训练区间内保持学习率不变。
-在很多**机器学习**任务中,特征并不总是连续值,而有可能是分类值。
+
-例如,考虑一下的三个特征:
+### 3.9.5 自然指数衰减
-```
-["male", "female"] ["from Europe", "from US", "from Asia"]
-["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]
-```
+ 它与指数衰减方式相似,不同的在于它的衰减底数是$e$,故而其收敛的速度更快,一般用于相对比较
+容易训练的网络,便于较快的收敛,其更新规则如下
+$$
+decayed{\_}learning{\_}rate =learning{\_}rate*e^{\frac{-decay{\_rate}}{global{\_}step}}
+$$
+ 下图为为分段常数衰减、指数衰减、自然指数衰减三种方式的对比图,红色的即为分段常数衰减图,阶梯型曲线。蓝色线为指数衰减图,绿色即为自然指数衰减图,很明可以看到自然指数衰减方式下的学习率衰减程度要大于一般指数衰减方式,有助于更快的收敛。
-如果将上述特征用数字表示,效率会高很多。例如:
+
-```
-["male", "from US", "uses Internet Explorer"] 表示为 [0, 1, 3]
-["female", "from Asia", "uses Chrome"] 表示为 [1, 2, 1]
-```
+### 3.9.6 多项式衰减
-但是,即使转化为数字表示后,上述数据也不能直接用在我们的分类器中。因为,分类器往往默认数据数据是连续的(可以计算距离?),并且是有序的(而上面这个 0 并不是说比 1 要高级)。但是,按照我们上述的表示,数字并不是有序的,而是随机分配的。
+ 应用多项式衰减的方式进行更新学习率,这里会给定初始学习率和最低学习率取值,然后将会按照
+给定的衰减方式将学习率从初始值衰减到最低值,其更新规则如下式所示。
+$$
+global{\_}step=min(global{\_}step,decay{\_}steps)
+$$
-**独热编码**
+$$
+decayed{\_}learning{\_}rate =(learning{\_}rate-end{\_}learning{\_}rate)* \left( 1-\frac{global{\_step}}{decay{\_}steps}\right)^{power} \\
+ +end{\_}learning{\_}rate
+$$
-为了解决上述问题,其中一种可能的解决方法是采用独热编码(One-Hot Encoding)。独热编码即 One-Hot 编码,又称一位有效编码,其方法是使用N位状态寄存器来对 N 个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候,其中只有一位有效。
+ 需要注意的是,有两个机制,降到最低学习率后,到训练结束可以一直使用最低学习率进行更新,另一个是再次将学习率调高,使用 decay_steps 的倍数,取第一个大于 global_steps 的结果,如下式所示.它是用来防止神经网络在训练的后期由于学习率过小而导致的网络一直在某个局部最小值附近震荡,这样可以通过在后期增大学习率跳出局部极小值。
+$$
+decay{\_}steps = decay{\_}steps*ceil \left( \frac{global{\_}step}{decay{\_}steps}\right)
+$$
+ 如下图所示,红色线代表学习率降低至最低后,一直保持学习率不变进行更新,绿色线代表学习率衰减到最低后,又会再次循环往复的升高降低。
-例如:
+
-```
-自然状态码为:000,001,010,011,100,101
-独热编码为:000001,000010,000100,001000,010000,100000
-```
+### 3.9.7 余弦衰减
-可以这样理解,对于每一个特征,如果它有 m 个可能值,那么经过独热编码后,就变成了 m 个二元特征(如成绩这个特征有好,中,差变成 one-hot 就是 100, 010, 001)。并且,这些特征互斥,每次只有一个激活。因此,数据会变成稀疏的。
+ 余弦衰减就是采用余弦的相关方式进行学习率的衰减,衰减图和余弦函数相似。其更新机制如下式所示:
+$$
+global{\_}step=min(global{\_}step,decay{\_}steps)
+$$
-这样做的好处主要有:
+$$
+cosine{\_}decay=0.5*\left( 1+cos\left( \pi* \frac{global{\_}step}{decay{\_}steps}\right)\right)
+$$
-1. 解决了分类器不好处理属性数据的问题;
-2. 在一定程度上也起到了扩充特征的作用。
+$$
+decayed=(1-\alpha)*cosine{\_}decay+\alpha
+$$
-## 3.11 常用的优化器有哪些
+$$
+decayed{\_}learning{\_}rate=learning{\_}rate*decayed
+$$
-分别列举
+ 如下图所示,红色即为标准的余弦衰减曲线,学习率从初始值下降到最低学习率后保持不变。蓝色的线是线性余弦衰减方式曲线,它是学习率从初始学习率以线性的方式下降到最低学习率值。绿色噪声线性余弦衰减方式。
-```
-Optimizer:
-tf.train.GradientDescentOptimizer
-tf.train.AdadeltaOptimizer
-tf.train.AdagradOptimizer
-tf.train.AdagradDAOptimizer
-tf.train.MomentumOptimizer
-tf.train.AdamOptimizer
-tf.train.FtrlOptimizer
-tf.train.ProximalGradientDescentOptimizer
-tf.train.ProximalAdagradOptimizer
-tf.train.RMSPropOptimizer
-```
+
## 3.12 Dropout 系列问题
@@ -1059,15 +1096,15 @@ tf.train.RMSPropOptimizer
### 3.12.2 为什么正则化有利于预防过拟合?
-
-
+
+
左图是高偏差,右图是高方差,中间是Just Right,这几张图我们在前面课程中看到过。
### 3.12.3 理解dropout正则化
-Dropout可以随机删除网络中的神经单元,它为什么可以通过正则化发挥如此大的作用呢?
+ Dropout可以随机删除网络中的神经单元,它为什么可以通过正则化发挥如此大的作用呢?
-直观上理解:不要依赖于任何一个特征,因为该单元的输入可能随时被清除,因此该单元通过这种方式传播下去,并为单元的四个输入增加一点权重,通过传播所有权重,dropout将产生收缩权重的平方范数的效果,和之前讲的L2正则化类似;实施dropout的结果实它会压缩权重,并完成一些预防过拟合的外层正则化;L2对不同权重的衰减是不同的,它取决于激活函数倍增的大小。
+ 直观上理解:不要依赖于任何一个特征,因为该单元的输入可能随时被清除,因此该单元通过这种方式传播下去,并为单元的四个输入增加一点权重,通过传播所有权重,dropout将产生收缩权重的平方范数的效果,和之前讲的L2正则化类似;实施dropout的结果实它会压缩权重,并完成一些预防过拟合的外层正则化;L2对不同权重的衰减是不同的,它取决于激活函数倍增的大小。
### 3.12.4 dropout率的选择
@@ -1080,10 +1117,11 @@ Dropout可以随机删除网络中的神经单元,它为什么可以通过正
### 3.12.5 dropout有什么缺点?
-dropout一大缺点就是代价函数J不再被明确定义,每次迭代,都会随机移除一些节点,如果再三检查梯度下降的性能,实际上是很难进行复查的。定义明确的代价函数J每次迭代后都会下降,因为我们所优化的代价函数J实际上并没有明确定义,或者说在某种程度上很难计算,所以我们失去了调试工具来绘制这样的图片。我通常会关闭dropout函数,将keep-prob的值设为1,运行代码,确保J函数单调递减。然后打开dropout函数,希望在dropout过程中,代码并未引入bug。我觉得你也可以尝试其它方法,虽然我们并没有关于这些方法性能的数据统计,但你可以把它们与dropout方法一起使用。
+ dropout一大缺点就是代价函数J不再被明确定义,每次迭代,都会随机移除一些节点,如果再三检查梯度下降的性能,实际上是很难进行复查的。定义明确的代价函数J每次迭代后都会下降,因为我们所优化的代价函数J实际上并没有明确定义,或者说在某种程度上很难计算,所以我们失去了调试工具来绘制这样的图片。我通常会关闭dropout函数,将keep-prob的值设为1,运行代码,确保J函数单调递减。然后打开dropout函数,希望在dropout过程中,代码并未引入bug。我觉得你也可以尝试其它方法,虽然我们并没有关于这些方法性能的数据统计,但你可以把它们与dropout方法一起使用。
+## 3.13 深度学习中常用的数据增强方法?
-## 3.13 深度学习中常用的数据增强方法(Data Augmentation)?(贡献者:黄钦建-华南理工大学)
+**(贡献者:黄钦建-华南理工大学)**
- Color Jittering:对颜色的数据增强:图像亮度、饱和度、对比度变化(此处对色彩抖动的理解不知是否得当);
@@ -1103,15 +1141,17 @@ dropout一大缺点就是代价函数J不再被明确定义,每次迭代,都
- Label Shuffle:类别不平衡数据的增广;
-## 3.14 如何理解 Internal Covariate Shift?(贡献者:黄钦建-华南理工大学)
+## 3.14 如何理解 Internal Covariate Shift?
+
+**(贡献者:黄钦建-华南理工大学)**
-深度神经网络模型的训练为什么会很困难?其中一个重要的原因是,深度神经网络涉及到很多层的叠加,而每一层的参数更新会导致上层的输入数据分布发生变化,通过层层叠加,高层的输入分布变化会非常剧烈,这就使得高层需要不断去重新适应底层的参数更新。为了训好模型,我们需要非常谨慎地去设定学习率、初始化权重、以及尽可能细致的参数更新策略。
+ 深度神经网络模型的训练为什么会很困难?其中一个重要的原因是,深度神经网络涉及到很多层的叠加,而每一层的参数更新会导致上层的输入数据分布发生变化,通过层层叠加,高层的输入分布变化会非常剧烈,这就使得高层需要不断去重新适应底层的参数更新。为了训好模型,我们需要非常谨慎地去设定学习率、初始化权重、以及尽可能细致的参数更新策略。
-Google 将这一现象总结为 Internal Covariate Shift,简称 ICS。 什么是 ICS 呢?
+ Google 将这一现象总结为 Internal Covariate Shift,简称 ICS。 什么是 ICS 呢?
-大家都知道在统计机器学习中的一个经典假设是“源空间(source domain)和目标空间(target domain)的数据分布(distribution)是一致的”。如果不一致,那么就出现了新的机器学习问题,如 transfer learning / domain adaptation 等。而 covariate shift 就是分布不一致假设之下的一个分支问题,它是指源空间和目标空间的条件概率是一致的,但是其边缘概率不同。
+ 大家都知道在统计机器学习中的一个经典假设是“源空间(source domain)和目标空间(target domain)的数据分布(distribution)是一致的”。如果不一致,那么就出现了新的机器学习问题,如 transfer learning / domain adaptation 等。而 covariate shift 就是分布不一致假设之下的一个分支问题,它是指源空间和目标空间的条件概率是一致的,但是其边缘概率不同。
-大家细想便会发现,的确,对于神经网络的各层输出,由于它们经过了层内操作作用,其分布显然与各层对应的输入信号分布不同,而且差异会随着网络深度增大而增大,可是它们所能“指示”的样本标记(label)仍然是不变的,这便符合了covariate shift的定义。由于是对层间信号的分析,也即是“internal”的来由。
+ 大家细想便会发现,的确,对于神经网络的各层输出,由于它们经过了层内操作作用,其分布显然与各层对应的输入信号分布不同,而且差异会随着网络深度增大而增大,可是它们所能“指示”的样本标记(label)仍然是不变的,这便符合了covariate shift的定义。由于是对层间信号的分析,也即是“internal”的来由。
**那么ICS会导致什么问题?**
@@ -1123,9 +1163,83 @@ Google 将这一现象总结为 Internal Covariate Shift,简称 ICS。 什么
其三,每层的更新都会影响到其它层,因此每层的参数更新策略需要尽可能的谨慎。
-## 3.15 什么时候用local-conv?什么时候用全卷积?(贡献者:梁志成-魅族科技)
-1.当数据集具有全局的局部特征分布时,也就是说局部特征之间有较强的相关性,适合用全卷积。
-2.在不同的区域有不同的特征分布时,适合用local-Conv。
+
+
+## 参考文献
+
+[1] Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain.[J]. Psychological Review, 1958, 65(6):386-408.
+
+[2] Duvenaud D , Rippel O , Adams R P , et al. Avoiding pathologies in very deep networks[J]. Eprint Arxiv, 2014:202-210.
+
+[3] Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors[J]. Cognitive modeling, 1988, 5(3): 1.
+
+[4] Hecht-Nielsen R. Theory of the backpropagation neural network[M]//Neural networks for perception. Academic Press, 1992: 65-93.
+
+[5] Felice M. Which deep learning network is best for you?| CIO[J]. 2017.
+
+[6] Conneau A, Schwenk H, Barrault L, et al. Very deep convolutional networks for natural language processing[J]. arXiv preprint arXiv:1606.01781, 2016, 2.
+
+[7] Ba J, Caruana R. Do deep nets really need to be deep?[C]//Advances in neural information processing systems. 2014: 2654-2662.
+
+[8] Nielsen M A. Neural networks and deep learning[M]. USA: Determination press, 2015.
+
+[9] Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016.
+
+[10] 周志华. 机器学习[M].清华大学出版社, 2016.
+
+[11] Kim J, Kwon Lee J, Mu Lee K. Accurate image super-resolution using very deep convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 1646-1654.
+
+[12] Chen Y, Lin Z, Zhao X, et al. Deep learning-based classification of hyperspectral data[J]. IEEE Journal of Selected topics in applied earth observations and remote sensing, 2014, 7(6): 2094-2107.
+
+[13] Domhan T, Springenberg J T, Hutter F. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves[C]//Twenty-Fourth International Joint Conference on Artificial Intelligence. 2015.
+
+[14] Maclaurin D, Duvenaud D, Adams R. Gradient-based hyperparameter optimization through reversible learning[C]//International Conference on Machine Learning. 2015: 2113-2122.
+
+[15] Srivastava R K, Greff K, Schmidhuber J. Training very deep networks[C]//Advances in neural information processing systems. 2015: 2377-2385.
+
+[16] Bergstra J, Bengio Y. Random search for hyper-parameter optimization[J]. Journal of Machine Learning Research, 2012, 13(Feb): 281-305.
+
+[17] Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning[C]//Proceedings of the 28th international conference on machine learning (ICML-11). 2011: 689-696.
+
+[18] Deng L, Yu D. Deep learning: methods and applications[J]. Foundations and Trends® in Signal Processing, 2014, 7(3–4): 197-387.
+
+[19] Erhan D, Bengio Y, Courville A, et al. Why does unsupervised pre-training help deep learning?[J]. Journal of Machine Learning Research, 2010, 11(Feb): 625-660.
+
+[20] Dong C, Loy C C, He K, et al. Learning a deep convolutional network for image super resolution[C]//European conference on computer vision. Springer, Cham, 2014: 184-199.
+
+[21] 郑泽宇,梁博文,顾思宇.TensorFlow:实战Google深度学习框架(第2版)[M].电子工业出版社,2018.
+
+[22] 焦李成. 深度学习优化与识别[M].清华大学出版社,2017.
+
+[23] 吴岸城. 神经网络与深度学习[M].电子工业出版社,2016.
+
+[24] Wei, W.G.H., Liu, T., Song, A., et al. (2018) An Adaptive Natural Gradient Method with Adaptive Step Size in Multilayer Perceptrons. Chinese Automation Congress, 1593-1597.
+
+[25] Y Feng, Y Li.An Overview of Deep Learning Optimization Methods and Learning Rate Attenuation Methods[J].Hans Journal of Data Mining,2018,8(4),186-200.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/LeNet-5.jpg" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/LeNet-5.jpg"
new file mode 100644
index 00000000..fb65aedd
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/LeNet-5.jpg" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/alexnet.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/alexnet.png"
new file mode 100644
index 00000000..1a954ca5
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/alexnet.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/featureMap.jpg" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/featureMap.jpg"
new file mode 100644
index 00000000..2e077ef9
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/featureMap.jpg" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image1.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image1.png"
index 41dea102..69c95112 100644
Binary files "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image1.png" and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image1.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image21.jpeg" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image21.jpeg"
index c2fa5934..32b53478 100644
Binary files "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image21.jpeg" and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image21.jpeg" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image21.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image21.png"
new file mode 100644
index 00000000..d4bda094
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image21.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image23.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image23.png"
new file mode 100644
index 00000000..800434cd
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image23.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image27.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image27.png"
new file mode 100644
index 00000000..ecb4c217
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image27.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image28.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image28.png"
new file mode 100644
index 00000000..486eef7a
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image28.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image31.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image31.png"
index 5d238865..dfc81ca4 100644
Binary files "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image31.png" and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image31.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image32.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image32.png"
index 7b463dae..64fc988c 100644
Binary files "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image32.png" and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image32.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image34.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image34.png"
index 95016085..b04d68bc 100644
Binary files "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image34.png" and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image34.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image35.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image35.png"
index 083675ca..d18574c2 100644
Binary files "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image35.png" and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image35.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image36.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image36.png"
index 404390cf..89333b38 100644
Binary files "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image36.png" and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image36.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image37.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image37.png"
index 6f163002..21c976e7 100644
Binary files "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image37.png" and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image37.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image38.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image38.png"
index 2bed01bc..8bdea2d7 100644
Binary files "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image38.png" and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image38.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image46.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image46.png"
index 3dbb5118..9cbf8abf 100644
Binary files "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image46.png" and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image46.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image47.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image47.png"
index adcebabd..50618ab9 100644
Binary files "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image47.png" and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image47.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image63.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image63.png"
index 38167b0c..5cb3858f 100644
Binary files "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image63.png" and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/image63.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_01.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_01.png"
new file mode 100644
index 00000000..d410b750
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_01.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_02.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_02.png"
new file mode 100644
index 00000000..495c67c0
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_02.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_03.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_03.png"
new file mode 100644
index 00000000..9575cd47
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_03.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_04.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_04.png"
new file mode 100644
index 00000000..db318cc2
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_04.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_05.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_05.png"
new file mode 100644
index 00000000..6e621a4f
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/img_inception_05.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/vgg16.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/vgg16.png"
new file mode 100644
index 00000000..3dca57d1
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/vgg16.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/zfnet-layer1.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/zfnet-layer1.png"
new file mode 100644
index 00000000..7fb93f7f
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/zfnet-layer1.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/zfnet-layer2.png" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/zfnet-layer2.png"
new file mode 100644
index 00000000..c2e2f6c7
Binary files /dev/null and "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/img/ch4/zfnet-layer2.png" differ
diff --git "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/\347\254\254\345\233\233\347\253\240_\347\273\217\345\205\270\347\275\221\347\273\234.md" "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/\347\254\254\345\233\233\347\253\240_\347\273\217\345\205\270\347\275\221\347\273\234.md"
index 78f29c4e..1e1bec50 100644
--- "a/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/\347\254\254\345\233\233\347\253\240_\347\273\217\345\205\270\347\275\221\347\273\234.md"
+++ "b/ch04_\347\273\217\345\205\270\347\275\221\347\273\234/\347\254\254\345\233\233\347\253\240_\347\273\217\345\205\270\347\275\221\347\273\234.md"
@@ -1,239 +1,140 @@
[TOC]
-# 第四章 经典网络
-## 4.1 LeNet5
+# 第四章 经典网络解读
+## 4.1 LeNet-5
-一种典型的用来识别数字的卷积网络是LeNet-5。
-### 4.1.1 模型结构
+### 4.1.1 模型介绍
-
+ LeNet-5是由$LeCun$ 提出的一种用于识别手写数字和机器印刷字符的卷积神经网络(Convolutional Neural Network,CNN)$^{[1]}$,其命名来源于作者$LeCun$的名字,5则是其研究成果的代号,在LeNet-5之前还有LeNet-4和LeNet-1鲜为人知。LeNet-5阐述了图像中像素特征之间的相关性能够由参数共享的卷积操作所提取,同时使用卷积、下采样(池化)和非线性映射这样的组合结构,是当前流行的大多数深度图像识别网络的基础。
### 4.1.2 模型结构
-LeNet-5共有7层(不包含输入层),每层都包含可训练参数;每个层有多个Feature Map,每个FeatureMap通过一种卷积滤波器提取输入的一种特征,然后每个FeatureMap有多个神经元。
-
-- C1层是一个卷积层
- 输入图片:32 \* 32
- 卷积核大小:5 \* 5
- 卷积核种类:6
- 输出featuremap大小:28 \* 28 (32-5+1)
- 神经元数量:28 \* 28 \* 6
- 可训练参数:(5 \* 5+1) \* 6(每个滤波器5 \* 5=25个unit参数和一个bias参数,一共6个滤波器)
- 连接数:(5 \* 5+1) \* 6 \* 28 \* 28
-
-- S2层是一个下采样层
- 输入:28 \* 28
- 采样区域:2 \* 2
- 采样方式:4个输入相加,乘以一个可训练参数,再加上一个可训练偏置。结果通过sigmoid
- 采样种类:6
- 输出featureMap大小:14 \* 14(28/2)
- 神经元数量:14 \* 14 \* 6
- 可训练参数:2 \* 6(和的权+偏置)
- 连接数:(2 \* 2+1) \* 6 \* 14 \* 14
- S2中每个特征图的大小是C1中特征图大小的1/4
-
-- C3层也是一个卷积层
- 输入:S2中所有6个或者几个特征map组合
- 卷积核大小:5 \* 5
- 卷积核种类:16
- 输出featureMap大小:10 \* 10
- C3中的每个特征map是连接到S2中的所有6个或者几个特征map的,表示本层的特征map是上一层提取到的特征map的不同组合
- 存在的一个方式是:C3的前6个特征图以S2中3个相邻的特征图子集为输入。接下来6个特征图以S2中4个相邻特征图子集为输入。然后的3个以不相邻的4个特征图子集为输入。最后一个将S2中所有特征图为输入。 则:可训练参数:6 \* (3 \* 25+1)+6 \* (4 \* 25+1)+3 \* (4 \* 25+1)+(25 \* 6+1)=1516
- 连接数:10 \* 10 \* 1516=151600
-
-- S4层是一个下采样层
- 输入:10 \* 10
- 采样区域:2 \* 2
- 采样方式:4个输入相加,乘以一个可训练参数,再加上一个可训练偏置。结果通过sigmoid
- 采样种类:16
- 输出featureMap大小:5 \* 5(10/2)
- 神经元数量:5 \* 5 \* 16=400
- 可训练参数:2 \* 16=32(和的权+偏置)
- 连接数:16 \* (2 \* 2+1) \* 5 \* 5=2000
- S4中每个特征图的大小是C3中特征图大小的1/4
-
-- C5层是一个卷积层
- 输入:S4层的全部16个单元特征map(与s4全相连)
- 卷积核大小:5 \* 5
- 卷积核种类:120
- 输出featureMap大小:1 \* 1(5-5+1)
- 可训练参数/连接:120 \* (16 \* 5 \* 5+1)=48120
-
-- F6层全连接层
- 输入:c5 120维向量
- 计算方式:计算输入向量和权重向量之间的点积,再加上一个偏置,结果通过sigmoid函数
- 可训练参数:84 \* (120+1)=10164
-### 4.1.3 模型特性
-- 卷积网络使用一个3层的序列:卷积、池化、非线性——这可能是自这篇论文以来面向图像的深度学习的关键特性!
-- 使用卷积提取空间特征
-- 使用映射的空间均值进行降采样
-- tanh或sigmoids非线性
-- 多层神经网络(MLP)作为最终的分类器
-- 层间的稀疏连接矩阵以避免巨大的计算开销
-
-## 4.2 AlexNet
-
-### 4.2.1 模型介绍
-
- AlexNet在2012年ILSVRC竞赛中赢得了第一名,其Top5错误率为15.3%。AlexNet模型证明了CNN在复杂模型下的有效性,并且在可接受时间范围内,部署GPU得到了有效结果。
-
-### 4.2.2 模型结构
-
-
-
-### 4.2.3 模型解读
-
-AlexNet共8层,前五层为卷积层,后三层为全连接层。
-
-1. **conv1阶段**:
-
-
-
-
-
-- 输入图片:227 \* 227 \* 3
-- 卷积核大小:11* 11 *3
-- 卷积核数量:96
-- 滤波器stride:4
-
-- 输出featuremap大小:(227-11)/4+1=55 (227个像素减去11,然后除以4,生成54个像素,再加上被减去的11也对应生成一个像素)
-
-- 输出featuremap大小:55 \* 55
-
-- 共有96个卷积核,会生成55 \* 55 \* 96个卷积后的像素层。96个卷积核分成2组,每组48个卷积核。对应生成2组55 \* 55 \* 48的卷积后的像素层数据。
-
-- 这些像素层经过relu1单元的处理,生成激活像素层,尺寸仍为2组55 \* 55 \* 48的像素层数据。
-
-- 这些像素层经过pool运算的处理,池化运算尺度为3 \* 3,运算的步长为2,则池化后图像的尺寸为(55-3)/2+1=27。 即池化后像素的规模为27 \* 27 \* 96;
-
-- 然后经过归一化处理,归一化运算的尺度为5 \* 5;第一卷积层运算结束后形成的像素层的规模为27 \* 27 \* 96。分别对应96个卷积核所运算形成。这96层像素层分为2组,每组48个像素层,每组在一个独立的GPU上进行运算。
-
-- 反向传播时,每个卷积核对应一个偏差值。即第一层的96个卷积核对应上层输入的96个偏差值。
-
-
-2. **conv2阶段**:
-
-
-
-
-
-- 输入图片:27 \* 27 \* 96(第一层输出)
-- 为便于后续处理,每幅像素层的左右两边和上下两边都要填充2个像素
-- 27 \* 27 \* 96的像素数据分成27 \* 27 \* 48的两组像素数据,两组数据分别再两个不同的GPU中进行运算。
-- 卷积核大小:5 \* 5 \* 48
-- 滤波器stride:1
-
-- 输出featuremap大小:卷积核在移动的过程中会生成(27-5+2 \* 2)/1+1=27个像素。(27个像素减去5,正好是22,在加上上下、左右各填充的2个像素,即生成26个像素,再加上被减去的5也对应生成一个像素),行和列的27 \* 27个像素形成对原始图像卷积之后的像素层。共有256个5 \* 5 \* 48卷积核;这256个卷积核分成两组,每组针对一个GPU中的27 \* 27 \* 48的像素进行卷积运算。会生成两组27 \* 27 \* 128个卷积后的像素层。
-
-- 这些像素层经过relu2单元的处理,生成激活像素层,尺寸仍为两组27 \* 27 \* 128的像素层。
-
-- 这些像素层经过pool运算(池化运算)的处理,池化运算的尺度为3 \* 3,运算的步长为2,则池化后图像的尺寸为(57-3)/2+1=13。 即池化后像素的规模为2组13 \* 13 \* 128的像素层;
-
-- 然后经过归一化处理,归一化运算的尺度为5 \* 5;
-
-- 第二卷积层运算结束后形成的像素层的规模为2组13 \* 13 \* 128的像素层。分别对应2组128个卷积核所运算形成。每组在一个GPU上进行运算。即共256个卷积核,共2个GPU进行运算。
-
-- 反向传播时,每个卷积核对应一个偏差值。即第一层的96个卷积核对应上层输入的256个偏差值。
-
-
-3. **conv3阶段**:
-
-
-
-- 第三层输入数据为第二层输出的2组13 \* 13 \* 128的像素层;
-- 为便于后续处理,每幅像素层的左右两边和上下两边都要填充1个像素;
-- 2组像素层数据都被送至2个不同的GPU中进行运算。每个GPU中都有192个卷积核,每个卷积核的尺寸是3 \* 3 \* 256。因此,每个GPU中的卷积核都能对2组13 \* 13 \* 128的像素层的所有数据进行卷积运算。
-- 移动的步长是1个像素。
-- 运算后的卷积核的尺寸为(13-3+1 \* 2)/1+1=13(13个像素减去3,正好是10,在加上上下、左右各填充的1个像素,即生成12个像素,再加上被减去的3也对应生成一个像素),每个GPU中共13 \* 13 \* 192个卷积核。2个GPU中共13 \* 13 \* 384个卷积后的像素层。这些像素层经过relu3单元的处理,生成激活像素层,尺寸仍为2组13 \* 13 \* 192像素层,共13 \* 13 \* 384个像素层。
+
+ 图4.1 LeNet-5网络结构图
-4. **conv4阶段DFD**:
-
-
-
-
-
-- 第四层输入数据为第三层输出的2组13 \* 13 \* 192的像素层;
-
-- 为便于后续处理,每幅像素层的左右两边和上下两边都要填充1个像素;
-
-- 2组像素层数据都被送至2个不同的GPU中进行运算。每个GPU中都有192个卷积核,每个卷积核的尺寸是3 \* 3 \* 192。因此,每个GPU中的卷积核能对1组13 \* 13 \* 192的像素层的数据进行卷积运算。
-
-- 移动的步长是1个像素。
-
-- 运算后的卷积核的尺寸为(13-3+1 \* 2)/1+1=13(13个像素减去3,正好是10,在加上上下、左右各填充的1个像素,即生成12个像素,再加上被减去的3也对应生成一个像素),每个GPU中共13 \* 13 \* 192个卷积核。2个GPU中共13 \* 13 \* 384个卷积后的像素层。
-
-- 这些像素层经过relu4单元的处理,生成激活像素层,尺寸仍为2组13 \* 13 \* 192像素层,共13 \* 13 \* 384个像素层。
+ 如图4.1所示,LeNet-5一共包含7层(输入层不作为网络结构),分别由2个卷积层、2个下采样层和3个连接层组成,网络的参数配置如表4.1所示,其中下采样层和全连接层的核尺寸分别代表采样范围和连接矩阵的尺寸(如卷积核尺寸中的$“5\times5\times1/1,6”$表示核大小为$5\times5\times1$、步长为$1$且核个数为6的卷积核)。
+ 表4.1 LeNet-5网络参数配置
- 5. **conv5阶段**:
+| 网络层 | 输入尺寸 | 核尺寸 | 输出尺寸 | 可训练参数量 |
+| :-------------: | :------------------: | :----------------------: | :------------------: | :-----------------------------: |
+| 卷积层$C_1$ | $32\times32\times1$ | $5\times5\times1/1,6$ | $28\times28\times6$ | $(5\times5\times1+1)\times6$ |
+| 下采样层$S_2$ | $28\times28\times6$ | $2\times2/2$ | $14\times14\times6$ | $(1+1)\times6$ $^*$ |
+| 卷积层$C_3$ | $14\times14\times6$ | $5\times5\times6/1,16$ | $10\times10\times16$ | $1516^*$ |
+| 下采样层$S_4$ | $10\times10\times16$ | $2\times2/2$ | $5\times5\times16$ | $(1+1)\times16$ |
+| 卷积层$C_5$$^*$ | $5\times5\times16$ | $5\times5\times16/1,120$ | $1\times1\times120$ | $(5\times5\times16+1)\times120$ |
+| 全连接层$F_6$ | $1\times1\times120$ | $120\times84$ | $1\times1\times84$ | $(120+1)\times84$ |
+| 输出层 | $1\times1\times84$ | $84\times10$ | $1\times1\times10$ | $(84+1)\times10$ |
- 
+> $^*$ 在LeNet中,下采样操作和池化操作类似,但是在得到采样结果后会乘以一个系数和加上一个偏置项,所以下采样的参数个数是$(1+1)\times6$而不是零。
+>
+> $^*$ $C_3$卷积层可训练参数并未直接连接$S_2$中所有的特征图(Feature Map),而是采用如图4.2所示的采样特征方式进行连接(稀疏连接),生成的16个通道特征图中分别按照相邻3个特征图、相邻4个特征图、非相邻4个特征图和全部6个特征图进行映射,得到的参数个数计算公式为$6\times(25\times3+1)+6\times(25\times4+1)+3\times(25\times4+1)+1\times(25\times6+1)=1516$,在原论文中解释了使用这种采样方式原因包含两点:限制了连接数不至于过大(当年的计算能力比较弱);强制限定不同特征图的组合可以使映射得到的特征图学习到不同的特征模式。
-
+
-- 第五层输入数据为第四层输出的2组13 \* 13 \* 192的像素层;
-- 为便于后续处理,每幅像素层的左右两边和上下两边都要填充1个像素;
-- 2组像素层数据都被送至2个不同的GPU中进行运算。每个GPU中都有128个卷积核,每个卷积核的尺寸是3 \* 3 \* 192。因此,每个GPU中的卷积核能对1组13 \* 13 \* 192的像素层的数据进行卷积运算。
-- 移动的步长是1个像素。
-- 因此,运算后的卷积核的尺寸为(13-3+1 \* 2)/1+1=13(13个像素减去3,正好是10,在加上上下、左右各填充的1个像素,即生成12个像素,再加上被减去的3也对应生成一个像素),每个GPU中共13 \* 13 \* 128个卷积核。2个GPU中共13 \* 13 \* 256个卷积后的像素层。
-- 这些像素层经过relu5单元的处理,生成激活像素层,尺寸仍为2组13 \* 13 \* 128像素层,共13 \* 13 \* 256个像素层。
-- 2组13 \* 13 \* 128像素层分别在2个不同GPU中进行池化(pool)运算处理。池化运算的尺度为3 \* 3,运算的步长为2,则池化后图像的尺寸为(13-3)/2+1=6。 即池化后像素的规模为两组6 \* 6 \* 128的像素层数据,共6 \* 6 \* 256规模的像素层数据。
+ 图4.2 $S_2$与$C_3$之间的特征图稀疏连接
+> $^*$ $C_5$卷积层在图4.1中显示为全连接层,原论文中解释这里实际采用的是卷积操作,只是刚好在$5\times5$卷积后尺寸被压缩为$1\times1$,输出结果看起来和全连接很相似。
+### 4.1.3 模型特性
+- 卷积网络使用一个3层的序列组合:卷积、下采样(池化)、非线性映射(LeNet-5最重要的特性,奠定了目前深层卷积网络的基础)
+- 使用卷积提取空间特征
+- 使用映射的空间均值进行下采样
+- 使用$tanh$或$sigmoid$进行非线性映射
+- 多层神经网络(MLP)作为最终的分类器
+- 层间的稀疏连接矩阵以避免巨大的计算开销
+## 4.2 AlexNet
-6. **fc6阶段**:
+### 4.2.1 模型介绍
-
+ AlexNet是由$Alex$ $Krizhevsky $提出的首个应用于图像分类的深层卷积神经网络,该网络在2012年ILSVRC(ImageNet Large Scale Visual Recognition Competition)图像分类竞赛中以15.3%的top-5测试错误率赢得第一名$^{[2]}$。AlexNet使用GPU代替CPU进行运算,使得在可接受的时间范围内模型结构能够更加复杂,它的出现证明了深层卷积神经网络在复杂模型下的有效性,使CNN在计算机视觉中流行开来,直接或间接地引发了深度学习的热潮。
-
+### 4.2.2 模型结构
-- 第六层输入数据的尺寸是6 \* 6 \* 256
-- 采用6 \* 6 \* 256尺寸的滤波器对第六层的输入数据进行卷积运算
-- 共有4096个6 \* 6 \* 256尺寸的滤波器对输入数据进行卷积运算,通过4096个神经元输出运算结果;
-- 这4096个运算结果通过relu激活函数生成4096个值;
-- 通过drop运算后输出4096个本层的输出结果值。
-- 由于第六层的运算过程中,采用的滤波器的尺寸(6 \* 6 \* 256)与待处理的feature map的尺寸(6 \* 6 \* 256)相同,即滤波器中的每个系数只与feature map中的一个像素值相乘;而其它卷积层中,每个滤波器的系数都会与多个feature map中像素值相乘;因此,将第六层称为全连接层。
-- 第五层输出的6 \* 6 \* 256规模的像素层数据与第六层的4096个神经元进行全连接,然后经由relu6进行处理后生成4096个数据,再经过dropout6处理后输出4096个数据。
+
+ 图4.3 AlexNet网络结构图
+ 如图4.3所示,除去下采样(池化层)和局部响应规范化操作(Local Responsible Normalization, LRN),AlexNet一共包含8层,前5层由卷积层组成,而剩下的3层为全连接层。网络结构分为上下两层,分别对应两个GPU的操作过程,除了中间某些层($C_3$卷积层和$F_{6-8}$全连接层会有GPU间的交互),其他层两个GPU分别计算结 果。最后一层全连接层的输出作为$softmax$的输入,得到1000个图像分类标签对应的概率值。除去GPU并行结构的设计,AlexNet网络结构与LeNet十分相似,其网络的参数配置如表4.2所示。
-7. **fc7阶段**:
+ 表4.2 AlexNet网络参数配置
-
+| 网络层 | 输入尺寸 | 核尺寸 | 输出尺寸 | 可训练参数量 |
+| :-------------------: | :----------------------------------: | :--------------------------------------: | :----------------------------------: | :-------------------------------------: |
+| 卷积层$C_1$ $^*$ | $224\times224\times3$ | $11\times11\times3/4,48(\times2_{GPU})$ | $55\times55\times48(\times2_{GPU})$ | $(11\times11\times3+1)\times48\times2$ |
+| 下采样层$S_{max}$$^*$ | $55\times55\times48(\times2_{GPU})$ | $3\times3/2(\times2_{GPU})$ | $27\times27\times48(\times2_{GPU})$ | 0 |
+| 卷积层$C_2$ | $27\times27\times48(\times2_{GPU})$ | $5\times5\times48/1,128(\times2_{GPU})$ | $27\times27\times128(\times2_{GPU})$ | $(5\times5\times48+1)\times128\times2$ |
+| 下采样层$S_{max}$ | $27\times27\times128(\times2_{GPU})$ | $3\times3/2(\times2_{GPU})$ | $13\times13\times128(\times2_{GPU})$ | 0 |
+| 卷积层$C_3$ $^*$ | $13\times13\times128\times2_{GPU}$ | $3\times3\times256/1,192(\times2_{GPU})$ | $13\times13\times192(\times2_{GPU})$ | $(3\times3\times256+1)\times192\times2$ |
+| 卷积层$C_4$ | $13\times13\times192(\times2_{GPU})$ | $3\times3\times192/1,192(\times2_{GPU})$ | $13\times13\times192(\times2_{GPU})$ | $(3\times3\times192+1)\times192\times2$ |
+| 卷积层$C_5$ | $13\times13\times192(\times2_{GPU})$ | $3\times3\times192/1,128(\times2_{GPU})$ | $13\times13\times128(\times2_{GPU})$ | $(3\times3\times192+1)\times128\times2$ |
+| 下采样层$S_{max}$ | $13\times13\times128(\times2_{GPU})$ | $3\times3/2(\times2_{GPU})$ | $6\times6\times128(\times2_{GPU})$ | 0 |
+| 全连接层$F_6$ $^*$ | $6\times6\times128\times2_{GPU}$ | $9216\times2048(\times2_{GPU})$ | $1\times1\times2048(\times2_{GPU})$ | $(9216+1)\times2048\times2$ |
+| 全连接层$F_7$ | $1\times1\times2048\times2_{GPU}$ | $4096\times2048(\times2_{GPU})$ | $1\times1\times2048(\times2_{GPU})$ | $(4096+1)\times2048\times2$ |
+| 全连接层$F_8$ | $1\times1\times2048\times2_{GPU}$ | $4096\times1000$ | $1\times1\times1000$ | $(4096+1)\times1000\times2$ |
+>卷积层$C_1$输入为$224\times224\times3$的图片数据,分别在两个GPU中经过核为$11\times11\times3$、步长(stride)为4的卷积卷积后,分别得到两条独立的$55\times55\times48$的输出数据。
+>
+>下采样层$S_{max}$实际上是嵌套在卷积中的最大池化操作,但是为了区分没有采用最大池化的卷积层单独列出来。在$C_{1-2}$卷积层中的池化操作之后(ReLU激活操作之前),还有一个LRN操作,用作对相邻特征点的归一化处理。
+>
+>卷积层$C_3$ 的输入与其他卷积层不同,$13\times13\times192\times2_{GPU}$表示汇聚了上一层网络在两个GPU上的输出结果作为输入,所以在进行卷积操作时通道上的卷积核维度为384。
+>
+>全连接层$F_{6-8}$中输入数据尺寸也和$C_3$类似,都是融合了两个GPU流向的输出结果作为输入。
-- 第六层输出的4096个数据与第七层的4096个神经元进行全连接
-- 然后经由relu7进行处理后生成4096个数据,再经过dropout7处理后输出4096个数据。
+### 4.2.3 模型特性
+- 所有卷积层都使用ReLU作为非线性映射函数,使模型收敛速度更快
+- 在多个GPU上进行模型的训练,不但可以提高模型的训练速度,还能提升数据的使用规模
+- 使用LRN对局部的特征进行归一化,结果作为ReLU激活函数的输入能有效降低错误率
+- 重叠最大池化(overlapping max pooling),即池化范围z与步长s存在关系$z>s$(如$S_{max}$中核尺度为$3\times3/2$),避免平均池化(average pooling)的平均效应
+- 使用随机丢弃技术(dropout)选择性地忽略训练中的单个神经元,避免模型的过拟合
-8. **fc8阶段**:
+## 4.3 ZFNet
+### 4.3.1 模型介绍
-
+ ZFNet是由$Matthew$ $D. Zeiler$和$Rob$ $Fergus$在AlexNet基础上提出的大型卷积网络,在2013年ILSVRC图像分类竞赛中以11.19%的错误率获得冠军(实际上原ZFNet所在的队伍并不是真正的冠军,原ZFNet以13.51%错误率排在第8,真正的冠军是$Clarifai$这个队伍,而$Clarifai$这个队伍所对应的一家初创公司的CEO又是$Zeiler$,而且$Clarifai$对ZFNet的改动比较小,所以通常认为是ZFNet获得了冠军)$^{[3-4]}$。ZFNet实际上是微调(fine-tuning)了的AlexNet,并通过反卷积(Deconvolution)的方式可视化各层的输出特征图,进一步解释了卷积操作在大型网络中效果显著的原因。
-
+### 4.3.2 模型结构
-- 第七层输出的4096个数据与第八层的1000个神经元进行全连接,经过训练后输出被训练的数值。
+
+
-### 4.2.4 模型特性
-- 使用ReLU作为非线性
+ 图4.4 ZFNet网络结构图(原始结构图与AlexNet风格结构图)
-- 使用dropout技术选择性地忽略训练中的单个神经元,避免模型的过拟合
+ 如图4.4所示,ZFNet与AlexNet类似,都是由8层网络组成的卷积神经网络,其中包含5层卷积层和3层全连接层。两个网络结构最大的不同在于,ZFNet第一层卷积采用了$7\times7\times3/2$的卷积核替代了AlexNet中第一层卷积核$11\times11\times3/4$的卷积核。图4.5中ZFNet相比于AlexNet在第一层输出的特征图中包含更多中间频率的信息,而AlexNet第一层输出的特征图大多是低频或高频的信息,对中间频率特征的缺失导致后续网络层次如图4.5(c)能够学习到的特征不够细致,而导致这个问题的根本原因在于AlexNet在第一层中采用的卷积核和步长过大。
-- 重叠最大池化(overlapping max pooling),避免平均池化(average pooling)的平均效应
+
-- 使用NVIDIA GTX 580 GPU减少训练时间
+
-- 当时,GPU比CPU提供了更多的核心,可以将训练速度提升10倍,从而允许使用更大的数据集和更大的图像。
+ 图4.5 (a)ZFNet第一层输出的特征图(b)AlexNet第一层输出的特征图(c)AlexNet第二层输出的特征图(d)ZFNet第二层输出的特征图
+ 表4.3 ZFNet网络参数配置
+| 网络层 | 输入尺寸 | 核尺寸 | 输出尺寸 | 可训练参数量 |
+| :-------------------: | :----------------------------------: | :--------------------------------------: | :----------------------------------: | :-------------------------------------: |
+| 卷积层$C_1$ $^*$ | $224\times224\times3$ | $7\times7\times3/2,96$ | $110\times110\times96$ | $(7\times7\times3+1)\times96$ |
+| 下采样层$S_{max}$ | $110\times110\times96$ | $3\times3/2$ | $55\times55\times96$ | 0 |
+| 卷积层$C_2$ $^*$ | $55\times55\times96$ | $5\times5\times96/2,256$ | $26\times26\times256$ | $(5\times5\times96+1)\times256$ |
+| 下采样层$S_{max}$ | $26\times26\times256$ | $3\times3/2$ | $13\times13\times256$ | 0 |
+| 卷积层$C_3$ | $13\times13\times256$ | $3\times3\times256/1,384$ | $13\times13\times384$ | $(3\times3\times256+1)\times384$ |
+| 卷积层$C_4$ | $13\times13\times384$ | $3\times3\times384/1,384$ | $13\times13\times384$ | $(3\times3\times384+1)\times384$ |
+| 卷积层$C_5$ | $13\times13\times384$ | $3\times3\times384/1,256$ | $13\times13\times256$ | $(3\times3\times384+1)\times256$ |
+| 下采样层$S_{max}$ | $13\times13\times256$ | $3\times3/2$ | $6\times6\times256$ | 0 |
+| 全连接层$F_6$ | $6\times6\times256$ | $9216\times4096$ | $1\times1\times4096$ | $(9216+1)\times4096$ |
+| 全连接层$F_7$ | $1\times1\times4096$ | $4096\times4096$ | $1\times1\times4096$ | $(4096+1)\times4096$ |
+| 全连接层$F_8$ | $1\times1\times4096$ | $4096\times1000$ | $1\times1\times1000$ | $(4096+1)\times1000$ |
+> 卷积层$C_1$与AlexNet中的$C_1$有所不同,采用$7\times7\times3/2$的卷积核代替$11\times11\times3/4$,使第一层卷积输出的结果可以包含更多的中频率特征,对后续网络层中多样化的特征组合提供更多选择,有利于捕捉更细致的特征。
+>
+> 卷积层$C_2$采用了步长2的卷积核,区别于AlexNet中$C_2$的卷积核步长,所以输出的维度有所差异。
+### 4.3.3 模型特性
-## 4.3 可视化ZFNet-转置卷积
-### 4.3.1 基本的思想及其过程
+ ZFNet与AlexNet在结构上几乎相同,此部分虽属于模型特性,但准确地说应该是ZFNet原论文中可视化技术的贡献。
- 可视化技术揭露了激发模型中每层单独的特征图。
- 可视化技术允许观察在训练阶段特征的演变过程且诊断出模型的潜在问题。
@@ -242,429 +143,178 @@ AlexNet共8层,前五层为卷积层,后三层为全连接层。
- 可视化技术提供了一个非参数的不变性来展示来自训练集的哪一块激活哪个特征图,不仅需要裁剪输入图片,而且自上而下的投影来揭露来自每块的结构激活一个特征图。
- 可视化技术依赖于解卷积操作,即卷积操作的逆过程,将特征映射到像素上。
-### 4.3.2 卷积与转置卷积
-
-
-下图为卷积过程
-
-
图 33: 深度网络迁移实验结果 1
+
+实是到了这一步,feature变得越来越specific,所以下降了。那对于第6第7层为什么精度又不变了?那是因为,整个网络就8层,我们固定了第6第7层,这个网络还能学什么呢?所以很自然地,精度和原来的 B 网络几乎一致!
+
+ 对 BnB+ 来说,结果基本上都保持不变。说明 finetune 对模型结果有着很好的促进作用!
+
+ 我们重点关注AnB和AnB+。对AnB来说,直接将A网络的前3层迁移到B,貌似不会有什么影响,再一次说明,网络的前3层学到的几乎都是general feature!往后,到了第4第5层的时候,精度开始下降,我们直接说:一定是feature不general 了!然而,到了第6第7层,精度出现了小小的提升后又下降,这又是为什么?作者在这里提出两点co-adaptation和feature representation。就是说,第4第5层精度下降的时候,主要是由于A和B两个数据集的差异比较大,所以会下降;至I」了第6第7层,由于网络几乎不迭代了,学习能力太差,此时 feature 学不到,所以精度下降得更厉害。
+
+ 再看AnB+。加入了 finetune以后,AnB+的表现对于所有的n几乎都非常好,甚至 比baseB
+(最初的B)还要好一些!这说明:finetune对于深度迁移有着非常好的促进作用!
+
+ 把上面的结果合并就得到了下面一张图 (图[34](#bookmark138)):
+
+ 至此, AnB 和 BnB 基本完成。作者又想,是不是我分 A 和 B 数据的时候,里面存在一些比较相似的类使结果好了?比如说A里有猫,B里有狮子,所以结果会好?为了排除这些影响,作者又分了一下数据集,这次使得A和B里几乎没有相似的类别。在这个条件下再做AnB,与原来精度比较(0%为基准)得到了下图(图[35](#bookmark139)):
+
+ 这个图说明了什么呢?简单:随着可迁移层数的增加,模型性能下降。但是,前3层仍然还是可以迁移的!同时,与随机初始化所有权重比较,迁移学习的精度是很高的!总之:
+
+- 深度迁移网络要比随机初始化权重效果好;
+
+
+- 网络层数的迁移可以加速网络的学习和优化。
+
+### 11.3.9 什么是深度网络自适应?
+
+**基本思路**
+
+ 深度网络的 finetune 可以帮助我们节省训练时间,提高学习精度。但是 finetune 有它的先天不足:它无法处理训练数据和测试数据分布不同的情况。而这一现象在实际应用中比比皆是。因为 finetune 的基本假设也是训练数据和测试数据服从相同的数据分布。这在迁移学习中也是不成立的。因此,我们需要更进一步,针对深度网络开发出更好的方法使之更好地完成迁移学习任务。
+
+ 以我们之前介绍过的数据分布自适应方法为参考,许多深度学习方法[[Tzeng et al.,2014](#bookmark307), [Long et al.,2015a](#bookmark275)]都开发出了自适应层(AdaptationLayer)来完成源域和目标域数据的自适应。自适应能够使得源域和目标域的数据分布更加接近,从而使得网络的效果更好。
+
+ 从上述的分析我们可以得出,深度网络的自适应主要完成两部分的工作:
+
+ 一是哪些层可以自适应,这决定了网络的学习程度;
+
+ 二是采用什么样的自适应方法 (度量准则),这决定了网络的泛化能力。
+
+ 深度网络中最重要的是网络损失的定义。绝大多数深度迁移学习方法都采用了以下的损失定义方式:
+
+
+
+ 其中,I表示网络的最终损失,lc(Ds,**y**s)表示网络在有标注的数据(大部分是源域)上的常规分类损失(这与普通的深度网络完全一致),Ia(Ds,Dt)表示网络的自适应损失。最后一部分是传统的深度网络所不具有的、迁移学习所独有的。此部分的表达与我们先前讨论过的源域和目标域的分布差异,在道理上是相同的。式中的A是权衡两部分的权重参数。
+
+ 上述的分析指导我们设计深度迁移网络的基本准则:决定自适应层,然后在这些层加入自适应度量,最后对网络进行 finetune。
+
+### 11.3.10 GAN在迁移学习中的应用
+
+生成对抗网络 GAN(Generative Adversarial Nets) [[Goodfellow et al.,2014](#bookmark256)] 是目前人工智能领域最炙手可热的概念之一。其也被深度学习领军人物 Yann Lecun 评为近年来最令人欣喜的成就。由此发展而来的对抗网络,也成为了提升网络性能的利器。本小节介绍深度对抗网络用于解决迁移学习问题方面的基本思路以及代表性研究成果。
+
+**基本思路**
+
+ GAN 受到自博弈论中的二人零和博弈 (two-player game) 思想的启发而提出。它一共包括两个部分:一部分为生成网络(Generative Network),此部分负责生成尽可能地以假乱真的样本,这部分被成为生成器(Generator);另一部分为判别网络(Discriminative Network), 此部分负责判断样本是真实的,还是由生成器生成的,这部分被成为判别器(Discriminator) 生成器和判别器的互相博弈,就完成了对抗训练。
+
+ GAN 的目标很明确:生成训练样本。这似乎与迁移学习的大目标有些许出入。然而,由于在迁移学习中,天然地存在一个源领域,一个目标领域,因此,我们可以免去生成样本的过程,而直接将其中一个领域的数据 (通常是目标域) 当作是生成的样本。此时,生成器的职能发生变化,不再生成新样本,而是扮演了特征提取的功能:不断学习领域数据的特征使得判别器无法对两个领域进行分辨。这样,原来的生成器也可以称为特征提取器
+(Feature Extractor)。
+
+ 通常用 Gf 来表示特征提取器,用 Gd 来表示判别器。正是基于这样的领域对抗的思想,深度对抗网络可以被很好地运用于迁移学习问题中。与深度网络自适应迁移方法类似,深度对抗网络的损失也由两部分构成:网络训练的损失lc*和领域判别损失Id:
+
+
+
+**DANN**
+
+Yaroslav Ganin 等人 [[Ganin et al., 2016](#bookmark251)]首先在神经网络的训练中加入了对抗机制,作者将他们的网络称之为DANN(Domain-Adversarial Neural Network)。在此研宄中,网络的学习目标是:生成的特征尽可能帮助区分两个领域的特征,同时使得判别器无法对两个领域的差异进行判别。该方法的领域对抗损失函数表示为:
+
+
+
+Id = max 其中的 Ld 表示为
+
+
+
+
+
+
+
+## 参考文献
+
+王晋东,迁移学习简明手册
+
+[Baktashmotlagh et al., 2013] Baktashmotlagh, M., Harandi, M. T., Lovell, B. C.,and Salz- mann, M. (2013). Unsupervised domain adaptation by domain invariant projection. In *ICCV,* pages 769-776.
+
+[Baktashmotlagh et al., 2014] Baktashmotlagh, M., Harandi, M. T., Lovell, B. C., and Salz- mann, M. (2014). Domain adaptation on the statistical manifold. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*,pages 2481-2488.
+
+[Ben-David et al., 2010] Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. (2010). A theory of learning from different domains. *Machine learning,* 79(1-2):151-175.
+
+[Ben-David et al., 2007] Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. (2007). Analysis of representations for domain adaptation. In *NIPS*, pages 137-144.
+
+[Blitzer et al., 2008] Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Wortman, J. (2008). Learning bounds for domain adaptation. In *Advances in neural information processing systems*, pages 129-136.
+
+[Blitzer et al., 2006] Blitzer, J., McDonald, R., and Pereira, F. (2006). Domain adaptation with structural correspondence learning. In *Proceedings of the 2006 conference on empirical methods in natural language processing*, pages 120-128. Association for Computational Linguistics.
+
+[Borgwardt et al., 2006] Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Scholkopf, B., and Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. *Bioinformatics*, 22(14):e49-e57.
+
+[Bousmalis et al., 2016] Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., and Erhan, D. (2016). Domain separation networks. In *Advances in Neural Information Processing Systems*, pages 343-351.
+
+[Cai et al., 2011] Cai, D., He, X., Han, J., and Huang, T. S. (2011). Graph regularized nonnegative matrix factorization for data representation. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 33(8):1548-1560.
+
+[Cao et al., 2017] Cao, Z., Long, M., Wang, J., and Jordan, M. I. (2017). Partial transfer learning with selective adversarial networks. *arXiv preprint arXiv:1707.07901*.
+
+[Carlucci et al., 2017] Carlucci, F. M., Porzi, L., Caputo, B., Ricci, E., and Bulo, S. R. (2017). Autodial: Automatic domain alignment layers. In International Conference on* Computer Vision.
+
+[Cook et al., 2013] Cook, D., Feuz, K. D., and Krishnan, N. C. (2013). Transfer learning for activity recognition: A survey. *Knowledge and information systems*, 36(3):537-556.
+
+[Cortes et al., 2008] Cortes, C., Mohri, M., Riley, M., and Rostamizadeh, A. (2008). Sample selection bias correction theory. In *International Conference on Algorithmic Learning Theory*, pages 38-53, Budapest, Hungary. Springer.
+
+[Dai et al., 2007] Dai, W., Yang, Q., Xue, G.-R., and Yu, Y. (2007). Boosting for transfer learning. In *ICML*, pages 193-200. ACM.
+
+[Davis and Domingos, 2009] Davis, J. and Domingos, P. (2009). Deep transfer via second- order markov logic. In *Proceedings of the 26th annual international conference on machine learning*, pages 217-224. ACM.
+
+[Denget al., 2014] Deng,W.,Zheng,Q.,andWang,Z.(2014).Cross-personactivityrecog-nition using reduced kernel extreme learning machine. *Neural Networks,* 53:1-7.
+
+[Donahue et al., 2014] Donahue, J., Jia, Y., et al. (2014). Decaf: A deep convolutional activation feature for generic visual recognition. In *ICML*, pages 647-655.
+
+[Dorri and Ghodsi, 2012] Dorri, F. and Ghodsi, A. (2012). Adapting component analysis. In *Data Mining (ICDM), 2012 IEEE 12th International Conference on*, pages 846-851. IEEE.
+
+[Duan et al., 2012] Duan, L., Tsang, I. W., and Xu, D. (2012). Domain transfer multiple kernel learning. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 34(3):465-479.
+
+[Fernando et al., 2013] Fernando, B., Habrard, A., Sebban, M., and Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In ICCV*, pages 29602967.
+
+[Fodor, 2002] Fodor, I. K. (2002). A survey of dimension reduction techniques. Center for Applied Scientific Computing, Lawrence Livermore National Laboratory*, 9:1-18.
+
+[Ganin et al., 2016] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Lavi- olette, F., Marchand, M., and Lempitsky, V. (2016).Domain-adversarial training of neural networks. *Journal of Machine Learning
+Research*, 17(59):1-35.
+
+[Gao et al., 2012] Gao, C., Sang, N., and Huang, R. (2012). Online transfer boosting for object tracking. In *Pattern Recognition (ICPR), 2012 21st International Conference on*, pages 906-909. IEEE.
+
+[Ghifary et al., 2017] Ghifary, M., Balduzzi, D., Kleijn, W. B., and Zhang, M. (2017). Scatter component analysis: A unified framework for domain adaptation and domain generalization. *IEEE transactions on pattern analysis and machine intelligence*, 39(7):1414-1430.
+
+[Ghifary et al., 2014] Ghifary, M., Kleijn, W. B., and Zhang, M. (2014). Domain adaptive neural networks for object recognition. In *PRICAI*, pages 898-904.
+
+[Gong et al., 2012] Gong, B., Shi, Y., Sha, F., and Grauman, K. (2012). Geodesic flow kernel for unsupervised domain adaptation. In *CVPR*, pages 2066-2073.
+
+[Goodfellow et al., 2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde- Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In *Advances in neural information processing systems*, pages 2672-2680.
+
+[Gopalan et al., 2011] Gopalan, R., Li, R., and Chellappa, R. (2011). Domain adaptation for object recognition: An unsupervised approach. In *ICCV*, pages 999-1006. IEEE.
+
+[Gretton et al., 2012] Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., and Sriperumbudur, B. K. (2012). Optimal kernel choice for large- scale two-sample tests. In *Advances in neural information processing systems*, pages 1205-1213.
+
+[Gu et al., 2011] Gu, Q., Li, Z., Han, J., et al. (2011). Joint feature selection and subspace learning. In *IJCAI Proceedings-International Joint Conference on Artificial Intel ligence*, volume 22, page 1294.
+
+[Hamm and Lee, 2008] Hamm, J. and Lee, D. D. (2008). Grassmann discriminant analysis: a unifying view on subspace-based learning. In *ICML*, pages 376-383. ACM.
+
+[Hou et al., 2015] Hou, C.-A., Yeh, Y.-R., and Wang, Y.-C. F. (2015). An unsupervised domain adaptation approach for cross-domain visual classification. In *Advanced Video and Signal Based Surveil lance (AVSS), 2015 12th IEEE International Conference on*,pages 1-6. IEEE.
+
+[Hsiao et al., 2016] Hsiao, P.-H., Chang, F.-J., and Lin, Y.-Y. (2016). Learning discriminatively reconstructed source data for object recognition with few examples. *IEEE*Transactions on Image Processing*, 25(8):3518-3532.
+
+[Hu and Yang, 2011] Hu, D. H. and Yang, Q. (2011). Transfer learning for activity recognition via sensor mapping. In *IJCAI Proceedings-International Joint Conference on Artificial Intelligence*, volume 22, page 1962, Barcelona, Catalonia, Spain. IJCAI.
+
+[Huang et al., 2007] Huang, J., Smola, A. J., Gretton, A., Borgwardt, K. M., Scholkopf, B., et al. (2007). Correcting sample selection bias by unlabeled data. *Advances in neural information processing systems*, 19:601.
+
+[Jaini et al., 2016] Jaini, P., Chen, Z., Carbajal, P., Law, E., Middleton, L., Regan, K., Schaekermann, M., Trimponias, G., Tung, J., and Poupart, P. (2016). Online bayesian transfer learning for sequential data modeling. In *ICLR 2017*.
+
+[Kermany et al., 2018] Kermany, D. S., Goldbaum, M., Cai, W., Valentim, C. C., Liang, H., Baxter, S. L., McKeown, A., Yang, G., Wu, X., Yan, F., et al. (2018). Identifying medical diagnoses and treatable diseases by image-based deep learning. *Cell*, 172(5):1122-1131.
+
+[Khan and Heisterkamp, 2016] Khan, M. N. A. and Heisterkamp, D. R. (2016). Adapting instance weights for unsupervised domain adaptation using quadratic mutual information and subspace learning. In *Pattern Recognition (ICPR), 2016 23rd International Conference on*, pages 1560-1565, Mexican City. IEEE.
+
+[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems*, pages 1097-1105.
+
+[Li et al., 2012] Li, H., Shi, Y., Liu, Y., Hauptmann, A. G., and Xiong, Z. (2012). Crossdomain video concept detection: A joint discriminative and generative active learning approach. *Expert Systems with Applications*,
+39(15):12220-12228.
+
+[Li et al., 2016] Li, J., Zhao, J., and Lu, K. (2016). Joint feature selection and structure preservation for domain adaptation. In *Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence*, pages
+1697-1703. AAAI Press.
+
+[Li et al., 2018] Li, Y., Wang, N., Shi, J., Hou, X., and Liu, J. (2018). Adaptive batch normalization for practical domain adaptation. *Pattern Recognition*, 80:109-117.
+
+[Liu et al., 2011] Liu, J., Shah, M., Kuipers, B., and Savarese, S. (2011). Cross-view action recognition via view knowledge transfer. In *Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on*, pages 3209-3216, Colorado Springs, CO, USA. IEEE.
+
+[Liu and Tuzel, 2016] Liu, M.-Y. and Tuzel, O. (2016). Coupled generative adversarial networks. In *Advances in neural information processing systems*, pages 469-477.
+
+[Liu et al., 2017] Liu, T., Yang, Q., and Tao, D. (2017). Understanding how feature structure transfers in transfer learning. In *IJCAI*.
+
+[Long et al., 2015a] Long, M., Cao, Y., Wang, J., and Jordan, M. (2015a). Learning transferable features with deep adaptation networks. In *ICML*, pages 97-105.
+
+[Long et al., 2016] Long, M., Wang, J., Cao, Y., Sun, J., and Philip, S. Y. (2016). Deep learning of transferable representation for scalable domain adaptation. *IEEE Transactions on Knowledge and Data Engineering*,
+28(8):2027-2040.
+
+[Long et al., 2014a] Long, M., Wang, J., Ding, G., Pan, S. J., and Yu, P. S. (2014a). Adaptation regularization: A general framework for transfer learning.*IEEE TKDE, 26(5):1076-1089.
+
+[Long et al., 2014b] Long, M., Wang, J., Ding, G., Sun, J., and Yu, P. S. (2014b). Transfer joint matching for unsupervised domain adaptation. In *CVPR ,pages 1410-1417.
+
+[Long et al., 2013] Long, M., Wang, J., et al. (2013). Transfer feature learning with joint distribution adaptation. In *ICCV*, pages 2200-2207.
+
+[Long et al., 2017] Long, M., Wang, J., and Jordan, M. I. (2017). Deep transfer learning with joint adaptation networks. In *ICML*, pages 2208-2217.
+
+[Long et al., 2015b] Long, M., Wang, J., Sun, J., and Philip, S. Y. (2015b). Domain invariant transfer kernel learning. *IEEE Transactions on Knowledge and Data Engineering*, 27(6):1519-1532.
+
+[Luo et al., 2017] Luo, Z., Zou, Y., Hoffman, J., and Fei-Fei, L. F. (2017). Label efficient learning of transferable representations acrosss domains and tasks. In *Advances in Neural Information Processing Systems*, pages 164-176.
+
+[Mihalkova et al., 2007] Mihalkova, L., Huynh, T., and Mooney, R. J. (2007). Mapping and revising markov logic networks for transfer learning. In *AAAI*, volume 7, pages 608-614.
+
+[Mihalkova and Mooney, 2008] Mihalkova, L. and Mooney, R. J. (2008). Transfer learning by mapping with minimal target data. In *Proceedings of the AAAI-08 workshop on transfer learning for complex tasks*.
+
+[Nater et al., 2011] Nater, F., Tommasi, T., Grabner, H., Van Gool, L., and Caputo, B. (2011). Transferring activities: Updating human behavior analysis. In *Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on*, pages 17371744, Barcelona, Spain. IEEE.
+
+[Pan et al., 2008a] Pan, S. J., Kwok, J. T., and Yang, Q. (2008a). Transfer learning via dimensionality reduction. In *Proceedings of the 23rd AAAI conference on Artificial intelligence*, volume 8, pages 677-682.
+
+[Pan et al., 2008b] Pan, S. J., Shen, D., Yang, Q., and Kwok, J. T. (2008b). Transferring localization models across space. In *Proceedings of the 23rd AAAI Conference on Artificial Intelligence*, pages 1383-1388.
+
+[Pan et al., 2011] Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. (2011). Domain adaptation via transfer component analysis. *IEEE TNN*, 22(2):199-210.
+
+[PanandYang, 2010] Pan,S.J.andYang,Q.(2010). A survey on transfer learning. IEEE TKDE*, 22(10):1345-1359.
+
+[Patil and Phursule, 2013] Patil, D. M. and Phursule, R. (2013). Knowledge transfer using cost sensitive online learning classification. *International Journal of Science and Research*, pages 527-529.
+
+[Razavian et al., 2014] Razavian, A. S., Azizpour, H., Sullivan, J., and Carlsson, S. (2014). Cnn features off-the-shelf: an astounding baseline for recognition. In *Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on*, pages 512519. IEEE.
+
+[Saito et al., 2017] Saito, K., Ushiku, Y., and Harada, T. (2017). Asymmetric tri-training for unsupervised domain adaptation. In *International Conference on Machine Learning*.
+
+[Sener et al., 2016] Sener, O., Song, H. O., Saxena, A., and Savarese, S. (2016). Learning transferrable representations for unsupervised domain adaptation. In *Advances in Neural Information Processing Systems*, pages 2110-2118.
+
+[Shen et al., 2018] Shen, J., Qu, Y., Zhang, W., and Yu, Y. (2018). Wasserstein distance guided representation learning for domain adaptation. In *AAAI*.
+
+[Si et al., 2010] Si, S., Tao, D., and Geng, B. (2010). Bregman divergence-based regularization for transfer subspace learning. *IEEE Transactions on Knowledge and Data Engineering*, 22(7):929-942.
+
+[Silver et al., 2017] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the game of go without human knowledge. *Nature*, 550(7676):354.
+
+[Stewart and Ermon, 2017] Stewart, R. and Ermon, S. (2017). Label-free supervision of neural networks with physics and domain knowledge. In *AAAI*, pages 2576-2582.
+
+[Sun et al., 2016] Sun, B., Feng, J., and Saenko, K. (2016). Return of frustratingly easy domain adaptation. In *AAAI*, volume 6, page 8.
+
+[Sun and Saenko, 2015] Sun, B. and Saenko, K. (2015). Subspace distribution alignment for unsupervised domain adaptation. In *BMVC*, pages 24-1.
+
+[Sun and Saenko, 2016] Sun, B. and Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation. In *European Conference on Computer Vision*, pages 443-450. Springer.
+
+[Tahmoresnezhad and Hashemi, 2016] Tahmoresnezhad, J. and Hashemi, S. (2016). Visual domain adaptation via transfer feature learning. *Knowledge and Information Systems*, pages 1-21.
+
+[Tan et al., 2015] Tan, B., Song, Y., Zhong, E., and Yang, Q. (2015). Transitive transfer learning. In *Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, pages 1155-1164. ACM.
+
+[Tan et al., 2017] Tan, B., Zhang, Y., Pan, S. J., and Yang, Q. (2017). Distant domain transfer learning. In *Thirty-First AAAI Conference on Artificial Intelligence*.
+
+[Taylor and Stone, 2009] Taylor, M. E. and Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. *Journal of Machine Learning Research*, 10(Jul):1633- 1685.
+
+[Tzeng et al., 2015] Tzeng, E., Hoffman, J., Darrell, T., and Saenko, K. (2015). Simultaneous deep transfer across domains and tasks. In *Proceedings of the IEEE International Conference on Computer Vision*, pages 4068-4076, Santiago, Chile. IEEE.
+
+[Tzeng et al., 2017] Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. (2017). Adversarial discriminative domain adaptation. In *CVPR*, pages 2962-2971.
+
+[Tzeng et al., 2014] Tzeng, E., Hoffman, J., Zhang, N., et al. (2014). Deep domain confusion: Maximizing for domain invariance. *arXiv preprint arXiv:1412.3474*.
+
+[Wang et al., 2017] Wang, J., Chen, Y., Hao, S., et al. (2017). Balanced distribution adaptation for transfer learning. In *ICDM*, pages 1129-1134.
+
+[Wang et al., 2018] Wang, J., Chen, Y., Hu, L., Peng, X., and Yu, P. S. (2018). Stratified transfer learning for cross-domain activity recognition. In *2018 IEEE International Conference on Pervasive Computing and Communications (PerCom)*.
+
+[Wang et al., 2014] Wang, J., Zhao, P., Hoi, S. C., and Jin, R. (2014). Online feature selection and its applications. *IEEE Transactions on Knowledge and Data Engineering*, 26(3):698-710.
+
+[Wei et al., 2016a] Wei, P., Ke, Y., and Goh, C. K. (2016a). Deep nonlinearfeature coding for unsupervised domain adaptation. In *IJCAI*, pages 2189-2195.
+
+[Wei et al., 2017] Wei, Y., Zhang, Y., and Yang, Q. (2017). Learning totransfer. *arXiv* preprint arXiv:1708.05629*.
+
+[Wei et al., 2016b] Wei, Y., Zhu, Y., Leung, C. W.-k., Song, Y., and Yang, Q. (2016b). Instilling social to physical: Co-regularized heterogeneous transfer learning. In *Thirtieth* AAAI Conference on Artificial Intelligence*.
+
+[Weiss et al., 2016] Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning. *Journal of Big Data*, 3(1):1-40.
+
+[Wu et al., 2017] Wu, Q., Zhou, X., Yan, Y., Wu, H., and Min, H. (2017). Online transfer learning by leveraging multiple source domains. *Knowledge and Information Systems*, 52(3):687-707.
+
+[xinhua, 2016] xinhua (2016). http://mp.weixin.qq.com/s?__biz=MjM5ODYzNzAyMQ==& mid=2651933920&idx=1\\&sn=ae2866bd12000f1644eae1094497837e.
+
+[Yan et al., 2017] Yan, Y., Wu, Q., Tan, M., Ng, M. K., Min, H., and Tsang, I. W. (2017). Online heterogeneous transfer by hedge ensemble of offline and online decisions. *IEEE transactions on neural networks and learning systems*.
+
+[Yosinski et al., 2014] Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features in deep neural networks? In *Advances in neural information processing systems*, pages 3320-3328.
+
+[Zadrozny, 2004] Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In *Proceedings of the twenty-first international conference on Machine learning*, page 114, Alberta, Canada. ACM.
+
+[Zellinger et al., 2017] Zellinger, W., Grubinger, T., Lughofer, E., Natschlager, T., and Saminger-Platz, S. (2017). Central moment discrepancy (cmd) for domain-invariant representation learning. *arXiv preprint arXiv:1702.08811*.
+
+[Zhang et al., 2017a] Zhang, J., Li, W., and Ogunbona, P. (2017a). Joint geometrical and statistical alignment for visual domain adaptation. In *CVPR*.
+
+[Zhang et al., 2017b] Zhang, X., Zhuang, Y., Wang, W., and Pedrycz, W. (2017b). Online feature transformation learning for cross-domain object category recognition. *IEEE transactions on neural networks and learning systems*.
+
+[Zhao and Hoi, 2010] Zhao, P. and Hoi, S. C. (2010). Otl: A framework of online transfer learning. In *Proceedings of the 27th international conference on machine learning (ICML- 10)*, pages 1231-1238.
+
+[Zhao et al., 2010] Zhao, Z., Chen, Y., Liu, J., and Liu, M. (2010). Cross-mobile elm based activity recognition. *International Journal of Engineering and Industries*, 1(1):30-38.
+
+[Zhao et al., 2011] Zhao, Z., Chen, Y., Liu, J., Shen, Z., and Liu, M. (2011). Cross-people mobile-phone based activity recognition. In *Proceedings of the Twenty-Second international joint conference on Artificial Intelligence (IJCAI)*, volume 11, pages 2545-2550. Citeseer.
+
+[Zheng et al., 2009] Zheng, V. W., Hu, D. H., and Yang, Q. (2009). Cross-domain activity recognition. In *Proceedings of the 11th international conference on Ubiquitous computing*, pages 61-70. ACM.
+
+[Zheng et al., 2008] Zheng, V. W., Pan, S. J., Yang, Q., and Pan, J. J. (2008). Transferring multi-device localization models using latent multi-task learning. In *AAAI*, volume 8, pages 1427-1432, Chicago, Illinois, USA. AAAI.
+
+[Zhuang et al., 2015] Zhuang, F., Cheng, X., Luo, P., Pan, S. J., and He, Q. (2015). Supervised representation learning: Transfer learning with deep autoencoders. In *IJCAI*,pages 4119-4125.
+
+[Zhuo et al., 2017] Zhuo, J., Wang, S., Zhang, W., and Huang, Q. (2017). Deep unsupervised convolutional domain adaptation. In *Proceedings of the 2017 ACM on Multimedia Conference*, pages 261-269. ACM.
diff --git "a/ch12_\347\275\221\347\273\234\346\220\255\345\273\272\345\217\212\350\256\255\347\273\203/img/ch12/readme.md" "b/ch12_\347\275\221\347\273\234\346\220\255\345\273\272\345\217\212\350\256\255\347\273\203/img/ch12/readme.md"
deleted file mode 100644
index f099ebf1..00000000
--- "a/ch12_\347\275\221\347\273\234\346\220\255\345\273\272\345\217\212\350\256\255\347\273\203/img/ch12/readme.md"
+++ /dev/null
@@ -1,2 +0,0 @@
-Add the corresponding chapter picture under img/ch*
-img/ch*ӶӦ½ͼƬ
\ No newline at end of file
diff --git "a/ch12_\347\275\221\347\273\234\346\220\255\345\273\272\345\217\212\350\256\255\347\273\203/\347\254\254\345\215\201\344\272\214\347\253\240_\347\275\221\347\273\234\346\220\255\345\273\272\345\217\212\350\256\255\347\273\203.md" "b/ch12_\347\275\221\347\273\234\346\220\255\345\273\272\345\217\212\350\256\255\347\273\203/\347\254\254\345\215\201\344\272\214\347\253\240_\347\275\221\347\273\234\346\220\255\345\273\272\345\217\212\350\256\255\347\273\203.md"
index e4f8210c..ee02c68c 100644
--- "a/ch12_\347\275\221\347\273\234\346\220\255\345\273\272\345\217\212\350\256\255\347\273\203/\347\254\254\345\215\201\344\272\214\347\253\240_\347\275\221\347\273\234\346\220\255\345\273\272\345\217\212\350\256\255\347\273\203.md"
+++ "b/ch12_\347\275\221\347\273\234\346\220\255\345\273\272\345\217\212\350\256\255\347\273\203/\347\254\254\345\215\201\344\272\214\347\253\240_\347\275\221\347\273\234\346\220\255\345\273\272\345\217\212\350\256\255\347\273\203.md"
@@ -3,44 +3,6 @@
# 第十二章 网络搭建及训练
-目录
-常用框架介绍
-常用框架对比(表格展示) 16个最棒的深度学习框架 https://baijiahao.baidu.com/s?id=1599943447101946075&wfr=spider&for=pc
-基于tensorfolw网络搭建实例
-CNN训练注意事项
-训练技巧
-深度学习模型训练痛点及解决方法 https://blog.csdn.net/weixin_40581617/article/details/80537559
-深度学习模型训练流程 https://blog.csdn.net/Quincuntial/article/details/79242364
-深度学习模型训练技巧 https://blog.csdn.net/w7256037/article/details/52071345
-https://blog.csdn.net/u012033832/article/details/79017951
-https://blog.csdn.net/u012968002/article/details/72122965
-
-深度学习几大难点 https://blog.csdn.net/m0_37867246/article/details/79766371
-
-## CNN训练注意事项
-http://www.cnblogs.com/softzrp/p/6724884.html
-1.用Mini-batch SGD对神经网络做训练的过程如下:
-
-不断循环 :
-
-① 采样一个 batch 数据( ( 比如 32 张 )
-
-②前向计算得到损失 loss
-
-③ 反向传播计算梯度( 一个 batch)
-
-④ 用这部分梯度迭代更新权重参数
-
-2.去均值
-
-去均值一般有两种方式:第一种是在每个像素点都算出3个颜色通道上的平均值,然后对应减去,如AlexNet。 第二种是在整个样本上就只得到一组数,不分像素点了,如VGGNet。
-3.权重初始化
-4.Dropout
-
-
-
-# 第十二章 TensorFlow、pytorch和caffe介绍
-
# 12.1 TensorFlow
## 12.1.1 TensorFlow是什么?
diff --git "a/ch13_\344\274\230\345\214\226\347\256\227\346\263\225/img/ch13/figure_13_18_1.png" "b/ch13_\344\274\230\345\214\226\347\256\227\346\263\225/img/ch13/figure_13_18_1.png"
index 8edafaf4..5d53edc7 100644
Binary files "a/ch13_\344\274\230\345\214\226\347\256\227\346\263\225/img/ch13/figure_13_18_1.png" and "b/ch13_\344\274\230\345\214\226\347\256\227\346\263\225/img/ch13/figure_13_18_1.png" differ
diff --git "a/ch13_\344\274\230\345\214\226\347\256\227\346\263\225/img/ch13/figure_13_19_1.png" "b/ch13_\344\274\230\345\214\226\347\256\227\346\263\225/img/ch13/figure_13_19_1.png"
index 4c283d61..0ada3972 100644
Binary files "a/ch13_\344\274\230\345\214\226\347\256\227\346\263\225/img/ch13/figure_13_19_1.png" and "b/ch13_\344\274\230\345\214\226\347\256\227\346\263\225/img/ch13/figure_13_19_1.png" differ
diff --git "a/ch13_\344\274\230\345\214\226\347\256\227\346\263\225/\347\254\254\345\215\201\344\270\211\347\253\240_\344\274\230\345\214\226\347\256\227\346\263\225.md" "b/ch13_\344\274\230\345\214\226\347\256\227\346\263\225/\347\254\254\345\215\201\344\270\211\347\253\240_\344\274\230\345\214\226\347\256\227\346\263\225.md"
index 88eae7fd..4641022b 100644
--- "a/ch13_\344\274\230\345\214\226\347\256\227\346\263\225/\347\254\254\345\215\201\344\270\211\347\253\240_\344\274\230\345\214\226\347\256\227\346\263\225.md"
+++ "b/ch13_\344\274\230\345\214\226\347\256\227\346\263\225/\347\254\254\345\215\201\344\270\211\347\253\240_\344\274\230\345\214\226\347\256\227\346\263\225.md"
@@ -4,222 +4,124 @@
# 第一十三章 优化算法
+## 13.1 如何解决训练样本少的问题
-## 13.1 CPU 和 GPU 的区别?
-
-
-**概念:**
+目前大部分的深度学习模型仍然需要海量的数据支持。例如 ImageNet 数据就拥有1400多万的图片。而现实生产环境中,数据集通常较小,只有几万甚至几百个样本。这时候,如何在这种情况下应用深度学习呢?
+(1)利用预训练模型进行迁移微调(fine-tuning),预训练模型通常在特征上拥有很好的语义表达。此时,只需将模型在小数据集上进行微调就能取得不错的效果。这也是目前大部分小数据集常用的训练方式。视觉领域内,通常会ImageNet上训练完成的模型。自然语言处理领域,也有BERT模型等预训练模型可以使用。
-CPU 全称是 central processing unit,CPU 是一块超大规模的集成电路,是一台计算机的运 算和控制核心,它的主要功能是解释计算机指令和处理计算机软件中的数据。
+(2)单样本或者少样本学习(one-shot,few-shot learning),这种方式适用于样本类别远远大于样本数量的情况等极端数据集。例如有1000个类别,每个类别只提供1-5个样本。少样本学习同样也需要借助预训练模型,但有别于微调的在于,微调通常仍然在学习不同类别的语义,而少样本学习通常需要学习样本之间的距离度量。例如孪生网络(Siamese Neural Networks)就是通过训练两个同种结构的网络来判别输入的两张图片是否属于同一类。
+ 上述两种是常用训练小样本数据集的方式。此外,也有些常用的手段,例如数据集增强、正则或者半监督学习等方式来解决小样本数据集的训练问题。
-
-GPU 全称是 graphics processing unit,GPU 是将计算机系统,所需要的显示信息进行转换 的驱动,并向显示器提供扫描信号,控制显示器的正确显示,是连接显示器和个人电脑主板的 重要元件,是人机对话的重要设备之一。
+## 13.2 深度学习是否能胜任所有数据集?
-
-**缓存:**
-
-CPU 有大量的缓存结构,目前主流的 CPU 芯片上都有四级缓存,这些缓存结构消耗了大 量的晶体管,在运行的时候需要大量的电力。反观 GPU 的缓存就很简单,目前主流的 GPU 芯 片最多有两层缓存。CPU 消耗在晶体管上的空间和能耗,GPU 都可以用来做成 ALU 单元,也 因此 GPU 比 CPU 的效率要高一些。
+深度学习并不能胜任目前所有的数据环境,以下列举两种情况:
-
-**响应方式:**
-
-对 CPU 来说,要求的是实时响应,对单任务的速度要求很高,所以就要用很多层缓存的 办法来保证单任务的速度。对 GPU 来说大家不关心第一个像素什么时候计算完成,而是都关 心最后一个像素什么时候计算出来,所以 GPU 就把所有的任务都排好,然后再批处理,这样 对缓存的要求就很低了。举个不恰当的例子,在点击 10 次鼠标的时候,CPU 要每一次点击都 要及时响应,而 GPU 会等第 10 次点击后,再一次性批处理响应。
+(1)深度学习能取得目前的成果,很大一部分原因依赖于海量的数据集以及高性能密集计算硬件。因此,当数据集过小时,需要考虑与传统机器学习相比,是否在性能和硬件资源效率更具有优势。
+(2)深度学习目前在视觉,自然语言处理等领域都有取得不错的成果。这些领域最大的特点就是具有局部相关性。例如图像中,人的耳朵位于两侧,鼻子位于两眼之间,文本中单词组成句子。这些都是具有局部相关性的,一旦被打乱则会破坏语义或者有不同的语义。所以当数据不具备这种相关性的时候,深度学习就很难取得效果。
-
-**浮点运算:**
-
-CPU 除了负责浮点整形运算外,还有很多其他的指令集的负载,比如像多媒体解码,硬 件解码等,所以 CPU 是个多才多艺的东西,而 GPU 基本上就是只做浮点运算的,也正是因为 只做浮点运算,所以设计结构简单,也就可以做的更快。另外显卡的 GPU 和单纯为了跑浮点 高性能运算的 GPU 还是不太一样,显卡的 GPU 还要考虑配合图形输出显示等方面,而有些专 用 GPU 设备,就是一个 PCI 卡上面有一个性能很强的浮点运算 GPU,没有显示输出的,这样 的 GPU 就是为了加快某些程序的浮点计算能力。CPU 注重的是单线程的性能,也就是延迟, 对于 CPU 来说,要保证指令流不中断,所以 CPU 需要消耗更多的晶体管和能耗用在控制部分, 于是CPU分配在浮点计算的功耗就会变少。GPU注重的是吞吐量,单指令能驱动更多的计算, 所以相比较而言 GPU 消耗在控制部分的能耗就比较少,因此也就可以把电省下来的资源给浮 点计算使用。
+## 13.3 有没有可能找到比已知算法更好的算法?
-
-**应用方向:**
-
-像操作系统这一类应用,需要快速响应实时信息,需要针对延迟优化,所以晶体管数量和能耗都需要用在分支预测,乱序执行上,低延迟缓存等控制部分,而这都是 CPU 的所擅长的。 对于像矩阵一类的运算,具有极高的可预测性和大量相似运算的,这种高延迟,高吞吐的架构 运算,就非常适合 GPU。
+在最优化理论发展中,有个没有免费午餐的定律,其主要含义在于,在不考虑具体背景和细节的情况下,任何算法和随机猜的效果期望是一样的。即,没有任何一种算法能优于其他一切算法,甚至不比随机猜好。深度学习作为机器学习领域的一个分支同样符合这个定律。所以,虽然目前深度学习取得了非常不错的成果,但是我们同样不能盲目崇拜。
-
-**浅显解释:**
-
-一块 CPU 相当于一个数学教授,一块 GPU 相当于 100 个小学生。
-
-第一回合,四则运算,一百个题。教授拿到卷子一道道计算。100 个小学生各拿一道题。 教授刚开始计算到第二题的时候,小学生就集体交卷了。
-
-第二回合,高等函数,一百个题。当教授搞定后。一百个小学生可能还不知道该做些什么。
-
-这两个回合就是 CPU 和 GPU 的区别了。
+优化算法本质上是在寻找和探索更符合数据集和问题的算法,这里数据集是算法的驱动力,而需要通过数据集解决的问题就是算法的核心,任何算法脱离了数据都会没有实际价值,任何算法的假设都不能脱离实际问题。因此,实际应用中,面对不同的场景和不同的问题,可以从多个角度针对问题进行分析,寻找更优的算法。
-## 13.2 如何解决训练样本少的问题
+## 13.4 什么是共线性,如何判断和解决共线性问题?
-
-要训练一个好的 CNN 模型,通常需要很多训练数据,尤其是模型结构比较复杂的时候, 比如 ImageNet 数据集上训练的模型。虽然深度学习在 ImageNet 上取得了巨大成功,但是一个 现实的问题是,很多应用的训练集是较小的,如何在这种情况下应用深度学习呢?有三种方法 可供读者参考。
+对于回归算法,无论是一般回归还是逻辑回归,在使用多个变量进行预测分析时,都可能存在多变量相关的情况,这就是多重共线性。共线性的存在,使得特征之间存在冗余,导致过拟合。
-
-(1)可以将 ImageNet 上训练得到的模型做为起点,利用目标训练集和反向传播对其进 行继续训练,将模型适应到特定的应用。ImageNet 起到预训练的作用。
-
-(2)如果目标训练集不够大,也可以将低层的网络参数固定,沿用 ImageNet 上的训练集 结果,只对上层进行更新。这是因为底层的网络参数是最难更新的,而从 ImageNet 学习得到 的底层滤波器往往描述了各种不同的局部边缘和纹理信息,而这些滤波器对一般的图像有较好 的普适性。
-
-(3)直接采用 ImageNet 上训练得到的模型,把最高的隐含层的输出作为特征表达,代 替常用的手工设计的特征。
+常用判断是否存在共线性的方法有:
-## 13.3 什么样的样本集不适合用深度学习?
+(1)相关性分析。当相关性系数高于0.8,表明存在多重共线性;但相关系数低,并不能表示不存在多重共线性;
-
-(1)数据集太小,数据样本不足时,深度学习相对其它机器学习算法,没有明显优势。
-
-(2)数据集没有局部相关特性,目前深度学习表现比较好的领域主要是图像/语音 /自然语言处理等领域,这些领域的一个共性是局部相关性。图像中像素组成物体,语音 信号中音位组合成单词,文本数据中单词组合成句子,这些特征元素的组合一旦被打乱, 表示的含义同时也被改变。对于没有这样的局部相关性的数据集,不适于使用深度学习算 法进行处理。举个例子:预测一个人的健康状况,相关的参数会有年龄、职业、收入、家 庭状况等各种元素,将这些元素打乱,并不会影响相关的结果。
+(2)方差膨胀因子VIF。当VIF大于5或10时,代表模型存在严重的共线性问题;
-## 13.4 有没有可能找到比已知算法更好的算法?
+(3)条件系数检验。 当条件数大于100、1000时,代表模型存在严重的共线性问题。
+通常可通过PCA降维、逐步回归法和LASSO回归等方法消除共线性。
+## 13.5 权值初始化方法有哪些?
-
-没有免费的午餐定理:
-
-