Skip to content

Commit 816bdea

Browse files
committed
.
1 parent a6da0bf commit 816bdea

14 files changed

+19867
-13
lines changed

.ipynb_checkpoints/02. LassoRidge Regression-checkpoint.ipynb

+19-2
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,10 @@
116116
"metadata": {},
117117
"source": [
118118
"## 5. Lasso Regression\n",
119+
"- type of linear regression which uses shrinkage\n",
120+
"- in shrinkage data values are shrinked towards a central point called mean\n",
121+
"- encourages models with fewer parameters\n",
122+
"- well-suited for models showing high levels of muticollinearity\n",
119123
"- Add penalty for large coefficients \n",
120124
"- Penalty function – L1 norm of regression coefficients \n",
121125
"- Penalty weighted by hyperparameter (alpha) \n",
@@ -354,7 +358,7 @@
354358
"cell_type": "markdown",
355359
"metadata": {},
356360
"source": [
357-
"## Lasso"
361+
"## Lasso (Least Absolute Shrinkage and Selection Operator)"
358362
]
359363
},
360364
{
@@ -658,7 +662,9 @@
658662
{
659663
"cell_type": "code",
660664
"execution_count": 35,
661-
"metadata": {},
665+
"metadata": {
666+
"scrolled": false
667+
},
662668
"outputs": [
663669
{
664670
"name": "stdout",
@@ -707,6 +713,17 @@
707713
"r_square"
708714
]
709715
},
716+
{
717+
"cell_type": "markdown",
718+
"metadata": {},
719+
"source": [
720+
"### Diff between Lasso-Ridge\n",
721+
"* Ridge Regression used in case of grouping(multicollinearity) of regressors\n",
722+
"* RR shring group proportionaly whereas Lasso doesnt\n",
723+
"* Lasso sets individual regresison coefficients to 0 to reduce model size\n",
724+
"* Higher dimension problems have multicollinearity so Ridge performs better\n"
725+
]
726+
},
710727
{
711728
"cell_type": "markdown",
712729
"metadata": {

.ipynb_checkpoints/03. Naive Bayes-checkpoint.ipynb

+121-1
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,11 @@
1818
"8. It's simple & out-performs many sophisticated methods. \n",
1919
"9. Stable to data changes. \n",
2020
"\n",
21+
"## Three types of Naive Bayes¶\n",
22+
"* Gaussian Naive Bayes - Feature columns are normal distribution\n",
23+
"* Multinomial Naive bayes - Feature columns are counters\n",
24+
"* Bernouli's Naive bayes - Feature columns are boolean\n",
25+
"\n",
2126
"\n",
2227
"## Bayes’s Theorem\n",
2328
"It describes the probability of an event, based on prior knowledge of conditions that might be related to the event. \n",
@@ -258,6 +263,121 @@
258263
"nb_model = joblib.load(\"pima-trained-model.pkl\")"
259264
]
260265
},
266+
{
267+
"attachments": {},
268+
"cell_type": "markdown",
269+
"metadata": {},
270+
"source": [
271+
"# Comparison\n",
272+
"### Bernoulli Naive Bayes :\n",
273+
"* assumes features are binary (e.g: 0 or 1)\n",
274+
"* 0: word does not occur in the document\n",
275+
"* 1: word occurs in the document\n",
276+
"\n",
277+
"### Multinomial Naive Bayes :\n",
278+
"* used for discrete data (E.g: rolling dice, movie rating from 1 to 10, etc)\n",
279+
"* In text learning we have the count of each word to predict the class or label.\n",
280+
"\n",
281+
"### Gaussian Naive Bayes :\n",
282+
"* used for normal distribution which means all features are continuous\n"
283+
]
284+
},
285+
{
286+
"attachments": {},
287+
"cell_type": "markdown",
288+
"metadata": {},
289+
"source": [
290+
"# Bernouli vs Multinomial\n",
291+
"In case of email classifier\n",
292+
"### Bernoulli :\n",
293+
"* Assume spam mail has email handle in subject\n",
294+
"* Build a feature where 0 means it’s not present and 1 if it is there \n",
295+
"* Binomial distribution\n",
296+
"\n",
297+
"### Multinomial: \n",
298+
"* In addition to above condition, more dollar sign means spam more likely\n",
299+
"* Same kind of word e.g: CASH or LOTTERY\n",
300+
"* Label these words by their count\n",
301+
"* Multinomial distribution"
302+
]
303+
},
304+
{
305+
"cell_type": "code",
306+
"execution_count": 1,
307+
"metadata": {},
308+
"outputs": [
309+
{
310+
"name": "stdout",
311+
"output_type": "stream",
312+
"text": [
313+
"<class 'pandas.core.frame.DataFrame'>\n",
314+
"RangeIndex: 19579 entries, 0 to 19578\n",
315+
"Data columns (total 3 columns):\n",
316+
" # Column Non-Null Count Dtype \n",
317+
"--- ------ -------------- ----- \n",
318+
" 0 id 19579 non-null object\n",
319+
" 1 text 19579 non-null object\n",
320+
" 2 author 19579 non-null object\n",
321+
"dtypes: object(3)\n",
322+
"memory usage: 459.0+ KB\n",
323+
"None\n",
324+
" id text author\n",
325+
"658 id10627 I did; but the fragile spirit clung to its ten... EAP\n",
326+
"4187 id00256 I have merely set down certain things appealin... HPL\n",
327+
"267 id08711 The remains of the half finished creature, who... MWS\n",
328+
"6672 id18249 \"No, Justine,\" said Elizabeth; \"he is more con... MWS\n",
329+
"12051 id20451 In the rash pursuit of this object, he rushes ... EAP\n",
330+
"EAP 7900\n",
331+
"MWS 6044\n",
332+
"HPL 5635\n",
333+
"Name: author, dtype: int64\n",
334+
"0.8265577119509704\n"
335+
]
336+
},
337+
{
338+
"data": {
339+
"text/plain": [
340+
"array([[1623, 169, 118],\n",
341+
" [ 134, 1098, 65],\n",
342+
" [ 242, 121, 1325]], dtype=int64)"
343+
]
344+
},
345+
"execution_count": 1,
346+
"metadata": {},
347+
"output_type": "execute_result"
348+
}
349+
],
350+
"source": [
351+
"import pandas as pd\n",
352+
"dataset = pd.read_csv(\"Data/Classification/horror-train.csv\")\n",
353+
"print(dataset.info())\n",
354+
"print(dataset.sample(5))\n",
355+
"print(dataset.author.value_counts())\n",
356+
"X = dataset.text\n",
357+
"y = dataset.author\n",
358+
"\n",
359+
"from sklearn.model_selection import train_test_split\n",
360+
"X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.25, random_state = 0)\n",
361+
"\n",
362+
"from sklearn.feature_extraction.text import CountVectorizer\n",
363+
"spam_fil = CountVectorizer(stop_words='english')\n",
364+
"\n",
365+
"X_train = spam_fil.fit_transform(X_train).toarray()\n",
366+
"X_test = spam_fil.transform(X_test).toarray()\n",
367+
"\n",
368+
"from sklearn.naive_bayes import MultinomialNB\n",
369+
"mnb = MultinomialNB()\n",
370+
"\n",
371+
"mnb.fit(X_train, y_train)\n",
372+
"\n",
373+
"print(mnb.score(X_test, y_test))\n",
374+
"\n",
375+
"y_pred = mnb.predict(X_test)\n",
376+
"\n",
377+
"from sklearn.metrics import confusion_matrix\n",
378+
"confusion_matrix(y_pred, y_test)"
379+
]
380+
},
261381
{
262382
"cell_type": "code",
263383
"execution_count": null,
@@ -282,7 +402,7 @@
282402
"name": "python",
283403
"nbconvert_exporter": "python",
284404
"pygments_lexer": "ipython3",
285-
"version": "3.6.6"
405+
"version": "3.8.5"
286406
}
287407
},
288408
"nbformat": 4,

02. LassoRidge Regression.ipynb

+19-2
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,10 @@
116116
"metadata": {},
117117
"source": [
118118
"## 5. Lasso Regression\n",
119+
"- type of linear regression which uses shrinkage\n",
120+
"- in shrinkage data values are shrinked towards a central point called mean\n",
121+
"- encourages models with fewer parameters\n",
122+
"- well-suited for models showing high levels of muticollinearity\n",
119123
"- Add penalty for large coefficients \n",
120124
"- Penalty function – L1 norm of regression coefficients \n",
121125
"- Penalty weighted by hyperparameter (alpha) \n",
@@ -354,7 +358,7 @@
354358
"cell_type": "markdown",
355359
"metadata": {},
356360
"source": [
357-
"## Lasso"
361+
"## Lasso (Least Absolute Shrinkage and Selection Operator)"
358362
]
359363
},
360364
{
@@ -658,7 +662,9 @@
658662
{
659663
"cell_type": "code",
660664
"execution_count": 35,
661-
"metadata": {},
665+
"metadata": {
666+
"scrolled": false
667+
},
662668
"outputs": [
663669
{
664670
"name": "stdout",
@@ -707,6 +713,17 @@
707713
"r_square"
708714
]
709715
},
716+
{
717+
"cell_type": "markdown",
718+
"metadata": {},
719+
"source": [
720+
"### Diff between Lasso-Ridge\n",
721+
"* Ridge Regression used in case of grouping(multicollinearity) of regressors\n",
722+
"* RR shring group proportionaly whereas Lasso doesnt\n",
723+
"* Lasso sets individual regresison coefficients to 0 to reduce model size\n",
724+
"* Higher dimension problems have multicollinearity so Ridge performs better\n"
725+
]
726+
},
710727
{
711728
"cell_type": "markdown",
712729
"metadata": {

03. Naive Bayes.ipynb

+120
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,11 @@
1818
"8. It's simple & out-performs many sophisticated methods. \n",
1919
"9. Stable to data changes. \n",
2020
"\n",
21+
"## Three types of Naive Bayes¶\n",
22+
"* Gaussian Naive Bayes - Feature columns are normal distribution\n",
23+
"* Multinomial Naive bayes - Feature columns are counters\n",
24+
"* Bernouli's Naive bayes - Feature columns are boolean\n",
25+
"\n",
2126
"\n",
2227
"## Bayes’s Theorem\n",
2328
"It describes the probability of an event, based on prior knowledge of conditions that might be related to the event. \n",
@@ -258,6 +263,121 @@
258263
"nb_model = joblib.load(\"pima-trained-model.pkl\")"
259264
]
260265
},
266+
{
267+
"attachments": {},
268+
"cell_type": "markdown",
269+
"metadata": {},
270+
"source": [
271+
"# Comparison\n",
272+
"### Bernoulli Naive Bayes :\n",
273+
"* assumes features are binary (e.g: 0 or 1)\n",
274+
"* 0: word does not occur in the document\n",
275+
"* 1: word occurs in the document\n",
276+
"\n",
277+
"### Multinomial Naive Bayes :\n",
278+
"* used for discrete data (E.g: rolling dice, movie rating from 1 to 10, etc)\n",
279+
"* In text learning we have the count of each word to predict the class or label.\n",
280+
"\n",
281+
"### Gaussian Naive Bayes :\n",
282+
"* used for normal distribution which means all features are continuous\n"
283+
]
284+
},
285+
{
286+
"attachments": {},
287+
"cell_type": "markdown",
288+
"metadata": {},
289+
"source": [
290+
"# Bernouli vs Multinomial\n",
291+
"In case of email classifier\n",
292+
"### Bernoulli :\n",
293+
"* Assume spam mail has email handle in subject\n",
294+
"* Build a feature where 0 means it’s not present and 1 if it is there \n",
295+
"* Binomial distribution\n",
296+
"\n",
297+
"### Multinomial: \n",
298+
"* In addition to above condition, more dollar sign means spam more likely\n",
299+
"* Same kind of word e.g: CASH or LOTTERY\n",
300+
"* Label these words by their count\n",
301+
"* Multinomial distribution"
302+
]
303+
},
304+
{
305+
"cell_type": "code",
306+
"execution_count": 1,
307+
"metadata": {},
308+
"outputs": [
309+
{
310+
"name": "stdout",
311+
"output_type": "stream",
312+
"text": [
313+
"<class 'pandas.core.frame.DataFrame'>\n",
314+
"RangeIndex: 19579 entries, 0 to 19578\n",
315+
"Data columns (total 3 columns):\n",
316+
" # Column Non-Null Count Dtype \n",
317+
"--- ------ -------------- ----- \n",
318+
" 0 id 19579 non-null object\n",
319+
" 1 text 19579 non-null object\n",
320+
" 2 author 19579 non-null object\n",
321+
"dtypes: object(3)\n",
322+
"memory usage: 459.0+ KB\n",
323+
"None\n",
324+
" id text author\n",
325+
"658 id10627 I did; but the fragile spirit clung to its ten... EAP\n",
326+
"4187 id00256 I have merely set down certain things appealin... HPL\n",
327+
"267 id08711 The remains of the half finished creature, who... MWS\n",
328+
"6672 id18249 \"No, Justine,\" said Elizabeth; \"he is more con... MWS\n",
329+
"12051 id20451 In the rash pursuit of this object, he rushes ... EAP\n",
330+
"EAP 7900\n",
331+
"MWS 6044\n",
332+
"HPL 5635\n",
333+
"Name: author, dtype: int64\n",
334+
"0.8265577119509704\n"
335+
]
336+
},
337+
{
338+
"data": {
339+
"text/plain": [
340+
"array([[1623, 169, 118],\n",
341+
" [ 134, 1098, 65],\n",
342+
" [ 242, 121, 1325]], dtype=int64)"
343+
]
344+
},
345+
"execution_count": 1,
346+
"metadata": {},
347+
"output_type": "execute_result"
348+
}
349+
],
350+
"source": [
351+
"import pandas as pd\n",
352+
"dataset = pd.read_csv(\"Data/Classification/horror-train.csv\")\n",
353+
"print(dataset.info())\n",
354+
"print(dataset.sample(5))\n",
355+
"print(dataset.author.value_counts())\n",
356+
"X = dataset.text\n",
357+
"y = dataset.author\n",
358+
"\n",
359+
"from sklearn.model_selection import train_test_split\n",
360+
"X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.25, random_state = 0)\n",
361+
"\n",
362+
"from sklearn.feature_extraction.text import CountVectorizer\n",
363+
"spam_fil = CountVectorizer(stop_words='english')\n",
364+
"\n",
365+
"X_train = spam_fil.fit_transform(X_train).toarray()\n",
366+
"X_test = spam_fil.transform(X_test).toarray()\n",
367+
"\n",
368+
"from sklearn.naive_bayes import MultinomialNB\n",
369+
"mnb = MultinomialNB()\n",
370+
"\n",
371+
"mnb.fit(X_train, y_train)\n",
372+
"\n",
373+
"print(mnb.score(X_test, y_test))\n",
374+
"\n",
375+
"y_pred = mnb.predict(X_test)\n",
376+
"\n",
377+
"from sklearn.metrics import confusion_matrix\n",
378+
"confusion_matrix(y_pred, y_test)"
379+
]
380+
},
261381
{
262382
"cell_type": "code",
263383
"execution_count": null,

03. Support Vector Machines.ipynb

+1-1
Original file line numberDiff line numberDiff line change
@@ -728,7 +728,7 @@
728728
"name": "python",
729729
"nbconvert_exporter": "python",
730730
"pygments_lexer": "ipython3",
731-
"version": "3.6.6"
731+
"version": "3.8.5"
732732
}
733733
},
734734
"nbformat": 4,

05. Artificial Neural Network.ipynb

+1-1
Original file line numberDiff line numberDiff line change
@@ -700,7 +700,7 @@
700700
"name": "python",
701701
"nbconvert_exporter": "python",
702702
"pygments_lexer": "ipython3",
703-
"version": "3.5.5"
703+
"version": "3.8.5"
704704
}
705705
},
706706
"nbformat": 4,

0 commit comments

Comments
 (0)