bansalrishi
diff --git a/‎.ipynb_checkpoints/02. LassoRidge Regression-checkpoint.ipynb
+19-2 b/‎.ipynb_checkpoints/02. LassoRidge Regression-checkpoint.ipynb
+19-2
diff --git a/‎.ipynb_checkpoints/03. Naive Bayes-checkpoint.ipynb
+121-1 b/‎.ipynb_checkpoints/03. Naive Bayes-checkpoint.ipynb
+121-1
diff --git a/‎02. LassoRidge Regression.ipynb
+19-2 b/‎02. LassoRidge Regression.ipynb
+19-2
diff --git a/‎03. Naive Bayes.ipynb
+120 b/‎03. Naive Bayes.ipynb
+120
diff --git a/‎03. Support Vector Machines.ipynb
+1-1 b/‎03. Support Vector Machines.ipynb
+1-1
diff --git a/‎05. Artificial Neural Network.ipynb
+1-1 b/‎05. Artificial Neural Network.ipynb
+1-1
@@ -116,6 +116,10 @@
    "metadata": {},
    "source": [
     "## 5. Lasso Regression\n",
+    "- type of linear regression which uses shrinkage\n",
+    "- in shrinkage data values are shrinked towards a central point called mean\n",
+    "- encourages models with fewer parameters\n",
+    "- well-suited for models showing high levels of muticollinearity\n",
     "- Add penalty for large coefficients  \n",
     "- Penalty function – L1 norm of regression coefficients  \n",
     "- Penalty weighted by hyperparameter (alpha)  \n",
@@ -354,7 +358,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Lasso"
+    "## Lasso (Least Absolute Shrinkage and Selection Operator)"
    ]
   },
   {
@@ -658,7 +662,9 @@
   {
    "cell_type": "code",
    "execution_count": 35,
-   "metadata": {},
+   "metadata": {
+    "scrolled": false
+   },
    "outputs": [
     {
      "name": "stdout",
@@ -707,6 +713,17 @@
     "r_square"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Diff between Lasso-Ridge\n",
+    "* Ridge Regression used in case of grouping(multicollinearity) of regressors\n",
+    "* RR shring group proportionaly whereas Lasso doesnt\n",
+    "* Lasso sets individual regresison coefficients to 0 to reduce model size\n",
+    "* Higher dimension problems have multicollinearity so Ridge performs better\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {
 
@@ -18,6 +18,11 @@
     "8. It's simple & out-performs many sophisticated methods.  \n",
     "9. Stable to data changes.  \n",
     "\n",
+    "## Three types of Naive Bayes¶\n",
+    "* Gaussian Naive Bayes - Feature columns are normal distribution\n",
+    "* Multinomial Naive bayes - Feature columns are counters\n",
+    "* Bernouli's Naive bayes - Feature columns are boolean\n",
+    "\n",
     "\n",
     "## Bayes’s Theorem\n",
     "It describes the probability of an event, based on prior knowledge of conditions that might be related to the event.  \n",
@@ -258,6 +263,121 @@
     "nb_model = joblib.load(\"pima-trained-model.pkl\")"
    ]
   },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Comparison\n",
+    "### Bernoulli Naive Bayes :\n",
+    "* assumes features are binary (e.g: 0 or 1)\n",
+    "* 0: word does not occur in the document\n",
+    "* 1: word occurs in the document\n",
+    "\n",
+    "### Multinomial Naive Bayes :\n",
+    "* used for discrete data (E.g: rolling dice, movie rating from 1 to 10, etc)\n",
+    "* In text learning we have the count of each word to predict the class or label.\n",
+    "\n",
+    "### Gaussian Naive Bayes :\n",
+    "* used for normal distribution which means all features are continuous\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Bernouli vs Multinomial\n",
+    "In case of email classifier\n",
+    "### Bernoulli :\n",
+    "* Assume spam mail has email handle in subject\n",
+    "* Build a feature where 0 means it’s not present and 1 if it is there \n",
+    "* Binomial distribution\n",
+    "\n",
+    "### Multinomial: \n",
+    "* In addition to above condition, more dollar sign means spam more likely\n",
+    "* Same kind of word e.g: CASH or LOTTERY\n",
+    "* Label these words by their count\n",
+    "* Multinomial distribution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 19579 entries, 0 to 19578\n",
+      "Data columns (total 3 columns):\n",
+      " #   Column  Non-Null Count  Dtype \n",
+      "---  ------  --------------  ----- \n",
+      " 0   id      19579 non-null  object\n",
+      " 1   text    19579 non-null  object\n",
+      " 2   author  19579 non-null  object\n",
+      "dtypes: object(3)\n",
+      "memory usage: 459.0+ KB\n",
+      "None\n",
+      "            id                                               text author\n",
+      "658    id10627  I did; but the fragile spirit clung to its ten...    EAP\n",
+      "4187   id00256  I have merely set down certain things appealin...    HPL\n",
+      "267    id08711  The remains of the half finished creature, who...    MWS\n",
+      "6672   id18249  \"No, Justine,\" said Elizabeth; \"he is more con...    MWS\n",
+      "12051  id20451  In the rash pursuit of this object, he rushes ...    EAP\n",
+      "EAP    7900\n",
+      "MWS    6044\n",
+      "HPL    5635\n",
+      "Name: author, dtype: int64\n",
+      "0.8265577119509704\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "array([[1623,  169,  118],\n",
+       "       [ 134, 1098,   65],\n",
+       "       [ 242,  121, 1325]], dtype=int64)"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "dataset = pd.read_csv(\"Data/Classification/horror-train.csv\")\n",
+    "print(dataset.info())\n",
+    "print(dataset.sample(5))\n",
+    "print(dataset.author.value_counts())\n",
+    "X = dataset.text\n",
+    "y = dataset.author\n",
+    "\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.25, random_state = 0)\n",
+    "\n",
+    "from sklearn.feature_extraction.text import CountVectorizer\n",
+    "spam_fil = CountVectorizer(stop_words='english')\n",
+    "\n",
+    "X_train = spam_fil.fit_transform(X_train).toarray()\n",
+    "X_test = spam_fil.transform(X_test).toarray()\n",
+    "\n",
+    "from sklearn.naive_bayes import MultinomialNB\n",
+    "mnb = MultinomialNB()\n",
+    "\n",
+    "mnb.fit(X_train, y_train)\n",
+    "\n",
+    "print(mnb.score(X_test, y_test))\n",
+    "\n",
+    "y_pred = mnb.predict(X_test)\n",
+    "\n",
+    "from sklearn.metrics import confusion_matrix\n",
+    "confusion_matrix(y_pred, y_test)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -282,7 +402,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.6"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,
 
@@ -116,6 +116,10 @@
    "metadata": {},
    "source": [
     "## 5. Lasso Regression\n",
+    "- type of linear regression which uses shrinkage\n",
+    "- in shrinkage data values are shrinked towards a central point called mean\n",
+    "- encourages models with fewer parameters\n",
+    "- well-suited for models showing high levels of muticollinearity\n",
     "- Add penalty for large coefficients  \n",
     "- Penalty function – L1 norm of regression coefficients  \n",
     "- Penalty weighted by hyperparameter (alpha)  \n",
@@ -354,7 +358,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Lasso"
+    "## Lasso (Least Absolute Shrinkage and Selection Operator)"
    ]
   },
   {
@@ -658,7 +662,9 @@
   {
    "cell_type": "code",
    "execution_count": 35,
-   "metadata": {},
+   "metadata": {
+    "scrolled": false
+   },
    "outputs": [
     {
      "name": "stdout",
@@ -707,6 +713,17 @@
     "r_square"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Diff between Lasso-Ridge\n",
+    "* Ridge Regression used in case of grouping(multicollinearity) of regressors\n",
+    "* RR shring group proportionaly whereas Lasso doesnt\n",
+    "* Lasso sets individual regresison coefficients to 0 to reduce model size\n",
+    "* Higher dimension problems have multicollinearity so Ridge performs better\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {
 
@@ -18,6 +18,11 @@
     "8. It's simple & out-performs many sophisticated methods.  \n",
     "9. Stable to data changes.  \n",
     "\n",
+    "## Three types of Naive Bayes¶\n",
+    "* Gaussian Naive Bayes - Feature columns are normal distribution\n",
+    "* Multinomial Naive bayes - Feature columns are counters\n",
+    "* Bernouli's Naive bayes - Feature columns are boolean\n",
+    "\n",
     "\n",
     "## Bayes’s Theorem\n",
     "It describes the probability of an event, based on prior knowledge of conditions that might be related to the event.  \n",
@@ -258,6 +263,121 @@
     "nb_model = joblib.load(\"pima-trained-model.pkl\")"
    ]
   },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Comparison\n",
+    "### Bernoulli Naive Bayes :\n",
+    "* assumes features are binary (e.g: 0 or 1)\n",
+    "* 0: word does not occur in the document\n",
+    "* 1: word occurs in the document\n",
+    "\n",
+    "### Multinomial Naive Bayes :\n",
+    "* used for discrete data (E.g: rolling dice, movie rating from 1 to 10, etc)\n",
+    "* In text learning we have the count of each word to predict the class or label.\n",
+    "\n",
+    "### Gaussian Naive Bayes :\n",
+    "* used for normal distribution which means all features are continuous\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Bernouli vs Multinomial\n",
+    "In case of email classifier\n",
+    "### Bernoulli :\n",
+    "* Assume spam mail has email handle in subject\n",
+    "* Build a feature where 0 means it’s not present and 1 if it is there \n",
+    "* Binomial distribution\n",
+    "\n",
+    "### Multinomial: \n",
+    "* In addition to above condition, more dollar sign means spam more likely\n",
+    "* Same kind of word e.g: CASH or LOTTERY\n",
+    "* Label these words by their count\n",
+    "* Multinomial distribution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 19579 entries, 0 to 19578\n",
+      "Data columns (total 3 columns):\n",
+      " #   Column  Non-Null Count  Dtype \n",
+      "---  ------  --------------  ----- \n",
+      " 0   id      19579 non-null  object\n",
+      " 1   text    19579 non-null  object\n",
+      " 2   author  19579 non-null  object\n",
+      "dtypes: object(3)\n",
+      "memory usage: 459.0+ KB\n",
+      "None\n",
+      "            id                                               text author\n",
+      "658    id10627  I did; but the fragile spirit clung to its ten...    EAP\n",
+      "4187   id00256  I have merely set down certain things appealin...    HPL\n",
+      "267    id08711  The remains of the half finished creature, who...    MWS\n",
+      "6672   id18249  \"No, Justine,\" said Elizabeth; \"he is more con...    MWS\n",
+      "12051  id20451  In the rash pursuit of this object, he rushes ...    EAP\n",
+      "EAP    7900\n",
+      "MWS    6044\n",
+      "HPL    5635\n",
+      "Name: author, dtype: int64\n",
+      "0.8265577119509704\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "array([[1623,  169,  118],\n",
+       "       [ 134, 1098,   65],\n",
+       "       [ 242,  121, 1325]], dtype=int64)"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "dataset = pd.read_csv(\"Data/Classification/horror-train.csv\")\n",
+    "print(dataset.info())\n",
+    "print(dataset.sample(5))\n",
+    "print(dataset.author.value_counts())\n",
+    "X = dataset.text\n",
+    "y = dataset.author\n",
+    "\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.25, random_state = 0)\n",
+    "\n",
+    "from sklearn.feature_extraction.text import CountVectorizer\n",
+    "spam_fil = CountVectorizer(stop_words='english')\n",
+    "\n",
+    "X_train = spam_fil.fit_transform(X_train).toarray()\n",
+    "X_test = spam_fil.transform(X_test).toarray()\n",
+    "\n",
+    "from sklearn.naive_bayes import MultinomialNB\n",
+    "mnb = MultinomialNB()\n",
+    "\n",
+    "mnb.fit(X_train, y_train)\n",
+    "\n",
+    "print(mnb.score(X_test, y_test))\n",
+    "\n",
+    "y_pred = mnb.predict(X_test)\n",
+    "\n",
+    "from sklearn.metrics import confusion_matrix\n",
+    "confusion_matrix(y_pred, y_test)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
 
@@ -728,7 +728,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.6"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,
 
@@ -700,7 +700,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.5.5"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,
Original file line number	Diff line number	Diff line change
`@@ -728,7 +728,7 @@`
`728`	`728`	`"name": "python",`
`729`	`729`	`"nbconvert_exporter": "python",`
`730`	`730`	`"pygments_lexer": "ipython3",`
`731`		`- "version": "3.6.6"`
	`731`	`+ "version": "3.8.5"`
`732`	`732`	`}`
`733`	`733`	`},`
`734`	`734`	`"nbformat": 4,`
Original file line number	Diff line number	Diff line change
`@@ -700,7 +700,7 @@`
`700`	`700`	`"name": "python",`
`701`	`701`	`"nbconvert_exporter": "python",`
`702`	`702`	`"pygments_lexer": "ipython3",`
`703`		`- "version": "3.5.5"`
	`703`	`+ "version": "3.8.5"`
`704`	`704`	`}`
`705`	`705`	`},`
`706`	`706`	`"nbformat": 4,`