|
18 | 18 | "8. It's simple & out-performs many sophisticated methods. \n",
|
19 | 19 | "9. Stable to data changes. \n",
|
20 | 20 | "\n",
|
| 21 | + "## Three types of Naive Bayes¶\n", |
| 22 | + "* Gaussian Naive Bayes - Feature columns are normal distribution\n", |
| 23 | + "* Multinomial Naive bayes - Feature columns are counters\n", |
| 24 | + "* Bernouli's Naive bayes - Feature columns are boolean\n", |
| 25 | + "\n", |
21 | 26 | "\n",
|
22 | 27 | "## Bayes’s Theorem\n",
|
23 | 28 | "It describes the probability of an event, based on prior knowledge of conditions that might be related to the event. \n",
|
|
258 | 263 | "nb_model = joblib.load(\"pima-trained-model.pkl\")"
|
259 | 264 | ]
|
260 | 265 | },
|
| 266 | + { |
| 267 | + "attachments": {}, |
| 268 | + "cell_type": "markdown", |
| 269 | + "metadata": {}, |
| 270 | + "source": [ |
| 271 | + "# Comparison\n", |
| 272 | + "### Bernoulli Naive Bayes :\n", |
| 273 | + "* assumes features are binary (e.g: 0 or 1)\n", |
| 274 | + "* 0: word does not occur in the document\n", |
| 275 | + "* 1: word occurs in the document\n", |
| 276 | + "\n", |
| 277 | + "### Multinomial Naive Bayes :\n", |
| 278 | + "* used for discrete data (E.g: rolling dice, movie rating from 1 to 10, etc)\n", |
| 279 | + "* In text learning we have the count of each word to predict the class or label.\n", |
| 280 | + "\n", |
| 281 | + "### Gaussian Naive Bayes :\n", |
| 282 | + "* used for normal distribution which means all features are continuous\n" |
| 283 | + ] |
| 284 | + }, |
| 285 | + { |
| 286 | + "attachments": {}, |
| 287 | + "cell_type": "markdown", |
| 288 | + "metadata": {}, |
| 289 | + "source": [ |
| 290 | + "# Bernouli vs Multinomial\n", |
| 291 | + "In case of email classifier\n", |
| 292 | + "### Bernoulli :\n", |
| 293 | + "* Assume spam mail has email handle in subject\n", |
| 294 | + "* Build a feature where 0 means it’s not present and 1 if it is there \n", |
| 295 | + "* Binomial distribution\n", |
| 296 | + "\n", |
| 297 | + "### Multinomial: \n", |
| 298 | + "* In addition to above condition, more dollar sign means spam more likely\n", |
| 299 | + "* Same kind of word e.g: CASH or LOTTERY\n", |
| 300 | + "* Label these words by their count\n", |
| 301 | + "* Multinomial distribution" |
| 302 | + ] |
| 303 | + }, |
| 304 | + { |
| 305 | + "cell_type": "code", |
| 306 | + "execution_count": 1, |
| 307 | + "metadata": {}, |
| 308 | + "outputs": [ |
| 309 | + { |
| 310 | + "name": "stdout", |
| 311 | + "output_type": "stream", |
| 312 | + "text": [ |
| 313 | + "<class 'pandas.core.frame.DataFrame'>\n", |
| 314 | + "RangeIndex: 19579 entries, 0 to 19578\n", |
| 315 | + "Data columns (total 3 columns):\n", |
| 316 | + " # Column Non-Null Count Dtype \n", |
| 317 | + "--- ------ -------------- ----- \n", |
| 318 | + " 0 id 19579 non-null object\n", |
| 319 | + " 1 text 19579 non-null object\n", |
| 320 | + " 2 author 19579 non-null object\n", |
| 321 | + "dtypes: object(3)\n", |
| 322 | + "memory usage: 459.0+ KB\n", |
| 323 | + "None\n", |
| 324 | + " id text author\n", |
| 325 | + "658 id10627 I did; but the fragile spirit clung to its ten... EAP\n", |
| 326 | + "4187 id00256 I have merely set down certain things appealin... HPL\n", |
| 327 | + "267 id08711 The remains of the half finished creature, who... MWS\n", |
| 328 | + "6672 id18249 \"No, Justine,\" said Elizabeth; \"he is more con... MWS\n", |
| 329 | + "12051 id20451 In the rash pursuit of this object, he rushes ... EAP\n", |
| 330 | + "EAP 7900\n", |
| 331 | + "MWS 6044\n", |
| 332 | + "HPL 5635\n", |
| 333 | + "Name: author, dtype: int64\n", |
| 334 | + "0.8265577119509704\n" |
| 335 | + ] |
| 336 | + }, |
| 337 | + { |
| 338 | + "data": { |
| 339 | + "text/plain": [ |
| 340 | + "array([[1623, 169, 118],\n", |
| 341 | + " [ 134, 1098, 65],\n", |
| 342 | + " [ 242, 121, 1325]], dtype=int64)" |
| 343 | + ] |
| 344 | + }, |
| 345 | + "execution_count": 1, |
| 346 | + "metadata": {}, |
| 347 | + "output_type": "execute_result" |
| 348 | + } |
| 349 | + ], |
| 350 | + "source": [ |
| 351 | + "import pandas as pd\n", |
| 352 | + "dataset = pd.read_csv(\"Data/Classification/horror-train.csv\")\n", |
| 353 | + "print(dataset.info())\n", |
| 354 | + "print(dataset.sample(5))\n", |
| 355 | + "print(dataset.author.value_counts())\n", |
| 356 | + "X = dataset.text\n", |
| 357 | + "y = dataset.author\n", |
| 358 | + "\n", |
| 359 | + "from sklearn.model_selection import train_test_split\n", |
| 360 | + "X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.25, random_state = 0)\n", |
| 361 | + "\n", |
| 362 | + "from sklearn.feature_extraction.text import CountVectorizer\n", |
| 363 | + "spam_fil = CountVectorizer(stop_words='english')\n", |
| 364 | + "\n", |
| 365 | + "X_train = spam_fil.fit_transform(X_train).toarray()\n", |
| 366 | + "X_test = spam_fil.transform(X_test).toarray()\n", |
| 367 | + "\n", |
| 368 | + "from sklearn.naive_bayes import MultinomialNB\n", |
| 369 | + "mnb = MultinomialNB()\n", |
| 370 | + "\n", |
| 371 | + "mnb.fit(X_train, y_train)\n", |
| 372 | + "\n", |
| 373 | + "print(mnb.score(X_test, y_test))\n", |
| 374 | + "\n", |
| 375 | + "y_pred = mnb.predict(X_test)\n", |
| 376 | + "\n", |
| 377 | + "from sklearn.metrics import confusion_matrix\n", |
| 378 | + "confusion_matrix(y_pred, y_test)" |
| 379 | + ] |
| 380 | + }, |
261 | 381 | {
|
262 | 382 | "cell_type": "code",
|
263 | 383 | "execution_count": null,
|
|
282 | 402 | "name": "python",
|
283 | 403 | "nbconvert_exporter": "python",
|
284 | 404 | "pygments_lexer": "ipython3",
|
285 |
| - "version": "3.6.6" |
| 405 | + "version": "3.8.5" |
286 | 406 | }
|
287 | 407 | },
|
288 | 408 | "nbformat": 4,
|
|
0 commit comments