Skip to content

Commit 9276182

Browse files
committed
.
1 parent c8844f7 commit 9276182

File tree

2 files changed

+380
-0
lines changed

2 files changed

+380
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
{
2+
"cells": [
3+
{
4+
"attachments": {},
5+
"cell_type": "markdown",
6+
"metadata": {},
7+
"source": [
8+
"# Hashing and Spam\n",
9+
"#### Article Classification(E.g: Spam Classification) using bag-of-words representation (BOW)\n",
10+
"\n",
11+
"Suppose we have two sentence one is span and other is not spam \n",
12+
"- i earn 20 lakh rupees per month just chitchating on the net!(spam) \n",
13+
"- are you free for a meeting anytime tomorrow?(not spam) \n"
14+
]
15+
},
16+
{
17+
"cell_type": "code",
18+
"execution_count": 1,
19+
"metadata": {},
20+
"outputs": [
21+
{
22+
"name": "stdout",
23+
"output_type": "stream",
24+
"text": [
25+
"[[0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 0]\n",
26+
" [1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1]]\n",
27+
"{'earn': 3, 'twenty': 16, 'lakh': 7, 'rupees': 13, 'per': 12, 'month': 9, 'just': 6, 'chitchating': 2, 'on': 11, 'the': 14, 'net': 10, 'are': 1, 'you': 17, 'free': 5, 'for': 4, 'meeting': 8, 'anytime': 0, 'tomorrow': 15}\n"
28+
]
29+
}
30+
],
31+
"source": [
32+
"import pandas as pd\n",
33+
"corpus = [\n",
34+
" 'i earn twenty lakh rupees per month just chitchating on the net!', \n",
35+
" 'are you free for a meeting anytime tomorrow?',\n",
36+
"]\n",
37+
"df = pd.DataFrame({'Text':corpus})\n",
38+
"from sklearn.feature_extraction.text import CountVectorizer\n",
39+
"count_v = CountVectorizer()\n",
40+
"X = count_v.fit_transform(df.Text).toarray()\n",
41+
"print(X)\n",
42+
"print(count_v.vocabulary_)\n"
43+
]
44+
},
45+
{
46+
"cell_type": "code",
47+
"execution_count": 2,
48+
"metadata": {},
49+
"outputs": [
50+
{
51+
"name": "stdout",
52+
"output_type": "stream",
53+
"text": [
54+
"[[1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1]]\n"
55+
]
56+
}
57+
],
58+
"source": [
59+
"#Now we got a new sentence, generate the vector using pre built vocabulary\n",
60+
"new_txt = ['io etrn are you free ruppee for a monnth meeting chitcchting anytime tomorrow neet']\n",
61+
"df_new = pd.DataFrame({'new_txt':new_txt})\n",
62+
"y = count_v.transform(df_new.new_txt).toarray()\n",
63+
"print(y)\n"
64+
]
65+
},
66+
{
67+
"attachments": {},
68+
"cell_type": "markdown",
69+
"metadata": {},
70+
"source": [
71+
"Vocabulary: \n",
72+
"{'anytime': 0, 'are': 1, 'chitchating': 2, 'earn': 3, 'for': 4, 'free': 5, 'just': 6, 'lakh': 7, 'meeting': 8, 'month': 9, 'net': 10, 'on': 11, 'per': 12, 'rupees': 13, 'the': 14, 'tomorrow': 15, 'twenty': 16, 'you': 17 } \n",
73+
" \n",
74+
"- i earn 20 lakh rupees per month just chitchating on the net! \n",
75+
"vector1:[0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 0] \n",
76+
"\n",
77+
"- are you free for a meeting anytime tomorrow? \n",
78+
"vector2:[1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1] \n"
79+
]
80+
},
81+
{
82+
"cell_type": "markdown",
83+
"metadata": {},
84+
"source": [
85+
"New Sentence: \n",
86+
"- io etrn are you free ruppee for a monnth meeting chitcchting anytime tomorrow neet \n",
87+
"With Existing Count Vector: \n",
88+
"vector3:[1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1] \n"
89+
]
90+
},
91+
{
92+
"cell_type": "markdown",
93+
"metadata": {},
94+
"source": [
95+
"- So, above vector3 is same as vector2, because new words can't be taken into account as this is CountVectorizer. So machine learning will classify vector3 as not spam same as vector2 which is not correct. \n",
96+
"- Even if we increase the vector length, whole machine learning model need to be trained again. \n",
97+
"- Solution is Hashing\n"
98+
]
99+
},
100+
{
101+
"attachments": {},
102+
"cell_type": "markdown",
103+
"metadata": {},
104+
"source": [
105+
"## Hashing\n",
106+
"- Apply Hash Function\n",
107+
"- “Rishi Bansal” --> 23\n",
108+
"- “Rashi Bansal” --> 72\n",
109+
"- Output number depend on Hash function\n"
110+
]
111+
},
112+
{
113+
"cell_type": "markdown",
114+
"metadata": {},
115+
"source": [
116+
"**Features:**\n",
117+
"- Same value for same string\n",
118+
"- Collison: Possibility of same value for different string \n"
119+
]
120+
},
121+
{
122+
"attachments": {},
123+
"cell_type": "markdown",
124+
"metadata": {},
125+
"source": [
126+
"Lets take a huge count vector – 2*2^20 \n",
127+
"Choose Hashing function which generates number between 0 to 2*2^20 \n",
128+
"'i earn 20 lakh rupees per month just chitchating on the net!', \n",
129+
"[0 0 1 1…. 0 0 1 1 0….. 1 1 1 1 ……1 1 0 1 0] – 2*2^20 \n",
130+
"\n",
131+
" 'are you free for a meeting anytime tomorrow?', \n",
132+
"[1 1 …..0 0 1 1 0 0 ……1 0……. 0 0 0 0 …..0 1 0 1] – 2*2^20 \n",
133+
"\n",
134+
" 'io etrn are you free ruppee for a monnth meeting chitcchting anytime tomorrow neet' \n",
135+
"[00 …..1 0 1 1 0 0 ……1 0……. 1 0 1 0 …..0 0 0 1] – 2*2^20 \n",
136+
"\n"
137+
]
138+
},
139+
{
140+
"attachments": {},
141+
"cell_type": "markdown",
142+
"metadata": {},
143+
"source": [
144+
"## When to Use HashVectorizer?"
145+
]
146+
},
147+
{
148+
"attachments": {},
149+
"cell_type": "markdown",
150+
"metadata": {},
151+
"source": [
152+
"#### HashingVectorizer : \n",
153+
"1> If dataset is large and there is no use for the resulting dictionary of tokens\n",
154+
"2> You have maxed out your computing resources and it’s time to optimize\n",
155+
"\n",
156+
"#### CountVectorizer: \n",
157+
"1> If need is to access the actual tokens.\n",
158+
"2> If you are worried about hash collisions (when matrix size is small) \n"
159+
]
160+
},
161+
{
162+
"cell_type": "code",
163+
"execution_count": null,
164+
"metadata": {},
165+
"outputs": [],
166+
"source": []
167+
}
168+
],
169+
"metadata": {
170+
"kernelspec": {
171+
"display_name": "Python 3",
172+
"language": "python",
173+
"name": "python3"
174+
},
175+
"language_info": {
176+
"codemirror_mode": {
177+
"name": "ipython",
178+
"version": 3
179+
},
180+
"file_extension": ".py",
181+
"mimetype": "text/x-python",
182+
"name": "python",
183+
"nbconvert_exporter": "python",
184+
"pygments_lexer": "ipython3",
185+
"version": "3.6.6"
186+
}
187+
},
188+
"nbformat": 4,
189+
"nbformat_minor": 2
190+
}

07. Hashing and Spam.ipynb

+190
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
{
2+
"cells": [
3+
{
4+
"attachments": {},
5+
"cell_type": "markdown",
6+
"metadata": {},
7+
"source": [
8+
"# Hashing and Spam\n",
9+
"#### Article Classification(E.g: Spam Classification) using bag-of-words representation (BOW)\n",
10+
"\n",
11+
"Suppose we have two sentence one is span and other is not spam \n",
12+
"- i earn 20 lakh rupees per month just chitchating on the net!(spam) \n",
13+
"- are you free for a meeting anytime tomorrow?(not spam) \n"
14+
]
15+
},
16+
{
17+
"cell_type": "code",
18+
"execution_count": 1,
19+
"metadata": {},
20+
"outputs": [
21+
{
22+
"name": "stdout",
23+
"output_type": "stream",
24+
"text": [
25+
"[[0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 0]\n",
26+
" [1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1]]\n",
27+
"{'earn': 3, 'twenty': 16, 'lakh': 7, 'rupees': 13, 'per': 12, 'month': 9, 'just': 6, 'chitchating': 2, 'on': 11, 'the': 14, 'net': 10, 'are': 1, 'you': 17, 'free': 5, 'for': 4, 'meeting': 8, 'anytime': 0, 'tomorrow': 15}\n"
28+
]
29+
}
30+
],
31+
"source": [
32+
"import pandas as pd\n",
33+
"corpus = [\n",
34+
" 'i earn twenty lakh rupees per month just chitchating on the net!', \n",
35+
" 'are you free for a meeting anytime tomorrow?',\n",
36+
"]\n",
37+
"df = pd.DataFrame({'Text':corpus})\n",
38+
"from sklearn.feature_extraction.text import CountVectorizer\n",
39+
"count_v = CountVectorizer()\n",
40+
"X = count_v.fit_transform(df.Text).toarray()\n",
41+
"print(X)\n",
42+
"print(count_v.vocabulary_)\n"
43+
]
44+
},
45+
{
46+
"cell_type": "code",
47+
"execution_count": 2,
48+
"metadata": {},
49+
"outputs": [
50+
{
51+
"name": "stdout",
52+
"output_type": "stream",
53+
"text": [
54+
"[[1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1]]\n"
55+
]
56+
}
57+
],
58+
"source": [
59+
"#Now we got a new sentence, generate the vector using pre built vocabulary\n",
60+
"new_txt = ['io etrn are you free ruppee for a monnth meeting chitcchting anytime tomorrow neet']\n",
61+
"df_new = pd.DataFrame({'new_txt':new_txt})\n",
62+
"y = count_v.transform(df_new.new_txt).toarray()\n",
63+
"print(y)\n"
64+
]
65+
},
66+
{
67+
"attachments": {},
68+
"cell_type": "markdown",
69+
"metadata": {},
70+
"source": [
71+
"Vocabulary: \n",
72+
"{'anytime': 0, 'are': 1, 'chitchating': 2, 'earn': 3, 'for': 4, 'free': 5, 'just': 6, 'lakh': 7, 'meeting': 8, 'month': 9, 'net': 10, 'on': 11, 'per': 12, 'rupees': 13, 'the': 14, 'tomorrow': 15, 'twenty': 16, 'you': 17 } \n",
73+
" \n",
74+
"- i earn 20 lakh rupees per month just chitchating on the net! \n",
75+
"vector1:[0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 0] \n",
76+
"\n",
77+
"- are you free for a meeting anytime tomorrow? \n",
78+
"vector2:[1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1] \n"
79+
]
80+
},
81+
{
82+
"cell_type": "markdown",
83+
"metadata": {},
84+
"source": [
85+
"New Sentence: \n",
86+
"- io etrn are you free ruppee for a monnth meeting chitcchting anytime tomorrow neet \n",
87+
"With Existing Count Vector: \n",
88+
"vector3:[1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1] \n"
89+
]
90+
},
91+
{
92+
"cell_type": "markdown",
93+
"metadata": {},
94+
"source": [
95+
"- So, above vector3 is same as vector2, because new words can't be taken into account as this is CountVectorizer. So machine learning will classify vector3 as not spam same as vector2 which is not correct. \n",
96+
"- Even if we increase the vector length, whole machine learning model need to be trained again. \n",
97+
"- Solution is Hashing\n"
98+
]
99+
},
100+
{
101+
"attachments": {},
102+
"cell_type": "markdown",
103+
"metadata": {},
104+
"source": [
105+
"## Hashing\n",
106+
"- Apply Hash Function\n",
107+
"- “Rishi Bansal” --> 23\n",
108+
"- “Rashi Bansal” --> 72\n",
109+
"- Output number depend on Hash function\n"
110+
]
111+
},
112+
{
113+
"cell_type": "markdown",
114+
"metadata": {},
115+
"source": [
116+
"**Features:**\n",
117+
"- Same value for same string\n",
118+
"- Collison: Possibility of same value for different string \n"
119+
]
120+
},
121+
{
122+
"attachments": {},
123+
"cell_type": "markdown",
124+
"metadata": {},
125+
"source": [
126+
"Lets take a huge count vector – 2*2^20 \n",
127+
"Choose Hashing function which generates number between 0 to 2*2^20 \n",
128+
"'i earn 20 lakh rupees per month just chitchating on the net!', \n",
129+
"[0 0 1 1…. 0 0 1 1 0….. 1 1 1 1 ……1 1 0 1 0] – 2*2^20 \n",
130+
"\n",
131+
" 'are you free for a meeting anytime tomorrow?', \n",
132+
"[1 1 …..0 0 1 1 0 0 ……1 0……. 0 0 0 0 …..0 1 0 1] – 2*2^20 \n",
133+
"\n",
134+
" 'io etrn are you free ruppee for a monnth meeting chitcchting anytime tomorrow neet' \n",
135+
"[00 …..1 0 1 1 0 0 ……1 0……. 1 0 1 0 …..0 0 0 1] – 2*2^20 \n",
136+
"\n"
137+
]
138+
},
139+
{
140+
"attachments": {},
141+
"cell_type": "markdown",
142+
"metadata": {},
143+
"source": [
144+
"## When to Use HashVectorizer?"
145+
]
146+
},
147+
{
148+
"attachments": {},
149+
"cell_type": "markdown",
150+
"metadata": {},
151+
"source": [
152+
"#### HashingVectorizer : \n",
153+
"1> If dataset is large and there is no use for the resulting dictionary of tokens\n",
154+
"2> You have maxed out your computing resources and it’s time to optimize\n",
155+
"\n",
156+
"#### CountVectorizer: \n",
157+
"1> If need is to access the actual tokens.\n",
158+
"2> If you are worried about hash collisions (when matrix size is small) \n"
159+
]
160+
},
161+
{
162+
"cell_type": "code",
163+
"execution_count": null,
164+
"metadata": {},
165+
"outputs": [],
166+
"source": []
167+
}
168+
],
169+
"metadata": {
170+
"kernelspec": {
171+
"display_name": "Python 3",
172+
"language": "python",
173+
"name": "python3"
174+
},
175+
"language_info": {
176+
"codemirror_mode": {
177+
"name": "ipython",
178+
"version": 3
179+
},
180+
"file_extension": ".py",
181+
"mimetype": "text/x-python",
182+
"name": "python",
183+
"nbconvert_exporter": "python",
184+
"pygments_lexer": "ipython3",
185+
"version": "3.6.6"
186+
}
187+
},
188+
"nbformat": 4,
189+
"nbformat_minor": 2
190+
}

0 commit comments

Comments
 (0)