Skip to content

Commit 95edc5c

Browse files
committed
Initial commit - new notebook
1 parent ee26d46 commit 95edc5c

File tree

1 file changed

+281
-0
lines changed

1 file changed

+281
-0
lines changed

Diff for: 20181127-top-hashtags-json.ipynb

+281
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,281 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Computing the top hashtags (JSON)"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"So you have tweets in a JSON file, and you'd like to get a list of the hashtags, from the most frequently occurring hashtags on down.\n",
15+
"\n",
16+
"There are *many, many* different ways to accomplish this. Since we're working with the tweets in JSON format, this solution will use `jq`, as well as a few bash shell / command line tools: `cat`, `sort`, `uniq`, and `wc`. If you haven't used `jq` yet, our [Working with Twitter Using jq](https://github.com/gwu-libraries/notebooks/blob/master/20160407-twitter-analysis-with-jq/Working-with-twitter-using-jq.ipynb) notebook is a good place to start.\n",
17+
"\n",
18+
"### Where are the hashtags in tweet JSON?\n",
19+
"\n",
20+
"When we look at a tweet, we see that it has a key called `entities`, and that the value of `entities` contains a key called `hashtags`. The value of `hashtags` is a list (note the square brackets); each item in the list contains the text of a single hashtag, and the indices of the characters in the tweet text where the hashtag begins and ends. \n",
21+
"```\n",
22+
"{\n",
23+
" created_at: \"Tue Oct 30 09:15:45 +0000 2018\",\n",
24+
" id: 1057199367411679200,\n",
25+
" id_str: \"1057199367411679234\",\n",
26+
" text: \"Lesson from Indra's elephant https://t.co/h5K3y5g4Ju #India #Hinduism #Buddhism #History #Culture https://t.co/qFyipqzPnE\",\n",
27+
"\n",
28+
" ...\n",
29+
" \n",
30+
" entities: {\n",
31+
" hashtags: [\n",
32+
" {\n",
33+
" text: \"India\",\n",
34+
" indices: [\n",
35+
" 54,\n",
36+
" 60\n",
37+
" ]\n",
38+
" },\n",
39+
" {\n",
40+
" text: \"Hinduism\",\n",
41+
" indices: [\n",
42+
" 61,\n",
43+
" 70\n",
44+
" ]\n",
45+
" },\n",
46+
" {\n",
47+
" text: \"Buddhism\",\n",
48+
" indices: [\n",
49+
" 71,\n",
50+
" 80\n",
51+
" ]\n",
52+
" },\n",
53+
" {\n",
54+
" text: \"History\",\n",
55+
" indices: [\n",
56+
" 81,\n",
57+
" 89\n",
58+
" ]\n",
59+
" },\n",
60+
" {\n",
61+
" text: \"Culture\",\n",
62+
" indices: [\n",
63+
" 90,\n",
64+
" 98\n",
65+
" ]\n",
66+
" }\n",
67+
" ],\n",
68+
" ...\n",
69+
" ```\n",
70+
" \n",
71+
" When we use `jq`, we'll need to construct a filter that pulls out the hashtag text values.\n",
72+
" "
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": 2,
78+
"metadata": {},
79+
"outputs": [
80+
{
81+
"name": "stdout",
82+
"output_type": "stream",
83+
"text": [
84+
"김유겸\r\n",
85+
"유겸\r\n",
86+
"Yugyeom\r\n",
87+
"GOT7\r\n",
88+
"갓세븐\r\n",
89+
"PresentYou\r\n",
90+
"LifeSite\r\n",
91+
"あなたの名前から想像される色\r\n",
92+
"صباح_الخير\r\n",
93+
"music\r\n",
94+
"network\r\n",
95+
"ShootOut1stWin\r\n",
96+
"acabateloparaustedes\r\n"
97+
]
98+
}
99+
],
100+
"source": [
101+
"!cat 50tweets.json | jq -cr '[.entities.hashtags][0][].text'"
102+
]
103+
},
104+
{
105+
"cell_type": "code",
106+
"execution_count": 3,
107+
"metadata": {},
108+
"outputs": [],
109+
"source": [
110+
"!cat tweets4hashtags.json | jq -cr '[.entities.hashtags][0][].text' > allhashtags.txt"
111+
]
112+
},
113+
{
114+
"cell_type": "markdown",
115+
"metadata": {},
116+
"source": [
117+
"Let's see how many hashtags we extracted:"
118+
]
119+
},
120+
{
121+
"cell_type": "code",
122+
"execution_count": 4,
123+
"metadata": {},
124+
"outputs": [
125+
{
126+
"name": "stdout",
127+
"output_type": "stream",
128+
"text": [
129+
" 878806 allhashtags.txt\r\n"
130+
]
131+
}
132+
],
133+
"source": [
134+
"!wc -l allhashtags.txt"
135+
]
136+
},
137+
{
138+
"cell_type": "markdown",
139+
"metadata": {},
140+
"source": [
141+
"What we'd like to do now is to count up how many of each hashtag we have. We'll use a combination of bash's `sort` and `uniq` commands for that. We'll also use the `-c` option for `uniq`, which prefaces each line with the count of lines it collapsed together in the process of `uniq`ing a group of identical lines. `sort`'s `-nr` options will allow us to sort by just the count on each line."
142+
]
143+
},
144+
{
145+
"cell_type": "code",
146+
"execution_count": 5,
147+
"metadata": {},
148+
"outputs": [],
149+
"source": [
150+
"!cat allhashtags.txt | sort | uniq -c | sort -nr > rankedhashtags.txt"
151+
]
152+
},
153+
{
154+
"cell_type": "markdown",
155+
"metadata": {},
156+
"source": [
157+
"Let's take a look at what we have now."
158+
]
159+
},
160+
{
161+
"cell_type": "code",
162+
"execution_count": 6,
163+
"metadata": {},
164+
"outputs": [
165+
{
166+
"name": "stdout",
167+
"output_type": "stream",
168+
"text": [
169+
"8170 EXO\r\n",
170+
"4123 BTS\r\n",
171+
"4061 TEMPO_SEHUN\r\n",
172+
"3365 GOT7\r\n",
173+
"3145 ローソン\r\n",
174+
"2924 SEHUN\r\n",
175+
"2773 EXO_DontMessUpMyTempo\r\n",
176+
"2743 weareoneEXO\r\n",
177+
"2705 Halloween\r\n",
178+
"2661 갓세븐\r\n",
179+
"2647 EXO_TEMPO\r\n",
180+
"2645 워너블다음은없어\r\n",
181+
"2403 몬스타엑스\r\n",
182+
"2355 MONSTA_X\r\n",
183+
"2339 엑소\r\n",
184+
"2279 ごちろう\r\n",
185+
"2267 지니인기상_달려라상탄\r\n",
186+
"2165 방탄소년단\r\n",
187+
"2161 塩にぎり無料プレゼント\r\n",
188+
"2042 ShootOut1stWin\r\n",
189+
"1951 ハロウィン\r\n",
190+
"1933 아이즈원\r\n",
191+
"1874 เป๊กผลิตโชค\r\n",
192+
"1873 IZONE\r\n",
193+
"1753 フードファンタジー\r\n",
194+
"1675 フーファン\r\n",
195+
"1668 食霊のティアラ\r\n",
196+
"1635 Ask_EXO\r\n",
197+
"1551 어디에도_없을_완벽한_EXO\r\n",
198+
"1494 AppleEvent\r\n",
199+
"1465 도경수\r\n",
200+
"1444 ShootOut\r\n",
201+
"1403 WasteItOnMe\r\n",
202+
"1370 TWICE\r\n",
203+
"1367 NCT\r\n",
204+
"1271 SomosLaAudiencia30\r\n",
205+
"1267 NewProfilePic\r\n",
206+
"1253 백일의낭군님\r\n",
207+
"1223 ﷺ\r\n",
208+
"1183 BAEKHYUN\r\n",
209+
"1174 더쇼\r\n",
210+
"1163 재민\r\n",
211+
"1123 MONSTAX\r\n",
212+
"1120 트와이스\r\n",
213+
"1068 ジェジュン\r\n",
214+
"1063 ALDUBStillReigns\r\n",
215+
"1054 JIMIN\r\n",
216+
"1026 RMonoBB200\r\n",
217+
"1014 RT\r\n",
218+
"1002 EXO_QuintupleMillionSeller\r\n"
219+
]
220+
}
221+
],
222+
"source": [
223+
"!head -n 50 rankedhashtags.txt"
224+
]
225+
},
226+
{
227+
"cell_type": "markdown",
228+
"metadata": {},
229+
"source": [
230+
"Personally, I have no idea what most of these hashtags are about, but this is apparently what people were tweeting about on October 31, 2018.\n",
231+
"\n",
232+
"And as for how many unique hashtags are in this set:"
233+
]
234+
},
235+
{
236+
"cell_type": "code",
237+
"execution_count": 7,
238+
"metadata": {},
239+
"outputs": [
240+
{
241+
"name": "stdout",
242+
"output_type": "stream",
243+
"text": [
244+
" 211986 rankedhashtags.txt\r\n"
245+
]
246+
}
247+
],
248+
"source": [
249+
"!wc -l rankedhashtags.txt"
250+
]
251+
},
252+
{
253+
"cell_type": "markdown",
254+
"metadata": {},
255+
"source": [
256+
"Again, there are many different ways to approach this! Let us know your thoughts and ideas."
257+
]
258+
}
259+
],
260+
"metadata": {
261+
"kernelspec": {
262+
"display_name": "Python 3",
263+
"language": "python",
264+
"name": "python3"
265+
},
266+
"language_info": {
267+
"codemirror_mode": {
268+
"name": "ipython",
269+
"version": 3
270+
},
271+
"file_extension": ".py",
272+
"mimetype": "text/x-python",
273+
"name": "python",
274+
"nbconvert_exporter": "python",
275+
"pygments_lexer": "ipython3",
276+
"version": "3.6.6"
277+
}
278+
},
279+
"nbformat": 4,
280+
"nbformat_minor": 2
281+
}

0 commit comments

Comments
 (0)