Skip to content

Commit 600f67a

Browse files
committed
Initial commit
1 parent d207be5 commit 600f67a

File tree

2 files changed

+1312
-0
lines changed

2 files changed

+1312
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,311 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Deriving a random subset of tweets (JSON or CSV)"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"Let's say you're working with a file containing tweets, and you'd like to derive a random sample of tweets from that file. For example, your set might be in chronological order, and you want just a random sample of 100 tweets from a mix of dates."
15+
]
16+
},
17+
{
18+
"cell_type": "markdown",
19+
"metadata": {},
20+
"source": [
21+
"Note that the technique we describe here works for any type of data file with one observation per line, whether it's tweets or some other type of data, and whether the observations are in CSV format, JSON format, or some other text-based format with one observation per line."
22+
]
23+
},
24+
{
25+
"cell_type": "markdown",
26+
"metadata": {},
27+
"source": [
28+
"Let's take a look at our sample data file:"
29+
]
30+
},
31+
{
32+
"cell_type": "code",
33+
"execution_count": 26,
34+
"metadata": {
35+
"collapsed": false
36+
},
37+
"outputs": [
38+
{
39+
"name": "stdout",
40+
"output_type": "stream",
41+
"text": [
42+
"created_at,twitter_id,screen_name,followers_count,friends_count,favorite_count/like_count,retweet_count,hashtags,mentions,in_reply_to_screen_name,twitter_url,text,is_retweet,is_quote,coordinates,url1,url1_expanded,url2,url2_expanded,media_url\r",
43+
"\r\n",
44+
"2016-09-26 17:52:09+00:00,780464994479542272,tlajj5,21,18,0,0,\"TrumpTrain, MAGA\",realDonaldTrump,,http://twitter.com/tlajj5/status/780464994479542272,RT @realDonaldTrump: New national Bloomberg poll just released - thank you! Join the MOVEMENT: https://t.co/3KWOl2ibaW. #TrumpTrain #MAGA…,Yes,No,,https://t.co/3KWOl2ibaW,http://www.DonaldJTrump.com,,,\r",
45+
"\r\n",
46+
"2016-09-26 17:55:09+00:00,780465749794123777,Kosky98,87,82,0,0,Debates2016,HebertEtHalfred,,http://twitter.com/Kosky98/status/780465749794123777,RT @HebertEtHalfred: Quoi ? #Debates2016 ou quoi ?? https://t.co/NUYvbc0kYc,Yes,No,,,,,,http://pbs.twimg.com/media/CtTFQ2rWcAAKLb7.jpg\r",
47+
"\r\n",
48+
"2016-09-26 17:56:34+00:00,780466104720236545,DeplorableTink,84,66,0,0,\"sundaythoughts, ImWithHer, Debates2016, CrookedHillary\",taydark_77,,http://twitter.com/DeplorableTink/status/780466104720236545,RT @taydark_77: #sundaythoughts We all know who the real deplorable's are. Every #ImWithHer crew. #Debates2016 #CrookedHillary https://t.co…,Yes,No,,,,,,\r",
49+
"\r\n",
50+
"2016-09-26 18:03:19+00:00,780467803266682880,EmiliaSaez10,10,734,0,0,\"HILLARYPRESIDENT, HillaryClinton, presidentialelection, PresidentialDebate, debatenight\",\"VigilanteArtist, HillaryClinton\",,http://twitter.com/EmiliaSaez10/status/780467803266682880,RT @VigilanteArtist: RT #HILLARYPRESIDENT @HillaryClinton #HillaryClinton #presidentialelection #PresidentialDebate #debatenight #Debates…,Yes,No,,,,,,\r",
51+
"\r\n"
52+
]
53+
}
54+
],
55+
"source": [
56+
"!head -n 5 debatetweets.csv"
57+
]
58+
},
59+
{
60+
"cell_type": "markdown",
61+
"metadata": {},
62+
"source": [
63+
"It appears to have a header row, which we'll deal with shortly, followed by the data. It looks like the tweets were written to the file in some sort of chronological order. How long is it?"
64+
]
65+
},
66+
{
67+
"cell_type": "code",
68+
"execution_count": 27,
69+
"metadata": {
70+
"collapsed": false
71+
},
72+
"outputs": [
73+
{
74+
"name": "stdout",
75+
"output_type": "stream",
76+
"text": [
77+
" 1001 debatetweets.csv\r\n"
78+
]
79+
}
80+
],
81+
"source": [
82+
"!wc -l debatetweets.csv"
83+
]
84+
},
85+
{
86+
"cell_type": "markdown",
87+
"metadata": {},
88+
"source": [
89+
"It's 1,001 rows, so that's 1 header row followed by 1,000 observations."
90+
]
91+
},
92+
{
93+
"cell_type": "markdown",
94+
"metadata": {},
95+
"source": [
96+
"The critical tool for accomplishing this cleverly is the `shuf`/`gshuf` shell command. This command, part of the GNU Coreutils library, shows up as `shuf` in a bash shell, or as `gshuf` in Mac OSX (installed via `brew install coreutils`) and as `gshuf` in the IPython shell like we have here. Let's see what it can do:"
97+
]
98+
},
99+
{
100+
"cell_type": "code",
101+
"execution_count": 28,
102+
"metadata": {
103+
"collapsed": false
104+
},
105+
"outputs": [
106+
{
107+
"name": "stdout",
108+
"output_type": "stream",
109+
"text": [
110+
"Usage: gshuf [OPTION]... [FILE]\r\n",
111+
" or: gshuf -e [OPTION]... [ARG]...\r\n",
112+
" or: gshuf -i LO-HI [OPTION]...\r\n",
113+
"Write a random permutation of the input lines to standard output.\r\n",
114+
"\r\n",
115+
"With no FILE, or when FILE is -, read standard input.\r\n",
116+
"\r\n",
117+
"Mandatory arguments to long options are mandatory for short options too.\r\n",
118+
" -e, --echo treat each ARG as an input line\r\n",
119+
" -i, --input-range=LO-HI treat each number LO through HI as an input line\r\n",
120+
" -n, --head-count=COUNT output at most COUNT lines\r\n",
121+
" -o, --output=FILE write result to FILE instead of standard output\r\n",
122+
" --random-source=FILE get random bytes from FILE\r\n",
123+
" -r, --repeat output lines can be repeated\r\n",
124+
" -z, --zero-terminated line delimiter is NUL, not newline\r\n",
125+
" --help display this help and exit\r\n",
126+
" --version output version information and exit\r\n",
127+
"\r\n",
128+
"GNU coreutils online help: <http://www.gnu.org/software/coreutils/>\r\n",
129+
"Full documentation at: <http://www.gnu.org/software/coreutils/shuf>\r\n",
130+
"or available locally via: info '(coreutils) shuf invocation'\r\n"
131+
]
132+
}
133+
],
134+
"source": [
135+
"!gshuf --help"
136+
]
137+
},
138+
{
139+
"cell_type": "markdown",
140+
"metadata": {},
141+
"source": [
142+
"It looks like this is something we can use! First we need to peel off the header row; we'll put it back later:"
143+
]
144+
},
145+
{
146+
"cell_type": "code",
147+
"execution_count": 29,
148+
"metadata": {
149+
"collapsed": false
150+
},
151+
"outputs": [],
152+
"source": [
153+
"!head -n 1 debatetweets.csv > header.csv"
154+
]
155+
},
156+
{
157+
"cell_type": "markdown",
158+
"metadata": {},
159+
"source": [
160+
"Now for the main event. We'll pipe ( `|` ) everything *except* the first line of debatetweets.csv to `gshuf`, and we'll take advantage of the `-n` option to request only 100 lines:"
161+
]
162+
},
163+
{
164+
"cell_type": "code",
165+
"execution_count": 30,
166+
"metadata": {
167+
"collapsed": true
168+
},
169+
"outputs": [],
170+
"source": [
171+
"!tail -n +2 debatetweets.csv | gshuf -n 100 > only100tweets.csv"
172+
]
173+
},
174+
{
175+
"cell_type": "markdown",
176+
"metadata": {},
177+
"source": [
178+
"Now let's quickly size up what we got out:"
179+
]
180+
},
181+
{
182+
"cell_type": "code",
183+
"execution_count": 31,
184+
"metadata": {
185+
"collapsed": false
186+
},
187+
"outputs": [
188+
{
189+
"name": "stdout",
190+
"output_type": "stream",
191+
"text": [
192+
"2016-09-27 08:58:10+00:00,780692999651074048,beingmissdaisy,1278,1007,0,0,debatenight,Elijahkyama,,http://twitter.com/beingmissdaisy/status/780692999651074048,RT @Elijahkyama: The person who did this will be haunted for nothing https://t.co/69C2KHMAk3 #debatenight,Yes,No,,,,,,http://pbs.twimg.com/media/CtUjRLIVIAAK5VT.jpg\r",
193+
"\r\n",
194+
"2016-09-27 06:59:53+00:00,780663232537198592,PlatoSays,2141,993,0,0,\"debatenight, Debates2016\",Agent4Trump,,http://twitter.com/PlatoSays/status/780663232537198592,RT @Agent4Trump: Stop Lying Hillary Fact Check Trolls.... ICE union endorses Trump https://t.co/EaVN7m9rxN #debatenight #Debates2016,Yes,No,,https://t.co/EaVN7m9rxN,http://politi.co/2cWt8Wp,,,\r",
195+
"\r\n",
196+
"2016-09-27 09:43:24+00:00,780704382048428032,Alex70CDA,456,375,0,0,\"Narcos, Debate\",la_maquina,,http://twitter.com/Alex70CDA/status/780704382048428032,RT @la_maquina: That one time a rich ranting lunatic thought he would be President. #Narcos / #Debate https://t.co/Xp1wCcqYFA,Yes,No,,,,,,http://pbs.twimg.com/media/CtU6PrPVMAAIWU8.jpg\r",
197+
"\r\n",
198+
"2016-09-27 00:09:23+00:00,780559928876466178,proudliberalmom,5290,5762,0,0,debatenight,GovGaryJohnson,,http://twitter.com/proudliberalmom/status/780559928876466178,.@GovGaryJohnson u shouldn't b on debate stage. No US Prez candidate should do this.Fuck 3rd party vote #debatenight https://t.co/CW7TaGMDIi,No,No,,https://t.co/CW7TaGMDIi,https://youtu.be/NXhR41lsEJY,,,\r",
199+
"\r\n",
200+
"2016-09-27 02:33:48+00:00,780596271622926336,sophiecredo,199,312,0,0,\"debatenight, debates, Debates2016\",h3h3productions,,http://twitter.com/sophiecredo/status/780596271622926336,RT @h3h3productions: I am the 400 lb hacker #debatenight #debates #Debates2016 https://t.co/amWtGmGTcf,Yes,No,,,,,,http://pbs.twimg.com/media/CtU6xS8UMAAQXwX.jpg\r",
201+
"\r\n"
202+
]
203+
}
204+
],
205+
"source": [
206+
"!head -n 5 only100tweets.csv"
207+
]
208+
},
209+
{
210+
"cell_type": "markdown",
211+
"metadata": {},
212+
"source": [
213+
"That looks like the random sample we expected. It doesn't appear to be in any chronological order."
214+
]
215+
},
216+
{
217+
"cell_type": "code",
218+
"execution_count": 32,
219+
"metadata": {
220+
"collapsed": false
221+
},
222+
"outputs": [
223+
{
224+
"name": "stdout",
225+
"output_type": "stream",
226+
"text": [
227+
" 100 only100tweets.csv\r\n"
228+
]
229+
}
230+
],
231+
"source": [
232+
"!wc -l only100tweets.csv"
233+
]
234+
},
235+
{
236+
"cell_type": "markdown",
237+
"metadata": {},
238+
"source": [
239+
"Last but not least, we do need to reattach the header row (not applicable in the case of a line-oriented JSON file):"
240+
]
241+
},
242+
{
243+
"cell_type": "code",
244+
"execution_count": 33,
245+
"metadata": {
246+
"collapsed": true
247+
},
248+
"outputs": [],
249+
"source": [
250+
"!cat header.csv only100tweets.csv > debatetweets-100sample.csv"
251+
]
252+
},
253+
{
254+
"cell_type": "code",
255+
"execution_count": 34,
256+
"metadata": {
257+
"collapsed": false,
258+
"scrolled": true
259+
},
260+
"outputs": [
261+
{
262+
"name": "stdout",
263+
"output_type": "stream",
264+
"text": [
265+
"created_at,twitter_id,screen_name,followers_count,friends_count,favorite_count/like_count,retweet_count,hashtags,mentions,in_reply_to_screen_name,twitter_url,text,is_retweet,is_quote,coordinates,url1,url1_expanded,url2,url2_expanded,media_url\r",
266+
"\r\n",
267+
"2016-09-27 08:58:10+00:00,780692999651074048,beingmissdaisy,1278,1007,0,0,debatenight,Elijahkyama,,http://twitter.com/beingmissdaisy/status/780692999651074048,RT @Elijahkyama: The person who did this will be haunted for nothing https://t.co/69C2KHMAk3 #debatenight,Yes,No,,,,,,http://pbs.twimg.com/media/CtUjRLIVIAAK5VT.jpg\r",
268+
"\r\n",
269+
"2016-09-27 06:59:53+00:00,780663232537198592,PlatoSays,2141,993,0,0,\"debatenight, Debates2016\",Agent4Trump,,http://twitter.com/PlatoSays/status/780663232537198592,RT @Agent4Trump: Stop Lying Hillary Fact Check Trolls.... ICE union endorses Trump https://t.co/EaVN7m9rxN #debatenight #Debates2016,Yes,No,,https://t.co/EaVN7m9rxN,http://politi.co/2cWt8Wp,,,\r",
270+
"\r\n",
271+
"2016-09-27 09:43:24+00:00,780704382048428032,Alex70CDA,456,375,0,0,\"Narcos, Debate\",la_maquina,,http://twitter.com/Alex70CDA/status/780704382048428032,RT @la_maquina: That one time a rich ranting lunatic thought he would be President. #Narcos / #Debate https://t.co/Xp1wCcqYFA,Yes,No,,,,,,http://pbs.twimg.com/media/CtU6PrPVMAAIWU8.jpg\r",
272+
"\r\n",
273+
"2016-09-27 00:09:23+00:00,780559928876466178,proudliberalmom,5290,5762,0,0,debatenight,GovGaryJohnson,,http://twitter.com/proudliberalmom/status/780559928876466178,.@GovGaryJohnson u shouldn't b on debate stage. No US Prez candidate should do this.Fuck 3rd party vote #debatenight https://t.co/CW7TaGMDIi,No,No,,https://t.co/CW7TaGMDIi,https://youtu.be/NXhR41lsEJY,,,\r",
274+
"\r\n"
275+
]
276+
}
277+
],
278+
"source": [
279+
"!head -n 5 debatetweets-100sample.csv"
280+
]
281+
},
282+
{
283+
"cell_type": "markdown",
284+
"metadata": {},
285+
"source": [
286+
"Done!"
287+
]
288+
}
289+
],
290+
"metadata": {
291+
"kernelspec": {
292+
"display_name": "Python 3",
293+
"language": "python",
294+
"name": "python3"
295+
},
296+
"language_info": {
297+
"codemirror_mode": {
298+
"name": "ipython",
299+
"version": 3
300+
},
301+
"file_extension": ".py",
302+
"mimetype": "text/x-python",
303+
"name": "python",
304+
"nbconvert_exporter": "python",
305+
"pygments_lexer": "ipython3",
306+
"version": "3.5.2"
307+
}
308+
},
309+
"nbformat": 4,
310+
"nbformat_minor": 0
311+
}

0 commit comments

Comments
 (0)