|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Deriving a random subset of tweets (JSON or CSV)" |
| 8 | + ] |
| 9 | + }, |
| 10 | + { |
| 11 | + "cell_type": "markdown", |
| 12 | + "metadata": {}, |
| 13 | + "source": [ |
| 14 | + "Let's say you're working with a file containing tweets, and you'd like to derive a random sample of tweets from that file. For example, your set might be in chronological order, and you want just a random sample of 100 tweets from a mix of dates." |
| 15 | + ] |
| 16 | + }, |
| 17 | + { |
| 18 | + "cell_type": "markdown", |
| 19 | + "metadata": {}, |
| 20 | + "source": [ |
| 21 | + "Note that the technique we describe here works for any type of data file with one observation per line, whether it's tweets or some other type of data, and whether the observations are in CSV format, JSON format, or some other text-based format with one observation per line." |
| 22 | + ] |
| 23 | + }, |
| 24 | + { |
| 25 | + "cell_type": "markdown", |
| 26 | + "metadata": {}, |
| 27 | + "source": [ |
| 28 | + "Let's take a look at our sample data file:" |
| 29 | + ] |
| 30 | + }, |
| 31 | + { |
| 32 | + "cell_type": "code", |
| 33 | + "execution_count": 26, |
| 34 | + "metadata": { |
| 35 | + "collapsed": false |
| 36 | + }, |
| 37 | + "outputs": [ |
| 38 | + { |
| 39 | + "name": "stdout", |
| 40 | + "output_type": "stream", |
| 41 | + "text": [ |
| 42 | + "created_at,twitter_id,screen_name,followers_count,friends_count,favorite_count/like_count,retweet_count,hashtags,mentions,in_reply_to_screen_name,twitter_url,text,is_retweet,is_quote,coordinates,url1,url1_expanded,url2,url2_expanded,media_url\r", |
| 43 | + "\r\n", |
| 44 | + "2016-09-26 17:52:09+00:00,780464994479542272,tlajj5,21,18,0,0,\"TrumpTrain, MAGA\",realDonaldTrump,,http://twitter.com/tlajj5/status/780464994479542272,RT @realDonaldTrump: New national Bloomberg poll just released - thank you! Join the MOVEMENT: https://t.co/3KWOl2ibaW. #TrumpTrain #MAGA…,Yes,No,,https://t.co/3KWOl2ibaW,http://www.DonaldJTrump.com,,,\r", |
| 45 | + "\r\n", |
| 46 | + "2016-09-26 17:55:09+00:00,780465749794123777,Kosky98,87,82,0,0,Debates2016,HebertEtHalfred,,http://twitter.com/Kosky98/status/780465749794123777,RT @HebertEtHalfred: Quoi ? #Debates2016 ou quoi ?? https://t.co/NUYvbc0kYc,Yes,No,,,,,,http://pbs.twimg.com/media/CtTFQ2rWcAAKLb7.jpg\r", |
| 47 | + "\r\n", |
| 48 | + "2016-09-26 17:56:34+00:00,780466104720236545,DeplorableTink,84,66,0,0,\"sundaythoughts, ImWithHer, Debates2016, CrookedHillary\",taydark_77,,http://twitter.com/DeplorableTink/status/780466104720236545,RT @taydark_77: #sundaythoughts We all know who the real deplorable's are. Every #ImWithHer crew. #Debates2016 #CrookedHillary https://t.co…,Yes,No,,,,,,\r", |
| 49 | + "\r\n", |
| 50 | + "2016-09-26 18:03:19+00:00,780467803266682880,EmiliaSaez10,10,734,0,0,\"HILLARYPRESIDENT, HillaryClinton, presidentialelection, PresidentialDebate, debatenight\",\"VigilanteArtist, HillaryClinton\",,http://twitter.com/EmiliaSaez10/status/780467803266682880,RT @VigilanteArtist: RT #HILLARYPRESIDENT @HillaryClinton #HillaryClinton #presidentialelection #PresidentialDebate #debatenight #Debates…,Yes,No,,,,,,\r", |
| 51 | + "\r\n" |
| 52 | + ] |
| 53 | + } |
| 54 | + ], |
| 55 | + "source": [ |
| 56 | + "!head -n 5 debatetweets.csv" |
| 57 | + ] |
| 58 | + }, |
| 59 | + { |
| 60 | + "cell_type": "markdown", |
| 61 | + "metadata": {}, |
| 62 | + "source": [ |
| 63 | + "It appears to have a header row, which we'll deal with shortly, followed by the data. It looks like the tweets were written to the file in some sort of chronological order. How long is it?" |
| 64 | + ] |
| 65 | + }, |
| 66 | + { |
| 67 | + "cell_type": "code", |
| 68 | + "execution_count": 27, |
| 69 | + "metadata": { |
| 70 | + "collapsed": false |
| 71 | + }, |
| 72 | + "outputs": [ |
| 73 | + { |
| 74 | + "name": "stdout", |
| 75 | + "output_type": "stream", |
| 76 | + "text": [ |
| 77 | + " 1001 debatetweets.csv\r\n" |
| 78 | + ] |
| 79 | + } |
| 80 | + ], |
| 81 | + "source": [ |
| 82 | + "!wc -l debatetweets.csv" |
| 83 | + ] |
| 84 | + }, |
| 85 | + { |
| 86 | + "cell_type": "markdown", |
| 87 | + "metadata": {}, |
| 88 | + "source": [ |
| 89 | + "It's 1,001 rows, so that's 1 header row followed by 1,000 observations." |
| 90 | + ] |
| 91 | + }, |
| 92 | + { |
| 93 | + "cell_type": "markdown", |
| 94 | + "metadata": {}, |
| 95 | + "source": [ |
| 96 | + "The critical tool for accomplishing this cleverly is the `shuf`/`gshuf` shell command. This command, part of the GNU Coreutils library, shows up as `shuf` in a bash shell, or as `gshuf` in Mac OSX (installed via `brew install coreutils`) and as `gshuf` in the IPython shell like we have here. Let's see what it can do:" |
| 97 | + ] |
| 98 | + }, |
| 99 | + { |
| 100 | + "cell_type": "code", |
| 101 | + "execution_count": 28, |
| 102 | + "metadata": { |
| 103 | + "collapsed": false |
| 104 | + }, |
| 105 | + "outputs": [ |
| 106 | + { |
| 107 | + "name": "stdout", |
| 108 | + "output_type": "stream", |
| 109 | + "text": [ |
| 110 | + "Usage: gshuf [OPTION]... [FILE]\r\n", |
| 111 | + " or: gshuf -e [OPTION]... [ARG]...\r\n", |
| 112 | + " or: gshuf -i LO-HI [OPTION]...\r\n", |
| 113 | + "Write a random permutation of the input lines to standard output.\r\n", |
| 114 | + "\r\n", |
| 115 | + "With no FILE, or when FILE is -, read standard input.\r\n", |
| 116 | + "\r\n", |
| 117 | + "Mandatory arguments to long options are mandatory for short options too.\r\n", |
| 118 | + " -e, --echo treat each ARG as an input line\r\n", |
| 119 | + " -i, --input-range=LO-HI treat each number LO through HI as an input line\r\n", |
| 120 | + " -n, --head-count=COUNT output at most COUNT lines\r\n", |
| 121 | + " -o, --output=FILE write result to FILE instead of standard output\r\n", |
| 122 | + " --random-source=FILE get random bytes from FILE\r\n", |
| 123 | + " -r, --repeat output lines can be repeated\r\n", |
| 124 | + " -z, --zero-terminated line delimiter is NUL, not newline\r\n", |
| 125 | + " --help display this help and exit\r\n", |
| 126 | + " --version output version information and exit\r\n", |
| 127 | + "\r\n", |
| 128 | + "GNU coreutils online help: <http://www.gnu.org/software/coreutils/>\r\n", |
| 129 | + "Full documentation at: <http://www.gnu.org/software/coreutils/shuf>\r\n", |
| 130 | + "or available locally via: info '(coreutils) shuf invocation'\r\n" |
| 131 | + ] |
| 132 | + } |
| 133 | + ], |
| 134 | + "source": [ |
| 135 | + "!gshuf --help" |
| 136 | + ] |
| 137 | + }, |
| 138 | + { |
| 139 | + "cell_type": "markdown", |
| 140 | + "metadata": {}, |
| 141 | + "source": [ |
| 142 | + "It looks like this is something we can use! First we need to peel off the header row; we'll put it back later:" |
| 143 | + ] |
| 144 | + }, |
| 145 | + { |
| 146 | + "cell_type": "code", |
| 147 | + "execution_count": 29, |
| 148 | + "metadata": { |
| 149 | + "collapsed": false |
| 150 | + }, |
| 151 | + "outputs": [], |
| 152 | + "source": [ |
| 153 | + "!head -n 1 debatetweets.csv > header.csv" |
| 154 | + ] |
| 155 | + }, |
| 156 | + { |
| 157 | + "cell_type": "markdown", |
| 158 | + "metadata": {}, |
| 159 | + "source": [ |
| 160 | + "Now for the main event. We'll pipe ( `|` ) everything *except* the first line of debatetweets.csv to `gshuf`, and we'll take advantage of the `-n` option to request only 100 lines:" |
| 161 | + ] |
| 162 | + }, |
| 163 | + { |
| 164 | + "cell_type": "code", |
| 165 | + "execution_count": 30, |
| 166 | + "metadata": { |
| 167 | + "collapsed": true |
| 168 | + }, |
| 169 | + "outputs": [], |
| 170 | + "source": [ |
| 171 | + "!tail -n +2 debatetweets.csv | gshuf -n 100 > only100tweets.csv" |
| 172 | + ] |
| 173 | + }, |
| 174 | + { |
| 175 | + "cell_type": "markdown", |
| 176 | + "metadata": {}, |
| 177 | + "source": [ |
| 178 | + "Now let's quickly size up what we got out:" |
| 179 | + ] |
| 180 | + }, |
| 181 | + { |
| 182 | + "cell_type": "code", |
| 183 | + "execution_count": 31, |
| 184 | + "metadata": { |
| 185 | + "collapsed": false |
| 186 | + }, |
| 187 | + "outputs": [ |
| 188 | + { |
| 189 | + "name": "stdout", |
| 190 | + "output_type": "stream", |
| 191 | + "text": [ |
| 192 | + "2016-09-27 08:58:10+00:00,780692999651074048,beingmissdaisy,1278,1007,0,0,debatenight,Elijahkyama,,http://twitter.com/beingmissdaisy/status/780692999651074048,RT @Elijahkyama: The person who did this will be haunted for nothing https://t.co/69C2KHMAk3 #debatenight,Yes,No,,,,,,http://pbs.twimg.com/media/CtUjRLIVIAAK5VT.jpg\r", |
| 193 | + "\r\n", |
| 194 | + "2016-09-27 06:59:53+00:00,780663232537198592,PlatoSays,2141,993,0,0,\"debatenight, Debates2016\",Agent4Trump,,http://twitter.com/PlatoSays/status/780663232537198592,RT @Agent4Trump: Stop Lying Hillary Fact Check Trolls.... ICE union endorses Trump https://t.co/EaVN7m9rxN #debatenight #Debates2016,Yes,No,,https://t.co/EaVN7m9rxN,http://politi.co/2cWt8Wp,,,\r", |
| 195 | + "\r\n", |
| 196 | + "2016-09-27 09:43:24+00:00,780704382048428032,Alex70CDA,456,375,0,0,\"Narcos, Debate\",la_maquina,,http://twitter.com/Alex70CDA/status/780704382048428032,RT @la_maquina: That one time a rich ranting lunatic thought he would be President. #Narcos / #Debate https://t.co/Xp1wCcqYFA,Yes,No,,,,,,http://pbs.twimg.com/media/CtU6PrPVMAAIWU8.jpg\r", |
| 197 | + "\r\n", |
| 198 | + "2016-09-27 00:09:23+00:00,780559928876466178,proudliberalmom,5290,5762,0,0,debatenight,GovGaryJohnson,,http://twitter.com/proudliberalmom/status/780559928876466178,.@GovGaryJohnson u shouldn't b on debate stage. No US Prez candidate should do this.Fuck 3rd party vote #debatenight https://t.co/CW7TaGMDIi,No,No,,https://t.co/CW7TaGMDIi,https://youtu.be/NXhR41lsEJY,,,\r", |
| 199 | + "\r\n", |
| 200 | + "2016-09-27 02:33:48+00:00,780596271622926336,sophiecredo,199,312,0,0,\"debatenight, debates, Debates2016\",h3h3productions,,http://twitter.com/sophiecredo/status/780596271622926336,RT @h3h3productions: I am the 400 lb hacker #debatenight #debates #Debates2016 https://t.co/amWtGmGTcf,Yes,No,,,,,,http://pbs.twimg.com/media/CtU6xS8UMAAQXwX.jpg\r", |
| 201 | + "\r\n" |
| 202 | + ] |
| 203 | + } |
| 204 | + ], |
| 205 | + "source": [ |
| 206 | + "!head -n 5 only100tweets.csv" |
| 207 | + ] |
| 208 | + }, |
| 209 | + { |
| 210 | + "cell_type": "markdown", |
| 211 | + "metadata": {}, |
| 212 | + "source": [ |
| 213 | + "That looks like the random sample we expected. It doesn't appear to be in any chronological order." |
| 214 | + ] |
| 215 | + }, |
| 216 | + { |
| 217 | + "cell_type": "code", |
| 218 | + "execution_count": 32, |
| 219 | + "metadata": { |
| 220 | + "collapsed": false |
| 221 | + }, |
| 222 | + "outputs": [ |
| 223 | + { |
| 224 | + "name": "stdout", |
| 225 | + "output_type": "stream", |
| 226 | + "text": [ |
| 227 | + " 100 only100tweets.csv\r\n" |
| 228 | + ] |
| 229 | + } |
| 230 | + ], |
| 231 | + "source": [ |
| 232 | + "!wc -l only100tweets.csv" |
| 233 | + ] |
| 234 | + }, |
| 235 | + { |
| 236 | + "cell_type": "markdown", |
| 237 | + "metadata": {}, |
| 238 | + "source": [ |
| 239 | + "Last but not least, we do need to reattach the header row (not applicable in the case of a line-oriented JSON file):" |
| 240 | + ] |
| 241 | + }, |
| 242 | + { |
| 243 | + "cell_type": "code", |
| 244 | + "execution_count": 33, |
| 245 | + "metadata": { |
| 246 | + "collapsed": true |
| 247 | + }, |
| 248 | + "outputs": [], |
| 249 | + "source": [ |
| 250 | + "!cat header.csv only100tweets.csv > debatetweets-100sample.csv" |
| 251 | + ] |
| 252 | + }, |
| 253 | + { |
| 254 | + "cell_type": "code", |
| 255 | + "execution_count": 34, |
| 256 | + "metadata": { |
| 257 | + "collapsed": false, |
| 258 | + "scrolled": true |
| 259 | + }, |
| 260 | + "outputs": [ |
| 261 | + { |
| 262 | + "name": "stdout", |
| 263 | + "output_type": "stream", |
| 264 | + "text": [ |
| 265 | + "created_at,twitter_id,screen_name,followers_count,friends_count,favorite_count/like_count,retweet_count,hashtags,mentions,in_reply_to_screen_name,twitter_url,text,is_retweet,is_quote,coordinates,url1,url1_expanded,url2,url2_expanded,media_url\r", |
| 266 | + "\r\n", |
| 267 | + "2016-09-27 08:58:10+00:00,780692999651074048,beingmissdaisy,1278,1007,0,0,debatenight,Elijahkyama,,http://twitter.com/beingmissdaisy/status/780692999651074048,RT @Elijahkyama: The person who did this will be haunted for nothing https://t.co/69C2KHMAk3 #debatenight,Yes,No,,,,,,http://pbs.twimg.com/media/CtUjRLIVIAAK5VT.jpg\r", |
| 268 | + "\r\n", |
| 269 | + "2016-09-27 06:59:53+00:00,780663232537198592,PlatoSays,2141,993,0,0,\"debatenight, Debates2016\",Agent4Trump,,http://twitter.com/PlatoSays/status/780663232537198592,RT @Agent4Trump: Stop Lying Hillary Fact Check Trolls.... ICE union endorses Trump https://t.co/EaVN7m9rxN #debatenight #Debates2016,Yes,No,,https://t.co/EaVN7m9rxN,http://politi.co/2cWt8Wp,,,\r", |
| 270 | + "\r\n", |
| 271 | + "2016-09-27 09:43:24+00:00,780704382048428032,Alex70CDA,456,375,0,0,\"Narcos, Debate\",la_maquina,,http://twitter.com/Alex70CDA/status/780704382048428032,RT @la_maquina: That one time a rich ranting lunatic thought he would be President. #Narcos / #Debate https://t.co/Xp1wCcqYFA,Yes,No,,,,,,http://pbs.twimg.com/media/CtU6PrPVMAAIWU8.jpg\r", |
| 272 | + "\r\n", |
| 273 | + "2016-09-27 00:09:23+00:00,780559928876466178,proudliberalmom,5290,5762,0,0,debatenight,GovGaryJohnson,,http://twitter.com/proudliberalmom/status/780559928876466178,.@GovGaryJohnson u shouldn't b on debate stage. No US Prez candidate should do this.Fuck 3rd party vote #debatenight https://t.co/CW7TaGMDIi,No,No,,https://t.co/CW7TaGMDIi,https://youtu.be/NXhR41lsEJY,,,\r", |
| 274 | + "\r\n" |
| 275 | + ] |
| 276 | + } |
| 277 | + ], |
| 278 | + "source": [ |
| 279 | + "!head -n 5 debatetweets-100sample.csv" |
| 280 | + ] |
| 281 | + }, |
| 282 | + { |
| 283 | + "cell_type": "markdown", |
| 284 | + "metadata": {}, |
| 285 | + "source": [ |
| 286 | + "Done!" |
| 287 | + ] |
| 288 | + } |
| 289 | + ], |
| 290 | + "metadata": { |
| 291 | + "kernelspec": { |
| 292 | + "display_name": "Python 3", |
| 293 | + "language": "python", |
| 294 | + "name": "python3" |
| 295 | + }, |
| 296 | + "language_info": { |
| 297 | + "codemirror_mode": { |
| 298 | + "name": "ipython", |
| 299 | + "version": 3 |
| 300 | + }, |
| 301 | + "file_extension": ".py", |
| 302 | + "mimetype": "text/x-python", |
| 303 | + "name": "python", |
| 304 | + "nbconvert_exporter": "python", |
| 305 | + "pygments_lexer": "ipython3", |
| 306 | + "version": "3.5.2" |
| 307 | + } |
| 308 | + }, |
| 309 | + "nbformat": 4, |
| 310 | + "nbformat_minor": 0 |
| 311 | +} |
0 commit comments