Initial commit - new notebook

kerchner · kerchner · commit 95edc5c0e125 · 2018-11-29T13:05:34.000-05:00
diff --git a/20181127-top-hashtags-json.ipynb b/20181127-top-hashtags-json.ipynb
@@ -0,0 +1,281 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Computing the top hashtags (JSON)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So you have tweets in a JSON file, and you'd like to get a list of the hashtags, from the most frequently occurring hashtags on down.\n",
+    "\n",
+    "There are *many, many* different ways to accomplish this.  Since we're working with the tweets in JSON format, this solution will use `jq`, as well as a few bash shell / command line tools: `cat`, `sort`, `uniq`, and `wc`.  If you haven't used `jq` yet, our [Working with Twitter Using jq](https://github.com/gwu-libraries/notebooks/blob/master/20160407-twitter-analysis-with-jq/Working-with-twitter-using-jq.ipynb) notebook is a good place to start.\n",
+    "\n",
+    "### Where are the hashtags in tweet JSON?\n",
+    "\n",
+    "When we look at a tweet, we see that it has a key called `entities`, and that the value of `entities` contains a key called `hashtags`.  The value of `hashtags` is a list (note the square brackets); each item in the list contains the text of a single hashtag, and the indices of the characters in the tweet text where the hashtag begins and ends.  \n",
+    "```\n",
+    "{\n",
+    "  created_at: \"Tue Oct 30 09:15:45 +0000 2018\",\n",
+    "  id: 1057199367411679200,\n",
+    "  id_str: \"1057199367411679234\",\n",
+    "  text: \"Lesson from Indra's elephant https://t.co/h5K3y5g4Ju #India #Hinduism #Buddhism #History #Culture https://t.co/qFyipqzPnE\",\n",
+    "\n",
+    "  ...\n",
+    "  \n",
+    "  entities: {\n",
+    "    hashtags: [\n",
+    "      {\n",
+    "        text: \"India\",\n",
+    "        indices: [\n",
+    "          54,\n",
+    "          60\n",
+    "        ]\n",
+    "      },\n",
+    "      {\n",
+    "        text: \"Hinduism\",\n",
+    "        indices: [\n",
+    "          61,\n",
+    "          70\n",
+    "        ]\n",
+    "      },\n",
+    "      {\n",
+    "        text: \"Buddhism\",\n",
+    "        indices: [\n",
+    "          71,\n",
+    "          80\n",
+    "        ]\n",
+    "      },\n",
+    "      {\n",
+    "        text: \"History\",\n",
+    "        indices: [\n",
+    "          81,\n",
+    "          89\n",
+    "        ]\n",
+    "      },\n",
+    "      {\n",
+    "        text: \"Culture\",\n",
+    "        indices: [\n",
+    "          90,\n",
+    "          98\n",
+    "        ]\n",
+    "      }\n",
+    "    ],\n",
+    " ...\n",
+    " ```\n",
+    " \n",
+    " When we use `jq`, we'll need to construct a filter that pulls out the hashtag text values.\n",
+    " "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "김유겸\r\n",
+      "유겸\r\n",
+      "Yugyeom\r\n",
+      "GOT7\r\n",
+      "갓세븐\r\n",
+      "PresentYou\r\n",
+      "LifeSite\r\n",
+      "あなたの名前から想像される色\r\n",
+      "صباح_الخير\r\n",
+      "music\r\n",
+      "network\r\n",
+      "ShootOut1stWin\r\n",
+      "acabateloparaustedes\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cat 50tweets.json | jq -cr '[.entities.hashtags][0][].text'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cat tweets4hashtags.json | jq -cr '[.entities.hashtags][0][].text' > allhashtags.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's see how many hashtags we extracted:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "  878806 allhashtags.txt\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "!wc -l allhashtags.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "What we'd like to do now is to count up how many of each hashtag we have.  We'll use a combination of bash's `sort` and `uniq` commands for that.  We'll also use the `-c` option for `uniq`, which prefaces each line with the count of lines it collapsed together in the process of `uniq`ing a group of identical lines.  `sort`'s `-nr` options will allow us to sort by just the count on each line."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cat allhashtags.txt | sort | uniq -c | sort -nr > rankedhashtags.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's take a look at what we have now."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "8170 EXO\r\n",
+      "4123 BTS\r\n",
+      "4061 TEMPO_SEHUN\r\n",
+      "3365 GOT7\r\n",
+      "3145 ローソン\r\n",
+      "2924 SEHUN\r\n",
+      "2773 EXO_DontMessUpMyTempo\r\n",
+      "2743 weareoneEXO\r\n",
+      "2705 Halloween\r\n",
+      "2661 갓세븐\r\n",
+      "2647 EXO_TEMPO\r\n",
+      "2645 워너블다음은없어\r\n",
+      "2403 몬스타엑스\r\n",
+      "2355 MONSTA_X\r\n",
+      "2339 엑소\r\n",
+      "2279 ごちろう\r\n",
+      "2267 지니인기상_달려라상탄\r\n",
+      "2165 방탄소년단\r\n",
+      "2161 塩にぎり無料プレゼント\r\n",
+      "2042 ShootOut1stWin\r\n",
+      "1951 ハロウィン\r\n",
+      "1933 아이즈원\r\n",
+      "1874 เป๊กผลิตโชค\r\n",
+      "1873 IZONE\r\n",
+      "1753 フードファンタジー\r\n",
+      "1675 フーファン\r\n",
+      "1668 食霊のティアラ\r\n",
+      "1635 Ask_EXO\r\n",
+      "1551 어디에도_없을_완벽한_EXO\r\n",
+      "1494 AppleEvent\r\n",
+      "1465 도경수\r\n",
+      "1444 ShootOut\r\n",
+      "1403 WasteItOnMe\r\n",
+      "1370 TWICE\r\n",
+      "1367 NCT\r\n",
+      "1271 SomosLaAudiencia30\r\n",
+      "1267 NewProfilePic\r\n",
+      "1253 백일의낭군님\r\n",
+      "1223 ﷺ\r\n",
+      "1183 BAEKHYUN\r\n",
+      "1174 더쇼\r\n",
+      "1163 재민\r\n",
+      "1123 MONSTAX\r\n",
+      "1120 트와이스\r\n",
+      "1068 ジェジュン\r\n",
+      "1063 ALDUBStillReigns\r\n",
+      "1054 JIMIN\r\n",
+      "1026 RMonoBB200\r\n",
+      "1014 RT\r\n",
+      "1002 EXO_QuintupleMillionSeller\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "!head -n 50 rankedhashtags.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Personally, I have no idea what most of these hashtags are about, but this is apparently what people were tweeting about on October 31, 2018.\n",
+    "\n",
+    "And as for how many unique hashtags are in this set:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "  211986 rankedhashtags.txt\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "!wc -l rankedhashtags.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Again, there are many different ways to approach this!  Let us know your thoughts and ideas."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}