Skip to content

Commit c2da4ff

Browse files
committedSep 7, 2019
Embeddings enhancements started
Former-commit-id: b1f75ae
1 parent 935af42 commit c2da4ff

35 files changed

+492
-3
lines changed
 

‎.gitignore

100644100755
File mode changed.

‎LICENSE

100644100755
File mode changed.

‎examples/0. Embeddings Generation/1. (proof of concept) DQN.ipynb

100644100755
File mode changed.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,360 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Movie Parsing\n",
8+
"## OMDB\n",
9+
"OMDB is Open Movie Database. Although, it is open, you will need to pay 1 doller to get the key and send up to 100k requests/day. For 5 you get access to the poster API.\n",
10+
"\n",
11+
"http://www.omdbapi.com/"
12+
]
13+
},
14+
{
15+
"cell_type": "code",
16+
"execution_count": 1,
17+
"metadata": {},
18+
"outputs": [],
19+
"source": [
20+
"import pandas as pd\n",
21+
"import requests\n",
22+
"from tqdm import tqdm_notebook as tqdm\n",
23+
"import json\n",
24+
"\n",
25+
"myOmdbKey = 'your key here' # you need to buy omdb key for 1$ on patreon\n",
26+
"movies = pd.read_csv('../../../../data/ml-20m/links.csv')\n",
27+
"movies['imdbId'] = movies['imdbId'].apply(lambda i: '0' * (8 - len(str(i))) + str(i))\n",
28+
"movies['tmdbId'] = movies['tmdbId'].fillna(-1).astype(int).apply(str)\n",
29+
"movies = movies.set_index('movieId')\n",
30+
"movies = movies.to_dict(orient='index')"
31+
]
32+
},
33+
{
34+
"cell_type": "markdown",
35+
"metadata": {},
36+
"source": [
37+
"> If failed pops up, run this block again till it's done"
38+
]
39+
},
40+
{
41+
"cell_type": "code",
42+
"execution_count": 44,
43+
"metadata": {},
44+
"outputs": [],
45+
"source": [
46+
"movies = json.load(open(\"../../../../data/parsed/omdb.json\", \"r\") )"
47+
]
48+
},
49+
{
50+
"cell_type": "code",
51+
"execution_count": 9,
52+
"metadata": {},
53+
"outputs": [
54+
{
55+
"data": {
56+
"application/vnd.jupyter.widget-view+json": {
57+
"model_id": "c1f10c26028c4326b7c7e6e2e43d1894",
58+
"version_major": 2,
59+
"version_minor": 0
60+
},
61+
"text/plain": [
62+
"HBox(children=(IntProgress(value=0, max=27278), HTML(value='')))"
63+
]
64+
},
65+
"metadata": {},
66+
"output_type": "display_data"
67+
},
68+
{
69+
"name": "stdout",
70+
"output_type": "stream",
71+
"text": [
72+
"\n"
73+
]
74+
}
75+
],
76+
"source": [
77+
"for id in tqdm(movies.keys()):\n",
78+
" imdbId = movies[id]['imdbId']\n",
79+
" if movies[id].get('omdb', False):\n",
80+
" continue\n",
81+
" try:\n",
82+
" movies[id]['omdb'] = requests.get(\"http://www.omdbapi.com/?i=tt{}&apikey={}&plot=full\".format(imdbId,\n",
83+
" myOmdbKey)).json()\n",
84+
" except:\n",
85+
" print(id, imdbId, 'failed')"
86+
]
87+
},
88+
{
89+
"cell_type": "code",
90+
"execution_count": 17,
91+
"metadata": {},
92+
"outputs": [],
93+
"source": [
94+
"with open(\"../../../../data/parsed/omdb.json\", \"w\") as write_file:\n",
95+
" json.dump(movies, write_file)"
96+
]
97+
},
98+
{
99+
"cell_type": "code",
100+
"execution_count": 46,
101+
"metadata": {},
102+
"outputs": [
103+
{
104+
"data": {
105+
"text/plain": [
106+
"{'imdbId': '00114709',\n",
107+
" 'tmdbId': '862',\n",
108+
" 'omdb': {'Title': 'Toy Story',\n",
109+
" 'Year': '1995',\n",
110+
" 'Rated': 'G',\n",
111+
" 'Released': '22 Nov 1995',\n",
112+
" 'Runtime': '81 min',\n",
113+
" 'Genre': 'Animation, Adventure, Comedy, Family, Fantasy',\n",
114+
" 'Director': 'John Lasseter',\n",
115+
" 'Writer': 'John Lasseter (original story by), Pete Docter (original story by), Andrew Stanton (original story by), Joe Ranft (original story by), Joss Whedon (screenplay by), Andrew Stanton (screenplay by), Joel Cohen (screenplay by), Alec Sokolow (screenplay by)',\n",
116+
" 'Actors': 'Tom Hanks, Tim Allen, Don Rickles, Jim Varney',\n",
117+
" 'Plot': 'A little boy named Andy loves to be in his room, playing with his toys, especially his doll named \"Woody\". But, what do the toys do when Andy is not with them, they come to life. Woody believes that he has life (as a toy) good. However, he must worry about Andy\\'s family moving, and what Woody does not know is about Andy\\'s birthday party. Woody does not realize that Andy\\'s mother gave him an action figure known as Buzz Lightyear, who does not believe that he is a toy, and quickly becomes Andy\\'s new favorite toy. Woody, who is now consumed with jealousy, tries to get rid of Buzz. Then, both Woody and Buzz are now lost. They must find a way to get back to Andy before he moves without them, but they will have to pass through a ruthless toy killer, Sid Phillips.',\n",
118+
" 'Language': 'English',\n",
119+
" 'Country': 'USA',\n",
120+
" 'Awards': 'Nominated for 3 Oscars. Another 23 wins & 17 nominations.',\n",
121+
" 'Poster': 'https://m.media-amazon.com/images/M/MV5BMDU2ZWJlMjktMTRhMy00ZTA5LWEzNDgtYmNmZTEwZTViZWJkXkEyXkFqcGdeQXVyNDQ2OTk4MzI@._V1_SX300.jpg',\n",
122+
" 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '8.3/10'},\n",
123+
" {'Source': 'Rotten Tomatoes', 'Value': '100%'},\n",
124+
" {'Source': 'Metacritic', 'Value': '95/100'}],\n",
125+
" 'Metascore': '95',\n",
126+
" 'imdbRating': '8.3',\n",
127+
" 'imdbVotes': '810,875',\n",
128+
" 'imdbID': 'tt0114709',\n",
129+
" 'Type': 'movie',\n",
130+
" 'DVD': '20 Mar 2001',\n",
131+
" 'BoxOffice': 'N/A',\n",
132+
" 'Production': 'Buena Vista',\n",
133+
" 'Website': 'http://www.disney.com/ToyStory',\n",
134+
" 'Response': 'True'}}"
135+
]
136+
},
137+
"execution_count": 46,
138+
"metadata": {},
139+
"output_type": "execute_result"
140+
}
141+
],
142+
"source": [
143+
"movies['1']"
144+
]
145+
},
146+
{
147+
"cell_type": "markdown",
148+
"metadata": {},
149+
"source": [
150+
"## TMDB"
151+
]
152+
},
153+
{
154+
"cell_type": "code",
155+
"execution_count": 25,
156+
"metadata": {},
157+
"outputs": [],
158+
"source": [
159+
"import pandas as pd\n",
160+
"import requests\n",
161+
"from tqdm import tqdm_notebook as tqdm\n",
162+
"import json\n",
163+
"\n",
164+
"myTmdbKey = 'your key here' # you can get it for free if you ask them nicely\n",
165+
"movies = pd.read_csv('../../../../data/ml-20m/links.csv')\n",
166+
"movies['imdbId'] = movies['imdbId'].apply(lambda i: '0' * (8 - len(str(i))) + str(i))\n",
167+
"movies['tmdbId'] = movies['tmdbId'].fillna(-1).astype(int).apply(str)\n",
168+
"movies = movies.set_index('movieId')\n",
169+
"movies = movies.to_dict(orient='index')"
170+
]
171+
},
172+
{
173+
"cell_type": "code",
174+
"execution_count": 26,
175+
"metadata": {},
176+
"outputs": [],
177+
"source": [
178+
"import asyncio\n",
179+
"# ! pip install aiohttp --user\n",
180+
"import aiohttp\n",
181+
"# ! pip install asyncio-throttle --user\n",
182+
"from asyncio_throttle import Throttler"
183+
]
184+
},
185+
{
186+
"cell_type": "code",
187+
"execution_count": 27,
188+
"metadata": {},
189+
"outputs": [],
190+
"source": [
191+
"# movies = json.load(open(\"../../../../data/parsed/tmdb.json\", \"r\") )"
192+
]
193+
},
194+
{
195+
"cell_type": "markdown",
196+
"metadata": {},
197+
"source": [
198+
"> you can also run this code multiple times"
199+
]
200+
},
201+
{
202+
"cell_type": "code",
203+
"execution_count": 28,
204+
"metadata": {},
205+
"outputs": [
206+
{
207+
"data": {
208+
"application/vnd.jupyter.widget-view+json": {
209+
"model_id": "aaacbda54580430396f4ae20b114f146",
210+
"version_major": 2,
211+
"version_minor": 0
212+
},
213+
"text/plain": [
214+
"HBox(children=(IntProgress(value=0, max=27278), HTML(value='')))"
215+
]
216+
},
217+
"metadata": {},
218+
"output_type": "display_data"
219+
},
220+
{
221+
"name": "stdout",
222+
"output_type": "stream",
223+
"text": [
224+
"\n"
225+
]
226+
}
227+
],
228+
"source": [
229+
"throttler = Throttler(rate_limit=4, period=2)\n",
230+
"\n",
231+
"async def tmdb(session, id, tmdbId):\n",
232+
" url = \"https://api.themoviedb.org/3/movie/{}?api_key={}\".format(tmdbId, myTmdbKey)\n",
233+
" async with throttler:\n",
234+
" async with session.get(url) as resp:\n",
235+
" if resp.status == 429:\n",
236+
" print('throttling')\n",
237+
" await asyncio.sleep(0.2)\n",
238+
" \n",
239+
" movies[id]['tmdb'] = await resp.json()\n",
240+
" \n",
241+
" # this also controlls the timespan between calls\n",
242+
" await asyncio.sleep(0.05)\n",
243+
" \n",
244+
"\n",
245+
"async def main():\n",
246+
" async with aiohttp.ClientSession() as session:\n",
247+
" for id in tqdm(movies.keys()):\n",
248+
" tmdbId = movies[id]['tmdbId']\n",
249+
" if movies[id].get('tmdb', False) and 'status_code' not in movies[id]['tmdb']:\n",
250+
" continue\n",
251+
" await tmdb(session, id, tmdbId)\n",
252+
" \n",
253+
" \n",
254+
"if __name__ == '__main__':\n",
255+
" loop = asyncio.get_event_loop()\n",
256+
" loop.create_task(main())"
257+
]
258+
},
259+
{
260+
"cell_type": "code",
261+
"execution_count": 31,
262+
"metadata": {},
263+
"outputs": [],
264+
"source": [
265+
"with open(\"../../../../data/parsed/tmdb.json\", \"w\") as write_file:\n",
266+
" json.dump(movies, write_file)"
267+
]
268+
},
269+
{
270+
"cell_type": "code",
271+
"execution_count": 43,
272+
"metadata": {},
273+
"outputs": [
274+
{
275+
"data": {
276+
"text/plain": [
277+
"{'imdbId': '00114709',\n",
278+
" 'tmdbId': '862',\n",
279+
" 'tmdb': {'adult': False,\n",
280+
" 'backdrop_path': '/dji4Fm0gCDVb9DQQMRvAI8YNnTz.jpg',\n",
281+
" 'belongs_to_collection': {'id': 10194,\n",
282+
" 'name': 'Toy Story Collection',\n",
283+
" 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg',\n",
284+
" 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'},\n",
285+
" 'budget': 30000000,\n",
286+
" 'genres': [{'id': 16, 'name': 'Animation'},\n",
287+
" {'id': 35, 'name': 'Comedy'},\n",
288+
" {'id': 10751, 'name': 'Family'}],\n",
289+
" 'homepage': 'http://toystory.disney.com/toy-story',\n",
290+
" 'id': 862,\n",
291+
" 'imdb_id': 'tt0114709',\n",
292+
" 'original_language': 'en',\n",
293+
" 'original_title': 'Toy Story',\n",
294+
" 'overview': \"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.\",\n",
295+
" 'popularity': 29.3,\n",
296+
" 'poster_path': '/rhIRbceoE9lR4veEXuwCC2wARtG.jpg',\n",
297+
" 'production_companies': [{'id': 3,\n",
298+
" 'logo_path': '/1TjvGVDMYsj6JBxOAkUHpPEwLf7.png',\n",
299+
" 'name': 'Pixar',\n",
300+
" 'origin_country': 'US'}],\n",
301+
" 'production_countries': [{'iso_3166_1': 'US',\n",
302+
" 'name': 'United States of America'}],\n",
303+
" 'release_date': '1995-10-30',\n",
304+
" 'revenue': 373554033,\n",
305+
" 'runtime': 81,\n",
306+
" 'spoken_languages': [{'iso_639_1': 'en', 'name': 'English'}],\n",
307+
" 'status': 'Released',\n",
308+
" 'tagline': '',\n",
309+
" 'title': 'Toy Story',\n",
310+
" 'video': False,\n",
311+
" 'vote_average': 7.9,\n",
312+
" 'vote_count': 10896}}"
313+
]
314+
},
315+
"execution_count": 43,
316+
"metadata": {},
317+
"output_type": "execute_result"
318+
}
319+
],
320+
"source": [
321+
"movies['1']"
322+
]
323+
},
324+
{
325+
"cell_type": "code",
326+
"execution_count": null,
327+
"metadata": {},
328+
"outputs": [],
329+
"source": []
330+
},
331+
{
332+
"cell_type": "code",
333+
"execution_count": null,
334+
"metadata": {},
335+
"outputs": [],
336+
"source": []
337+
}
338+
],
339+
"metadata": {
340+
"kernelspec": {
341+
"display_name": "Python 3",
342+
"language": "python",
343+
"name": "python3"
344+
},
345+
"language_info": {
346+
"codemirror_mode": {
347+
"name": "ipython",
348+
"version": 3
349+
},
350+
"file_extension": ".py",
351+
"mimetype": "text/x-python",
352+
"name": "python",
353+
"nbconvert_exporter": "python",
354+
"pygments_lexer": "ipython3",
355+
"version": "3.7.3"
356+
}
357+
},
358+
"nbformat": 4,
359+
"nbformat_minor": 2
360+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# NLP with RoBERTa\n",
8+
"### How do you get the fixed representation? Did you do pooling or something?\n",
9+
"\n",
10+
"Yes, pooling is required to get a fixed representation of a sentence. In the default strategy REDUCE_MEAN, I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling."
11+
]
12+
},
13+
{
14+
"cell_type": "code",
15+
"execution_count": 34,
16+
"metadata": {},
17+
"outputs": [],
18+
"source": [
19+
"import torch\n",
20+
"import json\n",
21+
"import torch.nn.functional as F"
22+
]
23+
},
24+
{
25+
"cell_type": "code",
26+
"execution_count": 5,
27+
"metadata": {},
28+
"outputs": [
29+
{
30+
"name": "stderr",
31+
"output_type": "stream",
32+
"text": [
33+
"Using cache found in /home/dev/.cache/torch/hub/pytorch_fairseq_master\n"
34+
]
35+
},
36+
{
37+
"name": "stdout",
38+
"output_type": "stream",
39+
"text": [
40+
"loading archive file http://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz from cache at /home/dev/.cache/torch/pytorch_fairseq/83e3a689e28e5e4696ecb0bbb05a77355444a5c8a3437e0f736d8a564e80035e.c687083d14776c1979f3f71654febb42f2bb3d9a94ff7ebdfe1ac6748dba89d2\n",
41+
"| dictionary: 50264 types\n",
42+
"\n"
43+
]
44+
}
45+
],
46+
"source": [
47+
"roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')\n",
48+
"roberta.eval()\n",
49+
"print()"
50+
]
51+
},
52+
{
53+
"cell_type": "code",
54+
"execution_count": 9,
55+
"metadata": {},
56+
"outputs": [],
57+
"source": [
58+
"movies = json.load(open(\"../../../../data/parsed/omdb.json\", \"r\") )"
59+
]
60+
},
61+
{
62+
"cell_type": "code",
63+
"execution_count": 35,
64+
"metadata": {},
65+
"outputs": [],
66+
"source": [
67+
"features = roberta.extract_features(roberta.encode(movies['1']['omdb']['Plot']))"
68+
]
69+
},
70+
{
71+
"cell_type": "code",
72+
"execution_count": 41,
73+
"metadata": {},
74+
"outputs": [],
75+
"source": [
76+
"pooled_features = F.avg_pool2d(features, (features.size(1), 1)).squeeze()"
77+
]
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": 42,
82+
"metadata": {},
83+
"outputs": [
84+
{
85+
"data": {
86+
"text/plain": [
87+
"tensor([-0.0292, 0.0314, -0.2926, ..., -0.0074, 0.0038, 0.2130],\n",
88+
" grad_fn=<SqueezeBackward0>)"
89+
]
90+
},
91+
"execution_count": 42,
92+
"metadata": {},
93+
"output_type": "execute_result"
94+
}
95+
],
96+
"source": [
97+
"pooled_features"
98+
]
99+
},
100+
{
101+
"cell_type": "code",
102+
"execution_count": null,
103+
"metadata": {},
104+
"outputs": [],
105+
"source": []
106+
}
107+
],
108+
"metadata": {
109+
"kernelspec": {
110+
"display_name": "Python 3",
111+
"language": "python",
112+
"name": "python3"
113+
},
114+
"language_info": {
115+
"codemirror_mode": {
116+
"name": "ipython",
117+
"version": 3
118+
},
119+
"file_extension": ".py",
120+
"mimetype": "text/x-python",
121+
"name": "python",
122+
"nbconvert_exporter": "python",
123+
"pygments_lexer": "ipython3",
124+
"version": "3.7.3"
125+
}
126+
},
127+
"nbformat": 4,
128+
"nbformat_minor": 2
129+
}

‎examples/0. Embeddings Generation/0. (optional) Statistics Approach (PCA/uMAP)/2.EDA.ipynb ‎examples/0. Embeddings Generation/Pipelines/ML20M/[deprecated] 2.EDA.ipynb

+1-1
Original file line numberDiff line numberDiff line change
@@ -460,7 +460,7 @@
460460
"name": "python",
461461
"nbconvert_exporter": "python",
462462
"pygments_lexer": "ipython3",
463-
"version": "3.7.1"
463+
"version": "3.7.3"
464464
}
465465
},
466466
"nbformat": 4,

‎examples/0. Embeddings Generation/0. (optional) Statistics Approach (PCA/uMAP)/3. Preprocessing.ipynb ‎examples/0. Embeddings Generation/Pipelines/ML20M/[deprecated] 3. Preprocessing.ipynb

+1-1
Original file line numberDiff line numberDiff line change
@@ -2331,7 +2331,7 @@
23312331
"name": "python",
23322332
"nbconvert_exporter": "python",
23332333
"pygments_lexer": "ipython3",
2334-
"version": "3.7.1"
2334+
"version": "3.7.3"
23352335
}
23362336
},
23372337
"nbformat": 4,

‎examples/0. Embeddings Generation/0. (optional) Statistics Approach (PCA/uMAP)/1.MovieParsing .ipynb ‎examples/0. Embeddings Generation/Pipelines/ML20M/[deprecated] 1.MovieParsing .ipynb

+1-1
Original file line numberDiff line numberDiff line change
@@ -678,7 +678,7 @@
678678
"name": "python",
679679
"nbconvert_exporter": "python",
680680
"pygments_lexer": "ipython3",
681-
"version": "3.7.1"
681+
"version": "3.7.3"
682682
}
683683
},
684684
"nbformat": 4,

‎examples/1. Vanilla RL/1. Anomaly Detection.ipynb

100644100755
File mode changed.

‎examples/1. Vanilla RL/2. DDPG.ipynb

100644100755
File mode changed.

‎examples/1. Vanilla RL/3. TD3.ipynb

100644100755
File mode changed.

‎examples/1. Vanilla RL/4. SAC.ipynb

100644100755
File mode changed.

‎examples/1. Vanilla RL/5. LSTM State Encoder.ipynb

100644100755
File mode changed.

‎examples/2. BCQ/1. BCQ PyTorch .ipynb

100644100755
File mode changed.

‎examples/2. BCQ/2. BCQ Pyro.ipynb

100644100755
File mode changed.

‎examples/_ Results/1. Ranking.ipynb

100644100755
File mode changed.

‎examples/_ Results/2. Diversity Test (Indexes).ipynb

100644100755
File mode changed.

‎examples/_ Results/3. Distances Test.ipynb

100644100755
File mode changed.

‎examples/_ Results/4. BCQ Stochastic Diversity .ipynb

100644100755
File mode changed.

‎readme.md

100644100755
File mode changed.

‎recnn/__init__.py

100644100755
File mode changed.

‎recnn/data.py

100644100755
File mode changed.

‎recnn/debugger.py

100644100755
File mode changed.

‎recnn/learning.py

100644100755
File mode changed.

‎recnn/misc.py

100644100755
File mode changed.

‎recnn/models.py

100644100755
File mode changed.

‎recnn/optim.py

100644100755
File mode changed.

‎recnn/plot.py

100644100755
File mode changed.

‎res/Anomaly_Detection.png

100644100755
File mode changed.

‎res/Article old.png

100644100755
File mode changed.

‎res/Article.png

100644100755
File mode changed.

‎res/Losses.png

100644100755
File mode changed.

‎res/gen_dist.png

100644100755
File mode changed.

‎res/logo.png

100644100755
File mode changed.

‎res/real_dist.png

100644100755
File mode changed.

0 commit comments

Comments
 (0)
Please sign in to comment.