Skip to content

Commit dc9d06c

Browse files
committed
data loading: caching and ray/dask pandas backends
1 parent cab134f commit dc9d06c

26 files changed

+631
-405
lines changed

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -115,3 +115,6 @@ venv.bak/
115115

116116
# mypy
117117
.mypy_cache/
118+
119+
# other
120+
.pyre_configuration

.pyre/pyre.stderr

+133
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
2020-08-09 14:09:48,144 INFO Binary found at `/home/mikew/anaconda3/envs/recnn/bin/pyre.bin`
2+
2020-08-09 14:09:48,145 DEBUG Trying with: `/home/mikew/anaconda3/envs/recnn/lib/python3.8/site-packages/pyre_check/client/pyre_check/typeshed/`
3+
2020-08-09 14:09:48,145 DEBUG Trying with: `/home/mikew/anaconda3/envs/recnn/lib/python3.8/site-packages/pyre_check/pyre_check/typeshed/`
4+
2020-08-09 14:09:48,145 DEBUG Trying with: `/home/mikew/anaconda3/envs/recnn/lib/python3.8/site-packages/pyre_check/typeshed/`
5+
2020-08-09 14:09:48,145 DEBUG Trying with: `/home/mikew/anaconda3/envs/recnn/lib/python3.8/pyre_check/typeshed/`
6+
2020-08-09 14:09:48,145 DEBUG Trying with: `/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/`
7+
2020-08-09 14:09:48,145 DEBUG Trying with: `/home/mikew/anaconda3/envs/recnn/lib/python3.8/site-packages/pyre_check/client/pyre_check/taint/`
8+
2020-08-09 14:09:48,145 DEBUG Trying with: `/home/mikew/anaconda3/envs/recnn/lib/python3.8/site-packages/pyre_check/pyre_check/taint/`
9+
2020-08-09 14:09:48,145 DEBUG Trying with: `/home/mikew/anaconda3/envs/recnn/lib/python3.8/site-packages/pyre_check/taint/`
10+
2020-08-09 14:09:48,145 DEBUG Trying with: `/home/mikew/anaconda3/envs/recnn/lib/python3.8/pyre_check/taint/`
11+
2020-08-09 14:09:48,145 DEBUG Trying with: `/home/mikew/anaconda3/envs/recnn/lib/pyre_check/taint/`
12+
2020-08-09 14:09:48,145 PROMPT Which directory should pyre be initialized in? (Default: `.`):
13+
2020-08-09 14:09:52,066 INFO Successfully initialized pyre! You can view the configuration at `/home/mikew/Documents/programming/python/RecNN/.pyre_configuration`.
14+
2020-08-09 14:11:08,489 DEBUG Running `/home/mikew/anaconda3/envs/recnn/bin/pyre.bin check -logging-sections parser -project-root /home/mikew/Documents/programming/python/RecNN -log-directory /home/mikew/Documents/programming/python/RecNN/.pyre -filter-directories /home/mikew/Documents/programming/python/RecNN/. -workers 12 -ignore-all-errors /home/mikew/Documents/programming/python/RecNN/.pyre -search-path /home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/3.9,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/3.7,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/3.6,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/3,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/2and3,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/third_party/3,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/third_party/2and3 /home/mikew/Documents/programming/python/RecNN/.`
15+
2020-08-09 14:11:08,492 DEBUG Registering process with pid 97893 in pid file `/home/mikew/Documents/programming/python/RecNN/.pyre/pid_files/check-97893.pid`
16+
2020-08-09 14:11:08,549 PERFORMANCE Module tracker built: 0.033682s
17+
2020-08-09 14:11:08,549 INFO Building type environment...
18+
2020-08-09 14:11:08,768 INFO Parsing 1015 stubs and sources...
19+
2020-08-09 14:11:09,654 PERFORMANCE Sources parsed: 0.886113s
20+
2020-08-09 14:11:10,163 PERFORMANCE Full environment built: 1.614207s
21+
2020-08-09 14:11:10,164 INFO Checking 165 functions...
22+
2020-08-09 14:11:10,278 INFO Processed 165 of 165 functions
23+
2020-08-09 14:11:10,278 PERFORMANCE Check_TypeCheck: 0.113675s
24+
2020-08-09 14:11:10,278 MEMORY Shared memory size post-typecheck (size: 10)
25+
2020-08-09 14:11:10,278 INFO Postprocessing 1015 sources...
26+
2020-08-09 14:11:10,302 INFO Postprocessed 43 of 1015 sources
27+
2020-08-09 14:11:10,307 INFO Postprocessed 86 of 1015 sources
28+
2020-08-09 14:11:10,317 INFO Postprocessed 129 of 1015 sources
29+
2020-08-09 14:11:10,318 INFO Postprocessed 172 of 1015 sources
30+
2020-08-09 14:11:10,318 INFO Postprocessed 215 of 1015 sources
31+
2020-08-09 14:11:10,319 INFO Postprocessed 258 of 1015 sources
32+
2020-08-09 14:11:10,320 INFO Postprocessed 301 of 1015 sources
33+
2020-08-09 14:11:10,320 INFO Postprocessed 344 of 1015 sources
34+
2020-08-09 14:11:10,321 INFO Postprocessed 387 of 1015 sources
35+
2020-08-09 14:11:10,322 INFO Postprocessed 430 of 1015 sources
36+
2020-08-09 14:11:10,324 INFO Postprocessed 473 of 1015 sources
37+
2020-08-09 14:11:10,324 INFO Postprocessed 516 of 1015 sources
38+
2020-08-09 14:11:10,332 INFO Postprocessed 559 of 1015 sources
39+
2020-08-09 14:11:10,336 INFO Postprocessed 602 of 1015 sources
40+
2020-08-09 14:11:10,345 INFO Postprocessed 645 of 1015 sources
41+
2020-08-09 14:11:10,346 INFO Postprocessed 688 of 1015 sources
42+
2020-08-09 14:11:10,347 INFO Postprocessed 731 of 1015 sources
43+
2020-08-09 14:11:10,347 INFO Postprocessed 774 of 1015 sources
44+
2020-08-09 14:11:10,347 INFO Postprocessed 817 of 1015 sources
45+
2020-08-09 14:11:10,348 INFO Postprocessed 860 of 1015 sources
46+
2020-08-09 14:11:10,348 INFO Postprocessed 903 of 1015 sources
47+
2020-08-09 14:11:10,352 INFO Postprocessed 929 of 1015 sources
48+
2020-08-09 14:11:10,352 INFO Postprocessed 972 of 1015 sources
49+
2020-08-09 14:11:10,355 INFO Postprocessed 1015 of 1015 sources
50+
2020-08-09 14:11:10,359 PERFORMANCE Check: 1.846867s
51+
2020-08-09 14:11:10,405 DEBUG Removing pid file: `/home/mikew/Documents/programming/python/RecNN/.pyre/pid_files/check-97893.pid`
52+
2020-08-09 14:11:10,410 ERROR Found 46 type errors!
53+
54+
2020-08-09 14:15:56,219 DEBUG Running `/home/mikew/anaconda3/envs/recnn/bin/pyre.bin check -logging-sections parser -project-root /home/mikew/Documents/programming/python/RecNN -log-directory /home/mikew/Documents/programming/python/RecNN/.pyre -filter-directories /home/mikew/Documents/programming/python/RecNN/. -workers 12 -ignore-all-errors /home/mikew/Documents/programming/python/RecNN/.pyre -search-path /home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/3.9,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/3.7,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/3.6,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/3,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/2and3,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/third_party/3,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/third_party/2and3 /home/mikew/Documents/programming/python/RecNN/.`
55+
2020-08-09 14:15:56,222 DEBUG Registering process with pid 101427 in pid file `/home/mikew/Documents/programming/python/RecNN/.pyre/pid_files/check-101427.pid`
56+
2020-08-09 14:15:56,275 PERFORMANCE Module tracker built: 0.032691s
57+
2020-08-09 14:15:56,275 INFO Building type environment...
58+
2020-08-09 14:15:56,489 INFO Parsing 1015 stubs and sources...
59+
2020-08-09 14:15:57,345 PERFORMANCE Sources parsed: 0.855059s
60+
2020-08-09 14:15:57,848 PERFORMANCE Full environment built: 1.573319s
61+
2020-08-09 14:15:57,849 INFO Checking 165 functions...
62+
2020-08-09 14:15:57,949 INFO Processed 165 of 165 functions
63+
2020-08-09 14:15:57,949 PERFORMANCE Check_TypeCheck: 0.100245s
64+
2020-08-09 14:15:57,949 MEMORY Shared memory size post-typecheck (size: 10)
65+
2020-08-09 14:15:57,949 INFO Postprocessing 1015 sources...
66+
2020-08-09 14:15:57,979 INFO Postprocessed 43 of 1015 sources
67+
2020-08-09 14:15:57,980 INFO Postprocessed 86 of 1015 sources
68+
2020-08-09 14:15:57,980 INFO Postprocessed 129 of 1015 sources
69+
2020-08-09 14:15:57,981 INFO Postprocessed 172 of 1015 sources
70+
2020-08-09 14:15:57,982 INFO Postprocessed 215 of 1015 sources
71+
2020-08-09 14:15:57,998 INFO Postprocessed 258 of 1015 sources
72+
2020-08-09 14:15:57,998 INFO Postprocessed 301 of 1015 sources
73+
2020-08-09 14:15:58,001 INFO Postprocessed 344 of 1015 sources
74+
2020-08-09 14:15:58,003 INFO Postprocessed 387 of 1015 sources
75+
2020-08-09 14:15:58,003 INFO Postprocessed 430 of 1015 sources
76+
2020-08-09 14:15:58,004 INFO Postprocessed 473 of 1015 sources
77+
2020-08-09 14:15:58,005 INFO Postprocessed 516 of 1015 sources
78+
2020-08-09 14:15:58,014 INFO Postprocessed 559 of 1015 sources
79+
2020-08-09 14:15:58,023 INFO Postprocessed 602 of 1015 sources
80+
2020-08-09 14:15:58,025 INFO Postprocessed 645 of 1015 sources
81+
2020-08-09 14:15:58,027 INFO Postprocessed 688 of 1015 sources
82+
2020-08-09 14:15:58,027 INFO Postprocessed 731 of 1015 sources
83+
2020-08-09 14:15:58,029 INFO Postprocessed 757 of 1015 sources
84+
2020-08-09 14:15:58,031 INFO Postprocessed 800 of 1015 sources
85+
2020-08-09 14:15:58,039 INFO Postprocessed 843 of 1015 sources
86+
2020-08-09 14:15:58,039 INFO Postprocessed 886 of 1015 sources
87+
2020-08-09 14:15:58,042 INFO Postprocessed 929 of 1015 sources
88+
2020-08-09 14:15:58,043 INFO Postprocessed 972 of 1015 sources
89+
2020-08-09 14:15:58,043 INFO Postprocessed 1015 of 1015 sources
90+
2020-08-09 14:15:58,047 PERFORMANCE Check: 1.806000s
91+
2020-08-09 14:15:58,090 DEBUG Removing pid file: `/home/mikew/Documents/programming/python/RecNN/.pyre/pid_files/check-101427.pid`
92+
2020-08-09 14:15:58,097 ERROR Found 46 type errors!
93+
94+
2020-08-09 14:19:55,130 DEBUG Running `/home/mikew/anaconda3/envs/recnn/bin/pyre.bin check -logging-sections parser -project-root /home/mikew/Documents/programming/python/RecNN -log-directory /home/mikew/Documents/programming/python/RecNN/.pyre -filter-directories /home/mikew/Documents/programming/python/RecNN/ -workers 12 -ignore-all-errors /home/mikew/Documents/programming/python/RecNN/.pyre -search-path /home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/3.9,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/3.7,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/3.6,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/3,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/stdlib/2and3,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/third_party/3,/home/mikew/anaconda3/envs/recnn/lib/pyre_check/typeshed/third_party/2and3 /home/mikew/Documents/programming/python/RecNN/`
95+
2020-08-09 14:19:55,133 DEBUG Registering process with pid 104816 in pid file `/home/mikew/Documents/programming/python/RecNN/.pyre/pid_files/check-104816.pid`
96+
2020-08-09 14:19:55,181 PERFORMANCE Module tracker built: 0.033456s
97+
2020-08-09 14:19:55,181 INFO Building type environment...
98+
2020-08-09 14:19:55,394 INFO Parsing 1015 stubs and sources...
99+
2020-08-09 14:19:56,261 PERFORMANCE Sources parsed: 0.867394s
100+
2020-08-09 14:19:56,765 PERFORMANCE Full environment built: 1.583210s
101+
2020-08-09 14:19:56,765 INFO Checking 165 functions...
102+
2020-08-09 14:19:56,875 INFO Processed 165 of 165 functions
103+
2020-08-09 14:19:56,875 PERFORMANCE Check_TypeCheck: 0.110074s
104+
2020-08-09 14:19:56,875 MEMORY Shared memory size post-typecheck (size: 10)
105+
2020-08-09 14:19:56,875 INFO Postprocessing 1015 sources...
106+
2020-08-09 14:19:56,901 INFO Postprocessed 43 of 1015 sources
107+
2020-08-09 14:19:56,903 INFO Postprocessed 86 of 1015 sources
108+
2020-08-09 14:19:56,904 INFO Postprocessed 129 of 1015 sources
109+
2020-08-09 14:19:56,904 INFO Postprocessed 172 of 1015 sources
110+
2020-08-09 14:19:56,905 INFO Postprocessed 215 of 1015 sources
111+
2020-08-09 14:19:56,906 INFO Postprocessed 258 of 1015 sources
112+
2020-08-09 14:19:56,917 INFO Postprocessed 301 of 1015 sources
113+
2020-08-09 14:19:56,919 INFO Postprocessed 344 of 1015 sources
114+
2020-08-09 14:19:56,920 INFO Postprocessed 387 of 1015 sources
115+
2020-08-09 14:19:56,922 INFO Postprocessed 430 of 1015 sources
116+
2020-08-09 14:19:56,926 INFO Postprocessed 473 of 1015 sources
117+
2020-08-09 14:19:56,928 INFO Postprocessed 516 of 1015 sources
118+
2020-08-09 14:19:56,929 INFO Postprocessed 559 of 1015 sources
119+
2020-08-09 14:19:56,930 INFO Postprocessed 602 of 1015 sources
120+
2020-08-09 14:19:56,933 INFO Postprocessed 645 of 1015 sources
121+
2020-08-09 14:19:56,933 INFO Postprocessed 688 of 1015 sources
122+
2020-08-09 14:19:56,933 INFO Postprocessed 731 of 1015 sources
123+
2020-08-09 14:19:56,939 INFO Postprocessed 774 of 1015 sources
124+
2020-08-09 14:19:56,942 INFO Postprocessed 817 of 1015 sources
125+
2020-08-09 14:19:56,950 INFO Postprocessed 860 of 1015 sources
126+
2020-08-09 14:19:56,951 INFO Postprocessed 903 of 1015 sources
127+
2020-08-09 14:19:56,952 INFO Postprocessed 929 of 1015 sources
128+
2020-08-09 14:19:56,952 INFO Postprocessed 972 of 1015 sources
129+
2020-08-09 14:19:56,963 INFO Postprocessed 1015 of 1015 sources
130+
2020-08-09 14:19:56,967 PERFORMANCE Check: 1.820369s
131+
2020-08-09 14:19:57,013 DEBUG Removing pid file: `/home/mikew/Documents/programming/python/RecNN/.pyre/pid_files/check-104816.pid`
132+
2020-08-09 14:19:57,021 ERROR Found 46 type errors!
133+

docs/source/examples/getting_started.rst

+8-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,14 @@ In order to initialize an env, you need to provide embeddings and ratings direct
3232
frame_size = 10
3333
batch_size = 25
3434
# embeddgings: https://drive.google.com/open?id=1EQ_zXBR3DKpmJR3jBgLvt-xoOvArGMsL
35-
env = recnn.data.env.FrameEnv('ml20_pca128.pkl','ml-20m/ratings.csv', frame_size, batch_size)
35+
dirs = recnn.data.env.DataPath(
36+
base="../../../data/",
37+
embeddings="embeddings/ml20_pca128.pkl",
38+
ratings="ml-20m/ratings.csv",
39+
cache="cache/frame_env.pkl", # cache will generate after you run
40+
use_cache=True
41+
)
42+
env = recnn.data.env.FrameEnv(dirs, frame_size, batch_size)
3643

3744
train = env.train_batch()
3845
test = env.train_batch()

docs/source/examples/pandas_backend

-10
This file was deleted.
+68
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
Using Pandas Backends
2+
==========================
3+
4+
5+
RecNN supports different types of pandas backends for faster data loading/processing in and out of core
6+
7+
8+
Pandas is your default backend::
9+
# but you can also set it directly:
10+
recnn.pd.set("pandas")
11+
frame_size = 10
12+
batch_size = 25
13+
dirs = recnn.data.env.DataPath(
14+
base="../../../data/",
15+
embeddings="embeddings/ml20_pca128.pkl",
16+
ratings="ml-20m/ratings.csv",
17+
cache="cache/frame_env.pkl", # cache will generate after you run
18+
use_cache=False # disable for testing purposes
19+
)
20+
21+
%%time
22+
env = recnn.data.env.FrameEnv(dirs, frame_size, batch_size)
23+
24+
# Output:
25+
100%|██████████| 20000263/20000263 [00:13<00:00, 1469488.15it/s]
26+
100%|██████████| 20000263/20000263 [00:15<00:00, 1265183.17it/s]
27+
100%|██████████| 138493/138493 [00:06<00:00, 19935.53it/s]
28+
CPU times: user 41.6 s, sys: 1.89 s, total: 43.5 s
29+
Wall time: 43.5 s
30+
31+
32+
IP.S. nstall Modin `here
33+
<https://github.com/modin-project/modin/>`_ , it is not installed via RecNN's deps
34+
35+
You can also use modin with Dask / Ray.
36+
37+
Here is a little Ray example::
38+
import os
39+
import ray
40+
41+
if ray.is_initialized():
42+
ray.shutdown()
43+
os.environ["MODIN_ENGINE"] = "ray" # Modin will use Ray
44+
ray.init(num_cpus=10) # adjust for your liking
45+
recnn.pd.set("modin")
46+
%%time
47+
env = recnn.data.env.FrameEnv(dirs, frame_size, batch_size)
48+
49+
100%|██████████| 138493/138493 [00:07<00:00, 18503.97it/s]
50+
CPU times: user 12 s, sys: 2.06 s, total: 14 s
51+
Wall time: 21.4 s
52+
53+
Using Dask::
54+
### dask
55+
import os
56+
os.environ["MODIN_ENGINE"] = "dask" # Modin will use Dask
57+
recnn.pd.set("modin")
58+
%%time
59+
env = recnn.data.env.FrameEnv(dirs, frame_size, batch_size)
60+
61+
100%|██████████| 138493/138493 [00:06<00:00, 19785.99it/s]
62+
CPU times: user 14.2 s, sys: 2.13 s, total: 16.3 s
63+
Wall time: 22 s
64+
<recnn.data.env.FrameEnv at 0x7f623fb30250>
65+
66+
67+
Free 2x improvement in loading speed
68+
====================================

0 commit comments

Comments
 (0)