Shuffle protein data on first load and then store in memmap (on disk) instead of in memory. #54

georgeamccarthy · 2021-08-10T11:59:22Z

PR type

🏆 Enhancements
🏠 Internal

Purpose

Shuffles proteins on first index.
Uses memmap to store proteins instead of keeping them in memory due to large size (~50 MB+ for ~100k proteins).

Why?

Currently embeddings are computed sequentially with increasing PDB ID. This makes embedding a fraction of the dataset index favour lower PDB ID which makes evaluating metric similarity less meaningful.
Memmap is preferred for large doc arrays.
Helps avoid memory issues on small deployment server.

Feedback required over

A quick pair of 👀 on the code

Mentions

@fissoreg

Future work

Currently using pandas to shuffle the data. One could use the jina built in .shuffle (see cookbook). However I couldn't get this working properly.

References

Previous meeting with Jina AI devs.

Legal

I have read and agreed to the terms of contributing.

…emmap.

georgeamccarthy · 2021-08-10T12:20:41Z

Added a feature to log number of culled proteins.

fissoreg · 2021-08-10T17:03:00Z

Future work

Currently using pandas to shuffle the data. One could use the jina built in .shuffle (see cookbook). However I couldn't get this working properly.

Apparently the shuffle method is a recent addition: jina-ai/serve@2302e45

It will work if you upgrade:

pip install --upgrade jina

georgeamccarthy · 2021-08-10T17:38:40Z

Great find! TODO :)

feat 1: shuffle protein data on first index. feat 2: store prots in m…

8784a94

…emmap.

georgeamccarthy added performance feature labels Aug 10, 2021

georgeamccarthy requested a review from fissoreg August 10, 2021 11:59

georgeamccarthy self-assigned this Aug 10, 2021

feat: log number of culled rows.

af5c9b6

georgeamccarthy added the backend label Aug 10, 2021

fissoreg requested review from fissoreg and removed request for fissoreg August 14, 2021 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shuffle protein data on first load and then store in memmap (on disk) instead of in memory. #54

Shuffle protein data on first load and then store in memmap (on disk) instead of in memory. #54

Uh oh!

georgeamccarthy commented Aug 10, 2021 •

edited

Loading

Uh oh!

georgeamccarthy commented Aug 10, 2021

Uh oh!

fissoreg commented Aug 10, 2021

Future work

Uh oh!

georgeamccarthy commented Aug 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Shuffle protein data on first load and then store in memmap (on disk) instead of in memory. #54

Are you sure you want to change the base?

Shuffle protein data on first load and then store in memmap (on disk) instead of in memory. #54

Uh oh!

Conversation

georgeamccarthy commented Aug 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

Purpose

Why?

Feedback required over

Mentions

Future work

References

Legal

Uh oh!

georgeamccarthy commented Aug 10, 2021

Uh oh!

fissoreg commented Aug 10, 2021

Future work

Uh oh!

georgeamccarthy commented Aug 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

georgeamccarthy commented Aug 10, 2021 •

edited

Loading