Skip to content

Conversation

@georgeamccarthy
Copy link
Owner

@georgeamccarthy georgeamccarthy commented Aug 10, 2021

PR type

  • 🏆 Enhancements
  • 🏠 Internal

Purpose

  • Shuffles proteins on first index.
  • Uses memmap to store proteins instead of keeping them in memory due to large size (~50 MB+ for ~100k proteins).

Why?

  • Currently embeddings are computed sequentially with increasing PDB ID. This makes embedding a fraction of the dataset index favour lower PDB ID which makes evaluating metric similarity less meaningful.
  • Memmap is preferred for large doc arrays.
  • Helps avoid memory issues on small deployment server.

Feedback required over

  • A quick pair of 👀 on the code

Mentions

Future work

  • Currently using pandas to shuffle the data. One could use the jina built in .shuffle (see cookbook). However I couldn't get this working properly.

References

  • Previous meeting with Jina AI devs.

Legal

@georgeamccarthy
Copy link
Owner Author

Added a feature to log number of culled proteins.

@fissoreg
Copy link
Collaborator

Future work

  • Currently using pandas to shuffle the data. One could use the jina built in .shuffle (see cookbook). However I couldn't get this working properly.

Apparently the shuffle method is a recent addition: jina-ai/serve@2302e45

It will work if you upgrade:

pip install --upgrade jina

@georgeamccarthy
Copy link
Owner Author

Great find! TODO :)

@fissoreg fissoreg requested review from fissoreg and removed request for fissoreg August 14, 2021 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants