Skip to content

Conversation

@georgeamccarthy
Copy link
Owner

@georgeamccarthy georgeamccarthy commented Aug 10, 2021

PR type

Purpose

  • Allows model and tokenizor to be stored locally & will download if not found.

Why?

  • Unable to download indexer within flow on GCP. (deployment).

Extra info

New protein_search/models directory to store models in.

models/
└── prot_bert
    ├── model
    │   ├── config.json
    │   └── pytorch_model.bin
    └── tokenizer
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        └── vocab.txt

Models downloaded from huggingface and then I moved them into these dirs.

Feedback required over

  • A quick pair of 👀 on the code
  • Discussion on the technical approach

Mentions

References

Legal

@georgeamccarthy
Copy link
Owner Author

Not sure if gonna merge this but needing it on GCP without the Dockerization merged. Could probs use a simpler model files structure https://huggingface.co/Rostlab/prot_bert/tree/main

@georgeamccarthy
Copy link
Owner Author

There may be a simpler to get around the issue. If I try and download the model with a simple script

from transformers import BertModel, BertTokenizer

model_path = "Rostlab/prot_bert"

print("Loading tokenizer.")
tokenizer = BertTokenizer.from_pretrained(model_path, do_lower_case=False)
print("loading model.")
model = BertModel.from_pretrained(model_path)

self.tokenizer = tokenizer
self.model = model

print("Done.")

then the system runs out of RAM ~1 GB and throws an error Killed.

To monitor RAM usage ps -m -o %cpu,%mem,command

Instead of downloading the repo I might just be able to configure the download to use a disk cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants