Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading model on low CPU memory #1411

Open
barschiiii opened this issue Aug 9, 2023 · 5 comments
Open

Loading model on low CPU memory #1411

barschiiii opened this issue Aug 9, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@barschiiii
Copy link

I am struggling to load a quantized model lacking sufficient CPU memory to load the weights.

Usually I would split the weights up in multiple shards and then load them accordingly.

Is this, or something similar, also possible in CTranslate?

@guillaumekln
Copy link
Collaborator

No, the model can only be loaded fully in memory. The execution would be very slow if the model should be reloaded from the disk for every requests.

@barschiiii
Copy link
Author

barschiiii commented Aug 10, 2023

No not on disk, I am trying to elaborate:

  • After running quantization, a model.bin file is generated with let's say 20GB of size.
  • When loading this bin file, the file has to first be read fully to CPU memory before being dispatched to GPU.
  • If now your CPU memory is <20GB, you will have issues loading this.
  • In HF transformers, how one solves this, is to shard the bin file into let's say 5GB chunks.
  • Then the loading automatically works by iterating over the chunks and reading 5GB into CPU memory, dispatch to GPU, release CPU memory, etc

I currently cant see any such functionality here, and am wondering how one could do that or work around having CPU memory < model.bin size.

@guillaumekln
Copy link
Collaborator

Ok, your first post didn’t mention that you eventually want to load the model on the GPU.

There is a PR that does what you are looking for but it is still a work in progress: #1058

Currently the model is loaded fully in the CPU memory. There is no way to control that at this time.

@barschiiii
Copy link
Author

I understand, thanks, but unfortunate.

So there is no workaround for that I assume. Unfortunately that makes it impossible for me to load the weights although the GPU could easily handle it.

@guillaumekln guillaumekln added the enhancement New feature or request label Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants
@guillaumekln @barschiiii and others