-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading model on low CPU memory #1411
Comments
No, the model can only be loaded fully in memory. The execution would be very slow if the model should be reloaded from the disk for every requests. |
No not on disk, I am trying to elaborate:
I currently cant see any such functionality here, and am wondering how one could do that or work around having CPU memory < |
Ok, your first post didn’t mention that you eventually want to load the model on the GPU. There is a PR that does what you are looking for but it is still a work in progress: #1058 Currently the model is loaded fully in the CPU memory. There is no way to control that at this time. |
I understand, thanks, but unfortunate. So there is no workaround for that I assume. Unfortunately that makes it impossible for me to load the weights although the GPU could easily handle it. |
I am struggling to load a quantized model lacking sufficient CPU memory to load the weights.
Usually I would split the weights up in multiple shards and then load them accordingly.
Is this, or something similar, also possible in CTranslate?
The text was updated successfully, but these errors were encountered: