Loading model on low CPU memory #1411

barschiiii · 2023-08-09T13:08:06Z

I am struggling to load a quantized model lacking sufficient CPU memory to load the weights.

Usually I would split the weights up in multiple shards and then load them accordingly.

Is this, or something similar, also possible in CTranslate?

guillaumekln · 2023-08-10T06:24:53Z

No, the model can only be loaded fully in memory. The execution would be very slow if the model should be reloaded from the disk for every requests.

barschiiii · 2023-08-10T08:45:33Z

No not on disk, I am trying to elaborate:

After running quantization, a model.bin file is generated with let's say 20GB of size.
When loading this bin file, the file has to first be read fully to CPU memory before being dispatched to GPU.
If now your CPU memory is <20GB, you will have issues loading this.
In HF transformers, how one solves this, is to shard the bin file into let's say 5GB chunks.
Then the loading automatically works by iterating over the chunks and reading 5GB into CPU memory, dispatch to GPU, release CPU memory, etc

I currently cant see any such functionality here, and am wondering how one could do that or work around having CPU memory < model.bin size.

guillaumekln · 2023-08-10T08:53:19Z

Ok, your first post didn’t mention that you eventually want to load the model on the GPU.

There is a PR that does what you are looking for but it is still a work in progress: #1058

Currently the model is loaded fully in the CPU memory. There is no way to control that at this time.

barschiiii · 2023-08-10T09:49:25Z

I understand, thanks, but unfortunate.

So there is no workaround for that I assume. Unfortunately that makes it impossible for me to load the weights although the GPU could easily handle it.

guillaumekln added the enhancement New feature or request label Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading model on low CPU memory #1411

Loading model on low CPU memory #1411

barschiiii commented Aug 9, 2023

guillaumekln commented Aug 10, 2023

barschiiii commented Aug 10, 2023 •

edited

Loading

guillaumekln commented Aug 10, 2023

barschiiii commented Aug 10, 2023

Loading model on low CPU memory #1411

Loading model on low CPU memory #1411

Comments

barschiiii commented Aug 9, 2023

guillaumekln commented Aug 10, 2023

barschiiii commented Aug 10, 2023 • edited Loading

guillaumekln commented Aug 10, 2023

barschiiii commented Aug 10, 2023

barschiiii commented Aug 10, 2023 •

edited

Loading