GPU-parallel inference #12188

vadimcn · 2025-01-19T00:45:30Z

vadimcn
Jan 19, 2025

I have a 16-GPU machine available, and a model that fits on a single GPU. What is the best way to optimize vLLM performance in this situation?

I've tried tensor parallel, however it seems that it requires both the number of attention heads and the vocabulary size to be divisible by the number of GPUs, and in my case the GCD of these is 4 😞

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-parallel inference #12188

{{title}}

Replies: 0 comments

Select a reply

GPU-parallel inference #12188

vadimcn Jan 19, 2025

Replies: 0 comments

vadimcn
Jan 19, 2025