[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation #2619

SuperMasterBlasterLaser · 2024-12-27T17:19:58Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

I have rented RTX 6000Ada with 48.0 GB VRAM GPU via vast.ai.

Specs:

Ubuntu 22.04
PyTorch 2.4.1
cuda12.4

Then I have installed flashinfer by this command:

pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4/

Then installed this lib with this command:

pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

Then downloaded Llama-3.2-11B-Vision-Instruct and launched it like this:

python -m sglang.launch_server --model-path /root/Llama-3.2-11B-Vision-Instruct --port 8080 --host 0.0.0.0

Then I have used this simple code to infer an image:

import sglang as sgl


base_url = "url.to.my.server"

@sgl.function
def caption_image(s, image_file):
    s += sgl.user(sgl.image(image_file) + "What is the overall style of this image?")
    s += sgl.assistant(sgl.gen("global_style", choices=["cinematic", "animated", "anime", "3d", "cartoon"]))
    s += sgl.user("Overall description of this image:")
    s += sgl.assistant(sgl.gen("description", max_tokens=255))


sgl.set_default_backend(sgl.RuntimeEndpoint(base_url))

image_path = "./example.png"

state = caption_image.run(image_file=image_path)

print(state["global_style"])
print(state["description"])
print(state.text())

However, when I launched this code just to check simple image, it just hangs and I receive no response or even error message.

Logs:

[2024-12-27 17:04:20 TP0] Overlap scheduler is disabled for multimodal models.
[2024-12-27 17:04:20 TP0] Automatically turn off --chunked-prefill-size for mllama.
[2024-12-27 17:04:20 TP0] Init torch distributed begin.
[2024-12-27 17:04:21 TP0] Load weight begin. avail mem=46.99 GB
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:00,  4.62it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:01,  1.63it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.22it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.15it/s]

[2024-12-27 17:04:26 TP0] Load weight end. type=MllamaForConditionalGeneration, dtype=torch.bfloat16, avail mem=26.84 GB
[2024-12-27 17:04:26 TP0] Memory pool end. avail mem=6.62 GB
[2024-12-27 17:04:26 TP0] Capture cuda graph begin. This can take up to several minutes.
[00:11<00:00,  2.00it/s]
[2024-12-27 17:04:38 TP0] Capture cuda graph end. Time elapsed: 11.53 s
[2024-12-27 17:04:38 TP0] max_total_num_tokens=125417, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2024-12-27 17:04:39] INFO:     Started server process [1126]
[2024-12-27 17:04:39] INFO:     Waiting for application startup.
[2024-12-27 17:04:39] INFO:     Application startup complete.
[2024-12-27 17:04:39] INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
[2024-12-27 17:04:40] INFO:     127.0.0.1:34840 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-12-27 17:04:40 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-27 17:04:40] INFO:     127.0.0.1:34854 - "POST /generate HTTP/1.1" 200 OK
[2024-12-27 17:04:40] The server is fired up and ready to roll!
[2024-12-27 17:04:47] INFO:     91.198.101.42:57416 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-12-27 17:05:18 TP0] Prefill batch. #new-seq: 1, #new-token: 6425, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:18] INFO:     91.198.101.42:14407 - "POST /generate HTTP/1.1" 200 OK
[2024-12-27 17:05:20 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6423, cache hit rate: 49.95%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:21 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 6423, cache hit rate: 66.61%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:23 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 6423, cache hit rate: 74.94%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:24 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6423, cache hit rate: 79.94%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:24 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6423, cache hit rate: 83.27%, token usage: 0.05, #running-req: 0, #queue-req: 0

I don't understand why this is happening?

Reproduction

I have written it in description

Environment

Specs:

Ubuntu 22.04
PyTorch 2.4.1
cuda12.4
RTX 6000Ada

The text was updated successfully, but these errors were encountered:

SuperMasterBlasterLaser · 2024-12-28T16:02:46Z

I found out that when I use select or choice it hangs, but simple gen without any other constraints returns generated results.

bluenevus · 2024-12-29T21:58:23Z

I had to roll back to v0.4.0 for 11b vision to work again. It errors out on 0.4.1 for me.

SuperMasterBlasterLaser · 2024-12-30T04:59:30Z

@bluenevus does gen with choices or select methods work on v0.4.0?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation #2619

[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation #2619

SuperMasterBlasterLaser commented Dec 27, 2024

SuperMasterBlasterLaser commented Dec 28, 2024

bluenevus commented Dec 29, 2024

SuperMasterBlasterLaser commented Dec 30, 2024

[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation #2619

[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation #2619

Comments

SuperMasterBlasterLaser commented Dec 27, 2024

Checklist

Describe the bug

Reproduction

Environment

SuperMasterBlasterLaser commented Dec 28, 2024

bluenevus commented Dec 29, 2024

SuperMasterBlasterLaser commented Dec 30, 2024