Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation #2619

Open
5 tasks done
SuperMasterBlasterLaser opened this issue Dec 27, 2024 · 3 comments
Open
5 tasks done

Comments

@SuperMasterBlasterLaser

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I have rented RTX 6000Ada with 48.0 GB VRAM GPU via vast.ai.

Specs:

  1. Ubuntu 22.04
  2. PyTorch 2.4.1
  3. cuda12.4

Then I have installed flashinfer by this command:

pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4/

Then installed this lib with this command:

pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

Then downloaded Llama-3.2-11B-Vision-Instruct and launched it like this:

python -m sglang.launch_server --model-path /root/Llama-3.2-11B-Vision-Instruct --port 8080 --host 0.0.0.0

Then I have used this simple code to infer an image:

import sglang as sgl


base_url = "url.to.my.server"

@sgl.function
def caption_image(s, image_file):
    s += sgl.user(sgl.image(image_file) + "What is the overall style of this image?")
    s += sgl.assistant(sgl.gen("global_style", choices=["cinematic", "animated", "anime", "3d", "cartoon"]))
    s += sgl.user("Overall description of this image:")
    s += sgl.assistant(sgl.gen("description", max_tokens=255))


sgl.set_default_backend(sgl.RuntimeEndpoint(base_url))

image_path = "./example.png"

state = caption_image.run(image_file=image_path)

print(state["global_style"])
print(state["description"])
print(state.text())

However, when I launched this code just to check simple image, it just hangs and I receive no response or even error message.

Logs:

[2024-12-27 17:04:20 TP0] Overlap scheduler is disabled for multimodal models.
[2024-12-27 17:04:20 TP0] Automatically turn off --chunked-prefill-size for mllama.
[2024-12-27 17:04:20 TP0] Init torch distributed begin.
[2024-12-27 17:04:21 TP0] Load weight begin. avail mem=46.99 GB
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:00,  4.62it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:01,  1.63it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.22it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.15it/s]

[2024-12-27 17:04:26 TP0] Load weight end. type=MllamaForConditionalGeneration, dtype=torch.bfloat16, avail mem=26.84 GB
[2024-12-27 17:04:26 TP0] Memory pool end. avail mem=6.62 GB
[2024-12-27 17:04:26 TP0] Capture cuda graph begin. This can take up to several minutes.
[00:11<00:00,  2.00it/s]
[2024-12-27 17:04:38 TP0] Capture cuda graph end. Time elapsed: 11.53 s
[2024-12-27 17:04:38 TP0] max_total_num_tokens=125417, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2024-12-27 17:04:39] INFO:     Started server process [1126]
[2024-12-27 17:04:39] INFO:     Waiting for application startup.
[2024-12-27 17:04:39] INFO:     Application startup complete.
[2024-12-27 17:04:39] INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
[2024-12-27 17:04:40] INFO:     127.0.0.1:34840 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-12-27 17:04:40 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-27 17:04:40] INFO:     127.0.0.1:34854 - "POST /generate HTTP/1.1" 200 OK
[2024-12-27 17:04:40] The server is fired up and ready to roll!
[2024-12-27 17:04:47] INFO:     91.198.101.42:57416 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-12-27 17:05:18 TP0] Prefill batch. #new-seq: 1, #new-token: 6425, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:18] INFO:     91.198.101.42:14407 - "POST /generate HTTP/1.1" 200 OK
[2024-12-27 17:05:20 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6423, cache hit rate: 49.95%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:21 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 6423, cache hit rate: 66.61%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:23 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 6423, cache hit rate: 74.94%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:24 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6423, cache hit rate: 79.94%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:24 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6423, cache hit rate: 83.27%, token usage: 0.05, #running-req: 0, #queue-req: 0

I don't understand why this is happening?

Reproduction

I have written it in description

Environment

Specs:

  1. Ubuntu 22.04
  2. PyTorch 2.4.1
  3. cuda12.4
  4. RTX 6000Ada
@SuperMasterBlasterLaser
Copy link
Author

I found out that when I use select or choice it hangs, but simple gen without any other constraints returns generated results.

@bluenevus
Copy link

I had to roll back to v0.4.0 for 11b vision to work again. It errors out on 0.4.1 for me.

@SuperMasterBlasterLaser
Copy link
Author

@bluenevus does gen with choices or select methods work on v0.4.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants