Replies: 1 comment 1 reply
-
IIUC, Your confusion mainly stems from cudagraph . I suggest you first learn about cudagraph . Alternatively, you can try changing the code below and debug it.
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there!
I'm new to vllm and studying it for a while, especially the part of lora.
There was a example at vllm/example/multilora_inference.py, so I followed it, but eventually got confused.
Let me expain what I have figured out first.
My understandings about vllm code
In the main func of multilora_inference.py, at the code
engine = initialize_engine()
flows into "vllm/worker/model_runner.py" and enters line 1656 (Code block 1)
Code Block 1 vllm/worker/model_runner.py
class ModelRunner(...):
This leads me to the forward function (line 544) in vllm/model_executor/models/llama.py (Code block 2).
Code Block 2 vllm/model_executor/models/llama.py
class LlamaForCausalLM(...):
If we dive into self.model(~~~), we can find LlamaDecodeLayer.forward(), which calls LlamaAttentionForward.forward() for the attention calculation of llama and eventually calls qkv_proj with "MergedQKVParallelLinearWithLora.apply()" in vllm/lora/layers.py.
Now we can see the LoRA calculation part in this function using the self.punica_wrapper.add_lora_packed_nslice(), and inside this function there are the shrink and expand methods for lora arithmetics.
Code Block 3 vllm/lora/punica.py
class PunicaWrapper:
This code flow works for the whole
engine = initialize_engine()
part of the main function in example/multilora_inference.py, both for the prefill and decode stage.I guess that this happens in the engine initializing stage for the engine to check how much memory does the system has, or allocation of gpu memory blocks for paged attention.
Question
My question starts from here.
After initalizing the engine, multilora_inference.py example starts to serve requests. It creates test prompts(requests), and process those requests by using
engine.step()
. Until the first step (which means the prefill stage), the code works totally the same as I explained above.But from the second step (which means the start of the decode stage), Code block1 do not lead me to Code Block2, but instead goes to Code Block4
Code Block 4 vllm/worker/model_runner.py
Class CudaGraphRunner(~~):
With my investigation, Code Block 4 never goes to the punica.py file. But not only punica.py, this code never reaches the llama.py file either. I also digged up the code outside Code Block 1, like model sample function.
Code Block 5 vllm/worker/model_runner.py
class ModelRunner(...):
However, these other function including model.sample have no relationship with llama.py either, so now I am very confused about the decoding stage. Is the decoding stage actually running?
Actually however, if I check the results of the multilora_inference.py, it shows new decoded tokens like this:
RequestOutput(request_id=1,
prompt='[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe airport [/user] [assistant]',
prompt_token_ids=[ ~~~~~~~,
...
...
logprobs=[{29871: Logprob(logprob=-3.3378044463461265e-05, rank=1, decoded_token=' ')},
{5097: Logprob(logprob=-0.0010131231974810362, rank=1, decoded_token=' SELECT')},
{474: Logprob(logprob=-0.9617809057235718, rank=1, decoded_token=' i')},
{1113: Logprob(logprob=-0.06918486207723618, rank=1, decoded_token='ca')},
{29877: Logprob(logprob=-1.4066597032069694e-05, rank=1, decoded_token='o')},
{3895: Logprob(logprob=-0.007424145471304655, rank=1, decoded_token=' FROM')},
{1591: Logprob(logprob=-0.0007995745982043445, rank=1, decoded_token=' table')},
{29918: Logprob(logprob=-1.5497195136049413e-06, rank=1, decoded_token='')},
{978: Logprob(logprob=-0.00010275312524754554, rank=1, decoded_token='name')},
{29918: Logprob(logprob=-6.794906312279636e-06, rank=1, decoded_token='')},
{29955: Logprob(logprob=-0.0001479277852922678, rank=1, decoded_token='7')},
{29946: Logprob(logprob=-9.238292841473594e-05, rank=1, decoded_token='4')},
{5754: Logprob(logprob=-0.00103265349753201, rank=1, decoded_token=' WHERE')},
{4799: Logprob(logprob=-0.0041597275994718075, rank=1, decoded_token=' air')},
{637: Logprob(logprob=-4.637133679352701e-05, rank=1, decoded_token='port')},
{353: Logprob(logprob=-0.0036322588566690683, rank=1, decoded_token=' =')},
{525: Logprob(logprob=-0.0027440059930086136, rank=1, decoded_token=" '")},
{29880: Logprob(logprob=-0.14139728248119354, rank=1, decoded_token='l')},
{309: Logprob(logprob=-0.0002873722987715155, rank=1, decoded_token='il')},
{549: Logprob(logprob=-6.0794889577664435e-05, rank=1, decoded_token='ong')},
{705: Logprob(logprob=-0.0006416169344447553, rank=1, decoded_token='we')},
{29915: Logprob(logprob=-0.2330627590417862, rank=1, decoded_token="'")},
{29871: Logprob(logprob=-0.0012892514932900667, rank=1, decoded_token=' ')},
{32003: Logprob(logprob=-0.0009485750924795866, rank=1, decoded_token='[/assistant]')}],
If I read the created tokens, it looks like it is the answer of the prompt message, making an SQL query starting with the keyword SELECT, so I do think the decoding is correctly working.
Though I cannot understand how this is created without passing the llama.py module. What is the problem of my understanding?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions