Is LoRA actually working? #11599

sgjeon194 · 2024-12-29T08:09:22Z

sgjeon194
Dec 29, 2024

Hi there!

I'm new to vllm and studying it for a while, especially the part of lora.
There was a example at vllm/example/multilora_inference.py, so I followed it, but eventually got confused.

Let me expain what I have figured out first.

My understandings about vllm code

In the main func of multilora_inference.py, at the code

engine = initialize_engine()

flows into "vllm/worker/model_runner.py" and enters line 1656 (Code block 1)

Code Block 1 vllm/worker/model_runner.py

class ModelRunner(...):

def execute_model(~~):
    ...
    hidden_or_intermediate_states = model_executable(
        input_ids=model_input.input_tokens,
        ...
    )

This leads me to the forward function (line 544) in vllm/model_executor/models/llama.py (Code block 2).

Code Block 2 vllm/model_executor/models/llama.py

class LlamaForCausalLM(...):

def forward(...) -> Union[torch.Tensor, IntermediateTensors]:
    model_output = self.model(input_ids, positions, kv_caches,
                                  attn_metadata, intermediate_tensors,
                                  inputs_embeds)
    return model_output

If we dive into self.model(~~~), we can find LlamaDecodeLayer.forward(), which calls LlamaAttentionForward.forward() for the attention calculation of llama and eventually calls qkv_proj with "MergedQKVParallelLinearWithLora.apply()" in vllm/lora/layers.py.

Now we can see the LoRA calculation part in this function using the self.punica_wrapper.add_lora_packed_nslice(), and inside this function there are the shrink and expand methods for lora arithmetics.

Code Block 3 vllm/lora/punica.py

class PunicaWrapper:

def add_lora(...):
    ...
    self.add_shrink(buffer, x, wa_t_all, scale)
    if y_offset is None and y_slice_size is None:
        self.add_expand(y, ...buffer, wb_t_all, add_input=True)
    else:
        self.add_expand_slice(...)
    ...
...

This code flow works for the whole engine = initialize_engine() part of the main function in example/multilora_inference.py, both for the prefill and decode stage.

I guess that this happens in the engine initializing stage for the engine to check how much memory does the system has, or allocation of gpu memory blocks for paged attention.

Question

My question starts from here.

After initalizing the engine, multilora_inference.py example starts to serve requests. It creates test prompts(requests), and process those requests by using engine.step(). Until the first step (which means the prefill stage), the code works totally the same as I explained above.

But from the second step (which means the start of the decode stage), Code block1 do not lead me to Code Block2, but instead goes to Code Block4

Code Block 4 vllm/worker/model_runner.py

Class CudaGraphRunner(~~):

def forward(~~):
    ...

With my investigation, Code Block 4 never goes to the punica.py file. But not only punica.py, this code never reaches the llama.py file either. I also digged up the code outside Code Block 1, like model sample function.

Code Block 5 vllm/worker/model_runner.py

class ModelRunner(...):

def execute_model(~~):
    ...
    output: SamplerOutput = self.model.sample(
    logits=logits,
    sampling_metadata=model_input.sampling_metadata,
    ...

However, these other function including model.sample have no relationship with llama.py either, so now I am very confused about the decoding stage. Is the decoding stage actually running?

Actually however, if I check the results of the multilora_inference.py, it shows new decoded tokens like this:

RequestOutput(request_id=1,
prompt='[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe airport [/user] [assistant]',
prompt_token_ids=[ ~~~~~~~,
...
...
logprobs=[{29871: Logprob(logprob=-3.3378044463461265e-05, rank=1, decoded_token=' ')},
{5097: Logprob(logprob=-0.0010131231974810362, rank=1, decoded_token=' SELECT')},
{474: Logprob(logprob=-0.9617809057235718, rank=1, decoded_token=' i')},
{1113: Logprob(logprob=-0.06918486207723618, rank=1, decoded_token='ca')},
{29877: Logprob(logprob=-1.4066597032069694e-05, rank=1, decoded_token='o')},
{3895: Logprob(logprob=-0.007424145471304655, rank=1, decoded_token=' FROM')},
{1591: Logprob(logprob=-0.0007995745982043445, rank=1, decoded_token=' table')},
{29918: Logprob(logprob=-1.5497195136049413e-06, rank=1, decoded_token='')},
{978: Logprob(logprob=-0.00010275312524754554, rank=1, decoded_token='name')},
{29918: Logprob(logprob=-6.794906312279636e-06, rank=1, decoded_token='')},
{29955: Logprob(logprob=-0.0001479277852922678, rank=1, decoded_token='7')},
{29946: Logprob(logprob=-9.238292841473594e-05, rank=1, decoded_token='4')},
{5754: Logprob(logprob=-0.00103265349753201, rank=1, decoded_token=' WHERE')},
{4799: Logprob(logprob=-0.0041597275994718075, rank=1, decoded_token=' air')},
{637: Logprob(logprob=-4.637133679352701e-05, rank=1, decoded_token='port')},
{353: Logprob(logprob=-0.0036322588566690683, rank=1, decoded_token=' =')},
{525: Logprob(logprob=-0.0027440059930086136, rank=1, decoded_token=" '")},
{29880: Logprob(logprob=-0.14139728248119354, rank=1, decoded_token='l')},
{309: Logprob(logprob=-0.0002873722987715155, rank=1, decoded_token='il')},
{549: Logprob(logprob=-6.0794889577664435e-05, rank=1, decoded_token='ong')},
{705: Logprob(logprob=-0.0006416169344447553, rank=1, decoded_token='we')},
{29915: Logprob(logprob=-0.2330627590417862, rank=1, decoded_token="'")},
{29871: Logprob(logprob=-0.0012892514932900667, rank=1, decoded_token=' ')},
{32003: Logprob(logprob=-0.0009485750924795866, rank=1, decoded_token='[/assistant]')}],

If I read the created tokens, it looks like it is the answer of the prompt message, making an SQL query starting with the keyword SELECT, so I do think the decoding is correctly working.

Though I cannot understand how this is created without passing the llama.py module. What is the problem of my understanding?

Thanks

jeejeelee · 2024-12-29T11:27:04Z

jeejeelee
Dec 29, 2024
Collaborator

IIUC, Your confusion mainly stems from cudagraph . I suggest you first learn about cudagraph .

Alternatively, you can try changing the code below and debug it.

  engine_args = EngineArgs(model="meta-llama/Llama-2-7b-hf",
                           enable_lora=True,
                           max_loras=1,
                           max_lora_rank=8,
                           max_cpu_loras=2,
                           max_num_seqs=256,
                           enforce_eager=True) # add `enforce_eager`

1 reply

sgjeon194 Dec 29, 2024
Author

Wow, thanks for your suggestion. This moves me into the llama moudule!
Yes, now I think I need to know what cudagraph is about as you mentioned. Thanks again for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Is LoRA actually working? #11599

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Is LoRA actually working? #11599

Uh oh!

Uh oh!

sgjeon194 Dec 29, 2024

My understandings about vllm code

Code Block 1 vllm/worker/model_runner.py

Code Block 2 vllm/model_executor/models/llama.py

Code Block 3 vllm/lora/punica.py

Question

Code Block 4 vllm/worker/model_runner.py

Code Block 5 vllm/worker/model_runner.py

Replies: 1 comment · 1 reply

Uh oh!

jeejeelee Dec 29, 2024 Collaborator

Uh oh!

sgjeon194 Dec 29, 2024 Author

sgjeon194
Dec 29, 2024

Replies: 1 comment 1 reply

jeejeelee
Dec 29, 2024
Collaborator

sgjeon194 Dec 29, 2024
Author