Eagle speculative decoding part 4: Add EAGLE2 worker #2150

yukavio · 2024-11-24T02:23:54Z

Support eagle speculative decoding. The following results are obtained on a single H100.

Official eagle code: 200 token/s

Normal decoding speed (SGLang): 156 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf

Eagle decoding speed (SGLang): 297 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7

Eagle decoding speed (SGLang w/ torch.comopile): 316 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --enable-torch-compile --cuda-graph-max-bs 2

Benchmark script

import time
import requests

tic = time.time()
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "[INST] Give me a simple FastAPI server. Show the python code. [/INST]",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 256,
        },
    },
)
latency = time.time() - tic
ret = response.json()

print(ret["text"])
speed = ret["meta_info"]["completion_tokens"]
print(f"speed: {speed / latency:.2f} token/s")

Some sub PRs:

zhyncs · 2025-01-02T11:31:00Z

🎉🎉🎉

Co-authored-by: kavioyu <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]>

Xu-Chen · 2025-01-16T12:15:54Z

When the batch size increases, the time taken for eagle_verify_retrive increases considerably. When the batch_size is increased to 10, the eagle_verify_retrive time increases to 0.15s for the 70b model on 4*A100, resulting in a slow overall throughput speed.

yukavio · 2025-01-17T02:06:01Z

Thanks for your report. I'll go confirm this. and look for possible solutions.

mmdbhs · 2025-01-20T10:55:38Z

can it used in awq model?

Xu-Chen · 2025-01-21T02:33:52Z

can it used in awq model?

You need to train the eagle model based on the awq model, and then you can use it.
Since the hidden states of transformer and sglang have certain precision differences, the acceleration effect will be discounted.

Xu-Chen · 2025-01-21T02:34:16Z

Thanks for your report. I'll go confirm this. and look for possible solutions.

Is there any progress？ @yukavio

yukavio · 2025-01-21T08:20:29Z

Thanks for your report. I'll go confirm this. and look for possible solutions.

Is there any progress？ @yukavio

I found only the first execution of the kernel is slow and it will not result in a slow overall throughput speed.
Test on 4*H800, batch size=10, model: LLAMA2-CHAT-70B

Xu-Chen · 2025-01-21T12:32:04Z

H800

Not sure if it's an A800(sm80) issue.
According to my test results on 4 * A800, model: Qwen2-72B-instruct.
When BS=1, the TPS of a single request is 75; when BS=10, the overall TPS = 230 - 260, and the average TPS of a single request is only 25.

Test Code

import argparse
import time
from openai import OpenAI
from multiprocessing import Pool
def infer_one(pid, openai_api_base, model_name, num):
    openai_api_key = "EMPTY"
    client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)
    start_time = time.time()
    messages = [{"role": "user", "content": "以模型部署与推理加速为题，写一篇少于 2000 字的文章"}]
    llm_response = client.chat.completions.create(
        messages=messages,
        model=model_name,
        max_tokens=2048,
        temperature=0,
        stream=True
    )
    if pid == 0:
        print("\n########## start ##########")
    answer = ""
    num_token = 0
    for each in llm_response:
        if len(each.choices) == 0:
            continue
        content = each.choices[0].delta.content
        if content is not None:
            answer += content
            num_token += 1
            if pid == 0:
                print(content,end="",  flush=True)
    if pid == 0:
        print("\n########## end  ############")

def infer_main(pid, openai_api_base, model_name):
    for i in range(3):
        infer_one(pid, openai_api_base, model_name, i)

def parser_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--pid-num",
        type=int,
        help="work num",
        default=5,
    )
    parser.add_argument(
        "--host",
        type=str,
        help="host",
        default="",
    )
    parser.add_argument(
        "--model-name",
        type=str,
        help="model name",
        default="",
    )
    args = parser.parse_args()
    return args

if __name__ == '__main__':
    args = parser_args()
    print(args)
    pid_num = args.pid_num
    model_name = args.model_name
    openai_api_base = f"{args.host}/v1"
    assert openai_api_base != ""
    pool = Pool(processes=pid_num)
    process_list = []
    start_time = time.time()
    for i in range(pid_num):
        p = pool.apply_async(infer_main, (i, openai_api_base, model_name,))
        process_list.append(p)
    pool.close()
    pool.join()

yukavio · 2025-01-22T02:31:34Z

Sorry, I don't have the A800 for testing. Could you please give me a profile file of nsight-system of of your test? (Don't test too many requests. It will cause the file too large to open in a laptop) I could help you to find the problem that caused the server to slow. @Xu-Chen

Xu-Chen · 2025-01-22T15:24:36Z

Sorry, I don't have the A800 for testing. Could you please give me a profile file of nsight-system of of your test? (Don't test too many requests. It will cause the file too large to open in a laptop) I could help you to find the problem that caused the server to slow. @Xu-Chen

I generate test reports with nsight-systems-cli on 4 * A100, sglang.out-v1.nsys-rep.zip, thanks for your help. @yukavio

yukavio · 2025-01-23T03:21:55Z

230 - 260

H800

Not sure if it's an A800(sm80) issue. According to my test results on 4 * A800, model: Qwen2-72B-instruct. When BS=1, the TPS of a single request is 75; when BS=10, the overall TPS = 230 - 260, and the average TPS of a single request is only 25.

Test Code

import argparse
import time
from openai import OpenAI
from multiprocessing import Pool
def infer_one(pid, openai_api_base, model_name, num):
    openai_api_key = "EMPTY"
    client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)
    start_time = time.time()
    messages = [{"role": "user", "content": "以模型部署与推理加速为题，写一篇少于 2000 字的文章"}]
    llm_response = client.chat.completions.create(
        messages=messages,
        model=model_name,
        max_tokens=2048,
        temperature=0,
        stream=True
    )
    if pid == 0:
        print("\n########## start ##########")
    answer = ""
    num_token = 0
    for each in llm_response:
        if len(each.choices) == 0:
            continue
        content = each.choices[0].delta.content
        if content is not None:
            answer += content
            num_token += 1
            if pid == 0:
                print(content,end="",  flush=True)
    if pid == 0:
        print("\n########## end  ############")

def infer_main(pid, openai_api_base, model_name):
    for i in range(3):
        infer_one(pid, openai_api_base, model_name, i)

def parser_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--pid-num",
        type=int,
        help="work num",
        default=5,
    )
    parser.add_argument(
        "--host",
        type=str,
        help="host",
        default="",
    )
    parser.add_argument(
        "--model-name",
        type=str,
        help="model name",
        default="",
    )
    args = parser.parse_args()
    return args

if __name__ == '__main__':
    args = parser_args()
    print(args)
    pid_num = args.pid_num
    model_name = args.model_name
    openai_api_base = f"{args.host}/v1"
    assert openai_api_base != ""
    pool = Pool(processes=pid_num)
    process_list = []
    start_time = time.time()
    for i in range(pid_num):
        p = pool.apply_async(infer_main, (i, openai_api_base, model_name,))
        process_list.append(p)
    pool.close()
    pool.join()

What command line did you use to start the service?

Xu-Chen · 2025-01-23T03:52:56Z

What command line did you use to start the service?

/opt/conda/bin/python3 -m sglang.launch_server --model Qwen/Qwen2-72B-Instruct --speculative-algo EAGLE --speculative-draft ./eagle-qwen2-72b-instruct --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --cuda-graph-max-bs 16 --context-length 32768 --tp 4 --port 8080 --dtype bfloat16

./eagle-qwen2-72b-instruct is EAGLE-Qwen2-72B-Instruct, change Qwen2ForCausalLM to Qwen2ForCausalLMEagle.

The model in the test is the accelerated model of qwen2.5 that we trained. You can test based on the qwen2 model and change the test input prompt to English.

We have collected open-source data in both Chinese and English, and trained an Eagle2 model based on the Qwen 2.5 model. Currently, we are facing issues with high-concurrency inference speed. Once this issue is resolved, we plan to open-source the model. We tested some samples on sglang, and under single-request scenarios, achieved a 2x acceleration for both Chinese and English.

yukavio · 2025-01-23T11:13:17Z

It is reasonable that the TPS of each request will decrease as the batch size increases because speculative decoding is a method that helps us to improve the computation efficiency with small batch size.
But you can try to set a smaller value of 'speculative-num-steps', 'speculative-eagle-topk' and 'speculative-num-draft-tokens' which will help you to get better performance with large batch size. @Xu-Chen

Xu-Chen · 2025-01-23T11:40:21Z

It is reasonable that the TPS of each request will decrease as the batch size increases because speculative decoding is a method that helps us to improve the computation efficiency with small batch size. But you can try to set a smaller value of 'speculative-num-steps', 'speculative-eagle-topk' and 'speculative-num-draft-tokens' which will help you to get better performance with large batch size. @Xu-Chen

You are so right. The following parameters can achieve up to 1.5X the speed at high concurrency. The parameter speculative-num-draft-tokens particularly affects performance at high batch-size( we test 10).

--speculative-num-steps 3
--speculative-eagle-topk 4
--speculative-num-draft-tokens 16

kavioyu added 30 commits October 13, 2024 15:49

temp

70135d6

migrated to new upstream, need implement evict memory

65fae7b

prove single req

064cca6

fix bug for long generate due to eagle_verify_retrive kernel

cb01c64

fix bug of eagle spec verify

df3de9d

support cuda graph

b7628f2

support batch inference

e2634e9

temp

f557a06

fix memeory leak

9987741

add sampling score

dcbc11c

support target model cuda graph

5578b18

disable target model cuda graph

af2e79a

fix batch bug

0fdd0b1

disable cuda graph pad in eagle

923523b

fix server args

0e3fea2

fix cuda graph and split prefill

4faaa31

optimize generate attn arg

33d8aef

fix parent list dtype

11d6e86

fix draft worker memory problem

2b3cb22

need to fix decode error when request retract happend

7aa0aff

remove debug info

404c5ab

fix bug

e095ec0

fix some bug and support target model use cuda graph

9f0a0c2

fix conflict, should solve scheduler and cuda graph problem

35c5678

fix naive cuda graph

b647a70

fix cuda graph

7226987

support split prefill batch

aaf1cae

fix cuda graph padding

dbeaa2c

remove modification of target model and remove some redundant code

b6f45d5

fix cache management

7c4a04c

merrymercy added 6 commits January 1, 2025 22:11

simplify spec algo

a5fedad

simplify server args

42d08db

Simplify cuda graph

a238c29

simplify position handling

3a6040b

update

2659da9

update

015473c

merrymercy mentioned this pull request Jan 2, 2025

Eagle speculative decoding part 3: small modifications to the general scheduler #2709

Merged

merrymercy added 2 commits January 2, 2025 02:10

Merge branch 'main' into spec_infer

96e3a77

Eagle

39b6d4b

merrymercy changed the title ~~Speculative EAGLE2~~ Eagle speculative decoding part 4: Add EAGLE2 worker Jan 2, 2025

merrymercy merged commit 815dce0 into sgl-project:main Jan 2, 2025
15 checks passed

This was referenced Jan 2, 2025

docs: update SGLang status SafeAILab/EAGLE#173

Merged

How can I play with the speculative decoding which metioned in the paper? deepseek-ai/DeepSeek-V3#14

Closed

chore: bump v0.4.1.post4 #2713

Merged

YAMY1234 pushed a commit to YAMY1234/sglang that referenced this pull request Jan 2, 2025

Eagle speculative decoding part 4: Add EAGLE2 worker (sgl-project#2150)

371305b

Co-authored-by: kavioyu <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]>

XiaotongJiang pushed a commit to XiaotongJiang/sglang that referenced this pull request Jan 3, 2025

Eagle speculative decoding part 4: Add EAGLE2 worker (sgl-project#2150)

e79d81a

Co-authored-by: kavioyu <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eagle speculative decoding part 4: Add EAGLE2 worker #2150

Eagle speculative decoding part 4: Add EAGLE2 worker #2150

yukavio commented Nov 24, 2024 •

edited by merrymercy

Loading

zhyncs commented Jan 2, 2025

Xu-Chen commented Jan 16, 2025

yukavio commented Jan 17, 2025 •

edited

Loading

mmdbhs commented Jan 20, 2025

Xu-Chen commented Jan 21, 2025

Xu-Chen commented Jan 21, 2025 •

edited

Loading

yukavio commented Jan 21, 2025 •

edited

Loading

Xu-Chen commented Jan 21, 2025 •

edited

Loading

yukavio commented Jan 22, 2025 •

edited

Loading

Xu-Chen commented Jan 22, 2025

yukavio commented Jan 23, 2025

Xu-Chen commented Jan 23, 2025 •

edited

Loading

yukavio commented Jan 23, 2025 •

edited

Loading

Xu-Chen commented Jan 23, 2025 •

edited

Loading

Eagle speculative decoding part 4: Add EAGLE2 worker #2150

Eagle speculative decoding part 4: Add EAGLE2 worker #2150

Conversation

yukavio commented Nov 24, 2024 • edited by merrymercy Loading

Official eagle code: 200 token/s

Normal decoding speed (SGLang): 156 token/s

Eagle decoding speed (SGLang): 297 token/s

Eagle decoding speed (SGLang w/ torch.comopile): 316 token/s

Benchmark script

zhyncs commented Jan 2, 2025

Xu-Chen commented Jan 16, 2025

yukavio commented Jan 17, 2025 • edited Loading

mmdbhs commented Jan 20, 2025

Xu-Chen commented Jan 21, 2025

Xu-Chen commented Jan 21, 2025 • edited Loading

yukavio commented Jan 21, 2025 • edited Loading

Xu-Chen commented Jan 21, 2025 • edited Loading

yukavio commented Jan 22, 2025 • edited Loading

Xu-Chen commented Jan 22, 2025

yukavio commented Jan 23, 2025

Xu-Chen commented Jan 23, 2025 • edited Loading

yukavio commented Jan 23, 2025 • edited Loading

Xu-Chen commented Jan 23, 2025 • edited Loading

yukavio commented Nov 24, 2024 •

edited by merrymercy

Loading

yukavio commented Jan 17, 2025 •

edited

Loading

Xu-Chen commented Jan 21, 2025 •

edited

Loading

yukavio commented Jan 21, 2025 •

edited

Loading

Xu-Chen commented Jan 21, 2025 •

edited

Loading

yukavio commented Jan 22, 2025 •

edited

Loading

Xu-Chen commented Jan 23, 2025 •

edited

Loading

yukavio commented Jan 23, 2025 •

edited

Loading

Xu-Chen commented Jan 23, 2025 •

edited

Loading