Performance optimized interleaved mode JetStream server #122

JoeZijunZhou · 2024-07-26T10:53:17Z

Optimized TPU duty cycle (largest gap < 4ms)
Optimized TTFT: dispatch prefill tasks ASAP w/o unnecessary blocking in CPU, keep backpressure to enforce insert ASAP, return first token ASAP.
Optimized TPOT: properly enforce generate and detokenize task in sequential w/o unnecessary blocking in CPU.
Optimized output token throughput: properly prioritize prefill and balancing TTFT and decode in high throughput situation.
Tested with llama2-70b JetStream MaxText server on v5e-8 VM

FanhaiLu1 · 2024-07-29T18:58:25Z

Optimized TPU duty cycle (largest gap < 4ms)

Optimized TTFT: dispatch prefill tasks ASAP w/o unnecessary blocking in CPU, keep backpressure to enforce insert ASAP, return first token ASAP.

Optimized TPOT: properly enforce generate and detokenize task in sequential w/o unnecessary blocking in CPU.

Optimized output token throughput: properly prioritize prefill and balancing TTFT and decode in high throughput situation.

Tested with llama2-70b JetStream MaxText server on v5e-8 VM

Optimized TTFT and Optimized output token throughput are conflicted with each. Can we expose some parameter to tuning the two part?

FanhaiLu1 · 2024-07-29T18:54:03Z

jetstream/core/orchestrator.py

@@ -316,6 +317,12 @@ def __init__(
        queue.Queue(8)
        for _ in self._generate_engines
    ]
+    self._prefill_detokenize_backlogs = [
+        # We don't let detokenization accumulate more than 8 steps to avoid


Can you elaborate more on why there is synchronization issue after 8 steps?

Set it to 8 as the detokenize thread. Too large or too small will cause performance issue.

Thanks, is this PR ready to submit?

FanhaiLu1 · 2024-07-29T18:56:25Z

jetstream/core/orchestrator.py

@@ -795,6 +792,44 @@ def _detokenize_thread(self, idx: int):
        slot, active_request = data
        my_live_requests[slot] = active_request

+  def _prefill_detokenize_thread(self, idx: int):


I though we already had prefill detokenize thread. Do current jetstream (before this pr) always return prefill token (fist token) after first decode step?

We only had detokenize thread that combined prefill detokenize and decode detokenize. The problem is that we have jax.block_until_ready() blocking the thread waiting for the prefill token or decode token copy to host async, so putting them in 1 thread would make the TTFT slow. JetStream returns prefill token in prefill thread (after prefill step generating the first token).

sounds good, thanks for sharing insights!

JoeZijunZhou · 2024-08-05T23:07:37Z

Optimized TPU duty cycle (largest gap < 4ms)

Optimized TTFT: dispatch prefill tasks ASAP w/o unnecessary blocking in CPU, keep backpressure to enforce insert ASAP, return first token ASAP.

Optimized TPOT: properly enforce generate and detokenize task in sequential w/o unnecessary blocking in CPU.

Optimized output token throughput: properly prioritize prefill and balancing TTFT and decode in high throughput situation.

Tested with llama2-70b JetStream MaxText server on v5e-8 VM

Optimized TTFT and Optimized output token throughput are conflicted with each. Can we expose some parameter to tuning the two part?

Currently, prioritize prefills in interleaved mode, and apply correct JAX blocking for copy to host async to reduce wasted wait time. 1 more optimization to do is to ensure the result returns immediately when the return channel has the result (from orchestrator).

Performance optimized interleaved mode JetStream server

20ba07c

JoeZijunZhou marked this pull request as ready for review July 26, 2024 10:53

JoeZijunZhou requested a review from vipannalla as a code owner July 26, 2024 10:53

FanhaiLu1 reviewed Jul 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance optimized interleaved mode JetStream server #122

Performance optimized interleaved mode JetStream server #122

Uh oh!

JoeZijunZhou commented Jul 26, 2024 •

edited

Loading

Uh oh!

FanhaiLu1 commented Jul 29, 2024

Uh oh!

FanhaiLu1 Jul 29, 2024

Uh oh!

JoeZijunZhou Aug 5, 2024

Uh oh!

FanhaiLu1 Aug 23, 2024

Uh oh!

FanhaiLu1 Jul 29, 2024

Uh oh!

JoeZijunZhou Aug 5, 2024

Uh oh!

FanhaiLu1 Aug 15, 2024

Uh oh!

JoeZijunZhou commented Aug 5, 2024

Uh oh!

Uh oh!

Performance optimized interleaved mode JetStream server #122

Are you sure you want to change the base?

Performance optimized interleaved mode JetStream server #122

Uh oh!

Conversation

JoeZijunZhou commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FanhaiLu1 commented Jul 29, 2024

Uh oh!

FanhaiLu1 Jul 29, 2024

Choose a reason for hiding this comment

Uh oh!

JoeZijunZhou Aug 5, 2024

Choose a reason for hiding this comment

Uh oh!

FanhaiLu1 Aug 23, 2024

Choose a reason for hiding this comment

Uh oh!

FanhaiLu1 Jul 29, 2024

Choose a reason for hiding this comment

Uh oh!

JoeZijunZhou Aug 5, 2024

Choose a reason for hiding this comment

Uh oh!

FanhaiLu1 Aug 15, 2024

Choose a reason for hiding this comment

Uh oh!

JoeZijunZhou commented Aug 5, 2024

Uh oh!

Uh oh!

JoeZijunZhou commented Jul 26, 2024 •

edited

Loading