Skip to content

Conversation

delavet
Copy link

@delavet delavet commented Sep 28, 2025

This PR introduces a new Unix Domain Socket (UDS) based tokenizer service that can be used as an external tokenizer for the KV Cache Manager, related to #126. The changes include:

  1. Added a new tokenizer mode in the KV Cache Manager, which communicate with an external tokenizer service via http over UDS.

  2. Added an example external tokenizer service with:

    • Server implementation which can do both tokenization and chat-templating
    • Tokenizer implementation with HF (transformers) code
    • Dockerfile for containerization
    • Gunicorn configuration for production deployment
    • alongside documentation and tests
  3. Updated the Helm chart to deploy the external tokenizer as a sidecar container alongside the KV Cache Manager

  4. Added configuration options to enable the external tokenizer in the kv events online example.

Signed-off-by: Hang Yin <[email protected]>
Copy link
Member

@vMaroon vMaroon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work - thank you for this contribution @delavet. Added a couple of minor comments.

Do you think we can get some profiling data and performance benchmarks here?

@@ -0,0 +1,129 @@
# Model Caching in Tokenizer Service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the uds_tokenizer package would be better housed in a new services directory.

1. Once from the chat template itself (which may include BOS token)
2. Once from the add_special_tokens parameter

vLLM handles this by setting add_special_tokens=False when using chat templates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would make sense to extract and reuse vLLM preprocessing code as-is? Serving as a lightweight vLLM sub-component for disaggregated tokenization. Its maintenance would be syncing up versions and dependencies.

Not a blocker, for this open PR, but I think we should aim towards this path. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants