Is the Fast version the final version for interactive use?

Thanks for the release of the Lingbot-World-Fast! While evaluating this model, I noticed a strange phenomenon regarding the chunk inference latency:
The chunk latency increases proportionally as the sequence length grows.

<img width="715" height="404" alt="Image" src="https://github.com/user-attachments/assets/764aefb5-1b8e-464b-9bd2-f377e461b539" />

<img width="609" height="393" alt="Image" src="https://github.com/user-attachments/assets/12098e71-b5ab-4f72-810e-fb0f80b18392" />

Then I reviewed the code to investigate (with claude code). It seems that the current fast version does not implement a sliding window or use sink tokens yet, so the KV cache gradually becomes a bottleneck when the sequence goes long.

<img width="1184" height="674" alt="Image" src="https://github.com/user-attachments/assets/bc1ded3a-d7cd-422f-a25d-5ee81790eb8e" />

Again, really appreciate your contribution to the community! 
May i ask is this model support local attention size? Or alternatively, are there plans to train a version with local attention/some kv cache compression mechanism in the future? I feel that without some constraint on KV cache growth, the model might face some challenges in the interactive use cases :) 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the Fast version the final version for interactive use? #44

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is the Fast version the final version for interactive use? #44

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions