Skip to content

Is the Fast version the final version for interactive use? #44

@Orion-Zheng

Description

@Orion-Zheng

Thanks for the release of the Lingbot-World-Fast! While evaluating this model, I noticed a strange phenomenon regarding the chunk inference latency:
The chunk latency increases proportionally as the sequence length grows.

Image Image

Then I reviewed the code to investigate (with claude code). It seems that the current fast version does not implement a sliding window or use sink tokens yet, so the KV cache gradually becomes a bottleneck when the sequence goes long.

Image

Again, really appreciate your contribution to the community!
May i ask is this model support local attention size? Or alternatively, are there plans to train a version with local attention/some kv cache compression mechanism in the future? I feel that without some constraint on KV cache growth, the model might face some challenges in the interactive use cases :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions