Thanks for the release of the Lingbot-World-Fast! While evaluating this model, I noticed a strange phenomenon regarding the chunk inference latency:
The chunk latency increases proportionally as the sequence length grows.
Then I reviewed the code to investigate (with claude code). It seems that the current fast version does not implement a sliding window or use sink tokens yet, so the KV cache gradually becomes a bottleneck when the sequence goes long.
Again, really appreciate your contribution to the community!
May i ask is this model support local attention size? Or alternatively, are there plans to train a version with local attention/some kv cache compression mechanism in the future? I feel that without some constraint on KV cache growth, the model might face some challenges in the interactive use cases :)
Thanks for the release of the Lingbot-World-Fast! While evaluating this model, I noticed a strange phenomenon regarding the chunk inference latency:
The chunk latency increases proportionally as the sequence length grows.
Then I reviewed the code to investigate (with claude code). It seems that the current fast version does not implement a sliding window or use sink tokens yet, so the KV cache gradually becomes a bottleneck when the sequence goes long.
Again, really appreciate your contribution to the community!
May i ask is this model support local attention size? Or alternatively, are there plans to train a version with local attention/some kv cache compression mechanism in the future? I feel that without some constraint on KV cache growth, the model might face some challenges in the interactive use cases :)