fix(ascend):update README#673
Conversation
📝 WalkthroughWalkthroughThis PR introduces UTF-8 safe streaming output for Qwen Ascend inference by adding prefill warmup, buffered token decoding, and UTF-8 validation logic. The tokenizer initializes with UTF-8 locale, and output is accumulated and printed only when valid UTF-8 sequences are complete. Documentation is updated to announce the feature. ChangesUTF-8 Safe Streaming and Tokenizer Configuration
Sequence DiagramsequenceDiagram
participant Gen as Token Generator
participant Buf as UTF-8 Buffer
participant Val as UTF-8 Validator
participant Out as Output
loop Generation Loop
Gen->>Buf: Decode token → pending_text
Buf->>Val: Check for valid UTF-8 prefix
Val->>Buf: Return valid prefix + remainder
Buf->>Out: Print valid UTF-8 chunk
Buf->>Buf: Keep remainder for next iteration
end
Out->>Out: Print final newline
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/qwen_ascend/main.cpp`:
- Around line 19-35: The helper takeValidUtf8Prefix currently returns empty when
an invalid byte is at the buffer start and never consumes it; update
takeValidUtf8Prefix to detect irrecoverable leading bytes (e.g., a start octet
that would require >4 continuation bytes or otherwise cannot form a valid UTF‑8
sequence) and consume/discard one invalid byte so progress continues, while
still returning empty when nothing valid precedes it; also ensure the final
flush path uses takeValidUtf8Prefix on pending_text (instead of printing
pending_text directly) so only validated UTF‑8 is emitted. Reference: function
takeValidUtf8Prefix and the final tail-flush that prints pending_text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1e278d7c-fbd2-4317-b44b-90eb3e3d5fd7
📒 Files selected for processing (3)
README.mdexamples/qwen_ascend/main.cppmllm/models/qwen_ascend/tokenization_qwen_ascend.hpp
| std::string takeValidUtf8Prefix(std::string& pending_text) { | ||
| auto invalid = utf8::find_invalid(pending_text.begin(), pending_text.end()); | ||
| if (invalid == pending_text.begin()) { | ||
| return {}; | ||
| } | ||
|
|
||
| if (invalid == pending_text.end()) { | ||
| std::string ready_text; | ||
| ready_text.swap(pending_text); | ||
| return ready_text; | ||
| } | ||
|
|
||
| auto ready_bytes = static_cast<size_t>(std::distance(pending_text.begin(), invalid)); | ||
| auto ready_text = pending_text.substr(0, ready_bytes); | ||
| pending_text.erase(0, ready_bytes); | ||
| return ready_text; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In utf8cpp, what does utf8::find_invalid(begin, end) return for (1) an incomplete multi-byte sequence at buffer start and (2) an irrecoverably invalid leading byte?
💡 Result:
For utf8::find_invalid(begin, end), the function returns an iterator pointing to the first invalid octet; if none are found it returns end. It does this by scanning from start and calling utf8::internal::validate_next(result, end) and immediately returning result when the returned error code is not UTF8_OK. [1] (1) Incomplete multi-byte sequence at buffer start: It returns begin (i.e., the iterator to the first byte of the incomplete sequence). validate_next is invoked starting from result==start, and any non-UTF8_OK (including INCOMPLETE_SEQUENCE) causes find_invalid to return the current iterator (the start of that invalid sequence). [1][2] (2) Irrecoverably invalid leading byte: It returns an iterator pointing to that invalid leading byte (the first byte whose validate_next call fails with a non-UTF8_OK error such as INVALID_LEAD). Since find_invalid returns the iterator value at which validate_next fails, it is the position of the bad leading octet. [1][2]
Citations:
- 1: https://github.com/nemtrif/utfcpp/blob/master/source/utf8/core.h
- 2: https://github.com/ledger/utfcpp/blob/master/source/utf8/checked.h
🏁 Script executed:
# Check if the file exists and get its size
wc -l examples/qwen_ascend/main.cppRepository: UbiquitousLearning/mllm
Length of output: 101
🏁 Script executed:
# View the takeValidUtf8Prefix helper (lines 19-35)
sed -n '19,35p' examples/qwen_ascend/main.cppRepository: UbiquitousLearning/mllm
Length of output: 604
🏁 Script executed:
# View the tail flush logic (lines 250-260 for context around 252-254)
sed -n '250,260p' examples/qwen_ascend/main.cppRepository: UbiquitousLearning/mllm
Length of output: 451
🏁 Script executed:
# Check how pending_text is used in the loop - search for the streaming context
sed -n '240,270p' examples/qwen_ascend/main.cppRepository: UbiquitousLearning/mllm
Length of output: 981
Handle irrecoverable leading bytes to avoid stalled output and invalid tail flush.
When utf8::find_invalid detects an invalid byte at the buffer start, the helper returns empty without consuming it. An irrecoverably invalid leading byte (e.g., a malformed UTF-8 start octet) will therefore block progress indefinitely—the next iteration finds the same byte, returns empty again, and loops without advancing. Additionally, the final flush at line 252-254 directly prints pending_text without validation, allowing invalid UTF-8 to reach output even though the streaming loop uses the validated takeValidUtf8Prefix.
Discard irrecoverable leading bytes that exceed the maximum UTF-8 sequence length (4 bytes), and apply takeValidUtf8Prefix to the tail flush to ensure output is valid UTF-8.
Proposed fix
std::string takeValidUtf8Prefix(std::string& pending_text) {
+ if (pending_text.empty()) {
+ return {};
+ }
auto invalid = utf8::find_invalid(pending_text.begin(), pending_text.end());
if (invalid == pending_text.begin()) {
- return {};
+ // Keep short prefixes that may become valid with future bytes.
+ // UTF-8 max code point width is 4 bytes; longer invalid-at-begin likely means malformed lead.
+ if (pending_text.size() <= 4) {
+ return {};
+ }
+ pending_text.erase(0, 1);
+ return {};
}
@@
- if (!pending_text.empty()) {
- fmt::print("{}", pending_text);
- }
+ auto tail = takeValidUtf8Prefix(pending_text);
+ if (!tail.empty()) {
+ fmt::print("{}", tail);
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/qwen_ascend/main.cpp` around lines 19 - 35, The helper
takeValidUtf8Prefix currently returns empty when an invalid byte is at the
buffer start and never consumes it; update takeValidUtf8Prefix to detect
irrecoverable leading bytes (e.g., a start octet that would require >4
continuation bytes or otherwise cannot form a valid UTF‑8 sequence) and
consume/discard one invalid byte so progress continues, while still returning
empty when nothing valid precedes it; also ensure the final flush path uses
takeValidUtf8Prefix on pending_text (instead of printing pending_text directly)
so only validated UTF‑8 is emitted. Reference: function takeValidUtf8Prefix and
the final tail-flush that prints pending_text.
Please check Guidelines for Contributing.
Summary by CodeRabbit
Release Notes
New Features
Improvements