Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ mllm

## Latest News

- [2026 May 02] 🔥🔥🔥 MLLM now supports the Ascend NPU backend, with ATB graph execution and Qwen3 W8A8 inference on Ascend devices.
- [2026 Mar 18] 🔥🔥🔥 `pymllm` now supports CUDA on Jetson Orin and Jetson Thor devices (experimental; still under active development).
- [2026 Feb 03] 🔥🔥🔥 MLLM Qnn AOT Support for Full Graph Execution on NPU! [Quick Start](https://ubiquitouslearning.github.io/mllm/qnn_backend/aot_execute.html), [Technical Report](https://chenghuawang.github.io/News/2026-01-29-mllm-qnn-aot-support-en/)
- [2025 Nov 27] Android Demo Update: Enabled stable Qwen3 and DeepSeek-OCR streaming on Android via a novel In-App Go Server Architecture.
Expand Down Expand Up @@ -75,7 +76,7 @@ The mllm framework integrates seamlessly with popular community frameworks' chec

| Model(v2) | CPU | Hexagon NPU <br> INT8 | Ascend NPU |
|-----------------------------------------------------------------------------|------|-----------------------|------------|
| [Qwen3-0.6B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-0.6B-w4a32kai) | | ✔️ W8A8 |
| [Qwen3-0.6B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-0.6B-w4a32kai) | | [✔️ W8A8](https://www.modelscope.cn/models/mllmTeam/Qwen3-0.6B-W8A8-Ascend) |
| [Qwen3-1.7B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-1.7B-w4a8-i8mm-kai) | [W4A16-SM8650](https://modelscope.cn/models/mllmTeam/Qwen3-1.7B-Qnn-AOT-SM8650/) | |
| [Qwen3-4B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-4B-w4a8-i8mm-kai) | | |
| [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/DeepSeek-OCR-w4a8-i8mm-kai) | | |
Expand Down
47 changes: 43 additions & 4 deletions examples/qwen_ascend/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,39 @@

#include <iostream>
#include <clocale>
#include <cstdio>
#include <iterator>
#include <string>
#include <fmt/core.h>
#include <utfcpp/utf8.h>
#include <mllm/mllm.hpp>
#include <mllm/models/qwen_ascend/modeling_qwen_ascend.hpp>
#include <mllm/models/qwen_ascend/tokenization_qwen_ascend.hpp>

using mllm::Argparse;

namespace {

std::string takeValidUtf8Prefix(std::string& pending_text) {
auto invalid = utf8::find_invalid(pending_text.begin(), pending_text.end());
if (invalid == pending_text.begin()) {
return {};
}

if (invalid == pending_text.end()) {
std::string ready_text;
ready_text.swap(pending_text);
return ready_text;
}

auto ready_bytes = static_cast<size_t>(std::distance(pending_text.begin(), invalid));
auto ready_text = pending_text.substr(0, ready_bytes);
pending_text.erase(0, ready_bytes);
return ready_text;
}
Comment on lines +19 to +35
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In utf8cpp, what does utf8::find_invalid(begin, end) return for (1) an incomplete multi-byte sequence at buffer start and (2) an irrecoverably invalid leading byte?

💡 Result:

For utf8::find_invalid(begin, end), the function returns an iterator pointing to the first invalid octet; if none are found it returns end. It does this by scanning from start and calling utf8::internal::validate_next(result, end) and immediately returning result when the returned error code is not UTF8_OK. [1] (1) Incomplete multi-byte sequence at buffer start: It returns begin (i.e., the iterator to the first byte of the incomplete sequence). validate_next is invoked starting from result==start, and any non-UTF8_OK (including INCOMPLETE_SEQUENCE) causes find_invalid to return the current iterator (the start of that invalid sequence). [1][2] (2) Irrecoverably invalid leading byte: It returns an iterator pointing to that invalid leading byte (the first byte whose validate_next call fails with a non-UTF8_OK error such as INVALID_LEAD). Since find_invalid returns the iterator value at which validate_next fails, it is the position of the bad leading octet. [1][2]

Citations:


🏁 Script executed:

# Check if the file exists and get its size
wc -l examples/qwen_ascend/main.cpp

Repository: UbiquitousLearning/mllm

Length of output: 101


🏁 Script executed:

# View the takeValidUtf8Prefix helper (lines 19-35)
sed -n '19,35p' examples/qwen_ascend/main.cpp

Repository: UbiquitousLearning/mllm

Length of output: 604


🏁 Script executed:

# View the tail flush logic (lines 250-260 for context around 252-254)
sed -n '250,260p' examples/qwen_ascend/main.cpp

Repository: UbiquitousLearning/mllm

Length of output: 451


🏁 Script executed:

# Check how pending_text is used in the loop - search for the streaming context
sed -n '240,270p' examples/qwen_ascend/main.cpp

Repository: UbiquitousLearning/mllm

Length of output: 981


Handle irrecoverable leading bytes to avoid stalled output and invalid tail flush.

When utf8::find_invalid detects an invalid byte at the buffer start, the helper returns empty without consuming it. An irrecoverably invalid leading byte (e.g., a malformed UTF-8 start octet) will therefore block progress indefinitely—the next iteration finds the same byte, returns empty again, and loops without advancing. Additionally, the final flush at line 252-254 directly prints pending_text without validation, allowing invalid UTF-8 to reach output even though the streaming loop uses the validated takeValidUtf8Prefix.

Discard irrecoverable leading bytes that exceed the maximum UTF-8 sequence length (4 bytes), and apply takeValidUtf8Prefix to the tail flush to ensure output is valid UTF-8.

Proposed fix
 std::string takeValidUtf8Prefix(std::string& pending_text) {
+  if (pending_text.empty()) {
+    return {};
+  }
   auto invalid = utf8::find_invalid(pending_text.begin(), pending_text.end());
   if (invalid == pending_text.begin()) {
-    return {};
+    // Keep short prefixes that may become valid with future bytes.
+    // UTF-8 max code point width is 4 bytes; longer invalid-at-begin likely means malformed lead.
+    if (pending_text.size() <= 4) {
+      return {};
+    }
+    pending_text.erase(0, 1);
+    return {};
   }
@@
-      if (!pending_text.empty()) {
-        fmt::print("{}", pending_text);
-      }
+      auto tail = takeValidUtf8Prefix(pending_text);
+      if (!tail.empty()) {
+        fmt::print("{}", tail);
+      }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/qwen_ascend/main.cpp` around lines 19 - 35, The helper
takeValidUtf8Prefix currently returns empty when an invalid byte is at the
buffer start and never consumes it; update takeValidUtf8Prefix to detect
irrecoverable leading bytes (e.g., a start octet that would require >4
continuation bytes or otherwise cannot form a valid UTF‑8 sequence) and
consume/discard one invalid byte so progress continues, while still returning
empty when nothing valid precedes it; also ensure the final flush path uses
takeValidUtf8Prefix on pending_text (instead of printing pending_text directly)
so only validated UTF‑8 is emitted. Reference: function takeValidUtf8Prefix and
the final tail-flush that prints pending_text.


} // namespace

MLLM_MAIN({
auto& help = Argparse::add<bool>("-h|--help").help("Show help message");
auto& model_path = Argparse::add<std::string>("-m|--model_path").help("Model path").required(true);
Expand Down Expand Up @@ -194,17 +220,27 @@ MLLM_MAIN({
msg.prompt = prompt_text;
auto inputs = tokenizer.convertMessage(msg);

// Clear KV cache before generation
// Run a prefill warmup outside ARGeneration timing so first-use Ascend
// graph/runtime setup is not counted as the measured prefill time.
model.clearCache();
fmt::print("\nWarming up prefill path...\n");
(void)model.forward(inputs, {});
// Keep RoPE cache warmed, but reset KV state for the measured generation.
model.kvCache().clearCache();

fmt::print("\nAnswer:\n");
auto chat_start = std::chrono::high_resolution_clock::now();

std::vector<int64_t> generated_ids;
// Use streaming generation with the ARGeneration chat interface
std::string pending_text;
for (auto& step : model.chat(inputs)) {
generated_ids.push_back(step.cur_token_id);
std::wcout << tokenizer.detokenize(step.cur_token_id) << std::flush;
pending_text += tokenizer.decode({step.cur_token_id});
auto ready_text = takeValidUtf8Prefix(pending_text);
if (!ready_text.empty()) {
fmt::print("{}", ready_text);
std::fflush(stdout);
}
// Stop if we've reached max_new_tokens
if (static_cast<int>(generated_ids.size()) >= gen_max_new_tokens) {
if (step.current_step > 0) {
Expand All @@ -213,7 +249,10 @@ MLLM_MAIN({
break;
}
}
std::wcout << std::endl;
if (!pending_text.empty()) {
fmt::print("{}", pending_text);
}
fmt::print("\n");

auto chat_end = std::chrono::high_resolution_clock::now();
auto chat_ms = std::chrono::duration_cast<std::chrono::milliseconds>(chat_end - chat_start).count();
Expand Down
2 changes: 1 addition & 1 deletion mllm/models/qwen_ascend/tokenization_qwen_ascend.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ struct QwenAscendMessage {
class QwenAscendTokenizer final : public mllm::preprocessor::AutoTokenizer {
public:
explicit QwenAscendTokenizer(const std::string& file_path) {
preprocessor::initLocal();
preprocessor::initLocal("C.UTF-8");
preprocessor::makeBytes2UnicodeMap(bytes_2_unicode_dict_);
for (auto& kv : bytes_2_unicode_dict_) { bytes_2_unicode_dict_inverse_.insert({kv.second, kv.first}); }
bpe_.initFromSentencePieceJson(file_path);
Expand Down
Loading