Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 97 additions & 22 deletions tools/mllm-llm-benchmark/main.cpp
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
// Copyright (c) MLLM Team.
// Licensed under the MIT License.

#include <string>
#include <fstream>
#include <vector>
#include <sstream>
#include <thread>
#include <chrono>
#include <algorithm> // For std::transform

#include <mllm/mllm.hpp>
#include <mllm/utils/Argparse.hpp>
Expand All @@ -16,6 +19,14 @@

#include "models/All.hpp"

#ifndef MLLM_GIT_COMMIT_HASH
#define MLLM_GIT_COMMIT_HASH unknown
#endif

#define STR_HELPER(x) #x
#define STR(x) STR_HELPER(x)


MLLM_MAIN({
auto& help = mllm::Argparse::add<bool>("-h|--help").help("Show help message");
auto& model_name = mllm::Argparse::add<std::string>("-n|--model_name").help("Model name");
Expand All @@ -25,8 +36,19 @@ MLLM_MAIN({
auto& pp = mllm::Argparse::add<std::string>("-pp|--prompt_length").help("Prompt length");
auto& tg = mllm::Argparse::add<std::string>("-tg|--test_generation_length").help("Test Generation length");
auto& cache_length = mllm::Argparse::add<int32_t>("-cl|--cache_length").help("Cache length");

// New CLI Arguments
auto& runs = mllm::Argparse::add<int32_t>("-r|--runs").help("Number of benchmark runs").def(3);
auto& cooldown_s = mllm::Argparse::add<int32_t>("-cs|--cooldown_s").help("Cooldown time between runs in seconds").def(5);
auto& output_csv = mllm::Argparse::add<std::string>("-oc|--output_csv").help("Output results to a CSV file").def("");
auto& schema_version = mllm::Argparse::add<int32_t>("-sv|--schema_version").help("Schema version for output format").def(1);
auto& kv_dtype_bytes = mllm::Argparse::add<int32_t>("-kv|--kv_dtype_bytes").help("KV cache data type bytes (1: int8, 2: fp16, 4: fp32)").def(4);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

kv_dtype_bytes is defined but never used.

This CLI argument is collected but never referenced in the code. The KV cache estimates at lines 195-196 are hardcoded to 0.0 instead of being computed using this value.

Either implement the KV cache estimation using kv_dtype_bytes, or remove the unused argument to avoid confusing users.

🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/main.cpp` at line 45, The CLI option kv_dtype_bytes
(added via mllm::Argparse::add<int32_t>("-kv|--kv_dtype_bytes")) is never used;
update the KV cache size estimation (replace the hardcoded 0.0 values where KV
cache is computed) to use kv_dtype_bytes to compute bytes-per-element and derive
the KV cache memory estimate, or if you prefer to remove the feature, delete the
kv_dtype_bytes argument and any references to KV cache estimation; specifically,
use the parsed kv_dtype_bytes value to multiply by number of KV elements (or
element count computation already present) to produce the correct KV cache size
instead of 0.0.


mllm::Argparse::parse(argc, argv);

mllm::Context::instance().setCpuOpThreads(num_threads.get());
mllm::setMaximumNumThreads((uint32_t)num_threads.get());
Comment on lines +49 to +50
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the Argparse implementation
find . -name "Argparse.hpp" -o -name "Argparse.h" | head -5

Repository: UbiquitousLearning/mllm

Length of output: 92


🏁 Script executed:

# Look at the main.cpp file around line 35 and 49-50
sed -n '30,55p' tools/mllm-llm-benchmark/main.cpp

Repository: UbiquitousLearning/mllm

Length of output: 1757


🏁 Script executed:

# Search for how num_threads is used and defined
rg -n "num_threads" tools/mllm-llm-benchmark/ -A2 -B2

Repository: UbiquitousLearning/mllm

Length of output: 1148


🏁 Script executed:

# Check if there's any validation or default initialization
rg -n "\.def\(" tools/mllm-llm-benchmark/main.cpp

Repository: UbiquitousLearning/mllm

Length of output: 697


🏁 Script executed:

# Now examine the Argparse.hpp implementation to understand get() behavior
cat -n mllm/utils/Argparse.hpp | head -150

Repository: UbiquitousLearning/mllm

Length of output: 4941


🏁 Script executed:

# Look for the get() method specifically
rg -n "get\(\)" mllm/utils/Argparse.hpp -A5 -B5

Repository: UbiquitousLearning/mllm

Length of output: 1536


🏁 Script executed:

# Check if there's any constructor or initialization for int32_t type
rg -n "int32_t|get\(" mllm/utils/Argparse.hpp | head -40

Repository: UbiquitousLearning/mllm

Length of output: 324


Add a default value or validation for num_threads.

When --threads is not provided, num_threads.get() returns 0 (the default-initialized value for int32_t). Setting thread counts to 0 is likely unintended behavior. Unlike the newer arguments (runs, cooldown_s, etc.), num_threads lacks a .def() call. Either add .def() with a reasonable default or validate the value before use at lines 49-50.

🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/main.cpp` around lines 49 - 50, num_threads
currently can be 0 when unset; update the flag handling to ensure a sensible
default or validate before use: either add a .def(...) when defining num_threads
to set a default (e.g., 1 or std::thread::hardware_concurrency()) or check
num_threads.get() before calling mllm::Context::instance().setCpuOpThreads and
mllm::setMaximumNumThreads and replace 0 with a fallback value; reference the
num_threads variable and the calls
mllm::Context::instance().setCpuOpThreads(...) and
mllm::setMaximumNumThreads(...) when making the change.


// Print Build Version
mllm::print("MLLM Build Version :", STRINGIFY(MLLM_GIT_COMMIT_HASH));

Expand Down Expand Up @@ -58,6 +80,25 @@ MLLM_MAIN({
auto benchmark = createBenchmark(model_name.get());
MLLM_RT_ASSERT(benchmark != nullptr);


// Validate runs early to avoid huge reserve() when negative values cast to size_t.
int R = runs.get();
if (R <= 0) {
mllm::print("[ERROR] --runs must be > 0, got:", R);
return 1;
}

// Open file stream
std::ofstream csv_file;
if (!output_csv.get().empty()) {
csv_file.open(output_csv.get());
if (!csv_file.is_open()) {
mllm::print("[ERROR] Failed to open --output_csv:", output_csv.get());
return 1;
}
csv_file << "schema_version,git_commit,arch,model_name,pp,tg,ttft_ms,prefill_speed,decode_speed,prefill_ms,decode_ms_per_tok,kv_est_bytes_pp,kv_est_bytes_final\n";
}

// Print Model Info
mllm::print("Model Info");
benchmark->init(config_path.get(), model_path.get(), cache_length.get());
Expand Down Expand Up @@ -92,7 +133,7 @@ MLLM_MAIN({
for (size_t i = 0; i < pp_values.size(); ++i) { pp_tg_pairs.emplace_back(pp_values[i], tg_values[i]); }
}

// Actual run for 3 turns and gives avg results. Each turn will sleep for 5 seconds to let the SoC or GPU/NPU cool down.
// Actual run for configurable number of turns
mllm::print("\n========================================");
mllm::print("Starting Benchmark Tests");
mllm::print("========================================\n");
Expand All @@ -106,30 +147,40 @@ MLLM_MAIN({

// Storage for results
std::vector<BenchmarkTemplateResult> results;
results.reserve(3);
results.reserve(static_cast<size_t>(R));

for (int i = 0; i < 3; ++i) {
mllm::print(" Run", i + 1, "of 3...");
for (int i = 0; i < R; ++i) {
mllm::print(" Run", i + 1, "of", R, "...");

// Clear cache before each run
benchmark->clear();
// Clear cache/state before each run to reduce cross-run interference.

// Run benchmark
benchmark->clear();
// Run benchmark for this (pp, tg) pair.
auto result = benchmark->run(pp, tg);
results.push_back(result);

mllm::print(" TTFT :", result.ttft, "ms");
mllm::print(" Prefill Speed:", result.prefill_speed, "tokens/s");
mllm::print(" Decode Speed :", result.decode_speed, "tokens/s");

// Sleep for 5 seconds between runs to cool down
if (i < 2) {
mllm::print(" Cooling down for 5 seconds...");
std::this_thread::sleep_for(std::chrono::seconds(5));
// Derive per-run latency numbers from throughput (guard against divide-by-zero).

float prefill_ms = (result.prefill_speed > 0.0f) ? (pp / result.prefill_speed) * 1000.0f : 0.0f;
float decode_ms_per_tok = (result.decode_speed > 0.0f) ? (1.0f / result.decode_speed) * 1000.0f : 0.0f;
mllm::print(" Prefill Latency :", prefill_ms, "ms");
mllm::print(" Decode Latency :", decode_ms_per_tok, "ms");

// Sleep between runs to cool down (configurable).

int cool = cooldown_s.get();
if (i + 1 < R && cool > 0) {
mllm::print(" Cooling down for", cool, "seconds...");
std::this_thread::sleep_for(std::chrono::seconds(cool));
}
}

// Calculate average results
float denom = (R > 0) ? static_cast<float>(R) : 1.0f;
float avg_ttft = 0.0f;
float avg_prefill_speed = 0.0f;
float avg_decode_speed = 0.0f;
Expand All @@ -140,20 +191,44 @@ MLLM_MAIN({
avg_decode_speed += result.decode_speed;
}

avg_ttft /= 3.0f;
avg_prefill_speed /= 3.0f;
avg_decode_speed /= 3.0f;

// Print average results
mllm::print("\n========== Average Results ==========");
mllm::print("Configuration: PP=", pp, " TG=", tg);
mllm::print("Average TTFT :", avg_ttft, "ms");
mllm::print("Average Prefill Speed:", avg_prefill_speed, "tokens/s");
mllm::print("Average Decode Speed :", avg_decode_speed, "tokens/s");
mllm::print("=====================================\n");
avg_ttft /= denom;
avg_prefill_speed /= denom;
avg_decode_speed /= denom;

float avg_prefill_ms = (avg_prefill_speed > 0.0f) ? (pp / avg_prefill_speed) * 1000.0f : 0.0f;
float avg_decode_ms_per_tok = (avg_decode_speed > 0.0f) ? (1.0f / avg_decode_speed) * 1000.0f : 0.0f;

// Rough KV cache estimate (bytes)
double kv_est_bytes_pp = 0.0;
double kv_est_bytes_final = 0.0;

// Prepare one line output (avg)
std::stringstream ss;
ss << schema_version.get() << ","
<< STRINGIFY(MLLM_GIT_COMMIT_HASH) << ","
<< mllm::cpu::CURRENT_ARCH_STRING << ","
<< model_name.get() << ","
<< pp << ","
<< tg << ","
<< avg_ttft << ","
<< avg_prefill_speed << ","
<< avg_decode_speed << ","
<< avg_prefill_ms << ","
<< avg_decode_ms_per_tok << ","
<< kv_est_bytes_pp << ","
<< kv_est_bytes_final;
Comment on lines +205 to +219
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Escape string fields in CSV output.

model_name, arch, and commit can contain commas or quotes, which will break CSV parsing. Quote/escape string fields before writing.

Proposed fix
-    std::stringstream ss;
-    ss << schema_version.get() << "," 
-       << STRINGIFY(MLLM_GIT_COMMIT_HASH) << "," 
-       << mllm::cpu::CURRENT_ARCH_STRING << ","
-       << model_name.get() << ","
+    auto csv_escape = [](const std::string& s) {
+      std::string out;
+      out.reserve(s.size() + 2);
+      out.push_back('"');
+      for (char c : s) {
+        if (c == '"') out += "\"\"";
+        else out.push_back(c);
+      }
+      out.push_back('"');
+      return out;
+    };
+
+    std::stringstream ss;
+    ss << schema_version.get() << ","
+       << csv_escape(STRINGIFY(MLLM_GIT_COMMIT_HASH)) << ","
+       << csv_escape(mllm::cpu::CURRENT_ARCH_STRING) << ","
+       << csv_escape(model_name.get()) << ","
        << pp << ","
        << tg << ","
        << avg_ttft << ","
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Prepare one line output (avg)
std::stringstream ss;
ss << schema_version.get() << ","
<< STRINGIFY(MLLM_GIT_COMMIT_HASH) << ","
<< mllm::cpu::CURRENT_ARCH_STRING << ","
<< model_name.get() << ","
<< pp << ","
<< tg << ","
<< avg_ttft << ","
<< avg_prefill_speed << ","
<< avg_decode_speed << ","
<< avg_prefill_ms << ","
<< avg_decode_ms_per_tok << ","
<< kv_est_bytes_pp << ","
<< kv_est_bytes_final;
// Prepare one line output (avg)
auto csv_escape = [](const std::string& s) {
std::string out;
out.reserve(s.size() + 2);
out.push_back('"');
for (char c : s) {
if (c == '"') out += "\"\"";
else out.push_back(c);
}
out.push_back('"');
return out;
};
std::stringstream ss;
ss << schema_version.get() << ","
<< csv_escape(STRINGIFY(MLLM_GIT_COMMIT_HASH)) << ","
<< csv_escape(mllm::cpu::CURRENT_ARCH_STRING) << ","
<< csv_escape(model_name.get()) << ","
<< pp << ","
<< tg << ","
<< avg_ttft << ","
<< avg_prefill_speed << ","
<< avg_decode_speed << ","
<< avg_prefill_ms << ","
<< avg_decode_ms_per_tok << ","
<< kv_est_bytes_pp << ","
<< kv_est_bytes_final;
🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/main.cpp` around lines 205 - 219, The CSV writer
currently concatenates raw string fields (STRINGIFY(MLLM_GIT_COMMIT_HASH),
mllm::cpu::CURRENT_ARCH_STRING, model_name.get()) into the stream which breaks
parsing when those values contain commas or quotes; update the code that builds
the stringstream (around schema_version.get(), STRINGIFY(MLLM_GIT_COMMIT_HASH),
mllm::cpu::CURRENT_ARCH_STRING, model_name.get()) to escape and quote these
string fields before inserting them (e.g., wrap in double quotes and double any
internal double-quote characters), leaving numeric fields as-is so the produced
CSV is safe to parse.


if (csv_file.is_open()) {
csv_file << ss.str() << std::endl;
}
Comment on lines +194 to +223
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Average results are calculated but never printed to console.

The code computes averages (avg_ttft, avg_prefill_speed, avg_decode_speed, avg_prefill_ms, avg_decode_ms_per_tok) but only writes them to the CSV file. Users who don't specify --output_csv will never see the aggregated benchmark results.

Consider printing the averages to console, similar to how per-run results are printed at lines 159-166.

Suggested fix
     avg_ttft /= denom;
     avg_prefill_speed /= denom;
     avg_decode_speed /= denom;

     float avg_prefill_ms = (avg_prefill_speed > 0.0f) ? (pp / avg_prefill_speed) * 1000.0f : 0.0f;
     float avg_decode_ms_per_tok = (avg_decode_speed > 0.0f) ? (1.0f / avg_decode_speed) * 1000.0f : 0.0f;

+    mllm::print("  ----------------------------------------");
+    mllm::print("  Average Results (", R, "runs ):");
+    mllm::print("    Avg TTFT         :", avg_ttft, "ms");
+    mllm::print("    Avg Prefill Speed:", avg_prefill_speed, "tokens/s");
+    mllm::print("    Avg Decode Speed :", avg_decode_speed, "tokens/s");
+    mllm::print("    Avg Prefill Latency   :", avg_prefill_ms, "ms");
+    mllm::print("    Avg Decode Latency    :", avg_decode_ms_per_tok, "ms/token");
+
     // Rough KV cache estimate (bytes)
🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/main.cpp` around lines 187 - 216, The averaged
metrics (avg_ttft, avg_prefill_speed, avg_decode_speed, avg_prefill_ms,
avg_decode_ms_per_tok) are only written to csv_file via the stringstream ss but
never printed to the console; update the end of the aggregation block in
main.cpp so the same aggregated line is printed to stdout (similar to the
per-run print at lines ~159-166). After building ss (which includes
schema_version, MLLM_GIT_COMMIT_HASH, mllm::cpu::CURRENT_ARCH_STRING,
model_name, pp, tg, avg_* and kv estimates), write ss.str() to std::cout (and
still to csv_file if csv_file.is_open()), so users without --output_csv still
see the average results.

}

mllm::print("\n========================================");
mllm::print("Benchmark Tests Completed");
mllm::print("========================================");

//close file stream
if (csv_file.is_open()) {
csv_file.close();
}
})