-
Notifications
You must be signed in to change notification settings - Fork 193
feat(benchmark): Add CPU benchmark tool with context length sweep #639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
fc46527
27db797
31fc935
dc3525a
85c4f88
507f7b2
c05fbd4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,11 @@ | ||
| # MLLM LLM Benchmark Tool | ||
|
|
||
| ## Why | ||
|
|
||
| There wasn't a unified way to benchmark mllm performance with varying context lengths. The existing benchmark tools had prefill length and decode length settings, but no automated sweep across contexts. So I put together this tool + a bash script to run sweeps automatically. | ||
|
huangzhenhua111 marked this conversation as resolved.
Outdated
|
||
|
|
||
| --- | ||
|
|
||
| ## Overview | ||
|
|
||
| This is a benchmark tool for measuring MLLM model performance, including: | ||
|
|
@@ -10,7 +16,6 @@ This is a benchmark tool for measuring MLLM model performance, including: | |
| ## Build | ||
|
|
||
| Build from the mllm_v2 project root directory: | ||
|
|
||
| ```bash | ||
| mkdir -p build && cd build | ||
| cmake .. | ||
|
|
@@ -20,7 +25,6 @@ make mllm-llm-benchmark | |
| ## Usage | ||
|
|
||
| ### Basic Usage | ||
|
|
||
| ```bash | ||
| ./mllm-llm-benchmark \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The README is for developer, not a description of your PR. You should add your part on the top of the original README.md |
||
| -n qwen3-w4a32-kai \ | ||
|
|
@@ -32,6 +36,47 @@ make mllm-llm-benchmark | |
| -cl 2048 | ||
| ``` | ||
|
|
||
| ### Context Sweep (New Feature) | ||
|
|
||
| For automated benchmarking across different context lengths, use the sweep script: | ||
| ```bash | ||
| cd tools/mllm-llm-benchmark | ||
| chmod +x scripts/sweep_context_v2.sh | ||
|
|
||
| # Configure paths | ||
| export BIN=../../build/bin/mllm-llm-benchmark | ||
| export MODEL=/path/to/your-model.mllm | ||
| export CFG=/path/to/config.json | ||
|
|
||
| # Run sweep | ||
| ./scripts/sweep_context_v2.sh | ||
| ``` | ||
|
|
||
| Output goes to `bench_context/context_sweep_v2.csv`. | ||
|
|
||
| **Configuration options:** | ||
| - `BIN`: Path to benchmark binary (required) | ||
| - `MODEL`: Path to model file (required) | ||
| - `CFG`: Path to config json (default: `./examples/llama/config_tiny_llama.json`) | ||
| - `THREADS`: Number of threads (default: 8) | ||
| - `RUNS`: How many runs to average (default: 1) | ||
| - `COOLDOWN`: Seconds to wait between runs (default: 0) | ||
| - `CLS`: Context lengths to test (default: "256 512 1024 2048 4096") | ||
| - `TG_DH`: Generate length for decode_heavy mode (default: 256) | ||
| - `TG_TTFT`: Generate length for prefill_ttft mode (default: 2) | ||
| - `OUTDIR`: Output directory (default: bench_context) | ||
|
|
||
| **Test modes:** | ||
| - `prefill_ttft`: Measures time to first token (prompt length = CL-2, generates 2 tokens) | ||
| - `decode_heavy`: Measures decode throughput (prompt length = CL-256, generates 256 tokens) | ||
|
|
||
| ### Plot Results | ||
|
|
||
| Visualize benchmark results: | ||
| ```bash | ||
| python3 scripts/plot_sweep.py bench_context/context_sweep_v2.csv output_dir/ | ||
| ``` | ||
|
|
||
| ### Parameters | ||
|
|
||
| | Parameter | Long Format | Description | Example | | ||
|
|
@@ -47,7 +92,6 @@ make mllm-llm-benchmark | |
| ### Examples | ||
|
|
||
| #### Testing Qwen3-0.6B Model | ||
|
|
||
| ```bash | ||
| ./mllm-llm-benchmark \ | ||
| -n qwen3-w4a32-kai \ | ||
|
|
@@ -60,7 +104,6 @@ make mllm-llm-benchmark | |
| ``` | ||
|
|
||
| #### Quick Test (Single Configuration) | ||
|
|
||
| ```bash | ||
| ./mllm-llm-benchmark \ | ||
| -n qwen3-w4a32-kai \ | ||
|
|
@@ -73,7 +116,6 @@ make mllm-llm-benchmark | |
| ``` | ||
|
|
||
| ## Output Example | ||
|
|
||
| ``` | ||
| MLLM Build Version : abc123def456 | ||
| ARCH : ARM64 | ||
|
|
@@ -144,7 +186,6 @@ Each test configuration executes the following steps: | |
| ### 1. Create New Benchmark Class | ||
|
|
||
| Create `YourModel_Benchmark.hpp` in the `models/` directory: | ||
|
|
||
| ```cpp | ||
| #include "BenchmarkTemplate.hpp" | ||
| #include <mllm/models/yourmodel/modeling_yourmodel.hpp> | ||
|
|
@@ -178,7 +219,6 @@ class YourModel_Benchmark final : public BenchmarkTemplate { | |
| ``` | ||
|
|
||
| ### 2. Register in All.hpp | ||
|
|
||
| ```cpp | ||
| #include "YourModel_Benchmark.hpp" | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,10 +1,13 @@ | ||
| // Copyright (c) MLLM Team. | ||
| // Licensed under the MIT License. | ||
|
|
||
| #include <string> | ||
| #include <fstream> | ||
| #include <vector> | ||
| #include <sstream> | ||
| #include <thread> | ||
| #include <chrono> | ||
| #include <algorithm> | ||
|
|
||
| #include <mllm/mllm.hpp> | ||
| #include <mllm/utils/Argparse.hpp> | ||
|
|
@@ -16,6 +19,9 @@ | |
|
|
||
| #include "models/All.hpp" | ||
|
|
||
| #define STR_HELPER(x) #x | ||
| #define STR(x) STR_HELPER(x) | ||
|
|
||
| MLLM_MAIN({ | ||
| auto& help = mllm::Argparse::add<bool>("-h|--help").help("Show help message"); | ||
| auto& model_name = mllm::Argparse::add<std::string>("-n|--model_name").help("Model name"); | ||
|
|
@@ -25,12 +31,21 @@ MLLM_MAIN({ | |
| auto& pp = mllm::Argparse::add<std::string>("-pp|--prompt_length").help("Prompt length"); | ||
| auto& tg = mllm::Argparse::add<std::string>("-tg|--test_generation_length").help("Test Generation length"); | ||
| auto& cache_length = mllm::Argparse::add<int32_t>("-cl|--cache_length").help("Cache length"); | ||
|
|
||
| auto& runs = mllm::Argparse::add<int32_t>("-r|--runs").help("Number of benchmark runs").def(3); | ||
| auto& cooldown_s = mllm::Argparse::add<int32_t>("-cs|--cooldown_s").help("Cooldown time between runs in seconds").def(5); | ||
| auto& output_csv = mllm::Argparse::add<std::string>("-oc|--output_csv").help("Output results to a CSV file").def(""); | ||
| auto& schema_version = mllm::Argparse::add<int32_t>("-sv|--schema_version").help("Schema version for output format").def(1); | ||
| auto& kv_dtype_bytes = | ||
| mllm::Argparse::add<int32_t>("-kv|--kv_dtype_bytes").help("KV cache data type bytes (1: int8, 2: fp16, 4: fp32)").def(4); | ||
|
|
||
| mllm::Argparse::parse(argc, argv); | ||
|
|
||
| // Print Build Version | ||
| mllm::Context::instance().setCpuOpThreads(num_threads.get()); | ||
| mllm::setMaximumNumThreads((uint32_t)num_threads.get()); | ||
|
|
||
| mllm::print("MLLM Build Version :", STRINGIFY(MLLM_GIT_COMMIT_HASH)); | ||
|
|
||
| // Print Device Info | ||
| mllm::print("ARCH :", mllm::cpu::CURRENT_ARCH_STRING); | ||
| mllm::print("FP16 :", mllm::cpu::hasFP16()); | ||
| mllm::print("BF16 :", mllm::cpu::hasBF16()); | ||
|
|
@@ -53,12 +68,27 @@ MLLM_MAIN({ | |
| mllm::print("AVX512VL :", mllm::cpu::hasAVX512VL()); | ||
| mllm::print("FMA :", mllm::cpu::hasFMA()); | ||
|
|
||
| // Create benchmark | ||
| mllm::print("Create Benchmark: ", model_name.get()); | ||
| auto benchmark = createBenchmark(model_name.get()); | ||
| MLLM_RT_ASSERT(benchmark != nullptr); | ||
|
|
||
| // Print Model Info | ||
| int R = runs.get(); | ||
| if (R <= 0) { | ||
| mllm::print("[ERROR] --runs must be > 0, got:", R); | ||
| return 1; | ||
| } | ||
|
|
||
| std::ofstream csv_file; | ||
| if (!output_csv.get().empty()) { | ||
| csv_file.open(output_csv.get()); | ||
| if (!csv_file.is_open()) { | ||
| mllm::print("[ERROR] Failed to open --output_csv:", output_csv.get()); | ||
| return 1; | ||
| } | ||
| csv_file << "schema_version,git_commit,arch,model_name,pp,tg,ttft_ms,prefill_speed,decode_speed,prefill_ms,decode_ms_per_" | ||
| "tok,kv_est_bytes_pp,kv_est_bytes_final\n"; | ||
| } | ||
|
|
||
| mllm::print("Model Info"); | ||
| benchmark->init(config_path.get(), model_path.get(), cache_length.get()); | ||
| benchmark->printModelInfo(); | ||
|
|
@@ -92,7 +122,7 @@ MLLM_MAIN({ | |
| for (size_t i = 0; i < pp_values.size(); ++i) { pp_tg_pairs.emplace_back(pp_values[i], tg_values[i]); } | ||
| } | ||
|
|
||
| // Actual run for 3 turns and gives avg results. Each turn will sleep for 5 seconds to let the SoC or GPU/NPU cool down. | ||
| // Actual run for configurable number of turns | ||
| mllm::print("\n========================================"); | ||
| mllm::print("Starting Benchmark Tests"); | ||
| mllm::print("========================================\n"); | ||
|
|
@@ -104,32 +134,33 @@ MLLM_MAIN({ | |
| mllm::print(" Generation Length (TG):", tg); | ||
| mllm::print("----------------------------------------"); | ||
|
|
||
| // Storage for results | ||
| std::vector<BenchmarkTemplateResult> results; | ||
| results.reserve(3); | ||
| results.reserve(static_cast<size_t>(R)); | ||
|
|
||
| for (int i = 0; i < 3; ++i) { | ||
| mllm::print(" Run", i + 1, "of 3..."); | ||
| for (int i = 0; i < R; ++i) { | ||
| mllm::print(" Run", i + 1, "of", R, "..."); | ||
|
|
||
| // Clear cache before each run | ||
| benchmark->clear(); | ||
|
|
||
| // Run benchmark | ||
| auto result = benchmark->run(pp, tg); | ||
| results.push_back(result); | ||
|
|
||
| mllm::print(" TTFT :", result.ttft, "ms"); | ||
| mllm::print(" Prefill Speed:", result.prefill_speed, "tokens/s"); | ||
| mllm::print(" Decode Speed :", result.decode_speed, "tokens/s"); | ||
|
|
||
| // Sleep for 5 seconds between runs to cool down | ||
| if (i < 2) { | ||
| mllm::print(" Cooling down for 5 seconds..."); | ||
| std::this_thread::sleep_for(std::chrono::seconds(5)); | ||
| float prefill_ms = (result.prefill_speed > 0.0f) ? (pp / result.prefill_speed) * 1000.0f : 0.0f; | ||
| float decode_ms_per_tok = (result.decode_speed > 0.0f) ? (1.0f / result.decode_speed) * 1000.0f : 0.0f; | ||
| mllm::print(" Prefill Latency :", prefill_ms, "ms"); | ||
| mllm::print(" Decode Latency :", decode_ms_per_tok, "ms"); | ||
|
|
||
| int cool = cooldown_s.get(); | ||
| if (i + 1 < R && cool > 0) { | ||
| mllm::print(" Cooling down for", cool, "seconds..."); | ||
| std::this_thread::sleep_for(std::chrono::seconds(cool)); | ||
| } | ||
| } | ||
|
|
||
| // Calculate average results | ||
| float denom = (R > 0) ? static_cast<float>(R) : 1.0f; | ||
| float avg_ttft = 0.0f; | ||
| float avg_prefill_speed = 0.0f; | ||
| float avg_decode_speed = 0.0f; | ||
|
|
@@ -151,9 +182,35 @@ MLLM_MAIN({ | |
| mllm::print("Average Prefill Speed:", avg_prefill_speed, "tokens/s"); | ||
| mllm::print("Average Decode Speed :", avg_decode_speed, "tokens/s"); | ||
| mllm::print("=====================================\n"); | ||
|
|
||
| avg_ttft /= denom; | ||
| avg_prefill_speed /= denom; | ||
| avg_decode_speed /= denom; | ||
|
|
||
| float avg_prefill_ms = (avg_prefill_speed > 0.0f) ? (pp / avg_prefill_speed) * 1000.0f : 0.0f; | ||
| float avg_decode_ms_per_tok = (avg_decode_speed > 0.0f) ? (1.0f / avg_decode_speed) * 1000.0f : 0.0f; | ||
|
|
||
| // KV cache estimate | ||
| double kv_est_bytes_pp = 0.0; | ||
| double kv_est_bytes_final = 0.0; | ||
| if (auto info = benchmark->kvEstimateInfo(); info.has_value()) { | ||
| const int32_t bytes_per = kv_dtype_bytes.get(); // 1/2/4 | ||
| // LLaMA-like KV: 2 * n_layers * n_kv_heads * head_dim * seq_len * bytes | ||
| kv_est_bytes_pp = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)pp * bytes_per; | ||
| kv_est_bytes_final = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)(pp + tg) * bytes_per; | ||
| } | ||
|
Comment on lines
+197
to
+202
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No validation on The help text states valid values are 1, 2, or 4, but any integer is accepted. A value of 0 would silently produce zero estimates; a negative value would produce negative estimates. Proposed validation (e.g., after line 89)+ int kv_bpe = kv_dtype_bytes.get();
+ if (kv_bpe != 1 && kv_bpe != 2 && kv_bpe != 4) {
+ mllm::print("[WARN] --kv_dtype_bytes should be 1, 2, or 4; got:", kv_bpe, "— defaulting to 4");
+ kv_bpe = 4;
+ }Then use 🤖 Prompt for AI Agents |
||
|
|
||
| std::stringstream ss; | ||
| ss << schema_version.get() << "," << STRINGIFY(MLLM_GIT_COMMIT_HASH) << "," << mllm::cpu::CURRENT_ARCH_STRING << "," | ||
| << model_name.get() << "," << pp << "," << tg << "," << avg_ttft << "," << avg_prefill_speed << "," << avg_decode_speed | ||
| << "," << avg_prefill_ms << "," << avg_decode_ms_per_tok << "," << kv_est_bytes_pp << "," << kv_est_bytes_final; | ||
|
|
||
| if (csv_file.is_open()) { csv_file << ss.str() << std::endl; } | ||
| } | ||
|
|
||
| mllm::print("\n========================================"); | ||
| mllm::print("Benchmark Tests Completed"); | ||
| mllm::print("========================================"); | ||
|
|
||
| if (csv_file.is_open()) { csv_file.close(); } | ||
| }) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,19 +3,27 @@ | |
| #pragma once | ||
|
|
||
| #include <string> | ||
| #include <optional> | ||
| #include <cstdint> | ||
|
|
||
| /** | ||
| * @brief Benchmark result structure | ||
| */ | ||
| struct BenchmarkTemplateResult { | ||
| float ttft; ///< Time To First Token in milliseconds | ||
| float prefill_speed; ///< Prefill phase speed in tokens/s | ||
| float decode_speed; ///< Decode phase speed in tokens/s | ||
| float ttft; ///< Time To First Token in milliseconds | ||
| float prefill_speed; ///< Prefill phase speed in tokens/s | ||
| float decode_speed; ///< Decode phase speed in tokens/s | ||
| }; | ||
|
|
||
| struct KVCacheEstimateInfo { | ||
| int32_t num_layers = 0; | ||
| int32_t num_kv_heads = 0; | ||
| int32_t head_dim = 0; // hidden_size / num_attention_heads | ||
| }; | ||
|
|
||
| /** | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do not delete all comments here! |
||
| * @brief Base class for benchmark templates | ||
| * | ||
| * | ||
| * All model benchmark implementations should inherit from this class and implement all virtual functions. | ||
| */ | ||
| class BenchmarkTemplate { | ||
|
|
@@ -32,21 +40,21 @@ class BenchmarkTemplate { | |
|
|
||
| /** | ||
| * @brief Print model information | ||
| * | ||
| * | ||
| * Should output model key parameters such as number of layers, hidden size, attention heads, etc. | ||
| */ | ||
| virtual void printModelInfo() = 0; | ||
|
|
||
| /** | ||
| * @brief Warmup run | ||
| * | ||
| * | ||
| * Run the model once with small-scale input to ensure the model enters a stable state. | ||
| */ | ||
| virtual void warmup() = 0; | ||
|
|
||
| /** | ||
| * @brief Clear cache | ||
| * | ||
| * | ||
| * Clear KV cache and performance counters to prepare for the next test. | ||
| */ | ||
| virtual void clear() = 0; | ||
|
|
@@ -58,4 +66,7 @@ class BenchmarkTemplate { | |
| * @return Test results | ||
| */ | ||
| virtual BenchmarkTemplateResult run(int32_t pp, int32_t tg) = 0; | ||
|
|
||
| // KV cache size estimation; return nullopt if unsupported | ||
| virtual std::optional<KVCacheEstimateInfo> kvEstimateInfo() const { return std::nullopt; } | ||
| }; | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to have
whysection