diff --git a/docs/accuracy/accuracy-benchmarking.md b/docs/accuracy/accuracy-benchmarking.md
index c58b7ac96..36612868c 100644
--- a/docs/accuracy/accuracy-benchmarking.md
+++ b/docs/accuracy/accuracy-benchmarking.md
@@ -74,6 +74,7 @@ system message).
 | `aime` | `math` | 8 | `Maxwell-Jia/AIME_2024` (trt-llm reference, 8-shot CoT) |
 | `hellaswag` | `exact_match` | 10 | `Rowan/hellaswag` (trt-llm/DeepEval reference; one few-shot per unique activity_label) |
 | `bigbench` | `exact_match` | 3 | `lukaemon/bbh` (trt-llm/DeepEval reference; 27 subtasks, canonical CoT/non-CoT prompt files) |
+| `aime24` | `lighteval_expr` | 0 | `HuggingFaceH4/aime_2024` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
 
 ## CLI Flags
 
diff --git a/docs/accuracy/accuracy_stubs.md b/docs/accuracy/accuracy_stubs.md
index ad8e8c71c..9c95af783 100644
--- a/docs/accuracy/accuracy_stubs.md
+++ b/docs/accuracy/accuracy_stubs.md
@@ -7,7 +7,7 @@
 
 This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected.
 
-**Status summary:** With the BigBench-Hard loader landing on top of the HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, and `BigBenchBenchmark` are fully implemented; the remaining benchmarks (`aime24`, `aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
+**Status summary:** With the AIME24 loader landing on top of the BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, and `AIME24Benchmark` are fully implemented; the remaining benchmarks (`aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
 
 ## Table of Contents
 
@@ -173,16 +173,16 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
 | 2 | `AIMEBenchmark` | `benchmarks/aime.py` | `aime` | `math` | 8 | **IMPLEMENTED.** Loads `Maxwell-Jia/AIME_2024`, instructs the model to wrap its final integer in `\boxed{}`, supports few-shot priming and chain-of-thought. `default_enable_cot=true`. |
 | 3 | `HellaSwagBenchmark` | `benchmarks/hellaswag.py` | `hellaswag` | `exact_match` | 10 | **IMPLEMENTED.** Loads `Rowan/hellaswag` (validation split filtered per task by `activity_label`; train split feeds the "one few-shot per unique activity_label" rule). Prompt rendering delegates to `deepeval.benchmarks.HellaSwag`'s `HellaSwagTemplate.generate_output`, so output is byte-equal to the trt-llm recipe's DeepEval-backed path. Pairs with `exact_match` for strict `Scorer.exact_match_score` semantics. Requires the `[accuracy]` extras (deepeval). |
 | 4 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 | **IMPLEMENTED.** Loads `lukaemon/bbh` (27 BBH subtasks). Prompt rendering delegates to `deepeval.benchmarks.BigBenchHard`'s `BigBenchHardTemplate.generate_output`, which reads the 27 canonical CoT/shot prompt files DeepEval ships as package data. Pairs with `exact_match` for the recipe's strict `Scorer.exact_match_score` semantics. `default_n_shots=3`, `default_enable_cot=true`. Requires the `[accuracy]` extras (deepeval). |
+| 5 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `lighteval_expr` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (train split) and emits the bare problem text as a single user message — no instruction prefix, no few-shot priming. Mirrors the trt-llm benchmark recipe's `acc_bench_lighteval.py` configuration (`few_shots_split=None`, `generation_size=32768`). Pairs with `lighteval_expr` for the recipe's `expr_gold_metric` extraction. |
 
 ### Still Stubbed
 
 | # | Class | File | Plugin Key | Default Grader | Default N-Shots |
 |---|-------|------|------------|----------------|-----------------|
-| 1 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 |
-| 2 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
-| 3 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
-| 4 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
-| 5 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
+| 1 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
+| 2 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
+| 3 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
+| 4 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
 
 **Each benchmark has 1 method to implement:**
 
@@ -309,13 +309,13 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu
 | Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods |
 |-----------|-------------|---------------|------------------|-------------------|
 | Graders | 7 (all) | 0 | — | 0 |
-| Benchmarks | 4 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) | 5 | 1 (`load_problems`) | 5 |
+| Benchmarks | 5 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`) | 4 | 1 (`load_problems`) | 4 |
 | Record Processor | 1 (`AccuracyRecordProcessor`) | 0 | — | 0 |
 | Results Processor | 1 (`AccuracyResultsProcessor`) | 0 | — | 0 |
 | Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 | — | 0 |
 | Data Exporter | 1 (`AccuracyDataExporter`) | 0 | — | 0 |
 | Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 |
-| **Total** | **15** | **6** | | **6** |
+| **Total** | **16** | **5** | | **5** |
 
 ### Self-Disabling Pattern
 
@@ -323,10 +323,10 @@ Processors and exporters raise their `Disabled` exception **in `__init__`** when
 
 ### Suggested Implementation Order
 
-The processors, exporters, all graders, and four benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) are already wired end-to-end. The remaining work is the five stub benchmarks; mirror the existing loader whose grader matches:
+The processors, exporters, all graders, and five benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`) are already wired end-to-end. The remaining work is the four stub benchmarks; mirror the existing loader whose grader matches:
 
-1. **`aime24`, `aime25`, `math_500`** — mirror `AIMEBenchmark` (`benchmarks/aime.py`); pair with the `math` grader.
-2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `multiple_choice` grader.
+1. **`aime25`, `math_500`** — mirror `AIME24Benchmark` (`benchmarks/aime24.py`) for the lighteval-aligned shape; pair with `lighteval_expr` (aime25) or `lighteval_latex` (math_500).
+2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader.
 3. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
 4. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.
 
diff --git a/src/aiperf/accuracy/benchmarks/aime24.py b/src/aiperf/accuracy/benchmarks/aime24.py
index da5ecb079..b40baf2b8 100644
--- a/src/aiperf/accuracy/benchmarks/aime24.py
+++ b/src/aiperf/accuracy/benchmarks/aime24.py
@@ -1,31 +1,108 @@
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 
+"""AIME 2024 benchmark loader, aligned with the trt-llm lighteval reference.
+
+Mirrors the recipe's ``acc_bench_lighteval.py`` configuration:
+
+    aime24 = LightevalTaskConfig(
+        name="aime24",
+        prompt_function=aime_prompt_fn,
+        hf_repo="HuggingFaceH4/aime_2024",
+        hf_subset="default",
+        evaluation_splits=["train"],
+        few_shots_split=None,
+        few_shots_select=None,
+        generation_size=32768,
+        metric=[expr_gold_metric],
+    )
+
+The recipe's ``aime_prompt_fn`` produces a ``Doc`` whose ``query`` is
+the bare problem text — lighteval's prompt manager wraps it as a
+single user message with no instruction prefix and no few-shot
+priming (``few_shots_split=None``). We emit prompts the same way.
+Pair with ``LightevalExprGrader`` for the recipe's ``expr_gold_metric``
+extraction.
+
+Reference:
+    trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:128
+"""
+
 from __future__ import annotations
 
-from typing import TYPE_CHECKING
+import asyncio
+from typing import TYPE_CHECKING, Any
+
+from datasets import Dataset, load_dataset
 
-from aiperf.accuracy.models import BenchmarkProblem
+from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
 from aiperf.common.mixins import AIPerfLoggerMixin
 
 if TYPE_CHECKING:
     from aiperf.config.resolution.plan import BenchmarkRun
 
+DATASET_NAME = "HuggingFaceH4/aime_2024"
+TASK_NAME = "aime24"
+
+# lighteval's aime24 task config: ``generation_size=32768`` to give
+# reasoning models room to think before emitting the boxed answer.
+DEFAULT_GENERATION_SIZE = 32768
+
+# Schema field names in HuggingFaceH4/aime_2024 (lowercase, lighteval
+# canonical — distinct from the Maxwell-Jia mirror used by ``aime``).
+PROBLEM_FIELD = "problem"
+ANSWER_FIELD = "answer"
+
 
 class AIME24Benchmark(AIPerfLoggerMixin):
-    """Registered placeholder for a future AIME 2024 loader.
+    """AIME 2024 lighteval-aligned benchmark loader.
 
-    `load_problems()` intentionally raises NotImplementedError in this release;
-    use the MMLU benchmark when a working accuracy loader is required.
+    Loads ``HuggingFaceH4/aime_2024`` (train split) and emits one user
+    message per problem containing the bare problem text — the format
+    lighteval's ``aime_prompt_fn`` + ``PromptManager`` produce when
+    ``few_shots_split=None``. Pair with ``LightevalExprGrader`` for
+    grading parity with the recipe.
     """
 
-    def __init__(self, run: BenchmarkRun, **kwargs) -> None:
+    def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
         super().__init__(**kwargs)
         self.run = run
 
     async def load_problems(
         self, tasks: list[str] | None, n_shots: int, enable_cot: bool
     ) -> list[BenchmarkProblem]:
-        raise NotImplementedError(
-            "aime24 benchmark is not yet implemented; only 'mmlu' is available in this release."
-        )
+        """Load AIME24 problems and format them lighteval-style.
+
+        Args:
+            tasks: Ignored — AIME24 has no subtasks.
+            n_shots: Ignored — the lighteval reference is zero-shot
+                (``few_shots_split=None``); accepting the parameter
+                keeps the protocol uniform but emitting few-shots
+                here would diverge from the reference.
+            enable_cot: Ignored — lighteval's ``aime_prompt_fn`` does
+                not add a CoT trigger; the model decides whether to
+                reason based on the system prompt the user provides
+                via ``--accuracy-system-prompt``.
+
+        Returns:
+            One ``BenchmarkProblem`` per dataset row, in dataset
+            order.
+        """
+        ds: Dataset = await asyncio.to_thread(load_dataset, DATASET_NAME, split="train")
+        return await asyncio.to_thread(self._build_problems, ds)
+
+    def _build_problems(self, ds: Dataset) -> list[BenchmarkProblem]:
+        problems: list[BenchmarkProblem] = []
+        for row in ds:
+            problem = row[PROBLEM_FIELD]
+            messages: list[AccuracyChatMessage] = [{"role": "user", "content": problem}]
+            problems.append(
+                BenchmarkProblem(
+                    prompt=problem,
+                    ground_truth=str(row[ANSWER_FIELD]),
+                    task=TASK_NAME,
+                    metadata={"generation_size": DEFAULT_GENERATION_SIZE},
+                    raw_messages=messages,
+                )
+            )
+        return problems
diff --git a/src/aiperf/plugin/plugins.yaml b/src/aiperf/plugin/plugins.yaml
index b11c97a2d..db3a3eb51 100644
--- a/src/aiperf/plugin/plugins.yaml
+++ b/src/aiperf/plugin/plugins.yaml
@@ -1255,11 +1255,12 @@ accuracy_benchmark:
   aime24:
     class: aiperf.accuracy.benchmarks.aime24:AIME24Benchmark
     description: |
-      AIME 2024 benchmark with problems from the 2024 competition year.
+      AIME 2024 benchmark, aligned with the trt-llm benchmark recipe's
+      lighteval-backed configuration (HuggingFaceH4/aime_2024 + lighteval
+      ``expr_gold_metric``).
     metadata:
-      default_grader: math
+      default_grader: lighteval_expr
       default_n_shots: 0
-      is_implemented: false
 
   aime25:
     class: aiperf.accuracy.benchmarks.aime25:AIME25Benchmark
diff --git a/tests/unit/accuracy/test_accuracy_config.py b/tests/unit/accuracy/test_accuracy_config.py
index 5e78125ce..8618451db 100644
--- a/tests/unit/accuracy/test_accuracy_config.py
+++ b/tests/unit/accuracy/test_accuracy_config.py
@@ -23,7 +23,6 @@
 # This branch (AIP-874) implements ``aime``, ``math``, and ``code_execution``,
 # so those names are absent from the stub lists.
 STUB_BENCHMARKS = (
-    "aime24",
     "aime25",
     "math_500",
     "gpqa_diamond",
diff --git a/tests/unit/accuracy/test_aime24_benchmark.py b/tests/unit/accuracy/test_aime24_benchmark.py
new file mode 100644
index 000000000..fc0507681
--- /dev/null
+++ b/tests/unit/accuracy/test_aime24_benchmark.py
@@ -0,0 +1,223 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""Unit tests for ``AIME24Benchmark`` after lighteval alignment.
+
+The recipe's ``acc_bench_lighteval.py`` uses ``aime_prompt_fn`` which
+makes ``Doc(query=line["problem"], choices=[line["answer"]],
+gold_index=0)`` — lighteval's prompt manager then renders the bare
+problem text as a single user message with no instruction prefix and
+no few-shot priming (``few_shots_split=None``). These tests pin that
+behavior so any future divergence fails loudly.
+"""
+
+from __future__ import annotations
+
+from typing import Any
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from aiperf.accuracy.benchmarks.aime24 import (
+    DEFAULT_GENERATION_SIZE,
+    TASK_NAME,
+    AIME24Benchmark,
+)
+from aiperf.accuracy.models import BenchmarkProblem
+from aiperf.plugin.enums import AccuracyBenchmarkType, EndpointType
+from tests.unit.conftest import make_benchmark_run
+
+
+def _make_run():
+    return make_benchmark_run(
+        model_names=["test-model"],
+        endpoint_type=EndpointType.COMPLETIONS,
+        streaming=False,
+        accuracy={"benchmark": AccuracyBenchmarkType.AIME24},
+    )
+
+
+def _make_row(problem: str = "What is 1+1?", answer: int = 2) -> dict[str, Any]:
+    return {"problem": problem, "answer": answer}
+
+
+def _make_fake_dataset(rows: list[dict[str, Any]]) -> MagicMock:
+    ds = MagicMock()
+    ds.__iter__ = MagicMock(side_effect=lambda: iter(rows))
+    ds.__len__ = MagicMock(return_value=len(rows))
+    ds.__getitem__ = MagicMock(side_effect=lambda i: rows[i])
+    return ds
+
+
+class TestPromptIsBareProblemText:
+    """Lighteval's ``aime_prompt_fn`` + ``PromptManager`` produces the
+    bare problem text as the user message — no instruction prefix, no
+    decoration."""
+
+    @pytest.mark.asyncio
+    async def test_flat_prompt_is_problem_text(self) -> None:
+        rows = [_make_row("Compute the answer.", 42)]
+        with patch(
+            "aiperf.accuracy.benchmarks.aime24.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = AIME24Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert problems[0].prompt == "Compute the answer."
+
+    @pytest.mark.asyncio
+    async def test_no_instruction_prefix(self) -> None:
+        """Any of the instruction phrasings used by the old or new
+        AIME loader should NOT be in an AIME24 prompt."""
+        rows = [_make_row("Q?", 1)]
+        with patch(
+            "aiperf.accuracy.benchmarks.aime24.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = AIME24Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        prompt = problems[0].prompt
+        assert "**Problem**" not in prompt
+        assert "competition math" not in prompt
+        assert "Let's think" not in prompt
+        assert "boxed" not in prompt
+
+    @pytest.mark.asyncio
+    async def test_chat_message_is_single_user_message(self) -> None:
+        rows = [_make_row("Q?", 1)]
+        with patch(
+            "aiperf.accuracy.benchmarks.aime24.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = AIME24Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        msgs = problems[0].raw_messages
+        assert msgs is not None
+        assert len(msgs) == 1
+        assert msgs[0]["role"] == "user"
+        assert msgs[0]["content"] == "Q?"
+
+
+class TestNShotsAndCoTAreIgnored:
+    """The lighteval reference is zero-shot, no CoT trigger. Aiperf
+    keeps the parameters in the signature for protocol uniformity but
+    must not let them change the prompt — that would diverge from the
+    reference."""
+
+    @pytest.mark.asyncio
+    async def test_n_shots_argument_does_not_affect_prompt(self) -> None:
+        rows = [_make_row(f"q{i}", i) for i in range(3)]
+        with patch(
+            "aiperf.accuracy.benchmarks.aime24.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = AIME24Benchmark(run=_make_run())
+            zero_shot = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+            five_shot = await bench.load_problems(
+                tasks=None, n_shots=5, enable_cot=False
+            )
+        assert [p.prompt for p in zero_shot] == [p.prompt for p in five_shot]
+
+    @pytest.mark.asyncio
+    async def test_enable_cot_does_not_affect_prompt(self) -> None:
+        rows = [_make_row("Q?", 1)]
+        with patch(
+            "aiperf.accuracy.benchmarks.aime24.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = AIME24Benchmark(run=_make_run())
+            no_cot = await bench.load_problems(tasks=None, n_shots=0, enable_cot=False)
+            with_cot = await bench.load_problems(tasks=None, n_shots=0, enable_cot=True)
+        assert no_cot[0].prompt == with_cot[0].prompt
+
+
+class TestLoadProblemsCore:
+    @pytest.mark.asyncio
+    async def test_returns_one_problem_per_row(self) -> None:
+        rows = [_make_row(f"q{i}", i) for i in range(5)]
+        with patch(
+            "aiperf.accuracy.benchmarks.aime24.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = AIME24Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert len(problems) == 5
+        assert all(isinstance(p, BenchmarkProblem) for p in problems)
+
+    @pytest.mark.asyncio
+    async def test_ground_truth_is_string_form_of_answer(self) -> None:
+        rows = [_make_row("q", 42)]
+        with patch(
+            "aiperf.accuracy.benchmarks.aime24.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = AIME24Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert problems[0].ground_truth == "42"
+
+    @pytest.mark.asyncio
+    async def test_task_name_is_aime24(self) -> None:
+        rows = [_make_row("q", 1)]
+        with patch(
+            "aiperf.accuracy.benchmarks.aime24.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = AIME24Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert problems[0].task == TASK_NAME
+
+    @pytest.mark.asyncio
+    async def test_generation_size_is_32k(self) -> None:
+        """Lighteval's aime24 task config sets ``generation_size=32768``."""
+        rows = [_make_row("q", 1)]
+        with patch(
+            "aiperf.accuracy.benchmarks.aime24.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = AIME24Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert problems[0].metadata["generation_size"] == DEFAULT_GENERATION_SIZE
+        assert DEFAULT_GENERATION_SIZE == 32768
+
+
+class TestPathologicalDatasetRows:
+    @pytest.mark.asyncio
+    async def test_empty_dataset_returns_empty_list(self) -> None:
+        with patch(
+            "aiperf.accuracy.benchmarks.aime24.load_dataset",
+            return_value=_make_fake_dataset([]),
+        ):
+            bench = AIME24Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert problems == []
+
+    @pytest.mark.asyncio
+    async def test_unicode_problem_text_preserved(self) -> None:
+        rows = [_make_row("Solve ∑₁ⁿ k² for n=10. ✓", 385)]
+        with patch(
+            "aiperf.accuracy.benchmarks.aime24.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = AIME24Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert "∑₁ⁿ" in problems[0].prompt