cicd: Add sanity test script #2212

kahyunnam · 2025-12-12T02:07:49Z

📌 Description

This PR adds a sanity test script to use towards testing more CTK versions. This adds --sanity-test flag as an option to [scripts/task_test_blackwell_kernels.sh](https://github.com/flashinfer-ai/flashinfer/blob/main/scripts/task_test_blackwell_kernels.sh), our current unit testing script. But instead of running every test combination, it samples every Nth test in each test suite. N is determined by SAMPLE_RATE. The default value is 5, which is 20% coverage (${SAMPLE_RATE:=5}).

Example output for bash scripts/task_test_blackwell_kernels.sh --dry-run --sanity-test:

[1] Collecting tests from: tests/attention/test_alibi.py
  Total test cases: 216
  Sampled test cases: 44 (every 5th test)
  Sample of tests that would run:
    tests/attention/test_alibi.py::test_single_decode_alibi[128-4-1]
    tests/attention/test_alibi.py::test_single_decode_alibi[128-8-9]
    tests/attention/test_alibi.py::test_single_decode_alibi[128-32-81]
    tests/attention/test_alibi.py::test_single_decode_alibi[256-4-729]
    tests/attention/test_alibi.py::test_single_decode_alibi[256-32-1]
    ... and 39 more

[2] Collecting tests from: tests/attention/test_attention_sink.py
  Total test cases: 1504
  Sampled test cases: 301 (every 5th test)
  Sample of tests that would run:
    tests/attention/test_attention_sink.py::test_attention_sink[fa2-True--1-8-32-1-1-dtype0]
    tests/attention/test_attention_sink.py::test_attention_sink[fa2-True--1-8-32-1-16-dtype1]
    tests/attention/test_attention_sink.py::test_attention_sink[fa2-True--1-8-32-4-16-dtype0]
    tests/attention/test_attention_sink.py::test_attention_sink[fa2-True--1-8-32-16-4-dtype1]
    tests/attention/test_attention_sink.py::test_attention_sink[fa2-True--1-8-32-128-4-dtype0]
    ... and 296 more

(.... etc...., [3] through [82]....) 

[83] Collecting tests from: tests/utils/test_sampling.py
  Total test cases: 921
  Sampled test cases: 185 (every 5th test)
  Sample of tests that would run:
    tests/utils/test_sampling.py::test_softmax[True-True-1.0-normal_distribution(std=1)-111-1]
    tests/utils/test_sampling.py::test_softmax[True-True-1.0-normal_distribution(std=1)-32000-989]
    tests/utils/test_sampling.py::test_softmax[True-True-1.0-normal_distribution(std=5)-111-99]
    tests/utils/test_sampling.py::test_softmax[True-True-1.0-normal_distribution(std=5)-128256-1]
    tests/utils/test_sampling.py::test_softmax[True-True-1.0-gumbel_distribution(beta=0.1)-111-989]
    ... and 180 more

[84] Collecting tests from: tests/utils/test_triton_cascade.py
  Total test cases: 4
  Sampled test cases: 1 (every 5th test)
  Sample of tests that would run:
    tests/utils/test_triton_cascade.py::test_merge_state[128-32-2048]

🔍 Related Issues

As discussed in weekly meeting and on slack.

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Added a Sanity Test mode that runs a sampled subset of tests instead of the full suite.
- New CLI option to enable sanity-mode and configure sample rate (with a random offset) to vary samples.
- Dry-run now previews per-file sampled tests and reports estimated coverage and sampling details.
- Actual sanity runs execute only sampled tests and produce per-run summaries with coverage metrics and failed-test lists.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

gemini-code-assist · 2025-12-12T02:08:04Z

Summary of Changes

Hello @kahyunnam, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new sanity test script, scripts/task_test_blackwell_kernels_lite.sh, which aims to streamline the testing process for various CTK versions. Unlike the full unit test suite, this script intelligently samples a fraction of tests from each suite, providing quicker feedback and broader compatibility checks. It also features a dry-run mode for planning test execution and integrates seamlessly with existing pytest configurations.

Highlights

New Sanity Test Script: A new bash script, scripts/task_test_blackwell_kernels_lite.sh, has been added to the repository.
Partial Test Execution: This script is designed to run a sampled subset of tests, specifically every Nth test (defaulting to every 5th test, providing approximately 20% coverage) from each test suite.
Purpose: The primary goal of this new script is to enable more efficient testing across various CTK (CUDA Toolkit) versions, offering a lighter and faster alternative to the full unit test suite.
Dry Run Mode: The script includes a --dry-run option, allowing users to preview which tests would be executed without actually running them, aiding in test planning and debugging.
Pytest Integration: It leverages pytest for test collection and execution, supports JUnit XML reporting, and respects directory exclusions defined in pytest.ini.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2025-12-12T02:08:42Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a sampled SANITY_TEST mode to scripts/task_test_blackwell_kernels.sh that selects tests by configurable SAMPLE_RATE and random SAMPLE_OFFSET, supports dry-run previews, runs only sampled tests per file, and reports per-file and aggregate sampling and run summaries.

Changes

Cohort / File(s)	Change Summary
Test sampling script `scripts/task_test_blackwell_kernels.sh`	Adds `--sanity-test` CLI flag, `SAMPLE_RATE` (default 5) and `SAMPLE_OFFSET`; implements per-file pytest collection, computes `TOTAL_IN_FILE`/`TOTAL_TEST_CASES`, derives sampled node IDs (`SAMPLED_NODE_IDS`) and counts (`SAMPLED_IN_FILE`, `SAMPLED_TEST_CASES`) via modulo sampling; supports dry-run reporting with per-file previews and SANITY DRY RUN SUMMARY; in actual SANITY runs executes only sampled tests, records per-file PASS/FAIL and aggregates `FAILED_TESTS`; preserves FULL mode behavior and adds detailed messaging and reproduction hints.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Verify sampling arithmetic (SAMPLE_RATE, SAMPLE_OFFSET, modulo) and reproducibility.
Check per-file collection logic and handling when no tests are sampled.
Validate aggregation of counters and failure reporting for dry-run vs actual-run branches.
Review CLI parsing additions and messaging for clarity.

Suggested reviewers

yongwww
yzh119
nvmbreughe

Poem

🐰 I hop through lists of tests and pick a few,
A numbered nibble — offset makes it new.
Dry-run peeks then real runs skip the rest;
Five at a time, a tidy little quest. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'cicd: Add sanity test script' clearly and concisely describes the main change—adding a sanity test feature to the existing test script.
Description check	✅ Passed	The PR description comprehensively covers all template sections with detailed explanation of the feature, includes example output, and marks all checklist items as complete.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

scripts/task_test_blackwell_kernels_lite.sh (3)
20-22: Make collection use the same pytest flags as execution (or intentionally keep them different).
Right now PYTEST_FLAGS doesn’t apply to --collect-only, so collection failures/behavior may diverge from the run phase.
@@
-        COLLECTION_OUTPUT=$(pytest --collect-only -q "$test_file" 2>&1)
+        COLLECTION_OUTPUT=$(pytest $PYTEST_FLAGS --collect-only -q "$test_file" 2>&1)
@@
-        COLLECTION_OUTPUT=$(pytest --collect-only -q "$test_file" 2>&1)
+        COLLECTION_OUTPUT=$(pytest $PYTEST_FLAGS --collect-only -q "$test_file" 2>&1)
Also applies to: 121-126, 190-195

5-8: Remove or wire up unused env vars (MAX_JOBS, CUDA_VISIBLE_DEVICES) to avoid confusion.
They’re set but not referenced anywhere in the script.

If unused: drop them.

If intended: apply them (e.g., export CUDA_VISIBLE_DEVICES for pytest runs; use MAX_JOBS with pytest -n if xdist is in use).

48-58: pytest.ini norecursedirs parsing is likely wrong (space-separated patterns, not comma-separated).
Current code only handles comma-separated values and may miss exclusions.

Suggested direction: parse everything after = and treat it as whitespace-separated tokens (and ignore comments), without the comma transform.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f4e806 and 1b456fc.

📒 Files selected for processing (1)

scripts/task_test_blackwell_kernels_lite.sh (1 hunks)

🧰 Additional context used

🪛 Shellcheck (0.11.0)

scripts/task_test_blackwell_kernels_lite.sh

[warning] 235-235: Quote this to prevent word splitting.

(SC2046)

scripts/task_test_blackwell_kernels_lite.sh

gemini-code-assist

Code Review

This pull request introduces a new sanity test script, task_test_blackwell_kernels_lite.sh, which is a great addition for running a subset of tests. The script is well-structured, but I've identified a few areas for improvement to enhance its robustness and fix a bug in the dry-run functionality. My review includes suggestions to address these points, primarily concerning shell scripting best practices for handling environment variables and command arguments.

scripts/task_test_blackwell_kernels_lite.sh

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

scripts/task_test_blackwell_kernels_lite.sh (3)
5-8: Validate SAMPLE_RATE (avoid divide-by-zero / awk modulo errors + misleading coverage).
Today SAMPLE_RATE=0 or non-integer will break arithmetic (100 / SAMPLE_RATE) and sampling (awk NR % ...).
 : ${CUDA_VISIBLE_DEVICES:=0}
 : ${SAMPLE_RATE:=5}  # Run every Nth test (5 = ~20% coverage)
+
+if ! [[ "${SAMPLE_RATE}" =~ ^[1-9][0-9]*$ ]]; then
+  echo "ERROR: SAMPLE_RATE must be a positive integer (got: ${SAMPLE_RATE})" >&2
+  exit 2
+fi
Also applies to: 17-18, 142-145, 212-215, 255-260

174-175: Fix JUnit XML output path (slashes in ${test_file} create invalid paths / missing dirs).
--junitxml=${JUNIT_DIR}/${test_file}.xml will typically fail because ${test_file} includes / (e.g., tests/foo/test_bar.py).
     mkdir -p "${JUNIT_DIR}"
@@
-        JUNIT_FLAG="--junitxml=${JUNIT_DIR}/${test_file}.xml"
+        # junitxml must be a valid file path; sanitize test_file into a flat name
+        junit_name="$(echo "${test_file}" | sed 's#^./##' | sed 's#/#_#g')"
+        JUNIT_FLAG="--junitxml=${JUNIT_DIR}/${junit_name}.xml"
Also applies to: 226-235

23-29: Don’t override env DRY_RUN, and fix the help text (DRY_RUN=true).
Hard-setting DRY_RUN=false prevents DRY_RUN=true ./script.sh from working; also the hint currently says DRY_RUN=false.
-DRY_RUN=false
+: ${DRY_RUN:=false}
 if [[ "$1" == "--dry-run" ]] || [[ "${DRY_RUN}" == "true" ]]; then
@@
-    echo "Or set DRY_RUN=false $0"
+    echo "Or set DRY_RUN=true $0"
Also applies to: 170-173

🧹 Nitpick comments (1)

scripts/task_test_blackwell_kernels_lite.sh (1)

62-89: Prefer arrays for TEST_FILES to avoid word-splitting hazards and simplify looping.
Even if paths “shouldn’t” contain spaces, using string-accumulation (TEST_FILES="$TEST_FILES $test_file") + for test_file in $TEST_FILES is fragile and can mis-handle edge cases.

-ALL_TEST_FILES=$(find tests/ -name "test_*.py" -type f | sort)
+mapfile -t ALL_TEST_FILES < <(find tests/ -name "test_*.py" -type f | sort)
@@
-TEST_FILES=""
-for test_file in $ALL_TEST_FILES; do
+TEST_FILES=()
+for test_file in "${ALL_TEST_FILES[@]}"; do
@@
-        TEST_FILES="$TEST_FILES $test_file"
+        TEST_FILES+=("$test_file")
     fi
 done
@@
-TEST_FILES=$(echo "$TEST_FILES" | xargs)
-
-if [ -z "$TEST_FILES" ]; then
+if [ "${#TEST_FILES[@]}" -eq 0 ]; then
     echo "No test files found in tests/ directory (after exclusions)"
     exit 1
 fi
@@
-for test_file in $TEST_FILES; do
+for test_file in "${TEST_FILES[@]}"; do
     echo "  $test_file"
 done
@@
-    for test_file in $TEST_FILES; do
+    for test_file in "${TEST_FILES[@]}"; do
@@
-    for test_file in $TEST_FILES; do
+    for test_file in "${TEST_FILES[@]}"; do

Also applies to: 96-98, 114-116, 180-182

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b456fc and 5a2db82.

📒 Files selected for processing (1)

scripts/task_test_blackwell_kernels_lite.sh (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

bkryu

Hi @kahyunnam, it seems like the newly added script is essentially a duplicate of task_test_blackwell_kernels.sh. Instead of creating a copy that adds maintenance burden moving forward, how do we feel about expanding the existing script such that it takes in an input flag? For example

bash scripts/task_test_blackwell_kernels.sh --> Same as current existing
bash scripts/task_test_blackwell_kernels.sh --sanity_test --> The subsampled testing done in this one

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

scripts/task_test_blackwell_kernels.sh (1)
185-241: Suggested refactor: Extract test collection and sampling logic into helper functions.

The test collection logic (pytest --collect-only, node ID extraction, error handling) is duplicated across the dry-run sanity block (lines 196–209) and full-execution sanity block (lines 281–295). Similarly, the sampling logic (lines 216–218 vs. 302–304) is repeated verbatim. Extracting these into helper functions would reduce duplication and make the code easier to maintain.

Consider defining helper functions near the top of the script:
# Helper: Collect test node IDs from a test file
collect_test_node_ids() {
    local test_file=$1
    set +e
    local collection_output=$(pytest --collect-only -q "$test_file" 2>&1)
    local collection_exit_code=$?
    set -e
    
    local node_ids=$(echo "$collection_output" | grep "::" || true)
    
    if [ -z "$node_ids" ]; then
        return $collection_exit_code  # 0 if success, >0 if collection failed
    fi
    
    echo "$node_ids"
    return 0
}

# Helper: Sample test node IDs at a given rate
sample_test_node_ids() {
    local node_ids=$1
    local sample_rate=$2
    echo "$node_ids" | awk "NR % $sample_rate == 1"
}
Then replace the duplicated blocks in lines 196–218 and 281–304 with calls to these helpers, reducing cognitive load and improving maintainability.

Also applies to: 265-334

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a2db82 and f212c33.

📒 Files selected for processing (1)

scripts/task_test_blackwell_kernels.sh (3 hunks)

🔇 Additional comments (5)

scripts/task_test_blackwell_kernels.sh (5)

34-45: Mode messaging is clear and user-friendly.

The output for DRY_RUN and SANITY_TEST modes (lines 34–45) is well-structured, with emoji indicators and an intelligible coverage estimate. The calculated percentage using $((100 / SAMPLE_RATE)) is helpful context.

Once SAMPLE_RATE validation is in place (see critical issue above), this messaging will be safe and informative.

195-198: Appropriate error handling for pytest collection failures.

The temporary disabling of set -e (lines 195–198 and 280–283) before pytest collection and re-enabling afterward is a good pattern. It allows the script to gracefully handle collection errors without terminating early, while preserving the strict error-handling mode for other operations.

Also applies to: 280-283

309-313: Good defensive check for empty sampled test set.

Lines 309–313 correctly skip files with no sampled tests, preventing unnecessary pytest invocations and avoiding potential issues with empty argument lists.

316-316: Correct bash array construction and argument passing.

Line 316 uses mapfile -t to safely convert newline-delimited test node IDs into a bash array with trimmed newlines. Line 324 then correctly expands the array with "${SAMPLED_NODE_IDS_ARRAY[@]}", ensuring each test node ID is passed as a separate argument to pytest.

Also applies to: 324-324

237-237: Division by zero protection in coverage calculation.

Lines 237 and 346 check if [ "$TOTAL_TEST_CASES" -gt 0 ] before calculating the coverage percentage, preventing a division-by-zero error if no tests are collected. This is solid defensive programming.

Also applies to: 346-346

scripts/task_test_blackwell_kernels.sh

kahyunnam · 2025-12-12T21:50:49Z

/bot run

flashinfer-bot · 2025-12-12T21:51:26Z

GitLab MR !194 has been created, and the CI pipeline #40119131 is currently running. I'll report back once the pipeline job completes.

scripts/task_test_blackwell_kernels.sh

yzh119

Hi @kahyunnam, thanks for adding this feature.

Do we want any stochastic behavior in the sampling? If we select every 5 unittests then some combinations are determined to skip.

jimmyzho · 2025-12-15T23:01:36Z

Agree with @yzh119's comment on stochastic sampling (maybe instead of "NR % $SAMPLE_RATE == 1" we can randomize between NR % $SAMPLE_RATE == 0-(SAMPLE_RATE-1) ?), but otherwise looks good from my end!

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

scripts/task_test_blackwell_kernels.sh (2)

8-13: Critical: Validate SAMPLE_RATE to prevent division by zero.

Lines 12 and 45 perform arithmetic with SAMPLE_RATE (RANDOM % SAMPLE_RATE and 100 / SAMPLE_RATE). If SAMPLE_RATE is 0, both will fail with division-by-zero errors.

Additionally, consider validating SAMPLE_OFFSET when set externally to ensure it's in the range [0, SAMPLE_RATE-1]:

 : ${SAMPLE_RATE:=5}  # Run every Nth test in sanity mode (5 = ~20% coverage)
+
+# Validate SAMPLE_RATE
+if ! [[ "$SAMPLE_RATE" =~ ^[0-9]+$ ]] || [ "$SAMPLE_RATE" -le 0 ]; then
+    echo "ERROR: SAMPLE_RATE must be a positive integer (got: $SAMPLE_RATE)" >&2
+    exit 1
+fi
 
 # Randomize starting offset (0 to SAMPLE_RATE-1) for sampling variety
 if [ -z "${SAMPLE_OFFSET:-}" ]; then
     SAMPLE_OFFSET=$((RANDOM % SAMPLE_RATE))
+else
+    # Validate externally-set SAMPLE_OFFSET
+    if ! [[ "$SAMPLE_OFFSET" =~ ^[0-9]+$ ]] || [ "$SAMPLE_OFFSET" -ge "$SAMPLE_RATE" ]; then
+        echo "ERROR: SAMPLE_OFFSET must be a non-negative integer less than SAMPLE_RATE (got: SAMPLE_OFFSET=$SAMPLE_OFFSET, SAMPLE_RATE=$SAMPLE_RATE)" >&2
+        exit 1
+    fi
 fi

320-325: Quote the command substitution (shellcheck SC2046).

Same issue as the dry-run path: Line 323's $(echo "$SAMPLED_NODE_IDS" | wc -l) should be quoted.

Apply this diff:

          # Sample every Nth test with random offset
         SAMPLED_NODE_IDS=$(echo "$ALL_NODE_IDS" | awk "NR % $SAMPLE_RATE == $SAMPLE_OFFSET")
         # Fallback: if no tests sampled (offset missed all tests), take the first test
-        if [ -z "$SAMPLED_NODE_IDS" ] || [ $(echo "$SAMPLED_NODE_IDS" | wc -l) -eq 0 ]; then
+        if [ -z "$SAMPLED_NODE_IDS" ] || [ "$(echo "$SAMPLED_NODE_IDS" | wc -l)" -eq 0 ]; then
             SAMPLED_NODE_IDS=$(echo "$ALL_NODE_IDS" | head -1)
         fi

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f212c33 and a749f5f.

📒 Files selected for processing (1)

scripts/task_test_blackwell_kernels.sh (3 hunks)

🧰 Additional context used

🪛 Shellcheck (0.11.0)

scripts/task_test_blackwell_kernels.sh

[warning] 224-224: Quote this to prevent word splitting.

(SC2046)

[warning] 323-323: Quote this to prevent word splitting.

(SC2046)

🔇 Additional comments (6)

scripts/task_test_blackwell_kernels.sh (6)

25-37: LGTM: Argument parsing is clear and correct.

The flag parsing logic properly handles both --dry-run and --sanity-test options and allows them to be combined.

44-51: LGTM: Clear mode announcements with helpful context.

The mode banners provide useful information including coverage estimates and sampling pattern details, which will help users understand and reproduce test runs.

239-255: LGTM: Comprehensive dry-run summary.

The summary provides clear metrics (coverage percentage, sample counts) and actionable reproduction instructions, which will be valuable for debugging and CI workflows.

338-354: LGTM: Correct array handling and test execution.

The use of mapfile -t to populate the array and "${SAMPLED_NODE_IDS_ARRAY[@]}" for proper expansion ensures test node IDs with spaces or special characters are handled correctly. The pass/fail tracking logic is also sound.

359-382: LGTM: Thorough sanity test summary with actionable metrics.

The summary provides comprehensive metrics including per-file pass/fail counts, coverage percentage, and exact reproduction commands. This will be very helpful for debugging CI failures and understanding test coverage.

383-418: LGTM: Full test mode preserves existing behavior.

The full test mode execution maintains the original script behavior while fitting well into the new dual-mode structure. The summary format is consistent with the sanity mode output.

scripts/task_test_blackwell_kernels.sh

kahyunnam · 2025-12-15T23:27:55Z

@yzh119 @jimmyzho I added the stochastic behavior in the latest revision, using @jimmyzho 's suggested approach with randomized offset in range of [0, sample_rate-1]

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

scripts/task_test_blackwell_kernels.sh (3)
8-13: Consider validating user-provided SAMPLE_OFFSET.

The SAMPLE_RATE division-by-zero issue is already flagged in a previous comment. Additionally, when SAMPLE_OFFSET is provided externally, there's no validation ensuring 0 ≤ SAMPLE_OFFSET < SAMPLE_RATE. While the fallback logic (lines 224-226, 323-325) ensures at least one test runs even with an invalid offset, validating the range up front would provide clearer error messages.

Consider adding validation after line 13:
 if [ -z "${SAMPLE_OFFSET:-}" ]; then
     SAMPLE_OFFSET=$((RANDOM % SAMPLE_RATE))
 fi
+
+# Validate SAMPLE_OFFSET is in valid range
+if ! [[ "$SAMPLE_OFFSET" =~ ^[0-9]+$ ]] || [ "$SAMPLE_OFFSET" -ge "$SAMPLE_RATE" ]; then
+    echo "ERROR: SAMPLE_OFFSET must be an integer in range [0, SAMPLE_RATE) (got: SAMPLE_OFFSET=$SAMPLE_OFFSET, SAMPLE_RATE=$SAMPLE_RATE)" >&2
+    exit 1
+fi
191-237: LGTM: Solid dry-run sampling logic.

The collection and sampling logic is well-implemented:

Proper error handling for collection failures

Correct awk-based sampling with randomized offset

Fallback ensures at least one test per file

Clear per-file reporting

The unquoted command substitution at line 224 is already flagged in previous comments.

284-357: LGTM: Correct sanity test execution.

The execution mode properly mirrors the dry-run logic with consistent sampling behavior. The use of mapfile and array expansion to pass sampled test IDs to pytest is the correct approach for handling test node IDs with spaces or special characters.

The unquoted command substitution at line 323 is already flagged in previous comments.

🧹 Nitpick comments (1)

scripts/task_test_blackwell_kernels.sh (1)
222-226: Optional: Simplify fallback condition.

The second condition [ $(echo "$SAMPLED_NODE_IDS" | wc -l) -eq 0 ] is redundant because echo always produces at least one line of output (even for empty strings, wc -l returns 1). The -z check alone is sufficient to detect empty results.

Optionally simplify to:
-        if [ -z "$SAMPLED_NODE_IDS" ] || [ $(echo "$SAMPLED_NODE_IDS" | wc -l) -eq 0 ]; then
+        if [ -z "$SAMPLED_NODE_IDS" ]; then
             SAMPLED_NODE_IDS=$(echo "$ALL_NODE_IDS" | head -1)
         fi
Note: The same pattern appears at lines 323-325.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a749f5f and 2c2ab68.

📒 Files selected for processing (1)

scripts/task_test_blackwell_kernels.sh (3 hunks)

🧰 Additional context used

🪛 Shellcheck (0.11.0)

scripts/task_test_blackwell_kernels.sh

[warning] 224-224: Quote this to prevent word splitting.

(SC2046)

[warning] 323-323: Quote this to prevent word splitting.

(SC2046)

🔇 Additional comments (5)

scripts/task_test_blackwell_kernels.sh (5)

25-37: LGTM: Clean argument parsing.

The addition of --sanity-test flag follows the existing pattern and is implemented correctly.

44-51: LGTM: Clear mode messaging.

The mode announcements clearly communicate whether sanity or full testing is active, and the sampling pattern details (offset, example test numbers) are helpful for reproducibility. Note that line 45's division operation is covered by the existing validation comment for SAMPLE_RATE.

239-255: LGTM: Comprehensive summary reporting.

The dry-run summary provides excellent visibility into sampling behavior with clear metrics and a reproduction command. The coverage calculation is straightforward and correct.

384-404: LGTM: Full test mode preserved correctly.

The full test mode maintains existing behavior while integrating cleanly with the new sanity test infrastructure. The separation of concerns between sanity and full modes is well-implemented.

359-382: LGTM: Thorough test summary.

The sanity test summary provides complete visibility into test results with clear pass/fail counts, coverage metrics, and reproduction instructions. The failed test reporting is well-formatted and helpful for debugging.

jimmyzho

lgtm!

sanity test script

1b456fc

kahyunnam requested review from nvmbreughe, yongwww and yzh119 as code owners December 12, 2025 02:07

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

scripts/task_test_blackwell_kernels_lite.sh Outdated Show resolved Hide resolved

scripts/task_test_blackwell_kernels_lite.sh Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Dec 12, 2025

View reviewed changes

scripts/task_test_blackwell_kernels_lite.sh Outdated Show resolved Hide resolved

scripts/task_test_blackwell_kernels_lite.sh Outdated Show resolved Hide resolved

scripts/task_test_blackwell_kernels_lite.sh Outdated Show resolved Hide resolved

kahyunnam changed the title ~~sanity test script~~ cicd: Add sanity test script Dec 12, 2025

fix caat arg limit potential error

5a2db82

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

bkryu reviewed Dec 12, 2025

View reviewed changes

Remove task_test_blackwell_kernels_lite.sh (merged into main script)

f212c33

kahyunnam requested a review from jimmyzho as a code owner December 12, 2025 21:47

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

scripts/task_test_blackwell_kernels.sh Show resolved Hide resolved

dierksen reviewed Dec 12, 2025

View reviewed changes

scripts/task_test_blackwell_kernels.sh Outdated Show resolved Hide resolved

kahyunnam requested a review from bkryu December 13, 2025 00:45

kahyunnam self-assigned this Dec 15, 2025

yzh119 reviewed Dec 15, 2025

View reviewed changes

add random offset

a749f5f

kahyunnam force-pushed the knam/sanity_test_script branch from 6ced71a to a749f5f Compare December 15, 2025 23:23

spacing nit

2c2ab68

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

scripts/task_test_blackwell_kernels.sh Show resolved Hide resolved

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

jimmyzho approved these changes Dec 16, 2025

View reviewed changes

yzh119 enabled auto-merge (squash) December 16, 2025 00:24

yzh119 merged commit 0fa89cd into flashinfer-ai:main Dec 16, 2025
4 checks passed

cicd: Add sanity test script #2212

cicd: Add sanity test script #2212

Conversation

kahyunnam commented Dec 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Dec 12, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

bkryu left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kahyunnam commented Dec 12, 2025

Uh oh!

flashinfer-bot commented Dec 12, 2025

Uh oh!

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

jimmyzho commented Dec 15, 2025 • edited by yzh119 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kahyunnam commented Dec 15, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

jimmyzho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

kahyunnam commented Dec 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2025 •

edited

Loading

jimmyzho commented Dec 15, 2025 •

edited by yzh119

Loading