-
Notifications
You must be signed in to change notification settings - Fork 596
cicd: Add sanity test script #2212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @kahyunnam, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new sanity test script, Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds a sampled SANITY_TEST mode to Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
scripts/task_test_blackwell_kernels_lite.sh (3)
20-22: Make collection use the same pytest flags as execution (or intentionally keep them different).
Right nowPYTEST_FLAGSdoesn’t apply to--collect-only, so collection failures/behavior may diverge from the run phase.@@ - COLLECTION_OUTPUT=$(pytest --collect-only -q "$test_file" 2>&1) + COLLECTION_OUTPUT=$(pytest $PYTEST_FLAGS --collect-only -q "$test_file" 2>&1) @@ - COLLECTION_OUTPUT=$(pytest --collect-only -q "$test_file" 2>&1) + COLLECTION_OUTPUT=$(pytest $PYTEST_FLAGS --collect-only -q "$test_file" 2>&1)Also applies to: 121-126, 190-195
5-8: Remove or wire up unused env vars (MAX_JOBS,CUDA_VISIBLE_DEVICES) to avoid confusion.
They’re set but not referenced anywhere in the script.
- If unused: drop them.
- If intended: apply them (e.g., export
CUDA_VISIBLE_DEVICESfor pytest runs; useMAX_JOBSwithpytest -nif xdist is in use).
48-58:pytest.ininorecursedirsparsing is likely wrong (space-separated patterns, not comma-separated).
Current code only handles comma-separated values and may miss exclusions.Suggested direction: parse everything after
=and treat it as whitespace-separated tokens (and ignore comments), without the comma transform.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/task_test_blackwell_kernels_lite.sh(1 hunks)
🧰 Additional context used
🪛 Shellcheck (0.11.0)
scripts/task_test_blackwell_kernels_lite.sh
[warning] 235-235: Quote this to prevent word splitting.
(SC2046)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new sanity test script, task_test_blackwell_kernels_lite.sh, which is a great addition for running a subset of tests. The script is well-structured, but I've identified a few areas for improvement to enhance its robustness and fix a bug in the dry-run functionality. My review includes suggestions to address these points, primarily concerning shell scripting best practices for handling environment variables and command arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
scripts/task_test_blackwell_kernels_lite.sh (3)
5-8: ValidateSAMPLE_RATE(avoid divide-by-zero / awk modulo errors + misleading coverage).
TodaySAMPLE_RATE=0or non-integer will break arithmetic (100 / SAMPLE_RATE) and sampling (awk NR % ...).: ${CUDA_VISIBLE_DEVICES:=0} : ${SAMPLE_RATE:=5} # Run every Nth test (5 = ~20% coverage) + +if ! [[ "${SAMPLE_RATE}" =~ ^[1-9][0-9]*$ ]]; then + echo "ERROR: SAMPLE_RATE must be a positive integer (got: ${SAMPLE_RATE})" >&2 + exit 2 +fiAlso applies to: 17-18, 142-145, 212-215, 255-260
174-175: Fix JUnit XML output path (slashes in${test_file}create invalid paths / missing dirs).
--junitxml=${JUNIT_DIR}/${test_file}.xmlwill typically fail because${test_file}includes/(e.g.,tests/foo/test_bar.py).mkdir -p "${JUNIT_DIR}" @@ - JUNIT_FLAG="--junitxml=${JUNIT_DIR}/${test_file}.xml" + # junitxml must be a valid file path; sanitize test_file into a flat name + junit_name="$(echo "${test_file}" | sed 's#^./##' | sed 's#/#_#g')" + JUNIT_FLAG="--junitxml=${JUNIT_DIR}/${junit_name}.xml"Also applies to: 226-235
23-29: Don’t override envDRY_RUN, and fix the help text (DRY_RUN=true).
Hard-settingDRY_RUN=falsepreventsDRY_RUN=true ./script.shfrom working; also the hint currently saysDRY_RUN=false.-DRY_RUN=false +: ${DRY_RUN:=false} if [[ "$1" == "--dry-run" ]] || [[ "${DRY_RUN}" == "true" ]]; then @@ - echo "Or set DRY_RUN=false $0" + echo "Or set DRY_RUN=true $0"Also applies to: 170-173
🧹 Nitpick comments (1)
scripts/task_test_blackwell_kernels_lite.sh (1)
62-89: Prefer arrays forTEST_FILESto avoid word-splitting hazards and simplify looping.
Even if paths “shouldn’t” contain spaces, using string-accumulation (TEST_FILES="$TEST_FILES $test_file") +for test_file in $TEST_FILESis fragile and can mis-handle edge cases.-ALL_TEST_FILES=$(find tests/ -name "test_*.py" -type f | sort) +mapfile -t ALL_TEST_FILES < <(find tests/ -name "test_*.py" -type f | sort) @@ -TEST_FILES="" -for test_file in $ALL_TEST_FILES; do +TEST_FILES=() +for test_file in "${ALL_TEST_FILES[@]}"; do @@ - TEST_FILES="$TEST_FILES $test_file" + TEST_FILES+=("$test_file") fi done @@ -TEST_FILES=$(echo "$TEST_FILES" | xargs) - -if [ -z "$TEST_FILES" ]; then +if [ "${#TEST_FILES[@]}" -eq 0 ]; then echo "No test files found in tests/ directory (after exclusions)" exit 1 fi @@ -for test_file in $TEST_FILES; do +for test_file in "${TEST_FILES[@]}"; do echo " $test_file" done @@ - for test_file in $TEST_FILES; do + for test_file in "${TEST_FILES[@]}"; do @@ - for test_file in $TEST_FILES; do + for test_file in "${TEST_FILES[@]}"; doAlso applies to: 96-98, 114-116, 180-182
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/task_test_blackwell_kernels_lite.sh(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
bkryu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kahyunnam, it seems like the newly added script is essentially a duplicate of task_test_blackwell_kernels.sh. Instead of creating a copy that adds maintenance burden moving forward, how do we feel about expanding the existing script such that it takes in an input flag? For example
bash scripts/task_test_blackwell_kernels.sh--> Same as current existingbash scripts/task_test_blackwell_kernels.sh --sanity_test--> The subsampled testing done in this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
scripts/task_test_blackwell_kernels.sh (1)
185-241: Suggested refactor: Extract test collection and sampling logic into helper functions.The test collection logic (pytest --collect-only, node ID extraction, error handling) is duplicated across the dry-run sanity block (lines 196–209) and full-execution sanity block (lines 281–295). Similarly, the sampling logic (lines 216–218 vs. 302–304) is repeated verbatim. Extracting these into helper functions would reduce duplication and make the code easier to maintain.
Consider defining helper functions near the top of the script:
# Helper: Collect test node IDs from a test file collect_test_node_ids() { local test_file=$1 set +e local collection_output=$(pytest --collect-only -q "$test_file" 2>&1) local collection_exit_code=$? set -e local node_ids=$(echo "$collection_output" | grep "::" || true) if [ -z "$node_ids" ]; then return $collection_exit_code # 0 if success, >0 if collection failed fi echo "$node_ids" return 0 } # Helper: Sample test node IDs at a given rate sample_test_node_ids() { local node_ids=$1 local sample_rate=$2 echo "$node_ids" | awk "NR % $sample_rate == 1" }Then replace the duplicated blocks in lines 196–218 and 281–304 with calls to these helpers, reducing cognitive load and improving maintainability.
Also applies to: 265-334
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/task_test_blackwell_kernels.sh(3 hunks)
🔇 Additional comments (5)
scripts/task_test_blackwell_kernels.sh (5)
34-45: Mode messaging is clear and user-friendly.The output for
DRY_RUNandSANITY_TESTmodes (lines 34–45) is well-structured, with emoji indicators and an intelligible coverage estimate. The calculated percentage using$((100 / SAMPLE_RATE))is helpful context.Once SAMPLE_RATE validation is in place (see critical issue above), this messaging will be safe and informative.
195-198: Appropriate error handling for pytest collection failures.The temporary disabling of
set -e(lines 195–198 and 280–283) before pytest collection and re-enabling afterward is a good pattern. It allows the script to gracefully handle collection errors without terminating early, while preserving the strict error-handling mode for other operations.Also applies to: 280-283
309-313: Good defensive check for empty sampled test set.Lines 309–313 correctly skip files with no sampled tests, preventing unnecessary pytest invocations and avoiding potential issues with empty argument lists.
316-316: Correct bash array construction and argument passing.Line 316 uses
mapfile -tto safely convert newline-delimited test node IDs into a bash array with trimmed newlines. Line 324 then correctly expands the array with"${SAMPLED_NODE_IDS_ARRAY[@]}", ensuring each test node ID is passed as a separate argument to pytest.Also applies to: 324-324
237-237: Division by zero protection in coverage calculation.Lines 237 and 346 check
if [ "$TOTAL_TEST_CASES" -gt 0 ]before calculating the coverage percentage, preventing a division-by-zero error if no tests are collected. This is solid defensive programming.Also applies to: 346-346
|
/bot run |
yzh119
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kahyunnam, thanks for adding this feature.
Do we want any stochastic behavior in the sampling? If we select every 5 unittests then some combinations are determined to skip.
|
Agree with @yzh119's comment on stochastic sampling (maybe instead of "NR % $SAMPLE_RATE == 1" we can randomize between NR % $SAMPLE_RATE == 0-(SAMPLE_RATE-1) ?), but otherwise looks good from my end! |
6ced71a to
a749f5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
scripts/task_test_blackwell_kernels.sh (2)
8-13: Critical: Validate SAMPLE_RATE to prevent division by zero.Lines 12 and 45 perform arithmetic with
SAMPLE_RATE(RANDOM % SAMPLE_RATEand100 / SAMPLE_RATE). IfSAMPLE_RATEis 0, both will fail with division-by-zero errors.Additionally, consider validating
SAMPLE_OFFSETwhen set externally to ensure it's in the range [0, SAMPLE_RATE-1]:: ${SAMPLE_RATE:=5} # Run every Nth test in sanity mode (5 = ~20% coverage) + +# Validate SAMPLE_RATE +if ! [[ "$SAMPLE_RATE" =~ ^[0-9]+$ ]] || [ "$SAMPLE_RATE" -le 0 ]; then + echo "ERROR: SAMPLE_RATE must be a positive integer (got: $SAMPLE_RATE)" >&2 + exit 1 +fi # Randomize starting offset (0 to SAMPLE_RATE-1) for sampling variety if [ -z "${SAMPLE_OFFSET:-}" ]; then SAMPLE_OFFSET=$((RANDOM % SAMPLE_RATE)) +else + # Validate externally-set SAMPLE_OFFSET + if ! [[ "$SAMPLE_OFFSET" =~ ^[0-9]+$ ]] || [ "$SAMPLE_OFFSET" -ge "$SAMPLE_RATE" ]; then + echo "ERROR: SAMPLE_OFFSET must be a non-negative integer less than SAMPLE_RATE (got: SAMPLE_OFFSET=$SAMPLE_OFFSET, SAMPLE_RATE=$SAMPLE_RATE)" >&2 + exit 1 + fi fi
320-325: Quote the command substitution (shellcheck SC2046).Same issue as the dry-run path: Line 323's
$(echo "$SAMPLED_NODE_IDS" | wc -l)should be quoted.Apply this diff:
# Sample every Nth test with random offset SAMPLED_NODE_IDS=$(echo "$ALL_NODE_IDS" | awk "NR % $SAMPLE_RATE == $SAMPLE_OFFSET") # Fallback: if no tests sampled (offset missed all tests), take the first test - if [ -z "$SAMPLED_NODE_IDS" ] || [ $(echo "$SAMPLED_NODE_IDS" | wc -l) -eq 0 ]; then + if [ -z "$SAMPLED_NODE_IDS" ] || [ "$(echo "$SAMPLED_NODE_IDS" | wc -l)" -eq 0 ]; then SAMPLED_NODE_IDS=$(echo "$ALL_NODE_IDS" | head -1) fi
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/task_test_blackwell_kernels.sh(3 hunks)
🧰 Additional context used
🪛 Shellcheck (0.11.0)
scripts/task_test_blackwell_kernels.sh
[warning] 224-224: Quote this to prevent word splitting.
(SC2046)
[warning] 323-323: Quote this to prevent word splitting.
(SC2046)
🔇 Additional comments (6)
scripts/task_test_blackwell_kernels.sh (6)
25-37: LGTM: Argument parsing is clear and correct.The flag parsing logic properly handles both
--dry-runand--sanity-testoptions and allows them to be combined.
44-51: LGTM: Clear mode announcements with helpful context.The mode banners provide useful information including coverage estimates and sampling pattern details, which will help users understand and reproduce test runs.
239-255: LGTM: Comprehensive dry-run summary.The summary provides clear metrics (coverage percentage, sample counts) and actionable reproduction instructions, which will be valuable for debugging and CI workflows.
338-354: LGTM: Correct array handling and test execution.The use of
mapfile -tto populate the array and"${SAMPLED_NODE_IDS_ARRAY[@]}"for proper expansion ensures test node IDs with spaces or special characters are handled correctly. The pass/fail tracking logic is also sound.
359-382: LGTM: Thorough sanity test summary with actionable metrics.The summary provides comprehensive metrics including per-file pass/fail counts, coverage percentage, and exact reproduction commands. This will be very helpful for debugging CI failures and understanding test coverage.
383-418: LGTM: Full test mode preserves existing behavior.The full test mode execution maintains the original script behavior while fitting well into the new dual-mode structure. The summary format is consistent with the sanity mode output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
scripts/task_test_blackwell_kernels.sh (3)
8-13: Consider validating user-provided SAMPLE_OFFSET.The SAMPLE_RATE division-by-zero issue is already flagged in a previous comment. Additionally, when SAMPLE_OFFSET is provided externally, there's no validation ensuring
0 ≤ SAMPLE_OFFSET < SAMPLE_RATE. While the fallback logic (lines 224-226, 323-325) ensures at least one test runs even with an invalid offset, validating the range up front would provide clearer error messages.Consider adding validation after line 13:
if [ -z "${SAMPLE_OFFSET:-}" ]; then SAMPLE_OFFSET=$((RANDOM % SAMPLE_RATE)) fi + +# Validate SAMPLE_OFFSET is in valid range +if ! [[ "$SAMPLE_OFFSET" =~ ^[0-9]+$ ]] || [ "$SAMPLE_OFFSET" -ge "$SAMPLE_RATE" ]; then + echo "ERROR: SAMPLE_OFFSET must be an integer in range [0, SAMPLE_RATE) (got: SAMPLE_OFFSET=$SAMPLE_OFFSET, SAMPLE_RATE=$SAMPLE_RATE)" >&2 + exit 1 +fi
191-237: LGTM: Solid dry-run sampling logic.The collection and sampling logic is well-implemented:
- Proper error handling for collection failures
- Correct awk-based sampling with randomized offset
- Fallback ensures at least one test per file
- Clear per-file reporting
The unquoted command substitution at line 224 is already flagged in previous comments.
284-357: LGTM: Correct sanity test execution.The execution mode properly mirrors the dry-run logic with consistent sampling behavior. The use of
mapfileand array expansion to pass sampled test IDs to pytest is the correct approach for handling test node IDs with spaces or special characters.The unquoted command substitution at line 323 is already flagged in previous comments.
🧹 Nitpick comments (1)
scripts/task_test_blackwell_kernels.sh (1)
222-226: Optional: Simplify fallback condition.The second condition
[ $(echo "$SAMPLED_NODE_IDS" | wc -l) -eq 0 ]is redundant becauseechoalways produces at least one line of output (even for empty strings,wc -lreturns 1). The-zcheck alone is sufficient to detect empty results.Optionally simplify to:
- if [ -z "$SAMPLED_NODE_IDS" ] || [ $(echo "$SAMPLED_NODE_IDS" | wc -l) -eq 0 ]; then + if [ -z "$SAMPLED_NODE_IDS" ]; then SAMPLED_NODE_IDS=$(echo "$ALL_NODE_IDS" | head -1) fiNote: The same pattern appears at lines 323-325.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/task_test_blackwell_kernels.sh(3 hunks)
🧰 Additional context used
🪛 Shellcheck (0.11.0)
scripts/task_test_blackwell_kernels.sh
[warning] 224-224: Quote this to prevent word splitting.
(SC2046)
[warning] 323-323: Quote this to prevent word splitting.
(SC2046)
🔇 Additional comments (5)
scripts/task_test_blackwell_kernels.sh (5)
25-37: LGTM: Clean argument parsing.The addition of
--sanity-testflag follows the existing pattern and is implemented correctly.
44-51: LGTM: Clear mode messaging.The mode announcements clearly communicate whether sanity or full testing is active, and the sampling pattern details (offset, example test numbers) are helpful for reproducibility. Note that line 45's division operation is covered by the existing validation comment for SAMPLE_RATE.
239-255: LGTM: Comprehensive summary reporting.The dry-run summary provides excellent visibility into sampling behavior with clear metrics and a reproduction command. The coverage calculation is straightforward and correct.
384-404: LGTM: Full test mode preserved correctly.The full test mode maintains existing behavior while integrating cleanly with the new sanity test infrastructure. The separation of concerns between sanity and full modes is well-implemented.
359-382: LGTM: Thorough test summary.The sanity test summary provides complete visibility into test results with clear pass/fail counts, coverage metrics, and reproduction instructions. The failed test reporting is well-formatted and helpful for debugging.
jimmyzho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
📌 Description
This PR adds a sanity test script to use towards testing more CTK versions. This adds
--sanity-testflag as an option to[scripts/task_test_blackwell_kernels.sh](https://github.com/flashinfer-ai/flashinfer/blob/main/scripts/task_test_blackwell_kernels.sh), our current unit testing script. But instead of running every test combination, it samples every Nth test in each test suite. N is determined bySAMPLE_RATE. The default value is 5, which is 20% coverage (${SAMPLE_RATE:=5}).Example output for
bash scripts/task_test_blackwell_kernels.sh --dry-run --sanity-test:🔍 Related Issues
As discussed in weekly meeting and on slack.
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.