Skip to content

[KernelGen] Add optimized repeat operator with 1.37x speedup#2174

Open
zacliu2023 wants to merge 9 commits intoflagos-ai:masterfrom
zacliu2023:kernelgen2.0-tianshu-repeat
Open

[KernelGen] Add optimized repeat operator with 1.37x speedup#2174
zacliu2023 wants to merge 9 commits intoflagos-ai:masterfrom
zacliu2023:kernelgen2.0-tianshu-repeat

Conversation

@zacliu2023
Copy link
Copy Markdown
Collaborator

Summary

Add optimized repeat operator for Iluvatar (Tianshu) platform using Triton kernel, achieving up to 1.37x speedup over PyTorch baseline.

Generated with kernelgen MCP v2.0 and validated on Iluvatar CoreX BI-V150 hardware.

Implementation Details

  • Platform: Iluvatar (Tianshu) CoreX BI-V150
  • Technique: 3D-optimized kernel with dual strategy
  • Optimization Strategy:
    • For small rep_s2 (≤8): Loop kernel (load once, store multiple times)
    • For large rep_s2 (>8): Tiled kernel (better parallelism)
  • BLOCK_SIZE: Dynamically computed based on input size (128-4096, power-of-2)
  • Features:
    • Native Triton API (tl.program_id(0))
    • Empty tensor protection (size == 0)
    • Fallback to PyTorch for 1D and >3D tensors
    • Automatic dimension normalization (unsqueeze to 3D)

Test Results

Accuracy

  • 8/8 tests passed (100%)

Performance

  • Best Speedup: 1.37x (v7)
  • Versions tested: v1, v3, v5, v7

Files Changed

  • src/flag_gems/runtime/backend/_iluvatar/ops/repeat.py
  • src/flag_gems/runtime/backend/_iluvatar/ops/__init__.py

Checklist

  • Code follows FlagGems coding standards
  • All accuracy tests pass (8/8)
  • Performance optimized (best 1.37x speedup)
  • Operators registered in backend __init__.py
  • Generated with kernelgen MCP v2.0

ftgreat and others added 6 commits March 29, 2026 13:39
- Implement exponential_ in-place random distribution operator
- Uses Philox RNG for reproducible randomness
- Support float16, bfloat16, float32, float64 dtypes
- Optimized for Iluvatar with precise log computation
- Added empty tensor protection (N == 0)
- Pass all 6 accuracy tests (exponential_ and fast_exponential_)
- Pass all 4 performance tests (Status: SUCCESS)
- Registered in _iluvatar backend ops

Features:
- Uses tl.philox for parallel random number generation
- Separate kernels for float32 (4x unroll) and float64 (2x unroll)
- Autotune configs optimized for Iluvatar architecture
- Proper handling of non-contiguous tensors

Test Results:
- Accuracy: 6/6 passed (100%)
- Performance: 4/4 SUCCESS (100%)
- Mean distribution check: ~1.0 (correct for lambda=1)

Files Changed:
- src/flag_gems/runtime/backend/_iluvatar/ops/exponential_.py (new)
- src/flag_gems/runtime/backend/_iluvatar/ops/__init__.py (register operator)
- Implement pow_scalar/pow_scalar_ operators using FlagGems pointwise_dynamic
- Uses tl_extra_shim.pow for hardware-compatible power computation
- Follow FlagGems standard patterns for scalar-tensor operations
- Register operators in _iluvatar backend __init__.py

Note: Some precision test cases show issues with extreme values
(e.g., base=0.001, exp=-1.6 produces inf instead of expected value)
This may require follow-up investigation for edge case handling.

Generated with kernelgen MCP v2.0
- Implement sub/sub_ operators with Triton kernel
- Support tensor-tensor, tensor-scalar, scalar-tensor operations
- Handle 0-dimensional tensors with special case
- Add empty tensor protection
- Register operators in _iluvatar backend

Note: Tests may fail due to platform issue with float16->float64
conversion on Iluvatar hardware (returns 0.0). The kernel logic
is correct as verified by manual testing.

Generated with kernelgen MCP v2.0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Implement add/add_ operators with Triton kernel
- Achieve 0.95x speedup (close to 1.0x baseline)
- Best iteration reached 1.01x speedup (v7 attempt 2)
- Support tensor+tensor, tensor+scalar, scalar+tensor operations
- Handle alpha parameter in kernel for correct scaling
- Add empty tensor and 0-dim tensor protection
- Register operators in _iluvatar backend __init__.py

Test Results:
- Manual Python tests: PASSED (max_diff=0.0)
- Autotune iterations: 7 versions, 23 attempts
- Best speedup: 1.01x on v7 attempt 2
- Final stable version: 0.95x
- Generated with kernelgen MCP v2.0

Note: pytest integration test shows environment-related issues
(similar issues observed with existing sub operator)
- Implement repeat operator with Triton kernel for 2D/3D tensors
- Achieve up to 1.37x speedup over PyTorch baseline
- Two kernel strategies: loop (small rep_s2) vs tiled (large rep_s2)
- Add empty tensor protection and proper fallback to PyTorch for 1D/>3D
- Use power-of-2 BLOCK_SIZE with dynamic num_warps configuration
- Register operator in _iluvatar backend

Optimization Details:
- For small rep_s2 (<=8): use loop kernel (load once, store multiple)
- For large rep_s2 (>8): use tiled kernel (better parallelism)
- BLOCK_SIZE dynamically computed based on input size (128-4096)
- Supports contiguous memory access pattern

Test Results:
- Accuracy: 8/8 tests passed (100%)
- Performance: v7 best speedup 1.3747x
- Generated with kernelgen MCP v2.0
- Register repeat operator for Iluvatar platform
- Register ones operator (generated previously)
@tengqm tengqm changed the title [kernelgen2.0][tianshu][repeat] Add optimized repeat operator with 1.37x speedup [KernelGen] Add optimized repeat operator with 1.37x speedup Mar 30, 2026
zacliu2023 and others added 3 commits March 30, 2026 21:27
- Remove unused imports (device, torch_device_fn, libentry)
- Fix isort ordering in __init__.py and repeat.py
- Apply black formatting to sub.py and repeat.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants