Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
453 changes: 453 additions & 0 deletions docs/designs/ptodsl-simt-micro-op-api-design.md

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions ptodsl/docs/user_guide/01-introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ Chapter 11 walks through this example in full detail.
| New to PTODSL | Chapter 2 (Quick Start), then Chapter 3 (Kernel Entries) |

| Writing your first kernel | Chapter 2 → Chapter 4 (Type System) → Chapter 5 (Control Flow) |
| Looking up a specific operation | Chapters 6–10 (organized by topic) |
| Looking up a specific operation | Chapters 6–10 and Chapter 13 (organized by topic) |
| Understanding the flash attention reference | Chapter 11 |

**Chapter overview:**
Expand All @@ -242,5 +242,5 @@ Chapter 11 walks through this example in full detail.
| 10 | Synchronization: barriers, flags, memory fences |
| 11 | Flash attention walkthrough |
| 12 | Additional examples |
| 13 | Migration from the old `@pto.vkernel`/`@pto.ckernel` API |
| 13 | SIMT micro-ops |
| 14 | Common errors and compatibility notes |
85 changes: 81 additions & 4 deletions ptodsl/docs/user_guide/03-kernel-entry-and-subkernels.md
Original file line number Diff line number Diff line change
Expand Up @@ -539,7 +539,7 @@ instruction appears to operate on a single element (`lds`, `sts`, `a + b`),
but the same instruction is issued across a large number of work-items
simultaneously.

**Signature**: `@pto.simt(fn=None, *, name=None, target="a5")`
**Signature**: `@pto.simt(fn=None, *, name=None, target="a5", max_threads=None, max_regs=None)`

<!-- ptodsl-doc-test: {"mode":"compile_fragment","fixture":"kernel_entry.simt_signature","symbol":"kernel_entry_simt_signature_probe","compile":{"BLOCK":8}} -->
```python
Expand Down Expand Up @@ -573,13 +573,90 @@ def blend_output_rows(
scalar.store(o_next, o_next_tile[row, col])
```

SIMT kernels read and write individual scalar elements from tiles. The unit
executes the same scalar instruction across many work-items in parallel, making
it efficient for per-element operations.
SIMT kernels read and write individual scalar elements from tiles or typed
pointers. The unit executes the same scalar instruction across many work-items
in parallel, making it efficient for per-element operations.

#### SIMT resource attributes
Comment thread
jimmychou0 marked this conversation as resolved.

Optional `max_threads` and `max_regs` arguments attach VPTO resource attributes
to the generated `pto.simt_entry` helper.

**Signature**: `@pto.simt(fn=None, *, name=None, target="a5", max_threads=None, max_regs=None)`

**Parameters**:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `max_threads` | positive Python `int` | backend default `1024` | Compile-time launch envelope for this SIMT helper |
| `max_regs` | positive Python `int` | backend default `32` | Scalar register budget per work-item |

`max_threads` is not the launch size. The actual work-item count comes from the
SIMT launch dimensions. Both arguments must be Python integers known at trace
time, greater than zero, and fit in signless `i32`. They are only valid on
decorated SIMT helper functions, not inline `with pto.simt():` scopes.

**Example**:

<!-- ptodsl-doc-test: {"mode":"compile","symbol":"kernel_entry_simt_resource_probe","compile":{}} -->
```python
@pto.simt(max_threads=256, max_regs=48)
def write_tid(dst: pto.ptr(pto.i32, "gm")):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后面我想把pto.simd/pto.cube/pto.simt的接口规范成Tile/TensorView/scalar,这样自定义的simt op也能被公共pass优化。

tid = pto.get_tid_x()
idx = scalar.index_cast(tid)
pto.stg(tid, dst, idx)


@pto.jit(target="a5")
def kernel_entry_simt_resource_probe(dst: pto.ptr(pto.i32, "gm")):
write_tid[128, 1, 1](dst)
```

**Invocation modes**: can be called from `@pto.jit` in either mode, or used
inline with `with pto.simt():` (Section 3.4).

#### Explicit SIMT launch dimensions

Calling a decorated SIMT helper directly uses the default launch descriptor
emitted by the tracer. Use indexed launch syntax when the launch dimensions must
be authored explicitly. `pto.simt_launch(...)` is the equivalent functional
form.

**Signatures**:

```python
body[dim_x, dim_y, dim_z](*args, **static_kwargs)
pto.simt_launch(body, *args, dims=(dim_x, dim_y, dim_z), **static_kwargs)
```

**Parameters**:

| Parameter | Type | Description |
|-----------|------|-------------|
| `body` | `@pto.simt` function | SIMT entry body to launch |
| `*args` | PTO values | Runtime arguments passed to the SIMT body |
| `dim_x`, `dim_y`, `dim_z` | `i32`-compatible values | Launch dimensions in source-level `x, y, z` order |
| `**static_kwargs` | hashable Python values | Trace-time specialization arguments for the SIMT body |

**Returns**: None.

**Example**:

<!-- ptodsl-doc-test: {"mode":"compile","symbol":"kernel_entry_simt_launch_probe","compile":{}} -->
```python
@pto.simt
def fill_tid(dst: pto.ptr(pto.i32, "gm")):
tid = pto.get_tid_x()
pto.stg(tid, dst, scalar.index_cast(tid))


@pto.jit(target="a5")
def kernel_entry_simt_launch_probe(dst: pto.ptr(pto.i32, "gm")):
fill_tid[32, 1, 1](dst)
```

Specific SIMT micro-op APIs are documented in Chapter 13.

## 3.4 Inline context manager syntax

In addition to the decorator form, each sub-kernel unit provides a context
Expand Down
Loading
Loading