Skip to content

Commit a2c8a8d

Browse files
committed
[skill] refine monkey-patch-kernels-to-transformers based on feedback
Signed-off-by: Rundong Li <davidli@nvidia.com>
1 parent 87d29d5 commit a2c8a8d

3 files changed

Lines changed: 25 additions & 25 deletions

File tree

.agents/skills/monkey-patch-kernels-to-transformers/references/environment-setup.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Work with user to prepare the experiment environment:
1313
cd /path/to/project
1414
docker build --target source -f modeling/transformers/Dockerfile -t auto-kernel:latest .
1515
```
16-
4. All subsequent commands in this experiment must run in the Docker container with explicit GPU UUID spec and with current work tree mounted:
16+
4. Run all subsequent commands **inside this Docker container**. Do not substitute a host conda/venv. Only use a non-Docker environment if the user explicitly requests it.
1717
```bash
1818
# Use the UUID from nvidia-smi -L output in step 1
1919
docker run --rm --gpus "device=GPU-d8ea7ef9-442e-488f-bd23-d6912699e32d" \

.agents/skills/monkey-patch-kernels-to-transformers/references/kernel-integration.md

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -89,15 +89,15 @@ flowchart TD
8989
- Mapping note: `Step 1.1/1.2` correspond to the two explore-subagent bullets under Step 1; `Step 2.1/2.2` correspond to the two plan sub-steps under Step 2; `Step 3.2.1-3.2.7` correspond to the code-subagent sub-steps under Step 3.2.
9090

9191
### Detailed Steps
92-
1. Research phase: Study the target Transformer model and available kernel and monkey-patch implementation in TileGym. Launch 2 parallel explore subagents:
93-
* Search the model ID on HuggingFace to know what architectures does it use. Then search GitHub code to get implementation of that architecture. If the explore subagent wants to inspect certain Python module and `grep` code, do it inside the experiment Docker container---The host machine might not have `transformers` or its dependence installed. Go through details to understand computations performed on every components. Summarize a comprehensive requirement list with all necessary details included. *Focus on details*. Some model might use variants of standard Attention/MoE/normalization, and/or use distinct data types at different part of computations;
92+
1. Research phase: Study the target Transformer model and available kernel and monkey-patch implementation in TileGym. Launch 2 parallel explore subagents. Each subagent needs `WebSearch` + `WebFetch`; if no available agent type exposes them, the orchestrator handles web lookups itself.
93+
* Search the model ID on HuggingFace to know what architectures does it use. Then search GitHub code to get implementation of that architecture. To locate the integration point use any of: (a) `grep`/inspect `transformers` source **inside the Docker container** (host may lack `transformers` and deps); (b) `WebSearch`/`WebFetch` against the `transformers` GitHub repo or HF Hub model card; (c) if the model loads with `trust_remote_code=True`, the classes live not in `transformers.models.*` but in custom `modeling_*.py` downloaded to the HF modules cache (`$HF_HOME/modules/transformers_modules/<repo>/`, default `~/.cache/huggingface/...`) and resolved via `auto_map` in the model config — inspect that path inside the container after one load. Go through details to understand computations performed on every components. Summarize a comprehensive requirement list with all necessary details included. *Focus on details*. Some model might use variants of standard Attention/MoE/normalization, and/or use distinct data types at different part of computations;
9494
* Go through @src/tilegym/ to inventory available kernel implementations, OP interfaces, and Transformer model monkey-patches. Pay attention to the `@dispatch("<OP name>")` and `@register("<OP name>")` mappings, and `apply_tilegym_kernel_to_<transformer_module>` patch patterns. Summarize a manifest that list all available monkey-patch functions, OP interfaces, kernel implementations with sufficient details to distinguish variants of operations. *Refer to but don't rely on docstring/comments; focus on details that distinct similar kernels*. If unsure about `cuda.tile` kernel semantic, check https://docs.nvidia.com/cuda/cutile-python/operations.html.
9595
2. Plan phase: Check if the target model architecture is already patched. If so, inform the user and exit; Otherwise, propose an integration plan following these sub-steps:
9696
1. Check the requirement list and manifest to determine which set of computations could be patched by TileGym implementations. Be optimistic since subsequent steps/subagents will drop unsuitable proposals;
9797
2. For each of the computation selected at previous sub-step, propose matching TileGym OP interfaces or/and concrete kernel implementations. You may propose multiple candidates if uncertain, but do keep candidate pool small using your best judgement.
9898
3. Execute-and-verify phase: Check develop environment, launch subagents to implement monkey-patch for each of the items in integration plan once-a-time, verify it on develop environment, and accept/reject that monkey-patch. Specific sub-steps:
9999
1. The orchestrator agent (i.e., you) checks the Docker container is available, GPU UUID reported inside the Docker container is expected, and current git branch does not have unstaged/uncommitted changes
100-
2. For each of the unverified integration plan item (i.e., a mapping of Transformer model compute <-> one or more TileGym implementation candidates), launch a code subagent. Tell this subagent how to invoke command in our Docker environment and its workflow:
100+
2. For each unverified integration plan item (i.e., a mapping of Transformer model compute <-> one or more TileGym implementation candidates), launch a code subagent **sequentially, one at a time** — purpose is context isolation, not parallelism; concurrent subagents race on `src/tilegym/transformers/<submodule_name>/` and on the Docker container. Subagent needs filesystem read/write + Bash (in-container test runs); web access not required. Tell this subagent how to invoke command in our Docker environment and its workflow:
101101
1. Study @src/tilegym/transformers/monkey_patch.py and @modeling/transformers/infer.py to understand how to monkey-patch a transformer model with TileGym implementation;
102102
2. Locate the integration point at `transformers` library. E.g., It could be a `nn.Module` subclass that corresponds to a layer in the transformer model, or an utility function that applies certain modification to transformer models' intermediate variables/tensors;
103103
3. Collect inputs and outputs around integration point to serve as subsequent verifications' references. You can create a simple debug Python script that calls `transformers` library's `.generate()` API to prompt the Transformer model to output "The capital of France is", and add code before and after the integration point to save intermediate PyTorch tensors and other necessary variables to disk as future references. *Critical: unoptimized `.generate()` is slow, collect as less data as possible*;
@@ -117,22 +117,22 @@ flowchart TD
117117
3. Aggregate all verified computes and corresponding patches. If none of the compute can be faithfully integrated, exit the workflow and let users know; Otherwise, aggregate all patching logic to a main monkey-patch function `def apply_tilegym_kernel_to_<submodule_name>(...)` and place it at @src/tilegym/transformers/monkey_patch.py. Each compute has a corresponding boolean flag as function argument;
118118
4. Update @modeling/transformers/infer.py to include the main monkey-patch function in the inference and benchmark flow. Create a Bash script modeling/transformers/bench_<submodule_name>.sh similar to other bench scripts in that directory. Ensure to use `--use_cutile` at 2nd infer.py call, as we focus on cuTile backend;
119119
5. Run the end-to-end inference script created at sub-step 3.4. It should print ~300 lines of plain text. Collect baseline throughput, cuTile kernelized throughput, and cuTile kernel coverage by `grep -E "Average throughput|cuTile Kernel Coverage \(GPU Time\)" <output_file>`. Example output:
120-
```text
121-
Average throughput: 25.93 ± 3.20 tokens/sec
122-
Average throughput: 53.41 ± 0.25 tokens/sec
123-
>>> cuTile Kernel Coverage (GPU Time): 49.21% <<<
124-
```
125-
Git commit current changes **except those standalone monkey patch files and tests** created at step 3.2.6. I.e.:
126-
```text
127-
src/tilegym/transformers/
128-
|- monkey_patch.py # check modifications to git
129-
|- <submodule_name>/
130-
|- __init__.py # check to git
131-
|- monkey_patch_<compute_name>.py # don't check to git
132-
|- test_monkey_patch_<compute_name>.py # don't check to git
133-
|- # Optional [monkey_patch_<other_compute_name>.py, test_monkey_patch_<other_compute_name>.py] pairs created by other subagents assigned with <other_compute_name>s --- don't check to git
134-
|- modeling_<submodule_name>.py # check to git
135-
|- modeling/transformers/
136-
|- bench_<submodule_name>.sh # check to git
137-
|- infer.py # check modifications to git
138-
```
120+
```text
121+
Average throughput: 25.93 ± 3.20 tokens/sec
122+
Average throughput: 53.41 ± 0.25 tokens/sec
123+
>>> cuTile Kernel Coverage (GPU Time): 49.21% <<<
124+
```
125+
Git commit current changes **except those standalone monkey patch files and tests** created at step 3.2.6. I.e.:
126+
```text
127+
src/tilegym/transformers/
128+
|- monkey_patch.py # check modifications to git
129+
|- <submodule_name>/
130+
|- __init__.py # check to git
131+
|- monkey_patch_<compute_name>.py # don't check to git
132+
|- test_monkey_patch_<compute_name>.py # don't check to git
133+
|- # Optional [monkey_patch_<other_compute_name>.py, test_monkey_patch_<other_compute_name>.py] pairs created by other subagents assigned with <other_compute_name>s --- don't check to git
134+
|- modeling_<submodule_name>.py # check to git
135+
|- modeling/transformers/
136+
|- bench_<submodule_name>.sh # check to git
137+
|- infer.py # check modifications to git
138+
```

modeling/transformers/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -100,9 +100,9 @@ COPY . /workspace/tilegym/
100100
# Verify repository structure
101101
RUN test -f setup.py || (echo "ERROR: setup.py not found! Please build from tilegym repository root." && exit 1)
102102

103-
# Install runtime dependencies from requirements.txt, then TileGym in editable mode
103+
# Install runtime dependencies from requirements.txt, then TileGym in editable mode with dev extras
104104
RUN pip install --no-cache-dir -r requirements.txt && \
105-
pip install --no-cache-dir --no-deps -e .
105+
pip install --no-cache-dir -e ".[dev]"
106106

107107
# Set up model cache directory
108108
ENV HF_HOME=/workspace/.cache/huggingface \

0 commit comments

Comments
 (0)