Skip to content

Commit 28ecd9a

Browse files
committed
Merge remote-tracking branch 'origin/main' into sweep-canary-gate
2 parents e05aa32 + adbaae5 commit 28ecd9a

15 files changed

Lines changed: 851 additions & 198 deletions
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
---
2+
description: Render an HTML dashboard of Claude/Klaud-Cold PR states (state + check breakdown per PR) and open it in the browser
3+
---
4+
5+
Render an HTML dashboard for every open PR in `SemiAnalysisAI/InferenceX` that was opened by Claude (either a `claude/*` branch OR a title prefixed with `[Klaud Cold]`). Each row shows the PR's current state, a check-status breakdown, the title, and empty "Reason"/"Suggested fix" cells you can fill in afterward by reading failed-run logs.
6+
7+
The dashboard lives at `/tmp/klaud_pr_status.html` and is opened with `open` (macOS) at the end.
8+
9+
## Step 1 — list candidate PRs (`claude/*` OR `[Klaud Cold]` title)
10+
11+
```bash
12+
gh pr list --repo SemiAnalysisAI/InferenceX --state open --limit 200 \
13+
--json number,title,headRefName,createdAt \
14+
--jq '.[] | select((.headRefName | startswith("claude/")) or (.title | startswith("[Klaud Cold]"))) | "\(.number)\t\(.headRefName)\t\(.createdAt)\t\(.title)"' \
15+
> /tmp/klaud_pr_candidates.tsv
16+
wc -l /tmp/klaud_pr_candidates.tsv
17+
```
18+
19+
## Step 2 — per-PR state classification
20+
21+
`gh pr list --json statusCheckRollup` truncates rollups, so enumerate candidates first then re-query each PR individually.
22+
23+
Each check's effective state is `if (.conclusion // "") != "" then .conclusion else .status end``gh` returns `conclusion: ""` (not `null`) for in-flight checks, so jq's `//` does not fall through to `.status`.
24+
25+
State buckets:
26+
- **FAILED** — at least one check is `FAILURE` / `CANCELLED` / `TIMED_OUT`, AND no checks are still pending.
27+
- **FAILED+RUNNING** — at least one failed check AND at least one pending check (sweep partially failed; some matrix jobs still running).
28+
- **RUNNING** — no failed checks; at least one is `QUEUED` / `IN_PROGRESS` / `PENDING`.
29+
- **READY** — no failed, no pending, and at least one `Run Sweep` check is `SUCCESS`.
30+
- **NO_SUCCESS** — sweep ran but never produced a `SUCCESS` (e.g. all matrix jobs got SKIPPED).
31+
- **NO_SWEEP** — no `Run Sweep` check exists for this head SHA at all (sweep never triggered — usually missing `full-sweep-enabled` label).
32+
33+
```bash
34+
: > /tmp/klaud_pr_status.tsv
35+
while IFS=$'\t' read -r pr branch created title; do
36+
rollup=$(gh pr view "$pr" --repo SemiAnalysisAI/InferenceX --json statusCheckRollup,headRefOid)
37+
classification=$(printf '%s' "$rollup" | jq -r '
38+
def state: if (.conclusion // "") != "" then .conclusion else .status end;
39+
. as $p
40+
| ([$p.statusCheckRollup[] | state]) as $s
41+
| ($s | any(. == "FAILURE" or . == "CANCELLED" or . == "TIMED_OUT")) as $failed
42+
| ($s | any(. == "QUEUED" or . == "IN_PROGRESS" or . == "PENDING")) as $pending
43+
| ([$p.statusCheckRollup[] | select(.workflowName == "Run Sweep" and (state) == "SUCCESS")] | length > 0) as $swept
44+
| ([$p.statusCheckRollup[] | select(.workflowName == "Run Sweep")] | length > 0) as $hasweep
45+
| if $failed and $pending then "FAILED+RUNNING"
46+
elif $failed then "FAILED"
47+
elif $pending then "RUNNING"
48+
elif $swept then "READY"
49+
elif $hasweep then "NO_SUCCESS"
50+
else "NO_SWEEP" end')
51+
breakdown=$(printf '%s' "$rollup" | jq -r '
52+
def state: if (.conclusion // "") != "" then .conclusion else .status end;
53+
[.statusCheckRollup[] | state] | group_by(.) | map("\(.[0])=\(length)") | join(" ")')
54+
head_sha=$(printf '%s' "$rollup" | jq -r '.headRefOid')
55+
printf '%s\t%s\t%s\t%s\t%s\t%s\t%s\n' "$pr" "$classification" "$breakdown" "$branch" "$created" "$head_sha" "$title" >> /tmp/klaud_pr_status.tsv
56+
done < /tmp/klaud_pr_candidates.tsv
57+
```
58+
59+
## Step 3 — render HTML and open
60+
61+
State render order (action items first): `FAILED``FAILED+RUNNING``NO_SWEEP``NO_SUCCESS``RUNNING``READY`. Within each bucket, descending PR number.
62+
63+
If you have per-PR diagnoses to inject (e.g. after running `/fix-klaud-cron-prs`), write them as a JSON map `{ "1461": {"reason": "...", "fix": "..."}, ... }` to `/tmp/klaud_pr_diag.json` BEFORE running this step — the generator will pick them up. HTML may contain inline `<code>` tags.
64+
65+
```bash
66+
cat > /tmp/gen_klaud_pr_status_html.py <<'PYEOF'
67+
#!/usr/bin/env python3
68+
import html, json, datetime as dt
69+
from pathlib import Path
70+
71+
tsv = Path("/tmp/klaud_pr_status.tsv").read_text().strip().splitlines()
72+
diag_path = Path("/tmp/klaud_pr_diag.json")
73+
diag = json.loads(diag_path.read_text()) if diag_path.exists() else {}
74+
75+
state_counts = {}
76+
state_order = {"FAILED": 0, "FAILED+RUNNING": 1, "NO_SWEEP": 2, "NO_SUCCESS": 3, "RUNNING": 4, "READY": 5}
77+
state_class = {
78+
"READY": "state-READY", "RUNNING": "state-RUNNING",
79+
"FAILED": "state-FAILED", "FAILED+RUNNING": "state-FAILED",
80+
"NO_SWEEP": "state-NOSWEEP", "NO_SUCCESS": "state-NOSWEEP",
81+
}
82+
83+
rows = []
84+
for line in tsv:
85+
parts = line.split("\t")
86+
if len(parts) < 7:
87+
continue
88+
pr, state, breakdown, branch, created, sha, title = parts[0], parts[1], parts[2], parts[3], parts[4], parts[5], "\t".join(parts[6:])
89+
state_counts[state] = state_counts.get(state, 0) + 1
90+
d = diag.get(pr, {})
91+
rows.append((state_order.get(state, 99), -int(pr), pr, state, breakdown, title, d.get("reason", ""), d.get("fix", "")))
92+
93+
rows.sort()
94+
now = dt.datetime.now(dt.timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
95+
96+
out = ['<!doctype html>',
97+
'<html lang="en"><head><meta charset="utf-8"><title>Claude / Klaud Cold PR status — InferenceX</title>',
98+
'<style>',
99+
' body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif; max-width: 1700px; margin: 24px auto; padding: 0 16px; color:#222; }',
100+
' h1 { font-size: 20px; margin-bottom: 6px; }',
101+
' .meta { color:#666; font-size:12px; margin-bottom: 18px; }',
102+
' table { border-collapse: collapse; width: 100%; font-size: 13px; }',
103+
' th, td { padding: 8px 10px; border-bottom: 1px solid #eee; text-align: left; vertical-align: top; }',
104+
' th { background:#f7f7f7; position: sticky; top: 0; z-index:1; }',
105+
' tr:hover { background:#fafafa; }',
106+
' .state-READY { color:#0a7; font-weight: 600; }',
107+
' .state-RUNNING { color:#06c; font-weight: 600; }',
108+
' .state-FAILED { color:#c33; font-weight: 600; }',
109+
' .state-NOSWEEP { color:#a60; font-weight: 600; }',
110+
' .pr { font-family: ui-monospace, "SF Mono", Menlo, monospace; }',
111+
' .breakdown { font-family: ui-monospace, "SF Mono", Menlo, monospace; color:#444; white-space: nowrap; font-size:11px; }',
112+
' .reason, .fix { font-size: 12px; max-width: 460px; }',
113+
' .fix { color:#444; }',
114+
' code { background:#f0f0f0; padding:1px 4px; border-radius:3px; font-size:11px; }',
115+
' a { color:#06c; text-decoration: none; } a:hover { text-decoration: underline; }',
116+
' .summary { display:flex; gap:10px; margin-bottom: 12px; flex-wrap:wrap; }',
117+
' .pill { padding: 2px 10px; border-radius: 999px; font-size: 12px; font-weight: 600; }',
118+
' .pill.ready { background:#d9f5e6; color:#0a7; }',
119+
' .pill.running { background:#dfeeff; color:#06c; }',
120+
' .pill.failed { background:#fde0e0; color:#c33; }',
121+
' .pill.noswp { background:#fbe9c8; color:#a60; }',
122+
'</style></head><body>',
123+
'<h1>Claude / [Klaud Cold] PR status &mdash; InferenceX</h1>',
124+
f'<div class="meta">Generated {now}. Source: <code>gh pr view --json statusCheckRollup</code> for every <code>claude/*</code> or <code>[Klaud Cold]</code>-titled open PR. Diagnoses (if any) loaded from <code>/tmp/klaud_pr_diag.json</code>.</div>']
125+
126+
pill_specs = [("READY", "ready"), ("RUNNING", "running"),
127+
("FAILED", "failed"), ("FAILED+RUNNING", "failed"),
128+
("NO_SWEEP", "noswp"), ("NO_SUCCESS", "noswp")]
129+
pills = [f'<span class="pill {cls}">{name}: {state_counts[name]}</span>'
130+
for name, cls in pill_specs if state_counts.get(name, 0)]
131+
out.append('<div class="summary">' + "".join(pills) + '</div>')
132+
133+
out.append('<table><thead><tr>'
134+
'<th>PR</th><th>State</th><th>Check breakdown</th>'
135+
'<th>Reason</th><th>Suggested fix</th><th>Title</th>'
136+
'</tr></thead><tbody>')
137+
138+
for _, _, pr, state, breakdown, title, reason, fix in rows:
139+
cls = state_class.get(state, "state-RUNNING")
140+
out.append(
141+
f'<tr><td class="pr"><a href="https://github.com/SemiAnalysisAI/InferenceX/pull/{pr}" target="_blank">#{pr}</a></td>'
142+
f'<td class="{cls}">{state}</td>'
143+
f'<td class="breakdown">{html.escape(breakdown)}</td>'
144+
f'<td class="reason">{reason or "&mdash;"}</td>'
145+
f'<td class="fix">{fix or "&mdash;"}</td>'
146+
f'<td>{html.escape(title)}</td></tr>'
147+
)
148+
149+
out.append('</tbody></table></body></html>')
150+
Path("/tmp/klaud_pr_status.html").write_text("\n".join(out))
151+
print(f"Wrote /tmp/klaud_pr_status.html — {len(rows)} rows, states: {state_counts}")
152+
PYEOF
153+
python3 /tmp/gen_klaud_pr_status_html.py
154+
open /tmp/klaud_pr_status.html 2>/dev/null || true
155+
```
156+
157+
Output the path (`/tmp/klaud_pr_status.html`) and the per-state counts to the user. The command is informational only — it does **not** modify any PR.
158+
159+
### Adding diagnoses to the dashboard
160+
161+
To populate the Reason / Suggested fix columns for failing PRs, write a JSON file like this **before** Step 3:
162+
163+
```bash
164+
cat > /tmp/klaud_pr_diag.json <<'EOF'
165+
{
166+
"1461": {
167+
"reason": "vLLM v0.21 CUDA-graph profiler OOM at <code>--gpu-memory-utilization 0.90</code>.",
168+
"fix": "Add <code>export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0</code> before vllm serve."
169+
},
170+
"1422": {
171+
"reason": "Upstream sglang v0.5.12 <code>flash_attn</code> SM-arch regression on B300 (<code>sm_120</code>).",
172+
"fix": "Pin to <code>v0.5.11-cu130</code>."
173+
}
174+
}
175+
EOF
176+
```
177+
178+
See `KLAUD_DEBUG.md` for the canonical catalog of recurring failure modes to draw diagnoses from.

.github/configs/amd-master.yaml

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ dsr1-fp4-mi355x-atom-mtp:
6666
- { tp: 8, conc-start: 4, conc-end: 256, spec-decoding: mtp }
6767

6868
dsr1-fp8-mi300x-sglang:
69-
image: lmsysorg/sglang:v0.5.9-rocm700-mi30x
69+
image: lmsysorg/sglang:v0.5.12-rocm700-mi30x
7070
model: deepseek-ai/DeepSeek-R1-0528
7171
model-prefix: dsr1
7272
runner: mi300x
@@ -85,7 +85,7 @@ dsr1-fp8-mi300x-sglang:
8585
- { tp: 8, conc-start: 4, conc-end: 64 }
8686

8787
dsr1-fp8-mi325x-sglang:
88-
image: lmsysorg/sglang:v0.5.9-rocm700-mi30x
88+
image: lmsysorg/sglang:v0.5.12-rocm700-mi30x
8989
model: deepseek-ai/DeepSeek-R1-0528
9090
model-prefix: dsr1
9191
runner: mi325x
@@ -162,7 +162,7 @@ qwen3.5-bf16-mi355x-sglang-mtp:
162162
- { tp: 8, ep: 1, conc-start: 4, conc-end: 256, spec-decoding: mtp }
163163

164164
qwen3.5-bf16-mi300x-sglang:
165-
image: lmsysorg/sglang:v0.5.10-rocm720-mi30x
165+
image: lmsysorg/sglang:v0.5.12-rocm720-mi30x
166166
model: Qwen/Qwen3.5-397B-A17B
167167
model-prefix: qwen3.5
168168
runner: mi300x
@@ -527,7 +527,7 @@ kimik2.5-int4-mi355x-vllm:
527527
- { tp: 8, conc-start: 4, conc-end: 64 }
528528

529529
kimik2.5-int4-mi325x-vllm:
530-
image: vllm/vllm-openai-rocm:v0.18.0
530+
image: vllm/vllm-openai-rocm:v0.21.0
531531
model: moonshotai/Kimi-K2.5
532532
model-prefix: kimik2.5
533533
runner: mi325x
@@ -802,7 +802,7 @@ minimaxm2.5-fp8-mi300x-vllm-agentic:
802802
- { tp: 4, offloading: cpu, conc-list: [16, 20, 24, 28, 32] }
803803

804804
minimaxm2.5-fp8-mi325x-vllm:
805-
image: vllm/vllm-openai-rocm:v0.18.0
805+
image: vllm/vllm-openai-rocm:v0.21.0
806806
model: MiniMaxAI/MiniMax-M2.5
807807
model-prefix: minimaxm2.5
808808
runner: mi325x
@@ -872,7 +872,7 @@ gptoss-fp4-mi300x-vllm:
872872
- { tp: 8, conc-start: 1, conc-end: 16 }
873873

874874
gptoss-fp4-mi325x-vllm:
875-
image: vllm/vllm-openai-rocm:v0.17.0
875+
image: vllm/vllm-openai-rocm:v0.21.0
876876
model: openai/gpt-oss-120b
877877
model-prefix: gptoss
878878
runner: mi325x
@@ -1706,25 +1706,6 @@ dsr1-fp4-mi355x-sglang-disagg-mtp:
17061706
- "DECODE_MTP_SIZE=1"
17071707

17081708

1709-
dsv4-fp8-mi355x-sglang:
1710-
image: rocm/sgl-dev:deepseek-v4-mi35x
1711-
model: sgl-project/DeepSeek-V4-Pro-FP8
1712-
model-prefix: dsv4
1713-
runner: mi355x
1714-
precision: fp8
1715-
framework: sglang
1716-
multinode: false
1717-
scenarios:
1718-
fixed-seq-len:
1719-
- isl: 1024
1720-
osl: 1024
1721-
search-space:
1722-
- { tp: 8, conc-start: 4, conc-end: 64 }
1723-
- isl: 8192
1724-
osl: 1024
1725-
search-space:
1726-
- { tp: 8, conc-start: 4, conc-end: 64 }
1727-
17281709
# DSv4-Pro FP4 on MI355X via SGLang. Uses a rocm720 mi35x image built off the
17291710
# amd/deepseek_v4 branch in sgl-project/sglang; the SHA is encoded in the
17301711
# image tag, so bumping sglang is just an image tag bump here. Sweeps
@@ -1800,3 +1781,22 @@ dsv4-fp4-mi355x-atom:
18001781
osl: 1024
18011782
search-space:
18021783
- { tp: 8, ep: 1, conc-start: 1, conc-end: 512 }
1784+
1785+
qwen3.5-bf16-mi325x-sglang-mtp:
1786+
image: lmsysorg/sglang:v0.5.12-rocm720-mi30x
1787+
model: Qwen/Qwen3.5-397B-A17B
1788+
model-prefix: qwen3.5
1789+
runner: mi325x
1790+
precision: bf16
1791+
framework: sglang
1792+
multinode: false
1793+
scenarios:
1794+
fixed-seq-len:
1795+
- isl: 1024
1796+
osl: 1024
1797+
search-space:
1798+
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
1799+
- isl: 8192
1800+
osl: 1024
1801+
search-space:
1802+
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }

0 commit comments

Comments
 (0)