Skip to content

Merge v4_Researcher → GEAK_v4: Deep Research Agent (DRA) for kernel_workflow#293

Open
Umangatamd wants to merge 9 commits into
GEAK_v4from
v4_Researcher
Open

Merge v4_Researcher → GEAK_v4: Deep Research Agent (DRA) for kernel_workflow#293
Umangatamd wants to merge 9 commits into
GEAK_v4from
v4_Researcher

Conversation

@Umangatamd

Copy link
Copy Markdown
Collaborator

Promotes the Deep Research Agent (DRA) work from v4_Researcher into GEAK_v4 (via #291).

What DRA does

An opt-in Research phase in kernel_workflow (gated by dra_enabled, default off). After profiling, DRA:

  • extracts kernel facts + ranked bottleneck/design hypotheses,
  • generates ranked research questions (bottleneck + design-space / alternative-implementation),
  • researches them in parallel via native WebSearch/WebFetch,
  • synthesizes a compact, ranked portfolio of optimization directionsdeep_search_brief.md (full evidence in deep_search.md).

The brief is handed to the TechLead planner as advisory suggestions: the optimizer does its own profile/code analysis first, then decides which (if any) to adopt.

Results (A/B: no-DRA vs DRA, budget=3)

Kernel no-DRA DRA Δ
KNN (HIP) 9.25x 11.70x +2.45x (+27%)
gemm_a16wfp4 (Triton) 1.539x 1.531x ~tie

On KNN, the DRA gain traces to adopted brief directions — warp-cooperative WarpSelect (wave64), Template<K> scratch-spill elimination into VGPRs, and wrapper/output-layout fixes.

Made with Cursor

dra and others added 9 commits June 22, 2026 03:25
Add the `researcher` persona (kernel_workflow/roles/researcher.md) — a v4-native
Deep Research Agent mirroring v3's Stage 0-7 pipeline (fact extraction → ranked
research questions → per-question native web research → optional blindspot
critique → ranked-directions portfolio) with phase contracts for research_plan /
research_question / research_blindspot / research_synthesize.

Wire a new opt-in phase('Research') into kernel_workflow.js AFTER Profile and
BEFORE the optimize loop, gated behind args.dra_enabled (default off → existing
runs byte-identical). The phase fans research questions out in PARALLEL via
parallel(), wraps every research agent in the agentT() hang-guard so a hung
research agent resolves null instead of wedging the round-barrier, and writes
deep_search.md / deep_search_brief.md / deep_search.json into EVAL_DIR. Adds
RESEARCH_PLAN/QUESTION/BLINDSPOT/RESEARCH schemas and threads the brief path into
tech_lead plan_round.
… brief

plan_round now Reads EVAL_DIR/deep_search_brief.md (when DEEP_SEARCH_BRIEF is
set) and seeds directions[] from the ranked DRA directions, carrying v3's
hard-won lessons: DIVERSIFY (spread different ranked directions across parallel
engineers, always keep >=1 free explorer slot, never anchor all engineers on one
theme); treat HIGH-CEILING rewrites (raw-HIP/load_inline, HIP/CUDA graph capture,
algorithmic reformulation) as FIRST-CLASS not secondary; and don't over-prescribe
(idea/mechanism only). The brief is a prior, never a cage — profile/per-case data
and measurement still rule. No-op when the brief path is empty.
Add WebSearch + WebFetch to interface/run_e2e.py ALLOWED_TOOLS so the Deep
Research Agent's per-question research agents can do native web research. Harmless
when dra_enabled is off (nothing opts into them).
Document the opt-in Research phase (Stage 0-7 flow, parallel fan-out + hang-guard,
brief->plan_round handoff with diversity + de-conservatism), the dra_enabled /
dra_max_questions / dra_blindspot / dra_max_blindspots args, the deep_search.*
artifacts + research/ trail, the researcher role, and the WebSearch/WebFetch
allowlist requirement.
CONCERN 1 (fusion): a single-kernel DRA could overlook fusion entirely. Add a
"Fusion & kernel scope" section + a fusion angle to research_plan question
generation + a synthesis rule so fusion is never buried: intra-kernel fusion
(collapse dispatches / fold epilogue) is surfaced as a first-class executable
direction; cross-kernel fusion (merge with an adjacent op) is recorded as an
e2e-level ESCALATION in open_measurements (the single-kernel layer can't extract
a neighbor against its immutable single-op oracle) rather than lost. The
researcher must not propose keeping an op standalone against an upstream fusion.

CONCERN 2 (advisory-not-dominant): add an explicit "You are ADVISORY, not the
decision-maker" section and reframe the Stage 7 portfolio as suggestions to be
vetted against the profile, never mandates.
…nant

Rewrite plan_round rule 2b so the TechLead remains THE decision-maker and the DRA
brief cannot regress into v3-style anchoring:
- brief is ADVISORY/OPTIONAL, not a plan to execute; critically evaluate each Dk
  against THIS kernel's profile/per-case data and reject/ignore ones that don't fit
- the DRA NEVER fills 100% of the round: always generate >=1 of the TechLead's own
  profile-driven directions, keep >=1 free explorer slot, brief seeds at most
  BUDGET_REMAINING-1 directions
- DIVERSIFY (spread different Dk across engineers, never converge on one theme)
- HIGH-CEILING directions first-class WHEN they fit the profile
- FUSION: intra-kernel fusion is a normal direction; a cross-kernel-fusion
  escalation is NOT executable here, leave it as the researcher's note
… first

Strengthen the advisory framing so the DRA brief is unambiguously a set of
SUGGESTIONS to consider, not directives. plan_round now mandates an explicit
order: the TechLead does its OWN independent profile/code analysis and forms its
own candidate directions FIRST, then consults the brief and decides by its own
judgment which (if any) suggestions to adopt — free to adapt/ignore/reject all
(adopting none is valid). researcher.md synthesis tone reworded to "consider/one
option is" rather than imperative. Existing diversity + free-explorer +
high-ceiling-first + fusion rules preserved. node --check passes.
feat(dra): Deep Research Agent (DRA) for kernel_workflow
@Umangatamd Umangatamd requested a review from zihaoanllm June 23, 2026 11:28
@zihaoanllm

Copy link
Copy Markdown
Collaborator

Could you also add the runtime overhead introduced by DRA?

In addition, DRA does not appear to provide much benefit for the LLM head kernels based on the current results. Could you include more head-kernel benchmark results to better evaluate its effectiveness in that scenario?

@Umangatamd

Copy link
Copy Markdown
Collaborator Author

Runtime overhead: Right now the DRA research phase adds about ~20% to the run wall-clock — a one-time, opt-in cost before the optimize loop (tunable via the question/blindspot budget).

Head kernels: Agreed it's worth more coverage there. Could you recommend a few head kernels you'd like benchmarked? Happy to run the no-DRA vs DRA A/B on them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants