You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The self-development loop has five specialized agents (triage, workers, self-update, fake-user, fake-strategist), but no agent systematically analyzes the outcomes of their work. The loop generates PRs, but nobody measures whether those PRs are getting better or worse over time, identifies recurring rejection patterns, or feeds concrete learnings back into prompts.
Issue #508 demonstrated the value of this analysis — a one-time manual review found a 40% closed-without-merge rate and identified 5 systemic prompt gaps. But this analysis was performed once by a human-triggered strategist run. It should be a continuous, automated part of the loop.
Problem
The feedback gap in the current self-development pipeline
The self-update agent reviews config files daily but:
Has no structured data about which PRs succeeded or failed
Relies on browsing recent PRs ad-hoc rather than systematic analysis
Cannot identify statistical trends (e.g., "PRs for kind/api issues fail 80% of the time")
Cannot track whether its own previous improvements actually helped
The workers agent creates PRs but has no way to learn from past failures. Each task starts fresh with no accumulated knowledge beyond what's hardcoded in the prompt.
Evidence this matters
From the current PR data (last ~50 kelos-generated PRs):
Merge rate: ~53% — nearly half of all agent work is wasted
The 40% rejection rate found in #508 has fluctuated but not systematically improved. Without continuous measurement, we can't know if prompt changes in #510 (the fix for #508) actually improved outcomes.
apiVersion: kelos.dev/v1alpha1kind: TaskSpawnermetadata:
name: kelos-retrospectivespec:
when:
cron:
schedule: "0 0 * * 1"# Weekly on Monday at midnight UTCmaxConcurrency: 1taskTemplate:
workspaceRef:
name: kelos-agentmodel: opustype: claude-codettlSecondsAfterFinished: 864000credentials:
type: oauthsecretRef:
name: kelos-credentialspodOverrides:
resources:
requests:
cpu: "250m"memory: "512Mi"ephemeral-storage: "2Gi"limits:
cpu: "1"memory: "2Gi"ephemeral-storage: "2Gi"agentConfigRef:
name: kelos-dev-agentpromptTemplate: | You are a retrospective analyst for the Kelos self-development loop. Your job is to measure the effectiveness of agent-generated PRs and identify evidence-backed improvements to the worker prompt. ## Step 1: Collect PR outcome data (last 7 days) Fetch all recent kelos-generated PRs: ``` gh pr list --state all --label generated-by-kelos --limit 50 --json number,title,state,mergedAt,closedAt,body,labels ``` For each PR, classify it: - **Merged**: Successfully contributed to the project - **Closed without merge**: Rejected — investigate why For each closed PR, read its review comments to categorize the rejection: ``` gh api repos/{owner}/{repo}/pulls/{number}/reviews gh api repos/{owner}/{repo}/pulls/{number}/comments gh pr view {number} --comments ``` Categorize rejections into failure modes: - **Scope creep**: Agent added unrequested features - **Design disagreement**: Maintainer wanted a different approach - **Duplicate/existing code**: Agent created something that already existed - **Format/convention violation**: PR description, commit format, etc. - **Quality issue**: Tests missing, code incorrect, review feedback ignored - **Not actionable**: Issue was not suitable for autonomous agent work - **Stale/superseded**: Another PR addressed the same issue first ## Step 2: Compute metrics Calculate: - Total PRs created this week - Merge rate (merged / total) - Rejection rate by failure mode - Average reset count (how many /reset-worker cycles per merged PR) - Compare to previous weeks if data is available from prior retrospective issues ## Step 3: Identify actionable patterns For each failure mode with 2+ occurrences: - Read the current worker prompt in `self-development/kelos-workers.yaml` - Check if the prompt already addresses this failure mode - If not, propose a specific prompt addition with exact wording For merged PRs that required multiple resets: - What did the agent miss on the first attempt? - Could the prompt be clearer about that scenario? ## Step 4: Output If you find actionable improvements (new failure patterns not yet addressed in the prompt, or evidence that previous prompt changes didn't help): Create a GitHub issue with: - Title: "Retrospective: [week date range] — [key finding]" - Body with: metrics summary, failure mode breakdown, specific prompt changes ``` gh issue create --title "..." --body "..." --label generated-by-kelos ``` If all failure modes are already addressed in the prompt and merge rate is above 70%: exit without creating an issue. Not every run needs output. ## Constraints - Only analyze PRs from the last 7 days (avoid re-analyzing old data) - Do NOT create PRs — only create issues with proposals - Check existing issues first to avoid duplicates: `gh issue list --label generated-by-kelos --limit 20` - Be specific: cite PR numbers, quote review comments, propose exact prompt wording - Do not create vague "we should improve X" issues — every proposal must include concrete text changespollInterval: 1m
🤖 Kelos Agent @gjkim42
Summary
The self-development loop has five specialized agents (triage, workers, self-update, fake-user, fake-strategist), but no agent systematically analyzes the outcomes of their work. The loop generates PRs, but nobody measures whether those PRs are getting better or worse over time, identifies recurring rejection patterns, or feeds concrete learnings back into prompts.
Issue #508 demonstrated the value of this analysis — a one-time manual review found a 40% closed-without-merge rate and identified 5 systemic prompt gaps. But this analysis was performed once by a human-triggered strategist run. It should be a continuous, automated part of the loop.
Problem
The feedback gap in the current self-development pipeline
The self-update agent reviews config files daily but:
kind/apiissues fail 80% of the time")The workers agent creates PRs but has no way to learn from past failures. Each task starts fresh with no accumulated knowledge beyond what's hardcoded in the prompt.
Evidence this matters
From the current PR data (last ~50 kelos-generated PRs):
The 40% rejection rate found in #508 has fluctuated but not systematically improved. Without continuous measurement, we can't know if prompt changes in #510 (the fix for #508) actually improved outcomes.
What's missing vs. what exists
Proposal:
kelos-retrospectiveTaskSpawnerAdd a new cron-based TaskSpawner that runs weekly, systematically analyzes PR outcomes, and produces either:
Proposed config:
self-development/kelos-retrospective.yamlWhy this is different from kelos-self-update
Expected impact
Implementation notes
self-development/Related issues