Skip to content

fix: prevent fallback truncation summaries in compaction#79

Open
cryptomaltese wants to merge 1 commit intoMartian-Engineering:mainfrom
cryptomaltese:fix/fallback-truncation-repair
Open

fix: prevent fallback truncation summaries in compaction#79
cryptomaltese wants to merge 1 commit intoMartian-Engineering:mainfrom
cryptomaltese:fix/fallback-truncation-repair

Conversation

@cryptomaltese
Copy link
Contributor

Lossless-Claw PR: Fallback Summarization Fix

Overview

This PR addresses a critical bug in the lossless-claw plugin where the LLM summarizer failures (403 errors, timeouts, etc.) trigger a "fallback truncation" that creates useless ~909-token garbage summaries. These summaries pollute the context DB and provide no value for future turns.

The fix: Instead of truncating to garbage, skip compaction entirely and retry on the next turn. This forces the system to maintain meaningful context while waiting for the summarizer to recover.

Problem

Current (Broken) Behavior

When CompactionEngine.summarizeWithEscalation() encounters:

  1. Summarizer failure (exception) or
  2. Poor compression (aggressive mode still >= input tokens)

...it falls back to:

const truncated = sourceText.slice(0, FALLBACK_MAX_CHARS);  // 2048 chars
summaryText = `${truncated}\n[Truncated from ${inputTokens} tokens]`;
level = "fallback";

This creates:

  • Garbage summaries (~909 tokens, matching FALLBACK_MAX_CHARS / 4)
  • Loss of semantic content (just raw text prefix)
  • DB pollution (334+ of these identified in existing DB)
  • Canary marker ([Truncated from N tokens]) that flags the garbage

Impact

  • 99+ summaries created in the last 2 weeks are fallback truncations
  • They consume ~90k tokens in the context DB (~313 bytes each × 334 summaries)
  • Future expansions / repairs must skip or fix these
  • Cascading parent condensed summaries may also be corrupted

Solution

1. Add level Column to summaries Table

File: src/db/migration.ts

ALTER TABLE summaries ADD COLUMN level TEXT NOT NULL DEFAULT 'normal' 
  CHECK (level IN ('normal', 'aggressive', 'fallback'))

Values:

  • 'normal' (default) — summary created with normal mode (good compression)
  • 'aggressive' — summary created with aggressive mode (aggressive but good compression)
  • 'fallback'deprecated escalation — should never be set by new code

Backfill:

  1. Query message_parts with part_type='compaction' for level='fallback' events
  2. Extract createdSummaryIds from metadata JSON
  3. Update those summaries to level='fallback' in the DB
  4. Scan remaining summaries for [Truncated from canary and flag as fallback (future work)

Migration flow:

runLcmMigrations()
  → ensureSummaryLevelColumn()
  → backfillSummaryLevels()      ← NEW
  → ensureSummaryMetadataColumns()
  → backfillSummaryDepths()
  → backfillSummaryMetadata()

2. Fix Fallback Behavior in compaction.ts

File: src/compaction.ts

New summarizeWithEscalation() signature:

/**
 * Run two-level summarization escalation with explicit error handling:
 * normal → aggressive → fail (do NOT truncate to garbage).
 * 
 * Returns SummarizationAttempt with { failed: boolean, content?: string, level?: CompactionLevel }
 */
private async summarizeWithEscalation(params: {
  sourceText: string;
  summarize: CompactionSummarizeFn;
  options?: CompactionSummarizeOptions;
  logger?: { warn: (msg: string) => void };
}): Promise<SummarizationAttempt>

Behavior changes:

  1. Phase 1: Normal mode

    • Call summarize(sourceText, false, options)
    • On error: Return { failed: true, error: "..." } immediately
    • On success but poor compression: Proceed to Phase 2
  2. Phase 2: Aggressive mode (only if normal didn't compress well)

    • Call summarize(sourceText, true, options)
    • On error: Return { failed: true, error: "..." } immediately
    • On success but poor compression: Return { failed: true, error: "..." }
  3. No Phase 3 (no fallback truncation)

    • Delete the entire truncation path
    • If compaction can't improve, the caller skips the compaction and retries next turn

Caller changes:

Both leafPass() and condensedPass() must handle null returns:

private async leafPass(...): Promise<{ ... } | null> {
  // ... build concatenated text ...
  
  const summary = await this.summarizeWithEscalation({ ... });
  
  // NEW: bail if summarization failed
  if (summary.failed || !summary.content) {
    return null;  // Caller sees: compaction incomplete, will retry
  }
  
  // ... persist summary ...
}

In compactLeaf() and compactFullSweep():

const leafResult = await this.leafPass(...);

// NEW: check for null
if (leafResult === null) {
  log.warn(`Leaf compaction failed; skipping and retrying next turn`);
  return {
    actionTaken: false,  // Signal: try again later
    tokensBefore,
    tokensAfter: tokensBefore,  // No progress
    condensed: false,
  };
}

3. Migration Script

File: src/db/migration.tsfunction backfillSummaryLevels()

function backfillSummaryLevels(db: DatabaseSync): void {
  // Extract from message_parts.metadata
  // Query: SELECT part_id, metadata FROM message_parts 
  //        WHERE part_type = 'compaction' AND metadata IS NOT NULL
  
  // For each row:
  //   1. Parse metadata JSON
  //   2. Check if level === 'fallback'
  //   3. Extract createdSummaryIds array
  //   4. UPDATE summaries SET level = 'fallback' WHERE summary_id IN (...)
  
  // Best-effort: swallow errors, don't block migration
}

4. lcm repair Command

File: src/tools/lcm-repair-command.tsLcmRepairEngine

Responsibility:

  • Find summaries with level='fallback' or truncation canary
  • Re-summarize using the same prompts (from summarize.ts)
  • Update DB with new content, set level='normal'
  • Log repairs as compaction events
  • Cascade check: after fixing leaves, verify parents are still valid

Public interface:

async repair(
  input: LcmRepairInput,
  summarizeFn: LcmSummarizeFn
): Promise<LcmRepairResult>

Input:

{
  mode: "scan" | "repair",        // dry-run vs commit
  conversationId?: number,        // specific conversation
  maxSummaries?: number,          // batch size (default 10)
  verbose?: boolean               // detailed logs
}

Output:

{
  mode: "scan" | "repair",
  foundCount: number,             // total fallback summaries found
  repairedCount: number,          // successfully re-summarized
  failedCount: number,            // re-summarization failed
  skippedCount: number,           // scan mode: not attempted
  entries: RepairSummaryEntry[],  // candidates (up to maxSummaries)
  logs: string[],                 // detailed logs if verbose
  cascadeDepth: number            // parent re-condensing depth
}

Algorithm:

  1. Find Phase:

    • Query summaries WHERE level = 'fallback'
    • Query summaries WHERE content LIKE '%[Truncated from%'
    • Deduplicate by summary_id
    • Enrich with lineage (children, parents)
  2. Repair Phase (if mode='repair'):

    • For each leaf: get source messages → reconstruct input → re-summarize
    • For each condensed: get parents → reconstruct input → re-summarize
    • Update summaries with new content, set level='normal'
    • Log repair as compaction event in message_parts
  3. Cascade Phase (future):

    • After leaf repairs, find parent condensed summaries
    • Check if they still compress parents effectively
    • If not, re-condense (calls condensedPass() logic)

Integration:

Register as a tool in engine.ts:

tools.register({
  name: "lcm_repair",
  description: "Repair fallback truncation summaries",
  inputSchema: { ... },
  invoke: async (input: LcmRepairInput) => {
    const engine = new LcmRepairEngine(db, convStore, summStore, compEngine, config);
    return engine.repair(input, summarizeFn);
  }
});

Testing Checklist

Unit Tests

  • summarizeWithEscalation() returns { failed: true } on summarizer exception
  • summarizeWithEscalation() returns { failed: true } when aggressive mode fails
  • leafPass() returns null when summarization fails
  • condensedPass() returns null when summarization fails
  • compactLeaf() returns actionTaken: false when leafPass is null
  • compactFullSweep() skips compaction on leafPass/condensedPass failures

Integration Tests

  • Migration adds level column with CHECK constraint
  • Backfill finds fallback events in message_parts.metadata
  • Backfill updates level='fallback' on identified summaries
  • LcmRepairEngine finds 99+ fallback summaries in test DB
  • LcmRepairEngine re-summarizes leaf and condensed summaries
  • Repair sets level='normal' on fixed summaries
  • Repair logs compaction event with summary ID

Manual Testing

  1. Create test scenario:

    INSERT INTO summaries (summary_id, conversation_id, kind, depth, level, content, token_count, created_at)
    VALUES ('test_fallback_1', 1, 'leaf', 0, 'fallback', 
            'Lorem ipsum...[Truncated from 5000 tokens]', 1024, datetime('now'));
  2. Run repair in scan mode:

    lcm repair --mode scan --conversationId 1 --verbose
    

    Expected: finds 1+ summaries

  3. Run repair in repair mode:

    lcm repair --mode repair --conversationId 1 --maxSummaries 5
    

    Expected: re-summarizes and updates DB

  4. Verify DB:

    SELECT summary_id, level, content, token_count FROM summaries 
    WHERE summary_id = 'test_fallback_1';

    Expected: level='normal', shorter content, lower token count

Implementation Notes

Why Not "Lazy Fallback"?

Some might suggest: "Just mark it fallback and move on, repair later."

Problems:

  • Corrupts the context with useless garbage immediately
  • Wastes tokens during expansion/assembly
  • Requires background repair job (operational overhead)
  • Breaks the promise: "LCM never loses semantic content"

Our approach:

  • Skip compaction when it can't improve (better for immediate context)
  • Repair existing fallbacks via tool (explicit, auditable, batch-friendly)

Token Accounting

  • FALLBACK_MAX_CHARS = 512 * 4 = 2048 chars
  • Fallback summaries ≈ 2048 / 4 = 512 tokens
  • But often shorter (truncated mid-word) → ~300–900 tokens
  • Level='fallback' + canary makes them detectable

Migration Safety

  • Add column with DEFAULT 'normal' → no existing rows affected
  • Backfill is best-effort (errors swallowed) → migration never blocks
  • ensureSummaryLevelColumn() checks for existing column → idempotent
  • Can run migration multiple times safely

Repair Cascading

Future work: after re-summarizing leaves, check parent condensed summaries:

  1. Get all condensed parents of repaired leaves
  2. Re-fetch their child summaries (now with new content)
  3. Check if condensation still effective
  4. If not, re-condense by calling condensedPass() logic

Currently stubbed as cascadeDepth: 0 placeholder.

Files Changed

  1. src/db/migration.ts

    • Add level column with CHECK constraint
    • Implement ensureSummaryLevelColumn()
    • Implement backfillSummaryLevels()
    • Call backfill in runLcmMigrations()
  2. src/compaction.ts

    • New SummarizationAttempt type
    • Refactor summarizeWithEscalation() with error handling
    • Delete truncation fallback path
    • Update leafPass() to return null on failure
    • Update condensedPass() to return null on failure
    • Update callers to handle null returns
  3. src/tools/lcm-repair-command.ts (NEW)

    • LcmRepairEngine class
    • findFallbackSummaries()
    • findTruncationCanaries()
    • resummarizeLeaf()
    • resummarizeCondensed()
    • repair() main flow
  4. src/engine.ts

    • Register lcm_repair tool (one-liner)

Backcompat & Migration

  • Old code: Creates level='fallback' summaries (still works, just not created anymore)
  • New code: Never creates level='fallback' (skips compaction on failure)
  • Mixed environment: OK; old summaries just get level='fallback' during backfill
  • Rollback: Remove level column via down-migration (or just ignore it)

Performance Impact

  • No cost to compaction path (we skip it on failure, which is rare)
  • Repair tool: O(N) scan + O(K) re-summarization where K = maxSummaries
  • Migration: One-time scan of message_parts (fast)

Future Work

  1. Cascade repairs (implement cascadeDepth logic in LcmRepairEngine)
  2. Scheduled repair cron (background job runs nightly)
  3. Repair metrics (track # of fallbacks, success rate, tokens saved)
  4. Aggressive mode tuning (target ratios based on input size)

Status: Draft PR for review
Target: lossless-claw v0.3.1+
Reviewer: @maltese
Related: GitHub issue #334-fallback-summaries

@octalmage
Copy link

I've run into this a few time, makes it hard to judge if lcm is working at all.

@cryptomaltese cryptomaltese force-pushed the fix/fallback-truncation-repair branch from 832133c to 6ac7145 Compare March 18, 2026 21:39
Replace the three-level summarization escalation (normal → aggressive →
deterministic truncation) with a two-level approach that returns null on
failure instead of creating garbage summaries.

When both normal and aggressive summarization fail to compress below the
input token count, the compaction engine now bails and retries on the
next turn. This prevents useless '[Truncated from N tokens]' summaries
from polluting the DAG — particularly for media-only messages where the
stored text content is just a file path (~28 tokens) that no LLM can
compress further.

Changes:
- Remove truncation fallback in summarizeWithEscalation() — return null
  on compression failure (callers already handle null since upstream)
- Wrap summarizer calls in try/catch to handle LLM errors gracefully
- Add 'level' column to summaries table (normal/aggressive/fallback)
  with migration and backfill from compaction event metadata
- Add SummaryRecord.level to store types
- Add lcm_repair tool to scan/re-summarize existing fallback summaries
- Register repair tool in plugin entry point
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants