Skip to content

Implement Branch-Aware RAG with a Self-Contained, Snapshot-Based Database #7

@retran

Description

@retran

Summary

This issue tracks the implementation of the project's core Retrieval-Augmented Generation (RAG) capabilities. The architecture is designed to be branch-aware, efficient, and fully self-contained within a single project database file.

It uses a Git-like, snapshot-based model to handle different versions of the codebase (i.e., Git branches) without data duplication, and a content-addressable store for expensive-to-compute artifacts like ASTs and vector embeddings. The entire stack will be pure Go, adhering to the project's core principles.

1. Core Architecture: Single Database with Snapshots

  • Database: A single modernc.org/sqlite database file at .meowg1k/index.db will be the sole source of truth. The database must be opened in WAL (Write-Ahead Logging) mode to ensure high concurrency between reader (query, generate) and writer (index) processes.
  • Content-Addressable Storage: All processed file data (ASTs, chunks, embeddings) will be stored once in a content_store table, keyed by the SHA256 hash of the file's content.
  • Branch Snapshots: A snapshots table will represent the state of a Git branch at a specific point in time. Each snapshot is a manifest listing the files and their content hashes for that branch.

2. Database Schema

A _schema_meta table will be used for versioning the schema itself.

content_store (Stores the heavy, processed data)

  • content_hash (PK, TEXT)
  • raw_content_blob (BLOB)
  • treesitter_ast_blob (BLOB)
  • chunks_blob (BLOB)
  • vectors_blob (BLOB)

snapshots (Represents a state of a branch)

  • id (PK, INTEGER)
  • branch_name (TEXT)
  • created_at (TIMESTAMP)

snapshot_files (Links a snapshot to its content)

  • snapshot_id (FK to snapshots.id)
  • filepath (TEXT)
  • content_hash (FK to content_store.content_hash)

vector_indexes (Stores the built HNSW search indexes for a snapshot)

  • snapshot_id (FK to snapshots.id)
  • index_name (TEXT, e.g., "plain")
  • index_blob (BLOB)
  • last_updated (TIMESTAMP)

3. Configuration (indexing section)

The config.yaml will define strategies for creating documents and the final search indexes, including per-index embedding profiles.

# .meowg1k/config.yaml
indexing:
  # Default profile used if an index doesn't specify its own.
  embeddingProfile: "default-embedding-model"

  document_strategies:
    - name: "go_files"
      parser: "treesitter"
      language: "go"
      queries: ["(function_declaration) @document"]
      rules:
        - match: "**/*.go"

  indexes:
    - name: "plain"
      from_documents: ["go_files"]
      chunk_strategy: "by_line"

    - name: "go_functions"
      embeddingProfile: "powerful-embedding-model"
      from_documents: ["go_files"]
      chunk_strategy: "document_as_chunk"
      filter: "document_type == 'function'"

4. meow index Command (Snapshot Reconciliation Workflow)

This command will be the core of the indexing process.

  1. Detect Branch: Get the current Git branch name.
  2. Scan Filesystem: Scan all project files and calculate the content_hash for each.
  3. Update Content Store: For each file, if its hash is not in content_store, process it (parse AST, chunk, embed) and save the results.
  4. Create New Snapshot: Create a new entry in the snapshots table for the current branch. Populate snapshot_files with the current list of (filepath, content_hash) pairs.
  5. Build Vector Indexes: For each index defined in the config, query the data belonging to the new snapshot, build the HNSW index, and save it to the vector_indexes table, linked to the snapshot_id.

5. meow query Command

A standalone command for directly querying the indexes of the current branch's latest snapshot.

# Search the 'go_functions' index for functions related to authentication
meow query --index go_functions "user authentication logic"

6. Integration with meow generate

Tasks will use a rag key to specify which indexes to query. The command will automatically use the latest snapshot for the current branch.

generate:
  tasks:
    refactor-auth:
      rag:
        indexes: ["go_functions", "plain"]
      userPrompt: "Refactor the main authentication function..."

Acceptance Criteria

  • Phase 1: Foundation

    • [ ] Integrate modernc.org/sqlite and a pure Go HNSW library (e.g., coder/hnsw).
    • [ ] Implement the full SQLite schema with versioning and WAL mode enabled.
    • [ ] Implement the indexing configuration structure with per-index embedding profiles.
    • [ ] Implement the service for managing the content_store.
    • [ ] Implement the Tree-sitter via WASM backend.
  • Phase 2: Indexing

    • [ ] Implement the meow index command with the full snapshot-based reconciliation logic.
    • [ ] The command correctly populates content_store on a cache miss.
    • [ ] The command correctly creates new entries in snapshots and snapshot_files.
    • [ ] The HNSW index builder queries data based on a snapshot_id and saves the result to vector_indexes.
  • Phase 3: Querying & Integration

    • [ ] Implement the meow query command, making it branch-aware.
    • [ ] Update meow generate to use the latest snapshot for the current branch to fetch RAG context.
  • Phase 4: Documentation

    • [ ] Write comprehensive documentation for the new RAG system, explaining the architecture and commands.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions