Skip to content

Conversation

@arabold
Copy link
Owner

@arabold arabold commented Oct 4, 2025

Introduce ETag and lastModified metadata for improved caching in fetchers and strategies, along with the implementation of a refresh job feature to enhance document management.

@arabold arabold force-pushed the feature/refresh-version-tool branch from 19c64d0 to 203a561 Compare October 27, 2025 14:52
@arabold arabold requested a review from Copilot October 27, 2025 14:53
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces ETag and lastModified metadata support for improved caching in fetchers and strategies, along with implementing a refresh job feature to enhance document management. The changes enable efficient re-indexing of existing documentation by leveraging HTTP conditional requests to skip unchanged content.

Key Changes:

  • Added ETag and Last-Modified header support throughout the scraping pipeline for cache validation
  • Implemented refresh job functionality to re-scrape existing library versions with conditional requests
  • Refactored type system to separate page-level metadata (etag, lastModified) from chunk-level metadata
  • Introduced RefreshVersionTool for triggering refresh operations on existing library versions

Reviewed Changes

Copilot reviewed 94 out of 95 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tsconfig.test.json New TypeScript configuration for test files
src/types/index.ts Removed document type definitions (moved to store/types.ts)
src/tools/ScrapeTool.ts Updated to use renamed pipeline method enqueueScrapeJob
src/tools/ScrapeTool.test.ts Updated tests for renamed enqueueScrapeJob method
src/tools/RefreshVersionTool.ts New tool for refreshing existing library versions with ETag support
src/tools/ListJobsTool.test.ts Updated type references and test data structure
src/tools/FetchUrlTool.ts Updated pipeline interface usage
src/store/types.ts Added depth field to DbPage, restructured chunk metadata types, added DbLibraryVersion
src/store/assembly/types.ts Replaced Document type with DbPageChunk
src/store/assembly/strategies/*.ts Updated to use DbPageChunk instead of Document
src/store/DocumentStore.ts Added etag/lastModified storage, refactored addDocuments to accept ScrapeResult
src/store/DocumentStore.test.ts Updated tests for new addDocuments signature with ScrapeResult
src/store/DocumentRetrieverService.ts Updated to work with DbPageChunk instead of Document
src/store/DocumentManagementService.ts Refactored addDocument to addScrapeResult, added page management methods
src/splitter/types.ts Renamed ContentChunk to Chunk
src/scraper/types.ts Added QueueItem with etag support, restructured ScrapeResult, renamed ScraperProgress
src/scraper/strategies/*.ts Updated to handle ETag-based conditional requests and refresh operations
src/scraper/pipelines/*.ts Updated pipeline interfaces to accept mimeType parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This commit introduces a version refresh feature that enables efficient re-indexing of previously scraped library versions. By leveraging HTTP ETags, the new mechanism avoids re-processing unchanged pages, significantly reducing processing time, bandwidth usage, and embedding costs.

Previously, re-indexing a library version was a wasteful process that required deleting all existing data and re-scraping every page from scratch, even if most of the content was unchanged.

This implementation introduces a refresh mechanism that re-visits all previously scraped pages with their stored ETags. This allows the scraper to:
- **Skip** unchanged pages (HTTP 304 Not Modified).
- **Re-process** only pages that have changed (HTTP 200 OK).
- **Delete** documents for pages that are no longer available (HTTP 404 Not Found).

- **Reused Existing Scraper Infrastructure**: The refresh operation is a standard scrape job with a pre-populated `initialQueue`. This leverages existing logic for progress tracking, error handling, and state management.
- **Database Schema Update**: A `depth` column has been added to the `pages` table to ensure that refresh operations respect the original `maxDepth` constraints. A database migration (`010-add-depth-to-pages.sql`) is included to apply this change.
- **Conditional Fetching**: The scraper's `processItem` logic has been updated to handle conditional requests. It now correctly processes 304, 404, and 200 HTTP responses to either skip, delete, or update documents.
- **Pipeline Manager Integration**: A new `enqueueRefreshJob` method was added to the `PipelineManager` to orchestrate the refresh process by fetching pages from the database and populating the `initialQueue`.
- Restructure AGENTS.md into clearer sections with improved hierarchy
- Add detailed documentation writing principles and structure guidelines
- Expand TypeScript development standards with testing and error handling
- Include comprehensive commit message format and conventions
- Add security, MCP protocol, and pull request guidelines
- Improve readability with better formatting and organization

The refactoring transforms the instruction file from a flat list into a well-organized reference document with specific sections for documentation, development practices, and contribution standards.
- Update migration to backfill depth based on source_url matching instead of default 0
- Root pages (matching source_url) assigned depth 0, discovered pages depth 1
- Add comprehensive refresh architecture documentation covering conditional requests, status handling, and change detection
- Fix test expectations to account for multiple documents processed during refresh
- Preserve page depth during refresh operations to maintain crawl hierarchy

This ensures depth values accurately reflect page discovery order and provides better context for search relevance while maintaining efficiency through HTTP conditional requests.
@arabold arabold force-pushed the feature/refresh-version-tool branch from 203a561 to b82dc27 Compare November 10, 2025 07:44
@arabold arabold requested a review from Copilot November 10, 2025 07:45
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 109 out of 110 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Added tests for refresh mode with initialQueue to ensure items are prioritized and metadata (pageId, etag) is preserved.
- Improved BaseScraperStrategy to utilize final URLs from results for progress callbacks and link resolution.
- Updated WebScraperStrategy to return final URLs after redirects for accurate indexing.
- Enhanced DocumentStore tests to validate refresh operations, ensuring proper handling of metadata and document existence.
- Implemented resiliency tests in the refresh pipeline to handle network timeouts and redirect chains effectively.
- Added file-based refresh scenarios to detect changes in file structure, ensuring accurate indexing of new, modified, and deleted files.
@arabold arabold requested a review from Copilot November 11, 2025 13:23
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 106 out of 110 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +29 to +30
// TODO: Check if `types` is properly used
types?: string[]; // Types of content in this chunk (e.g., "text", "code", "table")
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO comment indicates uncertainty about the types field usage. Consider verifying if this field is being used correctly throughout the codebase or removing it if not needed.

Suggested change
// TODO: Check if `types` is properly used
types?: string[]; // Types of content in this chunk (e.g., "text", "code", "table")

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +88
async process(
item: QueueItem,
options: ScraperOptions,
_progressCallback?: ProgressCallback<ScraperProgress>,
signal?: AbortSignal,
): Promise<{ document?: Document; links?: string[] }> {
): Promise<ProcessItemResult> {
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process method lacks JSDoc documentation. Since this is a public method in a processor class, it should include parameter descriptions and return value documentation for better API clarity.

Copilot uses AI. Check for mistakes.
refactor(store): improve error handling in findParentChunk method
refactor(assembly): streamline parent chunk lookup process
fix(tools): correct Etag to ETag capitalization in RefreshVersionTool documentation
refactor(tests): remove unused DocumentStore import in refresh pipeline tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants