-
Notifications
You must be signed in to change notification settings - Fork 92
Add ETag and lastModified support with refresh job functionality #250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
19c64d0 to
203a561
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces ETag and lastModified metadata support for improved caching in fetchers and strategies, along with implementing a refresh job feature to enhance document management. The changes enable efficient re-indexing of existing documentation by leveraging HTTP conditional requests to skip unchanged content.
Key Changes:
- Added ETag and Last-Modified header support throughout the scraping pipeline for cache validation
- Implemented refresh job functionality to re-scrape existing library versions with conditional requests
- Refactored type system to separate page-level metadata (etag, lastModified) from chunk-level metadata
- Introduced
RefreshVersionToolfor triggering refresh operations on existing library versions
Reviewed Changes
Copilot reviewed 94 out of 95 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tsconfig.test.json | New TypeScript configuration for test files |
| src/types/index.ts | Removed document type definitions (moved to store/types.ts) |
| src/tools/ScrapeTool.ts | Updated to use renamed pipeline method enqueueScrapeJob |
| src/tools/ScrapeTool.test.ts | Updated tests for renamed enqueueScrapeJob method |
| src/tools/RefreshVersionTool.ts | New tool for refreshing existing library versions with ETag support |
| src/tools/ListJobsTool.test.ts | Updated type references and test data structure |
| src/tools/FetchUrlTool.ts | Updated pipeline interface usage |
| src/store/types.ts | Added depth field to DbPage, restructured chunk metadata types, added DbLibraryVersion |
| src/store/assembly/types.ts | Replaced Document type with DbPageChunk |
| src/store/assembly/strategies/*.ts | Updated to use DbPageChunk instead of Document |
| src/store/DocumentStore.ts | Added etag/lastModified storage, refactored addDocuments to accept ScrapeResult |
| src/store/DocumentStore.test.ts | Updated tests for new addDocuments signature with ScrapeResult |
| src/store/DocumentRetrieverService.ts | Updated to work with DbPageChunk instead of Document |
| src/store/DocumentManagementService.ts | Refactored addDocument to addScrapeResult, added page management methods |
| src/splitter/types.ts | Renamed ContentChunk to Chunk |
| src/scraper/types.ts | Added QueueItem with etag support, restructured ScrapeResult, renamed ScraperProgress |
| src/scraper/strategies/*.ts | Updated to handle ETag-based conditional requests and refresh operations |
| src/scraper/pipelines/*.ts | Updated pipeline interfaces to accept mimeType parameter |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This commit introduces a version refresh feature that enables efficient re-indexing of previously scraped library versions. By leveraging HTTP ETags, the new mechanism avoids re-processing unchanged pages, significantly reducing processing time, bandwidth usage, and embedding costs. Previously, re-indexing a library version was a wasteful process that required deleting all existing data and re-scraping every page from scratch, even if most of the content was unchanged. This implementation introduces a refresh mechanism that re-visits all previously scraped pages with their stored ETags. This allows the scraper to: - **Skip** unchanged pages (HTTP 304 Not Modified). - **Re-process** only pages that have changed (HTTP 200 OK). - **Delete** documents for pages that are no longer available (HTTP 404 Not Found). - **Reused Existing Scraper Infrastructure**: The refresh operation is a standard scrape job with a pre-populated `initialQueue`. This leverages existing logic for progress tracking, error handling, and state management. - **Database Schema Update**: A `depth` column has been added to the `pages` table to ensure that refresh operations respect the original `maxDepth` constraints. A database migration (`010-add-depth-to-pages.sql`) is included to apply this change. - **Conditional Fetching**: The scraper's `processItem` logic has been updated to handle conditional requests. It now correctly processes 304, 404, and 200 HTTP responses to either skip, delete, or update documents. - **Pipeline Manager Integration**: A new `enqueueRefreshJob` method was added to the `PipelineManager` to orchestrate the refresh process by fetching pages from the database and populating the `initialQueue`.
- Restructure AGENTS.md into clearer sections with improved hierarchy - Add detailed documentation writing principles and structure guidelines - Expand TypeScript development standards with testing and error handling - Include comprehensive commit message format and conventions - Add security, MCP protocol, and pull request guidelines - Improve readability with better formatting and organization The refactoring transforms the instruction file from a flat list into a well-organized reference document with specific sections for documentation, development practices, and contribution standards.
- Update migration to backfill depth based on source_url matching instead of default 0 - Root pages (matching source_url) assigned depth 0, discovered pages depth 1 - Add comprehensive refresh architecture documentation covering conditional requests, status handling, and change detection - Fix test expectations to account for multiple documents processed during refresh - Preserve page depth during refresh operations to maintain crawl hierarchy This ensures depth values accurately reflect page discovery order and provides better context for search relevance while maintaining efficiency through HTTP conditional requests.
203a561 to
b82dc27
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 109 out of 110 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Added tests for refresh mode with initialQueue to ensure items are prioritized and metadata (pageId, etag) is preserved. - Improved BaseScraperStrategy to utilize final URLs from results for progress callbacks and link resolution. - Updated WebScraperStrategy to return final URLs after redirects for accurate indexing. - Enhanced DocumentStore tests to validate refresh operations, ensuring proper handling of metadata and document existence. - Implemented resiliency tests in the refresh pipeline to handle network timeouts and redirect chains effectively. - Added file-based refresh scenarios to detect changes in file structure, ensuring accurate indexing of new, modified, and deleted files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 106 out of 110 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // TODO: Check if `types` is properly used | ||
| types?: string[]; // Types of content in this chunk (e.g., "text", "code", "table") |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO comment indicates uncertainty about the types field usage. Consider verifying if this field is being used correctly throughout the codebase or removing it if not needed.
| // TODO: Check if `types` is properly used | |
| types?: string[]; // Types of content in this chunk (e.g., "text", "code", "table") |
| async process( | ||
| item: QueueItem, | ||
| options: ScraperOptions, | ||
| _progressCallback?: ProgressCallback<ScraperProgress>, | ||
| signal?: AbortSignal, | ||
| ): Promise<{ document?: Document; links?: string[] }> { | ||
| ): Promise<ProcessItemResult> { |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process method lacks JSDoc documentation. Since this is a public method in a processor class, it should include parameter descriptions and return value documentation for better API clarity.
refactor(store): improve error handling in findParentChunk method refactor(assembly): streamline parent chunk lookup process fix(tools): correct Etag to ETag capitalization in RefreshVersionTool documentation refactor(tests): remove unused DocumentStore import in refresh pipeline tests
Introduce ETag and lastModified metadata for improved caching in fetchers and strategies, along with the implementation of a refresh job feature to enhance document management.