Add ETag and lastModified support with refresh job functionality #250

arabold · 2025-10-04T12:48:52Z

Introduce ETag and lastModified metadata for improved caching in fetchers and strategies, along with the implementation of a refresh job feature to enhance document management.

…trategies

Copilot

Pull Request Overview

This PR introduces ETag and lastModified metadata support for improved caching in fetchers and strategies, along with implementing a refresh job feature to enhance document management. The changes enable efficient re-indexing of existing documentation by leveraging HTTP conditional requests to skip unchanged content.

Key Changes:

Added ETag and Last-Modified header support throughout the scraping pipeline for cache validation
Implemented refresh job functionality to re-scrape existing library versions with conditional requests
Refactored type system to separate page-level metadata (etag, lastModified) from chunk-level metadata
Introduced RefreshVersionTool for triggering refresh operations on existing library versions

Reviewed Changes

Copilot reviewed 94 out of 95 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tsconfig.test.json	New TypeScript configuration for test files
src/types/index.ts	Removed document type definitions (moved to store/types.ts)
src/tools/ScrapeTool.ts	Updated to use renamed pipeline method enqueueScrapeJob
src/tools/ScrapeTool.test.ts	Updated tests for renamed enqueueScrapeJob method
src/tools/RefreshVersionTool.ts	New tool for refreshing existing library versions with ETag support
src/tools/ListJobsTool.test.ts	Updated type references and test data structure
src/tools/FetchUrlTool.ts	Updated pipeline interface usage
src/store/types.ts	Added depth field to DbPage, restructured chunk metadata types, added DbLibraryVersion
src/store/assembly/types.ts	Replaced Document type with DbPageChunk
src/store/assembly/strategies/*.ts	Updated to use DbPageChunk instead of Document
src/store/DocumentStore.ts	Added etag/lastModified storage, refactored addDocuments to accept ScrapeResult
src/store/DocumentStore.test.ts	Updated tests for new addDocuments signature with ScrapeResult
src/store/DocumentRetrieverService.ts	Updated to work with DbPageChunk instead of Document
src/store/DocumentManagementService.ts	Refactored addDocument to addScrapeResult, added page management methods
src/splitter/types.ts	Renamed ContentChunk to Chunk
src/scraper/types.ts	Added QueueItem with etag support, restructured ScrapeResult, renamed ScraperProgress
src/scraper/strategies/*.ts	Updated to handle ETag-based conditional requests and refresh operations
src/scraper/pipelines/*.ts	Updated pipeline interfaces to accept mimeType parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/store/types.ts

src/scraper/strategies/BaseScraperStrategy.ts

src/store/DocumentManagementService.ts

This commit introduces a version refresh feature that enables efficient re-indexing of previously scraped library versions. By leveraging HTTP ETags, the new mechanism avoids re-processing unchanged pages, significantly reducing processing time, bandwidth usage, and embedding costs. Previously, re-indexing a library version was a wasteful process that required deleting all existing data and re-scraping every page from scratch, even if most of the content was unchanged. This implementation introduces a refresh mechanism that re-visits all previously scraped pages with their stored ETags. This allows the scraper to: - **Skip** unchanged pages (HTTP 304 Not Modified). - **Re-process** only pages that have changed (HTTP 200 OK). - **Delete** documents for pages that are no longer available (HTTP 404 Not Found). - **Reused Existing Scraper Infrastructure**: The refresh operation is a standard scrape job with a pre-populated `initialQueue`. This leverages existing logic for progress tracking, error handling, and state management. - **Database Schema Update**: A `depth` column has been added to the `pages` table to ensure that refresh operations respect the original `maxDepth` constraints. A database migration (`010-add-depth-to-pages.sql`) is included to apply this change. - **Conditional Fetching**: The scraper's `processItem` logic has been updated to handle conditional requests. It now correctly processes 304, 404, and 200 HTTP responses to either skip, delete, or update documents. - **Pipeline Manager Integration**: A new `enqueueRefreshJob` method was added to the `PipelineManager` to orchestrate the refresh process by fetching pages from the database and populating the `initialQueue`.

- Restructure AGENTS.md into clearer sections with improved hierarchy - Add detailed documentation writing principles and structure guidelines - Expand TypeScript development standards with testing and error handling - Include comprehensive commit message format and conventions - Add security, MCP protocol, and pull request guidelines - Improve readability with better formatting and organization The refactoring transforms the instruction file from a flat list into a well-organized reference document with specific sections for documentation, development practices, and contribution standards.

- Update migration to backfill depth based on source_url matching instead of default 0 - Root pages (matching source_url) assigned depth 0, discovered pages depth 1 - Add comprehensive refresh architecture documentation covering conditional requests, status handling, and change detection - Fix test expectations to account for multiple documents processed during refresh - Preserve page depth during refresh operations to maintain crawl hierarchy This ensures depth values accurately reflect page discovery order and provides better context for search relevance while maintaining efficiency through HTTP conditional requests.

Copilot

Pull Request Overview

Copilot reviewed 109 out of 110 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/refresh-pipeline-e2e.test.ts

src/store/assembly/strategies/HierarchicalAssemblyStrategy.ts

- Added tests for refresh mode with initialQueue to ensure items are prioritized and metadata (pageId, etag) is preserved. - Improved BaseScraperStrategy to utilize final URLs from results for progress callbacks and link resolution. - Updated WebScraperStrategy to return final URLs after redirects for accurate indexing. - Enhanced DocumentStore tests to validate refresh operations, ensuring proper handling of metadata and document existence. - Implemented resiliency tests in the refresh pipeline to handle network timeouts and redirect chains effectively. - Added file-based refresh scenarios to detect changes in file structure, ensuring accurate indexing of new, modified, and deleted files.

…and focus

Copilot

Pull Request Overview

Copilot reviewed 106 out of 110 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/tools/RefreshVersionTool.ts

Copilot · 2025-11-11T13:25:55Z

src/store/types.ts

+  // TODO: Check if `types` is properly used
+  types?: string[]; // Types of content in this chunk (e.g., "text", "code", "table")


TODO comment indicates uncertainty about the types field usage. Consider verifying if this field is being used correctly throughout the codebase or removing it if not needed.

Suggested change

// TODO: Check if `types` is properly used

types?: string[]; // Types of content in this chunk (e.g., "text", "code", "table")

src/store/DocumentStore.ts

Copilot · 2025-11-11T13:25:56Z

src/scraper/strategies/GitHubWikiProcessor.ts

+  async process(
    item: QueueItem,
    options: ScraperOptions,
-    _progressCallback?: ProgressCallback<ScraperProgress>,
    signal?: AbortSignal,
-  ): Promise<{ document?: Document; links?: string[] }> {
+  ): Promise<ProcessItemResult> {


The process method lacks JSDoc documentation. Since this is a public method in a processor class, it should include parameter descriptions and return value documentation for better API clarity.

src/scraper/strategies/GitHubScraperStrategy.ts

refactor(store): improve error handling in findParentChunk method refactor(assembly): streamline parent chunk lookup process fix(tools): correct Etag to ETag capitalization in RefreshVersionTool documentation refactor(tests): remove unused DocumentStore import in refresh pipeline tests

This was referenced Oct 4, 2025

feat: Re-Run Scrape w/ Versioning #247

Open

403 during scraping leads to incomplete documentation with no way to fix the problem #246

Open

Minor convenience addons #237

Closed

arabold added 3 commits October 11, 2025 07:15

feat: add ETag and lastModified support for caching in fetchers and s…

faa2508

…trategies

feat: implement refresh job functionality

ba3a72e

feat(ci): add type checking script and tsconfig for tests

9fc6cf9

arabold force-pushed the feature/refresh-version-tool branch from 19c64d0 to 203a561 Compare October 27, 2025 14:52

arabold requested a review from Copilot October 27, 2025 14:53

Copilot AI reviewed Oct 27, 2025

View reviewed changes

src/store/types.ts Outdated Show resolved Hide resolved

src/scraper/strategies/BaseScraperStrategy.ts Outdated Show resolved Hide resolved

src/store/DocumentManagementService.ts Outdated Show resolved Hide resolved

arabold added 5 commits November 1, 2025 08:19

test: add nock for HTTP mocking, and implement refresh pipeline tests

0786dbc

chore: update dependencies and devDependencies to latest versions

50e69c2

arabold force-pushed the feature/refresh-version-tool branch from 203a561 to b82dc27 Compare November 10, 2025 07:44

arabold requested a review from Copilot November 10, 2025 07:45

Copilot AI reviewed Nov 10, 2025

View reviewed changes

test/refresh-pipeline-e2e.test.ts Outdated Show resolved Hide resolved

test/refresh-pipeline-e2e.test.ts Outdated Show resolved Hide resolved

src/store/assembly/strategies/HierarchicalAssemblyStrategy.ts Outdated Show resolved Hide resolved

arabold added 2 commits November 11, 2025 05:23

refactor(docs): update testing philosophy and guidelines for clarity …

b49f6ee

…and focus

arabold requested a review from Copilot November 11, 2025 13:23

Copilot AI reviewed Nov 11, 2025

View reviewed changes

arabold added 2 commits November 11, 2025 05:50

refactor(tests): remove outdated refresh testing documentation

0eba284

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ETag and lastModified support with refresh job functionality #250

Add ETag and lastModified support with refresh job functionality #250

Uh oh!

arabold commented Oct 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Nov 11, 2025

Uh oh!

Uh oh!

Copilot AI Nov 11, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// TODO: Check if `types` is properly used
		types?: string[]; // Types of content in this chunk (e.g., "text", "code", "table")

Add ETag and lastModified support with refresh job functionality #250

Are you sure you want to change the base?

Add ETag and lastModified support with refresh job functionality #250

Uh oh!

Conversation

arabold commented Oct 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants