Skip to content

🚰 fix: Plug Scraped Content Leak & Cap Web Search Payloads#234

Merged
danny-avila merged 2 commits into
mainfrom
claude/exciting-clarke-f9cb59
Jun 10, 2026
Merged

🚰 fix: Plug Scraped Content Leak & Cap Web Search Payloads#234
danny-avila merged 2 commits into
mainfrom
claude/exciting-clarke-f9cb59

Conversation

@danny-avila

Copy link
Copy Markdown
Owner

Summary

Continues and supersedes #63 — full credit to @dan-and for the deep investigation that identified these issues. The original branch went stale, so I reimplemented the fixes on current main, with unit tests and fresh real-API verification.

Web search was leaking full raw scraped page content into the search artifact and letting merged news results balloon far past the requested source count, overflowing smaller context windows (see the detailed analysis in #63).

  • Stripped content/references from sources without highlights in expandHighlights — the early-return path previously passed the entire scraped page (tens to hundreds of KB per source) downstream whenever the reranker returned no highlights, a scrape produced no usable text, or rerank fell back; the normal path already deleted these fields.
  • Capped scraped content at 50,000 chars per source, applied at storage time after cleaning and again before chunking, bounding reranker request sizes against degenerate pages (minified content, dense tables, very long articles). Configurable via maxContentLength on ProcessSourcesConfig/createSearchTool, or the SEARCH_MAX_CONTENT_LENGTH env var.
  • Deduplicated merged topStories by link in executeParallelSearches — the main search's own news results and the parallel news sub-search frequently return the same stories, which were then scraped, reranked, and formatted repeatedly (the "4-5x the same news" from fix: reranked websearch content don't overload the model anymore #63).
  • Capped topStories to the requested source count (numElements) in processSources after enrichment, mirroring how organic results are limited via fetchContents' target; previously 24-37 news entries could flow into the formatted LLM output. SearXNG was worst affected, but Serper's news sub-search is also uncapped at the source.
  • Added source-processing.test.ts with 10 tests covering content stripping (no highlights, empty highlights, normal expansion path), both cap layers, the env var override, the 50k default, topStories capping with news on and off, and merge dedupe.

Real-API before/after on this branch vs. unmodified main (Serper + Firecrawl, identical query with news: true):

Metric Before (main) After
Raw content in artifact (sources without highlights) 238,576 chars across 11 sources 0 chars
Search artifact JSON 250,236 chars 6,481 chars
topStories entries 10 5

Change Type

  • Bug fix (non-breaking change which fixes an issue)

Testing

  • Ran the new unit suite plus all existing search tests: npx jest src/tools/search — 69 tests pass.
  • Ran the CI-equivalent full suite; the only failures are the 3 pre-existing API-key-dependent spec suites that fail identically on unmodified main.
  • Verified end-to-end against real APIs with a measurement harness that invokes createSearchTool and inspects the LLM output string and the web_search artifact: ran it on unmodified main and on this branch with the same query, once with a working reranker pipeline and once with rerankerType: 'none' to force the no-highlights path. The numbers in the table above are from those runs. The reranker-error path was also exercised live (Jina returned 403), confirming the fallback ranking produces bounded output.
  • npx tsc --noEmit, npx eslint src/tools/search/, and npm run sort-imports:check are clean.

Test Configuration:

  • Search: Serper; Scraper: Firecrawl (hosted v2 API); Rerankers: Jina (403 fallback path) and none
  • Node v24.16.0, macOS

Checklist

  • My code adheres to this project's style guidelines
  • I have performed a self-review of my own code
  • I have commented in any complex areas of my code
  • My changes do not introduce new warnings
  • I have written tests demonstrating that my changes are effective or that my feature works
  • Local unit tests pass with my changes

Continues the work from #63 by @dan-and on current main:

- Strip content/references from sources without highlights in
  expandHighlights — the early-return path leaked full scraped pages
  into the search artifact
- Cap scraped content per source at 50,000 chars (maxContentLength
  config / SEARCH_MAX_CONTENT_LENGTH env var) at storage time and
  before chunking, bounding reranker request sizes
- Dedupe merged topStories by link in executeParallelSearches and cap
  them to the requested source count in processSources
@dan-and

dan-and commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

I will testdrive it as soon as it is implemented.

I have a Websearch / crawler / reranker combo build, completely locally running, which is just waiting having it in librechat.

@danny-avila

Copy link
Copy Markdown
Owner Author

@codex review

Codex review: the cap was placed after the no-organic and empty-links
early returns, leaving those paths unbounded. Capping up front also
avoids scraping news entries the cap would discard.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb4cb3239f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/tools/search/search.ts Outdated
updateSourcesWithContent(topStories, sourceMap);
}

if (topStories.length > numElements) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move the topStories cap before early returns

When a general search returns no organic links but does return more than numElements top stories while news is false, this new cap is never reached because processSources returns earlier when organicLinks.length === 0 && ... !news. formatResultsForLLM still formats every topStories entry, so this leaves the oversized news payload uncapped in exactly the no-organic fallback case; apply the cap before that early return or cap in the returned branch as well.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 1b496c0 — the cap now runs at the top of processSources, before the no-organic and empty-links early returns (and before scraping, so entries the cap would discard are no longer fetched). Added regression tests for both early-return paths.

@danny-avila

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danny-avila danny-avila changed the title 🚰 fix(search): Plug Scraped Content Leak & Cap Web Search Payloads 🚰 fix: Plug Scraped Content Leak & Cap Web Search Payloads Jun 10, 2026
@danny-avila danny-avila merged commit 7d322f4 into main Jun 10, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants