🚰 fix: Plug Scraped Content Leak & Cap Web Search Payloads by danny-avila · Pull Request #234 · danny-avila/agents

danny-avila · 2026-06-09T22:17:15Z

Summary

Continues and supersedes #63 — full credit to @dan-and for the deep investigation that identified these issues. The original branch went stale, so I reimplemented the fixes on current main, with unit tests and fresh real-API verification.

Web search was leaking full raw scraped page content into the search artifact and letting merged news results balloon far past the requested source count, overflowing smaller context windows (see the detailed analysis in #63).

Stripped content/references from sources without highlights in expandHighlights — the early-return path previously passed the entire scraped page (tens to hundreds of KB per source) downstream whenever the reranker returned no highlights, a scrape produced no usable text, or rerank fell back; the normal path already deleted these fields.
Capped scraped content at 50,000 chars per source, applied at storage time after cleaning and again before chunking, bounding reranker request sizes against degenerate pages (minified content, dense tables, very long articles). Configurable via maxContentLength on ProcessSourcesConfig/createSearchTool, or the SEARCH_MAX_CONTENT_LENGTH env var.
Deduplicated merged topStories by link in executeParallelSearches — the main search's own news results and the parallel news sub-search frequently return the same stories, which were then scraped, reranked, and formatted repeatedly (the "4-5x the same news" from fix: reranked websearch content don't overload the model anymore #63).
Capped topStories to the requested source count (numElements) in processSources after enrichment, mirroring how organic results are limited via fetchContents' target; previously 24-37 news entries could flow into the formatted LLM output. SearXNG was worst affected, but Serper's news sub-search is also uncapped at the source.
Added source-processing.test.ts with 10 tests covering content stripping (no highlights, empty highlights, normal expansion path), both cap layers, the env var override, the 50k default, topStories capping with news on and off, and merge dedupe.

Real-API before/after on this branch vs. unmodified main (Serper + Firecrawl, identical query with news: true):

Metric	Before (`main`)	After
Raw content in artifact (sources without highlights)	238,576 chars across 11 sources	0 chars
Search artifact JSON	250,236 chars	6,481 chars
`topStories` entries	10	5

Change Type

Bug fix (non-breaking change which fixes an issue)

Testing

Ran the new unit suite plus all existing search tests: npx jest src/tools/search — 69 tests pass.
Ran the CI-equivalent full suite; the only failures are the 3 pre-existing API-key-dependent spec suites that fail identically on unmodified main.
Verified end-to-end against real APIs with a measurement harness that invokes createSearchTool and inspects the LLM output string and the web_search artifact: ran it on unmodified main and on this branch with the same query, once with a working reranker pipeline and once with rerankerType: 'none' to force the no-highlights path. The numbers in the table above are from those runs. The reranker-error path was also exercised live (Jina returned 403), confirming the fallback ranking produces bounded output.
npx tsc --noEmit, npx eslint src/tools/search/, and npm run sort-imports:check are clean.

Test Configuration:

Search: Serper; Scraper: Firecrawl (hosted v2 API); Rerankers: Jina (403 fallback path) and none
Node v24.16.0, macOS

Checklist

My code adheres to this project's style guidelines
I have performed a self-review of my own code
I have commented in any complex areas of my code
My changes do not introduce new warnings
I have written tests demonstrating that my changes are effective or that my feature works
Local unit tests pass with my changes

@dan-and

Continues the work from #63 by @dan-and on current main: - Strip content/references from sources without highlights in expandHighlights — the early-return path leaked full scraped pages into the search artifact - Cap scraped content per source at 50,000 chars (maxContentLength config / SEARCH_MAX_CONTENT_LENGTH env var) at storage time and before chunking, bounding reranker request sizes - Dedupe merged topStories by link in executeParallelSearches and cap them to the requested source count in processSources

dan-and · 2026-06-09T22:25:11Z

I will testdrive it as soon as it is implemented.

I have a Websearch / crawler / reranker combo build, completely locally running, which is just waiting having it in librechat.

danny-avila · 2026-06-09T22:57:01Z

@codex review

Codex review: the cap was placed after the no-organic and empty-links early returns, leaving those paths unbounded. Capping up front also avoids scraping news entries the cap would discard.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb4cb3239f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-09T23:01:33Z

        updateSourcesWithContent(topStories, sourceMap);
      }

+      if (topStories.length > numElements) {


Move the topStories cap before early returns

When a general search returns no organic links but does return more than numElements top stories while news is false, this new cap is never reached because processSources returns earlier when organicLinks.length === 0 && ... !news. formatResultsForLLM still formats every topStories entry, so this leaves the oversized news payload uncapped in exactly the no-organic fallback case; apply the cap before that early return or cap in the returned branch as well.

Useful? React with 👍 / 👎.

Addressed in 1b496c0 — the cap now runs at the top of processSources, before the no-organic and empty-links early returns (and before scraping, so entries the cap would discard are no longer fetched). Added regression tests for both early-return paths.

danny-avila · 2026-06-09T23:02:16Z

@codex review

chatgpt-codex-connector · 2026-06-09T23:05:46Z

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

danny-avila mentioned this pull request Jun 9, 2026

fix: reranked websearch content don't overload the model anymore #63

Closed

🚰 fix(search): Cap topStories Before Early Returns in processSources

1b496c0

Codex review: the cap was placed after the no-organic and empty-links early returns, leaving those paths unbounded. Capping up front also avoids scraping news entries the cap would discard.

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

danny-avila changed the title ~~🚰 fix(search): Plug Scraped Content Leak & Cap Web Search Payloads~~ 🚰 fix: Plug Scraped Content Leak & Cap Web Search Payloads Jun 10, 2026

danny-avila merged commit 7d322f4 into main Jun 10, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚰 fix: Plug Scraped Content Leak & Cap Web Search Payloads#234

🚰 fix: Plug Scraped Content Leak & Cap Web Search Payloads#234
danny-avila merged 2 commits into
mainfrom
claude/exciting-clarke-f9cb59

danny-avila commented Jun 9, 2026

Uh oh!

dan-and commented Jun 9, 2026

Uh oh!

danny-avila commented Jun 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Uh oh!

danny-avila Jun 9, 2026

Uh oh!

danny-avila commented Jun 9, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danny-avila commented Jun 9, 2026

Summary

Change Type

Testing

Test Configuration:

Checklist

Uh oh!

dan-and commented Jun 9, 2026

Uh oh!

danny-avila commented Jun 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

danny-avila Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

danny-avila commented Jun 9, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants