🚰 fix: Plug Scraped Content Leak & Cap Web Search Payloads#234
Conversation
Continues the work from #63 by @dan-and on current main: - Strip content/references from sources without highlights in expandHighlights — the early-return path leaked full scraped pages into the search artifact - Cap scraped content per source at 50,000 chars (maxContentLength config / SEARCH_MAX_CONTENT_LENGTH env var) at storage time and before chunking, bounding reranker request sizes - Dedupe merged topStories by link in executeParallelSearches and cap them to the requested source count in processSources
|
I will testdrive it as soon as it is implemented. I have a Websearch / crawler / reranker combo build, completely locally running, which is just waiting having it in librechat. |
|
@codex review |
Codex review: the cap was placed after the no-organic and empty-links early returns, leaving those paths unbounded. Capping up front also avoids scraping news entries the cap would discard.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb4cb3239f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| updateSourcesWithContent(topStories, sourceMap); | ||
| } | ||
|
|
||
| if (topStories.length > numElements) { |
There was a problem hiding this comment.
Move the topStories cap before early returns
When a general search returns no organic links but does return more than numElements top stories while news is false, this new cap is never reached because processSources returns earlier when organicLinks.length === 0 && ... !news. formatResultsForLLM still formats every topStories entry, so this leaves the oversized news payload uncapped in exactly the no-organic fallback case; apply the cap before that early return or cap in the returned branch as well.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 1b496c0 — the cap now runs at the top of processSources, before the no-organic and empty-links early returns (and before scraping, so entries the cap would discard are no longer fetched). Added regression tests for both early-return paths.
|
@codex review |
|
Codex Review: Didn't find any major issues. You're on a roll. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Continues and supersedes #63 — full credit to @dan-and for the deep investigation that identified these issues. The original branch went stale, so I reimplemented the fixes on current
main, with unit tests and fresh real-API verification.Web search was leaking full raw scraped page content into the search artifact and letting merged news results balloon far past the requested source count, overflowing smaller context windows (see the detailed analysis in #63).
content/referencesfrom sources without highlights inexpandHighlights— the early-return path previously passed the entire scraped page (tens to hundreds of KB per source) downstream whenever the reranker returned no highlights, a scrape produced no usable text, or rerank fell back; the normal path already deleted these fields.maxContentLengthonProcessSourcesConfig/createSearchTool, or theSEARCH_MAX_CONTENT_LENGTHenv var.topStoriesby link inexecuteParallelSearches— the main search's own news results and the parallel news sub-search frequently return the same stories, which were then scraped, reranked, and formatted repeatedly (the "4-5x the same news" from fix: reranked websearch content don't overload the model anymore #63).topStoriesto the requested source count (numElements) inprocessSourcesafter enrichment, mirroring how organic results are limited viafetchContents'target; previously 24-37 news entries could flow into the formatted LLM output. SearXNG was worst affected, but Serper's news sub-search is also uncapped at the source.source-processing.test.tswith 10 tests covering content stripping (no highlights, empty highlights, normal expansion path), both cap layers, the env var override, the 50k default, topStories capping with news on and off, and merge dedupe.Real-API before/after on this branch vs. unmodified
main(Serper + Firecrawl, identical query withnews: true):main)topStoriesentriesChange Type
Testing
npx jest src/tools/search— 69 tests pass.main.createSearchTooland inspects the LLM output string and theweb_searchartifact: ran it on unmodifiedmainand on this branch with the same query, once with a working reranker pipeline and once withrerankerType: 'none'to force the no-highlights path. The numbers in the table above are from those runs. The reranker-error path was also exercised live (Jina returned 403), confirming the fallback ranking produces bounded output.npx tsc --noEmit,npx eslint src/tools/search/, andnpm run sort-imports:checkare clean.Test Configuration:
noneChecklist