Skip to content

perf: consolidate tool metrics queries (11 → 4 pipelines)#61

Merged
Odrec merged 3 commits into
virtUOS:mainfrom
dborysenko:perf/optimize-tool-metrics-queries
May 8, 2026
Merged

perf: consolidate tool metrics queries (11 → 4 pipelines)#61
Odrec merged 3 commits into
virtUOS:mainfrom
dborysenko:perf/optimize-tool-metrics-queries

Conversation

@dborysenko
Copy link
Copy Markdown
Contributor

Summary

  • The previous implementation ran 11 sequential aggregation pipelines, each
    starting with $unwind on the full content array. On DocumentDB this
    is expensive because the array can contain large blobs (images, long text)
    that must be materialised in memory before the $match on content.type
    can filter them out — causing noticeably slow collection cycles on
    larger installations.
  • Rewrote _fetch_all_tool_metrics() to use 4 pipelines instead:
    • Query A — pre-filters documents with $match {"content.type": "tool_call"}
      (hits the multikey index), then uses $project + $filter to extract a
      tiny _tc array of only the tool-call items (typically 1–3 elements vs
      10–50+), and unwinds that. A single pass accumulates total calls,
      per-tool, per-model, and per-endpoint counts.
    • Query B — all-time error counts per tool; uses $indexOfCP instead of
      $regex for error string matching (DocumentDB-compatible, avoids a
      full string scan on every output blob).
    • Query C — 5-minute window; time filter is placed first so DocumentDB
      uses the updatedAt index before touching the content array at all.
      Total calls, per-tool counts, and error counts in a single pass.
    • Query D — messages with tool calls via count_documents() (index-only,
      no aggregation pipeline needed) + active tool users.

Test plan

  • Tested on production server — metrics are still showing as previously
  • No errors in container logs, collection time went from infinite to ~60sec on my 5Gb database

The previous implementation ran 11 sequential MongoDB aggregations, each
starting with $unwind on the full content array.  On DocumentDB this is
very expensive because the content array can contain large blobs (images,
long text) that must be materialised before the $match on content.type
can filter them out.

New approach:
- Pre-filter documents at the collection level using $match on
  content.type (hits the multikey index, skips non-tool messages).
- Use $project + $filter to extract a tiny "_tc" array containing only
  the tool_call items before unwinding — typically 1-3 elements instead
  of 10-50+.
- Query A: single all-time pass that accumulates total calls, per-tool,
  per-model, and per-endpoint counts in one pipeline.
- Query B: all-time error counts; uses $indexOfCP instead of $regex to
  avoid a full string scan on every output blob (DocumentDB-compatible).
- Query C: 5-minute window — total calls, per-tool, and errors in one
  pipeline; time filter is placed first to hit the updatedAt index.
- Query D: messages_with_tools via count_documents (index-only, no
  aggregation); active_tool_users with a minimal group pipeline.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Odrec
Copy link
Copy Markdown
Collaborator

Odrec commented May 8, 2026

Thanks for this. The strategy (prefilter at the document level, $filter the array down before $unwind) is the right shape for DocumentDB, and the inline docs explaining why are great. A couple of things before merge.

🔴 Case-sensitivity regression in error matching

The old code used {"$regex": "Error processing tool", "$options": "i"} — case-insensitive. The new $indexOfCP substring check is case-sensitive, so it won't match outputs that LibreChat actually emits in lowercase.

Verified against the LibreChat source:

api/server/services/ToolService.js:464
  output: `Error processing tool ${currentAction.tool}: …`     // capitalized
api/server/controllers/assistants/errors.js:170
api/server/controllers/assistants/chatV1.js:226
  output: 'error processing tool'                              // lowercase

The lowercase variant from the assistants/v1 paths will silently drop out of errors_per_tool, total_errors, and errors_5m. The "metrics still show as previously" sanity check on prod likely just means that code path isn't exercised on your instance.

Cheapest fix that preserves the perf intent is to OR two prefix checks:

ERROR_PREFIX_VARIANTS = ("Error processing tool", "error processing tool")

def _matches_error(output_expr):
    return {"$or": [
        {"$gte": [{"$indexOfCP": [output_expr, p]}, 0]}
        for p in ERROR_PREFIX_VARIANTS
    ]}

…and use _matches_error({"$ifNull": ["$$c.tool_call.output", ""]}) in both Query B and the $cond inside Query C.

If you'd rather match the old semantics exactly (any case, anywhere in the string), $regexMatch with options: "i" is supported on DocumentDB 4.0+ and is closer to a 1:1 swap.

🟡 Unnamed tool_call items now counted as "unknown"

The old per-tool / per-model / per-endpoint / errors_per_tool pipelines all required "content.tool_call.name": {"$exists": True}. The new pipelines drop that and use {"$ifNull": [..., "unknown"]}, so unnamed items now show up under the label unknown. Side effect: sum(per_tool.values()) now equals total_calls — it didn't before.

Probably an improvement (the old breakdowns were inconsistent with the totals), but it's a visible label change for any dashboard filtering on tool name. Either call it out in the PR description, or restore the filter inside the $filter cond if you'd rather preserve the old behavior:

"cond": {"$and": [
    {"$eq": ["$$c.type", "tool_call"]},
    {"$ne": [{"$type": "$$c.tool_call.name"}, "missing"]},
]}

🟡 Defensive cast on tool_call.output

$indexOfCP requires a string and will throw if output is ever stored as a non-string (object, number, array). $ifNull only guards null/missing. The old $regex was more tolerant. Swap the $ifNull for:

{"$convert": {"input": "$$c.tool_call.output", "to": "string", "onError": "", "onNull": ""}}

Nice work overall. The perf gain is significant and the consolidation is well thought out.

Dmytro Borysenko and others added 2 commits May 7, 2026 21:02
- Replace $indexOfCP with $regexMatch (options:"i") in Query B and C
  to match both capitalised and lowercase error strings emitted by
  LibreChat (ToolService.js vs assistants/chatV1.js).
- Replace $ifNull on tool_call.output with $convert (to:string,
  onError:"", onNull:"") to guard against non-string stored values
  that would cause $regexMatch to throw.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Remove alignment whitespace in Query A and C result loops (E221, E272)
- Wrap long $convert expressions across multiple lines (E501)

Co-authored-by: Cursor <cursoragent@cursor.com>
@dborysenko
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review, @Odrec — all three points are valid and addressed in the latest commit.

🔴 Case-sensitivity / $indexOfCP → $regexMatch
Replaced both occurrences (Query B $filter cond and Query C $cond) with $regexMatch + "options": "i", which matches the old $regex semantics exactly. Also swapped $ifNull for $convert (to: "string", onError: "", onNull: "") to guard against non-string output values.

This turned out to fix a real bug, not just a defensive improvement. We verified that our production database has 779 documents where tool_call.output is stored as a non-string type. With $indexOfCP, DocumentDB was throwing a type-mismatch error inside the $filter condition and silently treating it as "include this item" — so Query B was expanding the full tool-call arrays of all 779 documents instead of only actual error items. This was also corrupting errors_per_tool and errors_5m with false positives. The fix resolves both the performance regression and the incorrect error counts.

🟡 Unnamed tool_call items counted as "unknown"
Keeping the new behaviour. The old pipelines excluded unnamed items entirely via {"$exists": True}, which meant sum(per_tool.values()) < total_calls — the totals were internally inconsistent. The new approach counts them under "unknown", which at least makes the per-tool breakdown add up to the total. The "unknown" label also makes these calls visible in dashboards rather than silently dropping them, which is useful for diagnosing agent transfers or other tool-call variants that don't carry a name.

🟡 $convert defensive cast
Done as part of the error-matching fix above — both spots use $convert now.

@Odrec Odrec merged commit c23d5b3 into virtUOS:main May 8, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants