perf: consolidate tool metrics queries (11 → 4 pipelines)#61
Conversation
The previous implementation ran 11 sequential MongoDB aggregations, each starting with $unwind on the full content array. On DocumentDB this is very expensive because the content array can contain large blobs (images, long text) that must be materialised before the $match on content.type can filter them out. New approach: - Pre-filter documents at the collection level using $match on content.type (hits the multikey index, skips non-tool messages). - Use $project + $filter to extract a tiny "_tc" array containing only the tool_call items before unwinding — typically 1-3 elements instead of 10-50+. - Query A: single all-time pass that accumulates total calls, per-tool, per-model, and per-endpoint counts in one pipeline. - Query B: all-time error counts; uses $indexOfCP instead of $regex to avoid a full string scan on every output blob (DocumentDB-compatible). - Query C: 5-minute window — total calls, per-tool, and errors in one pipeline; time filter is placed first to hit the updatedAt index. - Query D: messages_with_tools via count_documents (index-only, no aggregation); active_tool_users with a minimal group pipeline. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Thanks for this. The strategy (prefilter at the document level, 🔴 Case-sensitivity regression in error matchingThe old code used Verified against the LibreChat source: The lowercase variant from the assistants/v1 paths will silently drop out of Cheapest fix that preserves the perf intent is to OR two prefix checks: ERROR_PREFIX_VARIANTS = ("Error processing tool", "error processing tool")
def _matches_error(output_expr):
return {"$or": [
{"$gte": [{"$indexOfCP": [output_expr, p]}, 0]}
for p in ERROR_PREFIX_VARIANTS
]}…and use If you'd rather match the old semantics exactly (any case, anywhere in the string), 🟡 Unnamed
|
- Replace $indexOfCP with $regexMatch (options:"i") in Query B and C to match both capitalised and lowercase error strings emitted by LibreChat (ToolService.js vs assistants/chatV1.js). - Replace $ifNull on tool_call.output with $convert (to:string, onError:"", onNull:"") to guard against non-string stored values that would cause $regexMatch to throw. Co-authored-by: Cursor <cursoragent@cursor.com>
- Remove alignment whitespace in Query A and C result loops (E221, E272) - Wrap long $convert expressions across multiple lines (E501) Co-authored-by: Cursor <cursoragent@cursor.com>
|
Thanks for the thorough review, @Odrec — all three points are valid and addressed in the latest commit. 🔴 Case-sensitivity / $indexOfCP → $regexMatch This turned out to fix a real bug, not just a defensive improvement. We verified that our production database has 779 documents where tool_call.output is stored as a non-string type. With $indexOfCP, DocumentDB was throwing a type-mismatch error inside the $filter condition and silently treating it as "include this item" — so Query B was expanding the full tool-call arrays of all 779 documents instead of only actual error items. This was also corrupting errors_per_tool and errors_5m with false positives. The fix resolves both the performance regression and the incorrect error counts. 🟡 Unnamed tool_call items counted as "unknown" 🟡 $convert defensive cast |
Summary
starting with
$unwindon the fullcontentarray. On DocumentDB thisis expensive because the array can contain large blobs (images, long text)
that must be materialised in memory before the
$matchoncontent.typecan filter them out — causing noticeably slow collection cycles on
larger installations.
_fetch_all_tool_metrics()to use 4 pipelines instead:$match {"content.type": "tool_call"}(hits the multikey index), then uses
$project + $filterto extract atiny
_tcarray of only the tool-call items (typically 1–3 elements vs10–50+), and unwinds that. A single pass accumulates total calls,
per-tool, per-model, and per-endpoint counts.
$indexOfCPinstead of$regexfor error string matching (DocumentDB-compatible, avoids afull string scan on every output blob).
uses the
updatedAtindex before touching the content array at all.Total calls, per-tool counts, and error counts in a single pass.
count_documents()(index-only,no aggregation pipeline needed) + active tool users.
Test plan