Skip to content

Latest commit

 

History

History
430 lines (303 loc) · 23.7 KB

File metadata and controls

430 lines (303 loc) · 23.7 KB

LLM Response Quality Evaluation — Baseline (Vibe Check)

Date: 2026-02-17 Dataset: ~5,559 job listings (baseline, raw export) Models evaluated: e5-large-instruct, bge-base-en-v1.5, gemini-embedding-001, text-embedding-3-small, text-embedding-3-large LLM generating answers: Claude Opus 4.6 (same across all runs) Queries: 18 (3 per category: synthesis, comparison, inference, pattern, nuanced-retrieval, analysis)


Executive Summary

High retrieval similarity does not guarantee high answer quality. The most dramatic example: Gemini scored 0.840 similarity on Q18 but retrieved only structural headers, producing a 756-char non-answer ("I cannot meaningfully answer this question"). Meanwhile, OpenAI Small scored 0.452 on Q17 but found specific companies and produced a better, more actionable answer than Gemini at 0.838.

The numbers measure retrieval. The vibe check measures utility.


Query Test Criteria

Each query tests a specific retrieval + reasoning capability. Understanding what's being measured changes the interpretation.

Q Category What It Tests
Q1 synthesis Synthesize patterns across multiple job descriptions into a coherent picture
Q2 synthesis Extract and combine technical requirements from LLM-related roles
Q3 synthesis Pull and synthesize interview details scattered across descriptions
Q4 comparison Compare and contrast requirements across seniority levels
Q5 comparison Distinguish company-stage signals and compare expectations
Q6 comparison Semantic depth, build vs integrate distinction across descriptions
Q7 inference Infer autonomy expectations from indirect language cues
Q8 inference Classify roles by inferred focus without explicit labels
Q9 inference Infer hybrid role expectations from combined skill signals
Q10 pattern Identify recurring non-technical requirements across retrieved chunks
Q11 pattern Extract co-occurring technical terms in a specific subdomain
Q12 pattern Identify perks and cultural signals across multiple listings
Q13 nuanced-retrieval Retrieve based on semantic intent, not keyword overlap with ML terms
Q14 nuanced-retrieval Retrieve on soft cultural signals buried in descriptions
Q15 nuanced-retrieval Distinguish MLOps/deployment focus from training/research focus
Q16 analysis LLM must reason about market signals and form a recommendation
Q17 analysis LLM judges content quality, requires meta-reasoning about the text itself
Q18 analysis Infer hiring difficulty from requirement complexity and specificity

Key insight: The test criteria reveal that Q7, Q8, Q9 (inference) require reading between the lines — indirect cues, no explicit labels. Q13, Q14 (nuanced-retrieval) explicitly test semantic intent over keyword matching. Q17, Q18 (analysis) require the LLM to reason about the text itself, not just summarize it. These are where embedding quality should matter most.


Per-Query Analysis

Q1 — Senior ML Engineer Day-to-Day (synthesis)

Tests: Synthesize patterns across multiple job descriptions into a coherent picture

Model Top-1 Chars Verdict
E5 0.882 2,874 Structured, 7 clear categories. Mentions LLM/GenAI as emerging expectation.
BGE 0.745 2,960 Very similar structure. Slightly more verbose.
Gemini 0.888 2,634 Clean categories, slightly shorter.
OA-Small 0.654 2,536 Same quality, different heading style (H1 vs H2).
OA-Large 0.693 2,734 Nearly identical to E5.

Verdict: All models converge. When the question is broad and the corpus has abundant data, embedding quality barely matters. No meaningful quality difference.


Q2 — Tech Stack for LLM Products (synthesis)

Tests: Extract and combine technical requirements from LLM-related roles

Model Top-1 Chars
E5 0.872 2,135
BGE 0.738 3,076
Gemini 0.864 3,370
OA-Small 0.659 3,346
OA-Large 0.632 3,046

Verdict: All models found Python, LangChain, LangGraph, OpenAI, vector databases. E5 was the most concise (2,135 chars) while hitting the same points. Gemini and OpenAI Small were more verbose but didn't add proportionally more insight. Convergent answers, different verbosity.


Q3 — Interview Process (synthesis) ★ INTERESTING

Tests: Pull and synthesize interview details scattered across descriptions

Model Top-1 Chars Key Finds
E5 0.890 1,773 Found "Technical AI Prompt Test + Behavioral Assessments" multi-stage process
BGE 0.762 2,285 Found "Talent Pool / Pre-screening model" pattern
Gemini 0.847 1,485 Admitted "very little specific information" — only found 2 mentions
OA-Small 0.677 2,449 Found Brainlancer's 3-step AI interview + QuintoAndar's pipeline
OA-Large 0.608 2,421 Found QuintoAndar + CI&T structured processes

Verdict: This is where embedding differences matter. Each model retrieved different job listings and surfaced different interview processes. No single model captured them all. BGE found the talent pool model that others missed. OpenAI models found specific Brazilian companies (QuintoAndar, CI&T). Gemini was the most honest about data scarcity but also the least useful. Different embeddings = different content discovered.


Q4 — Junior vs Senior AI Roles (comparison)

Tests: Compare and contrast requirements across seniority levels

Model Top-1 Chars
E5 0.878 2,992
BGE 0.774 2,582
Gemini 0.836 2,964
OA-Small 0.578 2,706
OA-Large 0.534 2,786

Verdict: All models captured the core contrast (experience years, autonomy, scope). E5 included the insight that junior roles are described as "viable transition roles" — a specific phrase from the data. Gemini noted "very little about junior roles specifically" which is honest and useful context. Similar quality, minor detail differences.


Q5 — Startups vs Large Companies (comparison)

Tests: Distinguish company-stage signals and compare expectations

Model Top-1 Chars
E5 0.879 2,777
BGE 0.722 2,679
Gemini 0.852 3,833
OA-Small 0.575 2,807
OA-Large 0.527 3,751

Verdict: Strong convergence across all models. Startups = ownership + breadth + velocity. Large = specialization + process + compliance. Gemini and OA-Large were notably longer without adding proportional insight. No quality gap despite 0.35 similarity gap between E5 and OA-Large.


Q6 — Build from Scratch vs Integrate APIs (comparison)

Tests: Semantic depth, build vs integrate distinction across descriptions

Model Top-1 Chars
E5 0.879 2,941
BGE 0.759 4,471
Gemini 0.839 3,778
OA-Small 0.591 4,285
OA-Large 0.575 3,984

Verdict: BGE produced the longest answer (4,471) and included the most concrete quotes from actual listings — "Design and implement VOX's initial product architecture and tech stack from scratch." More verbose but also more grounded. BGE's lower similarity didn't hurt — it may have helped by pulling more diverse chunks.


Q7 — Independent Work / Minimal Supervision (inference)

Tests: Infer autonomy expectations from indirect language cues

Model Top-1 Chars
E5 0.849 2,602
BGE 0.699 2,217
Gemini 0.819 2,231
OA-Small 0.527 3,021
OA-Large 0.496 2,518

Verdict: E5 found N2/N3 Support / Infrastructure Operations roles with explicit "high autonomy" language. BGE found fintech/startup roles. Gemini acknowledged it couldn't see role titles — just description fragments. OpenAI Small was the most comprehensive despite the lowest similarity. OA-Small's 0.527 outperformed Gemini's 0.819 in usefulness.


Q8 — Research vs Production Engineering (inference)

Tests: Classify roles by inferred focus without explicit labels

Model Top-1 Chars
E5 0.837 3,470
BGE 0.691 4,944
Gemini 0.807 3,311
OA-Small 0.517 3,608
OA-Large 0.484 5,075

Verdict: E5 named Epic Games Principal Research Engineer — a specific, real role. BGE was the most thorough (4,944 chars) with detailed categorization. OA-Large was the longest (5,075) and found Adaptable Labs roles. Gemini found GenAI R&D roles. All models successfully distinguished the spectrum. BGE and OA-Large traded verbosity for specificity — fair tradeoff.


Q9 — Full-Stack + ML Engineer (inference)

Tests: Infer hybrid role expectations from combined skill signals

Model Top-1 Chars Key Find
E5 0.900 3,098 Named "AI/ML Full Stack Engineer" title explicitly
BGE 0.737 4,202 Found "Senior AI Full-Stack Engineer (LLMs, RAG, Agentic Systems)"
Gemini 0.855 4,067 Same Senior AI Full-Stack role + NineTwoThree
OA-Small 0.599 2,904 Found same key roles, more concise
OA-Large 0.603 3,722 Found same roles with good detail

Verdict: E5's 0.900 (highest similarity in the entire benchmark) found the most directly relevant result. But all models converged on the same core roles. When the question maps cleanly to actual job titles in the data, all models work.


Q10 — Soft Skills (pattern)

Tests: Identify recurring non-technical requirements across retrieved chunks

Model Top-1 Chars
E5 0.890 1,775
BGE 0.782 1,993
Gemini 0.864 2,550
OA-Small 0.709 2,255
OA-Large 0.628 2,108

Verdict: Universal agreement: communication, problem-solving, collaboration, adaptability, ownership. E5 was most concise. No quality difference. This is a "flat data" query — the answer is the same everywhere in the corpus.


Q11 — LLM/RAG Tools and Frameworks (pattern)

Tests: Extract co-occurring technical terms in a specific subdomain

Model Top-1 Chars
E5 0.867 2,904
BGE 0.764 2,800
Gemini 0.848 2,993
OA-Small 0.598 2,287
OA-Large 0.550 2,490

Verdict: All models: LangChain #1, LangGraph #2, LlamaIndex, Pinecone, ChromaDB, FAISS. Virtually identical. This is another flat data query. The tools are mentioned so frequently that any retrieval finds them.


Q12 — Benefits Beyond Salary (pattern) ★ BOILERPLATE INDICATOR

Tests: Identify perks and cultural signals across multiple listings

Model Top-1 Chars
E5 0.872 2,675
BGE 0.726 2,289
Gemini 0.814 2,403
OA-Small 0.640 2,579
OA-Large 0.619 3,437

Verdict: This query is a boilerplate magnet. Benefits text IS boilerplate — 401(k), health insurance, PTO. High similarity scores here likely mean the model matched on the exact content we're about to strip in the cleaning round. E5 found Brazil-specific benefits (Wellhub/Gympass, BetterFly, medicines discounts) which shows it reached real content, not just template text. Watch this query's score change after cleaning — if it drops, the model was riding boilerplate.


Q13 — Data Quality / Pipeline Roles (nuanced-retrieval)

Tests: Retrieve based on semantic intent, not keyword overlap with ML terms

Model Top-1 Chars
E5 0.892 6,029
BGE 0.726 3,542
Gemini 0.848 5,920
OA-Small 0.577 4,722
OA-Large 0.595 4,515

Verdict: E5 and Gemini produced the longest, most detailed responses (6K chars each) with specific role descriptions and quotes. BGE was shorter but still found the core roles. All models identified Senior Data Quality Engineer, Data Engineer (pipeline resilience), and MLOps roles. E5 and Gemini excelled here — nuanced retrieval rewards high similarity.


Q14 — Mentorship & Culture (nuanced-retrieval)

Tests: Retrieve on soft cultural signals buried in descriptions

Model Top-1 Chars
E5 0.893 4,474
BGE 0.717 6,397
Gemini 0.848 4,282
OA-Small 0.603 4,588
OA-Large 0.586 4,534

Verdict: BGE produced the longest answer (6,397) and specifically found Amazon/AWS's "Earth's Best Employer" standardized mentorship section — which is itself a form of corporate boilerplate. E5 found diverse mentorship patterns across role levels. BGE's verbosity here may partially reflect retrieving repetitive Amazon chunks.


Q15 — Production Deployment / Inference at Scale (nuanced-retrieval)

Tests: Distinguish MLOps/deployment focus from training/research focus

Model Top-1 Chars
E5 0.906 3,756
BGE 0.774 4,747
Gemini 0.882 4,222
OA-Small 0.606 4,806
OA-Large 0.597 3,817

Verdict: All models found MLOps/ML Platform Engineer as the dominant role. E5 was most concise while hitting all key points (containerized microservices, low-latency inference, Kubernetes). OA-Small was the longest despite the lowest similarity. Consistent quality across the board.


Q16 — Skills to Learn for AI Roles (analysis)

Tests: LLM must reason about market signals and form a recommendation

Model Top-1 Chars
E5 0.895 3,088
BGE 0.770 2,969
Gemini 0.869 3,691
OA-Small 0.677 3,743
OA-Large 0.633 3,642

Verdict: All models produced tiered skill recommendations (Tier 1: Python, ML frameworks, Cloud; Tier 2: LLMs, RAG, vector DBs; Tier 3: domain specifics). OA-Small's response was the most opinionated — "Python. No exceptions." — which arguably makes it more useful as advice. Convergent content, different personality.


Q17 — Best-Written Job Descriptions (analysis) ★ KEY DIVERGENCE

Tests: LLM judges content quality, requires meta-reasoning about the text itself

Model Top-1 Chars Best JD Found
E5 0.863 3,492 n8n/Supabase Automation Role
BGE 0.731 4,125 LLM/RAG/Agent Role
Gemini 0.838 2,386 "vast majority contain virtually no substantive content"
OA-Small 0.453 3,984 Brainlancer
OA-Large 0.416 3,432 EdTech Role

Verdict: This is the most revealing query in the benchmark. Gemini scored 0.838 similarity but its retrieved chunks were mostly empty headers — it admitted as much. OpenAI Small scored 0.453 (nearly half) but found specific, well-written listings like Brainlancer and produced a better answer. Gemini matched on structural similarity (headers about "job descriptions") while OpenAI matched on semantic content (actual descriptions). This is the strongest evidence that high similarity can mean the model matched on noise, not signal.


Q18 — Hardest Roles to Fill (analysis) ★★ CRITICAL FAILURE

Tests: Infer hiring difficulty from requirement complexity and specificity

Model Top-1 Chars Answer
E5 0.871 2,966 Specialized AI/ML Research (PhD + publications)
BGE 0.741 3,460 Salesforce Specialist Roles
Gemini 0.840 756 "I cannot meaningfully answer this question"
OA-Small 0.521 4,117 Salesforce Platform Roles
OA-Large 0.495 3,692 SAP Specialist Roles

Verdict: Gemini's worst moment. Despite 0.840 top-1 similarity, the retrieved chunks contained only section headers ("REQUIREMENTS", "MISSING SKILLS", "Qualifications") with no actual content. The model honestly refused to hallucinate — which is good behavior from the LLM — but the retrieval completely failed. Meanwhile:

  • E5 found actual analysis content (AI reasoning about PhD requirements)
  • BGE and OA-Small found Salesforce roles (consistently rated POOR_FIT in the AI analysis sections)
  • OA-Large found SAP roles

E5, BGE, and OpenAI models retrieved chunks from the AI Analysis sections which contain the match reasoning data. Gemini retrieved structural headers. This query will be fascinating to compare after cleaning — the AI Analysis sections are being stripped.


Cross-Cutting Findings

1. Similarity Score ≠ Answer Quality

The correlation between top-1 similarity and response usefulness is weak for analysis/inference queries. Evidence:

Query Gemini Sim Gemini Quality OA-Small Sim OA-Small Quality
Q17 0.838 Poor (empty headers) 0.453 Good (specific companies)
Q18 0.840 Failed (756 chars) 0.521 Good (4,117 chars)
Q7 0.819 OK (no role names) 0.527 Better (comprehensive)

2. "Flat Data" Queries Equalize All Models

For queries where the answer exists everywhere in the corpus (Q1, Q2, Q5, Q10, Q11, Q15), all models produce near-identical answers regardless of similarity scores. These queries don't differentiate embedding quality.

Flat queries: Q1, Q2, Q4, Q5, Q10, Q11, Q15, Q16 (8 of 18) Discriminating queries: Q3, Q6, Q7, Q8, Q9, Q12, Q13, Q14, Q17, Q18 (10 of 18)

3. Different Embeddings Find Different Content

On Q3 (interview process), each model found different companies and interview structures. No single model captured the full picture. This suggests that running the same query across multiple embedding models and aggregating results would outperform any single model.

4. The AI Analysis Sections Are Doing Heavy Lifting

Q18's divergence reveals that E5, BGE, and OpenAI models are retrieving chunks from the AI Analysis sections (Copilot reasoning, match scores, gap analysis). These sections contain rich analytical content but are pipeline artifacts — not actual job description text. After cleaning removes them, we'll see if models can still answer analytical queries from raw job content alone.

5. Gemini's Retrieval Has a Structural Matching Problem

Gemini retrieved headers and structural fragments on Q17 and Q18 because those headers semantically match the query terms ("requirements", "job descriptions") without containing useful content. This is a known limitation of high-dimensional semantic matching on short, generic text fragments.

6. BGE Trades Precision for Coverage

BGE consistently produced the longest answers with the most quoted material. Its lower similarity scores may actually help by pulling more diverse chunks rather than the most similar ones — which for broad analytical queries, surfaces more varied content.


Model Rankings by Response Quality (Vibe)

Tier 1: Consistently Useful

  1. E5 — Best balance of concision and specificity. Rarely wrong, never wastes words. Found the most specific roles and real data points.
  2. BGE — Most thorough. Longer answers but grounded in actual quoted content. Sometimes retrieves repetitive chunks (Amazon boilerplate on Q14) but generally strong.

Tier 2: Solid but Inconsistent

  1. OpenAI Small — Surprisingly good answers despite the lowest similarity band. Opinionated, specific, and occasionally better than Gemini. The 0.45-0.67 similarity range didn't hurt response quality on most queries.
  2. OpenAI Large — Similar to Small. Found different content (SAP roles on Q18, EdTech on Q17). No clear advantage over Small in response quality despite 3072 vs 1536 dimensions.

Tier 3: Unreliable

  1. Gemini — Highest highs (Q13: 5,920 chars of excellent detail) and lowest lows (Q18: 756-char refusal). The structural matching problem on analytical queries is a real liability. When it works, it works well. When it doesn't, it fails completely.

Predictions for No-Boilerplate Round

  1. Q12 (Benefits) scores will drop across all models — benefits text IS the boilerplate being removed
  2. Q18 will be the biggest test — AI Analysis sections are being stripped. Models that relied on them (E5, BGE, OpenAI) will either find the signal in raw job content or degrade
  3. Gemini might improve on Q17/Q18 — with structural noise removed, fewer empty headers to match on
  4. Flat queries (Q1, Q2, Q10, Q11) will barely change — the signal is in the actual job content, not the boilerplate
  5. E5's spread may widen — if its tight 0.025 spread was partially inflated by shared boilerplate, cleaning will expose real variance

No-Boilerplate Round: What Actually Happened

Data change: 51,545 vectors down to 33,409 (HF models) / 36,523 (Gemini). ~35% chunk reduction. Same 5,559 files, same chunking.

Aggregate Comparison

Model Baseline Top-1 Clean Top-1 Baseline Spread Clean Spread
E5 0.879 0.872 0.025 0.034
Gemini 0.848 0.850 0.040 0.045
BGE 0.742 0.724 0.073 0.069
OA-Small 0.598 0.582 0.097 0.104
OA-Large 0.571 0.561 0.096 0.099

Prediction Results

  1. Q12 (Benefits) scores will drop — PARTIALLY CONFIRMED. E5 Q12 dropped from 0.872 to 0.867. BGE held steady at 0.727. The drop was smaller than expected because benefits content is embedded in actual job descriptions, not just boilerplate sections. The model still finds benefits language in the remaining text.

  2. Q18 will be the biggest test — CONFIRMED for Gemini, MIXED for others. Gemini Q18 stayed at 0.840 top-1 but still produced a short answer (1,113 chars vs 756 baseline). E5 Q18 dropped from 0.871 to 0.852 but still produced a good answer (4,387 chars). The AI Analysis sections being removed didn't kill analytical queries — models found signal in the raw job content.

  3. Gemini might improve on Q17/Q18 — PARTIALLY CONFIRMED. Gemini Q17 improved from 0.838 to 0.855 and answer length jumped from 2,386 to 3,523 chars. Q18 stayed flat at 0.840 but answer length went from 756 to 1,113. The structural header matching problem is slightly reduced but not eliminated.

  4. Flat queries barely change — CONFIRMED. Q1 E5: 0.882 to 0.879. Q10 E5: 0.890 to 0.888. Q11 E5: 0.867 to 0.853. The signal for broad pattern queries exists throughout the corpus regardless of boilerplate.

  5. E5's spread will widen — CONFIRMED. E5 spread went from 0.025 to 0.034 (+36%). FLAT query count dropped from 6/18 to 1/18. This was the most significant improvement from cleaning: E5 went from heavily compressed to functionally discriminating.

Key Observations

Token counts went up across all models. E5 jumped from 25,992 to 36,099 avg tokens/query (+39%). Fewer chunks total, but denser content per chunk means each of the 200 retrieval slots now carries more actual information. The LLM gets more real content per query.

BGE is the exception. Its spread actually narrowed (0.073 to 0.069). BGE was using boilerplate content productively on certain queries — Amazon mentorship boilerplate on Q14, benefits text on Q12. Removing that content removed signal for those specific questions.

Gemini's structural matching problem persists. Even with cleaned data, Q18 produced only 1,113 chars. The model matches on structural patterns (headers, section titles) that exist in both clean and dirty data. This is a model-level limitation, not a data quality issue.


Recommended Queries for Article Side-by-Side

If showing 3-4 queries to illustrate the difference:

  1. Q18 (Hardest roles) — Gemini's failure vs everyone else's success. Most dramatic divergence.
  2. Q17 (Best-written JDs) — Low similarity beating high similarity in usefulness.
  3. Q3 (Interview process) — Each model found different content. Shows embedding diversity value.
  4. Q12 (Benefits) — Boilerplate indicator. Compare baseline vs cleaned to show cleaning impact.