Date: 2026-02-17 Dataset: ~5,559 job listings (baseline, raw export) Models evaluated: e5-large-instruct, bge-base-en-v1.5, gemini-embedding-001, text-embedding-3-small, text-embedding-3-large LLM generating answers: Claude Opus 4.6 (same across all runs) Queries: 18 (3 per category: synthesis, comparison, inference, pattern, nuanced-retrieval, analysis)
High retrieval similarity does not guarantee high answer quality. The most dramatic example: Gemini scored 0.840 similarity on Q18 but retrieved only structural headers, producing a 756-char non-answer ("I cannot meaningfully answer this question"). Meanwhile, OpenAI Small scored 0.452 on Q17 but found specific companies and produced a better, more actionable answer than Gemini at 0.838.
The numbers measure retrieval. The vibe check measures utility.
Each query tests a specific retrieval + reasoning capability. Understanding what's being measured changes the interpretation.
| Q | Category | What It Tests |
|---|---|---|
| Q1 | synthesis | Synthesize patterns across multiple job descriptions into a coherent picture |
| Q2 | synthesis | Extract and combine technical requirements from LLM-related roles |
| Q3 | synthesis | Pull and synthesize interview details scattered across descriptions |
| Q4 | comparison | Compare and contrast requirements across seniority levels |
| Q5 | comparison | Distinguish company-stage signals and compare expectations |
| Q6 | comparison | Semantic depth, build vs integrate distinction across descriptions |
| Q7 | inference | Infer autonomy expectations from indirect language cues |
| Q8 | inference | Classify roles by inferred focus without explicit labels |
| Q9 | inference | Infer hybrid role expectations from combined skill signals |
| Q10 | pattern | Identify recurring non-technical requirements across retrieved chunks |
| Q11 | pattern | Extract co-occurring technical terms in a specific subdomain |
| Q12 | pattern | Identify perks and cultural signals across multiple listings |
| Q13 | nuanced-retrieval | Retrieve based on semantic intent, not keyword overlap with ML terms |
| Q14 | nuanced-retrieval | Retrieve on soft cultural signals buried in descriptions |
| Q15 | nuanced-retrieval | Distinguish MLOps/deployment focus from training/research focus |
| Q16 | analysis | LLM must reason about market signals and form a recommendation |
| Q17 | analysis | LLM judges content quality, requires meta-reasoning about the text itself |
| Q18 | analysis | Infer hiring difficulty from requirement complexity and specificity |
Key insight: The test criteria reveal that Q7, Q8, Q9 (inference) require reading between the lines — indirect cues, no explicit labels. Q13, Q14 (nuanced-retrieval) explicitly test semantic intent over keyword matching. Q17, Q18 (analysis) require the LLM to reason about the text itself, not just summarize it. These are where embedding quality should matter most.
Tests: Synthesize patterns across multiple job descriptions into a coherent picture
| Model | Top-1 | Chars | Verdict |
|---|---|---|---|
| E5 | 0.882 | 2,874 | Structured, 7 clear categories. Mentions LLM/GenAI as emerging expectation. |
| BGE | 0.745 | 2,960 | Very similar structure. Slightly more verbose. |
| Gemini | 0.888 | 2,634 | Clean categories, slightly shorter. |
| OA-Small | 0.654 | 2,536 | Same quality, different heading style (H1 vs H2). |
| OA-Large | 0.693 | 2,734 | Nearly identical to E5. |
Verdict: All models converge. When the question is broad and the corpus has abundant data, embedding quality barely matters. No meaningful quality difference.
Tests: Extract and combine technical requirements from LLM-related roles
| Model | Top-1 | Chars |
|---|---|---|
| E5 | 0.872 | 2,135 |
| BGE | 0.738 | 3,076 |
| Gemini | 0.864 | 3,370 |
| OA-Small | 0.659 | 3,346 |
| OA-Large | 0.632 | 3,046 |
Verdict: All models found Python, LangChain, LangGraph, OpenAI, vector databases. E5 was the most concise (2,135 chars) while hitting the same points. Gemini and OpenAI Small were more verbose but didn't add proportionally more insight. Convergent answers, different verbosity.
Tests: Pull and synthesize interview details scattered across descriptions
| Model | Top-1 | Chars | Key Finds |
|---|---|---|---|
| E5 | 0.890 | 1,773 | Found "Technical AI Prompt Test + Behavioral Assessments" multi-stage process |
| BGE | 0.762 | 2,285 | Found "Talent Pool / Pre-screening model" pattern |
| Gemini | 0.847 | 1,485 | Admitted "very little specific information" — only found 2 mentions |
| OA-Small | 0.677 | 2,449 | Found Brainlancer's 3-step AI interview + QuintoAndar's pipeline |
| OA-Large | 0.608 | 2,421 | Found QuintoAndar + CI&T structured processes |
Verdict: This is where embedding differences matter. Each model retrieved different job listings and surfaced different interview processes. No single model captured them all. BGE found the talent pool model that others missed. OpenAI models found specific Brazilian companies (QuintoAndar, CI&T). Gemini was the most honest about data scarcity but also the least useful. Different embeddings = different content discovered.
Tests: Compare and contrast requirements across seniority levels
| Model | Top-1 | Chars |
|---|---|---|
| E5 | 0.878 | 2,992 |
| BGE | 0.774 | 2,582 |
| Gemini | 0.836 | 2,964 |
| OA-Small | 0.578 | 2,706 |
| OA-Large | 0.534 | 2,786 |
Verdict: All models captured the core contrast (experience years, autonomy, scope). E5 included the insight that junior roles are described as "viable transition roles" — a specific phrase from the data. Gemini noted "very little about junior roles specifically" which is honest and useful context. Similar quality, minor detail differences.
Tests: Distinguish company-stage signals and compare expectations
| Model | Top-1 | Chars |
|---|---|---|
| E5 | 0.879 | 2,777 |
| BGE | 0.722 | 2,679 |
| Gemini | 0.852 | 3,833 |
| OA-Small | 0.575 | 2,807 |
| OA-Large | 0.527 | 3,751 |
Verdict: Strong convergence across all models. Startups = ownership + breadth + velocity. Large = specialization + process + compliance. Gemini and OA-Large were notably longer without adding proportional insight. No quality gap despite 0.35 similarity gap between E5 and OA-Large.
Tests: Semantic depth, build vs integrate distinction across descriptions
| Model | Top-1 | Chars |
|---|---|---|
| E5 | 0.879 | 2,941 |
| BGE | 0.759 | 4,471 |
| Gemini | 0.839 | 3,778 |
| OA-Small | 0.591 | 4,285 |
| OA-Large | 0.575 | 3,984 |
Verdict: BGE produced the longest answer (4,471) and included the most concrete quotes from actual listings — "Design and implement VOX's initial product architecture and tech stack from scratch." More verbose but also more grounded. BGE's lower similarity didn't hurt — it may have helped by pulling more diverse chunks.
Tests: Infer autonomy expectations from indirect language cues
| Model | Top-1 | Chars |
|---|---|---|
| E5 | 0.849 | 2,602 |
| BGE | 0.699 | 2,217 |
| Gemini | 0.819 | 2,231 |
| OA-Small | 0.527 | 3,021 |
| OA-Large | 0.496 | 2,518 |
Verdict: E5 found N2/N3 Support / Infrastructure Operations roles with explicit "high autonomy" language. BGE found fintech/startup roles. Gemini acknowledged it couldn't see role titles — just description fragments. OpenAI Small was the most comprehensive despite the lowest similarity. OA-Small's 0.527 outperformed Gemini's 0.819 in usefulness.
Tests: Classify roles by inferred focus without explicit labels
| Model | Top-1 | Chars |
|---|---|---|
| E5 | 0.837 | 3,470 |
| BGE | 0.691 | 4,944 |
| Gemini | 0.807 | 3,311 |
| OA-Small | 0.517 | 3,608 |
| OA-Large | 0.484 | 5,075 |
Verdict: E5 named Epic Games Principal Research Engineer — a specific, real role. BGE was the most thorough (4,944 chars) with detailed categorization. OA-Large was the longest (5,075) and found Adaptable Labs roles. Gemini found GenAI R&D roles. All models successfully distinguished the spectrum. BGE and OA-Large traded verbosity for specificity — fair tradeoff.
Tests: Infer hybrid role expectations from combined skill signals
| Model | Top-1 | Chars | Key Find |
|---|---|---|---|
| E5 | 0.900 | 3,098 | Named "AI/ML Full Stack Engineer" title explicitly |
| BGE | 0.737 | 4,202 | Found "Senior AI Full-Stack Engineer (LLMs, RAG, Agentic Systems)" |
| Gemini | 0.855 | 4,067 | Same Senior AI Full-Stack role + NineTwoThree |
| OA-Small | 0.599 | 2,904 | Found same key roles, more concise |
| OA-Large | 0.603 | 3,722 | Found same roles with good detail |
Verdict: E5's 0.900 (highest similarity in the entire benchmark) found the most directly relevant result. But all models converged on the same core roles. When the question maps cleanly to actual job titles in the data, all models work.
Tests: Identify recurring non-technical requirements across retrieved chunks
| Model | Top-1 | Chars |
|---|---|---|
| E5 | 0.890 | 1,775 |
| BGE | 0.782 | 1,993 |
| Gemini | 0.864 | 2,550 |
| OA-Small | 0.709 | 2,255 |
| OA-Large | 0.628 | 2,108 |
Verdict: Universal agreement: communication, problem-solving, collaboration, adaptability, ownership. E5 was most concise. No quality difference. This is a "flat data" query — the answer is the same everywhere in the corpus.
Tests: Extract co-occurring technical terms in a specific subdomain
| Model | Top-1 | Chars |
|---|---|---|
| E5 | 0.867 | 2,904 |
| BGE | 0.764 | 2,800 |
| Gemini | 0.848 | 2,993 |
| OA-Small | 0.598 | 2,287 |
| OA-Large | 0.550 | 2,490 |
Verdict: All models: LangChain #1, LangGraph #2, LlamaIndex, Pinecone, ChromaDB, FAISS. Virtually identical. This is another flat data query. The tools are mentioned so frequently that any retrieval finds them.
Tests: Identify perks and cultural signals across multiple listings
| Model | Top-1 | Chars |
|---|---|---|
| E5 | 0.872 | 2,675 |
| BGE | 0.726 | 2,289 |
| Gemini | 0.814 | 2,403 |
| OA-Small | 0.640 | 2,579 |
| OA-Large | 0.619 | 3,437 |
Verdict: This query is a boilerplate magnet. Benefits text IS boilerplate — 401(k), health insurance, PTO. High similarity scores here likely mean the model matched on the exact content we're about to strip in the cleaning round. E5 found Brazil-specific benefits (Wellhub/Gympass, BetterFly, medicines discounts) which shows it reached real content, not just template text. Watch this query's score change after cleaning — if it drops, the model was riding boilerplate.
Tests: Retrieve based on semantic intent, not keyword overlap with ML terms
| Model | Top-1 | Chars |
|---|---|---|
| E5 | 0.892 | 6,029 |
| BGE | 0.726 | 3,542 |
| Gemini | 0.848 | 5,920 |
| OA-Small | 0.577 | 4,722 |
| OA-Large | 0.595 | 4,515 |
Verdict: E5 and Gemini produced the longest, most detailed responses (6K chars each) with specific role descriptions and quotes. BGE was shorter but still found the core roles. All models identified Senior Data Quality Engineer, Data Engineer (pipeline resilience), and MLOps roles. E5 and Gemini excelled here — nuanced retrieval rewards high similarity.
Tests: Retrieve on soft cultural signals buried in descriptions
| Model | Top-1 | Chars |
|---|---|---|
| E5 | 0.893 | 4,474 |
| BGE | 0.717 | 6,397 |
| Gemini | 0.848 | 4,282 |
| OA-Small | 0.603 | 4,588 |
| OA-Large | 0.586 | 4,534 |
Verdict: BGE produced the longest answer (6,397) and specifically found Amazon/AWS's "Earth's Best Employer" standardized mentorship section — which is itself a form of corporate boilerplate. E5 found diverse mentorship patterns across role levels. BGE's verbosity here may partially reflect retrieving repetitive Amazon chunks.
Tests: Distinguish MLOps/deployment focus from training/research focus
| Model | Top-1 | Chars |
|---|---|---|
| E5 | 0.906 | 3,756 |
| BGE | 0.774 | 4,747 |
| Gemini | 0.882 | 4,222 |
| OA-Small | 0.606 | 4,806 |
| OA-Large | 0.597 | 3,817 |
Verdict: All models found MLOps/ML Platform Engineer as the dominant role. E5 was most concise while hitting all key points (containerized microservices, low-latency inference, Kubernetes). OA-Small was the longest despite the lowest similarity. Consistent quality across the board.
Tests: LLM must reason about market signals and form a recommendation
| Model | Top-1 | Chars |
|---|---|---|
| E5 | 0.895 | 3,088 |
| BGE | 0.770 | 2,969 |
| Gemini | 0.869 | 3,691 |
| OA-Small | 0.677 | 3,743 |
| OA-Large | 0.633 | 3,642 |
Verdict: All models produced tiered skill recommendations (Tier 1: Python, ML frameworks, Cloud; Tier 2: LLMs, RAG, vector DBs; Tier 3: domain specifics). OA-Small's response was the most opinionated — "Python. No exceptions." — which arguably makes it more useful as advice. Convergent content, different personality.
Tests: LLM judges content quality, requires meta-reasoning about the text itself
| Model | Top-1 | Chars | Best JD Found |
|---|---|---|---|
| E5 | 0.863 | 3,492 | n8n/Supabase Automation Role |
| BGE | 0.731 | 4,125 | LLM/RAG/Agent Role |
| Gemini | 0.838 | 2,386 | "vast majority contain virtually no substantive content" |
| OA-Small | 0.453 | 3,984 | Brainlancer |
| OA-Large | 0.416 | 3,432 | EdTech Role |
Verdict: This is the most revealing query in the benchmark. Gemini scored 0.838 similarity but its retrieved chunks were mostly empty headers — it admitted as much. OpenAI Small scored 0.453 (nearly half) but found specific, well-written listings like Brainlancer and produced a better answer. Gemini matched on structural similarity (headers about "job descriptions") while OpenAI matched on semantic content (actual descriptions). This is the strongest evidence that high similarity can mean the model matched on noise, not signal.
Tests: Infer hiring difficulty from requirement complexity and specificity
| Model | Top-1 | Chars | Answer |
|---|---|---|---|
| E5 | 0.871 | 2,966 | Specialized AI/ML Research (PhD + publications) |
| BGE | 0.741 | 3,460 | Salesforce Specialist Roles |
| Gemini | 0.840 | 756 | "I cannot meaningfully answer this question" |
| OA-Small | 0.521 | 4,117 | Salesforce Platform Roles |
| OA-Large | 0.495 | 3,692 | SAP Specialist Roles |
Verdict: Gemini's worst moment. Despite 0.840 top-1 similarity, the retrieved chunks contained only section headers ("REQUIREMENTS", "MISSING SKILLS", "Qualifications") with no actual content. The model honestly refused to hallucinate — which is good behavior from the LLM — but the retrieval completely failed. Meanwhile:
- E5 found actual analysis content (AI reasoning about PhD requirements)
- BGE and OA-Small found Salesforce roles (consistently rated POOR_FIT in the AI analysis sections)
- OA-Large found SAP roles
E5, BGE, and OpenAI models retrieved chunks from the AI Analysis sections which contain the match reasoning data. Gemini retrieved structural headers. This query will be fascinating to compare after cleaning — the AI Analysis sections are being stripped.
The correlation between top-1 similarity and response usefulness is weak for analysis/inference queries. Evidence:
| Query | Gemini Sim | Gemini Quality | OA-Small Sim | OA-Small Quality |
|---|---|---|---|---|
| Q17 | 0.838 | Poor (empty headers) | 0.453 | Good (specific companies) |
| Q18 | 0.840 | Failed (756 chars) | 0.521 | Good (4,117 chars) |
| Q7 | 0.819 | OK (no role names) | 0.527 | Better (comprehensive) |
For queries where the answer exists everywhere in the corpus (Q1, Q2, Q5, Q10, Q11, Q15), all models produce near-identical answers regardless of similarity scores. These queries don't differentiate embedding quality.
Flat queries: Q1, Q2, Q4, Q5, Q10, Q11, Q15, Q16 (8 of 18) Discriminating queries: Q3, Q6, Q7, Q8, Q9, Q12, Q13, Q14, Q17, Q18 (10 of 18)
On Q3 (interview process), each model found different companies and interview structures. No single model captured the full picture. This suggests that running the same query across multiple embedding models and aggregating results would outperform any single model.
Q18's divergence reveals that E5, BGE, and OpenAI models are retrieving chunks from the AI Analysis sections (Copilot reasoning, match scores, gap analysis). These sections contain rich analytical content but are pipeline artifacts — not actual job description text. After cleaning removes them, we'll see if models can still answer analytical queries from raw job content alone.
Gemini retrieved headers and structural fragments on Q17 and Q18 because those headers semantically match the query terms ("requirements", "job descriptions") without containing useful content. This is a known limitation of high-dimensional semantic matching on short, generic text fragments.
BGE consistently produced the longest answers with the most quoted material. Its lower similarity scores may actually help by pulling more diverse chunks rather than the most similar ones — which for broad analytical queries, surfaces more varied content.
- E5 — Best balance of concision and specificity. Rarely wrong, never wastes words. Found the most specific roles and real data points.
- BGE — Most thorough. Longer answers but grounded in actual quoted content. Sometimes retrieves repetitive chunks (Amazon boilerplate on Q14) but generally strong.
- OpenAI Small — Surprisingly good answers despite the lowest similarity band. Opinionated, specific, and occasionally better than Gemini. The 0.45-0.67 similarity range didn't hurt response quality on most queries.
- OpenAI Large — Similar to Small. Found different content (SAP roles on Q18, EdTech on Q17). No clear advantage over Small in response quality despite 3072 vs 1536 dimensions.
- Gemini — Highest highs (Q13: 5,920 chars of excellent detail) and lowest lows (Q18: 756-char refusal). The structural matching problem on analytical queries is a real liability. When it works, it works well. When it doesn't, it fails completely.
- Q12 (Benefits) scores will drop across all models — benefits text IS the boilerplate being removed
- Q18 will be the biggest test — AI Analysis sections are being stripped. Models that relied on them (E5, BGE, OpenAI) will either find the signal in raw job content or degrade
- Gemini might improve on Q17/Q18 — with structural noise removed, fewer empty headers to match on
- Flat queries (Q1, Q2, Q10, Q11) will barely change — the signal is in the actual job content, not the boilerplate
- E5's spread may widen — if its tight 0.025 spread was partially inflated by shared boilerplate, cleaning will expose real variance
Data change: 51,545 vectors down to 33,409 (HF models) / 36,523 (Gemini). ~35% chunk reduction. Same 5,559 files, same chunking.
| Model | Baseline Top-1 | Clean Top-1 | Baseline Spread | Clean Spread |
|---|---|---|---|---|
| E5 | 0.879 | 0.872 | 0.025 | 0.034 |
| Gemini | 0.848 | 0.850 | 0.040 | 0.045 |
| BGE | 0.742 | 0.724 | 0.073 | 0.069 |
| OA-Small | 0.598 | 0.582 | 0.097 | 0.104 |
| OA-Large | 0.571 | 0.561 | 0.096 | 0.099 |
-
Q12 (Benefits) scores will drop — PARTIALLY CONFIRMED. E5 Q12 dropped from 0.872 to 0.867. BGE held steady at 0.727. The drop was smaller than expected because benefits content is embedded in actual job descriptions, not just boilerplate sections. The model still finds benefits language in the remaining text.
-
Q18 will be the biggest test — CONFIRMED for Gemini, MIXED for others. Gemini Q18 stayed at 0.840 top-1 but still produced a short answer (1,113 chars vs 756 baseline). E5 Q18 dropped from 0.871 to 0.852 but still produced a good answer (4,387 chars). The AI Analysis sections being removed didn't kill analytical queries — models found signal in the raw job content.
-
Gemini might improve on Q17/Q18 — PARTIALLY CONFIRMED. Gemini Q17 improved from 0.838 to 0.855 and answer length jumped from 2,386 to 3,523 chars. Q18 stayed flat at 0.840 but answer length went from 756 to 1,113. The structural header matching problem is slightly reduced but not eliminated.
-
Flat queries barely change — CONFIRMED. Q1 E5: 0.882 to 0.879. Q10 E5: 0.890 to 0.888. Q11 E5: 0.867 to 0.853. The signal for broad pattern queries exists throughout the corpus regardless of boilerplate.
-
E5's spread will widen — CONFIRMED. E5 spread went from 0.025 to 0.034 (+36%). FLAT query count dropped from 6/18 to 1/18. This was the most significant improvement from cleaning: E5 went from heavily compressed to functionally discriminating.
Token counts went up across all models. E5 jumped from 25,992 to 36,099 avg tokens/query (+39%). Fewer chunks total, but denser content per chunk means each of the 200 retrieval slots now carries more actual information. The LLM gets more real content per query.
BGE is the exception. Its spread actually narrowed (0.073 to 0.069). BGE was using boilerplate content productively on certain queries — Amazon mentorship boilerplate on Q14, benefits text on Q12. Removing that content removed signal for those specific questions.
Gemini's structural matching problem persists. Even with cleaned data, Q18 produced only 1,113 chars. The model matches on structural patterns (headers, section titles) that exist in both clean and dirty data. This is a model-level limitation, not a data quality issue.
If showing 3-4 queries to illustrate the difference:
- Q18 (Hardest roles) — Gemini's failure vs everyone else's success. Most dramatic divergence.
- Q17 (Best-written JDs) — Low similarity beating high similarity in usefulness.
- Q3 (Interview process) — Each model found different content. Shows embedding diversity value.
- Q12 (Benefits) — Boilerplate indicator. Compare baseline vs cleaned to show cleaning impact.