Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ OPENAI_API_KEY=
# OPENAI_MODEL=gpt-4o-mini
# ORCHESTRATE_SEED=42
# ORCHESTRATE_MAX_FIELD_CHARS=200000
# Escalate when one ticket references multiple product ecosystems (HackerRank+Claude, etc.)
# ORCHESTRATE_DISABLE_CROSS_ECOSYSTEM_ESCALATE=1
117 changes: 103 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,92 @@ Build a terminal-based AI agent that triages real support tickets across three p

Read [`problem_statement.md`](./problem_statement.md) for the full task spec, input/output schema, and allowed values, and [`evaluation_criteria.md`](./evaluation_criteria.md) for how submissions are scored.

---

## Setup (evaluators — primary instructions)

**Environment:** Python **3.11+**. Work from the **repository root** (the folder that contains this `README.md`).

| Step | Action |
|------|--------|
| **1. Dependencies** | `pip install -r code/requirements.txt` |
| **2. Secrets** | Copy `.env.example` → `.env` at the repo root. Set `OPENAI_API_KEY` if you use the LLM path, **or** set `ORCHESTRATE_DISABLE_LLM=1` for a fully offline run (no API calls). Never commit `.env`. |
| **3. Run the agent** | `python code/main.py` — reads `support_tickets/support_tickets.csv`, writes `support_tickets/output.csv`. Use `python code/main.py --help` for `--input`, `--output`, `--limit`. |
| **4. Regression check** | From `code/`: `ORCHESTRATE_DISABLE_LLM=1 python run_eval.py --offline` (optional). Full smoke: `pwsh -File scripts/verify_local.ps1` or `bash scripts/verify_local.sh` from repo root. |

**Cross-platform:** Prefer `python code/main.py` from the repo root on **Windows, Linux, and macOS**. Avoid `python -m code` on Unix-like systems (stdlib name clash). Details: [`code/README.md`](./code/README.md).

---

## Approach overview

This submission implements **offline retrieval-augmented triage** over the bundled markdown corpus in **`data/`** (no live web search for answer facts):

1. **Retrieval:** Hybrid **BM25 + TF-IDF** fusion with lexical reranking to fetch relevant support chunks (`code/retrieve.py`).
2. **Routing & safety:** Regex **risk** escalation and **cross-ecosystem** detection (mixed vendors in one ticket) before answer generation (`risk.py`, `cross_ecosystem.py`).
3. **Taxonomy:** Stable **`product_area`** labels aligned to corpus structure (`taxonomy.py`).
4. **Answer generation:** Optional **OpenAI** chat with **JSON over retrieved context only** (`openai_agent.py`); if the API is missing or `ORCHESTRATE_DISABLE_LLM=1`, **offline synthesis** builds replies from retrieved text (`answer_synthesis.py`).
5. **Grounding:** Post-generation lexical overlap and numeric guards (`grounding.py`, `postprocess.py`).

Deeper design decisions and trade-offs: [`docs/decisions.md`](./docs/decisions.md). Interview / limits: [`docs/interview.md`](./docs/interview.md), [`docs/scope_and_limits.md`](./docs/scope_and_limits.md).

---

## Packaging a ZIP (full project + this README)

The challenge may ask for your **complete working project** and a **README** with setup and approach — that is this **root `README.md`**, not only `code/README.md`.

**Recommended (clean, no secrets, no `.git` folder):** from the repo root, archive **tracked** files only:

```bash
git archive --format=zip -o ../hackerrank-orchestrate-submission.zip HEAD
```

Or run **`scripts/make_submission_zip.sh`** / **`scripts/make_submission_zip.ps1`** (same idea; writes next to the repo folder).

That typically includes `README.md`, `AGENTS.md`, `problem_statement.md`, `evaluation_criteria.md`, `code/`, `data/`, `support_tickets/`, `docs/`, `scripts/`, `.github/`, etc.—whatever is **committed**. Untracked junk (e.g. `.venv`, `code/.cache`) stays out if not committed.

**Do not** put API keys in the zip (never commit `.env`).

**If the platform requires a `code/`-only zip** instead, zip the `code/` directory — and **add a copy of the sections [Setup](#setup-evaluators--primary-instructions) and [Approach](#approach-overview) into `code/README.md`** so reviewers still see setup + approach in one place.

**Predictions** (`output.csv`) are often uploaded **separately** on HackerRank—follow the live submission page.

---

### Evaluation criteria (`evaluation_criteria.md`) — what this repo covers vs what you must bring

| Dimension | What the repo already supports | What you still own |
|-----------|----------------------------------|-------------------|
| **1. Agent Design** | Clear pipeline (`retrieve.py`, `openai_agent.py`, `postprocess.py`, `risk.py`, `taxonomy.py`), pinned `requirements.txt`, tests, CI, [`docs/decisions.md`](./docs/decisions.md) | Explaining trade-offs and alternatives in the **AI Judge interview** |
| **2. AI Judge Interview** | Prep in [`docs/interview.md`](./docs/interview.md), [`docs/demo-script.md`](./docs/demo-script.md) | Showing up, demonstrating depth, honesty about AI assistance |
| **3. Output CSV** | `main.py` → `support_tickets/output.csv`; run [`scripts/verify_local.ps1`](./scripts/verify_local.ps1) / [`scripts/verify_local.sh`](./scripts/verify_local.sh) before upload | Regenerating predictions on the final `support_tickets.csv`; hidden-set accuracy is scored by the platform |
| **4. AI Fluency (transcript)** | [`AGENTS.md`](./AGENTS.md) instructs tools to log turns to `%USERPROFILE%\hackerrank_orchestrate\log.txt` (Windows) / `$HOME/hackerrank_orchestrate/log.txt` (Unix) | **You** must collaborate visibly with intent—scoped prompts, critique, architectural steering—not blind acceptance |

**If many teams “meet” the bar — how is one winner chosen?** The public docs **do not publish exact weights or tie-break rules**. Typically: scores from **each dimension are combined** into a final score; **Output CSV** quality on **held-out rows** usually moves the leaderboard the most; **Interview** and **transcript** differentiate teams when numeric scores are close. Perfect ties across *all* dimensions are unlikely—small CSV differences still rank-order. For anything not specified here, treat **official platform / organizer communications** as source of truth.

**Baseline snapshot (Phase 0):** after meaningful routing/retrieval changes, run `python scripts/capture_baseline.py` from the repo root to refresh [`docs/superpowers/BASELINE.md`](./docs/superpowers/BASELINE.md) (git SHA, pytest count, sample routing %). **Per-row routing diff:** from `code/` after `ORCHESTRATE_DISABLE_LLM=1 python run_eval.py --offline`, run `python eval_sample.py --pred ../support_tickets/sample_pred.csv --routing-detail`.

**Verify before submit (matches CI + full offline batch):** from repo root, run `bash scripts/verify_local.sh` or `pwsh -File scripts/verify_local.ps1`. This installs `code/requirements.txt`, runs `main.py --help`, `pytest`, `run_eval.py --offline`, then `main.py --limit 0` with the LLM off. Set `VERIFY_SKIP_FULL_BATCH=1` to stop after the sample regression (faster). On macOS/Linux, `chmod +x scripts/*.sh` if needed. This does **not** prove hidden-test accuracy—only that the pipeline is healthy.

**Problem statement alignment (what this repo implements):**

| Requirement (`problem_statement.md`) | How it is addressed |
|--------------------------------------|---------------------|
| Terminal-based agent | `code/main.py` CLI; run via `python code/main.py` or `scripts/run_agent.*` |
| HackerRank / Claude / Visa | Corpus under `data/` per brand; retrieval uses brand mask + `infer_brand` when `Company` is `None` |
| Only provided corpus for answers | Retrieval from `data/` only; LLM (if enabled) is given **retrieved** chunks as context, not live web search |
| Request type, product area, reply vs escalate, justification | Output columns + `taxonomy.py`, `postprocess.py`, `risk.py`, `cross_ecosystem.py` |
| Retrieve relevant docs | Hybrid BM25 + TF-IDF fusion + rerank (`retrieve.py`) |
| Safe / grounded responses | Grounding overlap + numeric guard (`grounding.py`, `postprocess.py`); offline synthesis when LLM off |
| Escalate high-risk / sensitive | Regex risk routes before generation; low-retrieval flag; cross-ecosystem escalation |
| Handle noise / multi-topic / malicious-ish text | Invalid small-talk heuristics; risk patterns; multi-topic note (`ticket_hints.py`) |
| CSV input → CSV output | `csv_io.py`; writes `response`, `product_area`, `status`, `request_type`, `justification` |

**Cross-platform:** CI runs on **Ubuntu** (`python main.py` from `code/`). Use **`python code/main.py`** from the repo root on all OSes—avoid **`python -m code`** on Linux/macOS (stdlib `code` module name clash). **Windows:** use PowerShell scripts or `python code\main.py`. Same Python **3.11+** and `pip install -r code/requirements.txt` everywhere; keep `data/` next to `code/` as in the repo layout.

**Offline routing check:** with `ORCHESTRATE_DISABLE_LLM=1`, `cd code && python run_eval.py --offline` should show **100%** exact match on `status`, `request_type`, and `product_area` for the bundled sample (response text differs when the LLM is off).

### Start here (run the bundled agent)

From the **repository root** (after `pip install -r code/requirements.txt`):
Expand All @@ -27,14 +113,15 @@ Optional offline-only: set `ORCHESTRATE_DISABLE_LLM=1`, then run one of the abov

## Contents

1. [Repository layout](#repository-layout)
2. [What you need to build](#what-you-need-to-build)
3. [Where your code goes](#where-your-code-goes)
4. [Quickstart](#quickstart)
5. [Chat transcript logging](#chat-transcript-logging)
6. [Submission](#submission)
7. [Judge interview](#judge-interview)
8. [Evaluation criteria](#evaluation-criteria)
1. [Setup](#setup-evaluators--primary-instructions) · [Approach](#approach-overview) · [Packaging a ZIP](#packaging-a-zip-full-project--this-readme)
2. [Repository layout](#repository-layout)
3. [What you need to build](#what-you-need-to-build)
4. [Where your code goes](#where-your-code-goes)
5. [Quickstart](#quickstart)
6. [Chat transcript logging](#chat-transcript-logging)
7. [Submission](#submission)
8. [Judge interview](#judge-interview)
9. [Evaluation criteria](#evaluation-criteria)

---

Expand All @@ -46,7 +133,7 @@ Optional offline-only: set `ORCHESTRATE_DISABLE_LLM=1`, then run one of the abov
├── problem_statement.md # Full task description and I/O schema
├── README.md # You are here
├── docs/ # decisions.md, interview prep, demo script, dev rubric
├── scripts/ # run_agent.sh / run_agent.ps1 (repo-root invocation)
├── scripts/ # run_agent.*, verify_local.*, make_submission_zip.*
├── code/ # Participant agent (see code/README.md)
│ ├── main.py # CLI entry: reads CSV, writes predictions
│ ├── retrieve.py # Hybrid retrieval + reranking
Expand Down Expand Up @@ -117,7 +204,7 @@ python code/main.py

This writes `support_tickets/output.csv`. **Regression:** `cd code` then `python run_eval.py --offline` (compares to `sample_support_tickets.csv`).

Submission expects a **zip of `code/` only** (no `data/` in the zip); evaluators use their own corpus copy. Your **`output.csv`** is uploaded separately.
For **ZIP packaging**, see [Packaging a ZIP](#packaging-a-zip-full-project--this-readme). Historically some docs mentioned zipping only `code/`; **follow the current submission UI**—often the **full starter-repo layout** (including `data/` and this README) is what “complete project” means.

---

Expand All @@ -139,11 +226,13 @@ You don't need to do anything to enable it — just use your AI tool normally. Y
Submit on the HackerRank Community Platform:
<https://www.hackerrank.com/contests/hackerrank-orchestrate-may26/challenges/support-agent/submission>

You will upload **three** files:
You will typically upload **three** artifacts:

1. **Code / project zip** — Often the **full repository** (this README + `code/` + `data/` + …); see [Packaging a ZIP](#packaging-a-zip-full-project--this-readme). Exclude secrets and local venvs (`git archive` helps).
2. **Predictions CSV** — agent output for `support_tickets/support_tickets.csv` (usually `output.csv`), **if** the platform asks for it separately.
3. **Chat transcript** — `log.txt` from [Chat transcript logging](#chat-transcript-logging), **if** required.

1. **Code zip** — zip your `code/` directory and upload it. Exclude virtualenvs, `node_modules`, build artifacts, the `data/` corpus, and the `support_tickets/` CSVs.
2. **Predictions CSV** — your agent's output for `support_tickets/support_tickets.csv` (i.e. the populated `output.csv`).
3. **Chat transcript** — the `log.txt` from the path in [Chat transcript logging](#chat-transcript-logging).
Always confirm fields on the **live submission page**—wording can change between rounds.

---

Expand Down
2 changes: 2 additions & 0 deletions code/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Support triage agent (Orchestrate)

> **Evaluators:** primary **setup + approach overview** for the whole submission is the **repository root** [`../README.md`](../README.md). This file focuses on `code/` module details and flags.

Terminal agent that reads `support_tickets/support_tickets.csv`, retrieves grounded snippets from the offline `data/` corpus (**BM25 + TF‑IDF fusion + lexical rerank**), applies risk-based escalation rules + taxonomy mapping, and writes predictions to `support_tickets/output.csv`.

**Design rationale & decision flowchart:** [`../docs/decisions.md`](../docs/decisions.md). **Interview / demo / rubric:** [`../docs/interview.md`](../docs/interview.md), [`../docs/demo-script.md`](../docs/demo-script.md), [`../docs/DEV_EVAL.md`](../docs/DEV_EVAL.md).
Expand Down
14 changes: 14 additions & 0 deletions code/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
"""Shared pytest fixtures (session-scoped retrieval index)."""
from __future__ import annotations

import pytest

from config import CACHE_PATH, DATA_DIR
from retrieve import BM25Index


@pytest.fixture(scope="session")
def bm25_index_session() -> BM25Index:
if not DATA_DIR.is_dir():
pytest.skip(f"Corpus missing: {DATA_DIR}")
return BM25Index.load(CACHE_PATH, DATA_DIR)
52 changes: 52 additions & 0 deletions code/cross_ecosystem.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
"""Detect tickets that span multiple distinct product ecosystems — safer to escalate than guess one answer."""
from __future__ import annotations

import os
import re


def cross_ecosystem_escalation_reason(issue: str, subject: str) -> str | None:
"""Return human-readable escalate reason, or None.

Conservative pairwise checks avoid false positives such as "HackerRank visa sponsorship"
(mentions Visa immigration language without Visa-the-network product context).
Disable entirely with ``ORCHESTRATE_DISABLE_CROSS_ECOSYSTEM_ESCALATE=1``.
"""
if os.environ.get("ORCHESTRATE_DISABLE_CROSS_ECOSYSTEM_ESCALATE", "").strip().lower() in {
"1",
"true",
"yes",
"y",
}:
return None

blob = f"{subject}\n{issue}".strip()
low = blob.lower()

has_hr = bool(re.search(r"\bhackerrank\b", low))
has_claude = bool(re.search(r"\bclaude\b|\banthropic\b", low))
# Visa Inc. product context (cards/travel/payment), not generic immigration "visa".
has_visa_financial = bool(
re.search(r"\bvisa\b", low)
and re.search(
r"\b(card|cards|credit|debit|cheque|cheques|gcas|lost|stolen|"
r"traveller|traveler|payment|pin|atm|fraud|chargeback)\b",
low,
)
)

tags: list[str] = []
if has_hr and has_claude:
tags.append("HackerRank + Claude/Anthropic")
if has_hr and has_visa_financial:
tags.append("HackerRank + Visa payment/travel")
if has_claude and has_visa_financial:
tags.append("Claude + Visa payment/travel")

if not tags:
return None
return (
"Multiple distinct product ecosystems in one ticket ("
+ "; ".join(tags)
+ "); escalating for human routing."
)
26 changes: 26 additions & 0 deletions code/eval_sample.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ def main() -> None:
ap.add_argument("--sample", type=str, default=str(Path("..") / "support_tickets" / "sample_support_tickets.csv"))
ap.add_argument("--pred", type=str, default=str(Path("..") / "support_tickets" / "output.csv"))
ap.add_argument("--report", type=str, default=str(Path("..") / "support_tickets" / "sample_eval_report.csv"))
ap.add_argument(
"--routing-detail",
action="store_true",
help="Print per-row gold vs predicted routing (status, request_type, product_area).",
)
args = ap.parse_args()

try:
Expand Down Expand Up @@ -90,6 +95,27 @@ def exact_acc(gold: str, pred_col: str) -> float:
mism = merged[merged["Status"] != merged["Pred_Status"]][key_cols + ["Status", "Pred_Status"]]
print(f"\nStatus mismatches: {len(mism)}")

if args.routing_detail:
print("\n=== Per-row routing (gold vs pred) ===")
for i, r in merged.iterrows():
subj = str(r.get("Subject", ""))[:60]
g_st = str(r.get("Status", "")).strip()
p_st = str(r.get("Pred_Status", "")).strip()
g_rt = str(r.get("Request Type", "")).strip()
p_rt = str(r.get("Pred_Request Type", "")).strip()
g_pa = str(r.get("Product Area", "")).strip()
p_pa = str(r.get("Pred_Product Area", "")).strip()
ok = (
_norm_status(g_st) == _norm_status(p_st)
and g_rt.lower() == p_rt.lower()
and g_pa.lower() == p_pa.lower()
)
mark = "OK" if ok else "MISMATCH"
print(f"[{mark}] row={i} subject={subj!r}…")
print(f" status: gold={g_st!r} pred={p_st!r}")
print(f" request_type: gold={g_rt!r} pred={p_rt!r}")
print(f" product_area: gold={g_pa!r} pred={p_pa!r}")

report_cols = key_cols + [
"Status",
"Pred_Status",
Expand Down
5 changes: 5 additions & 0 deletions code/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import pandas as pd

from config import DATA_DIR, INPUT_CSV, MAX_FIELD_CHARS, OUTPUT_CSV, SEED, TOP_K
from cross_ecosystem import cross_ecosystem_escalation_reason
from csv_io import TicketCsvError, canonicalize_ticket_columns, read_tickets_csv
from openai_agent import decide_with_openai, fallback_from_hits
from postprocess import finalize_decision
Expand Down Expand Up @@ -127,6 +128,10 @@ def process_row(row: pd.Series, index: BM25Index) -> dict[str, Any]:
fb["request_type"] = hit.force_request_type
return _validate_row(fb)

eco = cross_ecosystem_escalation_reason(issue, subject)
if eco:
return _validate_row(fallback_from_hits([], escalated=True, esc_reason=eco, low_retrieval=False))

hits, raw_top_score = index.search(f"{subject}\n{issue}", brand, TOP_K)
hits = rerank_hits(f"{subject}\n{issue}", hits)
low = should_escalate_low_retrieval(raw_top_score)
Expand Down
1 change: 1 addition & 0 deletions code/openai_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ def fallback_from_hits(
{"status":"replied"|"escalated","product_area":"string","response":"string","justification":"string","request_type":"product_issue"|"feature_request"|"bug"|"invalid"}
Rules:
- status=escalated for fraud, legal threats, account takeover, grading disputes, bug bounty reports needing security team, or when CONTEXT lacks needed facts.
- If the ticket mixes unrelated products (e.g. HackerRank assessment workflow AND Visa card dispute in one message), status=escalated — humans must split routing.
- product_area: short snake_case like sample outputs (e.g. screen, community, privacy, travel_support). Prefer last breadcrumb or doc topic from CONTEXT.
- request_type: bug if outage/errors; feature_request for new capability; invalid for spam/thanks/off-topic; else product_issue.
- response: concise, user-facing, only facts supported by CONTEXT. If status=replied, no fabricated steps.
Expand Down
4 changes: 4 additions & 0 deletions code/pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[pytest]
# On some Windows/Python builds the pdb hook imports stdlib `code`; this repo's top-level
# package folder is also named `code`, which can shadow the stdlib module and break pytest startup.
addopts = -p no:debugging
Loading
Loading