Skip to content

jitheender-ops/cloud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

title CustomerSupportTriage-v0
emoji 🎫
colorFrom blue
colorTo teal
sdk docker
app_port 7860
tags
openenv
customer-support
agent-benchmark
nlp
real-world
pinned false

CustomerSupportTriage-v0

An OpenEnv-compliant environment for benchmarking AI agents on real-world customer support triage.

Overview

The agent receives a queue of customer support tickets and must, for each one:

  1. Assign a prioritylow / medium / high / urgent
  2. Route to a departmentbilling / technical / shipping / returns / general / escalation
  3. Draft a customer reply — 1–3 professional, empathetic sentences
  4. Flag for human reviewtrue for legal threats, security incidents, accessibility issues

This mirrors the actual workflow of a Tier-1 support agent at a SaaS company.


Tasks

Task Tickets Difficulty Description
easy 5 Easy Unambiguous signals — clear priority, obvious department
medium 10 Medium Multi-issue bodies, ambiguous routing, partial overlap
hard 15 Hard Misleading sentiment, legal edge-cases, downplayed urgency, security incidents

Hard task examples

  • A ticket starting "Thanks for the quick response!" that describes a production outage
  • A GDPR deletion request that also involves a billing dispute
  • An accessibility complaint citing ADA/WCAG legal requirements
  • An enterprise customer calmly describing a catastrophic data migration failure

Action & Observation Spaces

Observation

{
  "queue": [
    {
      "ticket_id": "H001",
      "subject": "...",
      "body": "...",
      "customer_name": "Olivia Park",
      "customer_tier": "enterprise",
      "created_at": "2024-03-15T06:00:00Z",
      "sentiment": "positive",
      "tags": ["sso", "outage", "enterprise"]
    }
  ],
  "processed": 3,
  "total_tickets": 15,
  "task_name": "hard",
  "step_number": 2,
  "time_remaining": 13
}

Action

{
  "actions": [
    {
      "ticket_id": "H001",
      "priority": "urgent",
      "department": "technical",
      "response": "I'm treating this as a P0 incident...",
      "needs_human": true,
      "reasoning": "Enterprise SSO down = production blocker"
    }
  ]
}

Reward Function

Each ticket is graded by a deterministic grader:

Component Weight Description
Priority accuracy 30% Exact match = 1.0; off by 1 level = 0.5
Routing accuracy 30% Exact match = 1.0; adjacent dept = 0.3
Response quality 25% Keyword coverage from ground-truth model answers
Escalation correctness 15% Correct needs_human flag

Partial credit is awarded throughout — the reward signal is dense, not sparse.


API Endpoints

Method Path Description
POST /reset Start new episode
POST /step Submit triage actions
GET /state Full environment snapshot
GET /health Liveness probe
GET /tasks List available tasks
GET /openenv.yaml Spec file

Setup & Usage

Local development

# Clone and install
git clone https://huggingface.co/spaces/YOUR_HF_USERNAME/support-triage
cd support-triage
pip install -r requirements.txt

# Run the server
cd server
uvicorn server:app --host 0.0.0.0 --port 7860 --reload

Docker

docker build -t support-triage .
docker run -p 7860:7860 support-triage

# Test it
curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" \
  -d '{"task": "easy", "seed": 42}'

Run inference script

# Against local server
export HF_TOKEN="your-hf-token"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export ENV_BASE_URL="http://localhost:7860"
python inference.py

# Against deployed Space
export ENV_BASE_URL="https://YOUR_SPACE.hf.space"
python inference.py

Baseline Scores

Tested with Qwen/Qwen2.5-72B-Instruct (seed=42):

Task Score Notes
easy ~0.82 Strong on clear cases
medium ~0.68 Struggles with multi-issue routing
hard ~0.54 Misses misleading-sentiment tickets
overall ~0.68 Room for improvement with better prompting

Project Structure

support-triage/
├── server/
│   ├── server.py      # FastAPI HTTP server
│   ├── env.py         # CustomerSupportTriageEnv (step/reset/state)
│   ├── models.py      # Pydantic typed models
│   └── tasks.py       # Ticket corpora + grader
├── inference.py       # Baseline LLM inference script
├── openenv.yaml       # OpenEnv spec
├── Dockerfile         # Container build
├── requirements.txt   # Python dependencies
└── README.md          # This file

Evaluation Criteria Alignment

Criterion How addressed
Real-world utility Exact replica of Tier-1 SaaS support triage workflow
3+ tasks with graders easy/medium/hard with deterministic keyword+priority graders
Meaningful reward 4-component partial-credit reward, dense signal every step
OpenEnv spec Full typed models, step/reset/state, openenv.yaml
Deployment Docker + HF Spaces
Baseline script inference.py with [START]/[STEP]/[END] logs

About

Gemini said CustomerSupportTriage-v0 is an OpenEnv-compliant benchmark for AI agents handling real-world support triage. Agents process ticket queues by assigning priorities, routing departments, drafting replies, and flagging for human review. It features three difficulty tiers and uses a partial-credit reward function.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors