CustomerSupportTriage-v0

title

emoji

🎫

colorFrom

blue

colorTo

teal

sdk

docker

app_port

7860

CustomerSupportTriage-v0

An OpenEnv-compliant environment for benchmarking AI agents on real-world customer support triage.

Overview

The agent receives a queue of customer support tickets and must, for each one:

Assign a priority — low / medium / high / urgent
Route to a department — billing / technical / shipping / returns / general / escalation
Draft a customer reply — 1–3 professional, empathetic sentences
Flag for human review — true for legal threats, security incidents, accessibility issues

This mirrors the actual workflow of a Tier-1 support agent at a SaaS company.

Tasks

Task	Tickets	Difficulty	Description
`easy`	5	Easy	Unambiguous signals — clear priority, obvious department
`medium`	10	Medium	Multi-issue bodies, ambiguous routing, partial overlap
`hard`	15	Hard	Misleading sentiment, legal edge-cases, downplayed urgency, security incidents

Hard task examples

A ticket starting "Thanks for the quick response!" that describes a production outage
A GDPR deletion request that also involves a billing dispute
An accessibility complaint citing ADA/WCAG legal requirements
An enterprise customer calmly describing a catastrophic data migration failure

Action & Observation Spaces

Observation

{
  "queue": [
    {
      "ticket_id": "H001",
      "subject": "...",
      "body": "...",
      "customer_name": "Olivia Park",
      "customer_tier": "enterprise",
      "created_at": "2024-03-15T06:00:00Z",
      "sentiment": "positive",
      "tags": ["sso", "outage", "enterprise"]
    }
  ],
  "processed": 3,
  "total_tickets": 15,
  "task_name": "hard",
  "step_number": 2,
  "time_remaining": 13
}

Action

{
  "actions": [
    {
      "ticket_id": "H001",
      "priority": "urgent",
      "department": "technical",
      "response": "I'm treating this as a P0 incident...",
      "needs_human": true,
      "reasoning": "Enterprise SSO down = production blocker"
    }
  ]
}

Reward Function

Each ticket is graded by a deterministic grader:

Component	Weight	Description
Priority accuracy	30%	Exact match = 1.0; off by 1 level = 0.5
Routing accuracy	30%	Exact match = 1.0; adjacent dept = 0.3
Response quality	25%	Keyword coverage from ground-truth model answers
Escalation correctness	15%	Correct `needs_human` flag

Partial credit is awarded throughout — the reward signal is dense, not sparse.

API Endpoints

Method	Path	Description
`POST`	`/reset`	Start new episode
`POST`	`/step`	Submit triage actions
`GET`	`/state`	Full environment snapshot
`GET`	`/health`	Liveness probe
`GET`	`/tasks`	List available tasks
`GET`	`/openenv.yaml`	Spec file

Setup & Usage

Local development

# Clone and install
git clone https://huggingface.co/spaces/YOUR_HF_USERNAME/support-triage
cd support-triage
pip install -r requirements.txt

# Run the server
cd server
uvicorn server:app --host 0.0.0.0 --port 7860 --reload

Docker

docker build -t support-triage .
docker run -p 7860:7860 support-triage

# Test it
curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" \
  -d '{"task": "easy", "seed": 42}'

Run inference script

# Against local server
export HF_TOKEN="your-hf-token"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export ENV_BASE_URL="http://localhost:7860"
python inference.py

# Against deployed Space
export ENV_BASE_URL="https://YOUR_SPACE.hf.space"
python inference.py

Baseline Scores

Tested with Qwen/Qwen2.5-72B-Instruct (seed=42):

Task	Score	Notes
easy	~0.82	Strong on clear cases
medium	~0.68	Struggles with multi-issue routing
hard	~0.54	Misses misleading-sentiment tickets
overall	~0.68	Room for improvement with better prompting

Project Structure

support-triage/
├── server/
│   ├── server.py      # FastAPI HTTP server
│   ├── env.py         # CustomerSupportTriageEnv (step/reset/state)
│   ├── models.py      # Pydantic typed models
│   └── tasks.py       # Ticket corpora + grader
├── inference.py       # Baseline LLM inference script
├── openenv.yaml       # OpenEnv spec
├── Dockerfile         # Container build
├── requirements.txt   # Python dependencies
└── README.md          # This file

Evaluation Criteria Alignment

Criterion	How addressed
Real-world utility	Exact replica of Tier-1 SaaS support triage workflow
3+ tasks with graders	easy/medium/hard with deterministic keyword+priority graders
Meaningful reward	4-component partial-credit reward, dense signal every step
OpenEnv spec	Full typed models, step/reset/state, openenv.yaml
Deployment	Docker + HF Spaces
Baseline script	`inference.py` with `[START]/[STEP]/[END]` logs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CustomerSupportTriage-v0

Overview

Tasks

Hard task examples

Action & Observation Spaces

Observation

Action

Reward Function

API Endpoints

Setup & Usage

Local development

Docker

Run inference script

Baseline Scores

Project Structure

Evaluation Criteria Alignment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Dockerfile		Dockerfile
README.md		README.md
env.py		env.py
inference.py		inference.py
models.py		models.py
openenv.yaml		openenv.yaml
requirements.txt		requirements.txt
server.py		server.py
tasks.py		tasks.py
test_env.py		test_env.py

Folders and files

Latest commit

History

Repository files navigation

CustomerSupportTriage-v0

Overview

Tasks

Hard task examples

Action & Observation Spaces

Observation

Action

Reward Function

API Endpoints

Setup & Usage

Local development

Docker

Run inference script

Baseline Scores

Project Structure

Evaluation Criteria Alignment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages