TicketWorld generates realistic customer service datasets and environments for training and evaluating LLM systems. Creates customer support scenarios with interconnected databases, policy documents, and resolution plans.
TicketWorld generates synthetic customer service data that challenges LLM systems with:
- Multi-hop policy reasoning: Tickets require understanding interactions between multiple company policies
- Tool use and effecitve lookup: For all tickets, access to customer information, product information, order information, and company policy document is required to create an accurate resolution. These assets are stored separately in a database and standalone .txt file, requiring effective multi-hop queries and search.
- Realistic customer scenarios: Edge cases, partial information, and complex situations
- Policy compliance validation: Resolutions must reference and apply specific policy clauses
- Authentic data relationships: Customers, orders, and products with realistic transaction histories
The system uses a carefully orchestrated synthetic data pipeline that respects asset dependencies and provides targeted information access:
- Policy Graph Creation: Company policies are modeled as interconnected clauses with relationships (overrides, modifies, requires)
- Scenario Templates: Pre-built templates define customer situations (returns, exchanges, warranty claims, etc.) with varying conditions that require combining and reasoning over multiple policy rules
- Asset Generation Pipeline:
- Generate customers with realistic profiles and contact information
- Generate products with pricing, categories, and specifications
- Generate orders using both customers and products, creating authentic transaction relationships
- Email Generation: Using scenario templates and specific customer/order context, LLM creates customer emails from the customer's perspective. Emails are realistic due to varying levels of information provided by the customer: missing order numbers, misspelled order numbers, emails sent from secondary email address, etc. requiring database lookups, best-guess inference, or followup clarification requests.
- Resolution Generation: Using all previous assets plus metadata, LLM acts as customer service rep to create policy-compliant resolutions
The Key Innovation: This synthetic data pipeline addresses the core challenge of generating high-quality datasets that nevertheless remains difficult for LLMs to solve. During generation, we provide targeted information access (specific customer records, relevant policies) and deterministic metadata to ensure consistency and minimize hallucination. However, during evaluation, these scaffolds are removed - the LLM must accurately retrieve information from large databases and reason over numerous possibly irrelevant pieces of data.
This approach generates datasets with minimal errors and maximum consistency while creating genuinely challenging multi-hop reasoning scenarios that require effective tool use and lookup capabilities.
In addition, this setup allows for the generation of new, fresh batches of ticket data. This helps mitigate overfitting, data contaimnation, and staleness often seen on static train and test sets, similar to FreshStack.
- Python 3.11+
- uv for dependency management
- Google Gemini API access
# Clone the repository
cd ticketworld
# Install dependencies
uv sync
# Set up environment variables
cp .env.example .env # Create this fileCreate a .env file in the project root:
# Required: Google Gemini API key
GEMINI_API_KEY=your-gemini-api-key-hereGet your API key from Google AI Studio.
Generate a dataset with default settings:
# Run with test configuration (100 tickets, 50 customers, 35 products, 70 orders)
uv run python factory.py# Generate larger dataset
uv run python factory.py --tickets 500 --customers 200 --products 100 --orders 300
# Append to existing dataset
uv run python factory.py --mode append --tickets 100
# Custom output directory
uv run python factory.py --output-dir ./my_dataset --tickets 200
# Exclude debug metadata (for clean training data)
uv run python factory.py --no-debug --tickets 1000For a full dataset with all enhancements:
# 1. Generate core dataset
uv run python factory.py --tickets 500 --customers 200
# 2. Add policy dilution (makes policy document more realistic)
uv run python utils/policy_dilution_script.py
# 3. Convert to SQLite for easier querying
uv run python utils/convert_to_sqlite.py| Script | Purpose | When to Run |
|---|---|---|
policy_dilution_script.py |
Adds irrelevant content to policy document to simulate real-world policy complexity | After factory.py |
convert_to_sqlite.py |
Converts JSON customer database to SQLite for easier querying and analysis | After factory.py |
| Script | Purpose | Use Case |
|---|---|---|
audit_tickets.py |
Reviews generated tickets for policy compliance and errors | Quality assurance, debugging |
validate_templates.py |
Analyzes scenario templates and discovers policy interactions | Template development, validation |
# Add policy dilution
cd utils && python policy_dilution_script.py
# Convert to SQLite
cd utils && python convert_to_sqlite.py
# Audit ticket quality (optional)
cd utils && python audit_tickets.py
# Validate templates (development tool)
cd utils && python validate_templates.pyAfter running the factory, the assets/ directory contains:
| File | Description | Size (typical) |
|---|---|---|
support_tickets.json |
Complete ticket dataset with customer emails and resolutions | ~350KB (100 tickets) |
customer_database.json |
Customer profiles, orders, and product catalog | ~80KB (50 customers) |
company_policy.txt |
Clean company policy document | ~3KB |
| File | Description | Generated By |
|---|---|---|
company_policy_full.txt |
Policy document with realistic dilution content | policy_dilution_script.py |
customer_database.db |
SQLite version of customer database | convert_to_sqlite.py |
| File | Description | Contents |
|---|---|---|
policy_graph.json |
Policy interaction structure and metadata | Policy relationships, complexity analysis |
ticket_audit_results.json |
Quality analysis of generated tickets | Compliance scores, error detection |
ticket_audit_report.txt |
Human-readable audit summary | Policy violations, recommendations |
Shows customer email, resolution plan, and metadata (product, customer, scenario template, policy interations, etc.) used by the system to generate both.
{
"ticket_id": "TK-20250618-2052",
"customer_email": "[email protected]",
"subject": "Defective Tablet - Order ORD-20250609-1002 - Exchange Request",
"body": "Dear Customer Support,\n\nI am writing to you today because I received a defective item in my recent order, ORD-20250609-1002. I ordered the Tablet Basic 10-inch on June 9, 2025, so it's only been a little over a week since it arrived.\n\nUnfortunately, the tablet is not working correctly. The screen frequently flickers and freezes, making it impossible to use. I've tried restarting it several times, but the problem persists. It's really frustrating to receive a brand new item that's already faulty.\n\nI would like to request an exchange for a working Tablet Basic 10-inch. I really need this specific model and would prefer to get a replacement rather than a refund. Could you please let me know the process for exchanging a defective item?\n\nThank you for your time and assistance.\n\nSincerely,\nDavid Chen",
"timestamp": "2025-06-18T10:30:00",
"customer_id": "CUST-0002",
"order_id": "ORD-20250609-1002",
"resolution_plan": {
"order_id": "ORD-20250609-1002",
"order_date": "2025-06-09",
"customer_lookup": {
"status": "found",
"customer_id": "CUST-0002",
"lookup_method": "email_match",
"notes": "Customer found in database"
},
"policy_references": [
"POL-EXCHANGE-002",
"POL-RETURN-001",
"POL-EXCHANGE-001",
"POL-SHIP-006"
],
"policy_reasoning": "The customer reported receiving a defective Tablet Basic 10-inch within 9 days of purchase. This falls within the 30-day return/exchange window as per POL-RETURN-001 and POL-EXCHANGE-001. According to POL-EXCHANGE-002, defective items are to be exchanged for the same item at no cost. Since the item's value is $249.99, which is under $500, an immediate replacement can be authorized based on POL-SHIP-006 (Damaged items under $500).",
"actions": [
{
"type": "process_exchange",
"reason": "Customer is requesting an exchange for a defective item received within the exchange window, as per POL-EXCHANGE-002 and POL-RETURN-001. The item's value is under $500, allowing for immediate replacement per POL-SHIP-006.",
"value": 249.99,
"details": "Exchange for one (1) Tablet Basic 10-inch (PROD-1031) due to defect. No additional cost to customer."
},
{
"type": "send_replacement",
"reason": "Replacement authorized for a defective item under $500 as per POL-SHIP-006.",
"value": 249.99,
"details": "Ship one (1) new Tablet Basic 10-inch (PROD-1031) to customer David Chen. Provide return label for the defective unit."
}
],
"escalation_required": false,
"escalation_reason": null,
"priority": "medium",
"total_resolution_value": 249.99
},
"_scenario_dimensions": {
"query_type": "exchange_request",
"information_completeness": "complete",
"complexity": "requires_lookup",
"customer_sentiment": "pleading"
},
"_scenario_template": {
"scenario_id": "EXCHANGE-002",
"name": "exchange_defective_product",
"primary_policy": "POL-EXCHANGE-002",
"complexity_level": 2,
"expected_outcome": "approve"
},
"_policy_analysis": {
"all_relevant_policies": [
"POL-EXCHANGE-002",
"POL-RETURN-004"
],
"applicable_policies": [
"POL-EXCHANGE-002",
"POL-RETURN-004"
],
"context_used": {
"has_receipt": true,
"customer_tier": "standard",
"days_since_purchase": 9,
"months_since_purchase": 0.2956636005256242,
"order_status": "delivered",
"total_order_value": 249.99,
"item_value": 249.99,
"product_warranty_days": 365,
"item_condition": "defective",
"exchange_reason": "defective",
"purchase_month": 6
},
"policy_interactions": "Multi-hop reasoning required"
}
}--tickets N # Number of tickets to generate (default: 100)
--customers N # Number of customers (default: 50)
--products N # Number of products (default: 35)
--orders N # Number of orders (default: 70)
--mode MODE # "create" or "append" (default: create)
--output-dir DIR # Output directory (default: ./assets)
--company-name NAME # Company name for policies (default: TechNest)
--no-debug # Exclude debug metadata for clean training dataThe generator creates realistic distributions:
- Ticket Types: Returns (25%), Shipping Issues (20%), Billing Disputes (20%), Warranty Claims (15%), etc.
- Complexity Levels: Simple (40%), Requires Lookup (35%), Edge Cases (20%), Escalation Required (5%)
- Customer Tiers: Standard (70%), Premium (20%), VIP (10%)
- Information Completeness: Complete (30%), Missing Details (40%), Wrong Info (30%)
- Policy Reasoning: Test multi-hop policy application
- Customer Service: Train on realistic support scenarios
- Edge Case Handling: Challenge models with incomplete information
- Business Logic: Validate understanding of complex rules
import json
# Load tickets
with open('assets/support_tickets.json') as f:
tickets = json.load(f)
# Analyze policy complexity
complex_tickets = [t for t in tickets if len(t['_policy_analysis']['applicable_policies']) > 2]
print(f"Multi-policy tickets: {len(complex_tickets)}")-- Find high-value orders with issues
SELECT c.name, o.order_id, o.total_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.total_amount > 500;
-- Customer purchase patterns
SELECT customer_id, COUNT(*) as order_count, AVG(total_amount) as avg_order
FROM orders
GROUP BY customer_id
ORDER BY order_count DESC;- Policy Compliance: All resolutions reference specific policy clauses
- Realistic Timing: Email timestamps align with customer descriptions ("last week", "a few months ago")
- Data Consistency: Customer/order relationships are maintained across all tickets
- Edge Cases: Wrong emails, missing information, partial customer matches
- Multi-hop Reasoning: Complex scenarios requiring multiple policy interactions
- Edit scenario templates in
factory.py(create_scenario_templates()) - Run
utils/validate_templates.pyto discover policy interactions - Test with
utils/audit_tickets.pyfor compliance
- Add new policy clauses in
create_policy_graph() - Define relationships (overrides, modifies, requires)
- Update scenario templates to reference new policies
- Generation Speed: ~1-2 tickets/second (depends on LLM response time)
- Memory Usage: ~100MB for typical datasets
- Output Size: ~5MB for 1000 tickets with full metadata
TicketWorld creates comprehensive testing environments for customer service AI systems, ensuring robust handling of real-world complexity and multi-policy reasoning scenarios.
