Skip to content

zxkane/openhands-infra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

79 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ OpenHands on AWS

Self-host your AI coding agent β€” fully serverless, zero idle cost

License CI CDK OpenHands

Deploy OpenHands on AWS with production-grade infrastructure in minutes. ECS Fargate β€’ Bedrock LLM β€’ Per-conversation isolation β€’ Self-healing architecture.

Getting Started Β· Architecture Β· Cost Estimate Β· Blog Post


Why This Project?

Running OpenHands locally is great for trying it out. Running it for a team or in production is a different story:

Challenge How This Project Solves It
"I don't want to manage servers" Fully serverless β€” ECS Fargate, Aurora Serverless, no EC2 instances
"Idle cost is too high" Sandboxes scale to zero when not in use; pay only for active conversations
"Multi-user access control" Cognito authentication with 30-day sessions, per-user conversation isolation
"My conversations disappear on restart" Self-healing: Aurora + S3 + EFS persist everything across Fargate task replacements
"I need AWS access from the AI agent" Optional scoped IAM credentials for sandbox containers (least-privilege)
"Setting up infra is painful" One cdk deploy --all command β€” 10 stacks deployed in the right order automatically

✨ Key Features

  • πŸ—οΈ Fully Serverless β€” ECS Fargate (ARM64) for compute, Aurora Serverless v2 for database, no instances to patch
  • πŸ’° Zero Idle Cost β€” Sandbox containers spin up per-conversation and stop automatically after idle timeout
  • πŸ”’ Per-Conversation Isolation β€” Each sandbox gets a dedicated EFS access point; no cross-conversation access
  • πŸ”„ Self-Healing Architecture β€” Conversations resume seamlessly after Fargate task replacement (Aurora + S3 + EFS)
  • πŸ€– AWS Bedrock β€” LLM inference via IAM Role, no API keys to manage
  • 🌐 Multi-Domain Support β€” Share one backend across multiple CloudFront distributions and domains
  • πŸ” Enterprise Security β€” Cognito auth, WAF, VPC Endpoints, private subnets, KMS encryption, Secrets Manager
  • πŸš€ Runtime Subdomain β€” Agent-built apps accessible via {port}-{convId}.runtime.{subdomain}.{domain}
  • πŸ“Š Observability β€” CloudWatch Logs, Alarms, Container Insights, AWS Backup (14-day retention)
  • 🏎️ Warm Pool β€” Pre-warmed sandbox tasks for instant conversation starts

Architecture Overview

User β†’ CloudFront (WAF+Lambda@Edge Auth) β†’ ALB (origin verified) β†’ ECS Fargate (App + OpenResty)
           β”‚                                                                  ↓
           └── Cognito (OAuth2, Managed Login v2)                  Cloud Map β†’ Sandbox Fargate Tasks
                                                                              ↓
                                                        VPC Endpoints β†’ Bedrock / CloudWatch Logs
                                                                              ↓
                                                        RDS Proxy β†’ Aurora Serverless v2 PostgreSQL

Sandbox Orchestration:
App β†’ Orchestrator Lambda β†’ DynamoDB Registry β†’ Sandbox Fargate Tasks (per-conversation EFS isolation)

Runtime Apps:
{port}-{convId}.runtime.{subdomain}.{domain} β†’ CloudFront β†’ Lambda@Edge β†’ OpenResty β†’ Sandbox Fargate Task

πŸ“ For a detailed architecture deep dive (10-stack breakdown, data flows, sandbox lifecycle), see docs/ARCHITECTURE.md.

πŸš€ Quick Start

Prerequisites

  • AWS CLI configured with appropriate credentials
  • Node.js 22+ and npm
  • Existing VPC with private subnets and NAT Gateway
  • Existing Route 53 Hosted Zone

1. Install Dependencies

git clone https://github.com/zxkane/openhands-infra.git
cd openhands-infra
npm install

2. Bootstrap CDK (First Time Only)

npx cdk bootstrap --region <your-main-region>
npx cdk bootstrap --region us-east-1  # Required for Lambda@Edge and CloudFront

3. Create Sandbox Secret Key (First Time Only)

aws secretsmanager create-secret \
  --name openhands/sandbox-secret-key \
  --secret-string "$(openssl rand -base64 32)" \
  --region <your-main-region> \
  --description "OpenHands sandbox secret key for session encryption"

Note: This secret must exist in each region where you deploy.

4. Deploy

npx cdk deploy --all \
  --context vpcId=<vpc-id> \
  --context hostedZoneId=<hosted-zone-id> \
  --context domainName=<domain-name> \
  --context subDomain=<subdomain> \
  --context region=<region> \
  --require-approval never

That's it! Access OpenHands at https://<subdomain>.<domain-name> πŸŽ‰

Configuration

πŸ“‹ All Context Parameters
Parameter Description Example
vpcId Existing VPC ID vpc-0123456789abcdef0
hostedZoneId Route 53 Hosted Zone ID Z0123456789ABCDEFGHIJ
domainName Domain name example.com
subDomain Subdomain for OpenHands openhands
region AWS region (optional, defaults to us-east-1) us-west-2
siteName Cognito managed login site name (optional) Openhands on AWS
authCallbackDomains Extra OAuth callback domains for shared Cognito client (optional; JSON array or comma-separated) ["openhands.example.com","openhands.test.example.com"]
authDomainPrefixSuffix Suffix for Cognito domain prefix (optional; avoids collisions) shared
edgeStackSuffix Suffix for Edge stack name in us-east-1 (optional; enables multiple Edge stacks) my-project
sandboxAwsAccess Enable sandbox AWS access (optional, defaults to false) true
sandboxAwsPolicyFile Path to custom IAM policy JSON for sandbox (optional) config/sandbox-aws-policy.json
skipS3Endpoint Skip S3 Gateway endpoint if VPC already has one (optional) true
warmPoolSize Number of pre-warmed sandbox Fargate tasks (optional, default: 2) 3
idleTimeoutMinutes Minutes before idle sandbox is stopped (optional, default: 30, staging: 10) 15
sandboxSociImageUri SOCI v2 image URI for Fargate lazy loading (optional, see AGENTS.md) <ecr-uri>:tag-soci

Stack Structure

The project deploys 10 stacks with automatic dependency resolution:

Stack Region Description
OpenHands-Auth us-east-1 Cognito User Pool + Managed Login v2 branding
OpenHands-Network Main VPC import, VPC Endpoints
OpenHands-Monitoring Main CloudWatch Logs, Alarms, S3 Data Bucket, Backup
OpenHands-Security Main IAM Roles, Security Groups, KMS key
OpenHands-Database Main Aurora Serverless v2 PostgreSQL with RDS Proxy
OpenHands-UserConfig Main User Configuration API Lambda (MCP, Secrets, Integrations)
OpenHands-Cluster Main Shared ECS Cluster + Cloud Map namespace
OpenHands-Sandbox Main Sandbox Fargate tasks, DynamoDB registry, Orchestrator Lambda
OpenHands-Compute Main Fargate services (App + OpenResty), ALB, EFS
OpenHands-Edge-* us-east-1 Lambda@Edge, CloudFront, WAF, Route 53 (per domain/environment)

Deployment Order (handled automatically by CDK): 0. Auth β†’ 1. Network β†’ 2. Monitoring β†’ 3. Security β†’ 4. Database β†’ 5. UserConfig β†’ 6. Cluster β†’ 7. Sandbox β†’ 8. Compute β†’ 9. Edge

Cost Estimate

Base Infrastructure (~$250-350/month)

Component Monthly Cost (USD) Notes
Fargate App Service (1 vCPU / 2 GB ARM64) ~$30 Auto-scales 1-3
Fargate OpenResty Service (0.25 vCPU / 512 MB) ~$8 Auto-scales 1-3
Fargate Sandbox Tasks ~$0-50 On-demand, per-conversation
Aurora Serverless v2 ~$43-80 0.5-4 ACU
RDS Proxy ~$18
CloudFront ~$85 1TB data transfer
VPC Endpoints (10) ~$60
ALB ~$25
Other (EFS, S3, NAT, CW, R53, DDB) ~$10-50 Usage-dependent

Bedrock LLM Cost (Variable)

Model Input (per 1M tokens) Output (per 1M tokens)
Claude Opus 4.5 $5 $25
Claude Sonnet 4.5 $3 $15
Claude Haiku 4.5 $1 $5

Example: 10M input + 2M output tokens/month with Claude Sonnet 4.5 β‰ˆ $60/month

Advanced Topics

🌐 Multi-Domain Deployment

You can deploy multiple OpenHands instances on different domains, all sharing the same backend infrastructure.

Architecture

                                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                 β”‚      AuthStack (us-east-1)      β”‚
                                 β”‚  Shared Cognito User Pool       β”‚
                                 β”‚  - Multi-domain callbacks       β”‚
                                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                              β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚                               β”‚                               β”‚
              β–Ό                               β–Ό                               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  EdgeStack-Domain1      β”‚   β”‚  EdgeStack-Domain2      β”‚   β”‚  EdgeStack-DomainN      β”‚
β”‚  (us-east-1)            β”‚   β”‚  (us-east-1)            β”‚   β”‚  (us-east-1)            β”‚
β”‚  - CloudFront           β”‚   β”‚  - CloudFront           β”‚   β”‚  - CloudFront           β”‚
β”‚  - Lambda@Edge          β”‚   β”‚  - Lambda@Edge          β”‚   β”‚  - Lambda@Edge          β”‚
β”‚  - WAF                  β”‚   β”‚  - WAF                  β”‚   β”‚  - WAF                  β”‚
β”‚  - Route 53 records     β”‚   β”‚  - Route 53 records     β”‚   β”‚  - Route 53 records     β”‚
β”‚  - ACM Certificate      β”‚   β”‚  - ACM Certificate      β”‚   β”‚  - ACM Certificate      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                               β”‚                               β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                              β”‚
                                              β–Ό
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚     ComputeStack (main region)      β”‚
                           β”‚  - ALB with origin verification     β”‚
                           β”‚  - Fargate services (App+OpenResty) β”‚
                           β”‚  - SSM parameters in us-east-1      β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                              β”‚
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β–Ό                                     β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚     DatabaseStack       β”‚          β”‚    MonitoringStack      β”‚
              β”‚  Aurora PostgreSQL      β”‚          β”‚  S3, CloudWatch         β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 1: Configure Shared Authentication

npx cdk deploy OpenHands-Auth \
  --context vpcId=<vpc-id> \
  --context hostedZoneId=<primary-hosted-zone-id> \
  --context domainName=<primary-domain> \
  --context subDomain=openhands \
  --context region=<main-region> \
  --context authCallbackDomains='["openhands.domain1.com","openhands.domain2.com"]' \
  --require-approval never

Step 2: Deploy Backend Infrastructure

npx cdk deploy OpenHands-Network OpenHands-Monitoring OpenHands-Security \
  OpenHands-Database OpenHands-UserConfig OpenHands-Cluster \
  OpenHands-Sandbox OpenHands-Compute \
  --context vpcId=<vpc-id> \
  --context hostedZoneId=<primary-hosted-zone-id> \
  --context domainName=<primary-domain> \
  --context subDomain=openhands \
  --context region=<main-region> \
  --require-approval never

Step 3: Deploy Edge Stacks for Each Domain

# Domain 1
npx cdk deploy OpenHands-Edge-Test \
  --context vpcId=<vpc-id> \
  --context hostedZoneId=<hosted-zone-for-test-example-com> \
  --context domainName=test.example.com \
  --context subDomain=openhands \
  --context region=<main-region> \
  --context edgeStackSuffix=Test \
  --exclusively \
  --require-approval never

# Domain 2
npx cdk deploy OpenHands-Edge-Prod \
  --context vpcId=<vpc-id> \
  --context hostedZoneId=<hosted-zone-for-prod-example-com> \
  --context domainName=prod.example.com \
  --context subDomain=openhands \
  --context region=<main-region> \
  --context edgeStackSuffix=Prod \
  --exclusively \
  --require-approval never

Important: Use --exclusively flag when deploying individual Edge stacks to avoid redeploying the backend stacks with different domain context.

Managing Domains

Adding a new domain:

  1. Update Auth stack with the new callback domain
  2. Deploy a new Edge stack with --context edgeStackSuffix=<Name> --exclusively

Removing a domain:

  1. aws cloudformation delete-stack --stack-name OpenHands-Edge-<Suffix> --region us-east-1
  2. Optionally update Auth stack to remove the callback domain
πŸ”„ Conversation Resume (Self-Healing)

When sandbox Fargate tasks stop (idle timeout, crash, or deployment), conversations become ARCHIVED. All data is preserved:

Data Storage Survives Task Stop
Conversation metadata Aurora PostgreSQL βœ…
Conversation events/history S3 βœ…
Workspace files EFS (per-conversation access point) βœ…

Auto-Resume Flow:

User clicks archived conversation
    ↓
Frontend detects ARCHIVED status
    ↓
Calls POST /api/v1/app-conversations/{id}/resume
    ↓
App β†’ Orchestrator Lambda:
  - Creates new EFS access point for conversation
  - Registers new task definition with access point
  - Launches Fargate sandbox task
  - Updates DynamoDB registry
    ↓
Page reloads β†’ conversation is usable again

Workspace files on EFS are preserved via the access point, so code and files from the previous session remain available after resume.

πŸ” Sandbox AWS Access

Enable AI agents in sandbox containers to access AWS services with scoped IAM credentials:

npx cdk deploy --all \
  --context sandboxAwsAccess=true \
  --context sandboxAwsPolicyFile=config/sandbox-aws-policy.json \
  ...

⚠️ Customize the Policy File

The default config/sandbox-aws-policy.json grants broad permissions. Customize this for your use case!

Example: Purpose-built policy for S3 and DynamoDB only:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowS3Access",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
      "Resource": ["arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*"]
    },
    {
      "Sid": "AllowDynamoDB",
      "Effect": "Allow",
      "Action": ["dynamodb:GetItem", "dynamodb:PutItem", "dynamodb:Query"],
      "Resource": "arn:aws:dynamodb:*:*:table/my-table"
    }
  ]
}

Hardcoded Explicit Denies

These actions are always denied regardless of your policy:

Category Denied Actions
IAM Users iam:CreateUser, iam:DeleteUser, iam:CreateAccessKey
IAM Policies iam:AttachUserPolicy, iam:PutUserPolicy, iam:PutRolePolicy
IAM Roles iam:CreateRole, iam:DeleteRole, iam:AttachRolePolicy
Account organizations:*, account:*, billing:*
Role Assumption sts:AssumeRole (prevents lateral movement)
🌍 Runtime Subdomain Routing

When AI agents run applications (e.g., Flask, Node.js) inside the sandbox, they are accessible via dedicated runtime subdomains:

https://{port}-{convId}.runtime.{subdomain}.{domain}/

Example: https://5000-abc123def456.runtime.openhands.example.com/

Feature Benefit
Domain Root Apps run at / β€” internal routes work correctly
Cookie Isolation Each runtime has isolated cookies
Security Headers X-Frame-Options, CSP, X-XSS-Protection applied automatically
No Authentication Runtime subdomains bypass Cognito (public within conversation)

Architecture

User Browser
    ↓
https://5000-{convId}.runtime.openhands.example.com/
    ↓
CloudFront (matches *.runtime.* wildcard certificate)
    ↓
Lambda@Edge (viewer-request: parse subdomain, rewrite URI)
    ↓
ALB β†’ OpenResty β†’ Sandbox Discovery (DynamoDB) β†’ User App
πŸ’Ύ Data Persistence
Data Type Storage Persistence
Conversation Metadata Aurora PostgreSQL Permanent (via RDS Proxy)
Conversation Events S3 Permanent (survives task replacement)
User Settings / Secrets S3 Permanent (KMS envelope encryption)
Workspace Files EFS Persistent (per-conversation access points)
SDK Conversation Cache EFS Persistent (enables LLM context restoration)
Sandbox Registry DynamoDB Permanent (task state, user ownership)

Aurora Serverless v2: PostgreSQL 15.8, RDS Proxy connection pooling, 0.5-4 ACU auto-scaling, 35-day backups.

S3 Bucket: SSE-S3 encryption, versioning (30-day retention), RETAIN removal policy.

πŸ”’ Security
  • Fargate tasks in private subnets only
  • Per-conversation EFS isolation via access points
  • All AWS service access via VPC Endpoints
  • IAM Roles with least privilege per service
  • Database credentials in Secrets Manager
  • RDS Proxy with TLS-encrypted connections
  • User secrets protected by KMS envelope encryption
  • Cognito authentication (30-day sessions)
  • Lambda@Edge header spoofing prevention
  • WAF protection with rate limiting
  • S3 and Aurora storage encryption

Session Management:

Token Type Validity Description
Access Token 1 hour API access token
ID Token 1 day Identity token (stored in cookie)
Refresh Token 30 days Used to obtain new tokens

VPC Requirements

Your existing VPC must have:

  • At least 2 private subnets in different AZs
  • NAT Gateway for outbound internet access
  • DNS hostnames enabled

CI/CD

Workflow Trigger Description
CI Push/PR to main, develop Build TypeScript, run all tests (Jest + pytest)
Security Scan Push/PR to main, daily npm audit, Checkov, git-secrets, Semgrep SAST, cfn-lint
npm run test        # Run all tests
npm run test:ts     # TypeScript tests only
npm run test:py     # Python tests only
npm run test:ts -- -u  # Update snapshots

Useful Commands

npm run build       # Build TypeScript
npm run watch       # Watch for changes
npx cdk diff --all  # Show diff before deploy
npx cdk synth --all # Synthesize CloudFormation
npx cdk destroy --all  # Destroy all stacks

Troubleshooting

Common issues

VPC Lookup Fails β€” Ensure the VPC exists and your AWS credentials have ec2:DescribeVpcs permission.

Certificate Validation Pending β€” ACM certificates use DNS validation. Ensure the Hosted Zone is correctly configured.

Fargate Task Not Starting β€” Check CloudWatch Logs at /openhands/application for container startup errors. Check ECS service events for Fargate capacity issues.

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

AI Agent Skills

This project uses autonomous-dev-team skills for AI-assisted development with Claude Code, Kiro CLI, and Codex. Install after cloning:

npx skills add zxkane/autonomous-dev-team -s '*' -a claude-code -a kiro-cli -a codex -y

Or restore from the lock file:

npx skills experimental_install

These skills enforce TDD, git worktree isolation, PR workflows, and E2E testing. See CLAUDE.md for the full workflow.

License

This project is licensed under the Apache License 2.0 β€” see the LICENSE file for details.

This infrastructure project deploys OpenHands. See the OpenHands License for the main application.


If this project helps you deploy OpenHands, consider giving it a ⭐

Built with ❀️ using AWS CDK and OpenHands