Production-grade AWS infrastructure demonstrating enterprise networking, high availability, auto-scaling, and comprehensive monitoring.
- Project Overview
- Architecture
- Key Features
- Design Decisions
- Implementation
- Performance Metrics
- Security
- Monitoring
- Deliverables
- Challenges & Solutions
- Cost Analysis
- Future Improvements
- Getting Started
- Contributing
This project implements a production-grade 3-tier web application architecture on AWS, demonstrating enterprise networking patterns, high availability design, and modern DevOps practices.
| Layer | Technology |
|---|---|
| Frontend | Nginx (Reverse Proxy) |
| Backend | Apache Tomcat 9 (Java 11) |
| Database | RDS MySQL 8.0.39 (Multi-AZ) |
| Infrastructure | AWS (EC2, VPC, ALB, ASG, RDS, Transit Gateway) |
| CI/CD | GitHub Actions → SonarCloud → Nexus → CodeDeploy |
| Monitoring | CloudWatch (Metrics, Logs, Alarms, Dashboard) |
- Duration: 6 days (February 20-25, 2026)
- Total Cost: $46.40 ($7.73/day)
- Lines of Infrastructure Code: 200+ AWS Console configurations
- Tests Performed: 43 tests across 8 categories
┌─────────────────────────────────────────────────────────────────────┐
│ INTERNET │
└────────────────────────────┬────────────────────────────────────────┘
│
┌────────▼────────┐
│ Frontend ALB │ (Internet-Facing)
│ (us-east-1a/b) │
└────────┬────────┘
│
┌──────────────────┴──────────────────┐
│ │
┌─────▼─────┐ ┌────▼──────┐
│ Nginx-1 │ (Public Subnet AZ1a) │ Nginx-2 │ (Public Subnet AZ1b)
│ t3.micro │ │ t3.micro │
└─────┬─────┘ └────┬──────┘
│ │
└──────────────────┬──────────────────┘
│
┌────────▼────────┐
│ Backend ALB │ (Internal)
│ (us-east-1a/b) │
└────────┬────────┘
│
┌──────────────────┴──────────────────┐
│ │
┌─────▼──────┐ ┌────▼───────┐
│ Tomcat-1 │ (Private Subnet AZ1a) │ Tomcat-2 │ (Private Subnet AZ1b)
│ t3.micro │ │ t3.micro │
└─────┬──────┘ └────┬───────┘
│ │
└──────────────────┬──────────────────┘
│
┌────────▼─────────┐
│ RDS MySQL │
│ Multi-AZ │
│ (db.t3.micro) │
│ Primary: AZ1a │
│ Standby: AZ1b │
└──────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ MANAGEMENT VPC (172.32.0.0/16) │
│ │
│ ┌──────────────┐ Transit Gateway ┌──────────┐ │
│ │ Bastion Host │◄───────────────┬─────────────────►│ AppVPC │ │
│ │ (t2.micro) │ │ │ │ │
│ └──────────────┘ │ └──────────┘ │
│ ▲ │ │
│ │ │ │
│ SSH from Laptop └─ Secure Admin Access │
└─────────────────────────────────────────────────────────────────────┘
AppVPC (192.168.0.0/16):
- Public Subnets: 192.168.1.0/24, 192.168.2.0/24 (Nginx instances)
- Private App Subnets: 192.168.3.0/24, 192.168.4.0/24 (Tomcat instances)
- Private DB Subnets: 192.168.5.0/24, 192.168.6.0/24 (RDS MySQL)
- Availability Zones: us-east-1a, us-east-1b
ManagementVPC (172.32.0.0/16):
- Bastion Subnet: 172.32.1.0/24
- Purpose: Secure admin access via Transit Gateway
- Multi-AZ Deployment across 2 availability zones
- RDS Multi-AZ with automatic failover (tested: 2 min 8 sec)
- Auto-healing instances (replacement in 6 minutes)
- No single points of failure in application path
- Auto Scaling Groups (2-4 instances per tier)
- Dynamic scaling based on CPU utilization
- Load distribution across multiple instances
- Tested capacity: 200,000 users/day
- Defense-in-depth: 4 security layers (VPC, Subnets, Security Groups, IAM)
- Private subnets for application and database tiers
- Bastion host for secure admin access
- Least privilege IAM roles and security group rules
- CloudWatch Dashboard with 5 widgets
- 6 proactive alarms with email notifications
- Comprehensive metrics (CPU, memory, response time, request count)
- Health checks at ALB and ASG levels
- Automated deployment: GitHub → Actions → SonarCloud → Nexus → CodeDeploy
- Quality gates (SonarCloud analysis on every deployment)
- Rolling deployments (zero downtime)
- Artifact management (versioned deployments to Nexus)
- Deployment Speed: 85 seconds end-to-end (code push → live production)
Pipeline Breakdown:
Build & Test: 53 seconds (includes SonarCloud analysis)
Artifact Upload: 8 seconds (Nexus repository)
AWS Configuration: 1 second
S3 Upload: 6 seconds
CodeDeploy Trigger: 2 seconds
Instance Deployment: 7-10 seconds per instance
────────────────────────────────────────────────
Total: 85 seconds (1 min 25 sec)
Need: 3-tier Java app requiring HA, auto-scaling, and enterprise networking
Decision: AWS provides mature RDS Multi-AZ (automatic failover), native ASG+ALB integration (self-healing), and Transit Gateway (enterprise networking demonstration).
vs Others: Azure better for .NET; GCP better for containers. AWS best for traditional infrastructure patterns.
Need: Tomcat-based Java app with persistent connections
Decision: EC2 fits traditional Tomcat architecture. Lambda has 3-5s cold starts (poor UX). Containers deferred to Round 3 (learn infrastructure first, then abstraction).
Trade-off: More operational overhead but deeper infrastructure understanding.
Need: Relational data with HA requirement
Decision: MySQL for structured relationships. Multi-AZ for automatic failover.
Trade-off: 2x cost ($30 vs $15/month) but 2-minute automatic recovery vs 20-30 minute manual restore.
vs Others: DynamoDB adds NoSQL complexity; Aurora costs 2x for features not needed at this scale.
| Decision | Cost | Benefit | Justification |
|---|---|---|---|
| Multi-AZ RDS | +$15/mo | 2-min failover | Worth it: Automatic recovery vs 20-min manual |
| 2 NAT Gateways | +$32/mo | Per-AZ independence | Eliminated in Round 2 for cost optimization |
| Transit Gateway | +$36/mo | Enterprise networking | Demonstrates complex patterns; simplified in Round 2 |
| 2 ALBs vs 1 | +$20/mo | Security isolation | Backend completely isolated from internet |
- Created 2 VPCs with Transit Gateway connectivity
- Configured 7 subnets across 2 availability zones
- Set up NAT Gateways for private subnet internet access
- Implemented security groups with least privilege
- Created Launch Templates with User Data scripts
- Configured Auto Scaling Groups (2-4 instances per tier)
- Installed Nginx (frontend) and Tomcat (backend)
- Integrated CloudWatch Agent for custom metrics
- Deployed public-facing ALB for Nginx tier
- Deployed internal ALB for Tomcat tier
- Configured health checks (
/healthendpoint) - Set up target groups with proper routing
- Provisioned RDS MySQL Multi-AZ (db.t3.micro)
- Created DB subnet group in private subnets
- Configured automated backups (7-day retention)
- Tested automatic failover (2 min 8 sec)
- Set up Nexus Repository on EC2
- Configured GitHub Actions workflow
- Integrated SonarCloud for code quality
- Deployed CodeDeploy for automated deployments
Rolling Deployment via Auto Scaling Groups:
- Update Launch Template with new configuration
- ASG terminates one old instance
- ASG launches one new instance
- New instance passes health checks (2 consecutive successes)
- ALB routes traffic to new instance
- Repeat for remaining instances
- Result: Zero downtime deployment
Test Configuration:
- Tool: Artillery
- Duration: ~6 minutes per run
- Virtual Users: 4,500
- Requests: ~9,000 HTTP requests at 23 req/sec
- Test Phases: Warm-up (5 users/sec) → Sustained (10 users/sec) → Peak (20 users/sec)
Results: Baseline vs Optimized
| Metric | Baseline (Before) | Optimized (After) | Improvement | Assessment |
|---|---|---|---|---|
| Success Rate | 98.8% (8,860/8,963) | 100% (9,000/9,000) | +1.2% | Perfect |
| Timeouts (ETIMEDOUT) | 103 errors | 0 errors | 100% eliminated | Excellent |
| Response Time (Min) | 195 ms | 194 ms | Consistent | |
| Response Time (Median) | 247 ms | 223 ms | -9.7% faster | Excellent |
| Response Time (Mean) | 379.7 ms | 231.8 ms | -38.9% faster | Excellent |
| Response Time (p95) | 757.6 ms | 262.5 ms | -65.3% faster | Excellent |
| Response Time (p99) | 4,770.6 ms | 314.2 ms | -93.4% faster | Outstanding |
| Response Time (Max) | 8,901 ms | 2,671 ms | -70% reduction | |
| Throughput | 23 req/sec | 23 req/sec | Consistent |
What Changed Between Tests:
- Auto-Scaling Policy: Added dual-metric scaling (CPU + ALB RequestCountPerTarget)
- Scale-Out Threshold: RequestCountPerTarget > 500 for 3 datapoints
- Result: System now scales proactively based on actual request pressure, not just CPU
Key Achievement:
p99 latency reduced by 93.4% (4,770ms → 314ms) — Nearly eliminated slow outliers that cause user abandonment. Zero timeouts under sustained load proves the system can handle peak traffic without connection exhaustion.
Capacity Analysis:
- Current: ~200,000 users/day (2 instances per tier)
- Scaled (4 instances): ~400,000 users/day
- Cost per User: $0.00048/user
RDS Multi-AZ Failover Test:
Failover Initiated: 14:35:22 UTC
Status "Rebooting": 14:35:45 UTC (+23 sec)
Status "Failing-over": 14:36:12 UTC (+50 sec)
Status "Available": 14:37:18 UTC (+1 min 56 sec)
Application Recovery: 14:37:30 UTC (+2 min 8 sec)
Zero data loss (synchronous replication)
Zero manual intervention
Automatic DNS update
Auto-Healing Test:
Instance Failure: 00:00 (Tomcat stopped)
Unhealthy Detection: 01:00 (+1 min, 2 failed health checks)
ASG Termination: 02:00 (+2 min)
New Instance Launch: 02:15 (+2 min 15 sec)
New Instance Healthy: 06:30 (+6 min 30 sec)
Automatic replacement
No user impact (other instance serving traffic)
Zero manual intervention
Auto-Scaling Test:
Test Setup:
- Tool: Artillery (from EC2 inside VPC)
- Target: Internal Backend ALB
- Load: 50 arrivals/second for 5 minutes
- Policy: Target Tracking (RequestCountPerTarget > 500)
Results:
High Load Generated: 11:20 UTC (50 req/sec)
Scale-Out Triggered: 11:23 UTC (+3 min)
New Instances Launched: 11:23 UTC (2 → 4 instances)
CPU Before Scale: 36%
CPU After Scale: <10% (load distributed)
Result: 100% capacity increase
Dual-metric scaling (CPU + RequestCountPerTarget)
Proactive scaling based on actual request pressure
70% CPU reduction proves effective load distribution
- Target: 99.95% uptime (21.9 minutes downtime/month)
- Achievable: 99.88% uptime (52 minutes downtime/month)
- Gap: Deployment downtime, scaling delays (can be closed with blue/green deployments and pre-baked AMIs)
Layer 1: VPC Isolation
- 2 separate VPCs (Management + Application)
- Transit Gateway for controlled cross-VPC communication
- Network segmentation by function
Layer 2: Subnet Segmentation
- Public subnets: Internet-facing services only (Nginx)
- Private app subnets: Application logic (Tomcat)
- Private DB subnets: Database (RDS MySQL)
Layer 3: Security Groups (Stateful Firewall)
Frontend-SG:
Inbound: HTTP (80) from 0.0.0.0/0, SSH (22) from 172.32.0.0/16
Outbound: All traffic
Backend-SG:
Inbound: TCP (8080) from Frontend-SG, SSH (22) from 172.32.0.0/16
Outbound: All traffic
Database-SG:
Inbound: MySQL (3306) from Backend-SG only
Outbound: None
Layer 4: IAM Roles (Least Privilege)
- EC2-SSM-Role: SSM access, CloudWatch logs/metrics, CodeDeploy artifacts only
- CodeDeployServiceRole: Describe instances, register/deregister from ALB only
- No permissions: Terminate instances, modify infrastructure, IAM changes
Admin Access Pattern:
Laptop → Bastion Host (ManagementVPC) → Transit Gateway → Private Instances
Benefits:
- Single point of entry (audit trail)
- IP whitelisting (only your IP)
- Key-based authentication (no passwords)
- Jump host pattern (can't directly SSH to private instances)
5 Widgets Configured:
- Frontend Metrics: CPU, Network, Status Checks (Nginx ASG)
- Backend Metrics: CPU, Network, Status Checks (Tomcat ASG)
- ALB Performance: RequestCount, TargetResponseTime, HTTPCode_2XX/5XX
- RDS Performance: CPUUtilization, DatabaseConnections, FreeableMemory, Latency
- Target Health: HealthyHostCount, UnhealthyHostCount (both tiers)
| Alarm | Metric | Threshold | Action |
|---|---|---|---|
| Backend-Unhealthy-Targets | UnhealthyHostCount | ≥ 1 | SNS Email |
| Frontend-Unhealthy-Targets | UnhealthyHostCount | ≥ 1 | SNS Email |
| Backend-High-CPU | CPUUtilization | > 80% (2 periods) | SNS Email |
| Frontend-High-CPU | CPUUtilization | > 80% (2 periods) | SNS Email |
| RDS-High-CPU | CPUUtilization | > 75% (2 periods) | SNS Email |
| Backend-High-Response-Time | TargetResponseTime | > 2 seconds | SNS Email |
All alarms tested and confirmed working
ALB Health Checks:
- Protocol: HTTP
- Path:
/health(Nginx),/(Tomcat) - Interval: 30 seconds
- Timeout: 5 seconds
- Healthy/Unhealthy Threshold: 2 consecutive checks
ASG Health Checks:
- Type: ELB (detects application failures, not just EC2 status)
- Grace Period: 300 seconds (allows User Data to complete)
Problem: Created NAT Gateways in private subnets → instances couldn't download packages
Root Cause: NAT Gateways need internet access themselves; must be in public subnets
Solution: Deleted and recreated NAT Gateways in public subnets with Elastic IPs
Problem: Transit Gateway configured, routes added, but ping still failed
Root Cause: Network ACL had default DENY rule blocking subnet-level traffic
Solution: Changed Network ACL from DENY to ALLOW
Lesson: Security Groups (instance-level) + Network ACLs (subnet-level) both must allow traffic
Problem: Nginx running, port 80 open, but ALB marked instances unhealthy
Root Cause: Health check path was / (slow, proxies to backend) instead of /health (fast)
Solution: Changed Target Group health check path to /health endpoint
Lesson: Health checks should be fast and independent (not depend on backend services)
Impact: Instances became healthy in 2 minutes after fix
| Component | Daily Cost | Total (6 days) | % of Total |
|---|---|---|---|
| NAT Gateway (2) | $2.20 | $13.20 | 28.5% |
| Transit Gateway | $1.20 | $7.20 | 15.5% |
| Load Balancers (2) | $1.33 | $7.98 | 17.2% |
| RDS Multi-AZ | $1.00 | $6.00 | 12.9% |
| EC2 Instances (4) | $0.60 | $3.60 | 7.8% |
| Nexus (t3.medium) | $0.80 | $4.80 | 10.4% |
| Bastion (t2.micro) | $0.13 | $0.78 | 1.7% |
| CloudWatch | $0.05 | $0.30 | 0.6% |
| TOTAL | $7.73/day | $46.40 | 100% |
24/7 Operation: $232/month
Optimized (stop when not working): $55-80/month
With Reserved Instances (1-year): $143/month (38% savings)
Round 2 Simplified Architecture: $50/month (eliminates NAT/TG)
- NAT Gateways (30%): $65/month → Eliminated in Round 2
- Transit Gateway (17%): $36/month → Eliminated in Round 2
- Load Balancers (17%): $40/month → Essential, can't optimize much
- Current: $0.00048 per user/day
- At 1M users/day: $0.00023 per user
- Competitive with Heroku ($0.00083/user), AWS Fargate ($0.00031/user)
Priority: High
- Add SSL/TLS certificate (ACM + HTTPS redirect)
- Implement AWS Secrets Manager for credentials
- Enable WAF with OWASP Top 10 rules
- Enable RDS encryption at rest
Priority: High
- Rebuild with Terraform (Round 2)
- Version control infrastructure
- Faster deployment (5 min vs 3 days)
Priority: Medium
- Pre-bake AMI (reduce scaling time: 7.5 → 3.5 min)
- Add Redis caching (reduce DB load 80%)
- Implement CloudFront CDN (global distribution)
Priority: Future
- Migrate to ECS Fargate (Round 3)
- Faster scaling (< 1 minute)
- Modern cloud-native architecture
- AWS Account with billing enabled
- AWS CLI v2 installed and configured
- Basic understanding of AWS services
- SSH key pair for EC2 access
round-1-aws-3tier/
├── README.md
├── docs/
│ ├── architecture-diagram.png
│ ├── network-diagram.png
│ ├── Round-1-Complete-Testing-Documentation.md
│ └── Round-1-Complete-Lessons-Learned-Guide.md
├── scripts/
│ ├── nginx-user-data.sh
│ ├── tomcat-user-data.sh
│ └── cloudwatch-agent-config.json
├── screenshots/
│ ├── cloudwatch-dashboard.png
│ ├── healthy-targets.png
│ ├── load-test-results.png
│ └── rds-multi-az.png
└── ci-cd/
├── .github/workflows/deploy.yml
├── appspec.yml
└── deployment-scripts/
-
Clone Repository
git clone https://github.com/yourusername/round-1-aws-3tier.git cd round-1-aws-3tier -
Review Documentation
- Read architecture overview
- Understand network design
- Review security considerations
-
Deploy Infrastructure
- Follow AWS Console setup (Round 1)
- Or use Terraform (Round 2)
-
Test Deployment
- Run load tests
- Test failover scenarios
- Verify monitoring
-
Clean Up
- Delete resources to avoid charges
- Follow deletion order in documentation
Contributions, issues, and feature requests are welcome!
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
| Metric | Value |
|---|---|
| Success Rate | 100% (optimized) |
| p99 Response Time | 314 ms (was 4,771ms, 93% improvement) |
| Mean Response Time | 232 ms (was 380ms, 39% improvement) |
| Timeouts | 0 (eliminated 100%) |
| Failover Time | ~60 sec (avg of 3 tests) |
| Auto-Scaling | 2 → 4 instances (100% capacity increase) |
| Throughput | 23 req/sec sustained |
| Deployment Speed | 85 seconds (end-to-end) |
| User Capacity | 200k-400k users/day |
| Uptime Target | 99.95% |
| Cost | $7.73/day |
⭐ Star this repo if you found it helpful!
Made with ❤️






