You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is this related to an existing feature request or issue?
No response
Summary
The AWS Resiliency Plugin gives your AI agent a failure-mode-first approach to architecture reviews. Share your IaC or application code, and the agent reviews it across 7 resiliency domains — surfacing single points of failure, missing failover configurations, and application-level gaps that cause production incidents.
Use case
The AWS Resiliency Plugin gives your AI agent a failure-mode-first approach to architecture reviews. Share your IaC or application code, and the agent reviews it across 7 resiliency domains — surfacing single points of failure, missing failover configurations, and application-level gaps that cause production incidents.
Target users: Cloud Engineers, SREs, Platform Engineers, and Cloud Architects preparing production deployments, operational readiness reviews (ORRs), game-day exercises, or Well-Architected Reviews (WARs).
Key use cases:
Pre-production hardening: Review a CDK/Terraform stack before it goes to production — find the missing Multi-AZ, the absent DLQ, the Lambda timeout that will silently drop requests ORR preparation: Structured review across all seven resiliency domains to validate readiness criteria before launch Incident post-mortem: Share the IaC of a service that just had an outage — the agent finds the architectural gap that caused it Game-day planning: Identify blast radius and failure modes to design meaningful chaos experiments WAR prep: Map findings to Well-Architected REL pillar questions with evidence from the actual code
###Before / After user experience:
Before (without the plugin):
User: Is this CDK stack resilient?
Agent: Here are some general resiliency best practices:
1. Use Multi-AZ deployments
2. Implement retry logic
3. Consider disaster recovery
[...10 generic bullet points with no reference to the actual code...]
After (with aws-resiliency):
User: Is this CDK stack resilient?
[pastes CDK TypeScript stack]
Agent: Two-layer resiliency review complete across 7 domains.
| Domain | Finding | Severity | Blast Radius | RTO Impact | Fix |
|-------------|--------------------------------------|----------|---------------------------|-------------|----------------------------------------|
| COMPUTE | ECS desiredCount=1, no multi-AZ | Critical | Full service outage | ~5 min | desiredCount≥2 + AZ placement strategy |
| DATA | RDS missing multiAz | Critical | Read+write outage | 10–30 min | multiAz: true |
| COMPUTE | Lambda timeout=3s on payments fn | High | Silent 504s to ALB | Per-request | timeout: 29 (ALB max) |
| MESSAGING | SQS consumer missing DLQ | High | Message loss on failure | N/A | deadLetterQueue, maxReceiveCount: 3 |
| NETWORKING | Single NAT Gateway (one AZ) | Medium | Outbound loss if AZ fails | ~5 min | One NAT GW per AZ |
Layer 2 (application code): RDS client is correctly initialised outside the handler , but connection pool maxConnections=10 will exhaust during the RDS failover reconnection storm. Recommend maxConnections=2 with exponential backoff on connect. Top risk: RDS Multi-AZ + ECS single-task together mean any AZ failure causes a full outage with manual recovery. Fix both before production. REL 6 (REL_FAULT_ISOLATION), REL 9 (REL_MITIGATE_BULK_FAILURES).
Architecture best practices, cross-service integration patterns, service-specific failure behaviour FAQs
On-demand for service behaviour lookups
<style>
</style>
awslabs.aws-documentation-mcp-server is the only required server —
it provides Well-Architected REL pillar grounding for all findings. The IaC and
Terraform servers are conditionally invoked based on detected input format. The
knowledge server is invoked when the agent needs to confirm specific service
behaviours (e.g., exact RDS DNS failover propagation window, DynamoDB eventual
consistency semantics under partition).
IaC language detection in
monorepos:
Multi-file CDK projects with multiple stacks require heuristic detection.
Mitigation: SKILL.md instructs the agent to ask the user to identify the
main stack file if auto-detection is ambiguous.
Two-layer correlation: Connecting an IaC finding (RDS
Multi-AZ enabled) with an application finding (connection pool not
handling the 60s DNS failover window) requires both layers to be present.
Mitigation: SKILL.md instructs the agent to flag missing layers and
deliver a partial review with explicit caveats.
WAF question ID currency: REL pillar question IDs change
across framework versions. Mitigation: References use stable question titles rather
than volatile IDs; aws-documentation-mcp-server fetches current content when
available.
Reference file size: Domain reference files covering
IaC checks, application code checks, and failure modes will approach the
100-line guideline. Mitigation: Files load only when the relevant domain
is detected — a Lambda-only review never loads multi-region-dr.md.
SKILL.md stays under 200 lines.
Is this related to an existing feature request or issue?
No response
Summary
The AWS Resiliency Plugin gives your AI agent a failure-mode-first approach to architecture reviews. Share your IaC or application code, and the agent reviews it across 7 resiliency domains — surfacing single points of failure, missing failover configurations, and application-level gaps that cause production incidents.
Use case
The AWS Resiliency Plugin gives your AI agent a failure-mode-first approach to architecture reviews. Share your IaC or application code, and the agent reviews it across 7 resiliency domains — surfacing single points of failure, missing failover configurations, and application-level gaps that cause production incidents.
Target users: Cloud Engineers, SREs, Platform Engineers, and Cloud Architects preparing production deployments, operational readiness reviews (ORRs), game-day exercises, or Well-Architected Reviews (WARs).
Key use cases:
Pre-production hardening: Review a CDK/Terraform stack before it goes to production — find the missing Multi-AZ, the absent DLQ, the Lambda timeout that will silently drop requests
ORR preparation: Structured review across all seven resiliency domains to validate readiness criteria before launch
Incident post-mortem: Share the IaC of a service that just had an outage — the agent finds the architectural gap that caused it
Game-day planning: Identify blast radius and failure modes to design meaningful chaos experiments
WAR prep: Map findings to Well-Architected REL pillar questions with evidence from the actual code
###Before / After user experience:
Before (without the plugin):
User: Is this CDK stack resilient?
Agent: Here are some general resiliency best practices:
1. Use Multi-AZ deployments
2. Implement retry logic
3. Consider disaster recovery
[...10 generic bullet points with no reference to the actual code...]
After (with aws-resiliency):
User: Is this CDK stack resilient?
[pastes CDK TypeScript stack]
Agent: Two-layer resiliency review complete across 7 domains.
Layer 2 (application code): RDS client is correctly initialised outside the handler , but connection pool maxConnections=10 will exhaust during the RDS failover reconnection storm. Recommend maxConnections=2 with exponential backoff on connect. Top risk: RDS Multi-AZ + ECS single-task together mean any AZ failure causes a full outage with manual recovery. Fix both before production. REL 6 (REL_FAULT_ISOLATION), REL 9 (REL_MITIGATE_BULK_FAILURES).
Proposal
Plugin structure
MCP server dependencies
awslabs.aws-documentation-mcp-serveris the only required server — it provides Well-Architected REL pillar grounding for all findings. The IaC and Terraform servers are conditionally invoked based on detected input format. The knowledge server is invoked when the agent needs to confirm specific service behaviours (e.g., exact RDS DNS failover propagation window, DynamoDB eventual consistency semantics under partition).Defaults
Dependencies and Integrations
MCP dependencies (all from AWS Labs):
awslabs.aws-documentation-mcp-server— Well-Architected REL pillar, service docs, SLAsawslabs.aws-iac-mcp-server— CDK + CloudFormation validation, cfn-lint, cfn-guardawslabs.terraform-mcp-server— Terraform provider/module validationawslabs.aws-knowledge-mcp-server— Architecture patterns, service failure behaviourIntegration with existing plugins:
deploy-on-aws:deploy-on-awsgenerates dev-sized IaC;aws-resiliencyreviews it for production hardening. Recommended workflow: generate → review → harden → redeploy.aws-observability(RFC RFC: Add AWS Observability plugin #67): Resiliency review surfaces missing CloudWatch alarms, absent X-Ray tracing, and health check gaps — findings that feed directly intoaws-observabilityworkflows.Reference implementation: A working version of the skill and reference files is available at https://github.com/nirmal84/aws-resiliency-plugin
Potential Challenges
aws-documentation-mcp-serverfetches current content when available.multi-region-dr.md. SKILL.md stays under 200 lines.