cloud-aws
Use this skill when architecting on AWS, selecting services, optimizing costs, or following the Well-Architected Framework. Triggers on EC2, S3, Lambda, RDS, DynamoDB, CloudFront, IAM, VPC, ECS, EKS, SQS, SNS, API Gateway, and any task requiring AWS architecture decisions, service selection, or cost management.
cloud awscloudinfrastructureserverlesswell-architectedWhat is cloud-aws?
Use this skill when architecting on AWS, selecting services, optimizing costs, or following the Well-Architected Framework. Triggers on EC2, S3, Lambda, RDS, DynamoDB, CloudFront, IAM, VPC, ECS, EKS, SQS, SNS, API Gateway, and any task requiring AWS architecture decisions, service selection, or cost management.
cloud-aws
cloud-aws is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Architecting on AWS, selecting services, optimizing costs, or following the Well-Architected Framework.
Quick Facts
| Field | Value |
|---|---|
| Category | cloud |
| Version | 0.1.0 |
| Platforms | claude-code, gemini-cli, openai-codex |
| License | MIT |
How to Install
- Make sure you have Node.js installed on your machine.
- Run the following command in your terminal:
npx skills add AbsolutelySkilled/AbsolutelySkilled --skill cloud-aws- The cloud-aws skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).
Overview
A practical guide to building production systems on AWS following the Well-Architected Framework. This skill covers service selection, VPC design, IAM least-privilege, serverless patterns, cost optimization, and monitoring - with an emphasis on when to use each service, not just how. Designed for engineers who know AWS basics and need opinionated guidance on trade-offs and common pitfalls.
Tags
aws cloud infrastructure serverless well-architected
Platforms
- claude-code
- gemini-cli
- openai-codex
Related Skills
Pair cloud-aws with these complementary skills:
Frequently Asked Questions
What is cloud-aws?
Use this skill when architecting on AWS, selecting services, optimizing costs, or following the Well-Architected Framework. Triggers on EC2, S3, Lambda, RDS, DynamoDB, CloudFront, IAM, VPC, ECS, EKS, SQS, SNS, API Gateway, and any task requiring AWS architecture decisions, service selection, or cost management.
How do I install cloud-aws?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill cloud-aws in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support cloud-aws?
This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.
Maintainers
Generated from AbsolutelySkilled
SKILL.md
AWS Cloud Architecture
A practical guide to building production systems on AWS following the Well-Architected Framework. This skill covers service selection, VPC design, IAM least-privilege, serverless patterns, cost optimization, and monitoring - with an emphasis on when to use each service, not just how. Designed for engineers who know AWS basics and need opinionated guidance on trade-offs and common pitfalls.
When to use this skill
Trigger this skill when the user:
- Chooses between AWS compute options (EC2, ECS, Fargate, Lambda, App Runner)
- Designs or reviews a VPC, subnet, or security group setup
- Needs IAM roles, policies, or permission boundaries
- Architects a serverless application (API Gateway + Lambda + DynamoDB)
- Asks about cost reduction, Reserved Instances, Savings Plans, or right-sizing
- Sets up CloudWatch alarms, dashboards, or log insights
- Selects a database service (RDS, Aurora, DynamoDB, ElastiCache)
- Plans multi-region or high-availability architecture
Do NOT trigger this skill for:
- General Linux/shell scripting unrelated to AWS
- Kubernetes internals that are cloud-agnostic (use a k8s skill instead)
Key principles
Operational excellence - Automate everything that can be automated. Infrastructure-as-code (CloudFormation, CDK, Terraform) is not optional. Every change should be reviewable, reproducible, and reversible. Run post-incident reviews and feed learnings back into runbooks.
Security - Apply least-privilege IAM everywhere. No
*actions in production policies. Encrypt data at rest (KMS) and in transit (TLS). Treat every AWS account boundary as a trust boundary. Use VPC endpoints to keep traffic off the public internet where possible.Reliability - Design for multi-AZ by default. Use health checks, auto-scaling, and managed services that handle failure transparently. Define Recovery Time Objective (RTO) and Recovery Point Objective (RPO) before choosing a database tier.
Performance efficiency - Right-size before you scale out. Understand the access patterns of your workload and match them to the service that handles them natively (e.g., DynamoDB for key-value at scale, Aurora for relational OLTP). Use CloudFront and edge caching to reduce origin load.
Cost optimization - Cost is an architecture decision, not an afterthought. Tag every resource. Use Cost Explorer weekly. Commit to Reserved Instances or Savings Plans for stable workloads. Delete idle resources aggressively.
Core concepts
Regions and Availability Zones
A region is a geographic area with multiple isolated data centers. Each region contains at least 3 Availability Zones (AZs) - physically separate facilities with independent power and networking. Deploy stateful services across 2+ AZs for high availability. Some services (S3, IAM, CloudFront) are global; most are regional.
IAM model
IAM has four building blocks:
| Concept | What it is |
|---|---|
| Principal | Who is acting (user, role, service) |
| Policy | JSON document defining allowed/denied actions |
| Role | Identity assumed by services or users (no long-term credentials) |
| Trust policy | Who is allowed to assume a role |
The golden rule: use roles, not users. EC2 instances, Lambda functions, and ECS tasks all assume roles at runtime. Never embed access keys in code or AMIs.
Compute spectrum
Control / Cost Managed / Speed
<------------------------------------------>
EC2 -> ECS on EC2 -> ECS Fargate -> Lambda -> App Runner- EC2 - full OS control, GPU support, long-running workloads
- ECS on EC2 - containerized, you manage the host fleet
- ECS Fargate - containerized, AWS manages hosts (preferred default)
- Lambda - event-driven, sub-second billing, 15-min max duration
- App Runner - HTTP services from container or source, zero infra management
Storage tiers
| Service | Use case |
|---|---|
| S3 Standard | Frequently accessed objects |
| S3 Intelligent-Tiering | Unpredictable access patterns |
| S3 Glacier Instant | Archives needing millisecond retrieval |
| EBS | Block storage attached to EC2 |
| EFS | Shared POSIX filesystem across multiple EC2s |
Networking primitives
A VPC is a logically isolated network. Inside it, subnets span a single AZ. Public subnets have a route to an Internet Gateway; private subnets do not. Security groups are stateful firewalls attached to ENIs (deny by default). NACLs are stateless subnet-level firewalls (less common). Use VPC endpoints to reach AWS services (S3, DynamoDB, SQS) without traversing the internet.
Common tasks
Choose the right compute service
| Workload type | Recommended service | Why |
|---|---|---|
| Long-running stateful app, GPU needed | EC2 | Full OS control, persistent storage |
| Containerized microservice, >15 min tasks | ECS Fargate | No host management, predictable billing |
| Event-driven, short tasks (<15 min) | Lambda | Pay-per-invocation, auto-scales to zero |
| HTTP API from container, zero-ops | App Runner | Automated deployments, TLS, scaling |
| Large-scale batch processing | AWS Batch on Fargate | Managed job queues, spot support |
| Kubernetes required | EKS | When you need k8s primitives or portability |
Decision rule: start with Lambda or Fargate. Move to EC2 only when you need control over the OS, persistent GPU, or a runtime Lambda does not support.
Design a VPC with public/private subnets
A standard 3-tier VPC layout:
VPC 10.0.0.0/16
Public subnets (10.0.0.0/24, 10.0.1.0/24, 10.0.2.0/24) - one per AZ
- Internet Gateway route
- Load balancers, NAT Gateways, bastion hosts
Private subnets (10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24) - one per AZ
- Application servers, ECS tasks, Lambda (VPC-attached)
- Route outbound through NAT Gateway in the public subnet
Database subnets (10.0.20.0/24, 10.0.21.0/24, 10.0.22.0/24) - one per AZ
- RDS, ElastiCache
- No internet route at allCIDR planning rules:
- Use
/16for the VPC to leave room for growth - Use
/24per subnet (251 usable IPs - AWS reserves 5 per subnet) - Reserve CIDR ranges to avoid conflicts with on-premises networks or VPC peering
Never put application workloads in public subnets. Only load balancers and NAT Gateways belong in public subnets.
Set up IAM roles with least privilege
Start from zero-permissions and add only what's needed. Example Lambda role that reads from one S3 bucket and writes to DynamoDB:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::my-bucket/*"
},
{
"Effect": "Allow",
"Action": [
"dynamodb:PutItem",
"dynamodb:UpdateItem"
],
"Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/MyTable"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
}
]
}Key rules:
- Scope
Resourceto specific ARNs, never"*"for data plane actions - Use permission boundaries to cap what a role can grant to child roles
- Use IAM Access Analyzer to find overly permissive policies automatically
- Rotate any long-term credentials (access keys) every 90 days or eliminate them
Design a serverless API
Standard pattern: API Gateway -> Lambda -> DynamoDB
Client
-> API Gateway (REST or HTTP API)
- Request validation, auth (Cognito/JWT authorizer), throttling
-> Lambda function (per route or single handler)
- Business logic, input validation
-> DynamoDB table
- Partition key = entity type + ID, sort key = operation/timestamp
-> (optional) SQS for async fan-out, SNS for notificationsChoose HTTP API over REST API unless you need WAF integration, edge caching via API Gateway caches, or request/response transformation. HTTP API costs ~70% less.
DynamoDB access pattern design:
- Define all queries before designing the table (single-table design when possible)
- Use a composite sort key to support range queries (
STATUS#TIMESTAMP) - Enable DynamoDB Streams if downstream Lambdas need to react to changes
Optimize costs
| Strategy | When to apply | Typical saving |
|---|---|---|
| Reserved Instances (1yr no-upfront) | EC2/RDS running >8h/day, stable size | ~30-40% |
| Compute Savings Plans | Any EC2/Fargate/Lambda, flexible family | ~20-30% |
| Spot Instances | Batch, stateless, fault-tolerant workloads | ~60-80% |
| Right-sizing | Instances with <20% avg CPU over 2 weeks | Varies |
| S3 Intelligent-Tiering | Objects with unpredictable access | ~40% for cold data |
| Delete idle resources | Unattached EBS volumes, old snapshots, unused EIPs | Immediate |
Cost hygiene checklist:
- Set up AWS Budgets with alerts at 80% and 100% of monthly target
- Enable Cost Allocation Tags and tag every resource with
env,team,service - Review Trusted Advisor weekly for underutilized resources
- Use Lambda Power Tuning to find the optimal memory/cost configuration
Set up monitoring
Build three layers of observability using CloudWatch:
Metrics - Enable detailed monitoring (1-min granularity) for production EC2.
For Lambda, track Errors, Throttles, Duration, and ConcurrentExecutions.
Alarms - Follow the pattern: metric -> alarm -> SNS topic -> PagerDuty/Slack.
# Example: Lambda error rate alarm (AWS CLI)
aws cloudwatch put-metric-alarm \
--alarm-name "my-function-errors" \
--metric-name Errors \
--namespace AWS/Lambda \
--dimensions Name=FunctionName,Value=my-function \
--statistic Sum \
--period 60 \
--threshold 5 \
--comparison-operator GreaterThanOrEqualToThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:123456789:my-alertsDashboards - One dashboard per service with: error rate, latency (p50/p99), throughput, and saturation (CPU %, queue depth). Use CloudWatch Contributor Insights to find the top contributors to errors or high latency.
Logs - Use structured JSON logging. Query with CloudWatch Logs Insights:
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)Choose a database service
| Need | Service | Notes |
|---|---|---|
| Relational, OLTP, <100k writes/s | RDS (PostgreSQL/MySQL) | Familiar SQL, managed backups |
| Relational, high throughput, auto-scaling storage | Aurora | 5x MySQL throughput, Global Database for multi-region |
| Key-value / document at any scale | DynamoDB | Single-digit ms at any scale, requires upfront access pattern design |
| In-memory caching, session store | ElastiCache (Redis) | Sub-ms reads, Lua scripting, pub/sub |
| Full-text search | OpenSearch Service | Elasticsearch-compatible, managed |
| Analytical queries (OLAP) | Redshift | Columnar, petabyte-scale |
| Graph traversals | Neptune | Gremlin/SPARQL, highly connected data |
Decision rule: if access patterns are known and throughput exceeds RDS capacity, use DynamoDB. If you need joins, aggregations, or ad-hoc SQL, use Aurora.
Anti-patterns / common mistakes
| Mistake | Why it's wrong | What to do instead |
|---|---|---|
Using * in IAM policies |
Grants unintended access, violates least privilege | Scope to specific actions and ARNs; use IAM Access Analyzer |
| Putting databases in public subnets | Direct internet exposure, no network-layer defense | Database subnets with no internet route; security groups scoped to app tier |
| Hardcoding AWS credentials in code | Credentials leak via source control, logs, or container images | Use IAM roles assigned to compute resources; retrieve secrets from Secrets Manager |
| Single-AZ RDS in production | One maintenance event or hardware failure causes downtime | Enable Multi-AZ deployments; use Aurora for automatic failover |
| Lambda functions without concurrency limits | Runaway invocations can exhaust account concurrency and starve other functions | Set reserved concurrency; use SQS with a DLQ as a buffer |
| Over-provisioned EC2 for bursty workloads | Paying for idle capacity 20h/day | Switch to Fargate + auto-scaling or Lambda for bursty traffic patterns |
Gotchas
RDS encryption cannot be added after creation - You cannot enable encryption on an existing unencrypted RDS instance in place. The only path is to take a snapshot, copy it with encryption enabled, and restore to a new instance. Plan encryption at creation time for any instance that might hold regulated or sensitive data.
Lambda concurrency exhaustion is account-wide - Lambda functions share a per-region concurrency limit (default 1,000). A single runaway function (e.g., triggered by an SQS loop) can consume all available concurrency and throttle every other Lambda in the account. Always set reserved concurrency on high-traffic or loop-risky functions.
NAT Gateway costs accumulate silently - NAT Gateways charge per GB processed plus an hourly fee. A private subnet with heavy outbound traffic (e.g., Lambda pulling large S3 objects) can generate surprising bills. Use VPC endpoints for S3 and DynamoDB to bypass NAT Gateway entirely for those services.
S3 eventual consistency trap (pre-2020 style) - While S3 now provides strong read-after-write consistency for new objects, workflows that delete and recreate objects with the same key can still observe stale list results under some conditions. Don't assume a
ListObjectsimmediately after a delete/recreate reflects the latest state in automated pipelines.IAM policy evaluation order surprises - An explicit
Denyanywhere in the evaluation chain (SCPs, permission boundaries, identity policies, resource policies) overrides anyAllow. A service control policy at the organization level silently blocking an action is a common source of "permission denied" that looks correctly configured in the IAM console.
References
For detailed patterns and service-specific guidance, read the relevant file from
the references/ folder:
references/service-map.md- quick reference mapping use cases to AWS services
Only load a references file when the current task requires detailed service lookup - they consume context and the SKILL.md covers the most common decisions.
References
service-map.md
AWS Service Map
Quick-reference table mapping use cases to the right AWS service. Use this when a task involves service selection or when translating requirements to AWS primitives.
Compute
| Use case | Service | Notes |
|---|---|---|
| Long-running stateful app, OS control needed | EC2 | Choose instance family: M (general), C (compute), R (memory), G (GPU) |
| Containerized workload, no host management | ECS Fargate | Preferred default for containers |
| Containerized, need Kubernetes | EKS | Use when k8s portability or ecosystem is required |
| Event-driven, short tasks (<15 min) | Lambda | Sub-second billing, scales to zero |
| HTTP service from container or source, zero ops | App Runner | Auto-deploys, TLS, scaling handled |
| Large-scale batch jobs | AWS Batch | Managed job queues, Fargate or EC2 backing |
| Edge compute / CDN logic | Lambda@Edge / CloudFront Functions | CloudFront Functions for lightweight transforms (<1ms budget) |
Storage
| Use case | Service | Notes |
|---|---|---|
| Object storage, media, backups, static assets | S3 | Unlimited scale, 11 nines durability |
| Block storage for EC2 | EBS | gp3 is the default general-purpose volume type |
| Shared filesystem (NFS) across EC2 | EFS | POSIX-compliant, multi-AZ |
| High-performance shared filesystem (HPC) | FSx for Lustre | Scratch or persistent mode |
| File shares (Windows/SMB) | FSx for Windows File Server | Active Directory integration |
| Archival, long-term retention | S3 Glacier Instant / Flexible / Deep Archive | Deep Archive cheapest (~$1/TB/month), hours retrieval |
| Content delivery / CDN | CloudFront | 400+ PoPs, S3 or custom origin |
Database
| Use case | Service | Notes |
|---|---|---|
| Relational OLTP (Postgres/MySQL) | RDS | Managed, Multi-AZ, automated backups |
| High-throughput relational, auto-scaling storage | Aurora | 5x MySQL throughput; Aurora Serverless v2 for variable load |
| Key-value / document at massive scale | DynamoDB | Single-digit ms, design around access patterns first |
| In-memory cache / session store | ElastiCache for Redis | Sub-ms, supports data structures and pub/sub |
| Simple key-value cache (no persistence) | ElastiCache for Memcached | Multi-threaded, simpler than Redis |
| Full-text and log search | OpenSearch Service | Managed Elasticsearch/OpenSearch |
| Analytical / data warehouse | Redshift | Columnar, petabyte-scale, RA3 nodes |
| Serverless analytics on S3 | Athena | Presto-based, pay per query scanned |
| Highly connected data (graph) | Neptune | Gremlin and SPARQL APIs |
| Ledger / immutable audit log | QLDB | Cryptographically verifiable, document model |
| Time-series data | Timestream | Purpose-built, automatic tiering |
Networking
| Use case | Service | Notes |
|---|---|---|
| Isolated private network | VPC | One per workload/account; CIDR plan carefully |
| Layer 7 HTTP(S) load balancing | ALB (Application Load Balancer) | Path/host routing, WebSocket, Cognito auth |
| Layer 4 TCP/UDP load balancing | NLB (Network Load Balancer) | Static IPs, ultra-low latency, PrivateLink |
| DNS management | Route 53 | Health-check-based failover, latency routing |
| Private connectivity to AWS services | VPC Endpoints (Gateway / Interface) | Avoid internet traversal for S3, DynamoDB, etc. |
| Connect on-premises to VPC | Site-to-Site VPN / Direct Connect | VPN for quick setup; DX for dedicated bandwidth |
| Hub-and-spoke multi-VPC routing | Transit Gateway | Replaces VPC peering mesh at scale |
| Global accelerator for TCP/UDP | Global Accelerator | Anycast IPs, routes via AWS backbone |
| DDoS protection | Shield Standard / Advanced | Standard is automatic; Advanced adds 24/7 DDoS response team |
| Web Application Firewall | WAF | Attach to ALB, API Gateway, or CloudFront |
Messaging and Integration
| Use case | Service | Notes |
|---|---|---|
| Decoupled async message queue | SQS | Standard (at-least-once) or FIFO (exactly-once, ordered) |
| Fan-out pub/sub notifications | SNS | Push to SQS, Lambda, HTTP, email, SMS |
| Real-time streaming / event bus | Kinesis Data Streams | Ordered, replayable, shards scale throughput |
| Managed Kafka | MSK (Managed Streaming for Kafka) | When Kafka ecosystem/tooling required |
| Event-driven integration / routing | EventBridge | Schema registry, cross-account, SaaS integrations |
| Workflow orchestration | Step Functions | Standard (audit, long-running) or Express (high-volume, short) |
| Managed message broker (AMQP/STOMP) | Amazon MQ | Lift-and-shift for RabbitMQ or ActiveMQ |
Security and Identity
| Use case | Service | Notes |
|---|---|---|
| Identity and access management | IAM | Roles, policies, permission boundaries |
| User authentication / OIDC | Cognito | User pools (auth), identity pools (AWS credentials) |
| Secrets storage and rotation | Secrets Manager | Automatic rotation for RDS, Redshift, DocumentDB |
| Config/environment parameters | Parameter Store (SSM) | Free tier for standard params; use SecureString for sensitive values |
| Encryption key management | KMS | CMKs for envelope encryption; key policies control access |
| Certificate management | ACM (Certificate Manager) | Free TLS certs for ALB/CloudFront; auto-renewal |
| Threat detection (logs analysis) | GuardDuty | ML-based anomaly detection on VPC flow logs, CloudTrail, DNS |
| Security findings aggregation | Security Hub | Aggregates GuardDuty, Inspector, Macie findings |
| S3 sensitive data discovery | Macie | PII detection in S3 buckets |
| Vulnerability scanning (EC2/containers) | Inspector | CVE scanning, network reachability |
| Audit trail for API calls | CloudTrail | Enable in all regions; store in S3 with integrity validation |
Monitoring and Observability
| Use case | Service | Notes |
|---|---|---|
| Metrics, alarms, dashboards | CloudWatch Metrics + Alarms | 1-min granularity for detailed monitoring |
| Log aggregation and querying | CloudWatch Logs + Logs Insights | Structured JSON logs; Logs Insights for ad-hoc queries |
| Distributed tracing | X-Ray | Trace across Lambda, ECS, API Gateway, SDK-instrumented services |
| Synthetic monitoring (uptime) | CloudWatch Synthetics | Canary scripts to test endpoints |
| Application performance monitoring | CloudWatch Application Insights | Auto-detects and groups related metrics/logs |
| Infrastructure events | EventBridge / CloudWatch Events | React to AWS service state changes |
Developer Tools and IaC
| Use case | Service | Notes |
|---|---|---|
| Infrastructure as code (native) | CloudFormation / CDK | CDK (TypeScript/Python) compiles to CloudFormation |
| Source control | CodeCommit | Managed Git; most teams use GitHub/GitLab instead |
| CI/CD pipeline | CodePipeline + CodeBuild | Managed pipeline; CodeBuild for build/test steps |
| Container image registry | ECR (Elastic Container Registry) | Private, integrated with ECS/EKS, image scanning |
| Artifact storage | CodeArtifact | npm, Maven, pip, NuGet package proxy and hosting |
Cost Optimization Quick Reference
| Strategy | Best for | Typical saving |
|---|---|---|
| Reserved Instances (1-year, no upfront) | Stable EC2 and RDS | ~30-40% vs on-demand |
| Compute Savings Plans | EC2 + Fargate + Lambda mix | ~20-30% |
| Spot Instances | Fault-tolerant batch, stateless workers | ~60-80% vs on-demand |
| S3 Intelligent-Tiering | Objects with unknown access frequency | ~40% on cold objects |
| Graviton (ARM) instances | General-purpose EC2, ECS, RDS | ~10-20% vs x86 equivalents |
| Lambda right-sizing (Power Tuning tool) | All Lambda functions | 20-50% memory/cost balance |
Frequently Asked Questions
What is cloud-aws?
Use this skill when architecting on AWS, selecting services, optimizing costs, or following the Well-Architected Framework. Triggers on EC2, S3, Lambda, RDS, DynamoDB, CloudFront, IAM, VPC, ECS, EKS, SQS, SNS, API Gateway, and any task requiring AWS architecture decisions, service selection, or cost management.
How do I install cloud-aws?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill cloud-aws in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support cloud-aws?
cloud-aws works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.