cloud-aws

Use this skill when architecting on AWS, selecting services, optimizing costs, or following the Well-Architected Framework. Triggers on EC2, S3, Lambda, RDS, DynamoDB, CloudFront, IAM, VPC, ECS, EKS, SQS, SNS, API Gateway, and any task requiring AWS architecture decisions, service selection, or cost management.

What is cloud-aws?

Quick Start

Open your terminal or command prompt
Run: npx skills add AbsolutelySkilled/AbsolutelySkilled --skill cloud-aws
Start your AI coding agent (Claude Code, Cursor, Gemini CLI, or any supported agent)
The cloud-aws skill is now active and ready to use

Overview Files

cloud-aws

cloud-aws is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Architecting on AWS, selecting services, optimizing costs, or following the Well-Architected Framework.

Quick Facts

Field	Value
Category	cloud
Version	0.1.0
Platforms	claude-code, gemini-cli, openai-codex
License	MIT

How to Install

Make sure you have Node.js installed on your machine.
Run the following command in your terminal:

npx skills add AbsolutelySkilled/AbsolutelySkilled --skill cloud-aws

The cloud-aws skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).

Overview

A practical guide to building production systems on AWS following the Well-Architected Framework. This skill covers service selection, VPC design, IAM least-privilege, serverless patterns, cost optimization, and monitoring - with an emphasis on when to use each service, not just how. Designed for engineers who know AWS basics and need opinionated guidance on trade-offs and common pitfalls.

Platforms

claude-code
gemini-cli
openai-codex

Related Skills

Pair cloud-aws with these complementary skills:

Frequently Asked Questions

What is cloud-aws?

How do I install cloud-aws?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill cloud-aws in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support cloud-aws?

This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Maintainers

@maddhruv

Generated from AbsolutelySkilled

SKILL.md

AWS Cloud Architecture

When to use this skill

Trigger this skill when the user:

Chooses between AWS compute options (EC2, ECS, Fargate, Lambda, App Runner)
Designs or reviews a VPC, subnet, or security group setup
Needs IAM roles, policies, or permission boundaries
Architects a serverless application (API Gateway + Lambda + DynamoDB)
Asks about cost reduction, Reserved Instances, Savings Plans, or right-sizing
Sets up CloudWatch alarms, dashboards, or log insights
Selects a database service (RDS, Aurora, DynamoDB, ElastiCache)
Plans multi-region or high-availability architecture

Do NOT trigger this skill for:

General Linux/shell scripting unrelated to AWS
Kubernetes internals that are cloud-agnostic (use a k8s skill instead)

Key principles

Operational excellence - Automate everything that can be automated. Infrastructure-as-code (CloudFormation, CDK, Terraform) is not optional. Every change should be reviewable, reproducible, and reversible. Run post-incident reviews and feed learnings back into runbooks.
Security - Apply least-privilege IAM everywhere. No * actions in production policies. Encrypt data at rest (KMS) and in transit (TLS). Treat every AWS account boundary as a trust boundary. Use VPC endpoints to keep traffic off the public internet where possible.
Reliability - Design for multi-AZ by default. Use health checks, auto-scaling, and managed services that handle failure transparently. Define Recovery Time Objective (RTO) and Recovery Point Objective (RPO) before choosing a database tier.
Performance efficiency - Right-size before you scale out. Understand the access patterns of your workload and match them to the service that handles them natively (e.g., DynamoDB for key-value at scale, Aurora for relational OLTP). Use CloudFront and edge caching to reduce origin load.
Cost optimization - Cost is an architecture decision, not an afterthought. Tag every resource. Use Cost Explorer weekly. Commit to Reserved Instances or Savings Plans for stable workloads. Delete idle resources aggressively.

Core concepts

Regions and Availability Zones

A region is a geographic area with multiple isolated data centers. Each region contains at least 3 Availability Zones (AZs) - physically separate facilities with independent power and networking. Deploy stateful services across 2+ AZs for high availability. Some services (S3, IAM, CloudFront) are global; most are regional.

IAM model

IAM has four building blocks:

Concept	What it is
Principal	Who is acting (user, role, service)
Policy	JSON document defining allowed/denied actions
Role	Identity assumed by services or users (no long-term credentials)
Trust policy	Who is allowed to assume a role

The golden rule: use roles, not users. EC2 instances, Lambda functions, and ECS tasks all assume roles at runtime. Never embed access keys in code or AMIs.

Compute spectrum

Control / Cost                              Managed / Speed
<------------------------------------------>
EC2 -> ECS on EC2 -> ECS Fargate -> Lambda -> App Runner

EC2 - full OS control, GPU support, long-running workloads
ECS on EC2 - containerized, you manage the host fleet
ECS Fargate - containerized, AWS manages hosts (preferred default)
Lambda - event-driven, sub-second billing, 15-min max duration
App Runner - HTTP services from container or source, zero infra management

Storage tiers

Service	Use case
S3 Standard	Frequently accessed objects
S3 Intelligent-Tiering	Unpredictable access patterns
S3 Glacier Instant	Archives needing millisecond retrieval
EBS	Block storage attached to EC2
EFS	Shared POSIX filesystem across multiple EC2s

Networking primitives

A VPC is a logically isolated network. Inside it, subnets span a single AZ. Public subnets have a route to an Internet Gateway; private subnets do not. Security groups are stateful firewalls attached to ENIs (deny by default). NACLs are stateless subnet-level firewalls (less common). Use VPC endpoints to reach AWS services (S3, DynamoDB, SQS) without traversing the internet.

Common tasks

Choose the right compute service

Workload type	Recommended service	Why
Long-running stateful app, GPU needed	EC2	Full OS control, persistent storage
Containerized microservice, >15 min tasks	ECS Fargate	No host management, predictable billing
Event-driven, short tasks (<15 min)	Lambda	Pay-per-invocation, auto-scales to zero
HTTP API from container, zero-ops	App Runner	Automated deployments, TLS, scaling
Large-scale batch processing	AWS Batch on Fargate	Managed job queues, spot support
Kubernetes required	EKS	When you need k8s primitives or portability

Decision rule: start with Lambda or Fargate. Move to EC2 only when you need control over the OS, persistent GPU, or a runtime Lambda does not support.

Design a VPC with public/private subnets

A standard 3-tier VPC layout:

VPC 10.0.0.0/16
  Public subnets  (10.0.0.0/24, 10.0.1.0/24, 10.0.2.0/24)  - one per AZ
    - Internet Gateway route
    - Load balancers, NAT Gateways, bastion hosts
  Private subnets (10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24) - one per AZ
    - Application servers, ECS tasks, Lambda (VPC-attached)
    - Route outbound through NAT Gateway in the public subnet
  Database subnets (10.0.20.0/24, 10.0.21.0/24, 10.0.22.0/24) - one per AZ
    - RDS, ElastiCache
    - No internet route at all

CIDR planning rules:

Use /16 for the VPC to leave room for growth
Use /24 per subnet (251 usable IPs - AWS reserves 5 per subnet)
Reserve CIDR ranges to avoid conflicts with on-premises networks or VPC peering

Never put application workloads in public subnets. Only load balancers and NAT Gateways belong in public subnets.

Set up IAM roles with least privilege

Start from zero-permissions and add only what's needed. Example Lambda role that reads from one S3 bucket and writes to DynamoDB:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::my-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:PutItem",
        "dynamodb:UpdateItem"
      ],
      "Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/MyTable"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}

Key rules:

Scope Resource to specific ARNs, never "*" for data plane actions
Use permission boundaries to cap what a role can grant to child roles
Use IAM Access Analyzer to find overly permissive policies automatically
Rotate any long-term credentials (access keys) every 90 days or eliminate them

Design a serverless API

Standard pattern: API Gateway -> Lambda -> DynamoDB

Client
  -> API Gateway (REST or HTTP API)
      - Request validation, auth (Cognito/JWT authorizer), throttling
  -> Lambda function (per route or single handler)
      - Business logic, input validation
  -> DynamoDB table
      - Partition key = entity type + ID, sort key = operation/timestamp
  -> (optional) SQS for async fan-out, SNS for notifications

Choose HTTP API over REST API unless you need WAF integration, edge caching via API Gateway caches, or request/response transformation. HTTP API costs ~70% less.

DynamoDB access pattern design:

Define all queries before designing the table (single-table design when possible)
Use a composite sort key to support range queries (STATUS#TIMESTAMP)
Enable DynamoDB Streams if downstream Lambdas need to react to changes

Optimize costs

Strategy	When to apply	Typical saving
Reserved Instances (1yr no-upfront)	EC2/RDS running >8h/day, stable size	~30-40%
Compute Savings Plans	Any EC2/Fargate/Lambda, flexible family	~20-30%
Spot Instances	Batch, stateless, fault-tolerant workloads	~60-80%
Right-sizing	Instances with <20% avg CPU over 2 weeks	Varies
S3 Intelligent-Tiering	Objects with unpredictable access	~40% for cold data
Delete idle resources	Unattached EBS volumes, old snapshots, unused EIPs	Immediate

Cost hygiene checklist:

Set up AWS Budgets with alerts at 80% and 100% of monthly target
Enable Cost Allocation Tags and tag every resource with env, team, service
Review Trusted Advisor weekly for underutilized resources
Use Lambda Power Tuning to find the optimal memory/cost configuration

Set up monitoring

Build three layers of observability using CloudWatch:

Metrics - Enable detailed monitoring (1-min granularity) for production EC2. For Lambda, track Errors, Throttles, Duration, and ConcurrentExecutions.

Alarms - Follow the pattern: metric -> alarm -> SNS topic -> PagerDuty/Slack.

# Example: Lambda error rate alarm (AWS CLI)
aws cloudwatch put-metric-alarm \
  --alarm-name "my-function-errors" \
  --metric-name Errors \
  --namespace AWS/Lambda \
  --dimensions Name=FunctionName,Value=my-function \
  --statistic Sum \
  --period 60 \
  --threshold 5 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:my-alerts

Dashboards - One dashboard per service with: error rate, latency (p50/p99), throughput, and saturation (CPU %, queue depth). Use CloudWatch Contributor Insights to find the top contributors to errors or high latency.

Logs - Use structured JSON logging. Query with CloudWatch Logs Insights:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)

Choose a database service

Need	Service	Notes
Relational, OLTP, <100k writes/s	RDS (PostgreSQL/MySQL)	Familiar SQL, managed backups
Relational, high throughput, auto-scaling storage	Aurora	5x MySQL throughput, Global Database for multi-region
Key-value / document at any scale	DynamoDB	Single-digit ms at any scale, requires upfront access pattern design
In-memory caching, session store	ElastiCache (Redis)	Sub-ms reads, Lua scripting, pub/sub
Full-text search	OpenSearch Service	Elasticsearch-compatible, managed
Analytical queries (OLAP)	Redshift	Columnar, petabyte-scale
Graph traversals	Neptune	Gremlin/SPARQL, highly connected data

Decision rule: if access patterns are known and throughput exceeds RDS capacity, use DynamoDB. If you need joins, aggregations, or ad-hoc SQL, use Aurora.

Anti-patterns / common mistakes

Mistake	Why it's wrong	What to do instead
Using `*` in IAM policies	Grants unintended access, violates least privilege	Scope to specific actions and ARNs; use IAM Access Analyzer
Putting databases in public subnets	Direct internet exposure, no network-layer defense	Database subnets with no internet route; security groups scoped to app tier
Hardcoding AWS credentials in code	Credentials leak via source control, logs, or container images	Use IAM roles assigned to compute resources; retrieve secrets from Secrets Manager
Single-AZ RDS in production	One maintenance event or hardware failure causes downtime	Enable Multi-AZ deployments; use Aurora for automatic failover
Lambda functions without concurrency limits	Runaway invocations can exhaust account concurrency and starve other functions	Set reserved concurrency; use SQS with a DLQ as a buffer
Over-provisioned EC2 for bursty workloads	Paying for idle capacity 20h/day	Switch to Fargate + auto-scaling or Lambda for bursty traffic patterns

Gotchas

RDS encryption cannot be added after creation - You cannot enable encryption on an existing unencrypted RDS instance in place. The only path is to take a snapshot, copy it with encryption enabled, and restore to a new instance. Plan encryption at creation time for any instance that might hold regulated or sensitive data.
Lambda concurrency exhaustion is account-wide - Lambda functions share a per-region concurrency limit (default 1,000). A single runaway function (e.g., triggered by an SQS loop) can consume all available concurrency and throttle every other Lambda in the account. Always set reserved concurrency on high-traffic or loop-risky functions.
NAT Gateway costs accumulate silently - NAT Gateways charge per GB processed plus an hourly fee. A private subnet with heavy outbound traffic (e.g., Lambda pulling large S3 objects) can generate surprising bills. Use VPC endpoints for S3 and DynamoDB to bypass NAT Gateway entirely for those services.
S3 eventual consistency trap (pre-2020 style) - While S3 now provides strong read-after-write consistency for new objects, workflows that delete and recreate objects with the same key can still observe stale list results under some conditions. Don't assume a ListObjects immediately after a delete/recreate reflects the latest state in automated pipelines.
IAM policy evaluation order surprises - An explicit Deny anywhere in the evaluation chain (SCPs, permission boundaries, identity policies, resource policies) overrides any Allow. A service control policy at the organization level silently blocking an action is a common source of "permission denied" that looks correctly configured in the IAM console.

References

For detailed patterns and service-specific guidance, read the relevant file from the references/ folder:

references/service-map.md - quick reference mapping use cases to AWS services

Only load a references file when the current task requires detailed service lookup - they consume context and the SKILL.md covers the most common decisions.

References

service-map.md

AWS Service Map

Quick-reference table mapping use cases to the right AWS service. Use this when a task involves service selection or when translating requirements to AWS primitives.

Compute

Use case	Service	Notes
Long-running stateful app, OS control needed	EC2	Choose instance family: M (general), C (compute), R (memory), G (GPU)
Containerized workload, no host management	ECS Fargate	Preferred default for containers
Containerized, need Kubernetes	EKS	Use when k8s portability or ecosystem is required
Event-driven, short tasks (<15 min)	Lambda	Sub-second billing, scales to zero
HTTP service from container or source, zero ops	App Runner	Auto-deploys, TLS, scaling handled
Large-scale batch jobs	AWS Batch	Managed job queues, Fargate or EC2 backing
Edge compute / CDN logic	Lambda@Edge / CloudFront Functions	CloudFront Functions for lightweight transforms (<1ms budget)

Storage

Use case	Service	Notes
Object storage, media, backups, static assets	S3	Unlimited scale, 11 nines durability
Block storage for EC2	EBS	gp3 is the default general-purpose volume type
Shared filesystem (NFS) across EC2	EFS	POSIX-compliant, multi-AZ
High-performance shared filesystem (HPC)	FSx for Lustre	Scratch or persistent mode
File shares (Windows/SMB)	FSx for Windows File Server	Active Directory integration
Archival, long-term retention	S3 Glacier Instant / Flexible / Deep Archive	Deep Archive cheapest (~$1/TB/month), hours retrieval
Content delivery / CDN	CloudFront	400+ PoPs, S3 or custom origin

Database

Use case	Service	Notes
Relational OLTP (Postgres/MySQL)	RDS	Managed, Multi-AZ, automated backups
High-throughput relational, auto-scaling storage	Aurora	5x MySQL throughput; Aurora Serverless v2 for variable load
Key-value / document at massive scale	DynamoDB	Single-digit ms, design around access patterns first
In-memory cache / session store	ElastiCache for Redis	Sub-ms, supports data structures and pub/sub
Simple key-value cache (no persistence)	ElastiCache for Memcached	Multi-threaded, simpler than Redis
Full-text and log search	OpenSearch Service	Managed Elasticsearch/OpenSearch
Analytical / data warehouse	Redshift	Columnar, petabyte-scale, RA3 nodes
Serverless analytics on S3	Athena	Presto-based, pay per query scanned
Highly connected data (graph)	Neptune	Gremlin and SPARQL APIs
Ledger / immutable audit log	QLDB	Cryptographically verifiable, document model
Time-series data	Timestream	Purpose-built, automatic tiering

Networking

Use case	Service	Notes
Isolated private network	VPC	One per workload/account; CIDR plan carefully
Layer 7 HTTP(S) load balancing	ALB (Application Load Balancer)	Path/host routing, WebSocket, Cognito auth
Layer 4 TCP/UDP load balancing	NLB (Network Load Balancer)	Static IPs, ultra-low latency, PrivateLink
DNS management	Route 53	Health-check-based failover, latency routing
Private connectivity to AWS services	VPC Endpoints (Gateway / Interface)	Avoid internet traversal for S3, DynamoDB, etc.
Connect on-premises to VPC	Site-to-Site VPN / Direct Connect	VPN for quick setup; DX for dedicated bandwidth
Hub-and-spoke multi-VPC routing	Transit Gateway	Replaces VPC peering mesh at scale
Global accelerator for TCP/UDP	Global Accelerator	Anycast IPs, routes via AWS backbone
DDoS protection	Shield Standard / Advanced	Standard is automatic; Advanced adds 24/7 DDoS response team
Web Application Firewall	WAF	Attach to ALB, API Gateway, or CloudFront

Messaging and Integration

Use case	Service	Notes
Decoupled async message queue	SQS	Standard (at-least-once) or FIFO (exactly-once, ordered)
Fan-out pub/sub notifications	SNS	Push to SQS, Lambda, HTTP, email, SMS
Real-time streaming / event bus	Kinesis Data Streams	Ordered, replayable, shards scale throughput
Managed Kafka	MSK (Managed Streaming for Kafka)	When Kafka ecosystem/tooling required
Event-driven integration / routing	EventBridge	Schema registry, cross-account, SaaS integrations
Workflow orchestration	Step Functions	Standard (audit, long-running) or Express (high-volume, short)
Managed message broker (AMQP/STOMP)	Amazon MQ	Lift-and-shift for RabbitMQ or ActiveMQ

Security and Identity

Use case	Service	Notes
Identity and access management	IAM	Roles, policies, permission boundaries
User authentication / OIDC	Cognito	User pools (auth), identity pools (AWS credentials)
Secrets storage and rotation	Secrets Manager	Automatic rotation for RDS, Redshift, DocumentDB
Config/environment parameters	Parameter Store (SSM)	Free tier for standard params; use SecureString for sensitive values
Encryption key management	KMS	CMKs for envelope encryption; key policies control access
Certificate management	ACM (Certificate Manager)	Free TLS certs for ALB/CloudFront; auto-renewal
Threat detection (logs analysis)	GuardDuty	ML-based anomaly detection on VPC flow logs, CloudTrail, DNS
Security findings aggregation	Security Hub	Aggregates GuardDuty, Inspector, Macie findings
S3 sensitive data discovery	Macie	PII detection in S3 buckets
Vulnerability scanning (EC2/containers)	Inspector	CVE scanning, network reachability
Audit trail for API calls	CloudTrail	Enable in all regions; store in S3 with integrity validation

Monitoring and Observability

Use case	Service	Notes
Metrics, alarms, dashboards	CloudWatch Metrics + Alarms	1-min granularity for detailed monitoring
Log aggregation and querying	CloudWatch Logs + Logs Insights	Structured JSON logs; Logs Insights for ad-hoc queries
Distributed tracing	X-Ray	Trace across Lambda, ECS, API Gateway, SDK-instrumented services
Synthetic monitoring (uptime)	CloudWatch Synthetics	Canary scripts to test endpoints
Application performance monitoring	CloudWatch Application Insights	Auto-detects and groups related metrics/logs
Infrastructure events	EventBridge / CloudWatch Events	React to AWS service state changes

Developer Tools and IaC

Use case	Service	Notes
Infrastructure as code (native)	CloudFormation / CDK	CDK (TypeScript/Python) compiles to CloudFormation
Source control	CodeCommit	Managed Git; most teams use GitHub/GitLab instead
CI/CD pipeline	CodePipeline + CodeBuild	Managed pipeline; CodeBuild for build/test steps
Container image registry	ECR (Elastic Container Registry)	Private, integrated with ECS/EKS, image scanning
Artifact storage	CodeArtifact	npm, Maven, pip, NuGet package proxy and hosting

Cost Optimization Quick Reference

Strategy	Best for	Typical saving
Reserved Instances (1-year, no upfront)	Stable EC2 and RDS	~30-40% vs on-demand
Compute Savings Plans	EC2 + Fargate + Lambda mix	~20-30%
Spot Instances	Fault-tolerant batch, stateless workers	~60-80% vs on-demand
S3 Intelligent-Tiering	Objects with unknown access frequency	~40% on cold objects
Graviton (ARM) instances	General-purpose EC2, ECS, RDS	~10-20% vs x86 equivalents
Lambda right-sizing (Power Tuning tool)	All Lambda functions	20-50% memory/cost balance

Frequently Asked Questions

What is cloud-aws?

How do I install cloud-aws?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill cloud-aws in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support cloud-aws?

cloud-aws works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Is cloud-aws free?

Yes, cloud-aws is completely free and open source under the MIT license. Install it with a single command and start using it immediately.

What is the difference between cloud-aws and similar tools?

cloud-aws is an AI agent skill that teaches your coding agent specialized cloud knowledge. Unlike standalone tools, it integrates directly into claude-code, gemini-cli, openai-codex and other AI agents.

Can I use cloud-aws with Cursor or Windsurf?

cloud-aws works with any AI coding agent that supports the skills protocol, including Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini CLI, and 40+ more.

cloud-aws

What is cloud-aws?

Quick Start

cloud-aws

Quick Facts

How to Install

Overview

Tags

Platforms

Related Skills

Frequently Asked Questions

What is cloud-aws?

How do I install cloud-aws?

What AI agents support cloud-aws?

Maintainers

SKILL.md

AWS Cloud Architecture

When to use this skill

Key principles

Core concepts

Regions and Availability Zones

IAM model

Compute spectrum

Storage tiers

Networking primitives

Common tasks

Choose the right compute service

Design a VPC with public/private subnets

Set up IAM roles with least privilege

Design a serverless API

Optimize costs

Set up monitoring

Choose a database service

Anti-patterns / common mistakes

Gotchas

References

References

service-map.md

AWS Service Map

Compute

Storage

Database

Networking

Messaging and Integration

Security and Identity

Monitoring and Observability

Developer Tools and IaC

Cost Optimization Quick Reference

Frequently Asked Questions

What is cloud-aws?

How do I install cloud-aws?

What AI agents support cloud-aws?

Is cloud-aws free?

What is the difference between cloud-aws and similar tools?

Can I use cloud-aws with Cursor or Windsurf?