Aws Advanced¶

15 cards — 🟢 1 easy | 🟡 12 medium | 🔴 2 hard

🟢 Easy (1)¶

1. What is the difference between cost allocation tags and resource tags?

Show answer

**Resource tags**: key-value pairs you attach to AWS resources for identification, automation, and access control. Any tag can be a resource tag.
**Cost allocation tags**: a SUBSET of tags activated in the Billing console for cost tracking in Cost Explorer and Cost & Usage Reports. Two types:
- AWS-generated: `aws:createdBy` (auto-generated, activate in billing)
- User-defined: any tag you create, then activate as a cost allocation tag
Activating a tag as cost allocation does NOT retroactively tag past usage -- only from the activation date forward. Best practice: standardize tag keys (Environment, Team, Project) and enforce via SCP or Tag Policies.

🟡 Medium (12)¶

1. How does RDS Multi-AZ failover work?

Show answer

RDS Multi-AZ maintains a synchronous standby replica in a different Availability Zone. On failure (host crash, AZ outage, storage failure), AWS automatically fails over by updating the DNS CNAME record to point to the standby.
Failover typically takes 60-120 seconds. Your application should use the RDS endpoint (DNS name), NOT an IP address. Connection strings survive failover without changes, but existing TCP connections are dropped -- apps must reconnect.
Multi-AZ is for HA, NOT read scaling. For read scaling, use Read Replicas (async). Aurora Multi-AZ works differently -- it uses a shared storage layer with up to 15 read replicas that can be promoted.

2. What are the S3 lifecycle policy transition rules?

Show answer

S3 lifecycle policies automate object transitions between storage classes and expiration.
Transition rules: Standard -> Standard-IA (minimum 30 days) -> Intelligent-Tiering -> One Zone-IA -> Glacier Instant Retrieval -> Glacier Flexible Retrieval -> Glacier Deep Archive.
Key constraints: minimum 30 days before transitioning from Standard to Standard-IA/One Zone-IA. Objects must be at least 128KB for Standard-IA transitions (smaller objects are charged for 128KB). Cannot transition backwards.
Expiration rules delete objects after a specified number of days. Combine with versioning: `NoncurrentVersionExpiration` cleans up old versions.
Use S3 analytics to identify optimal transition timing based on access patterns.

3. How do AWS Organizations and SCPs work?

Show answer

AWS Organizations manages multiple AWS accounts with consolidated billing and centralized governance.
SCPs (Service Control Policies) set permission guardrails for member accounts. Key behaviors:
- SCPs do NOT grant permissions -- they set maximum allowed permissions. Users still need IAM policies.
- SCPs apply to all users/roles in the account INCLUDING the root user.
- The management (master) account is NOT affected by SCPs.
- SCPs use IAM policy syntax with Allow/Deny. Effective permissions = intersection of SCP + IAM policy.
- Common SCP: deny all access outside approved regions, prevent disabling CloudTrail, require encryption.
Organizational Units (OUs) inherit SCPs from parent OUs. Attach SCPs at the OU or account level.

4. Compare ECS task definitions and EKS pod specs.

Show answer

**ECS Task Definition**: JSON document defining containers, images, CPU/memory, ports, env vars, volumes, IAM role (taskRoleArn), logging (awslogs driver). Task = one or more containers that run together. Service = desired count of tasks.
**EKS Pod Spec**: YAML manifest with containers, images, resources (requests/limits), ports, env, volumes, service accounts. Pod = one or more containers sharing network/storage.
Key differences: ECS uses `taskRoleArn` (IAM role per task); EKS uses IRSA (IAM Roles for Service Accounts). ECS has native Fargate support; EKS Fargate has more limitations. ECS uses awslogs for CloudWatch; EKS typically uses Fluent Bit/Fluentd.
ECS is simpler for AWS-only shops; EKS offers K8s portability.

5. When do you use Kinesis vs SQS vs SNS?

Show answer

**SNS** (Simple Notification Service): pub/sub fanout. One message -> many subscribers (Lambda, SQS, email, HTTP). No persistence. Use for: event notifications, fanout to multiple consumers.
**SQS** (Simple Queue Service): message queue. One message -> one consumer (pull-based). Messages persist until processed. Use for: decoupling services, work queues, buffering.
**Kinesis Data Streams**: real-time streaming. Ordered, replayable, multiple consumers read independently. Retention 1-365 days. Use for: real-time analytics, log aggregation, IoT telemetry, event sourcing.
Decision tree: Need fanout? SNS. Need a work queue? SQS. Need ordered, replayable, real-time streaming? Kinesis. Common pattern: SNS -> SQS (fanout to multiple queues).

6. What are VPC peering limitations and when do you need Transit Gateway?

Show answer

VPC Peering connects two VPCs directly. Limitations:
- Non-transitive: if A peers with B and B with C, A cannot reach C through B.
- No overlapping CIDR blocks allowed.
- No edge routing (cannot route through a peered VPC's VPN/DX/NAT).
- Each peering is point-to-point; N VPCs need N*(N-1)/2 peerings.
**Transit Gateway (TGW)**: a hub that connects multiple VPCs, VPNs, and Direct Connect in a star topology. Supports transitive routing, route tables, and multicast.
Use peering for: 2-3 VPCs, simple connectivity. Use TGW for: many VPCs, hub-and-spoke, hybrid connectivity, centralized egress. TGW charges per attachment + per GB; peering charges per GB only.

7. What is the difference between AWS Config and CloudTrail?

Show answer

**CloudTrail**: records WHO did WHAT and WHEN -- API calls made in your account. Answers: "Who launched this instance?" Logs management events (free) and data events (paid). Stores JSON event logs in S3.
**AWS Config**: records WHAT your resources look like and HOW they changed over time. Answers: "Was this security group open to 0.0.0.0/0 last Tuesday?" Provides a configuration timeline per resource.
AWS Config also supports **Config Rules** for continuous compliance evaluation (e.g., "all EBS volumes must be encrypted"). Remediation actions can auto-fix non-compliant resources.
Use together: CloudTrail tells you who changed a resource; Config tells you what changed.

8. Explain DynamoDB partition keys, GSI vs LSI.

Show answer

**Partition key**: the primary hash key that determines which partition stores the item. Must be chosen for even distribution -- avoid hot partitions.
**Sort key** (optional): combined with partition key, allows range queries within a partition.
**LSI (Local Secondary Index)**: same partition key, different sort key. Must be created at table creation. Limited to 10GB per partition key value. Shares throughput with the base table.
**GSI (Global Secondary Index)**: different partition key and optional sort key. Can be created anytime. Has its own throughput provisioning. Eventually consistent reads only.
Best practices: choose high-cardinality partition keys, use GSIs for access patterns that do not align with the base table, monitor partition-level metrics.

9. Compare ALB vs NLB features.

Show answer

**ALB (Application Load Balancer)**: Layer 7 (HTTP/HTTPS). Supports path-based and host-based routing, WebSockets, HTTP/2, gRPC, Lambda targets, sticky sessions, authentication (Cognito/OIDC).
**NLB (Network Load Balancer)**: Layer 4 (TCP/UDP/TLS). Ultra-low latency, millions of RPS, static IPs (one per AZ), preserves client source IP, supports PrivateLink.
Key differences: ALB terminates TLS and inspects HTTP; NLB passes through TCP. ALB supports Lambda as a target group; NLB does not. NLB gets static/Elastic IPs; ALB uses dynamic IPs.
Use ALB for: web apps, microservices, API routing. Use NLB for: extreme performance, TCP/UDP protocols, static IPs, PrivateLink endpoints.

10. What are Route 53 routing policies?

Show answer

Route 53 supports multiple routing policies:
- **Simple**: single resource, no health checks.
- **Weighted**: distribute traffic by percentage (e.g., 80/20 for A/B testing).
- **Latency-based**: route to the region with lowest latency for the user.
- **Failover**: active-passive; route to secondary if primary health check fails.
- **Geolocation**: route based on user's geographic location (continent, country, state).
- **Geoproximity**: route based on geographic distance, with bias to shift traffic.
- **Multivalue answer**: return multiple healthy IPs (like simple + health checks).
Combine policies using alias records. Health checks can monitor endpoints, CloudWatch alarms, or other health checks (calculated).

11. Compare AWS Secrets Manager vs Systems Manager Parameter Store.

Show answer

**Parameter Store**: free tier (standard params), stores strings, string lists, and SecureString (encrypted with KMS). Max 10,000 params (standard), 4KB/8KB size limit. No automatic rotation.
**Secrets Manager**: $0.40/secret/month + $0.05 per 10K API calls. Built-in automatic rotation (Lambda-based) for RDS, Redshift, DocumentDB. Cross-account access. Max 64KB per secret. Stores JSON (multiple key-value pairs).
Use Parameter Store for: configuration values, non-sensitive data, simple secrets without rotation needs. Use Secrets Manager for: database credentials, API keys that need automatic rotation, cross-account secret sharing.
Both integrate with CloudFormation dynamic references: `{{resolve:ssm:name}}` and `{{resolve:secretsmanager:name}}`.

12. Compare Step Functions vs Lambda orchestration.

Show answer

**Lambda-to-Lambda** (direct invocation): simple but creates tight coupling, error handling is manual, hard to visualize, timeout limited to 15 minutes per function.
**Step Functions**: visual workflow orchestrator using JSON state machines (ASL). Supports sequential, parallel, branching, error handling with retries/catch, wait states, map (fan-out), and human approval steps.
Two types: Standard (up to 1 year, exactly-once, $0.025/1K transitions) and Express (up to 5 min, at-least-once, cheaper for high-volume).
Use Step Functions when: workflows have multiple steps, need error handling/retries, require visual monitoring, or run longer than 15 minutes. Use direct Lambda calls for simple 1-2 step processes.

🔴 Hard (2)¶

1. What are the key CloudFormation intrinsic functions?

Show answer

- `!Ref` -- returns the value of a parameter or physical ID of a resource. `!Ref MyBucket` returns the bucket name.
- `!GetAtt` -- gets an attribute of a resource. `!GetAtt MyBucket.Arn` returns the ARN.
- `!Sub` -- string substitution. `!Sub "arn:aws:s3:::${BucketName}"` or with mapping: `!Sub ["www.${Domain}", {Domain: !Ref RootDomain}]`
- `!Join` -- concatenate strings. `!Join ["-", [!Ref Env, app, bucket]]`
- `!If` -- conditional value. `!If [IsProd, m5.large, t3.micro]` (references a Conditions key)
- `!Select` -- pick from a list. `!Select [0, !GetAZs ""]`
- `Fn::ImportValue` -- import an exported value from another stack for cross-stack references.

2. How does Lambda concurrency work (reserved, provisioned, cold starts)?

Show answer

**Concurrency** = number of simultaneous executions. Default account limit: 1000 (soft limit, increase via support).
**Reserved concurrency**: guarantees N concurrent executions for a function and caps it at N. Other functions cannot steal this capacity. Free.
**Provisioned concurrency**: pre-initializes N execution environments to eliminate cold starts. Charged per provisioned-concurrency-hour.
**Cold starts**: occur when Lambda creates a new execution environment (download code, init runtime, run handler). Typical: 100ms-2s depending on runtime and package size. Worst for VPC-attached Lambdas (ENI setup).
Mitigation: provisioned concurrency, keep functions small, use SnapStart (Java), avoid VPC unless needed, use layers for shared dependencies.