Task Statement 2.2: Design highly available and/or fault-tolerant architectures.
📘AWS Certified Solutions Architect – (SAA-C03)
1. What is Disaster Recovery (DR)?
Disaster Recovery (DR) is the process of restoring applications, data, and infrastructure after a failure such as:
- Region failure
- Data center outage
- Application crash
- Data corruption
The goal is to minimize downtime and data loss.
2. Key Concepts You MUST Know for the Exam
2.1 RTO (Recovery Time Objective)
- Definition: Maximum acceptable time to restore a system after failure
- Example: System must be back within 10 minutes
👉 Lower RTO = faster recovery = higher cost
2.2 RPO (Recovery Point Objective)
- Definition: Maximum acceptable data loss measured in time
- Example: Losing 5 minutes of data is acceptable
👉 Lower RPO = less data loss = higher cost
2.3 Relationship Between RTO, RPO, and Cost
| Requirement | Impact |
|---|---|
| Low RTO | Requires faster failover → expensive |
| Low RPO | Requires continuous replication → expensive |
| High RTO/RPO | Slower recovery → cheaper |
👉 Exam Tip: Always match DR strategy with business requirements (RTO + RPO)
3. AWS Disaster Recovery Strategies (Important)
AWS defines 4 main DR strategies. You must know all of them clearly.
3.1 Backup and Restore (Lowest Cost)
How it works:
- Data is backed up regularly
- Infrastructure is recreated after failure
AWS Services Used:
- Amazon S3
- Amazon Glacier
- AWS Backup
- EBS Snapshots
- RDS Snapshots
Characteristics:
- RTO: High (hours to days)
- RPO: High (data loss possible)
- Cost: Very low
When to Use:
- Non-critical applications
- Systems that can tolerate downtime
Key Idea:
👉 Nothing is running until disaster happens
3.2 Pilot Light
How it works:
- Core system (like database) is always running
- Rest of infrastructure is created during disaster
AWS Services Used:
- Amazon RDS / DynamoDB (replicated)
- Amazon EC2 (minimal running)
- AMI templates
- CloudFormation
Characteristics:
- RTO: Medium (minutes to hours)
- RPO: Low (data is replicated)
- Cost: Low to medium
When to Use:
- Important applications
- Need faster recovery than backup
Key Idea:
👉 Only critical components stay active
3.3 Warm Standby
How it works:
- Full system is running but at reduced capacity
- Scales up during disaster
AWS Services Used:
- EC2 Auto Scaling
- RDS Multi-AZ / Read Replicas
- Elastic Load Balancer
- Route 53
Characteristics:
- RTO: Low (minutes)
- RPO: Low
- Cost: Medium to high
When to Use:
- Business-critical applications
- Need quick recovery
Key Idea:
👉 System is always running, just scaled down
3.4 Multi-Site (Active-Active) (Highest Cost)
How it works:
- Full system runs in multiple Regions simultaneously
- Traffic is shared between them
AWS Services Used:
- Route 53 (latency/health-based routing)
- DynamoDB Global Tables
- S3 Cross-Region Replication
- Aurora Global Database
Characteristics:
- RTO: Near zero
- RPO: Near zero
- Cost: Very high
When to Use:
- Mission-critical systems
- No downtime allowed
Key Idea:
👉 Both environments are always active
4. Comparison Table (VERY IMPORTANT FOR EXAM)
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | High | High | Low | Low |
| Pilot Light | Medium | Low | Low-Medium | Medium |
| Warm Standby | Low | Low | Medium-High | Medium |
| Multi-Site | Very Low | Very Low | Very High | High |
👉 Exam Trick:
If question mentions:
- “Cheapest” → Backup & Restore
- “Fast recovery, low cost” → Pilot Light
- “Quick failover” → Warm Standby
- “Zero downtime” → Multi-Site
5. Choosing the Right DR Strategy (Exam Logic)
To select the correct DR strategy, follow this thinking process:
Step 1: Check RTO requirement
- Seconds/minutes → Multi-Site or Warm Standby
- Hours → Backup or Pilot Light
Step 2: Check RPO requirement
- Near zero data loss → Continuous replication needed
- Some data loss acceptable → Backup-based solutions
Step 3: Check Budget
- Low budget → Backup & Restore
- Medium → Pilot Light / Warm Standby
- High → Multi-Site
Step 4: Check Application Criticality
- Non-critical → Backup
- Important → Pilot Light
- Business-critical → Warm Standby
- Mission-critical → Multi-Site
6. AWS Services Used in DR (Exam Focus)
Data Replication
- S3 Cross-Region Replication (CRR)
- DynamoDB Global Tables
- Aurora Global Database
- RDS Read Replicas
Backup Services
- AWS Backup
- EBS Snapshots
- S3 Glacier
Traffic Routing & Failover
- Route 53
- Failover routing
- Health checks
Compute Recovery
- EC2 AMIs
- Auto Scaling Groups
- CloudFormation (infrastructure automation)
7. Important Exam Scenarios
Scenario 1:
- “Restore system after several hours is acceptable”
👉 Answer: Backup & Restore
Scenario 2:
- “Keep database running, start app during failure”
👉 Answer: Pilot Light
Scenario 3:
- “System must recover within minutes”
👉 Answer: Warm Standby
Scenario 4:
- “No downtime allowed, global users”
👉 Answer: Multi-Site
8. Best Practices (Exam Must-Know)
- Always define RTO and RPO first
- Automate recovery using:
- CloudFormation
- Auto Scaling
- Use multi-AZ for high availability (not DR alone)
- Use multi-region for disaster recovery
- Regularly test DR strategy
- Encrypt backups and replicate securely
9. Common Mistakes (Exam Traps)
❌ Confusing High Availability vs Disaster Recovery
- HA = within same region (Multi-AZ)
- DR = across regions
❌ Choosing expensive solution unnecessarily
- Always match requirement → not maximum performance
❌ Ignoring RPO/RTO
- Most questions are based on these
10. Final Summary
- DR ensures systems recover after failure
- 4 key strategies:
- Backup & Restore (cheapest, slowest)
- Pilot Light (partial running)
- Warm Standby (scaled-down full system)
- Multi-Site (fully active, fastest)
👉 Golden Rule for Exam:
The correct DR strategy is the one that meets RTO, RPO, and cost requirements — not the most advanced one
