Task Statement 2.2: Design highly available and/or fault-tolerant architectures.

📘AWS Certified Solutions Architect – (SAA-C03)

1. What is Disaster Recovery (DR)?

Disaster Recovery (DR) is the process of restoring applications, data, and infrastructure after a failure such as:

Region failure
Data center outage
Application crash
Data corruption

The goal is to minimize downtime and data loss.

2. Key Concepts You MUST Know for the Exam

2.1 RTO (Recovery Time Objective)

Definition: Maximum acceptable time to restore a system after failure
Example: System must be back within 10 minutes

👉 Lower RTO = faster recovery = higher cost

2.2 RPO (Recovery Point Objective)

Definition: Maximum acceptable data loss measured in time
Example: Losing 5 minutes of data is acceptable

👉 Lower RPO = less data loss = higher cost

2.3 Relationship Between RTO, RPO, and Cost

Requirement	Impact
Low RTO	Requires faster failover → expensive
Low RPO	Requires continuous replication → expensive
High RTO/RPO	Slower recovery → cheaper

👉 Exam Tip: Always match DR strategy with business requirements (RTO + RPO)

3. AWS Disaster Recovery Strategies (Important)

AWS defines 4 main DR strategies. You must know all of them clearly.

3.1 Backup and Restore (Lowest Cost)

How it works:

Data is backed up regularly
Infrastructure is recreated after failure

AWS Services Used:

Amazon S3
Amazon Glacier
AWS Backup
EBS Snapshots
RDS Snapshots

Characteristics:

RTO: High (hours to days)
RPO: High (data loss possible)
Cost: Very low

When to Use:

Non-critical applications
Systems that can tolerate downtime

Key Idea:

👉 Nothing is running until disaster happens

3.2 Pilot Light

How it works:

Core system (like database) is always running
Rest of infrastructure is created during disaster

AWS Services Used:

Amazon RDS / DynamoDB (replicated)
Amazon EC2 (minimal running)
AMI templates
CloudFormation

Characteristics:

RTO: Medium (minutes to hours)
RPO: Low (data is replicated)
Cost: Low to medium

When to Use:

Important applications
Need faster recovery than backup

Key Idea:

👉 Only critical components stay active

3.3 Warm Standby

How it works:

Full system is running but at reduced capacity
Scales up during disaster

AWS Services Used:

EC2 Auto Scaling
RDS Multi-AZ / Read Replicas
Elastic Load Balancer
Route 53

Characteristics:

RTO: Low (minutes)
RPO: Low
Cost: Medium to high

When to Use:

Business-critical applications
Need quick recovery

Key Idea:

👉 System is always running, just scaled down

3.4 Multi-Site (Active-Active) (Highest Cost)

How it works:

Full system runs in multiple Regions simultaneously
Traffic is shared between them

AWS Services Used:

Route 53 (latency/health-based routing)
DynamoDB Global Tables
S3 Cross-Region Replication
Aurora Global Database

Characteristics:

RTO: Near zero
RPO: Near zero
Cost: Very high

When to Use:

Mission-critical systems
No downtime allowed

Key Idea:

👉 Both environments are always active

4. Comparison Table (VERY IMPORTANT FOR EXAM)

Strategy	RTO	RPO	Cost	Complexity
Backup & Restore	High	High	Low	Low
Pilot Light	Medium	Low	Low-Medium	Medium
Warm Standby	Low	Low	Medium-High	Medium
Multi-Site	Very Low	Very Low	Very High	High

👉 Exam Trick:
If question mentions:

“Cheapest” → Backup & Restore
“Fast recovery, low cost” → Pilot Light
“Quick failover” → Warm Standby
“Zero downtime” → Multi-Site

5. Choosing the Right DR Strategy (Exam Logic)

To select the correct DR strategy, follow this thinking process:

Step 1: Check RTO requirement

Seconds/minutes → Multi-Site or Warm Standby
Hours → Backup or Pilot Light

Step 2: Check RPO requirement

Near zero data loss → Continuous replication needed
Some data loss acceptable → Backup-based solutions

Step 3: Check Budget

Low budget → Backup & Restore
Medium → Pilot Light / Warm Standby
High → Multi-Site

Step 4: Check Application Criticality

Non-critical → Backup
Important → Pilot Light
Business-critical → Warm Standby
Mission-critical → Multi-Site

6. AWS Services Used in DR (Exam Focus)

Data Replication

S3 Cross-Region Replication (CRR)
DynamoDB Global Tables
Aurora Global Database
RDS Read Replicas

Backup Services

AWS Backup
EBS Snapshots
S3 Glacier

Traffic Routing & Failover

Route 53
- Failover routing
- Health checks

Compute Recovery

EC2 AMIs
Auto Scaling Groups
CloudFormation (infrastructure automation)

7. Important Exam Scenarios

Scenario 1:

“Restore system after several hours is acceptable”
👉 Answer: Backup & Restore

Scenario 2:

“Keep database running, start app during failure”
👉 Answer: Pilot Light

Scenario 3:

“System must recover within minutes”
👉 Answer: Warm Standby

Scenario 4:

“No downtime allowed, global users”
👉 Answer: Multi-Site

8. Best Practices (Exam Must-Know)

Always define RTO and RPO first
Automate recovery using:
- CloudFormation
- Auto Scaling
Use multi-AZ for high availability (not DR alone)
Use multi-region for disaster recovery
Regularly test DR strategy
Encrypt backups and replicate securely

9. Common Mistakes (Exam Traps)

❌ Confusing High Availability vs Disaster Recovery

HA = within same region (Multi-AZ)
DR = across regions

❌ Choosing expensive solution unnecessarily

Always match requirement → not maximum performance

❌ Ignoring RPO/RTO

Most questions are based on these

10. Final Summary

DR ensures systems recover after failure
4 key strategies:
1. Backup & Restore (cheapest, slowest)
2. Pilot Light (partial running)
3. Warm Standby (scaled-down full system)
4. Multi-Site (fully active, fastest)

👉 Golden Rule for Exam:

The correct DR strategy is the one that meets RTO, RPO, and cost requirements — not the most advanced one