Task Statement 2.2: Design highly available and/or fault-tolerant architectures.

📘AWS Certified Solutions Architect – (SAA-C03)

1. What is Failover?

Failover is the process of automatically switching to a backup system when the primary system fails.

In AWS terms, this usually involves:

Switching traffic from one EC2 instance / Availability Zone (AZ) / Region
To another healthy instance / AZ / Region

2. Key Concepts You MUST Understand

2.1 High Availability vs Fault Tolerance

High Availability (HA)
- System continues running with minimal downtime
- Some disruption may occur
- Example: switching to another AZ after failure
Fault Tolerant System
- No interruption at all
- System keeps running even if components fail
- Example: active-active setup across multiple AZs

👉 Exam Tip:

HA = quick recovery
Fault tolerant = no downtime

2.2 Failure Detection

Before failover happens, AWS must detect failure using:

Health checks (e.g., from Elastic Load Balancer (ELB))
Route health checks (e.g., Amazon Route 53)
Monitoring (e.g., Amazon CloudWatch alarms)

👉 If the system cannot detect failure, failover will NOT happen.

3. Types of Failover Strategies in AWS

3.1 Active-Passive Failover

How it works:

One system is active (serving traffic)
Another system is passive (standby, not serving traffic)
Passive system takes over only when active fails

AWS Services Used:

Amazon Route 53
Elastic Load Balancer (ELB)
Amazon RDS Multi-AZ

Example (AWS Architecture):

Primary EC2 instances in us-east-1a
Standby EC2 instances in us-east-1b
Route 53 routes traffic to primary
If failure occurs → switches to standby

Advantages:

Simple to design
Lower cost than active-active

Disadvantages:

Standby resources are idle (wasted capacity)

3.2 Active-Active Failover

How it works:

Multiple systems are active at the same time
Traffic is distributed across all systems
If one fails, others continue serving traffic

AWS Services Used:

Elastic Load Balancer
Amazon Route 53 (Latency-based or Weighted routing)
Multi-region deployments

Example:

Application running in:
- Region A
- Region B
Route 53 distributes traffic based on:
- Latency
- Health checks

Advantages:

No downtime
High performance
Efficient resource usage

Disadvantages:

More complex
Higher cost

3.3 Pilot Light Failover

How it works:

Core components are always running in a secondary region
Minimal infrastructure is kept active
Full system is started only when needed

Example:

Database replication is active
Application servers are not running in standby region
When failure occurs → spin up application servers

AWS Services Used:

Amazon RDS cross-region read replica
Amazon EC2 (on-demand launch)
Amazon AMI

Advantages:

Lower cost than warm standby
Faster recovery than backup/restore

Disadvantages:

Requires scaling up during failover

3.4 Warm Standby Failover

How it works:

A scaled-down version of the system runs in another region
It is always running but at lower capacity
Quickly scaled up when failure occurs

Example:

20% capacity running in secondary region
100% capacity in primary region
Failover → scale secondary to full capacity

AWS Services Used:

Auto Scaling Groups
Elastic Load Balancer
CloudFormation

Advantages:

Faster failover than pilot light
Lower cost than active-active

Disadvantages:

Still some cost due to standby resources

3.5 Backup and Restore (Cold Standby)

How it works:

No standby system is running
Only backups exist
System is rebuilt from backups after failure

Example:

Data stored in Amazon S3
After failure → restore to new EC2 instances

AWS Services Used:

Amazon S3
AWS Backup
Amazon Machine Images (AMI)

Advantages:

Cheapest option

Disadvantages:

Slow recovery time (high downtime)

4. AWS Services That Support Failover

4.1 Amazon Route 53 Failover Routing

Routes traffic to primary resource
If health check fails → routes to secondary

👉 Types:

Primary/Secondary failover
Weighted routing
Latency-based routing

4.2 Elastic Load Balancer (ELB)

Distributes traffic across multiple targets
Automatically removes unhealthy instances
Supports failover within and across AZs

4.3 Amazon RDS Failover

Multi-AZ deployment
Automatically switches to standby DB if primary fails

4.4 Auto Scaling Groups

Replaces failed EC2 instances automatically
Ensures minimum capacity is maintained

4.5 AWS Global Accelerator

Provides static IPs
Routes traffic to healthy endpoints globally
Supports fast failover between regions

5. Recovery Objectives (Important for Exam)

Recovery Time Objective (RTO)

Maximum acceptable downtime
Failover strategy determines RTO

Recovery Point Objective (RPO)

Maximum acceptable data loss
Depends on backup and replication

👉 Exam Tip:

Active-active → low RTO, low RPO
Backup/restore → high RTO, higher RPO

6. Common Exam Scenarios

Scenario 1:

“Minimize downtime with automatic failover across AZs”

✔ Use:

Multi-AZ deployment
Route 53 failover routing

Scenario 2:

“Highly available system with no downtime”

✔ Use:

Active-active architecture
Multi-region + load balancing

Scenario 3:

“Low-cost disaster recovery with acceptable downtime”

✔ Use:

Backup and restore

Scenario 4:

“Fast failover with minimal cost”

✔ Use:

Warm standby

Scenario 5:

“Switch traffic if health check fails”

✔ Use:

Route 53 health checks
ELB health checks

7. Best Practices for Failover (Exam Focus)

Deploy across multiple Availability Zones
Use Route 53 health checks
Use Elastic Load Balancer
Enable Auto Scaling
Use Multi-AZ databases
Replicate data across regions
Design for automation, not manual failover
Monitor using CloudWatch

8. Key Differences to Remember

Strategy	Cost	RTO	RPO	Complexity
Active-Active	High	Very Low	Very Low	High
Active-Passive	Medium	Low	Low	Medium
Warm Standby	Medium	Medium	Medium	Medium
Pilot Light	Low	Medium	Medium	Medium
Backup/Restore	Very Low	High	High	Low

Final Exam Tips

Always think:
👉 “How quickly should the system recover?” (RTO)
👉 “How much data loss is acceptable?” (RPO)
If question mentions:
- Automatic failover → Route 53 / Multi-AZ
- No downtime → Active-active
- Cost optimization → Backup/restore or pilot light
- Fast recovery → Warm standby