Task Statement 2.2: Design highly available and/or fault-tolerant architectures.
📘AWS Certified Solutions Architect – (SAA-C03)
1. What is Failover?
Failover is the process of automatically switching to a backup system when the primary system fails.
In AWS terms, this usually involves:
- Switching traffic from one EC2 instance / Availability Zone (AZ) / Region
- To another healthy instance / AZ / Region
2. Key Concepts You MUST Understand
2.1 High Availability vs Fault Tolerance
- High Availability (HA)
- System continues running with minimal downtime
- Some disruption may occur
- Example: switching to another AZ after failure
- Fault Tolerant System
- No interruption at all
- System keeps running even if components fail
- Example: active-active setup across multiple AZs
👉 Exam Tip:
- HA = quick recovery
- Fault tolerant = no downtime
2.2 Failure Detection
Before failover happens, AWS must detect failure using:
- Health checks (e.g., from Elastic Load Balancer (ELB))
- Route health checks (e.g., Amazon Route 53)
- Monitoring (e.g., Amazon CloudWatch alarms)
👉 If the system cannot detect failure, failover will NOT happen.
3. Types of Failover Strategies in AWS
3.1 Active-Passive Failover
How it works:
- One system is active (serving traffic)
- Another system is passive (standby, not serving traffic)
- Passive system takes over only when active fails
AWS Services Used:
- Amazon Route 53
- Elastic Load Balancer (ELB)
- Amazon RDS Multi-AZ
Example (AWS Architecture):
- Primary EC2 instances in us-east-1a
- Standby EC2 instances in us-east-1b
- Route 53 routes traffic to primary
- If failure occurs → switches to standby
Advantages:
- Simple to design
- Lower cost than active-active
Disadvantages:
- Standby resources are idle (wasted capacity)
3.2 Active-Active Failover
How it works:
- Multiple systems are active at the same time
- Traffic is distributed across all systems
- If one fails, others continue serving traffic
AWS Services Used:
- Elastic Load Balancer
- Amazon Route 53 (Latency-based or Weighted routing)
- Multi-region deployments
Example:
- Application running in:
- Region A
- Region B
- Route 53 distributes traffic based on:
- Latency
- Health checks
Advantages:
- No downtime
- High performance
- Efficient resource usage
Disadvantages:
- More complex
- Higher cost
3.3 Pilot Light Failover
How it works:
- Core components are always running in a secondary region
- Minimal infrastructure is kept active
- Full system is started only when needed
Example:
- Database replication is active
- Application servers are not running in standby region
- When failure occurs → spin up application servers
AWS Services Used:
- Amazon RDS cross-region read replica
- Amazon EC2 (on-demand launch)
- Amazon AMI
Advantages:
- Lower cost than warm standby
- Faster recovery than backup/restore
Disadvantages:
- Requires scaling up during failover
3.4 Warm Standby Failover
How it works:
- A scaled-down version of the system runs in another region
- It is always running but at lower capacity
- Quickly scaled up when failure occurs
Example:
- 20% capacity running in secondary region
- 100% capacity in primary region
- Failover → scale secondary to full capacity
AWS Services Used:
- Auto Scaling Groups
- Elastic Load Balancer
- CloudFormation
Advantages:
- Faster failover than pilot light
- Lower cost than active-active
Disadvantages:
- Still some cost due to standby resources
3.5 Backup and Restore (Cold Standby)
How it works:
- No standby system is running
- Only backups exist
- System is rebuilt from backups after failure
Example:
- Data stored in Amazon S3
- After failure → restore to new EC2 instances
AWS Services Used:
- Amazon S3
- AWS Backup
- Amazon Machine Images (AMI)
Advantages:
- Cheapest option
Disadvantages:
- Slow recovery time (high downtime)
4. AWS Services That Support Failover
4.1 Amazon Route 53 Failover Routing
- Routes traffic to primary resource
- If health check fails → routes to secondary
👉 Types:
- Primary/Secondary failover
- Weighted routing
- Latency-based routing
4.2 Elastic Load Balancer (ELB)
- Distributes traffic across multiple targets
- Automatically removes unhealthy instances
- Supports failover within and across AZs
4.3 Amazon RDS Failover
- Multi-AZ deployment
- Automatically switches to standby DB if primary fails
4.4 Auto Scaling Groups
- Replaces failed EC2 instances automatically
- Ensures minimum capacity is maintained
4.5 AWS Global Accelerator
- Provides static IPs
- Routes traffic to healthy endpoints globally
- Supports fast failover between regions
5. Recovery Objectives (Important for Exam)
Recovery Time Objective (RTO)
- Maximum acceptable downtime
- Failover strategy determines RTO
Recovery Point Objective (RPO)
- Maximum acceptable data loss
- Depends on backup and replication
👉 Exam Tip:
- Active-active → low RTO, low RPO
- Backup/restore → high RTO, higher RPO
6. Common Exam Scenarios
Scenario 1:
“Minimize downtime with automatic failover across AZs”
✔ Use:
- Multi-AZ deployment
- Route 53 failover routing
Scenario 2:
“Highly available system with no downtime”
✔ Use:
- Active-active architecture
- Multi-region + load balancing
Scenario 3:
“Low-cost disaster recovery with acceptable downtime”
✔ Use:
- Backup and restore
Scenario 4:
“Fast failover with minimal cost”
✔ Use:
- Warm standby
Scenario 5:
“Switch traffic if health check fails”
✔ Use:
- Route 53 health checks
- ELB health checks
7. Best Practices for Failover (Exam Focus)
- Deploy across multiple Availability Zones
- Use Route 53 health checks
- Use Elastic Load Balancer
- Enable Auto Scaling
- Use Multi-AZ databases
- Replicate data across regions
- Design for automation, not manual failover
- Monitor using CloudWatch
8. Key Differences to Remember
| Strategy | Cost | RTO | RPO | Complexity |
|---|---|---|---|---|
| Active-Active | High | Very Low | Very Low | High |
| Active-Passive | Medium | Low | Low | Medium |
| Warm Standby | Medium | Medium | Medium | Medium |
| Pilot Light | Low | Medium | Medium | Medium |
| Backup/Restore | Very Low | High | High | Low |
Final Exam Tips
- Always think:
👉 “How quickly should the system recover?” (RTO)
👉 “How much data loss is acceptable?” (RPO) - If question mentions:
- Automatic failover → Route 53 / Multi-AZ
- No downtime → Active-active
- Cost optimization → Backup/restore or pilot light
- Fast recovery → Warm standby
