Failover strategies

Task Statement 2.2: Design highly available and/or fault-tolerant architectures.

📘AWS Certified Solutions Architect – (SAA-C03)


1. What is Failover?

Failover is the process of automatically switching to a backup system when the primary system fails.

In AWS terms, this usually involves:

  • Switching traffic from one EC2 instance / Availability Zone (AZ) / Region
  • To another healthy instance / AZ / Region

2. Key Concepts You MUST Understand

2.1 High Availability vs Fault Tolerance

  • High Availability (HA)
    • System continues running with minimal downtime
    • Some disruption may occur
    • Example: switching to another AZ after failure
  • Fault Tolerant System
    • No interruption at all
    • System keeps running even if components fail
    • Example: active-active setup across multiple AZs

👉 Exam Tip:

  • HA = quick recovery
  • Fault tolerant = no downtime

2.2 Failure Detection

Before failover happens, AWS must detect failure using:

  • Health checks (e.g., from Elastic Load Balancer (ELB))
  • Route health checks (e.g., Amazon Route 53)
  • Monitoring (e.g., Amazon CloudWatch alarms)

👉 If the system cannot detect failure, failover will NOT happen.


3. Types of Failover Strategies in AWS


3.1 Active-Passive Failover

How it works:

  • One system is active (serving traffic)
  • Another system is passive (standby, not serving traffic)
  • Passive system takes over only when active fails

AWS Services Used:

  • Amazon Route 53
  • Elastic Load Balancer (ELB)
  • Amazon RDS Multi-AZ

Example (AWS Architecture):

  • Primary EC2 instances in us-east-1a
  • Standby EC2 instances in us-east-1b
  • Route 53 routes traffic to primary
  • If failure occurs → switches to standby

Advantages:

  • Simple to design
  • Lower cost than active-active

Disadvantages:

  • Standby resources are idle (wasted capacity)

3.2 Active-Active Failover

How it works:

  • Multiple systems are active at the same time
  • Traffic is distributed across all systems
  • If one fails, others continue serving traffic

AWS Services Used:

  • Elastic Load Balancer
  • Amazon Route 53 (Latency-based or Weighted routing)
  • Multi-region deployments

Example:

  • Application running in:
    • Region A
    • Region B
  • Route 53 distributes traffic based on:
    • Latency
    • Health checks

Advantages:

  • No downtime
  • High performance
  • Efficient resource usage

Disadvantages:

  • More complex
  • Higher cost

3.3 Pilot Light Failover

How it works:

  • Core components are always running in a secondary region
  • Minimal infrastructure is kept active
  • Full system is started only when needed

Example:

  • Database replication is active
  • Application servers are not running in standby region
  • When failure occurs → spin up application servers

AWS Services Used:

  • Amazon RDS cross-region read replica
  • Amazon EC2 (on-demand launch)
  • Amazon AMI

Advantages:

  • Lower cost than warm standby
  • Faster recovery than backup/restore

Disadvantages:

  • Requires scaling up during failover

3.4 Warm Standby Failover

How it works:

  • A scaled-down version of the system runs in another region
  • It is always running but at lower capacity
  • Quickly scaled up when failure occurs

Example:

  • 20% capacity running in secondary region
  • 100% capacity in primary region
  • Failover → scale secondary to full capacity

AWS Services Used:

  • Auto Scaling Groups
  • Elastic Load Balancer
  • CloudFormation

Advantages:

  • Faster failover than pilot light
  • Lower cost than active-active

Disadvantages:

  • Still some cost due to standby resources

3.5 Backup and Restore (Cold Standby)

How it works:

  • No standby system is running
  • Only backups exist
  • System is rebuilt from backups after failure

Example:

  • Data stored in Amazon S3
  • After failure → restore to new EC2 instances

AWS Services Used:

  • Amazon S3
  • AWS Backup
  • Amazon Machine Images (AMI)

Advantages:

  • Cheapest option

Disadvantages:

  • Slow recovery time (high downtime)

4. AWS Services That Support Failover


4.1 Amazon Route 53 Failover Routing

  • Routes traffic to primary resource
  • If health check fails → routes to secondary

👉 Types:

  • Primary/Secondary failover
  • Weighted routing
  • Latency-based routing

4.2 Elastic Load Balancer (ELB)

  • Distributes traffic across multiple targets
  • Automatically removes unhealthy instances
  • Supports failover within and across AZs

4.3 Amazon RDS Failover

  • Multi-AZ deployment
  • Automatically switches to standby DB if primary fails

4.4 Auto Scaling Groups

  • Replaces failed EC2 instances automatically
  • Ensures minimum capacity is maintained

4.5 AWS Global Accelerator

  • Provides static IPs
  • Routes traffic to healthy endpoints globally
  • Supports fast failover between regions

5. Recovery Objectives (Important for Exam)

Recovery Time Objective (RTO)

  • Maximum acceptable downtime
  • Failover strategy determines RTO

Recovery Point Objective (RPO)

  • Maximum acceptable data loss
  • Depends on backup and replication

👉 Exam Tip:

  • Active-active → low RTO, low RPO
  • Backup/restore → high RTO, higher RPO

6. Common Exam Scenarios


Scenario 1:

“Minimize downtime with automatic failover across AZs”

✔ Use:

  • Multi-AZ deployment
  • Route 53 failover routing

Scenario 2:

“Highly available system with no downtime”

✔ Use:

  • Active-active architecture
  • Multi-region + load balancing

Scenario 3:

“Low-cost disaster recovery with acceptable downtime”

✔ Use:

  • Backup and restore

Scenario 4:

“Fast failover with minimal cost”

✔ Use:

  • Warm standby

Scenario 5:

“Switch traffic if health check fails”

✔ Use:

  • Route 53 health checks
  • ELB health checks

7. Best Practices for Failover (Exam Focus)

  • Deploy across multiple Availability Zones
  • Use Route 53 health checks
  • Use Elastic Load Balancer
  • Enable Auto Scaling
  • Use Multi-AZ databases
  • Replicate data across regions
  • Design for automation, not manual failover
  • Monitor using CloudWatch

8. Key Differences to Remember

StrategyCostRTORPOComplexity
Active-ActiveHighVery LowVery LowHigh
Active-PassiveMediumLowLowMedium
Warm StandbyMediumMediumMediumMedium
Pilot LightLowMediumMediumMedium
Backup/RestoreVery LowHighHighLow

Final Exam Tips

  • Always think:
    👉 “How quickly should the system recover?” (RTO)
    👉 “How much data loss is acceptable?” (RPO)
  • If question mentions:
    • Automatic failover → Route 53 / Multi-AZ
    • No downtime → Active-active
    • Cost optimization → Backup/restore or pilot light
    • Fast recovery → Warm standby
Buy Me a Coffee