Disaster recovery (DR) strategies (for example, backup and restore, pilot light, warm standby, active-active failover, recovery point objective [RPO],recovery time objective [RTO])

Task Statement 2.2: Design highly available and/or fault-tolerant architectures.

📘AWS Certified Solutions Architect – (SAA-C03)


1. What is Disaster Recovery (DR)?

Disaster Recovery is the plan and process to restore IT systems, applications, and data after a failure or outage. Failures could be due to hardware crashes, software bugs, human errors, or even natural disasters.

In AWS, DR focuses on keeping workloads available and restoring them quickly using AWS services like Amazon S3, Amazon EC2, Amazon RDS, and AWS Backup.

Two key metrics in DR:

  1. RPO (Recovery Point Objective):
    • How much data can you afford to lose?
    • Example: If RPO = 1 hour, you can lose up to 1 hour of data.
  2. RTO (Recovery Time Objective):
    • How quickly must systems be restored?
    • Example: If RTO = 4 hours, your system must be back online within 4 hours.

2. Disaster Recovery Strategies

AWS defines four common DR strategies. They vary by cost, complexity, and recovery speed.

A. Backup and Restore

  • Description:
    You store backups of your data and applications in a safe location (like Amazon S3 or Amazon Glacier).
    If the primary system fails, you restore from backups.
  • AWS Examples:
    • Database snapshots in Amazon RDS.
    • Object storage backups in Amazon S3 or Amazon Glacier.
    • EC2 AMIs (Amazon Machine Images) for server restore.
  • RPO & RTO:
    • RPO: Hours or days, depending on backup frequency.
    • RTO: Hours to days, because you need to restore data and restart servers.
  • Pros:
    • Low cost (you pay only for storage).
  • Cons:
    • Slow recovery (not suitable for mission-critical apps).

B. Pilot Light

  • Description:
    A minimal version of your environment runs continuously in AWS. Only critical core components are always on. During a disaster, you scale it up to full production.
  • AWS Examples:
    • Keep a small EC2 instance with key app components running.
    • Backups and databases are synced in real-time using Amazon RDS.
    • When disaster strikes, launch additional EC2 instances using Auto Scaling to reach full capacity.
  • RPO & RTO:
    • RPO: Minutes to hours (depends on replication setup).
    • RTO: Shorter than backup & restore; can be tens of minutes to a few hours.
  • Pros:
    • Cost-effective; not everything runs continuously.
  • Cons:
    • Some manual steps or automation needed to scale up.

C. Warm Standby

  • Description:
    A scaled-down version of your production environment runs continuously in AWS. During a disaster, you scale up resources to full capacity.
  • AWS Examples:
    • EC2 instances running at half capacity.
    • RDS read replicas for databases.
    • ELB (Elastic Load Balancer) already configured.
  • RPO & RTO:
    • RPO: Near real-time (data is replicated continuously).
    • RTO: Short; often less than an hour.
  • Pros:
    • Faster recovery than pilot light.
    • Lower cost than fully active system.
  • Cons:
    • Costlier than pilot light.
    • Partial resources are always running.

D. Active-Active (Multi-Site / Hot Standby)

  • Description:
    Full production workloads run simultaneously in multiple AWS Regions or Availability Zones. Traffic can failover automatically to the secondary site.
  • AWS Examples:
    • Deploy your web application across multiple AWS Regions.
    • Use Amazon Route 53 for DNS failover.
    • Database replication using Amazon Aurora Global Database.
  • RPO & RTO:
    • RPO: Seconds (near-zero data loss).
    • RTO: Seconds to minutes (almost instantaneous failover).
  • Pros:
    • Fastest recovery.
    • Near-zero downtime and data loss.
  • Cons:
    • Most expensive; full environment runs continuously.

3. Choosing the Right Strategy

  • Non-critical systems → Backup and Restore.
  • Critical apps with some tolerance for downtime → Pilot Light.
  • Apps that must run with minimal downtime → Warm Standby.
  • Mission-critical, always-on apps → Active-Active.

Factors to consider:

  • Cost vs. Recovery Speed (trade-off).
  • RPO and RTO requirements.
  • Complexity of management and automation.
  • AWS services available to implement the strategy.

4. AWS Services Commonly Used in DR

  • Amazon S3 & Glacier: Backup storage.
  • Amazon EC2: Compute resources for pilot light or warm standby.
  • Amazon RDS / Aurora: Database replication and failover.
  • AWS Backup: Centralized backup service.
  • Amazon Route 53: DNS failover for active-active setups.
  • Amazon CloudWatch & Lambda: Automation for failover and scaling.

5. Exam Tips

  1. RPO vs RTO: Always remember:
    • RPO = data loss tolerance
    • RTO = downtime tolerance
  2. Cost vs Recovery Speed: DR strategy questions often ask you to balance cost and speed.
  3. AWS Services: Know which AWS service fits each DR type.

Summary Table for Quick Exam Reference

DR StrategyCostRPORTOKey AWS Services
Backup & RestoreLowHours-DaysHours-DaysS3, Glacier, EC2 AMI, RDS
Pilot LightLow-MediumMinutes-HoursMinutes-HoursEC2, RDS, S3, Auto Scaling
Warm StandbyMediumNear-Real-Time<1 HourEC2, RDS, ELB, Auto Scaling
Active-ActiveHighSecondsSecondsRoute 53, Aurora Global DB, EC2, ELB
Buy Me a Coffee