Task Statement 2.2: Design highly available and/or fault-tolerant architectures.
📘AWS Certified Solutions Architect – (SAA-C03)
1. What This Topic Means
In AWS architecture, metrics are measurable values that help you understand how your system is performing.
To design a highly available (HA) system, you must:
- Know what your business expects
- Translate those expectations into measurable metrics
- Use AWS tools to monitor and act on those metrics
👉 In simple terms:
Business requirement → Convert to metric → Monitor → Take action
2. Why Metrics Are Important for High Availability
Metrics help you:
- Detect failures quickly
- Prevent downtime
- Maintain performance under load
- Automatically recover systems
Without metrics, you are “blind” and cannot ensure availability.
3. Key Business Requirements You Must Understand
Before choosing metrics, identify what the business needs:
1. Availability Requirement
- Example: “System must be available 99.99% of the time”
2. Performance Requirement
- Example: “Requests must respond within 200 ms”
3. Scalability Requirement
- Example: “System must handle traffic spikes”
4. Reliability Requirement
- Example: “No data loss allowed”
5. Recovery Requirement
- RTO (Recovery Time Objective): How fast to recover
- RPO (Recovery Point Objective): How much data loss is acceptable
4. Types of Metrics You Should Identify
A. Availability Metrics
Measure if your system is up and reachable.
Common metrics:
- Uptime percentage
- Health check status
- Number of failed requests
AWS services:
- Amazon CloudWatch
- Elastic Load Balancer health checks
B. Performance Metrics
Measure how fast the system responds.
Common metrics:
- Latency (response time)
- Throughput (requests per second)
- CPU utilization
Important for:
- Web applications
- APIs
C. Error Metrics
Measure failures in the system.
Examples:
- HTTP 5xx errors
- Application exceptions
- Database connection failures
D. Scaling Metrics
Used to automatically increase/decrease resources.
Examples:
- CPU utilization
- Memory usage
- Request count
Used in:
- Auto Scaling Groups
- AWS Lambda concurrency
E. Storage and Data Metrics
Measure data reliability and performance.
Examples:
- Disk I/O
- Read/write latency
- Replication lag
5. Important AWS Monitoring Service
Amazon CloudWatch
Main service used to collect and monitor metrics.
Key features:
- Metrics collection
- Alarms
- Dashboards
- Logs
👉 Exam Tip:
CloudWatch is the central monitoring service in AWS
6. Converting Business Requirements into Metrics
This is very important for the exam.
Step-by-step:
Step 1: Understand Requirement
Example:
- “System must respond quickly”
Step 2: Convert to Metric
- Latency (e.g., < 200 ms)
Step 3: Set Threshold
- Alarm if latency > 200 ms
Step 4: Take Action
- Trigger Auto Scaling
- Send alert
- Restart service
7. Using Alarms for High Availability
Metrics alone are not enough — you must act on them.
CloudWatch Alarms:
- Monitor metrics continuously
- Trigger actions when thresholds are crossed
Actions:
- Send notification (SNS)
- Scale resources
- Recover EC2 instances
8. Example IT-Based Scenario (Exam Style)
Scenario:
A web application must remain highly available and handle traffic spikes.
Metrics to use:
- CPU utilization → for scaling
- Request count → traffic monitoring
- Latency → performance check
- Error rate → failure detection
Solution:
- Use Auto Scaling based on CPU
- Use CloudWatch alarms
- Use Load Balancer health checks
9. High Availability Design Using Metrics
To achieve HA, metrics are used to:
1. Detect Failures
- Health checks fail → remove instance
2. Trigger Scaling
- High CPU → add instances
3. Improve Performance
- High latency → optimize or scale
4. Enable Failover
- Region/instance unhealthy → switch to backup
10. Key Exam Concepts to Remember
MUST KNOW:
✔ Metrics come from business requirements
✔ CloudWatch is the primary monitoring tool
✔ Alarms trigger automated actions
✔ Metrics help with:
- Scaling
- Failover
- Recovery
IMPORTANT METRICS FOR EXAM:
- CPU Utilization
- Latency
- Error Rate (4xx, 5xx)
- Request Count
- Disk I/O
- Network Traffic
11. Common Exam Questions (What They Test)
You may be asked:
1. Which metric to use?
- High latency → use latency metric
- Scaling → CPU or request count
2. What action to take?
- Use Auto Scaling
- Use CloudWatch alarms
3. How to meet availability requirement?
- Monitor health checks
- Use failover mechanisms
12. Best Practices (Exam-Focused)
- Always define metrics based on business needs
- Use multiple metrics (not just one)
- Set proper thresholds
- Use alarms for automation
- Monitor continuously
13. Common Mistakes to Avoid
❌ Not setting alarms
❌ Monitoring only one metric
❌ Ignoring error rates
❌ Not linking metrics to scaling
14. Quick Summary
- Metrics = measurable system performance values
- Derived from business requirements
- Used to ensure high availability
- Monitored using CloudWatch
- Alarms trigger automatic recovery and scaling
