Identifying metrics based on business requirements to deliver a highlyavailable solution

Task Statement 2.2: Design highly available and/or fault-tolerant architectures.

📘AWS Certified Solutions Architect – (SAA-C03)

1. What This Topic Means

In AWS architecture, metrics are measurable values that help you understand how your system is performing.

To design a highly available (HA) system, you must:

Know what your business expects
Translate those expectations into measurable metrics
Use AWS tools to monitor and act on those metrics

👉 In simple terms:
Business requirement → Convert to metric → Monitor → Take action

2. Why Metrics Are Important for High Availability

Metrics help you:

Detect failures quickly
Prevent downtime
Maintain performance under load
Automatically recover systems

Without metrics, you are “blind” and cannot ensure availability.

3. Key Business Requirements You Must Understand

Before choosing metrics, identify what the business needs:

1. Availability Requirement

Example: “System must be available 99.99% of the time”

2. Performance Requirement

Example: “Requests must respond within 200 ms”

3. Scalability Requirement

Example: “System must handle traffic spikes”

4. Reliability Requirement

Example: “No data loss allowed”

5. Recovery Requirement

RTO (Recovery Time Objective): How fast to recover
RPO (Recovery Point Objective): How much data loss is acceptable

4. Types of Metrics You Should Identify

A. Availability Metrics

Measure if your system is up and reachable.

Common metrics:

Uptime percentage
Health check status
Number of failed requests

AWS services:

Amazon CloudWatch
Elastic Load Balancer health checks

B. Performance Metrics

Measure how fast the system responds.

Common metrics:

Latency (response time)
Throughput (requests per second)
CPU utilization

Important for:

Web applications
APIs

C. Error Metrics

Measure failures in the system.

Examples:

HTTP 5xx errors
Application exceptions
Database connection failures

D. Scaling Metrics

Used to automatically increase/decrease resources.

Examples:

CPU utilization
Memory usage
Request count

Used in:

Auto Scaling Groups
AWS Lambda concurrency

E. Storage and Data Metrics

Measure data reliability and performance.

Examples:

Disk I/O
Read/write latency
Replication lag

5. Important AWS Monitoring Service

Amazon CloudWatch

Main service used to collect and monitor metrics.

Key features:

Metrics collection
Alarms
Dashboards
Logs

👉 Exam Tip:
CloudWatch is the central monitoring service in AWS

6. Converting Business Requirements into Metrics

This is very important for the exam.

Step-by-step:

Step 1: Understand Requirement

Example:

“System must respond quickly”

Step 2: Convert to Metric

Latency (e.g., < 200 ms)

Step 3: Set Threshold

Alarm if latency > 200 ms

Step 4: Take Action

Trigger Auto Scaling
Send alert
Restart service

7. Using Alarms for High Availability

Metrics alone are not enough — you must act on them.

CloudWatch Alarms:

Monitor metrics continuously
Trigger actions when thresholds are crossed

Actions:

Send notification (SNS)
Scale resources
Recover EC2 instances

8. Example IT-Based Scenario (Exam Style)

Scenario:

A web application must remain highly available and handle traffic spikes.

Metrics to use:

CPU utilization → for scaling
Request count → traffic monitoring
Latency → performance check
Error rate → failure detection

Solution:

Use Auto Scaling based on CPU
Use CloudWatch alarms
Use Load Balancer health checks

9. High Availability Design Using Metrics

To achieve HA, metrics are used to:

1. Detect Failures

Health checks fail → remove instance

2. Trigger Scaling

High CPU → add instances

3. Improve Performance

High latency → optimize or scale

4. Enable Failover

Region/instance unhealthy → switch to backup

10. Key Exam Concepts to Remember

MUST KNOW:

✔ Metrics come from business requirements
✔ CloudWatch is the primary monitoring tool
✔ Alarms trigger automated actions
✔ Metrics help with:

Scaling
Failover
Recovery

IMPORTANT METRICS FOR EXAM:

CPU Utilization
Latency
Error Rate (4xx, 5xx)
Request Count
Disk I/O
Network Traffic

11. Common Exam Questions (What They Test)

You may be asked:

1. Which metric to use?

High latency → use latency metric
Scaling → CPU or request count

2. What action to take?

Use Auto Scaling
Use CloudWatch alarms

3. How to meet availability requirement?

Monitor health checks
Use failover mechanisms

12. Best Practices (Exam-Focused)

Always define metrics based on business needs
Use multiple metrics (not just one)
Set proper thresholds
Use alarms for automation
Monitor continuously

13. Common Mistakes to Avoid

❌ Not setting alarms
❌ Monitoring only one metric
❌ Ignoring error rates
❌ Not linking metrics to scaling

14. Quick Summary

Metrics = measurable system performance values
Derived from business requirements
Used to ensure high availability
Monitored using CloudWatch
Alarms trigger automatic recovery and scaling