Identifying metrics based on business requirements to deliver a highlyavailable solution

Task Statement 2.2: Design highly available and/or fault-tolerant architectures.

📘AWS Certified Solutions Architect – (SAA-C03)


1. What This Topic Means

In AWS architecture, metrics are measurable values that help you understand how your system is performing.

To design a highly available (HA) system, you must:

  • Know what your business expects
  • Translate those expectations into measurable metrics
  • Use AWS tools to monitor and act on those metrics

👉 In simple terms:
Business requirement → Convert to metric → Monitor → Take action


2. Why Metrics Are Important for High Availability

Metrics help you:

  • Detect failures quickly
  • Prevent downtime
  • Maintain performance under load
  • Automatically recover systems

Without metrics, you are “blind” and cannot ensure availability.


3. Key Business Requirements You Must Understand

Before choosing metrics, identify what the business needs:

1. Availability Requirement

  • Example: “System must be available 99.99% of the time”

2. Performance Requirement

  • Example: “Requests must respond within 200 ms”

3. Scalability Requirement

  • Example: “System must handle traffic spikes”

4. Reliability Requirement

  • Example: “No data loss allowed”

5. Recovery Requirement

  • RTO (Recovery Time Objective): How fast to recover
  • RPO (Recovery Point Objective): How much data loss is acceptable

4. Types of Metrics You Should Identify

A. Availability Metrics

Measure if your system is up and reachable.

Common metrics:

  • Uptime percentage
  • Health check status
  • Number of failed requests

AWS services:

  • Amazon CloudWatch
  • Elastic Load Balancer health checks

B. Performance Metrics

Measure how fast the system responds.

Common metrics:

  • Latency (response time)
  • Throughput (requests per second)
  • CPU utilization

Important for:

  • Web applications
  • APIs

C. Error Metrics

Measure failures in the system.

Examples:

  • HTTP 5xx errors
  • Application exceptions
  • Database connection failures

D. Scaling Metrics

Used to automatically increase/decrease resources.

Examples:

  • CPU utilization
  • Memory usage
  • Request count

Used in:

  • Auto Scaling Groups
  • AWS Lambda concurrency

E. Storage and Data Metrics

Measure data reliability and performance.

Examples:

  • Disk I/O
  • Read/write latency
  • Replication lag

5. Important AWS Monitoring Service

Amazon CloudWatch

Main service used to collect and monitor metrics.

Key features:

  • Metrics collection
  • Alarms
  • Dashboards
  • Logs

👉 Exam Tip:
CloudWatch is the central monitoring service in AWS


6. Converting Business Requirements into Metrics

This is very important for the exam.

Step-by-step:

Step 1: Understand Requirement

Example:

  • “System must respond quickly”

Step 2: Convert to Metric

  • Latency (e.g., < 200 ms)

Step 3: Set Threshold

  • Alarm if latency > 200 ms

Step 4: Take Action

  • Trigger Auto Scaling
  • Send alert
  • Restart service

7. Using Alarms for High Availability

Metrics alone are not enough — you must act on them.

CloudWatch Alarms:

  • Monitor metrics continuously
  • Trigger actions when thresholds are crossed

Actions:

  • Send notification (SNS)
  • Scale resources
  • Recover EC2 instances

8. Example IT-Based Scenario (Exam Style)

Scenario:

A web application must remain highly available and handle traffic spikes.

Metrics to use:

  • CPU utilization → for scaling
  • Request count → traffic monitoring
  • Latency → performance check
  • Error rate → failure detection

Solution:

  • Use Auto Scaling based on CPU
  • Use CloudWatch alarms
  • Use Load Balancer health checks

9. High Availability Design Using Metrics

To achieve HA, metrics are used to:

1. Detect Failures

  • Health checks fail → remove instance

2. Trigger Scaling

  • High CPU → add instances

3. Improve Performance

  • High latency → optimize or scale

4. Enable Failover

  • Region/instance unhealthy → switch to backup

10. Key Exam Concepts to Remember

MUST KNOW:

✔ Metrics come from business requirements
✔ CloudWatch is the primary monitoring tool
✔ Alarms trigger automated actions
✔ Metrics help with:

  • Scaling
  • Failover
  • Recovery

IMPORTANT METRICS FOR EXAM:

  • CPU Utilization
  • Latency
  • Error Rate (4xx, 5xx)
  • Request Count
  • Disk I/O
  • Network Traffic

11. Common Exam Questions (What They Test)

You may be asked:

1. Which metric to use?

  • High latency → use latency metric
  • Scaling → CPU or request count

2. What action to take?

  • Use Auto Scaling
  • Use CloudWatch alarms

3. How to meet availability requirement?

  • Monitor health checks
  • Use failover mechanisms

12. Best Practices (Exam-Focused)

  • Always define metrics based on business needs
  • Use multiple metrics (not just one)
  • Set proper thresholds
  • Use alarms for automation
  • Monitor continuously

13. Common Mistakes to Avoid

❌ Not setting alarms
❌ Monitoring only one metric
❌ Ignoring error rates
❌ Not linking metrics to scaling


14. Quick Summary

  • Metrics = measurable system performance values
  • Derived from business requirements
  • Used to ensure high availability
  • Monitored using CloudWatch
  • Alarms trigger automatic recovery and scaling
Buy Me a Coffee