3.3 Explain disaster recovery (DR) concepts
DR Metrics
📘CompTIA Network+ (N10-009)
Definition
Mean Time to Repair (MTTR) is a key disaster recovery (DR) and IT operations metric. It measures how long it takes, on average, to repair a failed system, device, or component and restore it to full working condition.
- Think of it as “the average downtime from the moment a problem is detected until it is fully fixed.”
- MTTR helps organizations understand system reliability and the efficiency of their repair processes.
Why MTTR Matters in IT
MTTR is crucial in IT because downtime costs money and productivity:
- Servers – If a web server goes down, MTTR measures how quickly IT can get it back online.
- Network Devices – If a switch or router fails, MTTR tracks how fast the network is restored.
- Applications – For business-critical apps, MTTR helps assess how quickly users can access services again.
- Data Recovery – MTTR can also apply to restoring lost data from backups.
Key point: Lower MTTR = faster recovery = less downtime = happier users.
How MTTR is Calculated
The formula for MTTR is straightforward:MTTR=Number of IncidentsTotal Downtime
Step-by-step explanation:
- Record the total time spent repairing all failures over a period.
- Count the number of repair incidents in that same period.
- Divide the total downtime by the number of incidents.
Example in IT environment:
- Suppose a company had 3 server outages this month:
- Outage 1: 2 hours
- Outage 2: 1.5 hours
- Outage 3: 2.5 hours
Total downtime=2+1.5+2.5=6 hours MTTR=3 incidents6 hours=2 hours
✅ This means, on average, it takes 2 hours to repair a server when it fails.
Key Factors Affecting MTTR
Several factors can influence MTTR in IT systems:
- Availability of spare parts – If a failed hard drive is needed but not in stock, MTTR increases.
- Skill level of IT staff – More experienced technicians can troubleshoot faster.
- Monitoring and alerting systems – Faster detection of failures reduces MTTR.
- Documentation and procedures – Clear instructions and checklists speed up repairs.
- Automation and tools – Automated recovery scripts can reduce manual repair time.
MTTR vs. Other Metrics
It’s important to differentiate MTTR from related metrics:
- MTBF (Mean Time Between Failures): Measures average time a system runs before failing. MTBF focuses on uptime; MTTR focuses on downtime.
- RTO (Recovery Time Objective): The maximum acceptable downtime for a system during a disaster. MTTR should ideally be less than or equal to RTO.
MTTR in Disaster Recovery Planning
In a disaster recovery (DR) context, MTTR helps:
- Set expectations – Stakeholders know how quickly systems can be restored.
- Evaluate DR effectiveness – A high MTTR signals that DR procedures may need improvement.
- Prioritize resources – Focus efforts on critical systems that need the fastest repair times.
- Measure improvements – Track MTTR over time to see if IT recovery processes are getting faster.
IT Examples of MTTR
- Server Crash: A database server fails. IT restores from backup in 3 hours. MTTR = 3 hours.
- Network Switch Failure: A core switch goes down. IT replaces it in 1 hour. MTTR = 1 hour.
- Application Outage: A business-critical application stops working. Automated scripts restart the service in 15 minutes. MTTR = 15 minutes.
Tip for the exam: Always associate MTTR with “average repair time after failure”, not with prevention or uptime.
Summary for Exam
- MTTR (Mean Time to Repair) = Average time to repair a failed system and restore service.
- Purpose: Measure downtime efficiency and help improve IT recovery processes.
- Calculation: Total downtime ÷ Number of incidents.
- Key points: Lower MTTR = faster recovery; MTTR should be less than or equal to RTO; affected by staffing, tools, monitoring, and procedures.
- Usage: Critical in DR plans, IT operations, and service-level agreements (SLAs).
