Fault tolerance - Learn Tech From Zero

2.4 Explain the key concepts of high availability for servers.

📘CompTIA Server+ (SK0-005)

Fault tolerance is the ability of a server or system to keep running even when something fails. This is very important for high availability because it ensures that services and applications remain online without interruption. In other words, even if a part of the system stops working, the server continues to operate normally.

Fault tolerance can be achieved in two main ways:

Server-level redundancy
Component-level redundancy

1. Server-Level Redundancy

Definition:
Server-level redundancy means having more than one server performing the same task so that if one server fails, the other takes over immediately. This usually works with clustering or load balancing.

Key Points:

The servers are usually configured in a cluster (active-active or active-passive).
If one server crashes, another server in the cluster continues serving clients.
It’s like having a backup server ready to run at any time.

Example in IT environment:

A company has a web server cluster hosting a website.
Server A and Server B are in an active-passive setup.
If Server A fails, Server B automatically takes over without downtime.
Users don’t even notice that one server failed.

Advantages:

Minimizes downtime.
Can handle more load by distributing traffic (active-active setup).

Considerations:

Requires extra servers, which means higher cost.
Needs proper configuration of clustering software and network.

2. Component-Level Redundancy

Definition:
Component-level redundancy means duplicating parts inside a single server, rather than having multiple servers. If one component fails, the server can continue running using the backup component.

Key Components that can be redundant:

Power supplies: Dual power supplies so if one fails, the other keeps the server powered.
Network cards (NICs): Multiple NICs so if one fails, the other can handle traffic.
Hard drives: Using RAID (Redundant Array of Independent Disks) to duplicate or stripe data across disks.
Memory (RAM): Some servers have ECC memory with mirroring, which allows one set of RAM to fail without crashing the server.

Example in IT environment:

A database server has 2 power supplies.
One power supply fails. The server continues running because the second supply is active.
Or a server uses RAID 1 (mirrored disks). If one hard drive fails, the data is still available from the second drive.

Advantages:

Provides redundancy without needing extra servers.
Lower cost compared to server-level redundancy.

Considerations:

Only protects against hardware failures, not software or complete server failure.
Needs careful planning to ensure critical components are redundant.

Comparison: Server-Level vs Component-Level Redundancy

Feature	Server-Level Redundancy	Component-Level Redundancy
What fails?	Entire server	Individual components (disk, power, NIC, etc.)
Cost	Higher (requires extra servers)	Lower (just duplicate components)
Protection against	Server failure, heavy load	Hardware failures only
Example	Web server cluster	RAID disks, dual power supplies
Complexity	High (clustering and failover configuration)	Medium (hardware setup inside server)

Exam Tips:

Remember: Server-level redundancy = multiple servers, Component-level redundancy = backup parts inside a server.
Think about failover scenarios:
- Server-level redundancy handles full server crash.
- Component-level redundancy handles hardware failure like power supply or hard drive.
Questions may ask about cost vs protection:
- Server-level is more expensive but protects from server crashes.
- Component-level is cheaper but limited to hardware issues.

In short:

Fault tolerance keeps your system running even if something fails.
Server-level redundancy = multiple servers, protects against server failure.
Component-level redundancy = multiple parts inside one server, protects against hardware failure.