Greenwich Foot Tunnel , London, England

Image by: Maria Teneva

About System Design, Chapter 2

Published on: February 3, 2025

In this second part of the series, I’ll focus on key properties that affect how systems behave: Consistency, Availability, Partition Tolerance, Latency, Durability, Fault Tolerance, and Scalability.

Table of Contents

Introduction

In the first part of this series, I introduced some fundamental system design concepts. Now, in this second part, I’ll focus on key properties that affect how systems behave: Consistency, Availability, Partition Tolerance, Latency, Durability, Fault Tolerance, and Scalability.

These concepts help us understand how systems handle failures, performance, and growth. I’ll explain each one in a simple way with practical examples. Let’s get started!

Consistency

Consistency means that a system or data remains correct and aligned with certain rules and expectations at all times.

Data Consistency

It means that a dataset or database is always in a valid and consistent state. Example, in an e-commerce site, if a product shows 10 items in stock but goes negative when multiple users buy it, that is an inconsistency.

It is an important part of ACID (Atomicity, Consistency, Isolation, Durability), especially in relational databases. Example, in a banking system, when money is withdrawn from an account, the same amount should be deducted elsewhere. If one part updates but the other does not, it creates an inconsistency.

Okay, what is the ACID?

We don’t need to go deeper into these concepts for now—just knowing what they are is enough.

Consistency in Distributed Systems

In a system spread across multiple servers or data centers, all nodes must have the same data or update it in a specific way.

They are primarily two types of consistency models that can be used in distributed systems:

We get approximately a table like this.

FeatureStrong ConsistencyEventual Consistency
SpeedSlower due to synchronization overhead.Faster because updates propagate asynchronously.
AvailabilityLower, as system may reject requests during synchronization.Higher, as updates are accepted without waiting for synchronization.
Use CasesFinancial systems, critical applications.Social media feeds, caching systems.
ComplexityMore complex due to strict coordination.Simpler and more scalable.
Trade-offMore consistent data, but slower operations.Faster operations, but less consistent data.

Okay, let’s see a diagram to understand this better.

graph TD;
    A[n = Total Replicas] -->|Read from r replicas| B(Read Process);
    A -->|Write to w replicas| C(Write Process);
    B --> D{r + w > n ?};
    C --> D;
    D -->|Yes| E(Strong Consistency);
    D -->|No| F(Eventual Consistency);

For a practical example, you can check out my simple implementation using Go and Redis at github.com/yplog/consistency-demo.

Availability

Availability is a critical concept in distributed systems to ensure that the system is always accessible. Techniques like redundancy, fault tolerance, partition tolerance, and load balancing help improve availability in system design.

Availability is usually measured by uptime percentage:

Partition Tolerance

Partition Tolerance means that a distributed system can keep working even if there are network partitions.

A network partition happens when the connection between nodes or data centers is lost. Systems with Partition Tolerance can still serve users and prevent data loss even when this happens.

graph TD;
    A[Client Request] -->|Request| B[Server 1];
    A -->|Request| C[Server 2];

    subgraph Network Partition
        B -.->|Network Failure| C;
    end

    B -->|Response| A;
    C -->|Response| A;

Latency

Latency is the time between sending a request and receiving a response. In other words, it is the time it takes for data to travel from one point to another. Generally, it is measured in milliseconds.

Durability

Durability means that data is stored permanently and does not get lost. If data is successfully written, it will not disappear even if the system crashes. To keep data safe, a distributed system can use methods like copying data (replication) and saving backups.

Fault Tolerance

Fault Tolerance means a system can keep working even if some parts fail. If a server crashes, the network goes down, or hardware breaks, the system should still provide service.

It might sound more like the concept of Availability.

A Fault Tolerant system usually has high Availability, but a high Availability system is not always Fault Tolerant.

Scalability

Scalability is a system’s ability to handle more load and users while keeping good performance. This means the system should stay fast and work smoothly as demand increases.

Vertical Scaling

Vertical Scaling is the process of increasing the resources of a single server.

graph TD;
    A[Single Server] --> B[More CPU];
    A --> C[More RAM];
    A --> D[More Storage];

    B & C & D --> E[Improved Performance];

Horizontal Scaling

Horizontal Scaling is the process of increasing the number of servers in a system.

graph TD;
    A[Load Balancer] -->|Distribute Requests| B[Server 1];
    A -->|Distribute Requests| C[Server 2];
    A -->|Distribute Requests| D[Server 3];

Conclusion

In this post, we learned about the key properties that affect how systems behave: Consistency, Availability, Partition Tolerance, Latency, Durability, Fault Tolerance, and Scalability.

In the next chapter, I will discuss Distributed System Theorems.

References

And further reading:

Tags: #System Design #Software Architecture #en