Distributed System Design Patterns

Original link: https://colobu.com/2022/06/26/distributed-system-design-patterns/

Original: Distributed System Design Patterns

Key design patterns for common problems associated with distributed systems.

Bloom filter

A Bloom filter is a space-efficient probabilistic data structure for testing whether an element is a member of a set. It is used in scenarios where we just need to check if an element belongs to an object.

In BigTable (and Cassandra), any read operation must read from the SSTable that makes up the Tablet. If these SSTables are not in memory, read operations may end up performing many disk accesses to read the required SSTables. To reduce the number of disk accesses, BigTable uses Bloom filters.

Consistent hashing

Consistent hashing allows you to scale easily, allowing data to be replicated in an efficient manner for better availability and fault tolerance.

Each data item identified by the key is assigned to a node by hashing the item’s key to yield its position on the ring, and then traversing the ring clockwise to find the first node with a position greater than the item’s position. The node associated with the node is the location of the data item.

The main advantage of consistent hashing is incremental stability; nodes leaving or arriving in the cluster only affect their immediate neighbors, other nodes are unaffected.

Quorum

In a distributed environment, a quorum is the minimum number of servers that need to successfully execute this distributed operation before the operation is confirmed to be successful.

Cassandra, to ensure data consistency, each write request can be configured to succeed only if the data has been written to at least one quorum (or majority) of replica nodes.

For leader election, Chubby uses Paxos, which uses quorum to ensure strong consensus.

Dynamo replicates writes to a sloppy quorum of other nodes in the system, rather than a strict majority quorum like Paxos. All read/write operations are performed on the first NN normal node in the preference list, which may not always be the first NN node encountered while traversing the consistent hash ring.

Leader and Follower

To achieve fault tolerance in systems that manage data, data needs to be replicated across multiple servers.

Choose a server in the cluster to be the leader. The leader is responsible for making decisions on behalf of the entire cluster and propagating decisions to all other servers.

Clusters of three to five nodes, as in systems implementing consensus, and leader election can be implemented within the data cluster itself, independent of any external system. Leader election takes place at server startup. Each server initiates a leader election at startup and attempts to elect a leader. The system does not accept any client requests unless a leader is elected.

heartbeat

The heartbeat mechanism is used to detect if an existing leader fails so that a new leader election can be initiated.

Fencing

In the leader-follower model, when the leader fails, it is impossible to determine that the leader has stopped working. For example, a slow network or network partition may trigger a new leader election, even if the previous leader is still running and thinks it is still the active leader.

Fencing refers to placing a fence around a previously active leader so that it cannot access cluster resources, thereby stopping any read/write requests being serviced.

Use the following two techniques:

Resource Blocking : The system blocks previously active leaders from accessing resources needed to perform essential tasks.
Node fencing : The system blocks access to all resources by a previously active leader. A common way to do this is to power off the node or reset the node.

WAL (Write-ahead Log)

Write-ahead logging is an advanced solution to the problem of file system inconsistencies in operating systems. Inspired by database management systems, this approach first “logs” a summary of the operation to be performed before actually writing it to disk. In the event of a crash, the OS simply checks this log and picks up where it left off.

segmented log

Split the log into multiple smaller files instead of a single large file for easier manipulation.

A single log file can grow and become a performance bottleneck when read at startup. Older logs are cleaned up periodically, and it is difficult to perform cleanup operations on a single large file.

A single log is split into multiple segments. The log file is rolled after the specified size limit. With log segmentation, there needs to be an easy way to map logical log offsets (or log sequence numbers) to log segment files.

High-Water mark

Tracks the last log entry on the leader that was successfully replicated to the follower’s quorum. The index of this entry in the log is called the high-water mark index. The leader only exposes data to the high watermark index.

Kafka: To handle non-repeatable reads and ensure data consistency, the Kafka broker keeps track of the high water mark, which is the maximum offset for a particular partition. Users can only see messages up to the high water mark.

Lease

A lease is like a lock, but it works even if the client leaves. The client requests a lease for a limited period, after which the lease expires. If the client wants to extend the lease, it can renew the lease before it expires.

Chubby clients maintain time-bound session leases with the leader. During this interval, the leader guarantees that the session will not be terminated unilaterally.

Gossip protocol

The Gossip protocol is a peer-to-peer communication mechanism in which nodes periodically exchange state information about themselves and other nodes they know of.

Each node initiates a gossip round every second to exchange state information about itself and other nodes with another random node.

Phi Accrual Failure Detection

This algorithm uses historical heartbeat information to adapt the thresholds. The generic accrual failure detector does not tell if a server is alive, but outputs a suspicious level about the server.

Cassandra uses the Phi Accrual Failure Detector algorithm to determine the state of the nodes in the cluster.

split brain

A scenario where a distributed system has two or more active leaders is called split-brain.

The split-brain problem can be solved by using the Generation Clock, which is just a monotonically increasing number that indicates the generation of the server.

The generation number is incremented each time a new leader is elected. This means that if the old leader’s clock count is “1”, the new leader’s clock count will be “2”. This clock number is included in every request sent from the leader to other nodes. In this way, nodes can now easily distinguish true leaders by simply trusting the leader with the highest number.

Kafka: To handle split brain (we can have multiple active controller brokers), Kafka uses an “epoch number”, which is just a monotonically increasing number to represent the server’s generation.

HDFS: ZooKeeper is used to ensure that only one NameNode is active at any one time. The epoch number is maintained as part of each transaction ID to reflect the NameNode’s generation.

checksum

In a distributed system, when data is moved between components, data fetched from nodes can become corrupted.

Calculate the checksum and store it with the data.

To calculate the checksum, use a cryptographic hash function such as MD5, SHA-1, SHA-256, or SHA-512. A hash function takes input data and produces a fixed-length string (containing letters and numbers); this string is called a checksum.

When the system stores some data, it calculates the checksum of the data and stores the checksum with the data. When the client retrieves data, it verifies that the data received from the server matches the stored checksum. If not, the client may choose to retrieve that data from another replica.

HDFS and Chubby store the checksum of each file along with the data.

CAP theorem

The CAP theorem states that it is impossible for a distributed system to provide all three of the following desirable properties at the same time:

Consistency (C), Availability (A), and Partition Tolerance (P).

According to the CAP theorem, any distributed system needs to choose two out of three properties. The three options are CA, CP, and AP.

Dynamo: In CAP theorem terminology, Dynamo belongs to the category of AP systems designed to achieve high availability at the expense of strong consistency.

BigTable: BigTable is a CP system as far as the CAP theorem is concerned, i.e. it has strictly consistent reads and writes.

PACELEC theorem

The PACELC theorem states that in a system that replicates data:

If there is a partition (‘P’), the distributed system can trade off availability and consistency (i.e. ‘A’ and ‘C’);
Otherwise (‘E’), the system can trade off between latency (‘L’) and consistency (‘C’) when the system is operating normally without partitions.

The first part of the theorem (PAC) is the same as the CAP theorem, ELC is an extension. The whole argument assumes that we maintain high availability through replication. So when it fails, the CAP theorem prevails. But if not, we still have to consider the trade-off between consistency and latency in the replication system.

Hinted Handoff

If nodes go down, the system retains hints (or notes) for all requests they missed. When the failed nodes recover, requests are forwarded to them based on stored hints.

When a node goes down, the leader writes hints in a text file on the local disk. This hint contains the data and the node information to which it belongs. When the leader realizes that the node for which it reserved hints has recovered, it forwards the write request for each hint to that node.

fix on read

In a distributed system, data is replicated across multiple nodes, and some nodes may end up with outdated data.

Fix stale data during read operations, because at this point, we can read data from multiple nodes to compare and find the node with stale data. This mechanism is called read repair. Once a node with the old data is known, a read repair operation pushes the newer version of the data to the node with the older version.

Cassandra and Dynamo use “read repair” to push the latest version of data to nodes with older versions.

Merkle Trees

Read Repair resolves conflicts when processing read requests. However, if one replica is significantly behind the others, it can take a long time to resolve conflicts.

Replicas can contain large amounts of data. Simply splitting the entire range to compute checksums for comparison is not very feasible; there is too much data to transfer. Instead, we can use a Merkle tree to compare copies of a range.

A Merkle tree is a binary tree of hashes, where each internal node is a hash of its two children, and each leaf node is a hash of a portion of the original data.

Comparing Merkle trees is conceptually simple:

Compare the root hashes of two trees.
If they are equal, stop.
Check recursively on the left and right children.

To achieve anti-entropy and resolve conflicts in the background, Dynamo uses Merkle trees.

This article is reprinted from: https://colobu.com/2022/06/26/distributed-system-design-patterns/
This site is for inclusion only, and the copyright belongs to the original author.