Categories
tips

Principle and Implementation of Consistent Hash Algorithm

In the mapping relationship between objects and nodes in a distributed system, 해시게임 the traditional scheme is to use the hash value of the object,

take the modulo of the number of nodes, and then map to the corresponding numbered nodes.

In this scheme, when the number of nodes changes,

most objects in The mapping relationship will fail and need to be migrated;

해시게임

while in the consistent hash algorithm, when the number of nodes changes,

very few objects whose mapping relationship fails,

and the migration cost is also very small.

This paper summarizes the algorithm principle and Java implementation of consistent hashing and enumerates its applications.

1 Overview

1.1 Traditional Hash (Hard Hash)
In a distributed system, assuming there are n nodes,

the traditional scheme uses mod(key, n) to map data and nodes.

When expanding or shrinking (even if only one node is added or removed),

the mapping relationship becomes mod(key, n+1) / mod(key, n-1), and the mapping relationship of most data will fail.

1.2 Consistent Hashing

In 1997, 6 people including David Karger of the Massachusetts Institute of Technology (MIT) published the academic paper “Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web Distributed Cache Protocol for Hotspots on the World Wide Web)”,

for a hash table with K keywords and n slots (nodes in a distributed system), after adding or reducing slots,

the average only needs to be K/n keywords. Remap.

1.3 Hash Metrics, Principle, and Implementation of Consistent Hash Algorithm.

To evaluate the pros and cons of a hash algorithm,

there are the following indicators, and consistent hashing all satisfy:

Balance: The hash addresses of keywords are evenly distributed in the address space so that the address space can be fully utilized,

which is a basic feature of the hash design.

Monotonicity: Monotonicity means that when the address space increases,

the hash address of the keyword obtained by the hash function can also be mapped to a new address space,

rather than being limited to the original address space.

Or when the address space is reduced, it can only be mapped to the effective address space.

Simple hash functions often fail to satisfy this property.

Spread: Hash is often used in distributed environments,

where end users store their content in different buffers through a hash function.

At this point, the terminal may not see all the buffers, but only a part of them.

When the terminal wants to map the content to the buffer through the hashing process,

the buffer range seen by different terminals may be different,

resulting in inconsistent hash results.

The final result is that the same content is mapped to different terminals by different terminals, in the buffer.

This situation should obviously be avoided because it causes the same content to be stored in different buffers,

reducing the efficiency of system storage.

Scattering is defined as the severity of the above-mentioned occurrences.

A good hash algorithm should be able to avoid inconsistencies as much as possible,

that is, to reduce the dispersion as much as possible.

Load: The problem of the load is actually another way of looking at the problem of decentralization.

Since different terminals may map the same content to different buffers,

a particular buffer may also be mapped to different content by different users.

Like decentralization, this should be avoided,

so a good hashing algorithm should minimize the buffering load.

2 Algorithm principle

2.1 Mapping scheme

2.1.1 Public Hash Functions and Hash Rings

Design the hash function Hash(key), the value range is required to be [0, 2^32).

The distribution of each hash value on the Hash ring in the above

figure: the position of the clock at 12 o’clock is 0, increasing in a clockwise direction,

and the left position near 12 o’clock is 2^32-1.

2.1.2 Node (Node) Mapping to Hash Ring

As shown by the green ball on the hash ring, the four nodes Node A/B/C/D,

Its IP address or machine name is mapped to the hash ring after the same Hash() calculation.

2.1.3 Objects are mapped to hash rings

As shown by the yellow ball on the hash ring, the four objects Object A/B/C/D,

Its key value, after the same Hash() calculation, is mapped to the hash ring.

2.1.4 Objects are mapped to Nodes

After both objects and nodes are mapped to the same hash ring,

to determine which node an object is mapped to,

Just start with that object and look clockwise along the hash ring, the first node you find, that is.

It can be seen that Object A/B/C/D is mapped to Node A/B/C/D respectively.

2.2 delete node, Principle and Implementation of Consistent Hash Algorithm

Real-world scenario: Delete nodes when the server is scaled down, or some nodes go down.

As shown in the figure below, to delete the node Node C:

Only affects the object between the node to be deleted (Node C) and the previous (clockwise is the forward direction) node (Node B), that is, Object C,

The mapping relationship of these objects is adjusted and mapped to the next node Node D of the node to be deleted according to the rules in 2.1.4.

The mapping relationship of other objects does not need to be adjusted.