Partition is the technique of breaking a large data set into small manageable chunks.

It is important to define partition strategy such that a single piece of information belongs to a single partition only, good partition strategy enables data level parallelism.

Partition is not a new idea; it comes from the RDBMS world where it is used for creating partition indexes.

Atomicity between partitions, especially for writing, is hard to achieve and it requires some tradeoff.

Partition should not be confused with replication, replication is keeping multiple copies of a single partition for high availability or fault tolerance.

Partition + Replication makes any data system fully distributed.

In this post we will look at different approaches to data partitioning. Lets pick email address dataset as a sample data for partition example.

Let's take bill.gates@microsoft.com email as an example.

This email can be partitioned by hash(bill.gates@microsoft.com) or by range(bill.gates@microsoft.com) based on how it is queried.

Hash

Hash one is simplest to understand but when we are in a distributed environment then it is not that simple because nodes that are storing data can go offline or new nodes are added to manage load. Due to dynamic node numbers we don't have a constant denominator to find mod for key.

It is important to understand why hash table type of data structure can't be used for data partitioning in distributed systems.

Classic hash table starts with N capacity and based on load factor(i.e how much it is filled) it will resize and all the keys need to rehash based on new capacity.

This resize is an expensive operation and also during resize the hash table might not be available for some operations.

Resizing a distributed dataset will mean shuffling all the keys and when a dataset contains billions or trillion of keys then it is going to generate a huge load on the system and also making it unavailable for end users.

If we look in terms of BIG O then classic hash table complexity will look something like below.

It is very clear that the Add/Remove node option is going to bring everything to standstill. Some distributed systems like Kafka have fixed no of partition strategies to avoid exactly this problem.

So what can be done in this case ?

Key insight is using a range of hash codes to represent keys and it helps in reducing overhead of key migration when a new node is added or removed.

Sets up nicely for next section

Consistent Hashing

Consistent hashing is now a new idea but it is more than a 2 decade old idea that was mentioned in paper.

This is a special hash function which changes minimally as new nodes are added/removed. In this approach both keys and slots/nodes are hashed using the same function and enables mapping nodes to intervals of keys.

All the nodes are placed in a ring and each node is responsible for range of hash. Due to this range hash property whenever a node is added or removed k/n keys need to be replaced where k is total number of keys and n is number of nodes or slots.

Each node is responsible for a range of keys due to which when a node is added or removed only part of keys needs to be replaced. Let's look at example

Node removed

In this example node B is lost and node A takes up responsibility and now handles range ( 1 to 200). Only keys between 101 to 200 need to move and this is much better than a naïve hash table.

Node Added

New node C.1 is added and it shares responsibility by handling part of keys from C.

Few things are done to balance load

Rebalancing by adjusting ranges.

Under this approach the background process is checking load(i.e. no of keys) on nodes. Any node that is overloaded or underloaded is balanced.

Add virtual nodes.

One of the problem with consistent hash is that some nodes become overload and to manage that it uses concept of virtual nodes, which are replica of nodes. This adds more nodes/slots and improves probability of better load balancing.

"Virtual Node" also gives one nice option to manage nodes with different capacity, if cluster contains node that is 2X bigger than other nodes then it can have more replica compared to others.

Lets look at one simple implementation. Code contains 2 components one is ConsistentHash and Distributed hash using consistent hash.

public class ConsistentHashing<T> {

    private final Function<byte[], Integer> hashFunction;
    private final int replica;
    private final SortedMap<Integer, T> ring = new TreeMap<>();
    private final Function<T, String> nodeKey;


    public void add(T node) {
        for (int c = 0; c < replica; c++) {
            String key = String.format("%s_%s", nodeKey.apply(node), c);
            ring.put(hashFunction.apply(key.getBytes()), node);
        }
    }

    public T findSlot(Object key) {
        int hash = hashFunction.apply(key.toString().getBytes());
        return ring.getOrDefault(hash, findClosestSlot(hash));
    }

    private T findClosestSlot(int hash) {
        SortedMap<Integer, T> tail = ring.tailMap(hash);
        int keyHash = tail.isEmpty() ? ring.firstKey() : tail.firstKey();
        return ring.get(keyHash);
    }

}

Distributed Hash

public class DistributedHashTable {
    private final ConsistentHashing<Node> hash;

    public DistributedHashTable(ConsistentHashing<Node> hash) {
        this.hash = hash;
    }

    public void put(Object key, Object value) {
        findSlot(key).put(key, value);
    }

    private Node findSlot(Object key) {
        return hash.findSlot(key);
    }

    public Object get(Object key) {
        return findSlot(key).get(key);
    }
}

Consistent is very popular in many distributed systems like Casandra, DynamoDB,CouchBase, Akka Router, Riak and many more.

Rendezvous Hashing

Rendezvous hashing is alternative to consistent hashing and its more general purpose. This also came at the same time but did not become that popular.

Idea is based on a score that is computed by using node & key, and the node with the highest score is picked up for handling requests. Hash function looks something like hash(node-x,key)

Few things that make this algorithm different

Low overhead

Picking the server is low overhead because it needs very less computation and no coordination is required.

No coordination

This property of the algorithm makes it very unique because all the nodes that are trying to identify a server for request processing will select the same server without communication with other nodes in cluster. Think like no global lock is required to make decision.

Heterogenous Servers

One of the problems with consistent hashing is that some of the servers will be overloaded and to manage that additional replicas/virtual nodes are added and that result in a long list of servers to select from. This also becomes a problem when a cluster contains a server with different compute/storage capacity because now generic replica factor (i.e 3) can't be used.

In this algorithm virtual servers are optional and only required if servers of different capacity are added in node.

Better load balancing

Hash code has random property due to which each server has equal probability of getting picked.

No duplicate work

For eg client request is going to broadcast 10Gb data and doing this multiple times will definitely choke network bandwidth.

This algorithm gives guarantee that each request will only map to a single server with no/less coordination, so duplicate work can be totally avoided and it will result in saving extra network overhead.

Minimal Disruption

This property makes it very good selection especially when cluster is very dynamic. Incase of node failure only keys mapped to failed server needs to redistributed and it becomes very significant overhead when data set is huge.

Code snippet for node selection/assignment

public N assignNode(Object key) {

    long value = Long.MIN_VALUE;
    N node = null;
    for (N n : nodes.values()) {
        long newValue = hash.apply(n, key);
        if (newValue > value) {
            node = n;
            value = newValue;
        }
    }
    return node;
}

Apache Ignite distributed database is based on Rendezvous hashing. Some other applications are cache, monitoring applications & exclusive task processing.

Range

This is one of the simple and yet very powerful way for data partition and relational database has used it for ages.

Idea of range partition comes from how books are arrange in library or the ways words are put in dictionary. One of the important assumption to take advantage of this style is ordering of keys, if keys are random then this scheme is not good fit.

Range implementations can be vary based on need, lets look at some of options.

Static Range

As name suggest , it is static range and never changes over lifecycle of service. Static range works well if data is distributed evenly. Some of the good examples are time series data that can benefit by static ranges where time period ( i.e day/hour/minute) is one bucket. Some implementations adds little bit of dynamic nature to this style by creating new partition based on range pattern like every hour new partition is created.

Real data is skewed, so static range runs in problem when one of the partition becomes hot partition and whole system performance degrades and also not all the resources are utilized properly.

Dynamic Range

Some of the new data systems add this innovation to make the system scalable. Let's look at how it works.

It starts with an empty container with some capacity, in this example let's use 5.

Once the container is full then it splits in half, this process keeps on going. This way of dynamic partition creation handles skewed data and also a very scalable approach.

In the above example adding 9 keys has created 3 nodes, each node containing an equal number of values. At the end partition will looking something like below

Just to highlight again that ordering property is maintained during split process and makes it very good fit for efficient range scan.

Ordered data has many other benefits that can easily utilized query optimizer.

Any discussion on partition can't be completed without talking about request routing, answer to this question depends on the complexity that you want to push to clients.

In this design router has all the details of keys placement and it can route traffic.

Conclusion

Data partition is key for building a distributed system. It has many other challenges like how to rebalance partitions without any downtime, how to process query that needs access(i.e. read and write) to multiple partitions.

Partition replication comes after partition and it is also equally challenging.

Are you ready

Sunday, 4 April 2021

Data partitioning approach.