13 January 2017

Sharding in Distributed Systems


I have mostly written about Networking protocols which is not fair as my blog's title is Inter-networking and Distributed Systems. With this logic Distributed Systems should get the 50% fair share on the blog content. So, today, I am going to write about one of the most important concepts used most commonly in Distributed Systems - Sharding.

Sharding in distributed databases

Sharding in distributed databases



What is Sharding?

Sharding in the simplest of terms, is a way of distributing data across sets of multiple machines. By distributing the dataset into multiple shards, it is possible to store larger datasets by increasing the number of machines. This dataset could be anything - a key/value store, a SQL database, BigTable, in-memory cache or anything that could represent data in any form.
Shards in distributed databases

Shards in distributed databases



Sharding is most commonly referred to as horizontal partitioning where homogenous rows are split across multiple database storage over different machines. By storing different rows on different groups of machines, performance is improved via parallel processing of independent rows. For example, in the above figure, users Kim and Lee are stored on Group 1 while users Park and Nam are stored on Group 2. When I say Group, it refers to all the SMR (State Machine Replica) that maintain the same data set to handle fault-tolerance. Note that the table is split by rows, which means all Groups store all the columns for some set of rows.

You can also do vertical partitioning to store some of the heavy columns (like BLOBs) elsewhere. But that is not sharding, so we won't go into the details of Vertical Partitioning.

What is not Sharding?

One of the most popular themes in Distributed Systems is Fault Tolerance. A very crucial reason why we need distributed systems is that machine failure is evident. Thus, we need to replicate the data and computations across multiple machines to make sure we don't lose anything in the events of failure. People accustomed to Distributed Systems literature on fault tolerance read about various consensus algorithms that help ensure data agreement and validity replicated across multiple nodes. 

2 Phase Commit is one of the protocols that ensure consensus among all participating nodes via a Coordinator. It is used in transactional systems which should guarantee ACID properties, for example, an online banking system. I would not go into the details of 2PC in this blog. Another most common protocol that people talk about is Paxos which ensures consensus among the majority of participants.  Paxos is a very interesting protocol as it provides an abstraction of a single machine that can never fail by replicating data store on multiple nodes and run Paxos protocol among them to achieve majority consensus.

PAXOS in Distributed Systems

PAXOS in Distributed Systems


An interesting thing to note here is that protocols like 2PC, Paxos, PBFT and other SMR protocols only provide Fault-Tolerance via Replication. All the machines in a consensus group or in short quorum store the same database which means they are doing exactly similar computations on same dataset. Adding more machines in any of these protocols doesn't increase throughput, or in other words, SMR protocols are NOT SCALABLE. 

As SMR only provides fault tolerance at the cost of more machines, this is definitely not enough for scalable distributed systems that need to store/compute billions of user records.


How Distributed Systems implement Sharding?

As I already said that Sharding is distributing data/computations across machines, what is the most easiest way of carrying out this distribution? Lets say you have a data store of 100 key/value pairs and a total of 5 nodes. By simple division, you could store 20 keys/node for equal load distribution. But you can't do it manually as your client program would need that information every time it accesses the data store.

The simplest algorithm that could be used is taking hash of the key and doing a modulus N over it where N is the number of nodes. This way your client application can simply apply this formula every time it needs to access the key. If you are wondering how Nodes identify which keys they should store, then I think a simple assignment of Nodes with natural numbers will solve the problem. But this simple approach wouldn't scale at all if you want to maintain even load balancing. For that, please refer to Chord paper. A quick idea about what Chord protocol does is depicted in the below picture:

Chord Peer to Peer protocol

Chord Peer to Peer protocol


In Chord, nodes are assumed to be placed in a hash space represented by a ring. Each Node IP is hashed and the result is applied to Mod N to find its position on the ring. The keys are also hashed and Mod N to determine their position on the ring. Each key is either stored at the same hashed Node or its successor if no node is present at that position. In the above picture, there are 3 nodes 0, 1 and 3 and 3 keys 1, 2 and 6. Key 1 is stored on Node 1, Key 2 on Node 3 (its successor) and Key 6 on Node 0 (its successor).


Is Sharding enough?

I still remember the time when I took Distributed Systems class in my Master's program at USC where we elaborately studied about its various theories like State Machine Replication, Sharding, MapReduce, Leases, Distributed Locks, In-memory caches and many more. It was during one of the office hours that I clearly understood the need for both Replication and Sharding together in Distributed Systems.

A simple analogy:

You are in your high school and you have asked to complete a set of hard mathematical problems. Your teacher is a kind one so she allows a team of 4 members to submit one solution.  Note that these questions are tricky and a single miscalculation might screw everything up. So, you decide within your team that each of you would solve all the problems individually and then compare the results in order to make sure that at least 2 of them matches. This is exactly what "Replication" does in distributed computing. But there is a catch here - you were also told that the team that finishes the first will receive 10% extra credit. This is huge and you don't want to miss that.

What do you do? Think, think, think. You decide to use one more theory here -- Sharding -- you make two groups of two members each such that each group solves a different set of problems. This will eliminate single person errors and along with it increase the efficiency of your team substantially - by 100%. This is exactly how Distributed Systems work - distribute different works to different sets of machines in order to achieve both Fault-tolerance and Scalability.
Distributed Storage

Distributed Storage



I know this article might have incited a lot of questions in your mind and this is good. I also had the same feeling when I started to learn Distributed Systems and eventually I found answers. Feel free to ask questions and I will try my best to answer them. Till then, I will start preparing for my next post.



About Me

My photo
Passionate about technology and humans