11 February 2017

Must read computer networking research papers


A lot of Quora users ask this question: What are most influential research papers that contributed to the development and improvements in Internet and networking technologies? Based on my academic study (material provided by Prof. Minlan Yu and Prof. Wyatt Lloyd), work experience and research on computer networks, I created a list that would help my readers to read/know/understand the most significant research contributions. I will keep this list updated based on recent conferences. 

Instead of providing a direct link, I prefer to provide the citations as I consider it the correct way to provide reference to other's works. You can search for the paper using the title in either Google or Google Scholar.


Before I start to mention networking related papers, I would like to introduce my audience to a must read paper that will help you to understand and absorb the contents of research paper effectively. Trust me, you want to read this paper if you want to improve your reading skills.

How to read a paper

How to read a paper


1.  Keshav, S. "How to read a paper.ACM SIGCOMM Computer Communication Review 37.3 (2007): 83-84.

Let's start with the inception of Internet. DARPA researchers had put together the design of the first network called ARPANET. I suggest you to read the next paper to understand the design decisions for developing a generic network to be used across multiple communities.


2. Clark, David. "The design philosophy of the DARPA Internet protocols.ACM SIGCOMM Computer Communication Review 18.4 (1988): 106-114.
history of internet

History of Internet

After reading 2. you would understand the fundamental thoughts that were put together to design Internet. With this knowledge, you can design your own Internet without fearing about the complex details of it. A very important detail that you must consider while designing any networking technology is whether you are aiming an end-end feature or hop-hop. A good start is the next paper which talks about End-to-End argument in detail.


3. Saltzer, Jerome H., David P. Reed, and David D. Clark. "End-to-end arguments in system design.ACM Transactions on Computer Systems (TOCS) 2.4 (1984): 277-288.

These papers give you a great insight into the design principles for any large scale network. Now lets focus on the concrete design requirements of a reliable packet delivery network. There are various things that one would try to address while designing a reliable packet communication like packet identification, acknowledgements, redelivery of failed packets. The authors of the next paper addressed a lot more in their paper. This paper was published later in SIGCOMM in 2005.



4. V. Cerf and R. Kahn, "A Protocol for Packet Network Intercommunication," in IEEE Transactions on Communications, vol. 22, no. 5, pp. 637-648, May 1974


-------
Next, we will look at IP routing protocols that allow packets to be routed from source to destination via routers. There are mainly two types of routing protocol used within an AS (autonomous system) - Distance Vector and Link State routing. To understand these protocols, I recommend you to read the CISCO white papers.



Note that understanding Routing is a major achievement. IGP routing (Interior Gateway Protocol) is implemented within an AS. On the other hand, EGP routing (Exterior Gateway Protocol) is implemented across different ASes. An AS is a single administrative unit that controls a set of IP prefixes and has their independent routing policy. BGP (Border Gateway Protocol) is the most commonly used protocol for routing across ASes. Note that managing different ASes is the most difficult task at hand. The next paper underlines the issues associated with BGP routing and proposes a solution to achieve convergence within permissible time.

IGP vs BGP across ASes

IGP vs BGP

6. Gao, Lixin, and Jennifer Rexford. "Stable internet routing without global coordination.ACM SIGMETRICS Performance Evaluation Review. Vol. 28. No. 1. ACM, 2000.

Due to the autonomous nature of BGP, ASes can act maliciously to advertise routing paths which don't exist. Additionally, there are other security concerns associated with BGP due to local policy selection which can't be monitored. I recommend you to read the next paper which describes the security loopholes in BGP and solutions to fix them.



7. Goldberg, Sharon. "Why is it taking so long to secure internet routing?.Communications of the ACM 57.10 (2014): 56-63.

Another important perspective towards routing is to look at the safety and liveness property of routing algorithms. Most routing algorithms favor responsiveness over consistency which has negative effects like routing loops. The next paper is the first one to introduce "consensus routing" which deals with the side effects of current routing protocols.


8. John, John P., et al. "Consensus routing: The Internet as a distributed system." Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation. 2008.


------
Now, I am going to talk about the my favorite topic - TCP - which is the most crucial component for end-end reliable communication, flow control and congestion control mechanism. The idea of TCP took birth with the research paper #4 where Vint Cerf and Bob Kahn discussed a single  internetworking protocol that handles both TCP and IP features. The most complex of all TCP features is the congestion control mechanism due to no or limited knowledge from the network at the end hosts. I would definitely recommend the next paper for a moderate understanding of congestion control.


9. Jacobson, Van. "Congestion avoidance and control.ACM SIGCOMM computer communication review. Vol. 18. No. 4. ACM, 1988.


TCP congestion window in Tahoe and Reno

TCP congestion window in Tahoe and Reno 

The next set of papers are only intended for readers who want to research on better/improvised TCP versions. These are advanced research papers targeted to improve TCP performance. I have personally read them and used this knowledge to flaunt in my technical interviews.


XCP (eXplicit Control Protocol)
10. Katabi, Dina, Mark Handley, and Charlie Rohrs. "Congestion control for high bandwidth-delay product networks.ACM SIGCOMM computer communication review 32.4 (2002): 89-102.


DCTCP 
11. Alizadeh, Mohammad, et al. "Data center tcp (dctcp).ACM SIGCOMM computer communication review. Vol. 40. No. 4. ACM, 2010.


MPTCP
12. Wischik, Damon, et al. "Design, Implementation and Evaluation of Congestion Control for Multipath TCP.NSDI. Vol. 11. 2011. 

TCP has been long dominating the systems that require reliable communication until recently when Google introduced QUIC - a UDP based reliable and secure transport protocol. It has improved congestion control mechanisms and uses forward error correction for better performance - much higher than TCP. I highly recommend you to read QUIC design.



------
Next, I would like to talk about the accessories of computer networking which act as an add-on to greatly improve the network performance. Load balancer is one of them which sits between the router and the organization's network to balance the incoming traffic. With the increase in traffic (billions of requests/sec), providers host the service on multiple end points and use load balancers to distribute traffic to end points evenly for best performance. Maglev, introduced by Google, is a software load balancer that replaces expensive inflexible hardware load balancers with the help of distributed software.


14. Eisenbud, Daniel E., et al. "Maglev: A fast and reliable software network load balancer." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, 2016.

Network virtualization has become increasingly common in data center networks to provide centralized control over the network services and underlying network resources with the help of network functions. Besides providing routing and switching capabilities, it also provides L4-L7 services for access control and load balancing. The decoupling of control and forwarding plane (SDN idea) has greatly aided in the advancement of network virtualization. The next paper on Open vSwitch provides comprehensive details about one of the major breakthrough in this area.


15. Pfaff, Ben, et al. "The Design and Implementation of Open vSwitch.NSDI. 2015.

I think the above material is a good way to start thinking like the designer of Internet-like network. You can put yourself in a situation where you are asked to design an alternate Internet model from scratch. Networking has been the biggest field of research from the time of its inception and it will always be due to the its inherent challenges. 



13 January 2017

Sharding in Distributed Systems


I have mostly written about Networking protocols which is not fair as my blog's title is Inter-networking and Distributed Systems. With this logic Distributed Systems should get the 50% fair share on the blog content. So, today, I am going to write about one of the most important concepts used most commonly in Distributed Systems - Sharding.

Sharding in distributed databases

Sharding in distributed databases



What is Sharding?

Sharding in the simplest of terms, is a way of distributing data across sets of multiple machines. By distributing the dataset into multiple shards, it is possible to store larger datasets by increasing the number of machines. This dataset could be anything - a key/value store, a SQL database, BigTable, in-memory cache or anything that could represent data in any form.
Shards in distributed databases

Shards in distributed databases



Sharding is most commonly referred to as horizontal partitioning where homogenous rows are split across multiple database storage over different machines. By storing different rows on different groups of machines, performance is improved via parallel processing of independent rows. For example, in the above figure, users Kim and Lee are stored on Group 1 while users Park and Nam are stored on Group 2. When I say Group, it refers to all the SMR (State Machine Replica) that maintain the same data set to handle fault-tolerance. Note that the table is split by rows, which means all Groups store all the columns for some set of rows.

You can also do vertical partitioning to store some of the heavy columns (like BLOBs) elsewhere. But that is not sharding, so we won't go into the details of Vertical Partitioning.

What is not Sharding?

One of the most popular themes in Distributed Systems is Fault Tolerance. A very crucial reason why we need distributed systems is that machine failure is evident. Thus, we need to replicate the data and computations across multiple machines to make sure we don't lose anything in the events of failure. People accustomed to Distributed Systems literature on fault tolerance read about various consensus algorithms that help ensure data agreement and validity replicated across multiple nodes. 

2 Phase Commit is one of the protocols that ensure consensus among all participating nodes via a Coordinator. It is used in transactional systems which should guarantee ACID properties, for example, an online banking system. I would not go into the details of 2PC in this blog. Another most common protocol that people talk about is Paxos which ensures consensus among the majority of participants.  Paxos is a very interesting protocol as it provides an abstraction of a single machine that can never fail by replicating data store on multiple nodes and run Paxos protocol among them to achieve majority consensus.

PAXOS in Distributed Systems

PAXOS in Distributed Systems


An interesting thing to note here is that protocols like 2PC, Paxos, PBFT and other SMR protocols only provide Fault-Tolerance via Replication. All the machines in a consensus group or in short quorum store the same database which means they are doing exactly similar computations on same dataset. Adding more machines in any of these protocols doesn't increase throughput, or in other words, SMR protocols are NOT SCALABLE. 

As SMR only provides fault tolerance at the cost of more machines, this is definitely not enough for scalable distributed systems that need to store/compute billions of user records.


How Distributed Systems implement Sharding?

As I already said that Sharding is distributing data/computations across machines, what is the most easiest way of carrying out this distribution? Lets say you have a data store of 100 key/value pairs and a total of 5 nodes. By simple division, you could store 20 keys/node for equal load distribution. But you can't do it manually as your client program would need that information every time it accesses the data store.

The simplest algorithm that could be used is taking hash of the key and doing a modulus N over it where N is the number of nodes. This way your client application can simply apply this formula every time it needs to access the key. If you are wondering how Nodes identify which keys they should store, then I think a simple assignment of Nodes with natural numbers will solve the problem. But this simple approach wouldn't scale at all if you want to maintain even load balancing. For that, please refer to Chord paper. A quick idea about what Chord protocol does is depicted in the below picture:

Chord Peer to Peer protocol

Chord Peer to Peer protocol


In Chord, nodes are assumed to be placed in a hash space represented by a ring. Each Node IP is hashed and the result is applied to Mod N to find its position on the ring. The keys are also hashed and Mod N to determine their position on the ring. Each key is either stored at the same hashed Node or its successor if no node is present at that position. In the above picture, there are 3 nodes 0, 1 and 3 and 3 keys 1, 2 and 6. Key 1 is stored on Node 1, Key 2 on Node 3 (its successor) and Key 6 on Node 0 (its successor).


Is Sharding enough?

I still remember the time when I took Distributed Systems class in my Master's program at USC where we elaborately studied about its various theories like State Machine Replication, Sharding, MapReduce, Leases, Distributed Locks, In-memory caches and many more. It was during one of the office hours that I clearly understood the need for both Replication and Sharding together in Distributed Systems.

A simple analogy:

You are in your high school and you have asked to complete a set of hard mathematical problems. Your teacher is a kind one so she allows a team of 4 members to submit one solution.  Note that these questions are tricky and a single miscalculation might screw everything up. So, you decide within your team that each of you would solve all the problems individually and then compare the results in order to make sure that at least 2 of them matches. This is exactly what "Replication" does in distributed computing. But there is a catch here - you were also told that the team that finishes the first will receive 10% extra credit. This is huge and you don't want to miss that.

What do you do? Think, think, think. You decide to use one more theory here -- Sharding -- you make two groups of two members each such that each group solves a different set of problems. This will eliminate single person errors and along with it increase the efficiency of your team substantially - by 100%. This is exactly how Distributed Systems work - distribute different works to different sets of machines in order to achieve both Fault-tolerance and Scalability.
Distributed Storage

Distributed Storage



I know this article might have incited a lot of questions in your mind and this is good. I also had the same feeling when I started to learn Distributed Systems and eventually I found answers. Feel free to ask questions and I will try my best to answer them. Till then, I will start preparing for my next post.



About Me

My photo
Passionate about technology and humans