Category: Innovation in Academia

Can P4 save Software-Defined Networking?

Published by castroflaviojr on October 24, 2017

Now, P4 is gaining momentum due to engagement of big players such as Google and AT&T. P4 has potential to cause a significant change in the industry and deliver on the SDN value-proposition. I’d like to discuss that.

In summary, P4 aims to provide 3 main goals:

Reconfigurability
Protocol independence
Target independence

OpenFlow had its shortcomings: somehow diversity of implementation strategies evolved into incompatibility. P4 target independence proposes to solve this issue using a compiler to translate P4 code into switch code taking into account its capabilities.

Screen Shot 2017-10-20 at 1.35.00 PM

In order to understand how disruptive this is, let’s look at the current state of affairs: commodity silicon vendors such as Broadcom and Mellanox already have an API to control their switches, the existence of that API itself already disrupted the industry enabling Cumulus, SnapRoute and even Arista. Now would you prefer that your silicon vendors established a common interface, or would you rather rewrite software everytime you want to test a new switch Vendor? The answer is obvious: the first option benefits users and new vendors, the second benefits established vendors. New industry players or the adventurous operators could write software on top of P4 and achieve multi-vendor integration at the cost of writing compilers for each vendor they use.

So, that’s the big pay-off opportunity, enabling competition, thus innovation. The challenge here is to provide vendors the incentives to write the P4 compiler.

New industry players or the adventurous operators on the other side, could be able to write software on top of P4 and achieve multi-vendor integration at the cost of writing compilers for each vendor they use. That can be game-changing, the big questions are “How eager are developers to write P4 software?”, “how much does it cost to hire somebody to do it?”, additionally, “Who will write Cisco/Broadcom specific p4 compiler code?”

There are endless opportunities: in a parallel universe, AT&T forces Cisco to enable a P4 compiler to their devices, Cisco writes a bad compiler, claims it’s bad technology and sells you ACI instead. In a different universe, Barefoot writes a Broadcom compiler ensuring it works, but then it “wastes” some resources promoting a competitor. A little bit more realistically, SnapRoute or Cumulus could write a P4 compiler to Broadcom Tomahawk, and thus would be able to enable their software in a plethora of existing devices. Even more realistically, Barefoot writes their own compiler to Tofino and keeps selling P4 to a limited niche market.

Now, if Barefoot takes on the responsibility to write a P4 compiler for Broadcom and Mellanox that would be translated into huge value to NOS vendors and Operators; since they would be able to seamlessly switch vendors. It would marginally increase adoption of Tofino, so the question remains, who would pay for this?

Now how much does it cost to adopt P4?

Before I answer this question I’m going to callback to a point previously when I wrote about network disaggregation. I ended it asking: “Does OpenFlow effectively lock you in?”. Now the same question may apply to P4.

The question is misleading by itself. I’ve heard vendors saying “OpenFlow locks you in, you might as well just buy our SDN”. There’s just so much wrong with this. OpenFlow isn’t perfect, but it does allow you to adopt software processes to deliver features much faster than your vendor will.

Any choice is a potential barrier and locks you in a little bit, but what everybody refers to when talking about lock-in is hardware lock-in. When you buy a generic x86 computer you are free to install Ubuntu, Debian, Windows or whatever you’d like, when you buy a PlayStation, you can’t just install Xbox on it, that’s vendor lock-in, the costs of doing that are prohibitive, you would be better off just buying another appliance.

You could at barely no cost try an OpenFlow Lab or Field trial on Broadcom-based network devices and fallback to Cumulus if it doesn’t fulfill your needs. Unsurprisingly, The vendors will claim lab trials aren’t needed because of their product quality, but the experience will tell there will always be a missing feature.

Now P4, from the adventurous perspective, P4 is great, you just have to write more software to get it done. For everybody else it has a significant cost: you have to hire premium developers or Barefoot itself to do it. That cost won’t be insignificant when using Broadcom + Big Switch might already give you the tools to improve your current process.

OpenFlow vs P4

OpenFlow is going to be 10 years old next year, a significant amount of resources has been put into testing it. It has been (properly) commercially supported by Big Switch for 3+ years if I’m not mistaken. I’d say with certainty that you could get an OpenFlow solution production-ready in a year. Realistically, could you get P4 ready to be deployed in production in a year?

Misconceptions:

Will P4 replace OpenFlow? Maybe. P4 offers a different value proposition. OpenFlow agents may be written on top of P4. Great P4 implementations may force OF into being obsolete.
Will P4 replace Broadcom SDK? Same answer, P4 may write a much better API on top of theirs.
Will P4 replace OpenNSL? Why not?
Will P4 replace NetFlow/Sflow? No. Sflow is a protocol to export data from the switches, it does not say (much) on how you should implement it in the dataplane.
Will P4 replace Riverbed? No way.
Will P4 replace OpenConfig? Nope, they are actually quite complementary.

Thanks for reading the long post. I welcome any thoughts or questions.

6 Comments

TCP BBR Congestion Control on Mininet

Published by castroflaviojr on October 10, 2017

In this post, I demonstrate some benefits of using BBR congestion control and illustrate how easy it is to adopt it by using Mininet as an example. I’m excited to share this post with you guys because it’s been a while since I’ve made a tutorial and I love breakthrough innovations like this.

This post is divided into three sections: Background on BBR, Tutorial and Technical challenges.

Background on BBR

TCP BBR has significantly increased throughput and reduced latency on Google’s internal backbone networks. From this a great resource:

TCP BBR is rate-based rather than window-based; that is, at any one time, TCP BBR sends at a given calculated rate, instead of sending new data in response to each received ACK. In particular, TCP BBR does not directly link the sending of new data to the receipt of ACKs, and so, strictly speaking, is not actually a sliding-windows implementation. Therefore, we cannot properly talk about winsize or cwnd. Instead, we talk about the number of packets in flight, which is the rate times RTT_actual, with the understanding that this number may vary with conditions.

Basically, BBR estimates bandwidth by keeping track of goodput: if an increase in the sender rate does not increase the observed goodput, it assumes that’s the available bandwidth. It is reasonably effective in doing so and that way it provides minimal queueing in the network.

TCP’s throughput is inversely proportional to RTT and most TCP implementations cause additional delays, in consequence, TCP by itself can never reach 100% utilization. BBR changes that, that’s why it’s such an impressive accomplishment.

Quick start

Open Source is great because it allows innovation to be deployed much faster, BBR is already implemented in the Linux kernel and using Mininet you can test it right away.

I’m a long time fan of the website: reproducing network research from Stanford. I leveraged most of the Mininet code for this experiment from there.

Now let’s get to it!! This tutorial assumes you have vagrant and git. If you don’t, don’t panic, follow this link. To start you will need to set up the VM. I took care of all the dependencies for you. If you want to inspect what I’m doing take a look at the mininet role in the ansible folder.

git clone https://github.com/castroflavio/bbr-replication/
git checkout vagrant
vagrant up

This should take 10 min to complete. After it’s done proceed

vagrant ssh
cd mininet
sudo ./figure5.sh all

After around 30 seconds the experiment should be done and you can exit the VM:

exit
open figure5_mininet/figure5_mininet.png

This should open the following figure: figure5_mininet

The figure compares the latency on TCP BBR and TCP CUBIC (less is better). And as you can see BBR reduces the latency from ~150ms to ~50ms(66%) on the average case and from 400ms to 50ms (87%) on the worst case. This is crazy!

Technical challenges

The first technical challenge is finding a linux kernel that implements BBR, and it turns out it’s implemented on 4.9 so look out for that. The second challenge was to implement the BBR pacing mechanism, it was mentioned on the CS244 website but I did not understand it at first.

BBR requires a mechanism to control the sender rate and it leverages tc ( traffic control ) module from linux. I knew about tc but I didn’t know it was such a powerful tool. After some research on linux queueing mechanisms, I found that BBR requires the fq (Fair queueing) queueing discipline because it uses that to rate control the sender. It turns out Mininet did not support fq for some reason, and I had to change a couple lines of code to add support for it.

Conclusion

TCP has been around for decades and for decades people have been trying to improve it. At first, TCP congestion control mechanism literally saved the internet, now I’m gonna be bold and say that BBR by providing a “queueless” congestion control is saving latency-sensitive applications. It really is a big deal. I highly encourage you to try it out, the least you should do is check the following article: Increase your linux server Internet speed with TCP BBR congestion control.

For future reference:

1 Comment

ESPRESSO – More insights into Google’s SDN

Published by castroflaviojr on August 16, 2017

Google recently released a paper detailing how it has designed and deployed ESPRESSO: the SDN at the edge of its network. The paper was published at SIGCOMM’17. In the past, Google’s SDN papers have been very insightful and inspiring, so I had big expectations for this one as well.

In this blog post, I’ll summarize and highlight the main points of the paper. I’ll follow-up with some conjectures on what we can abstract from the paper in terms of industry trends and understand the state-of-the-art SDN technologies.

For reference, Google has released several papers detailing its networking technologies, these are the most important ones:

B4 detailed Google’s SDN WAN – A must read. It explains how they drastically increased network utilization by means of global traffic engineering controller.
Jupiter Rising details hardware aspects of data center networks.
BwE explains Google’s bandwidth enforcer that plays a huge role in traffic engineering.

B4 connects Google’s Data Centers, B2 is the public facing network, which connects to ISPs in order to serve end-users. Espresso, an SDN infrastructure deployed at the edge of B2 enabled higher network utilization(+13%) and faster networking service roll-out.

nespresso-2-width-566

Requirements and Design Principles

The basic networking services provided at the edge are:

Peering – Learning routes by means of BGP
Routing – Forwarding packets. Based on BGP or TE policies
Security – Blocking or allowing packets based on security policies

To design the system, the following requirements were taken into account:

Efficiency – capacity needs to be better utilized and grow cheaply
Interoperability – espresso needs to connect to diverse environments
Reliability – must be available 99.999% of the time
Incremental Deployment – green-field deployment only is not compelling enough
High Feature Velocity

Historically, we have relied on big routers from Juniper or Cisco to achieve these requirements. Those routers usually would have the full internet routing table stored, as well as giant TCAM tables for all ACL rules needed to protect THE WHOLE INTERNET, and those are quite expensive. More importantly, a real Software-Defined Network allows you to deliver innovation at the speed of software development rather than the speed of hardware vendors.

Basically, 5 design principles are applied in order to fulfill those requirements:

Software Programmability – OpenFlow-like Peering fabric
Testability – Loosely coupled components allow software practices to be applied.
Manageability – Large-scale operations must be safe, automated and incremental
Hierarchical control plane – Global and local controllers with different functions allow the system to scale
Fail-static – Data plane maintains the last known good state to prevent failures in case of control plane unavailability

Peering Fabric

The Peering Fabric provides the following functions:

Tunnels BGP peering traffic to BGP Speakers
Tunnels End user requests to TCP reverse proxy hosts
Provides IP and MPLS based packet forwarding in the fabric
Implements a part of the ACL rules

Screen Shot 2017-08-15 at 10.41.28 PM

All the magic happens in the hosts. First, the BGP speakers learn the neighbor routes and propagate those to the local controller (LC), which then propagates those to the global controller(GC). The GC then builds its intent for optimal forwarding of the full internet routing table. It then propagates those routes to LCs which then install them in all the TCP reverse proxy hosts. The same thing happens for security policies.

The BGP Speakers are in fact a Virtual Network Function, which is a network function implemented using x86 CPUs, the routing also is a VNF, as well as ACLs. Also, notice that the peering fabric is not complicated at all. The most used ACL rules(5%) are there but the full Internet Routing table is not. The hosts will make the routing decision and encapsulate the packets, labeling it with the egress switch and egress port of the Fabric.

Configuration and Management

It’s mentioned in the paper that as the LC propagates configuration changes down, it canaries those changes to a subset of nodes and verify correct behavior before proceeding to wide-scale deployment. These features are implemented:

Big Red Button – the ability to backroll features of the system and test this nightly.
Network Telemetry – monitors peering link failure and route withdrawals.
Dataplane Probing – End-to-end probes monitor ACL – unclear if OF is used for this

Please refer to the original paper for details. I hope this post is useful for you and I apologize for any miscommunication. At the end of the day, I’m writing this post to myself more than anything.

Feature and rollout velocity

Google has achieved great results in terms of feature and rollout velocity. Because it’s software-defined they can leverage their testing and development infrastructure. Along three years Google has updated Espresso’s control plane >50x more frequently compared to traditional routers, which would have been impossible without the test infrastructure.

The L2 private connectivity solution for cloud customers was developed and deployed in a few months. Without new hardware or need for waiting vendors to deliver new features. Again something unimaginable with legacy network systems. In fact, they state the same work on the traditional routing platform is still ongoing and has already taken 6x longer.

Screen Shot 2017-08-16 at 5.09.21 PM

Traffic Engineering

To date, Espresso carries at least 22% of outgoing traffic. The nature of GC allows them to serve traffic from a peering point to another. The ability to make this choice by means of Espresso allows them to serve 13% more customers during peaks.

Google caps loss-sensitive traffic to prevent errors in bandwidth estimation. Nonetheless, GC can push link utilization to almost 100% by transmitting lower QoS, loss-tolerant traffic.

Conclusion

From the paper: “Espresso decouples complex routing and packet processing functions from the routing hardware. A hierarchical control-plane design and close attention to fault containment for loosely-coupled components underlie a system that is highly responsive, reliable and supports global/centralized traffic optimization. After more than a year of incremental rollout, Espresso supports six times the feature velocity, 75% cost-reduction, many novel features and the exponential capacity growth relative to traditional architectures.

List of Graduate Networking Readings

Published by castroflaviojr on January 9, 2015

This is a list I want to keep for myself and share with others. Soon I’ll make a compilation of interesting readings on networking on a different post.

Graduate level networking courses don’t usually have textbook, normally come with long reading lists.

Georgia Tech CS 6250 Fall – Computer Networks – PhD Mostafa Ammar taught this course on the Fall of 2014.
Georgia Tech 6250 Spring – Graduate Computer Networks – PhD Nick Feamster taught this course on the spring of 2013. The course has a lot of material on software defined networking.
Cornell CS 5114 – Network Programming Languages – Course taught by PhD Nate Foster and it looks like it also has a lot of material on SDN.
Cornell CS 6452 – Datacenter Networks and Services – This course is taught by Prof. Sirer. It looks like it has some amazing material about data-centers. The course was also taught on the Spring of 2014 but I can’t access the reading list.
Cornell CS6455 – Advanced Networking – this course is taught by PhD Rachit Agarwal in the fall of 2017.
Stanford CS 244 – Advanced Topics on Computer Networking – This course looks like have a lot of excellent material on advanced networking.
Princeton CS 561 – Advanced Computer Networks – This course is taught by PhD Jennifer Rexford and it also has some incredible amount of good readings
Cornell CS 5413 – High Performance Systems and Networking.
Minnesota CSE 5221 Foundations of Advanced Networking – Course taught by PhD Zhi-li Zhang with a lot of good references on networking
Minnesota CSE 8221 Advanced Computer Networks and Applications – Course taught by PhD Zhi-li Zhang