Author: castroflaviojr

What’s left for SDN after the hype?

Published by castroflaviojr on May 1, 2019

Say SDN one more time!

The following excerpt is an obligatory reading to understand SDN. It comes from the article “SDN is DevOps for Networking” by Rob Sherwood, written in 2014. If you haven’t read it yet, do it now.

The term SDN was first coined in an MIT Technology Review article by comparing the shift in networking to the shift in radio technology with the advance from software defined radios.

However, the term is perhaps misleading because almost all networking devices contain a mix of hardware and software components. This ambiguity was leveraged and exacerbated by a litany of companies trying to re-brand products under the “SDN” umbrella and effectively join the SDN bandwagon. As a result, much of the technical merit of SDN has been lost in the noise.

The conflict starts with semantics. When researchers refer to SDN, they don’t mean any network defined by a software, that’s too simplistic. It means a specific group of ideas in computer networking, mostly established around the abstraction and separation between the control and data plane. From that premise, the term has been thoroughly misused in the industry. Further, there are different levels of abstraction where this separation can be built upon:

Because of this separation, you can decouple innovation cycles and optimize your network with a global view of the network for example. In fact, the most valuable goal of SDN is to decouple innovation cycles, in other words, you can improve your network control independent from underlying technologies. This has significantly improved feature velocity in the industry. We’ve seen more improvement in the industry in the last 10 years than in the last 30. One of my posts gives some evidence on that.

In that way, OF or P4 is an abstraction based on forwarding API, while a more traditional approach of network automation is an abstraction built on top of traditional APIs. In that way, different levels of abstraction can deliver on the SDN value propositions on different levels. Some optimization can be achieved without pure SDN but not all.

I’ve seen people calling BGP, SDN: It has a software, thus it’s software-defined, right? No. What if I use a route reflector in order to specify BGP policy from a central point? Okay, now we are talking. A much more thoughtful argument is ‘you need network automation/DevOps, not SDN’. Indeed, for many use cases, that’s enough and that brings me to my next topic:

DevOps for networking

Disclaimer: humanity will agree on the meaning of life before the agreeing on the meaning of SDN and DevOps. I’ll take the meaning from the article mentioned before: “DevOps infuses traditional server administration with best practices from software engineering, including abstraction, automation, centralization, release management and testing”. I’m pretty sure most people can agree those are desirable practices/outcomes.

IaC does give you the opportunity to deliver on most of the these:

Abstraction
Automation
Centralization
Release management
Testing

It provides an abstraction of the control plane leveraging legacy APIs that is good enough for a lot of people. Also, It’s much easier to make a case to hire a couple developers than it is to replace hardware. Network disaggregation at the bare-metal layer has already delivered a lot. Now can network automation deliver global optimization?

Definitely. That doesn’t mean you should, or it’s the best or cheapest way. Think about this: To what extent is it worth to deliver optimization using automation? What abstractions would make it easier for you to do so? This now becomes an engineering problem in which we evaluate what technology to use to solve a problem.

Traffic Engineering

For example, take backbone TE, some organizations are perfectly fine running links at 30% average utilization. Some companies are hurt by elephant flows, some are not. Why would you care about TE again? I’m glad I asked, check this post.Jokes aside, A company with effective TE will consistently deliver better user-experience at reduced costs OpEx and CapEx, eventually, those efficiencies will be assimilated by the market. In fact, I’d argue we are starting to observe that. I ask you, which companies are spending most money expanding their networks? Yes, content providers such as Google and Facebook.

Can we deliver global optimization with RSVP-TE? Sure. How complex is the software system that performs those optimizations? You tell me. What’s the average size of a backbone TE team? I’m sure it’s at least half a dozen CCIEs, a ton of vendors and lots of lower level support engineers. Most ISPs have not delivered on real-time global optimization yet. Not many of them have mentioned these optimizations.

Google has publicized some data and they informed SDN teams supports 6 times the feature velocity compared to traditional networking architectures. With dubitable logic, one could induce that it’s 6 times more expensive to deliver TE with traditional architectures.

After more than a year of incremental rollout, Espresso supports six times the feature velocity, 75% cost-reduction, many novel features and exponential capacity growth relative to traditional architectures

How many engineers would you need to develop an OF based TE system?

One could make an argument that the underlying technology doesn’t matter as long as the solution is delivered. Fair enough. In this case, I’d ask you what architecture is going to be more extensible, providing you the biggest long-term benefit? The one put together with a bunch of hacks or the one built from the ground up?

Conclusion – What’s left after the hype?

The main reasons that started the SDN revolution are still real: the need for reduced networking costs and faster innovation cycles. Pure SDN is not the only way to achieve those, network disaggregation and network automation have delivered on those value propositions, but there’s still plenty of room for improvement. P4 is a great example as it aims to increase innovation speeds on the hardware side of the equation.

It’s no longer a question whether one can reduce costs with new practices in networking. Expensive operations will be replaced piece by piece, it started on the data center, now the backbone is the new goal. Additionally, I speculate that due to 5G, in 2019, the regulatory framework is going to change a lot, allowing innovative companies to increase market share significantly. Effective innovation strategies will allow the best ISPs to expand at a much-reduced cost and the market will see that sooner or later.

Leave a Comment

Heard about GitOps?

Published by castroflaviojr on December 5, 2018

Howdy! This is just a reading recommendation. I recently stumbled upon an article that is INCREDIBLE. Definitely a must-read.

GitOps: A Path to More Self-service IT

I’m just going to paste the best excerpts from the article:

To recap, a GitOps system evolves like this:

Basic — configs in repo as a storage or backup mechanism.

IaC — PRs from within the team trigger only CI-based deployments.

GitOps — PRs from outside the team, pre-vetted PRs, post-merge testing.

Automatic — Eliminate the human checks entirely.

GitOps lowers the cost of creating self-service IT systems, enabling self-service operations where previously they could not be justified. It improves the ability to operate the system safely, permitting regular users to make big changes. Safety improves as more tests are added. Security audits become easier as every change is tracked.

Anyway, go read it.

Leave a Comment

Test-driven Network Automation

Published by castroflaviojr on September 25, 2018

It’s been a while. In my last post, I narrated my experience at the NANOG 72 hackathon where I started working on a canarying project. I’m going to dive deeper into the underlying concepts for Test-driven Network Automation.

Why?

Justifiably there’s currently a big push for Infra-as-Code (IAC) in networking. IAC is pursued in order to enable modern software processes on infrastructure. The main benefits we are looking for is agility and predictability. Agility meaning faster feature delivery cycles. Predictability meaning reduction of outages: by automating deployment we can reduce human mistakes during maintenance. By doing so, you enable your team to collaborate more effectively and compound their productivity gains by improving code, ultimately allowing you to run a huge network with a small team.

As a side note, I believe those efficiencies developed in webscale companies like Facebook, Google, Microsoft will be assimilated into the markets sooner or later. Current network operation teams in TELCOs ( Verizon, AT&T, Comcast, Charter ) are orders of magnitude bigger than Webscale’s teams. So, ultimately I believe OpEx will slowly push inefficient practices out of the markets.

How?

CI/CD is fairly well-defined as a software practice. The question is how do we apply it to network automation. The following is a good representation of the process, supplied by Juniper( I think ):

Make changes
Pull Request
Peer-review – Automation code is reviewed
Dry-run – Dry-run against a lab or production
Notify results – Config Diffs, Errors?
Approve
Canary changes until the whole system is upgraded or rollback changes

Now, that’s a fair process, The missing part here is test automation. Augmenting this process with test automation allows bugs to be found faster, reducing outages. The networking tests can be basically summarized into 5 categories.

Config checks ( format )
State check ( ARP table entries, Routing table, BGP neighbors )
L2 reachability health
L3 connectivity health
Application health

I discuss some of this tests later in this article. Now the remaining thing is to do the canarying properly. Thus I’d augment the deployment phase:

Record baseline health-state
Deploy changes to a subset of nodes
Wait/Gather data
Observe alarms
- After quarantine wait-time has passed, increment change subset and go back to step 2.
- If alarms are unacceptable, rollback change

In this way, you guarantee that only a subset of your network is affected by possible errors.

Ultimately, Application health data should drive this. But usually, that data is nor easily consumable because team silos, or it’s simply difficult to get a small set of application-level metrics that deterministically tell the network has a problem. So, we revert back to L3 connectivity. Now, speaking of L3 connectivity we basically mean latency, loss, and throughput. The only way to get the actual data is by actively measuring it, the easiest open-source tool out there to do this programmatically is Todd.

What could go wrong?

Assessing health-state is already a pretty difficult problem. It would be great if we had a set of simple metrics to assert connectivity, but if that was trivial half of us network engineers wouldn’t have jobs. For example, although a ping test failure necessarily means you did something wrong, ping success doesn’t suffice to say a change went successfully. Basically, we either don’t have enough information to assess the state properly or we have so much info that assessing state is hard. I’m unaware of a solution to handle too much info, but I feel like this would be a good use case for machine learning. That’s all to say that the mechanism chosen to assess health state may likely not suffice.

The second thing is that even if the mechanisms to assess state you have do suffice, imagine your change next state is incompatible with your previous state, for example, you are changing your BGP password. In that case, your change intermediate steps do not present full connectivity. Canarying doesn’t make much sense in those scenarios. This scenario comes more often than you would wish since a lot of network changes exist to fix something.

Another challenge is that sometimes you just can’t replicate the current state of production in your development environment, that way you can’t really develop a procedure that executes a change in zero downtime. Imagine for example, you developed a change procedure that works in your development environment but when you push the change to the first subset of switches, a failure in redundancy is detected, and you abort the change. This reduces the throughput of changes executed by your team. There’s a point where the risk-acceptance level of the change may need to be reclassified in order for work to be done.

How do I profit from this?

Canarying gives you the opportunity to identify a bug before compromising your whole network. And it reduces detection time as verification is now an automated procedure. Let’s for example, you pushed a configuration change with a broken route-map for example, invalidating some routes to your systems. A good detection system plus a blue/green type of deployment would contain the outage caused by misconfiguration.

At the end of the day, I believe, what determines the productivity of your team is how fast you can find issues. By adopting test-driven practices you reduce your detection time, and thus reduce idle time of your team, improving productivity.

Leave a Comment

Network automation with Ansible : Canarying / Rolling upgrades

Published by castroflaviojr on March 12, 2018

In this post, I describe my experience at the NANOG 72 Hackathon and explain the project I worked on. If you want to see the work, my diffs can be found here.

The Hackathon lasted 8 hours and it involved engineers of all levels. It started with a presentation on network automation tools. Followed by team-forming: anyone with an idea shouted it out to the group and then teams were self-organized. I shared an idea for canarying automated network changes. Let me explain.

A typical challenge on network management is rolling out changes fast and reliably. Infra-as-code already goes a long way removing the human element from the equation, if you have no clue about what I’m talking about check this video. Yet, your automation has the potential to cause outages if a bug slips through testing. A good principle for automation is canarying. It consists of performing changes on a subset of endpoints and verifying the health state of the system before proceeding to prevent the wide-scale spread of a mistake.

Examples of things that may be canaried are patches, OS updates, critical software updates, config changes and etc. When using the good’ole CLI one always checks the state of the system to see if the changes look good. Canarying is analogous to that. Say you need to change the software running on all your switches from quagga to FRR. Wouldn’t it be great if you could perform those changes fast while still making sure your network is working perfectly? Yes! The alternative to that is running your scripts like a savage only to realize later that a typo slipped through your code review and blackholed all your datacenter traffic.

Now, back to the Hackathon. I convinced my team to leverage a vagrant environment + ansible code open-sourced by Cumulus. Leveraging open-source material was a great way to bootstrap development and get to a baseline faster, then we could focus on the canarying solution (that may have been a mistake). It turns out the internet was saturated and it took us more than an hour to download the boxes. At noon, I was able to bring up the environment only to find out that those boxes no longer supported quagga. That brought us to around 1PM and my team was getting a little impatient. At that time, folks were discussing pivot options but I decided to finish the work by myself. So I deserted my team, or they deserted me, you choose.

Anyway, I changed the ansible scripts to run FRR and finally had a running topology around 2PM. After that, I started playing with ansible and quickly found the feature necessary for my use case. Ansible allows you to configure the number of endpoints it runs against with the field “serial”. Once you set serial to 1 the scripts will run one at a time. DONE!!! RIGHT? Not yet.

Now it became interesting, and I grabbed a couple people to discuss how to actually implement this. And basically, the discussion ran around defining what is sufficient condition to accurately determine the state of the system.

Ping. Although a failed ping implies network failure, its success doesn’t mean there are no problems changes.
Route table of the modified switch: That doesn’t guarantee the whole system is in a positive state. For example, things could look good on a switch but a route could have been withdrawn on the other side of the network.
Global routing table state. That one can only be right, the challenge is how can you get that info.
Some folks in the team mentioned normally they have monitoring systems running to detect anomalies and remediate them with scripts. But you don’t know what you don’t know. So I’d advocate for canarying additionally to that.

Okay, so I quickly did #2 and realized it wasn’t good enough. I should have stopped coding there and made a powerpoint, but I decided to keep hacking instead. I wasn’t able to complete #3 and kinda just mumbled the content of this post. It was cool. Overall, it made my Sunday productive and I wouldn’t have done anything like this otherwise.

I finished this project a few weeks later. I wrote a flask web service that returns the number of working BGP neighbors, then on Ansible I can call that API and assess the global state. Let’s go for a tutorial.

Canarying tutorial

To run this you need vagrant 2.02. To start bring up the infrastructure:

 git clone https://github.com/cumulusnetworks/cldemo-vagrant
 cd cldemo-vagrant
 sudo vagrant up oob-mgmt-server oob-mgmt-switch leaf01
 sudo vagrant up leaf02 leaf03 leaf04 spine01 spine02

This should bring up a leaf-spine infrastructure not yet configured. To configure it do the following:

 sudo vagrant ssh oob-mgmt-server
git clone https://github.com/castroflavio/cldemo-roh-ansible
cd cldemo-roh-ansible
ansible-playbook start.yml

The next step is to run the change.yml playbook:

---
- hosts: spines
  user: cumulus
  become: yes
  become_method: sudo
  roles:
    - ifupdown2
    - frr
    - statecheck
  serial: 1
  max_fail_percentage: 50

The magic lies on the serial field, which stipulates changes need to be deployed sequentially, 1 node at a time in this case. Whenever a change fails the statecheck playbook will fail:

- uri:
    url: "http://{{ item }}:5000/"
    return_content: yes
  register: webpage
  with_items:    
    - '192.168.0.11'
    - '192.168.0.12'
    - '192.168.0.13'
    - '192.168.0.14'

- debug: var=item.content
  failed_when: item.content|int < 2
  with_items: "{{ webpage.results }}"

It reaches out to every leaf(192.168.0.1*) and gets the number of BGP neighbors through a REST API. The Web service code can be found here.

Thanks for reading the long post, do not hesitate to share any questions and thoughts.

Leave a Comment

What’s happening in 2018?

Published by castroflaviojr on March 5, 2018

I was going through old blog posts and stumbled on this article where I explained my views on the state and future of the Networking Industry in Jan’17. In this article, I’d like to do the same for 2018. The year of 2017 was the year of hype, I believe 2018 is going to be a competitive year.

To start, the first specs for 5G are out. Spectrum auctions are happening. That means new players are coming and markets will be shaking, folks with better network management systems like Google will definitely make a huge investment.

Other than the chase for 5G, reducing costs and increasing revenue are still main drivers of innovation. Nothing changed there. It won’t change in the next 100 years. In 2017, we’ve seen an increasing demand for network innovations for Enterprises. This is happening due to the maturation of SDN-like technologies.

The biggest achievement of 2017 is network disaggregation started by OCP. Broadcom keeps announcing jaw-breaking specs for upcoming chipsets. Still, competition in that area looks healthy, from Mellanox, Cavium, and P4. On the OS side, Cumulus, Big Switch, SnapRoute, Arista and even OpenSwitch offer plenty of choices and are still growing steadily. Additionally, AT&T is putting an effort into an open-source OS for themselves.

I wish the industry would move towards open-source and extremely low-cost networking solutions. But, although folks want cost to be moving down, that’s not the sole priority. By embracing open-source ISPs expose themselves to risks and extra costs. It’s common for businesses to transfer risk and they are willing to pay good money for that. Enterprises are as risk-averse as ISPs.

I think there’s an opportunity for commercial support of open-source solutions. The challenge lies in finding proper incentives. Entrepreneurs have little motivation in investing in a specific technology like OpenStack or Kubernetes because it’s ephemeral, thus their incentive lies in cultivating relationships with the customer and learning the technology as necessary. Further, companies seem to find it hard to hire labor specialized in open-source projects ~~(some would argue Corporations are terrible at hiring in general)~~ and end up crossing those options from the beginning. Winning players like Amazon, Google and Facebook ignore that cost and feast on those benefits by hiring skilled labor running on agile teams, increasing learning rate.

I may be wrong here, but I see Juniper is doing a strong move into software with Contrail, and even Junos Space and Cisco is still Cisco. In another word, the biggest factor ensuring market dominance is customer relationships rather than winning technology, and technology always wins in the long-term. In raw networking specs, I doubt they will catch up to BCM. Only time will tell.

All of this leads me to believe 2018 will start big changes in the industry. We won’t see these changes transform financial results until at least 2019, but I bet the winning ISPs in 2020 will be the ones who make the right technological choices now. Also, notice that the market share in the industry has not changed much in the last 5 years.

Leave a Comment

MPLS Traffic Engineering – Review

MPLS Traffic Engineering – Review

Published by castroflaviojr on November 15, 2017

I wanted to review the basics of MPLS and Traffic Engineering (TE) so I went to my favorite networking blog and searched for RSVP and found the following articles:

MPLS TE design ( part 1, part 2, part 3 )
RSVP deep dive ( part 1, part 2 )

Although the articles were incredible and clearly explained the technologies, it also clearly demonstrated how complex ‘legacy’ MPLS technologies are. UPDATE: I recently found about PacketDesign and got very excited by the material they put out there. Their white paper on MPLS-TE is one of the best pieces I’ve seen on the subject! I urge you to check it out.

This article is divided into 4 sections: First, I mention reasons for MPLS forwarding. Second, I go through some of the motivations behind Traffic Engineering technologies. Then, I briefly explain Segment Routing, and I conclude with a tutorial on how ONOS can achieve TE using an SR SDN application on top of OpenFlow.

Why MPLS at all?

To reduce network state.

Today, The full Internet routing table includes +600.000 routes. Routing this by itself is already complicated. Now what if you took different paths for different Classes of Service (CoS), you could easily reach 2M routes. With MPLS, you basically can aggregate several network prefixes into labels, reducing the state drastically. The articles I mentioned at the beginning go through some of those numbers. A Segment Routing (SR) architecture can reduce this number even further to the order of the number of network devices. ps: SR can also be achieved with IPV6 encapsulation.

Why Traffic Engineering?

To save money!! $$$$

Diptanshu Singh explains this subject wonderfully, so I urge you to check his article if you need a more detailed explanation.

For instance, say the Comcast network in your neighborhood has 1 Gbps of VOIP and 4 Gbps of data traffic demand. It’s overprovisioned by 50%, so its 10G links suffice at the moment. Now, suppose its traffic increases 20% next year, sustaining this strategy would require an immediate upgrade of the infrastructure.

A Diffserv strategy would change resource allocation rates: One could instead allocate a 2x overprovision rate for VOIP and a 1.2x overprovision for data. Resulting in 2.4+6 Gbps total of bandwidth ( 1G*120%*2, voice data plus 20% increase times 2x overprovision rate) Next year, you would have 2.8+7.2 Gbps of data, still smaller than 10G. With this approach, Comcast can delay its backbone upgrade for 2 years and can still adhere to the SLA’s required for sensitive traffic.

With the first rule, your expansion rate is dictated by generic traffic growth because you must keep network utilization low. On the second case, your expansion rate is mandated by critical traffic growth and networking equipment life-cycle (at your convenience). Critical traffic is 5x smaller than best-effort, thus your expansion rate would be 5x lower if you don’t care about best-effort traffic.

Now you have the opportunity to reduce your expansion budget by a factor of 5 and invest that money on engineering power. I’m sure that’s what Google saw 10 years ago when it started heavily investing in its networking technology. Bad Vendors will often say ‘you don’t need QoS or Traffic Engineering’, the problem can be solved with more bandwidth. That’s a convenient message if you sell bandwidth.

Why Segment Routing?

I wanted to compare legacy technologies (RSVP, LDP) with SR, but I realized that is pointless. To me, the only reason you would use legacy is for backward compatibility with existent equipment. Don’t get me wrong, RSVP will get the job done. Also, you may not be able to afford replacing it with SR or maybe your RSVP infrastructure works perfectly and you already have proper processes in place.

That all said, SR is just simpler and better. To learn more about RSVP check for yourself: http://packetpushers.net/rsvp-te-protocol-deep-dive/. If you know nothing about SR check http://www.segment-routing.net/.

In summary, SR is a network architecture that allows the network to keep no flow-state. Rather than only forwarding packets based on IP destination address, they are forwarded based on the segment address. The network maintains shortest path forwarding state information to each segment and backup paths to implement fast reroute. Fast reroute by itself is worth money, SR TILFA allows for sub 50ms failure recovery.

Additionally, The architecture allows you to enforce loose source routing. For example, say, IGP OSPF will give a 40ms path, to steer your VoIP traffic through a node 104, you would just change your routing at the edge of the network to include that segment before the end destination.

Tutorial

I already wrote a tutorial on this 2 years ago. I’m just going to highlight the main points.

Screen Shot 2015-08-13 at 4.46.02 PM.png

In this configuration, you have a cluster of 3 ONOS SDN controllers controlling a leaf-spine fabric. The entry-nodes, do a route lookup and encapsulate the packets with the MPLS label correspondent to the exit-node. The packet is then forwarded using shortest path based on the MPLS label. That’s basic IP forwarding. The cool thing here is the ability to programmatically set forwarding tunnels.

Let’s say you want all Netflix traffic to go through spine s105, thus making sure all Web and Voice traffic has 3 spines worth of bandwidth and thus lower delays, you could establish a tunnel in the following way:

A tunnel is defined as a set of LABELS, defining the path taken by a flow. The following command instantiates a tunnel called FASTPATH through the routers 101, 105, and 102 in that order.

onos> srtunnel-add FASTPATH 101,105,102

Then, a policy can be applied to a subset of traffic, for example, policy1 = tcp_port=80 >> fwd( TUNNEL_1)

onos> srpolicy-add p1 1000 10.1.1.1/24 80 10.0.2.2/24 80 TCP TUNNEL_FLOW FASTPATH

This tunnels can be used to reinforce TE policies and guarantee SLAs and improve network utilization.

Conclusion

A Segment Routing network combined with a centralized controller for path computation can enable advanced Real-Time traffic engineering capabilities. In this way, Segment Routing is a perfect match for SDN.

The SDN applications have already been developed and made available in open-source projects like ONOS. The Segment Routing app mentioned has evolved to TRELLIS which is the networking fabric that supports the Cord project. I urge you to check their work.

Please reach out to me if you have any questions regarding how one could move forward and implement this.

Leave a Comment

DevOps essentials: Developer Environment

Published by castroflaviojr on November 7, 2017

“90% of coding is debugging, the other 10% is writing bugs.”

If you develop any code you know how true this is. This implies the speed you correct bugs is what really dictates your feature release velocity. This relates to a critical part of DevOps: CI/CD. Continues Integration allows bugs to be found faster, continuous deployment allows fixes to be released in production faster.

In this post, I talk about the importance of a stable Developer Environment (Dev Env) and present a short tutorial on how to set up a virtual routing environment with open source tools provided by Cumulus.

Why bother at all?

A DevEnv to test your automation scripts is essential for any effective development team. It allows devs to test their code in minutes rather than days independently of others, increasing their velocity drastically. It also improves collaboration, by making sure all devs are starting from the same stable point.

Here is an example of how bad this is: Company ABC only has one testing environment in which it performs all testing. Say a good scheduling process allows for 50% utilization, and each dev takes 1 hour to run a script and verify its outcome and an additional hour to clean-up the environment. A work day has 8 hours, where 4 hours are utilized, and thus only 2 tests can be performed per day. In this case, no matter how big your team is, you can only push a total of 2 features or fixes per day. This is pretty much a fixed cap on the team productivity.

Technology choices

I’ve been wanting to set up a virtual environment for a while. At first, I tried using Mininet, while it suffices for routing using quagga, it required too much work to set it up in a way I could use it for testing scripts. The other options were eve-ng(uNetLab), GNS3 and vagrant. I crossed GNS3 right away because the overhead and learning curve are significant. If I had to run IOS I’d go for eve-ng, but I don’t. Additionally, the learning curve for vagrant was shorter thus it’s useful to more people.

Kudos to Cumulus for open-sourcing this, it is great to see vendors contributing multi-purpose code rather than just taking code from others or promoting their own technologies and I urge you to check their work. You can start from here.

Quick start

Disclaimer: this code is available at the cumulus GitHub. Before running this demo, install VirtualBox and Vagrant.

### Bring up the vagrant topology
git clone https://github.com/cumulusnetworks/cldemo-vagrant
cd cldemo-vagrant
vagrant up oob-mgmt-server oob-mgmt-switch leaf01 leaf02 spine01 spine02 server01 server02

### setup oob mgmt server
vagrant ssh oob-mgmt-server

### Run the ROH demo
git clone https://github.com/cumulusnetworks/cldemo-roh-ansible
cd cldemo-roh-ansible
ansible-playbook run-demo.yml

### check reachability of server02 from server01
ssh server01
wget 10.0.0.32
cat index.html

Custom topology

You only need vagrant, Virtualbox and git to run this tutorial. Vagrant is really simple, once you have a vagrantFile, you need exactly one line of code to bring your environment up. The tricky part is to build the Vagrant File, Cumulus not only provides the templates but also a tool to create the file from a topology file. It’s the topology converter.

cldemo_topology

The figure above shows the default topology, you can change it by editing the topology file and then run the topology_converter again:

python topology_converter.py topology.dot

After that simply do. Vagrant up and boom.

vagrant up
vagrant ssh oob-mgmt-server

Conclusion

In this post, I talked about the importance of a stable developer environment and how that fits into the DevOps framework. I also gave an example of how to establish an environment.

This links back to the SDN value proposition: the ability to run software processes to improve your infrastructure. But, people have virtualized network devices for years, there’s nothing new there. True, still, a significant part of network operators does not have a virtual environment in order to test their systems. On the other side, with the rise of OpenVSwitch and creation of Mininet, SDN developers started using network emulation to develop SDN systems and have always taken this as a given yielding increased development agility.

This also leads me to some thoughts on how P4 can improve enterprise systems, but I’ll leave that for a future post. Again, please let me know your thoughts on this

1 Comment

Can P4 save Software-Defined Networking?

Published by castroflaviojr on October 24, 2017

Now, P4 is gaining momentum due to engagement of big players such as Google and AT&T. P4 has potential to cause a significant change in the industry and deliver on the SDN value-proposition. I’d like to discuss that.

In summary, P4 aims to provide 3 main goals:

Reconfigurability
Protocol independence
Target independence

OpenFlow had its shortcomings: somehow diversity of implementation strategies evolved into incompatibility. P4 target independence proposes to solve this issue using a compiler to translate P4 code into switch code taking into account its capabilities.

Screen Shot 2017-10-20 at 1.35.00 PM

In order to understand how disruptive this is, let’s look at the current state of affairs: commodity silicon vendors such as Broadcom and Mellanox already have an API to control their switches, the existence of that API itself already disrupted the industry enabling Cumulus, SnapRoute and even Arista. Now would you prefer that your silicon vendors established a common interface, or would you rather rewrite software everytime you want to test a new switch Vendor? The answer is obvious: the first option benefits users and new vendors, the second benefits established vendors. New industry players or the adventurous operators could write software on top of P4 and achieve multi-vendor integration at the cost of writing compilers for each vendor they use.

So, that’s the big pay-off opportunity, enabling competition, thus innovation. The challenge here is to provide vendors the incentives to write the P4 compiler.

New industry players or the adventurous operators on the other side, could be able to write software on top of P4 and achieve multi-vendor integration at the cost of writing compilers for each vendor they use. That can be game-changing, the big questions are “How eager are developers to write P4 software?”, “how much does it cost to hire somebody to do it?”, additionally, “Who will write Cisco/Broadcom specific p4 compiler code?”

There are endless opportunities: in a parallel universe, AT&T forces Cisco to enable a P4 compiler to their devices, Cisco writes a bad compiler, claims it’s bad technology and sells you ACI instead. In a different universe, Barefoot writes a Broadcom compiler ensuring it works, but then it “wastes” some resources promoting a competitor. A little bit more realistically, SnapRoute or Cumulus could write a P4 compiler to Broadcom Tomahawk, and thus would be able to enable their software in a plethora of existing devices. Even more realistically, Barefoot writes their own compiler to Tofino and keeps selling P4 to a limited niche market.

Now, if Barefoot takes on the responsibility to write a P4 compiler for Broadcom and Mellanox that would be translated into huge value to NOS vendors and Operators; since they would be able to seamlessly switch vendors. It would marginally increase adoption of Tofino, so the question remains, who would pay for this?

Now how much does it cost to adopt P4?

Before I answer this question I’m going to callback to a point previously when I wrote about network disaggregation. I ended it asking: “Does OpenFlow effectively lock you in?”. Now the same question may apply to P4.

The question is misleading by itself. I’ve heard vendors saying “OpenFlow locks you in, you might as well just buy our SDN”. There’s just so much wrong with this. OpenFlow isn’t perfect, but it does allow you to adopt software processes to deliver features much faster than your vendor will.

Any choice is a potential barrier and locks you in a little bit, but what everybody refers to when talking about lock-in is hardware lock-in. When you buy a generic x86 computer you are free to install Ubuntu, Debian, Windows or whatever you’d like, when you buy a PlayStation, you can’t just install Xbox on it, that’s vendor lock-in, the costs of doing that are prohibitive, you would be better off just buying another appliance.

You could at barely no cost try an OpenFlow Lab or Field trial on Broadcom-based network devices and fallback to Cumulus if it doesn’t fulfill your needs. Unsurprisingly, The vendors will claim lab trials aren’t needed because of their product quality, but the experience will tell there will always be a missing feature.

Now P4, from the adventurous perspective, P4 is great, you just have to write more software to get it done. For everybody else it has a significant cost: you have to hire premium developers or Barefoot itself to do it. That cost won’t be insignificant when using Broadcom + Big Switch might already give you the tools to improve your current process.

OpenFlow vs P4

OpenFlow is going to be 10 years old next year, a significant amount of resources has been put into testing it. It has been (properly) commercially supported by Big Switch for 3+ years if I’m not mistaken. I’d say with certainty that you could get an OpenFlow solution production-ready in a year. Realistically, could you get P4 ready to be deployed in production in a year?

Misconceptions:

Will P4 replace OpenFlow? Maybe. P4 offers a different value proposition. OpenFlow agents may be written on top of P4. Great P4 implementations may force OF into being obsolete.
Will P4 replace Broadcom SDK? Same answer, P4 may write a much better API on top of theirs.
Will P4 replace OpenNSL? Why not?
Will P4 replace NetFlow/Sflow? No. Sflow is a protocol to export data from the switches, it does not say (much) on how you should implement it in the dataplane.
Will P4 replace Riverbed? No way.
Will P4 replace OpenConfig? Nope, they are actually quite complementary.

Thanks for reading the long post. I welcome any thoughts or questions.

6 Comments

TCP BBR Congestion Control on Mininet

Published by castroflaviojr on October 10, 2017

In this post, I demonstrate some benefits of using BBR congestion control and illustrate how easy it is to adopt it by using Mininet as an example. I’m excited to share this post with you guys because it’s been a while since I’ve made a tutorial and I love breakthrough innovations like this.

This post is divided into three sections: Background on BBR, Tutorial and Technical challenges.

Background on BBR

TCP BBR has significantly increased throughput and reduced latency on Google’s internal backbone networks. From this a great resource:

TCP BBR is rate-based rather than window-based; that is, at any one time, TCP BBR sends at a given calculated rate, instead of sending new data in response to each received ACK. In particular, TCP BBR does not directly link the sending of new data to the receipt of ACKs, and so, strictly speaking, is not actually a sliding-windows implementation. Therefore, we cannot properly talk about winsize or cwnd. Instead, we talk about the number of packets in flight, which is the rate times RTT_actual, with the understanding that this number may vary with conditions.

Basically, BBR estimates bandwidth by keeping track of goodput: if an increase in the sender rate does not increase the observed goodput, it assumes that’s the available bandwidth. It is reasonably effective in doing so and that way it provides minimal queueing in the network.

TCP’s throughput is inversely proportional to RTT and most TCP implementations cause additional delays, in consequence, TCP by itself can never reach 100% utilization. BBR changes that, that’s why it’s such an impressive accomplishment.

Quick start

Open Source is great because it allows innovation to be deployed much faster, BBR is already implemented in the Linux kernel and using Mininet you can test it right away.

I’m a long time fan of the website: reproducing network research from Stanford. I leveraged most of the Mininet code for this experiment from there.

Now let’s get to it!! This tutorial assumes you have vagrant and git. If you don’t, don’t panic, follow this link. To start you will need to set up the VM. I took care of all the dependencies for you. If you want to inspect what I’m doing take a look at the mininet role in the ansible folder.

git clone https://github.com/castroflavio/bbr-replication/
git checkout vagrant
vagrant up

This should take 10 min to complete. After it’s done proceed

vagrant ssh
cd mininet
sudo ./figure5.sh all

After around 30 seconds the experiment should be done and you can exit the VM:

exit
open figure5_mininet/figure5_mininet.png

This should open the following figure: figure5_mininet

The figure compares the latency on TCP BBR and TCP CUBIC (less is better). And as you can see BBR reduces the latency from ~150ms to ~50ms(66%) on the average case and from 400ms to 50ms (87%) on the worst case. This is crazy!

Technical challenges

The first technical challenge is finding a linux kernel that implements BBR, and it turns out it’s implemented on 4.9 so look out for that. The second challenge was to implement the BBR pacing mechanism, it was mentioned on the CS244 website but I did not understand it at first.

BBR requires a mechanism to control the sender rate and it leverages tc ( traffic control ) module from linux. I knew about tc but I didn’t know it was such a powerful tool. After some research on linux queueing mechanisms, I found that BBR requires the fq (Fair queueing) queueing discipline because it uses that to rate control the sender. It turns out Mininet did not support fq for some reason, and I had to change a couple lines of code to add support for it.

Conclusion

TCP has been around for decades and for decades people have been trying to improve it. At first, TCP congestion control mechanism literally saved the internet, now I’m gonna be bold and say that BBR by providing a “queueless” congestion control is saving latency-sensitive applications. It really is a big deal. I highly encourage you to try it out, the least you should do is check the following article: Increase your linux server Internet speed with TCP BBR congestion control.

For future reference:

1 Comment

ESPRESSO – More insights into Google’s SDN

Published by castroflaviojr on August 16, 2017

Google recently released a paper detailing how it has designed and deployed ESPRESSO: the SDN at the edge of its network. The paper was published at SIGCOMM’17. In the past, Google’s SDN papers have been very insightful and inspiring, so I had big expectations for this one as well.

In this blog post, I’ll summarize and highlight the main points of the paper. I’ll follow-up with some conjectures on what we can abstract from the paper in terms of industry trends and understand the state-of-the-art SDN technologies.

For reference, Google has released several papers detailing its networking technologies, these are the most important ones:

B4 detailed Google’s SDN WAN – A must read. It explains how they drastically increased network utilization by means of global traffic engineering controller.
Jupiter Rising details hardware aspects of data center networks.
BwE explains Google’s bandwidth enforcer that plays a huge role in traffic engineering.

B4 connects Google’s Data Centers, B2 is the public facing network, which connects to ISPs in order to serve end-users. Espresso, an SDN infrastructure deployed at the edge of B2 enabled higher network utilization(+13%) and faster networking service roll-out.

nespresso-2-width-566

Requirements and Design Principles

The basic networking services provided at the edge are:

Peering – Learning routes by means of BGP
Routing – Forwarding packets. Based on BGP or TE policies
Security – Blocking or allowing packets based on security policies

To design the system, the following requirements were taken into account:

Efficiency – capacity needs to be better utilized and grow cheaply
Interoperability – espresso needs to connect to diverse environments
Reliability – must be available 99.999% of the time
Incremental Deployment – green-field deployment only is not compelling enough
High Feature Velocity

Historically, we have relied on big routers from Juniper or Cisco to achieve these requirements. Those routers usually would have the full internet routing table stored, as well as giant TCAM tables for all ACL rules needed to protect THE WHOLE INTERNET, and those are quite expensive. More importantly, a real Software-Defined Network allows you to deliver innovation at the speed of software development rather than the speed of hardware vendors.

Basically, 5 design principles are applied in order to fulfill those requirements:

Software Programmability – OpenFlow-like Peering fabric
Testability – Loosely coupled components allow software practices to be applied.
Manageability – Large-scale operations must be safe, automated and incremental
Hierarchical control plane – Global and local controllers with different functions allow the system to scale
Fail-static – Data plane maintains the last known good state to prevent failures in case of control plane unavailability

Peering Fabric

The Peering Fabric provides the following functions:

Tunnels BGP peering traffic to BGP Speakers
Tunnels End user requests to TCP reverse proxy hosts
Provides IP and MPLS based packet forwarding in the fabric
Implements a part of the ACL rules

Screen Shot 2017-08-15 at 10.41.28 PM

All the magic happens in the hosts. First, the BGP speakers learn the neighbor routes and propagate those to the local controller (LC), which then propagates those to the global controller(GC). The GC then builds its intent for optimal forwarding of the full internet routing table. It then propagates those routes to LCs which then install them in all the TCP reverse proxy hosts. The same thing happens for security policies.

The BGP Speakers are in fact a Virtual Network Function, which is a network function implemented using x86 CPUs, the routing also is a VNF, as well as ACLs. Also, notice that the peering fabric is not complicated at all. The most used ACL rules(5%) are there but the full Internet Routing table is not. The hosts will make the routing decision and encapsulate the packets, labeling it with the egress switch and egress port of the Fabric.

Configuration and Management

It’s mentioned in the paper that as the LC propagates configuration changes down, it canaries those changes to a subset of nodes and verify correct behavior before proceeding to wide-scale deployment. These features are implemented:

Big Red Button – the ability to backroll features of the system and test this nightly.
Network Telemetry – monitors peering link failure and route withdrawals.
Dataplane Probing – End-to-end probes monitor ACL – unclear if OF is used for this

Please refer to the original paper for details. I hope this post is useful for you and I apologize for any miscommunication. At the end of the day, I’m writing this post to myself more than anything.

Feature and rollout velocity

Google has achieved great results in terms of feature and rollout velocity. Because it’s software-defined they can leverage their testing and development infrastructure. Along three years Google has updated Espresso’s control plane >50x more frequently compared to traditional routers, which would have been impossible without the test infrastructure.

The L2 private connectivity solution for cloud customers was developed and deployed in a few months. Without new hardware or need for waiting vendors to deliver new features. Again something unimaginable with legacy network systems. In fact, they state the same work on the traditional routing platform is still ongoing and has already taken 6x longer.

Screen Shot 2017-08-16 at 5.09.21 PM

Traffic Engineering

To date, Espresso carries at least 22% of outgoing traffic. The nature of GC allows them to serve traffic from a peering point to another. The ability to make this choice by means of Espresso allows them to serve 13% more customers during peaks.

Google caps loss-sensitive traffic to prevent errors in bandwidth estimation. Nonetheless, GC can push link utilization to almost 100% by transmitting lower QoS, loss-tolerant traffic.

Conclusion

From the paper: “Espresso decouples complex routing and packet processing functions from the routing hardware. A hierarchical control-plane design and close attention to fault containment for loosely-coupled components underlie a system that is highly responsive, reliable and supports global/centralized traffic optimization. After more than a year of incremental rollout, Espresso supports six times the feature velocity, 75% cost-reduction, many novel features and the exponential capacity growth relative to traditional architectures.

Leave a Comment