Skip to content

Category: Thoughts & theories

What’s left for SDN after the hype?

Say SDN one more time!

The following excerpt is an obligatory reading to understand SDN. It comes from the article “SDN is DevOps for Networking” by Rob Sherwood, written in 2014. If you haven’t read it yet, do it now.

The term SDN was first coined in an MIT Technology Review article by comparing the shift in networking to the shift in radio technology with the advance from software defined radios.

However, the term is perhaps misleading because almost all networking devices contain a mix of hardware and software components. This ambiguity was leveraged and exacerbated by a litany of companies trying to re-brand products under the “SDN” umbrella and effectively join the SDN bandwagon. As a result, much of the technical merit of SDN has been lost in the noise.

The conflict starts with semantics. When researchers refer to SDN, they don’t mean any network defined by a software, that’s too simplistic. It means a specific group of ideas in computer networking, mostly established around the abstraction and separation between the control and data plane. From that premise, the term has been thoroughly misused in the industry. Further, there are different levels of abstraction where this separation can be built upon:

Because of this separation, you can decouple innovation cycles and optimize your network with a global view of the network for example. In fact, the most valuable goal of SDN is to decouple innovation cycles, in other words, you can improve your network control independent from underlying technologies. This has significantly improved feature velocity in the industry. We’ve seen more improvement in the industry in the last 10 years than in the last 30. One of my posts gives some evidence on that.

In that way, OF or P4 is an abstraction based on forwarding API, while a more traditional approach of network automation is an abstraction built on top of traditional APIs. In that way, different levels of abstraction can deliver on the SDN value propositions on different levels. Some optimization can be achieved without pure SDN but not all.

I’ve seen people calling BGP, SDN: It has a software, thus it’s software-defined, right? No. What if I use a route reflector in order to specify BGP policy from a central point? Okay, now we are talking. A much more thoughtful argument is ‘you need network automation/DevOps, not SDN’. Indeed, for many use cases, that’s enough and that brings me to my next topic:

DevOps for networking

Disclaimer: humanity will agree on the meaning of life before the agreeing on the meaning of SDN and DevOps. I’ll take the meaning from the article mentioned before: “DevOps infuses traditional server administration with best practices from software engineering, including abstraction, automation, centralization, release management and testing”. I’m pretty sure most people can agree those are desirable practices/outcomes. 

IaC does give you the opportunity to deliver on most of the these:

  • Abstraction
  • Automation
  • Centralization
  • Release management
  • Testing

It provides an abstraction of the control plane leveraging legacy APIs that is good enough for a lot of people. Also, It’s much easier to make a case to hire a couple developers than it is to replace hardware. Network disaggregation at the bare-metal layer has already delivered a lot. Now can network automation deliver global optimization?

Definitely. That doesn’t mean you should, or it’s the best or cheapest way. Think about this: To what extent is it worth to deliver optimization using automation? What abstractions would make it easier for you to do so?  This now becomes an engineering problem in which we evaluate what technology to use to solve a problem.

Traffic Engineering

For example, take backbone TE, some organizations are perfectly fine running links at 30% average utilization. Some companies are hurt by elephant flows, some are not. Why would you care about TE again? I’m glad I asked, check this post.Jokes aside, A company with effective TE will consistently deliver better user-experience at reduced costs OpEx and CapEx, eventually, those efficiencies will be assimilated by the market. In fact, I’d argue we are starting to observe that. I ask you, which companies are spending most money expanding their networks? Yes, content providers such as Google and Facebook.

Can we deliver global optimization with RSVP-TE? Sure. How complex is the software system that performs those optimizations? You tell me. What’s the average size of a backbone TE team? I’m sure it’s at least half a dozen CCIEs, a ton of vendors and lots of lower level support engineers. Most ISPs have not delivered on real-time global optimization yet. Not many of them have mentioned these optimizations.

Google has publicized some data and they informed SDN teams supports 6 times the feature velocity compared to traditional networking architectures. With dubitable logic, one could induce that it’s 6 times more expensive to deliver TE with traditional architectures.

After more than a year of incremental rollout, Espresso supports six times the feature velocity, 75% cost-reduction, many novel features and exponential capacity growth relative to traditional architectures

How many engineers would you need to develop an OF based TE system?

One could make an argument that the underlying technology doesn’t matter as long as the solution is delivered. Fair enough. In this case, I’d ask you what architecture is going to be more extensible, providing you the biggest long-term benefit? The one put together with a bunch of hacks or the one built from the ground up?

Conclusion – What’s left after the hype?

The main reasons that started the SDN revolution are still real: the need for reduced networking costs and faster innovation cycles. Pure SDN is not the only way to achieve those, network disaggregation and network automation have delivered on those value propositions, but there’s still plenty of room for improvement. P4 is a great example as it aims to increase innovation speeds on the hardware side of the equation.

It’s no longer a question whether one can reduce costs with new practices in networking. Expensive operations will be replaced piece by piece, it started on the data center, now the backbone is the new goal. Additionally, I speculate that due to 5G, in 2019, the regulatory framework is going to change a lot, allowing innovative companies to increase market share significantly. Effective innovation strategies will allow the best ISPs to expand at a much-reduced cost and the market will see that sooner or later.

Leave a Comment

Network automation with Ansible : Canarying / Rolling upgrades

In this post, I describe my experience at the NANOG 72 Hackathon and explain the project I worked on. If you want to see the work, my diffs can be found here.

The Hackathon lasted 8 hours and it involved engineers of all levels. It started with a presentation on network automation tools. Followed by team-forming: anyone with an idea shouted it out to the group and then teams were self-organized. I shared an idea for canarying automated network changes. Let me explain.

A typical challenge on network management is rolling out changes fast and reliably. Infra-as-code already goes a long way removing the human element from the equation, if you have no clue about what I’m talking about check this video. Yet, your automation has the potential to cause outages if a bug slips through testing. A good principle for automation is canarying. It consists of performing changes on a subset of endpoints and verifying the health state of the system before proceeding to prevent the wide-scale spread of a mistake.

Examples of things that may be canaried are patches, OS updates, critical software updates, config changes and etc. When using the good’ole CLI one always checks the state of the system to see if the changes look good. Canarying is analogous to that. Say you need to change the software running on all your switches from quagga to FRR. Wouldn’t it be great if you could perform those changes fast while still making sure your network is working perfectly? Yes! The alternative to that is running your scripts like a savage only to realize later that a typo slipped through your code review and blackholed all your datacenter traffic.

Now, back to the Hackathon. I convinced my team to leverage a vagrant environment + ansible code open-sourced by Cumulus. Leveraging open-source material was a great way to bootstrap development and get to a baseline faster, then we could focus on the canarying solution (that may have been a mistake). It turns out the internet was saturated and it took us more than an hour to download the boxes. At noon, I was able to bring up the environment only to find out that those boxes no longer supported quagga. That brought us to around 1PM and my team was getting a little impatient. At that time, folks were discussing pivot options but I decided to finish the work by myself. So I deserted my team, or they deserted me, you choose.

Anyway, I changed the ansible scripts to run FRR and finally had a running topology around 2PM. After that, I started playing with ansible and quickly found the feature necessary for my use case. Ansible allows you to configure the number of endpoints it runs against with the field “serial”. Once you set serial to 1 the scripts will run one at a time. DONE!!! RIGHT? Not yet.

Now it became interesting, and I grabbed a couple people to discuss how to actually implement this. And basically, the discussion ran around defining what is sufficient condition to accurately determine the state of the system.

  1. Ping. Although a failed ping implies network failure, its success doesn’t mean there are no problems changes.
  2. Route table of the modified switch: That doesn’t guarantee the whole system is in a positive state. For example, things could look good on a switch but a route could have been withdrawn on the other side of the network.
  3. Global routing table state. That one can only be right, the challenge is how can you get that info.
  4. Some folks in the team mentioned normally they have monitoring systems running to detect anomalies and remediate them with scripts. But you don’t know what you don’t know. So I’d advocate for canarying additionally to that.

Okay, so I quickly did #2 and realized it wasn’t good enough. I should have stopped coding there and made a powerpoint, but I decided to keep hacking instead. I wasn’t able to complete #3 and kinda just mumbled the content of this post. It was cool. Overall, it made my Sunday productive and I wouldn’t have done anything like this otherwise.

I finished this project a few weeks later. I wrote a flask web service that returns the number of working BGP neighbors, then on Ansible I can call that API and assess the global state. Let’s go for a tutorial.

Canarying tutorial

To run this you need vagrant 2.02. To start bring up the infrastructure:

 git clone https://github.com/cumulusnetworks/cldemo-vagrant
 cd cldemo-vagrant
 sudo vagrant up oob-mgmt-server oob-mgmt-switch leaf01
 sudo vagrant up leaf02 leaf03 leaf04 spine01 spine02 

This should bring up a leaf-spine infrastructure not yet configured. To configure it do the following:

 sudo vagrant ssh oob-mgmt-server
git clone https://github.com/castroflavio/cldemo-roh-ansible
cd cldemo-roh-ansible
ansible-playbook start.yml

The next step is to run the change.yml playbook:

---
- hosts: spines
  user: cumulus
  become: yes
  become_method: sudo
  roles:
    - ifupdown2
    - frr
    - statecheck
  serial: 1
  max_fail_percentage: 50

The magic lies on the serial field, which stipulates changes need to be deployed sequentially, 1 node at a time in this case. Whenever a change fails the statecheck playbook will fail:

- uri:
    url: "http://{{ item }}:5000/"
    return_content: yes
  register: webpage
  with_items:    
    - '192.168.0.11'
    - '192.168.0.12'
    - '192.168.0.13'
    - '192.168.0.14'

- debug: var=item.content
  failed_when: item.content|int < 2
  with_items: "{{ webpage.results }}"

It reaches out to every leaf(192.168.0.1*) and gets the number of BGP neighbors through a REST API. The Web service code can be found here.

Thanks for reading the long post, do not hesitate to share any questions and thoughts.

Leave a Comment

What’s happening in 2018?

I was going through old blog posts and stumbled on this article where I explained my views on the state and future of the Networking Industry in Jan’17. In this article, I’d like to do the same for 2018. The year of 2017 was the year of hype, I believe 2018 is going to be a competitive year.

To start, the first specs for 5G are outSpectrum auctions are happening. That means new players are coming and markets will be shaking, folks with better network management systems like Google will definitely make a huge investment.

Other than the chase for 5G, reducing costs and increasing revenue are still main drivers of innovation. Nothing changed there. It won’t change in the next 100 years. In 2017, we’ve seen an increasing demand for network innovations for Enterprises. This is happening due to the maturation of SDN-like technologies.

The biggest achievement of 2017 is network disaggregation started by OCP. Broadcom keeps announcing jaw-breaking specs for upcoming chipsets. Still, competition in that area looks healthy, from Mellanox, Cavium, and P4. On the OS side, Cumulus, Big Switch, SnapRoute, Arista and even OpenSwitch offer plenty of choices and are still growing steadily. Additionally, AT&T is putting an effort into an open-source OS for themselves.

I wish the industry would move towards open-source and extremely low-cost networking solutions. But, although folks want cost to be moving down, that’s not the sole priority. By embracing open-source ISPs expose themselves to risks and extra costs. It’s common for businesses to transfer risk and they are willing to pay good money for that. Enterprises are as risk-averse as ISPs.

I think there’s an opportunity for commercial support of open-source solutions. The challenge lies in finding proper incentives. Entrepreneurs have little motivation in investing in a specific technology like OpenStack or Kubernetes because it’s ephemeral, thus their incentive lies in cultivating relationships with the customer and learning the technology as necessary. Further, companies seem to find it hard to hire labor specialized in open-source projects (some would argue Corporations are terrible at hiring in general) and end up crossing those options from the beginning. Winning players like Amazon, Google and Facebook ignore that cost and feast on those benefits by hiring skilled labor running on agile teams, increasing learning rate.

I may be wrong here, but I see Juniper is doing a strong move into software with Contrail, and even Junos Space and Cisco is still Cisco. In another word, the biggest factor ensuring market dominance is customer relationships rather than winning technology, and technology always wins in the long-term. In raw networking specs, I doubt they will catch up to BCM. Only time will tell.

All of this leads me to believe 2018 will start big changes in the industry. We won’t see these changes transform financial results until at least 2019, but I bet the winning ISPs in 2020 will be the ones who make the right technological choices now. Also, notice that the market share in the industry has not changed much in the last 5 years.

Leave a Comment

DevOps essentials: Developer Environment

“90% of coding is debugging, the other 10% is writing bugs.”

If you develop any code you know how true this is. This implies the speed you correct bugs is what really dictates your feature release velocity. This relates to a critical part of DevOps: CI/CD. Continues Integration allows bugs to be found faster, continuous deployment allows fixes to be released in production faster.

In this post, I talk about the importance of a stable Developer Environment (Dev Env) and present a short tutorial on how to set up a virtual routing environment with open source tools provided by Cumulus.

Why bother at all?

A DevEnv to test your automation scripts is essential for any effective development team. It allows devs to test their code in minutes rather than days independently of others, increasing their velocity drastically. It also improves collaboration, by making sure all devs are starting from the same stable point.

Here is an example of how bad this is: Company ABC only has one testing environment in which it performs all testing. Say a good scheduling process allows for 50% utilization, and each dev takes 1 hour to run a script and verify its outcome and an additional hour to clean-up the environment. A work day has 8 hours, where 4 hours are utilized, and thus only 2 tests can be performed per day. In this case, no matter how big your team is, you can only push a total of 2 features or fixes per day. This is pretty much a fixed cap on the team productivity.

Technology choices

I’ve been wanting to set up a virtual environment for a while. At first, I tried using Mininet, while it suffices for routing using quagga, it required too much work to set it up in a way I could use it for testing scripts. The other options were eve-ng(uNetLab), GNS3 and vagrant. I crossed GNS3 right away because the overhead and learning curve are significant. If I had to run IOS I’d go for eve-ng, but I don’t. Additionally, the learning curve for vagrant was shorter thus it’s useful to more people.

Kudos to Cumulus for open-sourcing this, it is great to see vendors contributing multi-purpose code rather than just taking code from others or promoting their own technologies and I urge you to check their work. You can start from here.

Quick start

Disclaimer: this code is available at the cumulus GitHub. Before running this demo, install VirtualBox and Vagrant.

### Bring up the vagrant topology
git clone https://github.com/cumulusnetworks/cldemo-vagrant
cd cldemo-vagrant
vagrant up oob-mgmt-server oob-mgmt-switch leaf01 leaf02 spine01 spine02 server01 server02

### setup oob mgmt server
vagrant ssh oob-mgmt-server

### Run the ROH demo
git clone https://github.com/cumulusnetworks/cldemo-roh-ansible
cd cldemo-roh-ansible
ansible-playbook run-demo.yml

### check reachability of server02 from server01
ssh server01
wget 10.0.0.32
cat index.html

Custom topology

You only need vagrant, Virtualbox and git to run this tutorial. Vagrant is really simple, once you have a vagrantFile, you need exactly one line of code to bring your environment up. The tricky part is to build the Vagrant File, Cumulus not only provides the templates but also a tool to create the file from a topology file. It’s the topology converter.

cldemo_topology

The figure above shows the default topology, you can change it by editing the topology file and then run the topology_converter again:

python topology_converter.py topology.dot

After that simply do. Vagrant up and boom.

vagrant up
vagrant ssh oob-mgmt-server

Conclusion

In this post, I talked about the importance of a stable developer environment and how that fits into the DevOps framework. I also gave an example of how to establish an environment.

This links back to the SDN value proposition: the ability to run software processes to improve your infrastructure. But, people have virtualized network devices for years, there’s nothing new there. True, still, a significant part of network operators does not have a virtual environment in order to test their systems. On the other side, with the rise of OpenVSwitch and creation of Mininet, SDN developers started using network emulation to develop SDN systems and have always taken this as a given yielding increased development agility.

This also leads me to some thoughts on how P4 can improve enterprise systems, but I’ll leave that for a future post. Again, please let me know your thoughts on this

Leave a Comment

Can P4 save Software-Defined Networking?

Now, P4 is gaining momentum due to engagement of big players such as Google and AT&T. P4 has potential to cause a significant change in the industry and deliver on the SDN value-proposition. I’d like to discuss that.

In summary, P4 aims to provide 3 main goals:

  • Reconfigurability
  • Protocol independence
  • Target independence

OpenFlow had its shortcomings: somehow diversity of implementation strategies evolved into incompatibility. P4 target independence proposes to solve this issue using a compiler to translate P4 code into switch code taking into account its capabilities.

Screen Shot 2017-10-20 at 1.35.00 PM

In order to understand how disruptive this is, let’s look at the current state of affairs: commodity silicon vendors such as Broadcom and Mellanox already have an API to control their switches, the existence of that API itself already disrupted the industry enabling Cumulus, SnapRoute and even Arista. Now would you prefer that your silicon vendors established a common interface, or would you rather rewrite software everytime you want to test a new switch Vendor? The answer is obvious: the first option benefits users and new vendors, the second benefits established vendors. New industry players or the adventurous operators could write software on top of P4 and achieve multi-vendor integration at the cost of writing compilers for each vendor they use.

So, that’s the big pay-off opportunity, enabling competition, thus innovation. The challenge here is to provide vendors the incentives to write the P4 compiler.

New industry players or the adventurous operators on the other side, could be able to write software on top of P4 and achieve multi-vendor integration at the cost of writing compilers for each vendor they use. That can be game-changing, the big questions are “How eager are developers to write P4 software?”,  “how much does it cost to hire somebody to do it?”, additionally, “Who will write Cisco/Broadcom specific p4 compiler code?

There are endless opportunities: in a parallel universe, AT&T forces Cisco to enable a P4 compiler to their devices, Cisco writes a bad compiler, claims it’s bad technology and sells you ACI instead. In a different universe, Barefoot writes a Broadcom compiler ensuring it works, but then it “wastes” some resources promoting a competitor. A little bit more realistically, SnapRoute or Cumulus could write a P4 compiler to Broadcom Tomahawk, and thus would be able to enable their software in a plethora of existing devices. Even more realistically, Barefoot writes their own compiler to Tofino and keeps selling P4 to a limited niche market.

Now, if Barefoot takes on the responsibility to write a P4 compiler for Broadcom and Mellanox that would be translated into huge value to NOS vendors and Operators; since they would be able to seamlessly switch vendors. It would marginally increase adoption of Tofino, so the question remains, who would pay for this?

Now how much does it cost to adopt P4?

Before I answer this question I’m going to callback to a point previously when I wrote about network disaggregation. I ended it asking: “Does OpenFlow effectively lock you in?”. Now the same question may apply to P4.

The question is misleading by itself. I’ve heard vendors saying “OpenFlow locks you in, you might as well just buy our SDN”. There’s just so much wrong with this. OpenFlow isn’t perfect, but it does allow you to adopt software processes to deliver features much faster than your vendor will.

Any choice is a potential barrier and locks you in a little bit, but what everybody refers to when talking about lock-in is hardware lock-in. When you buy a generic x86 computer you are free to install Ubuntu, Debian, Windows or whatever you’d like, when you buy a PlayStation, you can’t just install Xbox on it, that’s vendor lock-in, the costs of doing that are prohibitive, you would be better off just buying another appliance.

You could at barely no cost try an OpenFlow Lab or Field trial on Broadcom-based network devices and fallback to Cumulus if it doesn’t fulfill your needs. Unsurprisingly, The vendors will claim lab trials aren’t needed because of their product quality, but the experience will tell there will always be a missing feature.

Now P4, from the adventurous perspective, P4 is great, you just have to write more software to get it done. For everybody else it has a significant cost: you have to hire premium developers or Barefoot itself to do it. That cost won’t be insignificant when using Broadcom + Big Switch might already give you the tools to improve your current process.

OpenFlow vs P4

OpenFlow is going to be 10 years old next year, a significant amount of resources has been put into testing it. It has been (properly) commercially supported by Big Switch for 3+ years if I’m not mistaken. I’d say with certainty that you could get an OpenFlow solution production-ready in a year. Realistically, could you get P4 ready to be deployed in production in a year?

Misconceptions:

  • Will P4 replace OpenFlow? Maybe. P4 offers a different value proposition. OpenFlow agents may be written on top of P4. Great P4 implementations may force OF into being obsolete.
  • Will P4 replace Broadcom SDK? Same answer, P4 may write a much better API on top of theirs.
  • Will P4 replace OpenNSL?  Why not?
  • Will P4 replace NetFlow/Sflow? No. Sflow is a protocol to export data from the switches, it does not say (much) on how you should implement it in the dataplane.
  • Will P4 replace Riverbed? No way.
  • Will P4 replace OpenConfig? Nope, they are actually quite complementary.

Thanks for reading the long post. I welcome any thoughts or questions.

6 Comments

Is vendor lock-in really a big deal?

I’ve recently come across a Datanauts podcast regarding ““Choosing Your Next Infrastructure” ( if you like podcasts, I HIGHLY recommend Packet Pushers, I’m a fan because of their diverse and unbiased content). In this episode, various great considerations on choosing new infrastructure are made and they perform an excellent job at describing pros and cons of different strategies, but a few points regarding vendor lock-in got me scratching my head. The article “Vendor lock-in the good, the bad and the ugly” does a great job at explaining the overall concept of vendor lock-in.

Additionally, I see it in the following way: Some vendors provide hardware and software as integrated solutions, potentially including storage, networking, or computing. Traditional vendors have been doing this for decades and that’s one part of vendor lock-in, because you rely on your vendor to deliver new features, if they do not deliver it, the migration costs, most of the time, would be prohibitive, and a good enough reason to just pay the same vendor a premium.

imgonline-com-ua-twotoone-BHYS6FHG8Q

During the podcast, the following question was asked: “If you commit to a hyper-converged platform, you are committing to a vendor and thus, in fact, locked-in, is that a big deal?”.

Where the response was “What’s important is understanding that lock-in is going to happen… and it’s important to choose a vendor that is going to be a good partner for your business… So if you have a very good relationship with a vendor who provides an all-at-once solution, that may be strategic for you, and if you would rather keep the hardware open and have a vendor you trust to give a good software solution, that’s your best path”.

Learning curves, and migration costs will always exist. Successful organizations, managers, and architects will minimize those costs while meeting critical requirements. That answer caught my attention because this is not the first time I’ve heard comparisons between hardware lock-in and software lock-in minimizing the cost of hardware lock-in. I’ve heard stronger opinions from hardware vendors before (of course): “hardware locks you in, software locks you in, therefore you might as well lock yourself to the hardware”; that statement is easy to be made when you are selling hardware, it’s much harder to justify when you are buying hardware.

I’m not completely opposed to lock-in in order to meet critical requirements, but that decision must be taken very carefully, and rationally, more often than not, the future cost of the decision is much higher than the initial cost of the whole project. Requirements are uncertain, and they become more dynamic every day.

For example, say at the time of design you thought your critical requirement was performance and acquired the best in the industry, a year from now, your solution becomes popular in your organization (because it is so good!), now multi-tenancy is much more important, and you are locked in, your manager now demands multi-tenancy and your sales engineer gladly offers you an add-on contract for whatever price (s)he wishes. The requirement is fulfilled, all parties involved go to dinner at a fancy steakhouse, everybody is happy!

If your organization is mature enough to have a project starting and ending with the exact same requirements, then you definitely should pick vendor-lock in. But, if your organization stands in a dynamic environment, external or internal, then you should always maximize choice and minimize barriers to change in order to meet ever-changing requirements.

imgonline-com-ua-twotoone-G7zqrsKzwtXPUX

I’m a firm believer that competition and choice ultimately drive innovation, thus in order to consistently deliver innovative solutions one must be open to competition. I’d argue that computers only are what they are now because of choice. And personal computers can be a nice example. One can choose between AMD or Intel processors, OSX, Windows or Linux. At the end of the day, lots of people will buy a solid computer integrated by Microsoft or Apple, but in the long run, the most innovative solutions and sometimes cost-effective solutions are the build-your-own type.

More than that, at the end of the day, a well put gaming setup is much more exciting than a boring Macbook, as Facebooks’ or Google’s chassis switches are more exciting than an expensive Juniper router.

 

Leave a Comment

Network Disaggregation – The holy grail?

Tl;DR: Yes

The networking industry has seen more innovation in the last decade than in the last 30 years. The popularization of the SDN concept and the release of OpenFlow 1.0 pretty much ignited a flame present in every operator’s mind: the fear of vendor lock-in.

It was common for operators to solely rely on a single vendor every time a new feature was need: let’s say, Joe has decided your network now needs to be monitored using a specific monitoring protocol, xFlow, for illustration, then, because you only use vendor A gear you would have to convince request your vendor to add that feature to your software stack. Your sales engineer would then have to convince his developers that this is a critical feature and then that feature would have to go through the full Q&A hardening pipeline in order to make sure it doesn’t break any of the 400 protocols present in the OS of your network. That process easily took years. It still takes a few years for the unfortunate souls that choose to be locked into a specific vendor.

OpenFlow became popular as a promise to bring innovation to the industry and solve the multi-vendor integration problem by providing a standard interface for programming the network. As I mentioned in my last post, while it has brought innovation to the industry, for a lack of a strong standardization process, it failed to achieve vendor integration, and the demand for an escape route from vendor lock-in remained.

 In 2011, a few smart minds in the industry ( Facebook, Arista, Rackspace ) started the Open Compute Project as an initiative to open hardware design, having in mind that there’s already so much innovation in the software layer of computation. Quickly the idea expanded to networking gear and a trend of disaggregation between NOS (networking operating system) and hardware started. Hardware vendors such as Broadcom and Mellanox started working on their own abstraction for hardware programming interface, and that abstraction layer allowed a lot of good innovation and that’s where the OpenNetworking concept started.

Having established a common interface to interact with the hardware, several NOS vendors have come up and in fact disaggregated the network. This naturally allows for faster development cycles since it decouples software development cycles from hardware development cycles, the NOS vendors focus on software instead of hardware specificities, it allows for a diversity of vendors, increasing the speed of innovation.

Let me give you a couple examples: Say, you convinced your manager to buy Open Networking gear based on Broadcom chips (for example) and you went for a “traditional” vendor, say, Dell, 3 years later, Broadcom comes up with a next generation chip, you could (1) choose to keep using Dell and upgrade the gear with no need to change any management systems. Alternatively, (2) let’s say Dell features didn’t keep up with your expectations, then you could replace it with Arista, or even Cumulus Linux in order to experiment with completely new paradigms and finally deploy xFlow. On another scenario, let’s say Mellanox next generation hardware performs much better, then you could again choose to keep using Dell OS and smoothly upgrade your hardware for an optimal cost.

Traditionally, vendor lock-in makes you pay for decades for a non-optimal decision, network disaggregation makes your decisions lighter, allowing you to quickly rethink your strategy and cheaply pivot if necessary.

Choice is extremely powerful, in college, I remember being amazed by the power of MIMO communications. Embracing path diversity and the ability to “choose” the best path just almost linearly increases the capacity of a channel. Network disaggregation gives you the same power, the power of choice.

Now, let me approach a few misconceptions I’ve seen around:

  • Is network disaggregation SDN?  No.
  • Can SDN be achieved through network disaggregation? Yes, ultimately network disaggregation accelerates innovation.
  • Does OpenFlow effectively locks you to a vendor?

That’s a good one and I’m going to answer this on a next post.

Don’t hesitate to reach out to me with any questions.

 

Leave a Comment

Has OpenFlow failed? – Challenges and implementations

In truth, very few vendors have successfully implemented full capabilities of OpenFlow. OpenFlow provides way too much flexibility to programmers. It’s hard to make the hardware couple with that much power. A few vendors are able to deliver programmable ASICs like that such as NoviFlow, Corsa and Barefoot.

The reason for that comes from the nature of matching tables, a match table is implemented in memory. In a match table, we match on a field, say MAC address and we take an action, say forward the packet to port 1. The complexity comes when we want to match on multiple fields. Say we have a MAC table with N addresses, and an IP table with M addresses. The total size of my flow tables (memory) is M +N. Now if we want to execute the match on a single table, the size of those tables raises to M*N. Now imagine matching on multiple fields at the same time.

The multi-table aspect of OpenFlow, came on version 1.3, and it addresses the scalability problem of flow-tables. But now the challenge is how to provide a standard API via OpenFlow when different vendors have different table patterns?

The answer is we don’t. Rather, we adapt our OpenFlow version to each vendor in order to achieve our forwarding objective. Now, say we want to do a L3 forwarding – which means match on ip, then modify L2 addresses and forward to port N – one vendor might have put the modify action in the IP table, while other vendor might have grouped all actions in a group action later on.

OpenFlow became popular as a promise to bring innovation to the industry analogously as the x86 API brought innovation to computers. In truth, interoperability between vendors via OpenFlow has been rare, exactly because vendors have different implementations of OpenFlow. We’ve seen vertical stacks of software deliver SDN capabilities, but we haven’t seen interoperable solutions yet.

Last time I checked, ONOS, a great SDN controller, provided an abstraction to Openflow via the FlowObjective primitive, basically, an Objective is defined and then the OpenFlow drivers will match that objective to the hardware implementation. What that provides you is the ability to have a controller controlling multiple vendors. Vendors still need to write code as drivers but developers only have to write software once. Again the power of abstraction shows itself. There may be others out there, but I’m aware of a couple solutions for OpenFlow fabric such as BigSwitch and Trellis used in the CORD project that have successfully deployed stable solutions.

OpenFlow is not the answer to all your networking problems. The perfect abstraction for networking is the answer, but it does not exist. OpenFlow definitely succeed in bringing innovation to the networking industry. A few vendors like BigSwitch have built incredible solutions. and the OpenNetworkingFoundation has merged with the ON.LAB which may bring some more energy towards standardization of the protocol. The support from vendors has slowed down as vendors started generalizing the SDN definition, I will write more about it.

Leave a Comment

Network Automation vs Software Defined Network – Ansible vs Openflow

At Verizon, we are moving towards automating network configuration and provisioning. To me the goals for this move can be summarized as:

  • Maintenance cost reduction
  • More agile deployment processes

Coming from an OpenFlow SDN background, where changes to the network can be immediate, and looking at the real world, where changes to the network require human approval and human intervention to be deployed resulting in 1-2 weeks time, it’s really hard to tolerate this acceptance for delay with legacy systems.

I’m much interested in identifying where automation of legacy systems offers a real benefit over OpenFlow networks and vice-versa. My experiences tell me the biggest paradigm shift comes from the users. If the network operator is used to the OpenFlow paradigm, and has the software development skills, pretty much anything can be done. On the other side when the network operator comes from a classical Cisco network engineer background, even incremental changes to the network as advocated by network automation gurus can be challenging.

So far, my only experience with network automation is Ansible. A great positive factor for Ansible is its learning curve. Very easy to try. Right now, I’m intrigued with testing of Ansible code, refactoring variables consists of project-wise find and replace, it’s also not yet intuitive to me how Ansible code can be continuously tested and deployed. Quoting Uncle Ben: “with great power comes great responsibility”, Ansible does give you the opportunity to mess up things really well.

That’s where my bias towards OpenFlow comes in: successful OF projects, like ONOS, have been tested for a couple years now and are quite mature for open source projects. AS mentioned in my last article, to me it all comes down to the skill set companies want to cherish, it’s easy to leverage network engineer expertise plus some python scripting capabilities to work on network automation, but I bet you won’t get great code quality out of that.

Another option is to leverage great software developing skills to make sure you do get the code quality, but then what I would advocate for is to get this great software developer and put him to work to develop a real SDN system with real software challenges in place where the opportunity for gain is incredible.

OpenFlow has an inherent disadvantage which is the requirement for extra hardware support. Successful OF deployment have been performed with new gear, or have used successful hybrid deployment strategies, which can be complex. So, if you want to improve current deployments, OpenFlow won’t be your pick.

I’m still skeptical regarding the value of network automation, other than incremental adoption of new technology, in other words, it’s easy to sell.

1 Comment

What’s going on?

In January, I’ve started working for Verizon as a DevOps Engineer with focus in network engineering. I’ve been working with SDN for about 2 years and my last experience was at the Open Networking Lab, a research lab, pioneer in terms of SDN research, in collaboration with AT&T.

In this article, instead of describing a technology as I usually do in this blog, I’ll try to summarize my thoughts on where this industry is going.

Every day it’s clearer to me that innovation in service providers is driven by two factors: pressure to reduce acquisition and operational costs; increasing pressure to deliver new services fast, which BTW happens in order to generate new sources of revenue.

Most service providers are trying to leverage open hardware from OCP and open source technology in order to achieve those goals. The “open” alternative of solutions is quite cost-effective compared to current legacy solutions; at the same time it offers the opportunity to be at the edge of technology development, that’s to say open technologies fasten innovation cycles significantly. The disaggregation of network devices has played a tremendous role in enabling innovation as well.

There are challenges in order to achieve those goals. Acquisition costs are definitely the most compelling point of open technologies. The delivery of the open source solutions on the other side is where the risk lies. If you are used to open source, you do know that bugs are just part of your life. There’s a 9 in 10 chance that at least one of your critical features won’t be supported natively by available open source solutions.

To couple with that I believe service providers should invest in acquiring diverse talents, or invest in training its own staff.

The truth is change is inevitable, you either hop on the boat and deliver reduced costs or new services or you will be left behind. We’ve started to see evidence why that has been happening with big vendors, I believe this pattern will repeat with providers.

In the next posts, I’ll try to comment on what is going on with vendors or make a follow-up post with my thoughts on costs, risks and benefits of this search for innovation as well.

Leave a Comment