March 2018 – Net Automated

In this post, I describe my experience at the NANOG 72 Hackathon and explain the project I worked on. If you want to see the work, my diffs can be found here.

The Hackathon lasted 8 hours and it involved engineers of all levels. It started with a presentation on network automation tools. Followed by team-forming: anyone with an idea shouted it out to the group and then teams were self-organized. I shared an idea for canarying automated network changes. Let me explain.

A typical challenge on network management is rolling out changes fast and reliably. Infra-as-code already goes a long way removing the human element from the equation, if you have no clue about what I’m talking about check this video. Yet, your automation has the potential to cause outages if a bug slips through testing. A good principle for automation is canarying. It consists of performing changes on a subset of endpoints and verifying the health state of the system before proceeding to prevent the wide-scale spread of a mistake.

Examples of things that may be canaried are patches, OS updates, critical software updates, config changes and etc. When using the good’ole CLI one always checks the state of the system to see if the changes look good. Canarying is analogous to that. Say you need to change the software running on all your switches from quagga to FRR. Wouldn’t it be great if you could perform those changes fast while still making sure your network is working perfectly? Yes! The alternative to that is running your scripts like a savage only to realize later that a typo slipped through your code review and blackholed all your datacenter traffic.

Now, back to the Hackathon. I convinced my team to leverage a vagrant environment + ansible code open-sourced by Cumulus. Leveraging open-source material was a great way to bootstrap development and get to a baseline faster, then we could focus on the canarying solution (that may have been a mistake). It turns out the internet was saturated and it took us more than an hour to download the boxes. At noon, I was able to bring up the environment only to find out that those boxes no longer supported quagga. That brought us to around 1PM and my team was getting a little impatient. At that time, folks were discussing pivot options but I decided to finish the work by myself. So I deserted my team, or they deserted me, you choose.

Anyway, I changed the ansible scripts to run FRR and finally had a running topology around 2PM. After that, I started playing with ansible and quickly found the feature necessary for my use case. Ansible allows you to configure the number of endpoints it runs against with the field “serial”. Once you set serial to 1 the scripts will run one at a time. DONE!!! RIGHT? Not yet.

Now it became interesting, and I grabbed a couple people to discuss how to actually implement this. And basically, the discussion ran around defining what is sufficient condition to accurately determine the state of the system.

Ping. Although a failed ping implies network failure, its success doesn’t mean there are no problems changes.
Route table of the modified switch: That doesn’t guarantee the whole system is in a positive state. For example, things could look good on a switch but a route could have been withdrawn on the other side of the network.
Global routing table state. That one can only be right, the challenge is how can you get that info.
Some folks in the team mentioned normally they have monitoring systems running to detect anomalies and remediate them with scripts. But you don’t know what you don’t know. So I’d advocate for canarying additionally to that.

Okay, so I quickly did #2 and realized it wasn’t good enough. I should have stopped coding there and made a powerpoint, but I decided to keep hacking instead. I wasn’t able to complete #3 and kinda just mumbled the content of this post. It was cool. Overall, it made my Sunday productive and I wouldn’t have done anything like this otherwise.

I finished this project a few weeks later. I wrote a flask web service that returns the number of working BGP neighbors, then on Ansible I can call that API and assess the global state. Let’s go for a tutorial.

Canarying tutorial

To run this you need vagrant 2.02. To start bring up the infrastructure:

 git clone https://github.com/cumulusnetworks/cldemo-vagrant
 cd cldemo-vagrant
 sudo vagrant up oob-mgmt-server oob-mgmt-switch leaf01
 sudo vagrant up leaf02 leaf03 leaf04 spine01 spine02

This should bring up a leaf-spine infrastructure not yet configured. To configure it do the following:

 sudo vagrant ssh oob-mgmt-server
git clone https://github.com/castroflavio/cldemo-roh-ansible
cd cldemo-roh-ansible
ansible-playbook start.yml

The next step is to run the change.yml playbook:

---
- hosts: spines
  user: cumulus
  become: yes
  become_method: sudo
  roles:
    - ifupdown2
    - frr
    - statecheck
  serial: 1
  max_fail_percentage: 50

The magic lies on the serial field, which stipulates changes need to be deployed sequentially, 1 node at a time in this case. Whenever a change fails the statecheck playbook will fail:

- uri:
    url: "http://{{ item }}:5000/"
    return_content: yes
  register: webpage
  with_items:    
    - '192.168.0.11'
    - '192.168.0.12'
    - '192.168.0.13'
    - '192.168.0.14'

- debug: var=item.content
  failed_when: item.content|int < 2
  with_items: "{{ webpage.results }}"

It reaches out to every leaf(192.168.0.1*) and gets the number of BGP neighbors through a REST API. The Web service code can be found here.

Thanks for reading the long post, do not hesitate to share any questions and thoughts.

Leave a Comment

I was going through old blog posts and stumbled on this article where I explained my views on the state and future of the Networking Industry in Jan’17. In this article, I’d like to do the same for 2018. The year of 2017 was the year of hype, I believe 2018 is going to be a competitive year.

To start, the first specs for 5G are out. Spectrum auctions are happening. That means new players are coming and markets will be shaking, folks with better network management systems like Google will definitely make a huge investment.

Other than the chase for 5G, reducing costs and increasing revenue are still main drivers of innovation. Nothing changed there. It won’t change in the next 100 years. In 2017, we’ve seen an increasing demand for network innovations for Enterprises. This is happening due to the maturation of SDN-like technologies.

The biggest achievement of 2017 is network disaggregation started by OCP. Broadcom keeps announcing jaw-breaking specs for upcoming chipsets. Still, competition in that area looks healthy, from Mellanox, Cavium, and P4. On the OS side, Cumulus, Big Switch, SnapRoute, Arista and even OpenSwitch offer plenty of choices and are still growing steadily. Additionally, AT&T is putting an effort into an open-source OS for themselves.

I wish the industry would move towards open-source and extremely low-cost networking solutions. But, although folks want cost to be moving down, that’s not the sole priority. By embracing open-source ISPs expose themselves to risks and extra costs. It’s common for businesses to transfer risk and they are willing to pay good money for that. Enterprises are as risk-averse as ISPs.

I think there’s an opportunity for commercial support of open-source solutions. The challenge lies in finding proper incentives. Entrepreneurs have little motivation in investing in a specific technology like OpenStack or Kubernetes because it’s ephemeral, thus their incentive lies in cultivating relationships with the customer and learning the technology as necessary. Further, companies seem to find it hard to hire labor specialized in open-source projects ~~(some would argue Corporations are terrible at hiring in general)~~ and end up crossing those options from the beginning. Winning players like Amazon, Google and Facebook ignore that cost and feast on those benefits by hiring skilled labor running on agile teams, increasing learning rate.

I may be wrong here, but I see Juniper is doing a strong move into software with Contrail, and even Junos Space and Cisco is still Cisco. In another word, the biggest factor ensuring market dominance is customer relationships rather than winning technology, and technology always wins in the long-term. In raw networking specs, I doubt they will catch up to BCM. Only time will tell.

All of this leads me to believe 2018 will start big changes in the industry. We won’t see these changes transform financial results until at least 2019, but I bet the winning ISPs in 2020 will be the ones who make the right technological choices now. Also, notice that the market share in the industry has not changed much in the last 5 years.

Month: March 2018

Network automation with Ansible : Canarying / Rolling upgrades

Canarying tutorial

What’s happening in 2018?