Network automation with Ansible : Canarying / Rolling upgrades

In this post, I describe my experience at the NANOG 72 Hackathon and explain the project I worked on. If you want to see the work, my diffs can be found here.

The Hackathon lasted 8 hours and it involved engineers of all levels. It started with a presentation on network automation tools. Followed by team-forming: anyone with an idea shouted it out to the group and then teams were self-organized. I shared an idea for canarying automated network changes. Let me explain.

A typical challenge on network management is rolling out changes fast and reliably. Infra-as-code already goes a long way removing the human element from the equation, if you have no clue about what I’m talking about check this video. Yet, your automation has the potential to cause outages if a bug slips through testing. A good principle for automation is canarying. It consists of performing changes on a subset of endpoints and verifying the health state of the system before proceeding to prevent the wide-scale spread of a mistake.

Examples of things that may be canaried are patches, OS updates, critical software updates, config changes and etc. When using the good’ole CLI one always checks the state of the system to see if the changes look good. Canarying is analogous to that. Say you need to change the software running on all your switches from quagga to FRR. Wouldn’t it be great if you could perform those changes fast while still making sure your network is working perfectly? Yes! The alternative to that is running your scripts like a savage only to realize later that a typo slipped through your code review and blackholed all your datacenter traffic.

Now, back to the Hackathon. I convinced my team to leverage a vagrant environment + ansible code open-sourced by Cumulus. Leveraging open-source material was a great way to bootstrap development and get to a baseline faster, then we could focus on the canarying solution (that may have been a mistake). It turns out the internet was saturated and it took us more than an hour to download the boxes. At noon, I was able to bring up the environment only to find out that those boxes no longer supported quagga. That brought us to around 1PM and my team was getting a little impatient. At that time, folks were discussing pivot options but I decided to finish the work by myself. So I deserted my team, or they deserted me, you choose.

Anyway, I changed the ansible scripts to run FRR and finally had a running topology around 2PM. After that, I started playing with ansible and quickly found the feature necessary for my use case. Ansible allows you to configure the number of endpoints it runs against with the field “serial”. Once you set serial to 1 the scripts will run one at a time. DONE!!! RIGHT? Not yet.

Now it became interesting, and I grabbed a couple people to discuss how to actually implement this. And basically, the discussion ran around defining what is sufficient condition to accurately determine the state of the system.

Ping. Although a failed ping implies network failure, its success doesn’t mean there are no problems changes.
Route table of the modified switch: That doesn’t guarantee the whole system is in a positive state. For example, things could look good on a switch but a route could have been withdrawn on the other side of the network.
Global routing table state. That one can only be right, the challenge is how can you get that info.
Some folks in the team mentioned normally they have monitoring systems running to detect anomalies and remediate them with scripts. But you don’t know what you don’t know. So I’d advocate for canarying additionally to that.

Okay, so I quickly did #2 and realized it wasn’t good enough. I should have stopped coding there and made a powerpoint, but I decided to keep hacking instead. I wasn’t able to complete #3 and kinda just mumbled the content of this post. It was cool. Overall, it made my Sunday productive and I wouldn’t have done anything like this otherwise.

I finished this project a few weeks later. I wrote a flask web service that returns the number of working BGP neighbors, then on Ansible I can call that API and assess the global state. Let’s go for a tutorial.

Canarying tutorial

To run this you need vagrant 2.02. To start bring up the infrastructure:

 git clone https://github.com/cumulusnetworks/cldemo-vagrant
 cd cldemo-vagrant
 sudo vagrant up oob-mgmt-server oob-mgmt-switch leaf01
 sudo vagrant up leaf02 leaf03 leaf04 spine01 spine02

This should bring up a leaf-spine infrastructure not yet configured. To configure it do the following:

 sudo vagrant ssh oob-mgmt-server
git clone https://github.com/castroflavio/cldemo-roh-ansible
cd cldemo-roh-ansible
ansible-playbook start.yml

The next step is to run the change.yml playbook:

---
- hosts: spines
  user: cumulus
  become: yes
  become_method: sudo
  roles:
    - ifupdown2
    - frr
    - statecheck
  serial: 1
  max_fail_percentage: 50

The magic lies on the serial field, which stipulates changes need to be deployed sequentially, 1 node at a time in this case. Whenever a change fails the statecheck playbook will fail:

- uri:
    url: "http://{{ item }}:5000/"
    return_content: yes
  register: webpage
  with_items:    
    - '192.168.0.11'
    - '192.168.0.12'
    - '192.168.0.13'
    - '192.168.0.14'

- debug: var=item.content
  failed_when: item.content|int < 2
  with_items: "{{ webpage.results }}"

It reaches out to every leaf(192.168.0.1*) and gets the number of BGP neighbors through a REST API. The Web service code can be found here.

Thanks for reading the long post, do not hesitate to share any questions and thoughts.

Network automation with Ansible : Canarying / Rolling upgrades

Canarying tutorial

Related

Be First to Comment

Leave a ReplyCancel reply