It’s been a while. In my last post, I narrated my experience at the NANOG 72 hackathon where I started working on a canarying project. I’m going to dive deeper into the underlying concepts for Test-driven Network Automation.
Justifiably there’s currently a big push for Infra-as-Code (IAC) in networking. IAC is pursued in order to enable modern software processes on infrastructure. The main benefits we are looking for is agility and predictability. Agility meaning faster feature delivery cycles. Predictability meaning reduction of outages: by automating deployment we can reduce human mistakes during maintenance. By doing so, you enable your team to collaborate more effectively and compound their productivity gains by improving code, ultimately allowing you to run a huge network with a small team.
As a side note, I believe those efficiencies developed in webscale companies like Facebook, Google, Microsoft will be assimilated into the markets sooner or later. Current network operation teams in TELCOs ( Verizon, AT&T, Comcast, Charter ) are orders of magnitude bigger than Webscale’s teams. So, ultimately I believe OpEx will slowly push inefficient practices out of the markets.
- Make changes
- Pull Request
- Peer-review – Automation code is reviewed
- Dry-run – Dry-run against a lab or production
- Notify results – Config Diffs, Errors?
- Canary changes until the whole system is upgraded or rollback changes
Now, that’s a fair process, The missing part here is test automation. Augmenting this process with test automation allows bugs to be found faster, reducing outages. The networking tests can be basically summarized into 5 categories.
- Config checks ( format )
- State check ( ARP table entries, Routing table, BGP neighbors )
- L2 reachability health
- L3 connectivity health
- Application health
I discuss some of this tests later in this article. Now the remaining thing is to do the canarying properly. Thus I’d augment the deployment phase:
- Record baseline health-state
- Deploy changes to a subset of nodes
- Wait/Gather data
- Observe alarms
- After quarantine wait-time has passed, increment change subset and go back to step 2.
- If alarms are unacceptable, rollback change
In this way, you guarantee that only a subset of your network is affected by possible errors.
Ultimately, Application health data should drive this. But usually, that data is nor easily consumable because team silos, or it’s simply difficult to get a small set of application-level metrics that deterministically tell the network has a problem. So, we revert back to L3 connectivity. Now, speaking of L3 connectivity we basically mean latency, loss, and throughput. The only way to get the actual data is by actively measuring it, the easiest open-source tool out there to do this programmatically is Todd.
What could go wrong?
Assessing health-state is already a pretty difficult problem. It would be great if we had a set of simple metrics to assert connectivity, but if that was trivial half of us network engineers wouldn’t have jobs. For example, although a ping test failure necessarily means you did something wrong, ping success doesn’t suffice to say a change went successfully. Basically, we either don’t have enough information to assess the state properly or we have so much info that assessing state is hard. I’m unaware of a solution to handle too much info, but I feel like this would be a good use case for machine learning. That’s all to say that the mechanism chosen to assess health state may likely not suffice.
The second thing is that even if the mechanisms to assess state you have do suffice, imagine your change next state is incompatible with your previous state, for example, you are changing your BGP password. In that case, your change intermediate steps do not present full connectivity. Canarying doesn’t make much sense in those scenarios. This scenario comes more often than you would wish since a lot of network changes exist to fix something.
Another challenge is that sometimes you just can’t replicate the current state of production in your development environment, that way you can’t really develop a procedure that executes a change in zero downtime. Imagine for example, you developed a change procedure that works in your development environment but when you push the change to the first subset of switches, a failure in redundancy is detected, and you abort the change. This reduces the throughput of changes executed by your team. There’s a point where the risk-acceptance level of the change may need to be reclassified in order for work to be done.
How do I profit from this?
Canarying gives you the opportunity to identify a bug before compromising your whole network. And it reduces detection time as verification is now an automated procedure. Let’s for example, you pushed a configuration change with a broken route-map for example, invalidating some routes to your systems. A good detection system plus a blue/green type of deployment would contain the outage caused by misconfiguration.
At the end of the day, I believe, what determines the productivity of your team is how fast you can find issues. By adopting test-driven practices you reduce your detection time, and thus reduce idle time of your team, improving productivity.