Skip to content

Net Automated Posts

Featured Post

Test-driven Network Automation

It’s been a while. In my last post, I narrated my experience at the NANOG 72 hackathon where I started working on a canarying project. I’m going to dive deeper into the underlying concepts for Test-driven Network Automation.

Why?

Justifiably there’s currently a big push for Infra-as-Code (IAC) in networking. IAC is pursued in order to enable modern software processes on infrastructure. The main benefits we are looking for is agility and predictability. Agility meaning faster feature delivery cycles. Predictability meaning reduction of outages: by automating deployment we can reduce human mistakes during maintenance. By doing so, you enable your team to collaborate more effectively and compound their productivity gains by improving code, ultimately allowing you to run a huge network with a small team.

As a side note, I believe those efficiencies developed in webscale companies like Facebook, Google, Microsoft will be assimilated into the markets sooner or later. Current network operation teams in TELCOs ( Verizon, AT&T, Comcast, Charter ) are orders of magnitude bigger than Webscale’s teams. So, ultimately I believe OpEx will slowly push inefficient practices out of the markets.

How?

CI/CD is fairly well-defined as a software practice. The question is how do we apply it to network automation. The following is a good representation of the process, supplied by Juniper( I think ):

  1. Make changes
  2. Pull Request
  3. Peer-review – Automation code is reviewed
  4. Dry-run – Dry-run against a lab or production
  5. Notify results – Config Diffs, Errors?
  6. Approve
  7. Canary changes until the whole system is upgraded or rollback changes

Now, that’s a fair process, The missing part here is test automation. Augmenting this process with test automation allows bugs to be found faster, reducing outages. The networking tests can be basically summarized into 5 categories.

  • Config checks ( format )
  • State check ( ARP table entries, Routing table, BGP neighbors )
  • L2 reachability health
  • L3 connectivity health
  • Application health

I discuss some of this tests later in this article. Now the remaining thing is to do the canarying properly. Thus I’d augment the deployment phase:

  1. Record baseline health-state
  2. Deploy changes to a subset of nodes
  3. Wait/Gather data
  4. Observe alarms
    • After quarantine wait-time has passed, increment change subset and go back to step 2.
    • If alarms are unacceptable, rollback change

In this way, you guarantee that only a subset of your network is affected by possible errors.

Ultimately, Application health data should drive this. But usually, that data is nor easily consumable because team silos, or it’s simply difficult to get a small set of application-level metrics that deterministically tell the network has a problem. So, we revert back to L3 connectivity. Now, speaking of L3 connectivity we basically mean latency, loss, and throughput. The only way to get the actual data is by actively measuring it, the easiest open-source tool out there to do this programmatically is Todd.

What could go wrong?

Assessing health-state is already a pretty difficult problem. It would be great if we had a set of simple metrics to assert connectivity, but if that was trivial half of us network engineers wouldn’t have jobs. For example, although a ping test failure necessarily means you did something wrong, ping success doesn’t suffice to say a change went successfully. Basically, we either don’t have enough information to assess the state properly or we have so much info that assessing state is hard. I’m unaware of a solution to handle too much info, but I feel like this would be a good use case for machine learning. That’s all to say that the mechanism chosen to assess health state may likely not suffice.

The second thing is that even if the mechanisms to assess state you have do suffice, imagine your change next state is incompatible with your previous state, for example, you are changing your BGP password. In that case, your change intermediate steps do not present full connectivity. Canarying doesn’t make much sense in those scenarios. This scenario comes more often than you would wish since a lot of network changes exist to fix something.

Another challenge is that sometimes you just can’t replicate the current state of production in your development environment, that way you can’t really develop a procedure that executes a change in zero downtime. Imagine for example, you developed a change procedure that works in your development environment but when you push the change to the first subset of switches, a failure in redundancy is detected, and you abort the change. This reduces the throughput of changes executed by your team. There’s a point where the risk-acceptance level of the change may need to be reclassified in order for work to be done.

How do I profit from this?

Canarying gives you the opportunity to identify a bug before compromising your whole network. And it reduces detection time as verification is now an automated procedure. Let’s for example, you pushed a configuration change with a broken route-map for example, invalidating some routes to your systems. A good detection system plus a blue/green type of deployment would contain the outage caused by misconfiguration.

At the end of the day, I believe, what determines the productivity of your team is how fast you can find issues. By adopting test-driven practices you reduce your detection time, and thus reduce idle time of your team, improving productivity.

Leave a Comment

Create your own virtual lab with Netsim Tools – Installation

How do you know your change won’t cause problems to network? Traditional network engineering relies on robust Change Review processes plus CCIE-level knowledge combined with smart-hands to ensure that. More modern approaches aim to solve this problem with a DevOps mindset that would include multiple layers of testing.

In Networking it’s very expensive to have an environment like production for testing. One possible approach is having some type of a virtual networking lab. We’ve been playing around with an opensource tool called Netsim-tools and below we give an overview of the tool and describe the installation process.

Netsim-tools is bringing IAC to your networking labs. Instead of wasting time creating lab topology in a GUI and configuring boring details, you’ll start with a lab preconfigured according to your specifications. Netsim allows you to:

  • Describe high-level lab topology in YAML format without worrying about the specific implementation details
  • Use the same lab topology with multiple virtualization providers (Virtualbox, KVM/libvirt, Docker containers)
  • Create Vagrant configuration files and Ansible inventory from the lab topology

In a future post, we will create an actual topology, for now, let’s focus on the installation procedure.

  • Arista vEOS
  • Cisco IOSv
  • Cisco CSR 1000v
  • Cisco Nexus 9300v
  • Cumulus Linux
  • Cumulus Linux 5.0 (NVUE)
  • Fortinet FortiOS
  • FRR 7.5.0
  • Generic Linux host
  • Juniper vSRX 3.0
  • Mikrotik CHR RouterOS
  • Nokia SR Linux
  • Nokia SR OS
  • VyOS
  • Dell OS10

Read more about supported devices in:
netsim-tools.readthedocs

Installation

Now, let’s install the Netsim. The install process are automated, you can copy and paste this script on your Ubuntu 20.04, but some requirements are indispensable:

Server requeriments:
• 4 CPUs
• 8 GB MEM
• 64 GB Disk
• Ansible
• Python3 with pip

Please read the comments and adjust to yours needs, but if you have a new server or if you can remove old versions of Python and Ansible, maybe be the better way.

Installation with Ansible:

# you can copy/paste and run all this text.
# you can choose not update your system and Python
# but we recommend to do this to avoid some troubles with Ansible

# can you update your system? uncomment this step
# apt update && apt -y upgrade && apt -y autoremove

# can you update your Python? uncomment these two step under
# apt remove -y python3 ansible ansible-core 
# apt install python3-pip

# adjust the step "ADJUSTING TIME TO AVOID ISSUE" to your timezone if the script do not work

# Update your pip and install Ansible to run de playbook
pip install --upgrade pip
pip install ansible ansible-core

# create playbook in netsim-install.pb file
cat > netsim-install.pb << EOF
---
- hosts: localhost
  gather_facts: no
  tasks:
    - name: "[ UPDATING SYSTEM PACKAGES ]"
      apt:
        update_cache: yes
        upgrade: full
        autoremove: yes
      register: aptout

    - name: "[ INSTALLING DEPENDENCIES OF NETLAB ]"
      apt:
        name: "{{ pkgs }}"
        state: present
      vars:
        pkgs:
          - python3-pip
          - vagrant-libvirt

    - name: "[ UPDATING PIP ]"
      shell:
        cmd: "pip install --upgrade pip"

    - name: "[ INSTALLING NETSIM-TOOL WITH PIP  ]"
      pip:
        name: "netsim-tools"

    - name: "[ INSTALLING NETLAB PACKAGES ]"
      shell:
        cmd: "netlab install -y ubuntu ansible libvirt containerlab"

    - name: "[ ADJUSTING TIME TO AVOID ISSUES ]"
      shell:
        cmd: "timedatectl set-timezone $(cat /etc/timezone) && timedatectl --adjust-system-clock"

    - name: "[ TESTING KVM LIBVIRT SUPPORT ]"
      shell:
        cmd: "kvm-ok"
      register: kvmout
      ignore_errors: True

    - debug:
        msg: "{{ kvmout }}"
EOF
# run the PB and run the netlab test if PB pass
ansible-playbook ./netsim-install.pb && netlab test libvirt

We really expect that you reach theses objectives and get something like this:

When Netsim(netlab) script finishs, automatically the lab its detroyed, evidencing that all requirements are done. Well, its that all, good run and enjoy!

fonts:
https://github.com/ipspace/netsim-tools
https://netsim-tools.readthedocs.io

Leave a Comment

Beginning of a new cycle

Hi Everybody, I just want to write a little bit about my recent career updates and what I’ve been up to. In the last couple years I’ve had a growing interest in finance and investments which led me to take a Machine Learning for Trading course on Coursera and then proceed to read a lot on the subject. Special remarks to the book Dao of Capital which is definitely one of my top 5 favorite books.

Since I was spending so much time reading about finance, I decided to take classes part-time on the subject. So far I have taken a stochastic processes for finance class at Georgia Tech and an Advanced Machine Learning class at NYU as a visiting student; these classes have been absolutely amazing!

Anyway, at this point I’m not sure I’d transition my career to finance, but I definitely want to learn more about it so I decided to apply for networking jobs at finance companies. By extreme luck I was able to interview for a position to which I think I’m a perfect fit. Every time I talked to an engineer at this company I became more interested about the opportunity. I finally got an offer and was thrilled to join the Systems Dev team at Hudson River Trading (HRT) as a network automation engineer.

For those who are not familiar, a quick Wikipedia search will say that “HRT is a class quantitative trading firm, and more specifically a high-frequency trading (HFT) firm. According to the Wall Street Journal, it is responsible for about 5% of all stock trading in the United States.”

I started only one week ago, but I’m absolutely excited because I already see state-of-the-art and bleeding edge technologies left and right. Their engineering environment is one of the most complex I was ever exposed to and the technical challenges are amazing. And when I say complex, I don’t mean process complexity or people complexity, I mean technical complexity, which is every engineer’s paradise.

In many ways this opportunity is perfect for me since it allows me to progress on my career as a network automation engineer and at the same time learn more about the finance industry. To top it up, I’m also working with amazing engineers and managers. I’ll try to write about that later.

Anyway, that’s all folks! Ciao!

Leave a Comment

End of a Cycle

Last Friday was my last day working as a Network Automation Engineer at Cargill. Words can’t describe how grateful I am for spending 2020 with them. For those who don’t know, according to Forbes, Cargill is the 2nd largest private US company, primarily in the food and agriculture industry. I was asked many times why would they need network automation engineers?

Cargill runs an immense network and as every other network, it needs to be managed properly. I really cherished the opportunities I got to get exposed to technologies I wasn’t super familiar with such as Jenkins and Batfish while, at the same time, mentoring other engineers on Ansible implementations.

I was really impressed with the overall integrity and high level of excellence of Cargill. Even though they are not a tech company, their container platform is the best I’ve ever worked with so far. Additionally, they are really good at agile practices, ensuring developers deliver their best. More than anything, I cherish the incredible connections I made while working there and I hope I’ll keep for life.

This event and other things got me thinking. This year it’s going to be 5 years since I’ve gotten my Master’s degree in Computer Science and 7 since I made my permanent move to the USA. Many things happened in my life recently that deserve appreciation.

In 2014, I’ve obtained my Bachelors of Science degree in Telecommunication Networks Engineering from Brazil, it took me six years to get that degree because I almost gave up on it on my 5th year. That degree plus other experiences gave me a very strong quantitative background. I did not get exposed to as many technologies as I’d like to but I got exposed to many fundamental problems that are extremely relevant to engineering. In hindsight, I value that experience way more than I did a few years ago. More than that, some of the smartest individuals I met in my life happened during that time.

During my Master’s in the US I dove deep in the research area of Software-Defined Networks. Similarly, initially through Georgia Tech, I had the opportunity to encounter individuals who I admire a lot and once thought were legends such as Vint Cerf, Bob Kahn, Larry Peterson. It was mind boggling to know these guys were real people. Georgia Tech was the environment in which I was most productive and focused in all of my life so far. It’s a tough school, but every minute you spend at Georgia Tech was worth it for me. While at GT I also worked for ONF, that experience itself could lead to a whole new blog post, so I’ll just admit that I’m neglecting them in this specific post. Work wise, ONF has been the most significant job experience for me, so far.

After that, I directed my career to Network Automation and worked for 3 companies, Verizon, PayPal and Cargill. Right now I feel like I owe Verizon, PayPal and ONF their dedicated posts. I might write another one soon. For now, I’m just going to say that I’m quite happy with where all those experiences brought me and I’m very grateful for all I learned so far.

I’m in the last steps of figuring out where I’m going next. I’ll keep you all posted.

Cheers!

Leave a Comment

Where to place Ansible variables?

In this post, I want to discuss a couple of the choices one can have when it comes to placing Ansible variables. The docs for variable precedence state 22 where variables can be defined and the order of precedence that the program applies.

Does that mean you should use all of them? Nope. Although each of one of them has its own use case, I usually vouch for implementation simplicity: less is better.

Here is the list I’d recommend:

  • role defaults
  • inventory group_vars
  • inventory host_vars
  • role vars

I usually do not use other types of variables and I’d not recommend doing so. I reiterate that they can all be useful, but, in my opinion, relying mostly on your inventory variables, reduces the complexity significantly. Which leads to time savings when it comes to troubleshooting and other things.

I remember many times where I spent days working on a bug that was introduced due to a misplaced.

Think of it this way: For every additional layer, you are adding more opportunity for the system to break. For example, one time when we writing playbooks for upgrading OS on a networking device we stipulated that some tasks of the playbook should use a different password, and we thus hardcoded those variables as task vars. For some reason, in production someone was inputting the password variables with more precedence and thus breaking the code. Now, I could argue that the person inputting the password was making a mistake. But in hindsight I’d say the mistake was allowing this complexity to take place at all.

So if you have N layers, you can have O(N^2) possible conflicts. Which means that troubleshooting this can be a nightmare. Ansible developers have experience with this and know how to navigate it, but still why even bother if you could prevent it.

Again, let’s say you wrote your code so that it uses block variables. But then the user wants to overwrite some behavior by setting variables using include_vars? That could mean that you are in uncharted territory, meaning that the code wasn’t tested for this use case…

Role defaults vs Role vars

One pattern that I’ve been using when writing roles using role defaults as an example of how to configure your role. Similar to what is used here:

---
# defaults file for ansible-frr
frr_daemons:
  bgpd: false
  isisd: false
  ldpd: false
  nhrpd: false
  ospf6d: false
  ospfd: false
  pimd: false
  ripd: false
  ripngd: false
  zebra: true

I like to use Role vars when something must be hardcoded. Such as your data model schema. Let’s say the dictionary frr_daemons, must have entries for bgpd. Then I like to put an entry in vars that explicitly defines that and I also add a task on the beginning of my role to check if the inputs are given.

---
# Must have children in variables
Children:
  frr_daemons_: [ 'bgpd' ]

This can get out of hand quick, but usually it works out

Conclusion

In summary, I believe you can place your Ansible variables in the 4 locations listed below. And if you avoid placing variables in any other place you’ll save yourself a lot of troubleshooting time. I’d love to hear if you can think of a usecase to which this example doesn’t suffice.

  • role defaults
  • inventory group_vars
  • inventory host_vars
  • role vars

Please comment below, or PM me on the slack channel from the networktocode folks, my username is castroflavio.

Leave a Comment

Tutorial – What is Infrastructure as Code?

Howdy!! There’s so much network automation content out there, I’ve been feeling like there’s not much value in blogging about it. This is the first post of a of a series with some theory and lab-like tutorials plus code examples of what I believe to be the meat and bones of Infrastructure as Code(IaC).

In the last decade, the industry has been seeing a strong push to adopt software practices towards network engineering. SOME COMPANIES have been successful at that, but I don’t think that’s for everyone. Software development is hard; cost-effective software development is an especially rare beast when you are paying 300K per engineer annually. Okay, so, what?

I claim it’s way more feasible to take a shorter step and aim to implement successful DevOps practices to your networking team than it is to FULLY run your network team like a software team. Infra-as-Code comes in as an enabler for that goal.

Why Infra-as-Code(IaC)?

Infrastructure as Code can simplify and accelerate your infrastructure provisioning process, help you avoid mistakes and comply with policies, keep your environments consistent, and save your company a lot of time and money. Your engineers can be more productive on focus on higher-value tasks. And you can better serve your customers.

https://www.thorntech.com/2018/01/infrastructureascodebenefits/

If IaC isn’t something you’re doing now, maybe it’s time to start!

I’m SOLD! HOW can I get started?

I’m glad I asked! This article defines 6 best-practices to take the most out of IaC:

  1. Codify everything
  2. Document as little as possible
  3. Maintain version control
  4. Continuously test, integrate, and deploy
  5. Make your infrastructure code modular
  6. Make your infrastructure immutable (when possible)

So, this is our lab curriculum, so far:
1. Codify everything: How to use Ansible to automate network provisioning.
3. Maintain version control: Git Tutorial with Ansible
4. Continuously test, integrate and deploy: CI/CD with GitLab
4.1 Continuous test with Batfish and Robot framework

Let me know how this sounds, and if you find this post in the future but I haven’t followed up with it, please make a comment, and I’ll do my best to get it done.

Leave a Comment

Product Management Bootcamp

What’s going on? While working on datacenter networking automation for PayPal this past year, I noticed how much more effective some development teams are when they have a good product manager and that piqued my curiosity. I realized I always spend a good amount of time asking why the work is done before doing it and this is one of the characteristics of a product manager.

Last week, I joined a Bootcamp on Product Management at General Assembly. My goal in joining this Bootcamp is to dive into the subject and understand how product management brings value to the team. Along the way, I hope to acquire some skills to help me be more effective at my job.

In this article, I’m going to list the reading recommendation for the first week of the bootcamp and write a reading summary:

Here is the list of articles:

Good PM/Bad PM by Ben Horowitz and David Weiden – 3.5 stars

Good article that does a good job at densely describing what a good product manager does, and cautions about what a bad product manager does as well.Highlights of a Good PDM:

Clearly defines requirements. The definition is based on research, information and a logical transparent process. This definition empowers engineering to fill the technical gaps, rather than pushing a vision downwards, it builds it’s vision gathering information informally from engineering.
Knows what it takes to make it’s product successful and it defines that in writing.

Doesn’t rest until product vision is consistent across all teams.

What’s Your Problem (Parts 1 and 2 only) by Matt Lavoie – 5 stars

Incredible article that focuses on explaining why defining a problem statement is crucial for the success of an endeavor rather than jumping into solutions as we Engineers usually do. I like that it’s a fun read, yet concise and effective in pushing its point across. Highlights:

With a problem statement, there is no feature creep. There’s a problem and a measurable outcome.
If we believe something will get to that outcome and we can create an experiment to prove it, we should work on it.

By not taking a moment to identify the problem, your implementation won’t be as successful as it could be.

Outcome-driven teams, know when they are successful by utilizing measurement with our output to know that we reached our desired outcome.

The Five Components of a Good Hypothesis by Teresa Torres – 4 stars

Solid reading that describes why one should define good hypothesis for it’s work to be effective. I actually like the format that she criticizes in a later article: How to Improve Your Experiment Design (And Build Trust in Your Product Experiments). Which is the Lean Startup format:

We believe [this capability]
Will result in [this outcome]
We will have confidence to proceed when [we see these measurable signals]

I like it better because its action driven. I can see that depending on the problem her hypothesis template would be more accurate, but the Lean Startup one is lean, it only has the essential to get you moving.

A Product Manager’s Job by Josh Elman – 3 stars

Good read, but unimpressive honestly. It just expands on his vision of the PDM’s job: “Help your team (and company) ship the right product to your users”. Which is true. I think it still needs insights on how this can be achieved for this information to be relevant to me

Product Management vs. Product Marketing by Marty Cagan – 4.5 stars

Great read, and it describes what the author think should be the two complementary roles needed to launch a product. The author states that often the two roles are assigned to the same person even though people usually are focuses in one aspect or the other and that often creates a gap.

Product Management vs. Project Management by Marty Cagan (5/10)

Good read but unimpressive.

This is all for now folks, I’m excited and I’ll try to write a blog post weekly about this new experience.

Leave a Comment

What’s left for SDN after the hype?

Say SDN one more time!

The following excerpt is an obligatory reading to understand SDN. It comes from the article “SDN is DevOps for Networking” by Rob Sherwood, written in 2014. If you haven’t read it yet, do it now.

The term SDN was first coined in an MIT Technology Review article by comparing the shift in networking to the shift in radio technology with the advance from software defined radios.

However, the term is perhaps misleading because almost all networking devices contain a mix of hardware and software components. This ambiguity was leveraged and exacerbated by a litany of companies trying to re-brand products under the “SDN” umbrella and effectively join the SDN bandwagon. As a result, much of the technical merit of SDN has been lost in the noise.

The conflict starts with semantics. When researchers refer to SDN, they don’t mean any network defined by a software, that’s too simplistic. It means a specific group of ideas in computer networking, mostly established around the abstraction and separation between the control and data plane. From that premise, the term has been thoroughly misused in the industry. Further, there are different levels of abstraction where this separation can be built upon:

Because of this separation, you can decouple innovation cycles and optimize your network with a global view of the network for example. In fact, the most valuable goal of SDN is to decouple innovation cycles, in other words, you can improve your network control independent from underlying technologies. This has significantly improved feature velocity in the industry. We’ve seen more improvement in the industry in the last 10 years than in the last 30. One of my posts gives some evidence on that.

In that way, OF or P4 is an abstraction based on forwarding API, while a more traditional approach of network automation is an abstraction built on top of traditional APIs. In that way, different levels of abstraction can deliver on the SDN value propositions on different levels. Some optimization can be achieved without pure SDN but not all.

I’ve seen people calling BGP, SDN: It has a software, thus it’s software-defined, right? No. What if I use a route reflector in order to specify BGP policy from a central point? Okay, now we are talking. A much more thoughtful argument is ‘you need network automation/DevOps, not SDN’. Indeed, for many use cases, that’s enough and that brings me to my next topic:

DevOps for networking

Disclaimer: humanity will agree on the meaning of life before the agreeing on the meaning of SDN and DevOps. I’ll take the meaning from the article mentioned before: “DevOps infuses traditional server administration with best practices from software engineering, including abstraction, automation, centralization, release management and testing”. I’m pretty sure most people can agree those are desirable practices/outcomes. 

IaC does give you the opportunity to deliver on most of the these:

  • Abstraction
  • Automation
  • Centralization
  • Release management
  • Testing

It provides an abstraction of the control plane leveraging legacy APIs that is good enough for a lot of people. Also, It’s much easier to make a case to hire a couple developers than it is to replace hardware. Network disaggregation at the bare-metal layer has already delivered a lot. Now can network automation deliver global optimization?

Definitely. That doesn’t mean you should, or it’s the best or cheapest way. Think about this: To what extent is it worth to deliver optimization using automation? What abstractions would make it easier for you to do so?  This now becomes an engineering problem in which we evaluate what technology to use to solve a problem.

Traffic Engineering

For example, take backbone TE, some organizations are perfectly fine running links at 30% average utilization. Some companies are hurt by elephant flows, some are not. Why would you care about TE again? I’m glad I asked, check this post.Jokes aside, A company with effective TE will consistently deliver better user-experience at reduced costs OpEx and CapEx, eventually, those efficiencies will be assimilated by the market. In fact, I’d argue we are starting to observe that. I ask you, which companies are spending most money expanding their networks? Yes, content providers such as Google and Facebook.

Can we deliver global optimization with RSVP-TE? Sure. How complex is the software system that performs those optimizations? You tell me. What’s the average size of a backbone TE team? I’m sure it’s at least half a dozen CCIEs, a ton of vendors and lots of lower level support engineers. Most ISPs have not delivered on real-time global optimization yet. Not many of them have mentioned these optimizations.

Google has publicized some data and they informed SDN teams supports 6 times the feature velocity compared to traditional networking architectures. With dubitable logic, one could induce that it’s 6 times more expensive to deliver TE with traditional architectures.

After more than a year of incremental rollout, Espresso supports six times the feature velocity, 75% cost-reduction, many novel features and exponential capacity growth relative to traditional architectures

How many engineers would you need to develop an OF based TE system?

One could make an argument that the underlying technology doesn’t matter as long as the solution is delivered. Fair enough. In this case, I’d ask you what architecture is going to be more extensible, providing you the biggest long-term benefit? The one put together with a bunch of hacks or the one built from the ground up?

Conclusion – What’s left after the hype?

The main reasons that started the SDN revolution are still real: the need for reduced networking costs and faster innovation cycles. Pure SDN is not the only way to achieve those, network disaggregation and network automation have delivered on those value propositions, but there’s still plenty of room for improvement. P4 is a great example as it aims to increase innovation speeds on the hardware side of the equation.

It’s no longer a question whether one can reduce costs with new practices in networking. Expensive operations will be replaced piece by piece, it started on the data center, now the backbone is the new goal. Additionally, I speculate that due to 5G, in 2019, the regulatory framework is going to change a lot, allowing innovative companies to increase market share significantly. Effective innovation strategies will allow the best ISPs to expand at a much-reduced cost and the market will see that sooner or later.

Leave a Comment

Heard about GitOps?

Howdy! This is just a reading recommendation. I recently stumbled upon an article that is INCREDIBLE. Definitely a must-read.

GitOps: A Path to More Self-service IT

I’m just going to paste the best excerpts from the article:

To recap, a GitOps system evolves like this:

  1. Basic — configs in repo as a storage or backup mechanism.
  2. IaC — PRs from within the team trigger only CI-based deployments.
  3. GitOps — PRs from outside the team, pre-vetted PRs, post-merge testing.
  4. Automatic — Eliminate the human checks entirely.

 

GitOps lowers the cost of creating self-service IT systems, enabling self-service operations where previously they could not be justified. It improves the ability to operate the system safely, permitting regular users to make big changes. Safety improves as more tests are added. Security audits become easier as every change is tracked.

Anyway, go read it.

Leave a Comment

Network automation with Ansible : Canarying / Rolling upgrades

In this post, I describe my experience at the NANOG 72 Hackathon and explain the project I worked on. If you want to see the work, my diffs can be found here.

The Hackathon lasted 8 hours and it involved engineers of all levels. It started with a presentation on network automation tools. Followed by team-forming: anyone with an idea shouted it out to the group and then teams were self-organized. I shared an idea for canarying automated network changes. Let me explain.

A typical challenge on network management is rolling out changes fast and reliably. Infra-as-code already goes a long way removing the human element from the equation, if you have no clue about what I’m talking about check this video. Yet, your automation has the potential to cause outages if a bug slips through testing. A good principle for automation is canarying. It consists of performing changes on a subset of endpoints and verifying the health state of the system before proceeding to prevent the wide-scale spread of a mistake.

Examples of things that may be canaried are patches, OS updates, critical software updates, config changes and etc. When using the good’ole CLI one always checks the state of the system to see if the changes look good. Canarying is analogous to that. Say you need to change the software running on all your switches from quagga to FRR. Wouldn’t it be great if you could perform those changes fast while still making sure your network is working perfectly? Yes! The alternative to that is running your scripts like a savage only to realize later that a typo slipped through your code review and blackholed all your datacenter traffic.

Now, back to the Hackathon. I convinced my team to leverage a vagrant environment + ansible code open-sourced by Cumulus. Leveraging open-source material was a great way to bootstrap development and get to a baseline faster, then we could focus on the canarying solution (that may have been a mistake). It turns out the internet was saturated and it took us more than an hour to download the boxes. At noon, I was able to bring up the environment only to find out that those boxes no longer supported quagga. That brought us to around 1PM and my team was getting a little impatient. At that time, folks were discussing pivot options but I decided to finish the work by myself. So I deserted my team, or they deserted me, you choose.

Anyway, I changed the ansible scripts to run FRR and finally had a running topology around 2PM. After that, I started playing with ansible and quickly found the feature necessary for my use case. Ansible allows you to configure the number of endpoints it runs against with the field “serial”. Once you set serial to 1 the scripts will run one at a time. DONE!!! RIGHT? Not yet.

Now it became interesting, and I grabbed a couple people to discuss how to actually implement this. And basically, the discussion ran around defining what is sufficient condition to accurately determine the state of the system.

  1. Ping. Although a failed ping implies network failure, its success doesn’t mean there are no problems changes.
  2. Route table of the modified switch: That doesn’t guarantee the whole system is in a positive state. For example, things could look good on a switch but a route could have been withdrawn on the other side of the network.
  3. Global routing table state. That one can only be right, the challenge is how can you get that info.
  4. Some folks in the team mentioned normally they have monitoring systems running to detect anomalies and remediate them with scripts. But you don’t know what you don’t know. So I’d advocate for canarying additionally to that.

Okay, so I quickly did #2 and realized it wasn’t good enough. I should have stopped coding there and made a powerpoint, but I decided to keep hacking instead. I wasn’t able to complete #3 and kinda just mumbled the content of this post. It was cool. Overall, it made my Sunday productive and I wouldn’t have done anything like this otherwise.

I finished this project a few weeks later. I wrote a flask web service that returns the number of working BGP neighbors, then on Ansible I can call that API and assess the global state. Let’s go for a tutorial.

Canarying tutorial

To run this you need vagrant 2.02. To start bring up the infrastructure:

 git clone https://github.com/cumulusnetworks/cldemo-vagrant
 cd cldemo-vagrant
 sudo vagrant up oob-mgmt-server oob-mgmt-switch leaf01
 sudo vagrant up leaf02 leaf03 leaf04 spine01 spine02 

This should bring up a leaf-spine infrastructure not yet configured. To configure it do the following:

 sudo vagrant ssh oob-mgmt-server
git clone https://github.com/castroflavio/cldemo-roh-ansible
cd cldemo-roh-ansible
ansible-playbook start.yml

The next step is to run the change.yml playbook:

---
- hosts: spines
  user: cumulus
  become: yes
  become_method: sudo
  roles:
    - ifupdown2
    - frr
    - statecheck
  serial: 1
  max_fail_percentage: 50

The magic lies on the serial field, which stipulates changes need to be deployed sequentially, 1 node at a time in this case. Whenever a change fails the statecheck playbook will fail:

- uri:
    url: "http://{{ item }}:5000/"
    return_content: yes
  register: webpage
  with_items:    
    - '192.168.0.11'
    - '192.168.0.12'
    - '192.168.0.13'
    - '192.168.0.14'

- debug: var=item.content
  failed_when: item.content|int < 2
  with_items: "{{ webpage.results }}"

It reaches out to every leaf(192.168.0.1*) and gets the number of BGP neighbors through a REST API. The Web service code can be found here.

Thanks for reading the long post, do not hesitate to share any questions and thoughts.

Leave a Comment