ONUG Digital Live 2020

by Nick Lippis

June 23, 2020

Automation in the Age of the Remote Workforce: Terracon and Gluware Show the Way

Throughout the years at ONUG, Gluware has consistently shown how it partners with customers. It is unique in that its customers advocate for Gluware in public, which is a difficult internal process to undergo. But time and time again, Gluware’s customers do. They do it because they are delighted with the outcome, because Gluware delivers what it promises.

Jamie Hughes, Infrastructure Architect at Terracon, delivered a keynote during ONUG Digital Live this past May with Jeff Gray, Gluware CEO. Automation is a top five issue across all ONUG Community members. Gluware is the leader in Intelligent Network Automation, delivering an Intent-Based orchestration engine that empowers Network Operations to successfully automate and orchestrate mission-critical networks at scale. Gluware is a proven automation platform in some of the largest corporate brands such Mastercard, Merck, et al.

Jamie presents in this keynote the new reality that the pandemic brought us and according to ONUG Community members, this way of working is the new norm. To support highly distributed workforces, its infrastructure needs to be elastic. Terracon, being an engineering firm, is constantly adding and subtracting sites. This is a fundamental attribute of automating elastic infrastructure that nearly all large enterprises are developing strategies to address.

As businesses reassess/rationalize their current projects, processes, real estate and IT spend, two things come into focus: 1) the value of buyer and supplier partnerships is fundamentally important as these relationships are changing, and 2) businesses’ digital transformation projects and staffing are by far winning in funding battles. In short, now more than ever, IT buyers and suppliers are tightly reliant upon each other to solve problems as corporations quickly downscale real estate plus service customers, partners and suppliers digitally.

Terracon and Gluware show how good partners work with each other to deliver a positive outcome. Terracon needed a solution to automate remote site deployment and change management that was multi-vendor, leaves an audit trail for compliance assurance, assures the remote site is configured correctly and keeps track of inventory. Jamie and his team chose Gluware to automate their remote workforce elastic infrastructure, and in the process, they saved time, money and gained better security. But the biggest payoff is that Terracon is more digital, agile and flexible to service its customers, suppliers and partners.

Nick Lippis; ONUG Co-Founder/Co-Chair

Presented by:

Jeff Gray Gluware CEO and Co-Founder
Jamie Hughes Terracon Infrastructure Architect

Jeff Gray, Gluware CEO & Co-Founder

Welcome everyone to the Gluware Terracon keynote, “Automating Business Continuity: Network Life Before, During and After COVID-19.

I’m Jeff Gray, CEO and Co-founder of Gluware, and I’m excited to introduce Jamie Hughes of Terracon.

We have a very relevant and timely case study that Jamie will be sharing today. Jamie is one of the most talented and capable network architects that I’ve had the pleasure to come across. He found a critical issue in his company about a year ago and he sought to find an automation platform that would work for his needs and for Terracon’s needs. And he got his hands dirty; he got very involved; he made his decision and he executed. And Terracon is leading the industry with their automation efforts and reaping the benefits. Now, Terracon was reaping the benefits before the pandemic, but now that the pandemic has hit, they’re paying off even more, and Jamie will discuss that.

It is my honor to introduce Jamie Hughes of Terracon to present this keynote.

Jamie Hughes

Thanks for that introduction, Jeff. This is Jamie Hughes with Terracon.

We’re an engineering firm that’s employee-owned. We have more than 5000 people across the United States, serving all 50 states. Our growth patterns are 50% through acquisition and 50% organic. One of the challenges that presents is we’re always spinning up new sites, spinning down sites, moving—so there’s a lot of change and leads into why we were looking to automate our network.

I’m fairly new at Terracon so one of the things I was trying to get to know is, “How does the company work?” We operate in all 50 states and one of the challenges that represents is that we don’t have IT staff here. They’re more centrally-located in the Kansas City area. We don’t have those remote hands at all the sites so need to perform all those functions remotely.

Some of the challenges that we were presented with are

Inventory management. How do we know how many switches, routers firewalls devices there are and where they’re located? Understanding all those pieces.
The other issues are the security response. How do we get faster at those things? How do we know if we really are vulnerable? We really needed to patch the device. Inconsistent configurations—if we looked. we were doing it all manually. Where are those errors at, where are those human issues involved with that?
And then also change deployment. At this point, when we started this journey, we were applying everything manually. So basically, we were just throwing resources at it to solve the problem.

Some of the top requirements we had:

How do we get to where we can automate the operating systems? When we started out, we didn’t feel comfortable with all those remote sites. If we load the wrong firmware on a device, that’s going to cause a truck roll—site’s going to be down for hours. That’s a huge problem. So, how do we correct that?
We needed something that has multi-vendor support. Most of our equipment is Cisco but we do have some other pieces of equipment. How do we support all of those?
The compliance audit—I want to be able to take my configuration and make sure—does this match what we’re expecting, and does it check all the security boxes that we’re expecting, making sure we’re protecting the company.
Inventory management
Configuration changes on the audits. What are we changing in the environment and why are we changing these things? A tool that can help us understand what’s happening in our environment as its operating and leverage our current skill set. It’s easy to say that we’re going to go do a bunch of programming and move into software, networking. One of the things that’s much harder to do.

At the start of this, these are a lot of the issues that Terracon was facing:

95% of our changes were performed manually.
70% of our policy violations were human error. Somebody would make mistakes, or they’d be done differently. You have two people and they make the same change and a lot of times you’ll get two different results.
The operation expense—we were spending 60% of our time on troubleshooting those human errors—things that we found in the environment.
We had got a lot of different OS versions.
20% of our QoS policy was ineffective so that means a fifth of my sites effectively didn’t have QoS applied to them.
Vendor vulnerabilities. It would take us will take us sometimes a considerable amount of time to remediate those issues and you’d have that window of insecurity.
Manual changes and errors causing outages. Some of those human errors, if they were bad enough, would cause like a poor end-user experience. How do we resolve those things?
Identify NIST security violations and configurations. So basically, taking a framework and being able to apply that and against our environment and make sure—are we meeting our own security standards?

Terracon’s Evaluation-Decision Process. We looked at several different things:

DIY (Do-It-Yourself) was pretty much out the window from the start.
One of the reasons was we just don’t have the resources to apply–go to new skills like Python, database, all the platforms, and the ongoing maintenance. A new piece of hardware is delivered, and we need to QA to onboard that. We just don’t have the resources to perform that kind of task.

Then we looked at other tool sets that can help us do this automation. A lot of them were limited to one vendor.
They lacked the validation that we would need to build to make sure we did all these changes correctly.

That led us Gluware.

One of the things very nice about that it has a lot of prepackaged changes in there. That really helps from scaling up, making some of those little changes like fixing NTP here, change my syslog there. They have a lot of examples in there that makes it easy to build off of. That really accelerates that time-to-delivery over something built with a DIY approach instead of sitting there trying to figure out how to script, I’m already automating the environment.
The other nice thing is basically I can take my configurations—those base configs, the ones that are security hardened—and I can split those up into modules in the tool and then deploy them out to the environment.
One of the really impressive things about the solution was that I didn’t see any other solution that we tried that really gave me that intent-based and declarative—what does that mean? What that allows is I can take and say, Here’s the configuration I want on the device. It will actually go look and compare it and say, “Well, here’s your configuration and there’s all these others. I’m going to call it cruft (unnecessary configuration) that’s just leftover. It will actually remove that off of there for me. So now it has a very pristine, standard and I can guarantee that throughout the environment. To me, that was huge.
Support and service. As I’ve worked through with Gluware, one of the fantastic things is (and this is really, a big negative against the DIY approach) we had a new switch hardware platform we just migrated. The vendor changed the configuration (syntax) when we upgraded the operating system. We were able to open the ticket with Gluware. It took a week or something and they turned around a fix and we’re moving along now. My guys didn’t have to put any resources to that stuff. It was just fantastic.

How does Gluware work?

Basically, it’s broken up into multiple parts. The nice things about this is you can take your own current configuration—what is your standard, or where do you want to be at and if you haven’t met that standard. No coding required—you don’t have to learn Python or all that fancy stuff. We use the CLI that we want from other devices to allow us to do that.

The pre and post checks. The nice thing is you can actually take your configuration and ask, “What is this change going to do when I apply it?” You can actually go out and have it do a live connects to the real production device. It won’t make any changes. It just pulls down the configuration, takes your changes you’re looking to make, and then spits out (what config changes are required for that device).

What’s the result of that going to look like to automate a policy at scale? That piece is fantastic because, now I can take my QoS policy if we do one of those acquisitions. I’m going to add five new routers, I could really just take that and push that configuration out to those five new sites. It really sped up our deployment of moves and changes of our operation.

The other fantastic things that works in brownfield. A lot of the tools I looked at the problem with that was is almost all of them required to start with rebuild (of the entire configuration) and nobody’s network looks like that.

Some of the initial use-cases and quick wins.

The device manager was a huge one. We were able to get a lot of our devices in there get inventory on what versions of software they’re running and that was really helpful. Do they have SmartNet on them? Do we need to renew that for this year next year when is expiring?

Configuration drift and audit. To me, that was a really powerful component. Because what a let us do is we were using that resource to go and find out what kind of errors we have in the environment. And I noticed that, for example, our operations team was spending a lot of time on troubleshooting access lists. One of the things we did is— one of the first pieces we automated is—let’s get this access list—do we have a golden standard and how we can push out devices and waste less people’s time.

The OS manager. One of the fantastic things about this is we had a lot of different versions of operating system in our environment. This has really helped us cut that down. As we deploy those new sites, we have a new standard. It’s really easy to do. All the nice thing is operations teams that are, they’re willing to do that because now they literally hook a device up, push the configuration out to upgrade the operating system, and away we go. It’s really simple.

The most powerful component of this system is the Config Modeling. We can take those pieces of the CLI config we have and break them down into their component parts. The great thing about this is it really makes it easy to get into the solution and start automating things and feel like you’re getting somewhere. But you’re not trying to model the whole config. Once you can break it down into its component pieces, you can really move that bar forward in your environment.

Completed automation projects.

Over the last year, these are several of the projects we’ve worked on automating in our environment. One of the first things—we took some little piece—was fixing our access list. Some of the bigger projects we worked on were redoing our QoS. Since we were doing all this stuff manually, it hadn’t been looked at in a while. We were missing applications in there—things have changed or moved in and weren’t getting classified correctly any longer.

First, we sat down and figured out, “Where do you want to be.” And then, basically, we took from there, “Okay here’s where we want to be at. We built a policy in the tool, pushed that out, and then—since that was one of the first major things we had tried—we went back and audited ourselves. “What does this look like?” I was really impressed from that automation exercise getting that precision. Before we had 20% of our sites wouldn’t function correctly. After doing this audit, we literally audited every device, and they were all correct. It was just fantastic!

The other piece was the site-to-site VPN failover for MPLS what was happening was our MPLS circuits. When they went down, our employees lost connectivity to the data center. In this case, we were able to use automation to speed up those deployments and get that out faster. Our employees, instead of twiddling our thumbs when a circuit goes down, they would failover to the internet circuit or VPN tunnel and get back to our data center. It’s a huge win for us. It took outages that some of the sites could have been be down for hours now they didn’t even notice.

The other piece of really nice is the vulnerability remediation. A lot of times, we get this security vulnerability and it says you have to be on this specific operating system and you also need to have these commands or not have these commands on your device. With Gluware, I can go take that inventory management and find out how many devices are on that version of code. And then I can go run audit against something like “Do they have this command line in there?” It makes it really easy to figure out if I have devices that are vulnerable and if need to do something. Or maybe I don’t need to do anything at all.

And then the firewall deployment. One of the last things we’ve been working on here. Some of my team members have had some time during the pandemic and they’ve actually went in and automated the full firewall deployment so now we can take one of those devices and completely automate the deployment of it, which is a huge win for us.

Speaking of that, the pandemic presented a several challenges.

Some of those were our ability to travel, change network patterns, and now there were users aren’t at the site that are working from home, and supply chain challenges.

So, automation helped us several ways with this.

Normally when we were deploying the new site of another company we bought, we need to go out and travel. We would send our employees out to those sites to spin up this network equipment. Well now we have the inability to travel so “How do we resolve that?”, because that just doesn’t stop.

What we started doing is with Gluware, we can actually push the configuration out through the out-of-band modem. It allowed us to kind of switch our strategy. Instead of shipping all the items to our corporate office, we could dropship the equipment out there, use the out-of-band modems, push those configurations out, and get the device online and then upgrade its operating system right at the site. Then we can just hire local support to go out and just cable it up for us.

The other thing that it allowed was with a change in traffic patterns, we found that some of our QoS wasn’t as effective or needed to be put in different places. For example, with all the remote workers, basically we had a huge expansion of remote working from home. Well now we needed QoS on our internet edge. We were able to take those modular policies in the tool and with just a couple hours of modeling around and changing them to fit our environment for the internet edge, it could push that out for us. Then the nice thing is, later on, we could tweak that for our data center. Now with that being modular I can go add an application and then I know when I push it out on all my datacenter devices—it’s automatically going to have that updated config on there.

The other thing that allowed us to do is the ability with changing the routing in the environment, now that the people are home in a lot of cases, their internet circuit had more upload bandwidth to their MPLS circuit. We can flip the routing around on those things on the fly to basically direct traffic in the most efficient manner.

And then here’s some of the benefits Terracon realized in our automation environment.

From a security aspect, we’re much faster be able to turn around, upgrading our devices and outage reduction with those changes to allowing our MPLS to failover to the internet circuit automatically. The human error aspects that really reduced our major outages drastically. Tying into that was the efficiency piece, we basically had 40% less time spent on troubleshooting human error.
And then for business continuity, the automation really helped us in closing that gap for the pandemic. I noticed we got several kudos from our co-workers. They talk to their friends and family and some of them are struggling with the domain for the workforce. They might have been repaired. I said we weren’t 100% prepared but we were able to quickly go in and make changes on the fly to get that remote workforce up and running quickly.
And then the agility piece. Now if you want to make changes to like our QoS policy, this is a couple hours whereas before we would spend 100 hours to go and change the QoS on all the devices. It’s a huge reduction in the amount of time it takes for us to roll things out.
One of the things using that audit has really helped us figure out, “Where’s the Operations Group spending their time? Where should we focus on automating or what things should we make priority? What is the biggest impact of the company?” It really helps a lot with that and figuring out things we should be automating.
Insights and understanding – Increase understanding where to spend time.

Some of the lessons that Terracon learned out of this was

Make sure where we’re at working with the Operations Group, figure out where are our big pain points. What pieces should be working on automating first.
The second point is to align your initiatives with strategic projects. Take those pain points, take your upcoming projects, and decide what makes sense to automate in your environment.
And then the new benchmark for NetOps. One of the things automation does is make us a lot closer to being a software-defined network. It’s bridged that gap a little bit for us. Without having to completely retool everything we’re able to take our existing brownfield environment and do a lot of improvements
The business case ROI. Looking back over the journey, it’s really easy to see where we got a lot of value out of doing this automation journey.

And with that, I’d like to turn it over to Jeff gray again.

Jeff Gray

Jamie, thank you so much for sharing your story with the ONUG community. Because of your vision, Terracon is reaping a tremendous amount of benefits, and you were very early, and you’ve executed further than many companies out there in the industry. I want to thank you for your partnership, and I want to thank you once again for sharing the story.

And with the benefits that Jamie and other customers have realized, we have decided to invest in others during this pandemic.

I have two announcements to make here at ONUG Digital Live. One is Gluware has now partnered with Microsoft and Gluware is now available in the Microsoft Azure Marketplace. We are now delivering a 30-Day Free Pilot-to-Production Trial Offer.

This is much more than just testing software. This is approximately a $25,000 value and it includes software, support, design, and training for qualified customers. We want to be able to share the benefits of automation in the same way that Terracon tested and rolled to production. Now, you can do this much faster because, within Microsoft Azure, you can download Gluware directly into your Azure tenant, spin up in minutes, and get automated. We want to invest to support that because it delivers a lot of benefits for customers, especially in a time like this. Please apply in the lower right-hand corner to apply for our business continuity offer. Gluware will work with you, support you and partner with you.

And with that, thank you for your time and stay safe.

Author's Bio

Nick Lippis

Co-Founder and Co-Chair ONUG