ONUG Working Groups: The AIOPS and Observability Working Group

by Guest Author

November 2, 2020

ONUG’s Working Groups are having a great impact on the tech industry in numerous ways. The AiOPS and Observability Working Group is composed of core members Tim Van Herck of VMware, Russ Currie of Netscout, and Ted Turner of Kentik. ONUG recently produced a webinar to review the progress each of the working groups has made, the following summarizes some of the main activities of this working group.

Main Objectives

Van Herck reviewed the group’s main objective, which is to leverage ML/AI to help businesses answer questions that require data that is distributed across different cloud and equipment vendors. “Our goal is to correlate all this data together, so you have a single insight into what’s going on, Van Herck explained. “The biggest difference is that we want to leave the data in place, avoiding the use of large data lakes and warehouses, which are expensive to maintain.”

These changes will ultimately lead to shortened troubleshooting time. If someone calls in with a support question, support must rapidly get to the components within that user’s workflow, redirect the attention of the engineer, and avoid a recurrence of the problem moving forward? Instead of spending time troubleshooting that one problem, it is used as a learning experience to accelerate future troubleshooting and/or avoid the same problem altogether.

Use Case

Van Herck used a common example to illustrate. “What’s wrong with my Zoom call?” This is a question tech support frequently gets. Low network visibility makes it a challenge to identify the problem. He explained that if you are at a branch location, you have more control. You can answer questions, such as “Is there an overlay network issue with the VPN, SD-WAN or cloud security tunnel?” or “Is the router running out of resources?” However, as you get closer to the user, visibility becomes even harder. It’s hard to know what the Wi-Fi signal looks like. It’s difficult to identify endpoint device issues with their CPU, storage, or memory.

“As a support engineer, you need a lot more visibility. When a call comes in, you need to get your bearings.” Support must be able to identify the issue within the network quickly. The questions support needs to answer is within a variety of domains formats, adding to the complexity of visibility.

Data Framework

“Extract meaningful correlations to accelerate troubleshooting.” That’s how Van Herck explained the purpose of the framework developed by the working group. The purpose of the framework is to segment workflow, so you can easily troubleshoot. The three main layers are:

Data Source, including wireless and wired access vendors, SD-WAN vendors, internet providers, and application vendors.
Cloud Storage layer examines how to store the data so that it can easily be driven to the next layer.
Data Abstraction layer receives data through AI and ML installments.

This framework makes it possible to quickly answer questions, such as “why is my Zoom not working?” Support can trace the problem all the way from the user’s laptop to an access point, through the routing and switching environment, as well as finding issues within the internet connection or the application itself.

The AIOPS group is specifically focused on the more complex failure scenarios, including suboptimal signal strength, interface errors on a switch, or minor packet loss on the network. In isolation, these would not contribute to the low-quality Zoom call. But, combined they can cause significant impairment of application delivery. That’s what the group is really focused on, correlating all these parts on the network to provide a clear view, enabling support to diagnose and fix problems quickly.

Proof of Concept

The group did an early proof of concept of their framework in 2019. Their test case involved a mist access point that was connected to an SD-WAN edge, causing an issue with the Zoom application for an end-user. Both resource orchestrators were streaming data into an AWS bucket. Next, the data went to the visualization layer and was read by AWS Redshift. This PoC follows these steps:

Bucket Access Policy: This detailed how-to stream the data into the S3 buckets. Van Herck showed a slide that featured sample code, outlining how to give the AIOPS vendor access to data via Mist.
Redshift/Spectrum External Tables: This file describes the data schema. Van Herck demonstrated how to create a table that would be part of a data catalog provided by the cloud company. Ideally, this catalog should be an industry standard.
Correlation through Data Visualization Model: Van Herck’s slide illustrated how support can follow the workflow all the way through the infrastructure while maintaining all relevant information.

Focus Going Forward

It’s critical that the group has a good understanding of where all data is coming from. Especially as they focus on correlation, they must be confident in the reliability of the data. Van Herck detailed four main sources of data.

Data Inventory, including volume, frequency, fidelity, and scope.
Access Protocol, including ReST API, Object Store, and Kafka Pipeline.
Data Format, including Self-describing, CSV, and Parquet.
Time Synch, including NTP and Synthetic events

The group will also focus on data security and privacy, including the following areas.

Data ownership: Permissions, GDPR, Data erasure
Data storage: Encryption, Retention, Cost
Sensitive data: PII, Anonymize, Pseudonymize, Retain relation, At the source, At ingestion
Data access: RBAC, Obfuscate, Compartmentalize, Access logs

The next area of focus is the correlation and baseline. “We need an entity relation that includes all the various data sources we ingest,” explained Van Herck. “We think this is a job best done by the enterprise administrator.” This would include some impedance matching at times as well. Lastly, the group will focus on root cause and analysis. That includes data expansion, root cause, predictive analytics, and self-healing.

Van Herck summarized the next steps in these five bullets:

Drive to a production-grade PoC.
Add new vendors to the collective of data sources and AI/ML engines.
Formalize data formats and transports.
Engage an ML engine to produce baselines.
Engage an AI engine for advanced correlation.

“Calling all vendors, IT executives and leadership,” was Van Herck’s concluding comment!

Get Involved

ONUG Working Groups are catalysts for change. Join the movement. Learn more about our different groups here. Contact us to learn more about the ONUG Community.

Author's Bio

Guest Author

guest