Observability, User Experience, and Work from Home: A Case Study

by Babak Roushanaee

September 28, 2020

Observability and Site Reliability Engineering (SRE) were both borne out of best operations practices at hyperscale companies in the past decade and quickly spread through the DevOps and Cloud-Native communities. Today, they are also being adopted by some large enterprise moving to the cloud, something that the COVID-19 pandemic has accelerated along with expanded Work from Home (WFH).

Observability positions itself as a superset of past monitoring practices. Where legacy monitoring monitors the health of a system, observability is about the care architects and developers give to building a system that is resilient and easy to operate by minimize the number of unknown unknowns. DevOps, microservices, and refactoring applications to cloud have provided ample opportunity for this.

As a result, much of observability literature revolves around distributed tracing and logs. WFH, on the other hand, shines light on the importance of network or wire data and resource metrics that inform user experience and pinpoint bottlenecks. In the rest of this blog we examine the remote work technologies to show how observability architectures are always a contextual balance between (network) metrics, traces, and logs.

Virtual Private Network (VPN) – Figure 1 shows a typical enterprise network in which the headquarters, several smaller data centers, and remote locations are connected via an SD-WAN or an MPLS mesh network. The network at HQ provides redundant connectivity to the internet, co-lo, and cloud service providers. The HQ Network as well as remote branches house primary and backup VPN concentrators, which are located between edge and aggregation networks. VPN concentrators can optionally be placed in the DMZ or co-lo. WFH users connect to the closest VPN concentrator using GeoDNS to consume UC&C, VDI, and business services. A split-DNS policy allows users to access cloud without routing through the VPN.

Figure 1: Typical network connectivity for remote branch and work-from-home employees.

IT may also layer in a combination of secure access service edge (SASE), zero-trust network access (ZTNA), multi-factor authentication (MFA), cloud access security broker (CASB), and data-loss prevention (DLP) in conjunction with the VPN. The entire above critical infrastructure is built on commercial solutions, which eliminate the opportunity for code traces (for monitoring them). These components do generate logs. However, building an observability solution from asynchronous log events here will require expense and expertise with ongoing Opex. Turing log levels too high will also adversely impact the performance each component. In contrast, network metadata, and device resources utilization metrics shine here by their efficiency in reporting user-experience, while tracking bottlenecks along the path.

Virtual Desktop Interface (VDI) – is a kind of middleware allowing applications and desktops being run in a datacenter or cloud to be mirrored on a remote user’s computer by sending keyboard, mouse movement, and screen updates back and forth. VDI solutions are complex service chains that are notoriously sensitive to capacity constraints and network response-time. A user’s attempts for a remote Citrix desktop connection, for example, sets in motion a series of interactions between Access Gateway, LDAP, StoreFront, Desktop Delivery Controller, Domain Controller, License Manager, configuration database, Virtual Desktop server, and storage that holds user’s profile to grant user a VDI session customized to his or hers profile.

Figure 2: Typical Citrix VDI infrastructure providing desktops to remote users.

Like the VPN alternative, VDI solutions are entirely comprised of self-contained commercial solutions that again eliminate the opportunity for code tracing or APM solutions. And, again while logging maybe possible, the pesky performance problem that plague VDI deployments are hard to monitor and troubleshoot with logs. In contrast, network metadata and resource utilization metrics provide necessary insight here again.

Conclusion

Large enterprises moving to cloud have thousands of applications that utilize various shared infrastructure. The SRE book states:

Your monitoring system should address two questions: what’s broken, and why? …“What” versus “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.

“writing good monitoring” is certainly a must when writing code. But Large enterprises are not in the business of writing replacements for commercial monitoring solutions. Perhaps better phrasing would be “identifying key metrics and architecting a monitoring solution”. For observability to continue a frictionless adoption by large enterprise, it is incumbent upon us, the observability practitioners, to balance the use of tools in our arsenal. Years of advancement in commercials wire data, synthetic transaction technology, and resource monitoring solutions has been about “maximum signal and minimum noise”, which makes them core building blocks of any observability architecture.

Author's Bio

Babak Roushanaee

Director, Enterprise Strategy at NETSCOUT
Babak Roushanaee is Director, Enterprise Strategy at NETSCOUT and a contributing member of ONUG Observability and AIOps working groups. He has more than 25 years of industry experience in architecture, delivery, and operations oversight of CIO dashboards, line of business end-to-end monitoring solutions, performance engineering, and capacity planning at large financials.

Director