Out-of-band monitoring techniques are necessary for AI clusters to provide trustworthy inferences. Out-of-band solutions provide latency analysis, decrease points of failure, and do not add additional burden on the network. [1] For AI clusters, the result of high latency are erroneous inferences.
AI clusters are nodal network of GPUs that store and process inferences from machine learning models. [2] A slow latency for AI clusters inference is enough to produce incorrect inferences. The nodes of an AI cluster can create incomplete calculations on requests due to the inconsistency of a slow network and high latency. The inference stage needs “milliseconds or faster processing speeds”. [3] The AI algorithm demands are “very compute and memory intensive, executed at remote servers in large-scale datacenters,” with multiple cascading algorithms. [4] Therefore, latency determines the truthfulness of the information. In-band monitoring techniques are not proper solutions for AI clusters when latency demands can result in erroneous calculations.
In-band monitoring techniques produces multiple points of failure. Agent based EDR tools are integrated inside the endpoint. The deep integration inside an endpoint and its core functionalities puts the endpoint at risk of failure when a kernel driver error occurs. An agent based EDR tool within the endpoint is an additional point of failure. The risk of failure therefore exists not only inside the endpoint but also for the entire organization that have these devices. [5]
Out-of-band solutions provide remote monitoring separated from the endpoint nodes of the network. The monitoring and recording system take place outside the network. The traffic is mirrored onto the monitoring device instead of being captured within the network. In this way, an out-of-band monitoring technique removes critical points of failure. The technique does not need to install endpoint agents and therefore does not place a risk for the endpoints and its core functionalities. The result is a recording and monitoring system that does not interfere with the network. An out-of-band solution is necessary to retrieve key metrics of AI clusters’ latency.
Out-of-band solutions is ideal for AI clusters which require low latency for its throughput. AI clusters are complex with many crucial operations within a day. [6] In addition to solutions that accelerates AI operations, monitoring and recording systems that are passive, free of greater burden on the endpoints, and decreases risks are necessary for AI clusters that will not increase latency. Such a solution is necessary because of the correlation between AI clusters’ latency and the trustworthiness of the inference for the end-user.
Out-of-band monitoring and recording system determines the cause of increases in latency, which can be caused by congestion, suboptimal operations, or unavailable AI cluster. The monitoring and recording system can focus on KPI’s that may be causing inconsistent latency. The solution is ideal for AI clusters, which provides security without interference, reduces maintenance, and decreases points of failure. All these result in optimized latency and throughput.
Network monitoring and recording systems, which provide cybersecurity, a measurement of AI clusters’ latency, workloads, and traffic are necessary for AI clusters to produce correct inferences. These tools are necessary because latency determines the amount of trust the end-user has for the inference. Incorrect inferences can lead to mistrust for end-users, who rely on the truthfulness of the calculations. Other components that determine the trust of the end-user are security and privacy. Trustworthiness for the end-user is overall low. Pew Research reports that more than half of Americans are more concerned than excited about AI in daily life. [7] For the end-user, a breach in privacy and security can result in a human rights violation. [8] Network monitoring tools also provide the needed measurement to determine the security and privacy of AI clusters. Technology leaders need to understand that out-of-band monitoring techniques are necessary components that determine key metrics of AI clusters and its performance. Out-of-band solutions provide an ideal solution for AI clusters, which have one of the most demands on optimized latency and network speed. While AI infrastructure is in its early development stage, it is the right time to adopt out-of-band network monitoring solutions to avoid the redundancy of risks produced by in-network monitoring solutions.