This is the second post in a series about the challenges of network monitoring in a telecommuting world. Part I focused on the 'Challenges of Network Monitoring When Everyone Is At Home' and part III looks at 'Streaming Cloud Workflows & Monitoring'.
In the first post of this series, we talked about how the global pandemic has been a forcing function for businesses to support a growing telecommuting workforce. Working remotely has been gaining traction for years, but many businesses were still just easing into it before the pandemic. Shelter-in-place orders required businesses to deal with it immediately, en masse, and address critical functions, such as network monitoring for video streaming, which were usually accomplished in a physical room: the Network Operations Center (NOC).
Forcing businesses to address questions like “how do I monitor my streaming workflow when my engineers don’t have ,” actually exposes another critical issue with the current situation. Reactivity.
Remote network monitoring is a serious business consideration. Much like disaster recovery and business continuity planning, enabling hardware and software behind the corporate firewall to be monitored remotely requires a lot of thought. It’s not something you can just decide on overnight. Unfortunately, that’s exactly what’s happened: many businesses needing to remotely monitor streaming services simply did what they had to. But, now that we all have a little breathing room, it’s time to put some serious thought into developing a proper remote-monitoring strategy.
The first step is to listen to the people who monitor network equipment in the NOC. What do they need to monitor remotely? What makes sense? What kind of data do they need to see and how frequently? Answering these and other questions will help direct decisions into how to enable remote monitoring.
One of the most critical questions to answer is, “what needs to be monitored?” Although it’s easy to just enable everything, that may overwhelm the people who need to make decisions based on the data. All the widgets and graphs in the world won’t help you make sense of things. Keep your network engineers and operations folks focused on what needs to be monitored, not what can be monitored.
The second step is to determine who needs to see what data. This is significantly influenced by Step 1 but this is about organization of the monitoring requirements communication by network engineers into policies. It doesn’t make a lot of sense to enable monitoring from every network component for every network engineer. That can quickly devolve into a security issue. You need to clearly define roles, assign those roles to engineers, and group network equipment or services underneath each role.
Once you’ve gone through understanding the critical elements to monitor and who should have access to seeing that data remotely, you need to establish how to be actionable from the data. Set thresholds. Determine measurements. Create boundaries around your data which relate to your business goals.
Above all else, keep-it-simple-stupid (KISS). Think of a traffic light. Monitoring data from network systems should be as simple as green equals everything is working fine (within the established KPI) and red equals an error (deviation from KPI).
As you are building out your strategy and setting access policies, you’ll need to also think about how data can be collected from behind the firewall and stored outside it. Keeping your monitoring secure, when it’s not hidden within a NOC deep in the building, is critical and involves a host of considerations including user authentication (even if employing a VPN), data encryption (especially if using the cloud), and more. Collecting data from remote systems, often not exposed to the internet, may require APIs and other middleware that are, again, security concerns. The simple fact is that embracing the concept of monitoring network resources remotely, such as those in streaming video, that would not normally be available outside the corporate firewall, requires an approach which identifies the needs of network engineers used to the NOC balanced against corporate IT policies and overall security.
Once you have established your monitoring strategy, there are some logistical concerns such as, “where are you going to store the data?” Although many network services, including hardware, are much smarter now, capable of throwing off lots of operational bits and bytes, you need a way to collect it in an organized manner.
There are numerous log-analysis tools, such as Splunk and New Relic, that can collect data from various network components. Interfacing through APIs, these services can combine hardware data, software data, and even cloud-based services data into a single data pool. That’s a great way to bring everything together. Of course, there’s not much actionable insight from such a large data pool, but at least it’s in one place. And, given that many of those tools, like Splunk, can be accessed remotely, it provides a tangible way for operations engineers to access the data they need.
Part of that monitoring strategy must take into account several critical steps beyond just collecting and storing the data.
With collected data stored in a relational database, it’s critical to normalize it. In many cases, the data may not be particularly useful in raw format, especially when it needs to be correlated. Or, it’s possible that different systems performing similar functions may be reporting data in different ways. Regardless, you must implement rules and policies for how data is to be transformed after storage. This transformation will ultimately make it more useful when visualized.
One of the key steps in making network monitoring data actionable and meaningful is to correlate it. How is the data collected from one system impacted by other systems upstream? Correlating data, to identify the relationships between different network components, enables decision making. Identifying the patterns and seeing the connections is a massive step in ensuring network operations personnel can actually do their job.
But data correlation doesn’t mean much if it’s still just a bunch of numbers. Visualization makes sense of those correlations, especially when that visualization is built on a foundation of KISS. It should tell a clear story so when someone looks at it, they can say, “oh, yeah, I see where there’s a problem.” If you force network operations to waste too much time trying to figure out where the problem is within the visualization, in trying to figure out what’s wrong, it’s valuable time wasted that could be focused on solving a problem.
So now you have data being collected, access policies enabled, correlation and visualization producing a simple view...but it doesn’t do you any good if all of that isn’t available remotely. Whatever tool you decide upon for visualization, it needs to be accessible from the outside. That means that it needs access to the datastore into which all the monitoring data is being collected. And in addition to being accessible, that monitoring solution should be responsive. Yes, the current situation means most people are working from home. But you aren’t creating and enacting your remote monitoring strategy for today, you are doing it for tomorrow. And tomorrow, people may be out and about. You need to empower your remote network engineers and operations personnel with tools that work on the devices they have at their fingertips...like their mobile phones.
Committing your strategy to paper, or the whiteboard, is one thing. But just having a strategy won’t do you any good unless you can implement it effectively and efficiently. We’ve pulled together six phases of implementation you can follow for a clear path from strategy to action.
Many of the problems which crop up in implementing a strategy often result from a simple adage: seeing the forest for the trees. It’s easy to get caught up in the details, such as which element is being monitored and who has access, but focusing solely on each tree may be too myopic. You need to see the entire monitoring approach/strategy and its implementation from a high-level view.
Once you have that 30,000-foot perspective, you need to understand how everything fits together. This phase is directly related to correlating data. You need to see how data flows through your entire workflow, through every piece you have selected for monitoring. This way, you’ll have a clear picture on the flow of potential errors when correlating and visualizing the data.
When businesses are reactive, like what streaming platforms needed to do to keep monitoring their workflow when network operations personnel were working from home, the tendency is to implement something that “just works.” But that solution may not be the best long-term. And there may not be time to do a proper bakeoff or PoC. But with a little breathing room, it’s crucial to take the necessary time to assess the options and select the appropriate monitoring solution for each component which needs to be watched.
As you evaluate different monitoring solutions, you will need to address a fundamental question: “should I utilize hardware monitors or software monitors?” In our experience, software monitors have several significant advantages over hardware monitors:
Remotely monitoring an entire streaming workflow, which may include hardware and software behind the firewall as well as cloud-based services, is a complicated endeavor to do correctly. Yes, you can grab data from workflow components and people can muddle through it, eventually figuring out how to relate it to other components, but that is not a long-term tenable solution. With time on your side to strategize and implement more slowly, and more diligently, it behooves you to do so in stages. Don’t try and tackle everything at once. Be agile. Address each component in your big picture as an individual project. Yes, you may find that a best-of-breed monitoring solution targeted at one component also works for another, but don’t go into the implementation of your remote monitoring strategy with that in mind.
Although you will have spent time on creating your strategy and implementing it piece-by-piece, you will need a more holistic approach to visualization. You can’t have multiple monitoring tools each requiring their own dashboard. That defeats the purpose of remote monitoring in which operations personnel aren’t in front of dozens of screens to see everything at once. Whatever visualization tool you decide upon, it will behoove you to strategize about it first: what needs to be displayed, how are relationships between data sources accounted for, and, of course, KISS. Visualization is really where the rubber meets the road. However you articulate the presentation of all the monitoring data, it has to be done in such a way that makes it simple and easy to identify root causes and troubleshoot issues down to the streaming workflow component and user session.
Cutting across your strategy and implementation is one, overriding principle: remote monitoring is not a one-and-done activity. Just because you strategize, select which systems need to be monitored remotely, build policies, implement, and visualize doesn’t mean you are finished. There are always elements of that strategy or implementation to improve. New monitoring technologies. New visualization approaches. Embracing continuous improvement for your remote monitoring strategy, whether reactively or proactively, will pay off in the long run by ensuring your operations personnel and network engineers have the tools they need to shorten troubleshooting time and root-cause analysis. Both of which can significantly impact viewer satisfaction and subscriber health.
Don’t forget to subscribe to this blog series so that you won’t miss the next installment that talks about the role the cloud plays in remote monitoring!