Addressing the Challenges of Network Monitoring When Everyone Is At Home
This is the first post in a series about the challenges of network monitoring in a telecommuting world. The next post will address how to monitor streaming infrastructure virtually (that was never meant to be monitored outside the NOC).
Even before the current situation of shelter-in-place, many tech teams were already geographically dispersed. Network engineers, software developers, and even some operations people were working from home (or the local coffee shop). Perhaps not all the time, but at least a portion of it, capable of carrying out their work from chairs and monitors in home offices, through cloud resources or VPN connections.
Although a growing trend, telecommuting is still a gradual progression, with businesses easing into the support of a remote workforce. For example, between 2005 and 2017, the number of people telecommuting in the United States increased 159%. And during that time, businesses were learning how to support employees working away from the office. Not just the workers themselves, but the actual work. It’s much easier, perhaps, to support someone who is developing software, or creating technical documentation, than someone who needs access to equipment and machines not connected to the outside world. When there are few dependencies to the work, it’s much easier to support telecommuting.
COVID-19, though, has forced businesses to contend with telecommuting much faster. People used to working everyday in a structured office environment have suddenly been thrust into remote working as offices shut down around the world. The current situation has sped up the distributed workforce timetable, exposing the weaknesses in how some work was being carried out in the first place. Consider network management, such as in a Network Operations Center (NOC). Traditionally, the monitoring of servers and software related to content delivery had been carried out in a single room, with network operations engineers staring into walls of monitors. And because it’s always been done like that, little attention had been paid to the big question:
“How can we monitor everything if we can’t be in the room?”
When the current situation is resolved, and things settle into a new normal, it will be difficult to predict what will happen to telecommuting. It’s likely businesses will recognize the benefits, both from a cost and well-being perspective, and embrace remote working. But if that happens, then businesses must also rethink how they accomplish some work, like network monitoring, so it can be addressed by resources regardless of where they are in the world.
Why Is Monitoring Network Infrastructure Remotely So Difficult?
Under normal circumstances, network engineers in streaming operations work side-by-side in a physical room. The Network Operations Center, or NOC. Screens displaying details about system and server health, streaming content, and network details fill the walls. Phone lines wait tensely for customers to ring with delivery problems. All-in-all, it’s a collaborative environment where everyone works together to triage issues and ensure the best possible viewing experience. And there-in lies the problem. Many of those systems and servers involved in the streaming workflow (content origin, caches, encoders, transcoders, packagers, and all the in-between network gear) might only be reachable from within the corporate network, from within the physical room.
Of course, many businesses have provisions for outside workers to access network resources. But the problem with connecting through a VPN isn’t only the availability of the connection itself. More importantly, it’s replicating the “wall of screens” within the room. Assuming there are a dozen screens (and in a big NOC, there are usually more), how would it be possible to replicate that at home? Part of monitoring the network isn’t just seeing the data flowing across the screens, it’s seeing the monitor for one system right next to another; it’s seeing all the data at once. A network engineer working from home, even with three screens, would still have to toggle between systems and while toggling to one, they might miss something critical on another.
What Does the Future of Network Operations Look Like?
If the current situation is providing an opportunity to re-imagine how the network is monitored, then what operations engineers need is a way not only to access data for all the systems within the streaming workflow when they are outside the physical room, but see it all together. Let’s take that a step further and imagine such a solution. First, this solution would be a system of monitors that are not constrained by the physical network. These software-based monitors could be deployed within the network and beacon data outside, to a cloud service. Second, the solution would need to be accessible from any device, whether a laptop or phone. It shouldn’t matter. Anything with a web browser should be able to pull up the cloud service and look at the data as it comes in. Finally, the solution needs to consolidate all of those screens into something meaningful. Remember, part of the difficulty in monitoring the network resources in a streaming workflow from outside the network is replicating all of the screens. What this solution must do is piece all the data together in a way which simulates the engineer’s ability to look at a wall of monitors and identify the issue causing viewer problems.
Although that might seem like a tall order, it’s not.
Pulling the NOC Into the Cloud
Monitoring single, discrete systems in a streaming workflow isn’t the most efficient method of tracking issues. In fact, it’s pretty daunting to piece together the root cause from data collected across multiple systems. Those network operations folks though, sitting in the NOC, are magicians when it comes to finding problems. Still, any speed and efficiency they could gain by making root cause analysis easier and faster only benefits the business by solving viewer problems before they cancel their subscription. To develop a solution for remote network operations, the data from the systems has to be available outside of the corporate network. The best way to accomplish that is through the cloud. By enabling each component of the workflow, whether physical or software, to beacon data to the cloud, the fundamental issue of data accessibility has been addressed. Of course, security must be a priority, but once the data and access to it are protected, the first major hurdle has been tackled. Now that there’s a steady stream of data from streaming components (both inside and outside the network), cloud-based software and processes can act against it. The data can be sorted, collated, aggregated, and compared. Intelligent systems can be used to identify patterns or relationships (something that, in the physical room with all the screens, would have had to have been done visually).
Imagine it...a NOC in your pocket
When the data is in the cloud, the ideal solution would be to visualize it within a browser window. And, because it’s browser based, it could be accessed from the comfort of the couch, the passenger seat of a car, or even the beach. The power of the NOC in the palm of your hand. But just displaying data in the browser isn’t that exciting. It still requires visual acuity to find the pattern. So take this solution a step further. What if the data was pieced together along a timeline, so that a single viewing session (where an error had occurred) could be traced back all the way to the origin? It would be like a DVR for network operations. Now that would be powerful!
A Global Experiment is Underway
If nothing else, the current situation provides a unique opportunity: a look into the future of how network infrastructure, such as the systems involved in the streaming video workflow, can be monitored effectively outside of the physical operations center; while also providing a look into how stressed the network can become and where the breaking points might be. The kind of sustained traffic that is being pushed through the internet right now as a result of mass telecommuting and mass streaming is like a giant load test. The current circumstances can not only help us understand the stress on the internet and on streaming platforms, but also how NOCs might be re-imagined in the future to monitor the very systems being stressed right now.
Don’t miss the next instalment in this series that explores the first major element of solving remote network operations: getting the data from the systems into the cloud. The second post will dig into how a cloud-based monitoring approach can empower operations engineers to see inside the network...from wherever they are.