How to Fix the Root Cause Analysis Problem in Streaming Video

An operations engineer root doing a root cause analysis and fixing errors

Your subscribers are churning right now. You just probably don’t know why – and that’s a much bigger problem. Of course, you might be able to hazard a guess looking at one of the dozen monitors in the NOC and dig in deeper by checking various specific dashboards and data elements. However, this takes way too much time without the right monitoring setup. Plus, performance issues with streaming services are often a result of multiple factors, requiring correlation between several data sources. Unfortunately, the need to correlate data sources for root cause analysis is exactly what complicates it, and ends up frustrating your viewers.

In this article, we look at why root cause analysis is critical for streaming operators, the main challenges that come with it, and how to overcome them.

Why the first 30 seconds of your root cause analysis is so critical

Studies have shown that the longer performance issues persist, the more viewers are likely to abandon the content. Of course, abandonment doesn’t immediately equate to churn. However, for ad-supported streaming services, any content abandonment hurts revenue. Plus, abandonment caused by performance-related issues will, sooner or later, inevitably lead to churn as subscribers become more and more frustrated and sign up elsewhere.

That’s why getting to the root cause of a technical issue quickly is so important. Failing to identify, diagnose, and fix performance-related problems typically results in a ripple effect: viewers having an issue with one piece of content may have issues with others, compounding their frustration and speeding their way to increasing abandonment rates and eventually cancelling their subscription. What’s more, angry customers aren’t shy to voice their opinion online or with friends, meaning that acquiring new viewers becomes more difficult too.

Most streaming providers are aware of this, so what’s holding them back from speeding up root cause analysis? In the following section, we look at that in more detail.

The “roots” of the root cause analysis problem

main-root-cause-analysis-challenges-infographic-touchstream

Operational efficiency in streaming video services is driven by two primary metrics: Mean Time to Diagnose (MTTD) and Mean Time to Repair (MTTR). As the names imply, they simply represent the time it takes to diagnose a problem (MTTD) and resolve it (MTTR).

For the most part, operations or network engineers can drive down MTTR through improved expertise and skill. Of course, there are some external elements, such as a third-party provider (i.e., a CDN), which can slow down resolution. For MTTD, though, there are many factors that affect that score and extend those first 30 seconds to 30 minutes, 30 hours, or more. Here’s a full overview of the most common challenges.

Too much data: you can't see the woods for the trees

Operations engineers don’t work with a dearth of information. They are inundated with data from throughout the workflow. Encoders throw it off. Cloud storage providers generate it. Content delivery networks are swimming in it. Even the player produces dozens of data points. In fact, there is so much data coming into the NOC – with so many possible points of correlation – that telemetry becomes almost impossible.

Slow data puts you on the back foot in the race against time

Sometimes engineers have access to an incredible amount of data, but it doesn’t come in fast enough. If a refresh happens every 30 minutes, MTTD time is going to naturally increase even if the root cause analysis itself would take just a few minutes with up-to-date data.

Data from third-party providers creates added complexity

When part of the streaming workflow isn’t under direct control, it creates an additional obstacle to root cause analysis: getting access to the data. If operations engineers only get the data through a proprietary visualisation tool, it can be difficult to correlate to other sources. What’s more, if the third-party provider isn’t transparent about how data elements in the tool are calculated, using that data in root cause analysis becomes even more complicated.

Fragmentation of data tools blurs your vision & costs time

“Troubleshooting can take days or even months because data is fragmented across different sources, often without access to data.”

Technical Operations Manager at major US streaming provider

One of the problems with so much data is that it’s often provided through different tools. So, not only do operations engineers need to sift through a ton of data, but they also need to bounce around between dashboards.

Increased post-processing drains your engineers’ critical time

When there are multiple data sources and tools which need to be looked at, correlation is nearly impossible, especially if third-party providers aren’t providing programmatic access. This means engineers must manually connect data points and calculate values each time, slowing down diagnosis.

Plain old human error due to a lack of integration & automation

The combination of too much data and too many tools invites a very obvious outcome: human error. When operations engineers have to correlate the data themselves, from dozens of sources across an equal number of dashboards, the opportunity for error is exponentially increased. Moreover, although identifying root causes is the end goal, identifying the wrong root cause creates more delays in dealing with quality of experience issues.

“Human misguidance or error frequently creates long delays in root cause analysis, as there is simply too much data to parse manually.”

Video Delivery Associate Director at a leading APAC telecommunications company

In the highly competitive market of streaming services, where consumers only have so much money to spend, operators cannot afford to extend the time it takes to find root causes of viewing issues. Rather, they need to figure out how to shrink the amount of time it takes to identify the issue using all the data they have and, more importantly, to fix it quickly.

How to improve root cause analysis to reduce MTTD & boost QoS

Resolving performance- or quality-related issues necessitates quickly driving down that mean time to diagnose. In other words, getting to the root cause faster. Doing so requires addressing those “roots”.

Connect the dots & simplify visualisation to avoid data overload

Operations can reduce the amount of data they need to deal with at first glance by connecting and correlating sources behind the dashboard. That way, the visualisation tool reflects the key indicator of a specific performance or quality aspect, not just a bunch of numbers on a graph.

Acquire data more quickly

When data is controlled by third-party vendors, it can be difficult to get them to speed up collection. One solution is to ask for raw data, prior to calculations of metric, via programmatic means. Another effective approach is to use a third-party data provider, like Datazoom, who can collect the data more quickly and feed it to whatever tools you need. Regardless, you should look for all opportunities to speed up data acquisition.

Integrate third-party data sources via APIs

Unfortunately, there is no way to reduce the reliance on third-party providers. Commercial CDNs, for example, will always be needed to ensure scalability for streaming services. However, operations can insist third-party providers give them raw data they can access via APIs. By programmatically pulling third-party data into a common visualisation tool, post-processing can be done automatically before the data is ever looked at, ensuring it’s standardised, normalised and correlated.

Overcome data fragmentation by pulling all data into one visualisation tool

Although there will always be the need for specific dashboards to certain data sources – analysing encoder data, for example, is fundamentally different from CDN data – key metrics should be pulled into a single visualisation tool. Employing a service like Datadog or Looker as a top-level dashboard will allow operations engineers to create a single visualisation tool that reflects the most important indicators for common root cause analysis.

Build a data processing layer into your analytics stack

Rather than requiring operations engineers to figure out the correlation between data points or the calculation of a metric, build that into the streaming tech stack. So when the data comes in, it’s automatically scrubbed, normalised, and standardised according to your internal data dictionary. As a result, the data that hits your one visualisation tool doesn’t need to be correlated as it already communicates what the engineers need it to.

Limit the opportunity for human error by improving data visualisation

The idea behind connecting all the data sources and exposing top-level data points into a single dashboard is to make it easier for engineers in the NOC to identify performance or quality issues quickly. When engineers don’t have to interpret data across multiple dashboards or data sources, there is less of a chance of them making a mistake or miscalculation and chasing issues that aren’t really related to the root cause. An excellent way to circumvent this issue of potential human error is to colour-code top-level data elements according to thresholds. So when engineers look at the tool and see CDN performance in the yellow or red, they can immediately drill in.

The key to root cause analysis: master your data, but don’t be mastered by it

Streaming is evolving rapidly with new technologies, new ways to architect services, and new stacks which also means the data is changing. More data is coming from software and hardware within the streaming technology stack than ever before. However, if streaming operators don’t put a strategy in place to deal with that data, the proliferation causes more harm than good as operations and network engineers waste time in a fruitless effort to resolve the issue their subscribers are complaining about.

One strategy is to employ a monitoring harness to handle all of the data. Doing so ensures a single visualisation tool, one dashboard that doesn’t change despite new or evolving data sources. This also means that the tool can be optimised a single time, focusing on visual indicators (such as green, yellow and red) with clear, actionable buttons to help engineers perform root cause analysis with just a couple of clicks.

At the end of the day, regardless of what operations engineers do, MTTD needs to go down or abandonment and churn will only continue to rise.

Don’t waste any more time, address your root cause analysis problem now by downloading our Monitoring Harness White Paper.

Get Whitepaper