The world’s undoubtably going through an extremely difficult time in the wake of the COVID-19 outbreak, causing unimaginable suffering to so many people. It is also, perhaps more than ever before, highlighting the essential role of the Internet as the means for people to keep in touch with loved ones, stay updated on the latest developments as well as for some sectors to stay afloat amid legislator after legislator taking actions to foster, incentivize or enforce social distancing in the hope of reducing the spread of the virus.
Our CEO, Staffan Göjeryd, wrote the Connecting the World @Home blog post a few weeks ago on how we’re tackling the situation and humbly recognizing the critical role our Internet Backbone has in society with direct customers spread across more than 120 countries and accounting for more than 60% of the global routes – connected via 300+ POPs.
The purpose of this post is to transparently and a bit more detailed share some of the learnings we’ve gathered during this difficult time as well as how we’re addressing the short-term network challenges that arise.
Changing Trends and Seasonality
At a high level there are two overarching trends to consider, namely the change in distribution over a period but more importantly the peak times as this is what networks are capacity planned for. Adjusted for seasonality and/or trend, statistically significant changes can be observed when looking at daily and weekly resolutions respectively. As time moves on, there’s no reason to think this won’t be an obvious seasonal anomaly on both a monthly and yearly basis as well.
Without a regional and time zone context, the data contains a lot of unnecessary noise. We’ve seen the same trend throughout Asia, now in full swing in Europe with North America closely behind. If focusing on Europe and recognizing parts of the noise can be attributed to both intra-day seasonality and trend changes – below is a chart illustrating the percentage change per hour of the day view between the average of all Mondays in February 2020 versus Monday the 23rd of March 2020. All timestamps for the remainder of this post are in CET unless otherwise explicitly specified.
Some visually noticeable patterns, as well verified in a larger sample, emerge:
- 1. The exact time of peak isn’t moving significantly, whereas the near/90 percentile peak periods themselves are getting longer with a much more even distribution throughout the day
- 2. Daytime, ranging from somewhere around 08:00 to 17:00, represents the largest growth – consistent with what’s to be expected with large portions of the world working and consuming content from home
If staying on intra-day movements and running correlation of this pattern against categories of clients and the sum of their traffic to and from AS1299, there’s a noticeable positive correlation to the category of which client’s primary service is Video Conferencing. Naturally, this is one of the client groups expected to have a major upswing as a result of people working from home. Out of the same previous Monday samples and again comparing to the average in February to March 23 (2020-03-23) it becomes evident that this category too sees a huge increase in both total volume and during the traditional low-peak hours:
Moving into the weekly resolution, and again without even decomposing the timeseries data into seasonal, trend and residual components – it becomes visually obvious how the pattern for all days in the third week of March look much like what Sunday evenings used to do in terms of both amplitude and the more evenly distributed tops. Worth to note is also that the February data, from a seasonality perspective, already seem to be skewed toward the ongoing trend changes when comparing to earlier months.
If removing the weekly seasonal component, the peak utilization (95th percentile here for noise reduction) has increased by roughly 35% throughout the course of March 2020, which is somewhat representative of what the rest of the industry seem to be experiencing. It is also somewhere along the lines of what we expected the full year-on-year growth to be for this network segment, now to be implemented during a couple of weeks. I’ll come back to the short-term challenges associated with that.
A Robust Traffic Demand Modelling Process
Although it would probably require a post of its own to do our capacity planning process justice, I’d like to cover the main building blocks and rationale behind the setup. How it maps into a wider context and flow can be seen in Figure 4.
For the purpose of the actual capacity planning, we make use of a mix of home-grown software, commercial software and several open-source projects such as Facebook Prophet and pmacct.
Although recognizing there are situations and circumstances in which Prophet isn’t the perfect tool for forecasting, it’s offsetting some of the complexity associated with conventional ARIMA models for which the end-user (Network Planner or Engineer) almost needs to be a full-blown data scientist to understand which knobs to adjust in order to decompose any given timeseries data.
At a high level, it takes pre-structured and timestamped data about the network as input. The data is enriched and linked to be aggregable into any given view and dimension the consumer wishes. At this stage of the process, all computations and analysis are performed – which in turn becomes the foundation of which all forecasting is based on. Because both the technical and commercial artifacts of each component are modelled, it provides a robust output on how both capital and operational expenditures will be impacted over time and location (device, POP, region). In a year’s time and through executing well ahead of time based on the forecast, it has also enabled the reduction of customer orders requiring buildout by 40%, with the trend poised to continue throughout 2020.
The marketing message of the month seem to revolve around how networks cope in a fully operational state, which isn’t typically what they’re built and dimensioned for in the first place and thus making for an equally poor metric now. What is more useful is understanding whether the network can cope during outages, with the most common one being able to handle any single failure. We model and measure this for every hour of the day in three different setups, namely:
Retrospective Model: Historical view of “Traffic at Risk” taking into account any ongoing failures at the time of the auto-discovered snapshot. Mapping SRLG failure trends as well as verifying that outages in the network are accurately represented in the simulations
Reference Model: Models the network in a fully operational state and used to do what-if scenarios with regards to topology or metric changes, addition of new devices and simulating impacts of planned maintenances
Forward-looking Model: Essentially a copy of the reference model but including all committed augmentations to take that into account when adding new capacity to the network
Hence, we can immediately identify where new hotspots have emerged should we have single or dual failures – measured in the form of “Traffic at Risk” per time period and device, network role, SRLG and/or region.
How are the short-term network challenges being addressed?
Traffic has always followed the previously mentioned weekly seasonal pattern with occasional outliers where the highest load on the network per continent is Sunday evenings. The fact that every day of the week now look like a Sunday in the weekly resolution doesn’t make that much of a difference per say. What does, however, present a temporary challenge is within the daily pattern. While the headroom implemented certainly have absorbed the vast majority of the growth attributed to changing behaviors due to COVID-19, there are a couple of new phenomena associated with this. There are also regional discrepancies which the team is working around the clock to address using the previously explained robust modelling process – taking all relevant failure scenarios into account. A somewhat clunky proxy to the regional differences could be looking at the tier 1-3 POP growth rates for March 2020 – illustrated in Figure 5 using SMA3 smoothening to reduce impact of outliers:
As the near peak-hours for every continent start earlier, the US afternoons and the evenings in Europe now contribute significantly to each other’s peaks. The model cannot predict future changes in seasonal patterns or creative third-party routing interventions in a time when everyone’s scrambling for available capacity. Even if it could, it’d still require one round of buildout in order to adjust to the new normal. To get a historical perspective over the last six months, Figure 6 depicts a view of the number of network buildouts currently being expedited for backbone purposes.
Here follows a short summary of findings and intended takeaways – all for the period of March 2020 unless explicitly specified:
- Overall traffic volumes have risen by >50% with major regional differences, but mostly visible during non-peak hours as a result of people staying at home
- As of March 29 (Sunday), peak traffic levels are up around 35% depending on continent
- On average, POPs have grown by 20.5% – with 208% and -56% respectively making the extremes and highlighting the regional differences
- Seasonality observations per resolution:
- Daily: Traditional low-peak hours have grown the most and the near-peak hours are much more evenly distributed – contributing to longer periods of high utilization and inter-continental overspill
- Weekly: All days of the week now look very similar to what Sundays used to do
- Certain client categories such as Video Conferencing are up more than 400% in total volume
- Despite lockdown measures are stabilizing in Europe, traffic is continuing to grow albeit it at lower rate, but still far more than normal monthly seasonality would suggest
Rest assured, there will be several further learnings from this very unusual behavior of demand of Internet usage. I will get back to that in later posts.
Head of Network Engineering & Architecture
Click here to learn more about Telia Carrier’s network.