Cloud Insight Index

About #

Cloud Insight Index is a for-fun project that makes a database of cloud provider information for Amazon Web Services (AWS), Google Cloud Platform (GCP), and Azure. It’s done in my free time and totally unaffiliated.

The dataset and source code for Cloud Insight Index will not be made public any time soon.

Data Sources #

This information is from publicly available incident reports published by each provider (GCP, AWS, Azure). The information is only as reliable as the provider’s report.

Model Biases #

Each provider defines incidents in different ways and varying levels of detail for incident communication.

Cloud Insight Index defines an Incident as when a cloud provider says a service is impacted for a region. This diverges from provider definitions. Providers say a single incident impacts many services, but Cloud Insight Index counts each service impacted as separate incidents a part of the same Event. Cloud Insight Index models bias toward what customers actually experience. Service customers do not care about the underlying cause of the issue, see also https://www.whoownsmyavailability.com/

An example (truncated from GCP incident VuCtCwkRXueAyusvrXfG):

Description: We are experiencing an issue with Cloud Memorystore, Google Compute Engine, Google Cloud Composer, Google Cloud Networking, Cloud Filestore, Google App Engine, Google Kubernetes Engine, Apigee, Google Cloud SQL, Google Cloud Dataflow beginning at Monday, 2023-06-12 17:15 US/Pacific.

Taiwan (asia-east1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Osaka (asia-northeast2)Seoul (asia-northeast3)Mumbai (asia-south1)Delhi (asia-south2)Singapore (asia-southeast1) …

This is a single Event. The description lists 10 services impacted across 7 regions. That’s 70 Incidents according to Cloud Insight Index.

This model “penalizes” outages with a large blast radius.

Event Start and End Times #

Here is the precedence that Cloud Insight Index uses to determine a start/end time for an event:

Provider’s “start” and “end” time attributes in their event data
Fuzzy string matching in the event summary for timestamps
Large Language Model parsing the incident report to determine start/end time
Provider’s publish time of the first and last message of the event.

2023 #

Which provider had the most incidents? #

GCP. See the note about Events and Incidents and this model’s biases.

Which event had the largest blast radius per provider? #

GCP LnvJwfYu3TCyUrcrP7yf, 468 impacted service/regions.
Azure XMGF-5Z0, 171 impacted service/regions.
AWS Lambda Service Event, 103 impacted service/regions.

Which geographies were impacted the most? #

GCP us-central1 (270 impcated service/regions)
Azure us-east-3 (48 impacted service/regions)
AWS us-east-1 (125 impacted service/regions)

What month had the most incidents, regardless of provider? #

I was surprised by the data! I expected January to be the top month. My thought was cloud providers would be less bullish with releases during the holidays, leading to a large amount of undeployed changes queued in January.

June.

Month	Incident Count
01	481
02	770
03	715
04	635
05	328
06	1339
07	589
08	368
09	473
10	344
11	791
12	352

What was each providers longest outage? #

The numbers here aren’t exact. See event start and end times for the methodology. Providers don’t always make this data easily available.

Hours	Provider
52	AWS
32	Azure
720	GCP

Not every provider makes it easy to directly link to an outage.

Longest AWS event in 2023

This obscure ARN is the only identifier I can find about this event:

arn:aws:health:us-east-1::event/S3_REPLICATION_TIME_CONTROL/AWS_S3_REPLICATION_TIME_CONTROL_OPERATIONAL_ISSUE/AWS_S3_REPLICATION_TIME_CONTROL_OPERATIONAL_ISSUE_229B4_6E9B3927D03

Summary:

Between October 17 11:00 PM PDT and October 20 3:40 AM PDT, S3 Replication Time Control (RTC) experienced delayed data replication from S3 buckets in the US-EAST-1 Region to US-EAST-1 itself and other AWS regions….

Longest Azure event in 2023

Azure Event VN11-JD8

Summary:

Between 20:19 UTC on 7 February 2023 and 04:30 UTC on 9 February 2023, a subset of customers with workloads hosted in the Southeast Asia and East Asia regions experienced difficulties accessing and managing resources deployed in these regions.

Longest GCP event in 2023

vwCTc91NtshzQyDzUnnc

Summary:

Summary: Chronicle customers in all regions using the SENTINEL_EDR default parser (product source: “SentinelOne EDR”) may have incorrect process enrichment results.

and

Incident began at 2023-07-29 22:55 and ended at 2023-08-29 22:00 (all times are US/Pacific).

What’s next? #

~~Finished: Incident Durations~~

Service Maturity, Service Investments

What’s the relationship between the service’s age and incidents? What’s the relationship between a service’s feature press releases and its incidents?

Security Events, Conferences, Feature launches

Is there a relationship between provider incidents and security disclosures? I would enjoy seeing data to support or refute that.
Is there a relationship between provider conferences and incidents? (e.g., Google I/O, AWS re:Invent, Microsoft Ignite)
Is there a relationship between feature launches and incidents?

About #

Data Sources #

Model Biases #

Event Start and End Times #

2023 #

Which provider had the most incidents? #

Which event had the largest blast radius per provider? #

Which geographies were impacted the most? #

What month had the most incidents, regardless of provider? #

What was each providers longest outage? #

What’s next? #

What questions would you ask if you had this information? #