We’ve experienced a degraded performance incident for the past couple of days that affected view counts and statistics as well as reduced performance of the helpdesk in general for on of our shards. We thought to share more details about the incident and how we’re planning to prevent these types of events in the future.
Timeline (time is in PST):
- June 27 (Saturday night) 00:00 - scheduled maintenance to upgrade one of our databases started and ended successfully at 1AM without problems. We performed all the checks we thought are necessary to make sure that the database cluster is in a consistent state.
- June 28 (Sunday afternoon) 14:00: Postgres replicas are in an inconsistent state due to a missing file - unclear how it happened and we're still investigating the root cause.
- June 28 - 15:00 - after failing to find a solution to the missing file we decided to restart the replicas recovery from scratch which was estimated to be done on 29th June Monday 8-9AM - just in time for our peak activity in the helpdesk. Size of the database is 9.1 Tb - this is why it takes a long time.
- June 29th (Monday morning) - 7AM - Google Cloud Networking is experiencing an incident in the zone of our cluster (us-east1-c) which puts our Kubernetes cluster into a failed state including our database and it's recovery process is aborted.
- June 29th (Monday afternoon) - 15:00 - we manage to recover our cluster and restart the replication recovery process estimated at least 24h because of the increased load.
- June 30th (Tuesday afternoon) - 15:10 - full recovery of the replicas and restart of normal operations. At this point the us-main-east1-c cluster is running on degraded capacity (poor performance, disabled statistics and view counts) since June 28th 15:00.
Mitigation plan:
- Highest priority: split the database of this shard into smaller shards to improve performance and reduce recovery times.
- Medium priority: run all our services into multiple availability zones. For the record: this is the first time in 4 years since we're running on GCP that our us-east1-c networking went completely out. It's an event that we need to prepare for, but it should be very rare. The more we delay, the bigger the impact will be.
We’re also hiring more site reliability engineers to work on the above problems and more.
I sincerely apologize for the trouble created - we’re doing everything we can to reduce this in the future and we’re taking the availability of our systems. Thank you again for understanding.