View counts and statistics are disabled for some customers

Incident Report for Gorgias

Postmortem

We’ve experienced a degraded performance incident for the past couple of days that affected view counts and statistics as well as reduced performance of the helpdesk in general for on of our shards. We thought to share more details about the incident and how we’re planning to prevent these types of events in the future.

Timeline (time is in PST):

June 27 (Saturday night) 00:00 - scheduled maintenance to upgrade one of our databases started and ended successfully at 1AM without problems. We performed all the checks we thought are necessary to make sure that the database cluster is in a consistent state.
June 28 (Sunday afternoon) 14:00: Postgres replicas are in an inconsistent state due to a missing file - unclear how it happened and we're still investigating the root cause.
June 28 - 15:00 - after failing to find a solution to the missing file we decided to restart the replicas recovery from scratch which was estimated to be done on 29th June Monday 8-9AM - just in time for our peak activity in the helpdesk. Size of the database is 9.1 Tb - this is why it takes a long time.
June 29th (Monday morning) - 7AM - Google Cloud Networking is experiencing an incident in the zone of our cluster (us-east1-c) which puts our Kubernetes cluster into a failed state including our database and it's recovery process is aborted.
June 29th (Monday afternoon) - 15:00 - we manage to recover our cluster and restart the replication recovery process estimated at least 24h because of the increased load.
June 30th (Tuesday afternoon) - 15:10 - full recovery of the replicas and restart of normal operations. At this point the us-main-east1-c cluster is running on degraded capacity (poor performance, disabled statistics and view counts) since June 28th 15:00.

Mitigation plan:

Highest priority: split the database of this shard into smaller shards to improve performance and reduce recovery times.
Medium priority: run all our services into multiple availability zones. For the record: this is the first time in 4 years since we're running on GCP that our us-east1-c networking went completely out. It's an event that we need to prepare for, but it should be very rare. The more we delay, the bigger the impact will be.

We’re also hiring more site reliability engineers to work on the above problems and more.

I sincerely apologize for the trouble created - we’re doing everything we can to reduce this in the future and we’re taking the availability of our systems. Thank you again for understanding.

Posted Jun 30, 2020 - 22:24 UTC

Resolved

View counts and statistics are now enabled for everyone. Postmortem to follow soon.

Posted Jun 30, 2020 - 22:11 UTC

Identified

We've recovered the base backup and now are replaying the events in the past 24h - it is hard to estimate the completion, will post an update here once it's done. After that we'll be able to re-enable the statistics and view counts.

Posted Jun 30, 2020 - 20:07 UTC

Update

We are continuing to investigate this issue.

Posted Jun 30, 2020 - 14:51 UTC

Investigating

Due to recent downtime events we've temporarily disabled view counts and statistics for some customers to increase the stability of the helpdesk. Unfortunately the recovery process is very slow. ETA: 23h from the time of this message.

Posted Jun 29, 2020 - 19:37 UTC

This incident affected: Helpdesk (Web App).