Helpdesk is not accessible for some users
Incident Report for Gorgias
Postmortem

Incident Timeline

Starting at 2.22pm UTC, we noticed a major performance degradation with our primary database, in one of our datacenter. A fifth of our total customers started to experience very long page load.

At 2.37 PM UTC we decided to shutdown the database to run an emergency maintenance operation in the hope that it would restore the performance. We decided to put the whole cluster in maintenance and display the maintenance page we usually use for planned maintenance, although this one was not planned.

The database maintenance operation (technically called “vacuum”) took a long time because of the large amount of data and the initial settings were suboptimal so we had to cancel it mid-way, twice, after we estimated the first 2 attempts would take too long.

At 4.28 PM UTC, after the maintenance operation made enough progress (it operated in several “phases”) we progressively restored the services, starting with the non client-facing workloads.

At 5.25PM UTC, after we observed that the database performance were nominal we removed the maintenance page, to let the customers access their Helpdesk.

At 5.28pm UTC, the DB maintenance operation finished.

After that we closely monitored the platform and the response time and concluded the service was fully restored.

Incident Impact

For 15min, a fifth of all our customers experienced very slow Helpdesk page load. Then, for the same customers the Helpdesk was completely unavailable for 2h48min.

Action Items

  • Run more frequent DB maintenance operations to keep a constant level of performance, without having to shutdown the database. The current maintenance procedures did not scale the same way our system has been growing, we’re revising those procedures in order to support the current load and we’ll also have scheduled conversations to revise them again as the system grows.
  • Change the maintenance page message to not say that the maintenance was planned when it was not.
Posted May 18, 2022 - 12:17 PDT

Resolved
This incident has been resolved.
Posted May 17, 2022 - 10:51 PDT
Update
We are continuing to investigate this issue.

We're waiting on the completion of a maintenance task that should allow us to restore service.
Posted May 17, 2022 - 09:13 PDT
Update
We're still investigating this incident.
We hope to get this resolved in the next hour.

We will share further updates as soon as available.
Posted May 17, 2022 - 08:12 PDT
Investigating
We are currently investigating this issue.
Posted May 17, 2022 - 07:35 PDT
This incident affected: Helpdesk Clusters (us-east1-2607).