The Gorgias platform is completely unavailable for a portion of our customers
Incident Report for Gorgias
Postmortem

Yesterday, on May 17th we had an incident from 12:05 PM to 1:10 PM PST that resulted in the Helpdesk system and API being down in our us-east4 cluster. The underlying reason for this outage was the loss of one of our main message queue systems in this region. The queue system was stuck in an edge case condition where all nodes were down and were unable to restart. Our engineers worked hard to restore the queue system back to service once the problem was identified.

As we use this queue system to handle async message processing in our API and backend processing, the system being down meant that no new messages could be processed (although they continued to flow in and be saved in other queues), and no API calls could be responded to. The queue system took an hour to be back to full operation, and then some additional time to replay messages received during the outage.

The effect seen by customers on this cluster was the inability to login and access their Helpdesk, and message processing being delayed. We understand this is unacceptable and every second of uptime on our platform is critical to our customers' business.

What are our future mitigation plans?

We are actively working on improving the resiliency of our queue system deployments by:

  • Fine tuning configuration
  • Installing upgrades for stability improvements
  • Adding additional metrics and alarms
  • Preparing more training exercises for on-call staff to improve incident resolution times

On behalf of the team at Gorgias, please accept our sincerest apology for this incident - we’ll be working hard to improve the stability and earn the trust you put in us.

The SRE team @ Gorgias.

Posted May 18, 2021 - 18:27 PDT

Resolved
This incident has been resolved.
Posted May 17, 2021 - 14:57 PDT
Monitoring
A fix has been implemented and we are monitoring the results. We are currently importing and sending messages that were received and sent during the incident. No message will be lost.
Posted May 17, 2021 - 13:14 PDT
Investigating
The Gorgias platform is completely unavailable for a portion of our customers. We're currently investigating.
Posted May 17, 2021 - 12:30 PDT
This incident affected: Helpdesk (REST API, Web App, Mobile Apps) and Helpdesk Clusters (us-east4-65cd).