Yesterday, on May 17th we had an incident from 12:05 PM to 1:10 PM PST that resulted in the Helpdesk system and API being down in our us-east4 cluster. The underlying reason for this outage was the loss of one of our main message queue systems in this region. The queue system was stuck in an edge case condition where all nodes were down and were unable to restart. Our engineers worked hard to restore the queue system back to service once the problem was identified.
As we use this queue system to handle async message processing in our API and backend processing, the system being down meant that no new messages could be processed (although they continued to flow in and be saved in other queues), and no API calls could be responded to. The queue system took an hour to be back to full operation, and then some additional time to replay messages received during the outage.
The effect seen by customers on this cluster was the inability to login and access their Helpdesk, and message processing being delayed. We understand this is unacceptable and every second of uptime on our platform is critical to our customers' business.
We are actively working on improving the resiliency of our queue system deployments by:
On behalf of the team at Gorgias, please accept our sincerest apology for this incident - we’ll be working hard to improve the stability and earn the trust you put in us.
The SRE team @ Gorgias.