Partial outage in one of our clusters
Incident Report for Gorgias
Postmortem

Due to increased traffic one of clusters got into an unresponsive state. Specifically a queue system that we’re using to keep Gorgias in sync with Shopify/Facebook/etc.. went down and was sent into a weird state. Unfortunately we had to remove the entire queue storage and rebuild it from scratch, but this time with a lot more resources to handle the growth in traffic that we’re experiencing.

Fortunately however we have a way to replay all of the events that arrived during the downtime (Facebook, Shopify, Live chat, etc..) so that’s what we did. So there should be no missing updates/messages during the downtime or before.

We’re also taking steps to setup more monitoring/alerts to prevent this kind of outage in the future.

Posted May 26, 2020 - 11:25 PDT

Resolved
This incident has been resolved.
Posted May 26, 2020 - 11:19 PDT
Monitoring
We've implemented a fix and we're replaying past updates from external services. No data should be lost in the process.
Posted May 26, 2020 - 11:09 PDT
Identified
One of our clusters is currently experiencing a downtime. We're aware of the problem and working on mitigation.
Posted May 26, 2020 - 11:02 PDT
This incident affected: Helpdesk (REST API, Web App, Mobile Apps).