Partial outage in one of our clusters

Incident Report for Gorgias

Postmortem

Due to increased traffic one of clusters got into an unresponsive state. Specifically a queue system that we’re using to keep Gorgias in sync with Shopify/Facebook/etc.. went down and was sent into a weird state. Unfortunately we had to remove the entire queue storage and rebuild it from scratch, but this time with a lot more resources to handle the growth in traffic that we’re experiencing.

Fortunately however we have a way to replay all of the events that arrived during the downtime (Facebook, Shopify, Live chat, etc..) so that’s what we did. So there should be no missing updates/messages during the downtime or before.

We’re also taking steps to setup more monitoring/alerts to prevent this kind of outage in the future.

Posted May 26, 2020 - 18:25 UTC

Resolved

This incident has been resolved.

Posted May 26, 2020 - 18:19 UTC

Monitoring

We've implemented a fix and we're replaying past updates from external services. No data should be lost in the process.

Posted May 26, 2020 - 18:09 UTC

Identified

One of our clusters is currently experiencing a downtime. We're aware of the problem and working on mitigation.

Posted May 26, 2020 - 18:02 UTC

This incident affected: Helpdesk (REST API, Web App, Mobile Apps).