Due to increased traffic one of clusters got into an unresponsive state. Specifically a queue system that we’re using to keep Gorgias in sync with Shopify/Facebook/etc.. went down and was sent into a weird state. Unfortunately we had to remove the entire queue storage and rebuild it from scratch, but this time with a lot more resources to handle the growth in traffic that we’re experiencing.
Fortunately however we have a way to replay all of the events that arrived during the downtime (Facebook, Shopify, Live chat, etc..) so that’s what we did. So there should be no missing updates/messages during the downtime or before.
We’re also taking steps to setup more monitoring/alerts to prevent this kind of outage in the future.