Delayed messages and customer data creation in the Helpdesk
Incident Report for Gorgias
Postmortem

What appears to have been a larger-than-usual burst of incoming requests from one of our partners exhausted the CPUs on all of our incoming workers, making them unable to respond to health checks which in turn determined the system to kill and respawn them.

With workers being either driven to 100% capacity or being killed by the orchestrator, the entire incoming service was practically unusable.

We managed to fix the issue by scaling up the number of workers and adjusting the parameters used to check their health. We plan to make these changes permanent in order to overprovision the incoming subsystem so that we do a better job in handling large bursts.

Posted Feb 24, 2022 - 08:57 PST

Resolved
This incident has been resolved.
Posted Feb 23, 2022 - 14:56 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 23, 2022 - 14:52 PST
Identified
The issue has been identified and a fix is being implemented.
Posted Feb 23, 2022 - 14:27 PST
Update
We are continuing to investigate this issue.
Posted Feb 23, 2022 - 14:23 PST
Investigating
We are currently investigating this issue.
Posted Feb 23, 2022 - 14:23 PST
This incident affected: Helpdesk Integrations (Email, Mailgun inbound email, Gmail, Outlook, Live Chat, Smooch Core API, Facebook Posts & Comments, Instagram comments, Shopify integration, Facebook Messenger, ReCharge integration, Native Phone, Instagram Direct Messages, Yotpo).