Delayed messages and customer data creation in the Helpdesk

Incident Report for Gorgias

Postmortem

What appears to have been a larger-than-usual burst of incoming requests from one of our partners exhausted the CPUs on all of our incoming workers, making them unable to respond to health checks which in turn determined the system to kill and respawn them.

With workers being either driven to 100% capacity or being killed by the orchestrator, the entire incoming service was practically unusable.

We managed to fix the issue by scaling up the number of workers and adjusting the parameters used to check their health. We plan to make these changes permanent in order to overprovision the incoming subsystem so that we do a better job in handling large bursts.

Posted Feb 24, 2022 - 16:57 UTC

Resolved

This incident has been resolved.

Posted Feb 23, 2022 - 22:56 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 23, 2022 - 22:52 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 23, 2022 - 22:27 UTC

Update

We are continuing to investigate this issue.

Posted Feb 23, 2022 - 22:23 UTC

Investigating

We are currently investigating this issue.

Posted Feb 23, 2022 - 22:23 UTC

This incident affected: Helpdesk Integrations (Email, Mailgun inbound email, Gmail, Outlook, Live Chat, Smooch Core API, Facebook Posts & Comments, Instagram comments, Shopify integration, Facebook Messenger, ReCharge integration, Native Phone, Instagram Direct Messages, Yotpo).