Helpdesk unreachable

Incident Report for Gorgias

Postmortem

Incident Timeline

Starting at 4:28pm UTC, we noticed a degradation in performance or downtime for some of our customers, depending on the datacenter.

At 4:35pm UTC, we decided to revert our last deployment, causing downtimes to end.

At 4:45pm UTC, the deployment revert was finished and performance was back to normal.

After investigation, we found out that the root cause of the issue was a change introduced in the latest deployment, causing our load balancers to stop routing traffic to the helpdesk. The issue was not caught during pre-deployment because of configuration differences between the deployment health checks and production checks.

Incident Impact

For up to 20 minutes, some of our customers experienced degraded Helpdesk performance. For other customers, the Helpdesk was completely unavailable for up to 10 minutes.

Action Items

Ensure the deployment and production health checks are in sync
Add more checks on our pre-deployments to detect issues before they reach the production

Posted Jun 30, 2022 - 04:58 PDT

Resolved

An invalid network configuration has been applied on our infrastructure resulting in an issue routing traffic. This issue was identified and reverted immediately. More details will be communicated later on.

Posted Jun 29, 2022 - 09:30 PDT