Increased latency and error rate

Incident Report for Gorgias

Postmortem

Yesterday we had an incident where one of our regions was impacted by high latency and error rate for about an hour. The situation happened where our database connection resources in the region had been running quite high during peak times, and yesterday everything came to a boiling point where connections where exhausted and held by our application. We have connection pools and proxies to prevent this situation, but an edge case in our timeouts caused a lockup.

While we were able to free up the connection resources that were stuck temporarily, it had been stuck for some time so there was considerable back pressure and messages that had been waiting to be processed. Working through the backlog of messages took another hour before performance was able to return to acceptable levels.

Learnings:

We’re able to spot problems of this sort faster, so we can respond quicker and avoid having too much time to recover completely.
We are now looking at ways of measuring capacity limits in our infrastructure within regions on multiple levels, to continue to offer steady performance during peak hours.

Actions:

Last night we performed a maintenance on our database to increase the maximum available connections to stop the system from getting stuck when under high load.
We are continuing to roll out performance improvements in the application daily.

‌

We are sorry for any inconvenience any of the above has caused, but please know we’re working hard every day to provide a performant solution your business can rely on.

Thanks

The SRE Team @ Gorgias

Posted Jul 22, 2021 - 20:50 UTC

Resolved

This incident has been resolved.

Posted Jul 21, 2021 - 21:53 UTC

Update

We are continuing to monitor the situation, our short term fixes have helped regain stability of the platform and we are planning for further improvements this evening. Thank you for your patience.

Posted Jul 21, 2021 - 19:09 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 21, 2021 - 17:56 UTC

Investigating

We are currently investigating an increase in latency and error rate in our us-east1 region.

Posted Jul 21, 2021 - 17:13 UTC

This incident affected: Helpdesk (REST API, Web App, Mobile Apps) and Helpdesk Clusters (us-east1-2607).