Database issues

Incident Report for Gorgias

Postmortem

As you know we had some stability issues with the Gorgias helpdesk in the past couple of days. The issue is resolved now and we’ve taken steps to prevent this in the future, but it’s worth giving a bit more details about what happened and what we’re doing to prevent this in the future.

This past Sunday (June 7th) at 0:45 AM PST time we started a scheduled maintenance to upgrade to the latest stable version of Postgres 12 which is one of our main storage of state. Major version upgrades in Postgres require some downtime and at our scale, it requires about 1h in total.

Before doing this upgrade we already tested a couple of times with a backup image and everything worked as expected in our tests.

So what happened?

When we started the actual upgrade on one of our clusters (represents a group of helpdesk accounts) we encountered a bug mentioned here which happens only in a specific scenario which we didn’t take into account in the real upgrade. That lead to a failed upgrade of our main database and also failed database replicas which we use for features like view counts, statistics, etc.. We managed to recover the main database without dataloss, but the bad news is that it takes a very long time to recover lost replicas so we were running under capacity for about 19h on Monday until we recovered them and enabled all the affected features.

Mitigation

Given this new info and our ability to reproduce the failed upgrade, we changed the way we do the upgrade of major versions to approach it as much as possible with the real environment so if there’s a problem during the upgrade we’re more likely to find it before we do the maintenance. We still want to upgrade to version 12 because there are good performance gains which will make the helpdesk faster for everybody.

We also hired 1 more person on the SRE team (they started this Monday) and a Postgres expert to help us prevent future issues with our database and infrastructure in general. We also plan on hiring more people to work these issues until the end of the year. Special thanks to Laurenz Albe from Cybertec for helping us during this time.

Conclusion

Downtimes are really painful especially during peak times on Monday - I and the team realize this and we worked around the clock to fix the incident as fast as we could. We also understand that if you cannot communicate with customers you’re losing business and that’s really bad especially now with the current economic climate. I apologize for the trouble caused and we’re also dedicating a big part of the next quarter on stability improvements. We’ll dedicate a lot more effort to address the current issues. Stability is not only about availability, but also about bugs and performance improvements.

We’re also offering a 15% discount for this month for the customers that have been affected by this partial outage.

Once again I apologize for the trouble this incident caused and I hope that we still have your trust in the future of our platform.

Posted Jun 10, 2020 - 18:16 UTC

Resolved

This incident has been resolved.

Posted Jun 07, 2020 - 10:21 UTC

Update

We are continuing to monitor for any further issues.

Posted Jun 07, 2020 - 10:18 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jun 07, 2020 - 10:17 UTC

Investigating

The platform is currently down due to a database issue. We are currently investigating and will keep you updated.

Posted Jun 07, 2020 - 08:58 UTC