Starting at 2.22pm UTC, we noticed a major performance degradation with our primary database, in one of our datacenter. A fifth of our total customers started to experience very long page load.
At 2.37 PM UTC we decided to shutdown the database to run an emergency maintenance operation in the hope that it would restore the performance. We decided to put the whole cluster in maintenance and display the maintenance page we usually use for planned maintenance, although this one was not planned.
The database maintenance operation (technically called “vacuum”) took a long time because of the large amount of data and the initial settings were suboptimal so we had to cancel it mid-way, twice, after we estimated the first 2 attempts would take too long.
At 4.28 PM UTC, after the maintenance operation made enough progress (it operated in several “phases”) we progressively restored the services, starting with the non client-facing workloads.
At 5.25PM UTC, after we observed that the database performance were nominal we removed the maintenance page, to let the customers access their Helpdesk.
At 5.28pm UTC, the DB maintenance operation finished.
After that we closely monitored the platform and the response time and concluded the service was fully restored.
For 15min, a fifth of all our customers experienced very slow Helpdesk page load. Then, for the same customers the Helpdesk was completely unavailable for 2h48min.