As part of a continuous effort from Gorgias to proactively keep systems secure and up to date, the engineering team performed a scheduled update to the component responsible for the seamless communication between the Web App (the interface agents use daily), and our backend servers.
The main responsibility of this component is the live update of information such as Ticket counts, Agent typing activity, Agent Ticket presence, Agent availability, and Ticket routing - assignment.
These routine updates contribute to our commitment to data security and risk prevention. Our customers' data security is one of the main priorities of Gorgias and the reason we comply with SOC-2 certification and perform regular audits.
On July 19, 2023, the update was performed on a subset of servers first. After careful monitoring and testing, it was extended to all customers.
On July 20, 2023, we observed higher traffic due to the update release, a common occurrence. We continued monitoring for any anomalies, as it typically takes some time for a new app version to see adoption across all customers. We expected the traffic to eventually plateau.
Some users reported temporary issues, but these were resolved through simple actions like page refresh, cache cleaning, and tab closure.
On July 21, 2023, we saw an influx in customer reports indicating issues with the Helpdesk, and it became apparent that a deeper investigation was needed. Reverting the update was risky and could create even more disruption, so we opted to fix the version in use.
An engineering task force managed the situation, monitoring all apps and services, uninterrupted, between July 23, and July 25, 2023. During this time, the team was tasked with mitigating any issues, adjusting connections load manually, and identifying the source of the increased traffic. The mitigations alleviated some pressure on the system and allowed Agents to still use Helpdesk features, although some were still in a degraded state. Meanwhile, we were still trying to find a root cause fix.
On July 26, 2023, the Gorgias engineering team met with the maintainers of the component library that had been upgraded. It became evident that there were changes in the library that had not been documented and not correctly evaluated. Thanks to their collaboration, the Gorgias team was finally able to identify the root causes. Hotfixes were applied and deployed immediately. This alleviated the issues for the majority of customers.
Traffic remained high on July 27, 2023 and we took steps to block traffic originating from a select number of accounts who were using outdated versions of our web app. Affected customers were notified and this helped bring down the system pressure even further.
On July 31, 2023, the main author of the 3rd party component library shared a patch after identifying issues.
On August 1, 2023, the web app is patched and no longer putting our backend infrastructure at risk.
On August 2, 2023, the team identified the remaining cause of some customer reports that were still coming in. It was fixed and in a couple of minutes, all issues were resolved and services restored.
For approximately eight days, our customers experienced visible issues with live updates related to:
During the first two days, users with more than one Gorgias tab open or who didn’t refresh their page were not seeing updates to the above features.
During the following four days, users observed high latency receiving notifications or were impacted with intermittent issues.
Issues during the last two days were primarily observed by users that had opened Gorgias tabs when resuming work after their computers had gone to sleep.
No data was lost during this incident.