View counts, agent status and general notification issues
Incident Report for Gorgias
Postmortem

Incident Context‌

As part of a continuous effort from Gorgias to proactively keep systems secure and up to date, the engineering team performed a scheduled update to the component responsible for the seamless communication between the Web App (the interface agents use daily), and our backend servers.

The main responsibility of this component is the live update of information such as Ticket counts, Agent typing activity, Agent Ticket presence, Agent availability, and Ticket routing - assignment.

These routine updates contribute to our commitment to data security and risk prevention. Our customers' data security is one of the main priorities of Gorgias and the reason we comply with SOC-2 certification and perform regular audits.

Incident Timeline

On July 19, 2023, the update was performed on a subset of servers first. After careful monitoring and testing, it was extended to all customers.

On July 20, 2023, we observed higher traffic due to the update release, a common occurrence. We continued monitoring for any anomalies, as it typically takes some time for a new app version to see adoption across all customers. We expected the traffic to eventually plateau.

Some users reported temporary issues, but these were resolved through simple actions like page refresh, cache cleaning, and tab closure.

On July 21, 2023, we saw an influx in customer reports indicating issues with the Helpdesk, and it became apparent that a deeper investigation was needed. Reverting the update was risky and could create even more disruption, so we opted to fix the version in use.
An engineering task force managed the situation, monitoring all apps and services, uninterrupted, between July 23, and July 25, 2023. During this time, the team was tasked with mitigating any issues, adjusting connections load manually, and identifying the source of the increased traffic. The mitigations alleviated some pressure on the system and allowed Agents to still use Helpdesk features, although some were still in a degraded state. Meanwhile, we were still trying to find a root cause fix.

On July 26, 2023, the Gorgias engineering team met with the maintainers of the component library that had been upgraded. It became evident that there were changes in the library that had not been documented and not correctly evaluated. Thanks to their collaboration, the Gorgias team was finally able to identify the root causes. Hotfixes were applied and deployed immediately. This alleviated the issues for the majority of customers.

Traffic remained high on July 27, 2023 and we took steps to block traffic originating from a select number of accounts who were using outdated versions of our web app. Affected customers were notified and this helped bring down the system pressure even further.

On July 31, 2023, the main author of the 3rd party component library shared a patch after identifying issues.

On August 1, 2023, the web app is patched and no longer putting our backend infrastructure at risk.

On August 2, 2023, the team identified the remaining cause of some customer reports that were still coming in. It was fixed and in a couple of minutes, all issues were resolved and services restored.

Incident impact

For approximately eight days, our customers experienced visible issues with live updates related to:

  • Agent Ticket presence
  • Agent typing activity
  • Ticket counts
  • Agent availability
  • Ticket routing - assignment

During the first two days, users with more than one Gorgias tab open or who didn’t refresh their page were not seeing updates to the above features.

During the following four days, users observed high latency receiving notifications or were impacted with intermittent issues.

Issues during the last two days were primarily observed by users that had opened Gorgias tabs when resuming work after their computers had gone to sleep.

No data was lost during this incident.

Action items

  • Document everything discovered regarding this component to make sure risks are well-known and identified.
  • Reduce the impact radius of changes that may have non-trivial rollback by rolling out slowly and applying blue/green techniques when possible.
  • Harden our Helpdesk front-end to deal with back-end upgrades, and improve error messaging to notify impacted users.
  • Increase components’ overall performance and observability.
  • Increase the update frequency of critical back-end components to reduce risk due to accumulated changes.
Posted Aug 22, 2023 - 17:41 PDT

Resolved
We can confirm that the changes released during this week yielded the expected results and that we are observing normal behavior of the following features:

- Agent activity
- Agent availability
- Ticket assignment
- Agent collision
- View counts updates
- New tickets and ticket reply notification

We sincerely apologize for the inconvenience these issues had on the day-to-day of our users. The engineering team has been working to its best capacity to figure out these issues and solve them in a timely manner. We will be following up with a customer-facing post-mortem detailing the root cause analysis and actions we’ve taken to prevent the issues from occurring again.

Users still experiencing similar issues are kindly asked to proceed by closing all Gorgias tabs, re-opening one, and reaching out to our support if the issues persist.
Any users constantly experiencing a red banner are encouraged to upgrade their browser to the latest version. If the banner persists, please reach out to our support team.

We want to again apologize sincerely for the disruptions the issues caused.

Thank you for your patience, your feedback, and your continued trust.
Posted Aug 04, 2023 - 14:06 PDT
Update
We have been monitoring the situation for the past 24 hours and our metrics indicate that the situation should be back to normal for the following features:
- Agent activity
- Agent availability
- Ticket assignment
- Agent collision
- View counts updates
- New tickets and ticket reply notification

We will be monitoring this for the next 24 hours again before resolving the incident.

We kindly ask agents still experiencing issues to close all Gorgias tabs and re-open one and to reach out to our support if the issues persist.
Posted Aug 03, 2023 - 14:39 PDT
Update
We shipped one more fix which addresses specifically "Agent collision" and "Agent typing activity" issues and are observing the results. The observed result should be that agents no longer work on the same ticket without knowing.

Yesterday's changes show that view counts are now consistently up to date, and so are other automatic notifications and updates within the web app.

We closely monitor feedback and metrics on our end to confirm the two previous fixes indeed solve all the issues encountered with the web app.
Posted Aug 02, 2023 - 12:24 PDT
Update
We recently deployed a new version of our web app that aims to address the main common issues with "Agent activity", "View count" and "Assignment" features and are observing the results.

We are closely monitoring the situation and feedback received to ensure the fastest resolution of the above issues.

Currently, disconnection should happen on even rarer occasions and in that case, a red banner should be displayed to make it explicit.

We still recommend that affected agents reload their page or close all their tabs if the former doesn't help, this is also the way to guarantee the new version of the web app is used.
Posted Aug 01, 2023 - 10:41 PDT
Update
We continue observing reports from some accounts that from time to time agents start experiencing disconnects and see the banner "Not connected to live updates". This banner means that live connection features are not properly working for this agent.
Our team continues working on a fix for this problem and we aim to release it as soon as it is ready.

Recommendations for the agents continue to be the same until the incident is fully resolved:
- If agent gets this banner and it doesn't disappear in 20 seconds, please proceed with reloading the page. If agent has multiple tabs, reloading the page might not help and they would need to close the rest of tabs and reload the last one.
- Until incident is fully resolved, to ensure their live connection features work properly, we kindly recommend agents to use one tab of Gorgias to guarantee that they stay alerted of any disconnects and can immediately connect. The good indicator of live connection staying healthy is numbers of tickets in the view being regularly updated in the left sidebar.
Posted Jul 31, 2023 - 14:20 PDT
Update
We resolved the main problem that was affecting agents and now we constantly make sure our server can handle all the connections. We still follow the strategy of reaching out to accounts ahead if they start experiencing spike of issues due to stale versions of the web app. Our engineering team works around the clock to make sure this issue doesn't turn into regression.

We have reports from some accounts that from time to time agents start experiencing disconnects and see the banner "Not connected to live updates". This banner means that live connection features are not properly working for this agent.
Our team is actively working on a solution for this problem and planning to release a fix for it at the beginning of next week.

The following workaround recommended to make sure agents are actively connected to live updates:
- If agent gets this banner and it doesn't disappear in 20 seconds, please proceed with reloading the page. If agent has multiple tabs, reloading the page might not help and they would need to close the rest of tabs and reload the last one.
- Until incident is fully resolved, to ensure their live connection features work properly, we kindly recommend agents using one tab of Gorgias to guarantee that they stay alerted of any disconnects and can immediately connect. The good indicator of live connection staying healthy is numbers of tickets in the view being regularly updated in the left sidebar.
Posted Jul 28, 2023 - 12:28 PDT
Update
We continue monitoring the live connections state.
We see no impact on end users so far for the majority of accounts, except the cases for the stale versions of web app failing to connect to the new live connection.

If agent has a stale version of the web app, the banner with text
"Your Helpdesk is using a stale live notifications connection. Please close all the Gorgias tabs(or browser copies) to make sure your Gorgias application is using the latest version and try to open a new page again." at the top of the page 30 seconds after agent opened the page.
To make sure agent uses the latest version, please follow the instructions on the banner.
If after following instructions the banner doesn't disappear, please reach out to our support team for further assistance.

For some accounts where we see a high impact of stale versions of the web app, engineering team is contacting the accounts directly to make sure the adoption of the new version of the app proceeds smoothly to prevent impact ahead.
Posted Jul 27, 2023 - 15:41 PDT
Update
We have made updates to our application on both the server side and client side. We see the majority of live connections being stabilized and continue monitoring the state of live connections.

Meanwhile, some accounts may still experience issues with the web app due to a stale version of the Front-End live messaging connection for certain agents. This can result in view count discrepancies, live agent activity problems, phone call issues, auto-assignment not working properly, or notifications not popping up.

To resolve this, please follow these steps:
- Close all Gorgias tabs or copies of your browser.
- Reopen the page to ensure a proper live connection.
- Check that view counts on the left side started appearing.
Posted Jul 26, 2023 - 20:59 PDT
Update
We have limited the impact of this issue and the current situation is stable for most users. We are actively working on resolving this for our customers that are still impacted.

Affected users are still advised to close all Gorgias tabs (or restart their browser).
Posted Jul 26, 2023 - 08:54 PDT
Update
We are still monitoring the adoption of a new Web app version and are doing everything we can to accelerate it.
Affected users are still advised to close all Gorgias tabs (or restart their browser).
Posted Jul 21, 2023 - 16:19 PDT
Monitoring
We have observed issues with automatic refresh on the web app for users with multiple Gorgias tabs open.
To resolve these issues, affected users are advised to close all Gorgias tabs (or restart their browser).
Posted Jul 20, 2023 - 15:17 PDT
This incident affected: Helpdesk Clusters (us-east1-635c, us-east4-65cd, us-east1-2607, aus-southeast1-fcb9, europe-west3-86c1, us-central1-d8ff, us-east4-5f09, europe-west1-c511, us-central1-c433) and Helpdesk (Web App).