What appears to have been a larger-than-usual burst of incoming requests from one of our partners exhausted the CPUs on all of our incoming workers, making them unable to respond to health checks which in turn determined the system to kill and respawn them.
With workers being either driven to 100% capacity or being killed by the orchestrator, the entire incoming service was practically unusable.
We managed to fix the issue by scaling up the number of workers and adjusting the parameters used to check their health. We plan to make these changes permanent in order to overprovision the incoming subsystem so that we do a better job in handling large bursts.