Note - due to several subsequent API Monitoring incidents being migration related, all incidents are being covered in this single report.
Incident details with time:
- Incident 1 - 05/26/2020 10:01PM PDT to 05/27/2020 01:40AM PST => Post migration extended window
- Incident 2.1 - 05/27/2020 07:31AM PDT to 05/27/2020 12:20PM PST => Degraded Performance
- Incident 2.2 - 05/27/2020 07:31AM PDT to 05/27/2020 01:55PM PST => US-Virginia Degraded Performance
- Incident 3.1 - 05/27/2020 05:54PM PDT to 05/28/2020 05:22AM PST => Partial Outage
- Incident 3.2 - 05/28/2020 09:34AM PDT to 05/29/2020 05:52PM PDT => Slowness and tuning
- Incident 4 - 05/29/2020 08:00PM -:8:15 PM PDT => Redis Node failure
Problem symptoms:
- Post migration, Tests were running slow in most of the locations, especially US Virginia
- Occasional random test failures with System Error in most of the test locations
- Observed intermittent “read timeout” exceptions when persisting test runs
- Slowness in data loading in UI Dashboards
Root cause of the problem
- Incident 1 - The migration window got extended due to an unforeseen issue with a few of our Redis migrations. This created a lot of events getting built up within the system unprocessed.
- Incident 2.x - When system services connected after the Redis migration issue got resolved, due to an upfront event build-up in the system, coupled with cache layer cache misses, the DB layer saw an abnormal high volume of read and write operations and was exhausted with its Input/Output Operations Per Second.
- Incident 3.x - When system services connected after the Redis migration issue got resolved, due to an upfront event build up in the system already, the Redis shards encountered high volume of keys getting added and updated and this caused compute resources exhaustion on the Redis instances. To mitigate this, we moved the impacted Redis shards to higher processing nodes.
- Incident 3.x - We identified an index which seemed to be causing additional read operations to the DB. We did a hotfix to remove this index and restarted the impacted micro-service(s). This hotfix along with the restart caused the application & DB layer to relinquish the previous read and write connection spikes held up by the DB layer and DB layer started to observe normal/typical connection patterns with Input/Output Operations Per Second usage well within the desired threshold.
- Incident 2.x/3.x - We also triaged and identified that the test slowness was due to the agents going through a network address translation(NAT) layer that was not performant with the load and not able to scale. Once we removed this bottleneck layer from the agents, the tests started running normal at expected velocity.
- Incident 1/2.x - Since, we updated our DNS records, the DNS query cache on most of the agents - both Cloud Agents & on-premises Remote Agents started failing test as they were serving DNS queries from their cache for quite some time until the cache TTL was reached though we restarted our internal agents to flush the cache out, but most of the remote agents were not and got impacted.
- Incident 3.x - There were also intermittent latency spikes observed on the DB layer which matched with our application layer read/write timeout exception occurrences window. Later this issue was narrowed down to one of MongoDB shards and that shard has since been repaired and attached back clean.
- Incident 4 - There was a Redis Primary Node failure due to GCP node correction which we raised to GCP and they said it was a one off or rare occurrence but they will take care of this and suggested to move Redis to the memorystore.
What we are doing to avoid recurrences in future:
As this involves migrating 20+ TB of data and 40+ microservices with a complete architectured platform from AWS to GCP. This migration was just a one-time activity and will not be a recurring one. Also, most of the issues identified were mostly one time tuning and settling the system on the new platform covering Network, Database and Infrastructure fine tuning.
- Incident 2.x/3.x - We analyzed all the DB queries and identified slow running ones and adjusted indexes to make them run fast.
- Incident 2.x/3.x - We narrowed down the latency issues on the DB layer to one shard and that shard had been repaired, tuned and attached back to the cluster. We disabled flow control that alleviated the situation by letting secondaries lag if needed, but as those were catching up quickly, it didn't cause any further issues and the write throughput stabilized.
- Incident 2.x/3.x - Also, we did a hotfix to remove a wrong index which seemed to be causing additional read operations to the DB.
- Incident 1/4 - We moved the compute resource exhausted Redis shards to higher processing nodes. As a long-term plan, we are also working on moving all Redis shards to a managed instance “memorystore” pool with HA baked into it.
- Incident 2.x/3.x - We revisited our internal agents’ topology and removed the non-performant network address translation layer bottleneck to go directly to the targets.
- Incident 2.x - We mitigated issues related to mixed cloud (AWS & GCP) agent workloads into single (GCP) cloud workloads wherever it was possible to do so.
- Incident 1/2.x - Reached out to customers to restart their remote agents
- All Incidents - Working on an enhanced “Site Reliability Engg” plan and processes in place.
- All Incidents - Deployed a stringent infrastructure and service monitoring and alerting with appropriate thresholds in place with further constant review & analysis of those metrics to evolve and mature it. This monitoring will cover not only core services , DBs but also Global locations for running tests.
- Incident 4 - Working on moving Redis to GCP memorystore.