API Monitoring (Runscope) Planned Maintenance Window on May 26, 2020
Scheduled Maintenance Report for BlazeMeter
Postmortem

Note - due to several subsequent API Monitoring incidents being migration related, all incidents are being covered in this single report.

Incident details with time:

  • Incident 1 - 05/26/2020 10:01PM PDT to 05/27/2020 01:40AM PST => Post migration extended window
  • Incident 2.1 - 05/27/2020 07:31AM PDT to 05/27/2020 12:20PM PST => Degraded Performance
  • Incident 2.2 - 05/27/2020 07:31AM PDT to 05/27/2020 01:55PM PST => US-Virginia Degraded Performance
  • Incident 3.1 - 05/27/2020 05:54PM PDT to 05/28/2020 05:22AM PST => Partial Outage
  • Incident 3.2 - 05/28/2020 09:34AM PDT to 05/29/2020 05:52PM PDT => Slowness and tuning
  • Incident 4 - 05/29/2020 08:00PM -:8:15 PM PDT => Redis Node failure

Problem symptoms:

  1. Post migration, Tests were running slow in most of the locations, especially US Virginia
  2. Occasional random test failures with System Error in most of the test locations
  3. Observed intermittent “read timeout” exceptions when persisting test runs
  4. Slowness in data loading in UI Dashboards

Root cause of the problem

  1. Incident 1 - The migration window got extended due to an unforeseen issue with a few of our Redis migrations. This created a lot of events getting built up within the system unprocessed.
  2. Incident 2.x - When system services connected after the Redis migration issue got resolved, due to an upfront event build-up in the system, coupled with cache layer cache misses, the DB layer saw an abnormal high volume of read and write operations and was exhausted with its Input/Output Operations Per Second.
  3. Incident 3.x - When system services connected after the Redis migration issue got resolved, due to an upfront event build up in the system already, the Redis shards encountered high volume of keys getting added and updated and this caused compute resources exhaustion on the Redis instances. To mitigate this, we moved the impacted Redis shards to higher processing nodes.
  4. Incident 3.x - We identified an index which seemed to be causing additional read operations to the DB. We did a hotfix to remove this index and restarted the impacted micro-service(s). This hotfix along with the restart caused the application & DB layer to relinquish the previous read and write connection spikes held up by the DB layer and DB layer started to observe normal/typical connection patterns with Input/Output Operations Per Second usage well within the desired threshold.
  5. Incident 2.x/3.x - We also triaged and identified that the test slowness was due to the agents going through a network address translation(NAT) layer that was not performant with the load and not able to scale. Once we removed this bottleneck layer from the agents, the tests started running normal at expected velocity.
  6. Incident 1/2.x - Since, we updated our DNS records, the DNS query cache on most of the agents - both Cloud Agents & on-premises Remote Agents started failing test as they were serving DNS queries from their cache for quite some time until the cache TTL was reached though we restarted our internal agents to flush the cache out, but most of the remote agents were not and got impacted.
  7. Incident 3.x - There were also intermittent latency spikes observed on the DB layer which matched with our application layer read/write timeout exception occurrences window. Later this issue was narrowed down to one of MongoDB shards and that shard has since been repaired and attached back clean.
  8. Incident 4 - There was a Redis Primary Node failure due to GCP node correction which we raised to GCP and they said it was a one off or rare occurrence but they will take care of this and suggested to move Redis to the memorystore.

What we are doing to avoid recurrences in future:

As this involves migrating 20+ TB of data and 40+ microservices with a complete architectured platform from AWS to GCP. This migration was just a one-time activity and will not be a recurring one. Also, most of the issues identified were mostly one time tuning and settling the system on the new platform covering Network, Database and Infrastructure fine tuning.

  1. Incident 2.x/3.x - We analyzed all the DB queries and identified slow running ones and adjusted indexes to make them run fast.
  2. Incident 2.x/3.x - We narrowed down the latency issues on the DB layer to one shard and that shard had been repaired, tuned and attached back to the cluster. We disabled flow control that alleviated the situation by letting secondaries lag if needed, but as those were catching up quickly, it didn't cause any further issues and the write throughput stabilized.
  3. Incident 2.x/3.x - Also, we did a hotfix to remove a wrong index which seemed to be causing additional read operations to the DB.
  4. Incident 1/4 - We moved the compute resource exhausted Redis shards to higher processing nodes. As a long-term plan, we are also working on moving all Redis shards to a managed instance “memorystore” pool with HA baked into it.
  5. Incident 2.x/3.x - We revisited our internal agents’ topology and removed the non-performant network address translation layer bottleneck to go directly to the targets.
  6. Incident 2.x - We mitigated issues related to mixed cloud (AWS & GCP) agent workloads into single (GCP) cloud workloads wherever it was possible to do so.
  7. Incident 1/2.x - Reached out to customers to restart their remote agents
  8. All Incidents - Working on an enhanced “Site Reliability Engg” plan and processes in place.
  9. All Incidents - Deployed a stringent infrastructure and service monitoring and alerting with appropriate thresholds in place with further constant review & analysis of those metrics to evolve and mature it. This monitoring will cover not only core services , DBs but also Global locations for running tests.
  10. Incident 4 - Working on moving Redis to GCP memorystore.
Posted Jun 17, 2020 - 06:54 PDT

Completed
Our maintenance is complete. We have successfully migrated API Monitoring to Google Cloud Platform. Our Illinois and Texas locations are not yet available but should be soon. Thank you for your patience!
Posted May 27, 2020 - 00:58 PDT
Update
We’re getting closer. The BlazeMeter API Monitoring (Runscope) services are being restarted and system testing is in progress. Please refrain from running tests and/or making any modifications as we test. We will communicate when the system is back to 100% and ready for use. Thank you.
Posted May 27, 2020 - 00:12 PDT
Update
API Monitoring is now functioning. We are working to bring up all our locations.
Posted May 27, 2020 - 00:00 PDT
Update
Everything is going to be 200 OK, but our engineering team requires some additional time. We are extending the maintenance window until 1:00am PDT. We will provide an update on progress by 12:00AM PDT. Thank you for your patience.
Posted May 26, 2020 - 22:52 PDT
Update
Apologies, it’s taking a little longer than expected, we are working to complete our maintenance window as soon as possible. We expect to have our service restored by 11:00pm PDT.
Posted May 26, 2020 - 21:54 PDT
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted May 26, 2020 - 18:00 PDT
Update
We will be undergoing scheduled maintenance during this time.
Posted May 14, 2020 - 10:29 PDT
Scheduled
We have a planned maintenance window for the API Monitoring component (formerly known as Runscope) of the BlazeMeter Continuous Testing Platform on May 26th, 2020, from 6pm to 10pm PDT (0100 - 0500 UTC).

During this time:

Tests under the API Monitoring tab will not be executed in any way (schedules, Trigger URLs, or API calls)
The API Monitoring tab will not be available
The API Monitoring API will not be available

This will only impact the API Monitoring component. The Functional, Performance, and Mock Services components of the BlazeMeter platform will continue to run.

Why is this happening?

We are migrating our infrastructure from Amazon Web Services to Google Cloud Platform. Our team has been working on this to provide a better experience for our users, and in our efforts to continue to improve the integration of API Monitoring with the overall BlazeMeter platform.

What's going to happen to my API Monitoring tests during the maintenance window?

A few minutes after the maintenance window starts, we will stop any tests that are scheduled from being executed. Any Trigger URLs and API calls will return a 503 status code.

Is there anything I need to do?

There is no action you need to do in regards to your tests. Once our maintenance window is complete and we have finished migrating our infrastructure, any tests that you had schedules for will resume running.

There also won't be any changes necessary to Remote Radar Agent configuration files, or API calls to the API Monitoring API.

If you would like to stay up-to-date in regards to the maintenance window status, please go to our status page at status.blazemeter.com.

We understand that this downtime can be disruptive to teams, and we sincerely apologize for the inconvenience. We're doing everything we can to help mitigate any risks associated with this infrastructure change, and reduce the impact for our customers.

If you have any questions or concerns about this, please reach out to us by replying to this email.

Thank you,

The BlazeMeter Team
Posted May 13, 2020 - 11:57 PDT
This scheduled maintenance affected: API Monitoring (Runscope).