Issue with API Monitoring Tests and API

Incident Report for BlazeMeter

Postmortem

Incident details with time:

06/30/2020 11:29AM PDT to 12:00PM PDT => 3 internal incidents were triggered by our SaaS Monitoring for drop in API traffic and Test runs:

‌

Problem symptoms:

Observed “max number of clients limit reached” exceptions and “health-check” failures in our internal Calculon service
Observed queue spikes resulting in Tests getting queued up
Observed drops in Transactions Per Second (TPS) for APIs initially, and later observed the same system wide

‌

Root cause of the problem

The drop in API TPS followed by the “max number of clients reached” exception and Calculon microservice health check failures, enabled us to quickly narrow down the issue at hand to one of the Redis DBs “Redis-calculon”. As we observed that this Redis DB maxed out its clients connections two times during this window, we suspected any of the Redis COMMAND(s) came into this Redis server during that window might have been BLOCKED for longer time (and hence this client connection spikes) due to multiple reasons including but not limited to – a) any foreign script got triggered outside of the application context that runs any BLOCKING commands b) IOPS and other kernel, N/W & H/W limitations/issues on that node, if any c) connection contentions at application clients itself.
So, to mitigate the connection contentions and threshold limit of the impacted Redis instance and to be able to bring back the calculon microservice to a healthy state, we did a failover of this Redis server and updated application clients and relinquished all its inflight connections to point to the newly failed over instance (the new master).
Immediately within a few secs of failover, the runscope Calculon microservice became healthy and API and other services TPS builds returned back to normal. Also tests execution picked up to its normal velocity.
We connected the impacted Redis instance (old master) as slave to the new master instance and enabled debug to triage the issue further – Upon further triaging on this impacted instance, we observed that there was a housekeeping script which otherwise would normally be triggered on the slave instances only, got triggered accidently on this master at the wrong time window at ~ 11:15AM PDT on 06/30 – we normally run these housekeeping scripts during a dedicated maintenance schedules which are carefully chosen hand-picked time windows with low traffic and low impact to the systems.

‌

What we are doing to avoid recurrences in future:

Automate the instances details update process upon any addition/deletion of DB hosts to/from the production cluster – especially Redis as it normally involves frequent failovers and removals.
Revisit and re-approve instance access controls with more stringent reviews and approvals.
Enable housekeeping/profiles jobs via Rundeck with proper access and approvals

Posted Jul 07, 2020 - 08:19 PDT

Resolved

We have confirmed that things are back to normal in API Monitoring. We are closing this incident.

Posted Jun 30, 2020 - 13:48 PDT

Monitoring

API Monitoring should be functioning normally now. We are monitoring to see that it is stable.

Posted Jun 30, 2020 - 12:25 PDT

Update

We have identified the issue and will provide an update shortly.

Posted Jun 30, 2020 - 12:12 PDT

Investigating

We are aware of an issue with API Monitoring functionality including running tests, scheduled tests, and accessing via API. We are investigating and will provide an update as soon as possible.

Posted Jun 30, 2020 - 12:00 PDT

This incident affected: API Monitoring (Runscope).