Network connectivity issues

Incident Report for BlazeMeter

Postmortem

Problem:

  1. BlazeMeter’s DataService, which is responsible for collecting test results and load engine status was having networking issues on the underlying Kubernetes cluster.
  2. The networking issue caused a failed DNS resolution, resulting in problems accessing the database and other services.
  3. As a result customer experienced problems with running tests and viewing their results.

Root cause of the problem:

  1. The initial connectivity issue was caused by the Kubernetes cluster network layer not being able to resolve DNS entries every several requests.
  2. Our attempt to resolve this issue failed because of a problem with the cluster setup script
  3. Code deployed to restore the malfunctioning cluster was not identical to the original code

Action Taken:

  1. Created a standby K8S cluster with dedicated DNS host names and route traffic there in case of a cluster issue that cannot be fixed (Complete)
  2. Improved process to make sure release branch always contains the latest production code (Complete)
Posted Dec 23, 2019 - 06:21 PST

Resolved

System is back online and we are monitoring the results.
Some customers may still experience issues.
We will update on progress and analyze the incident.
Posted Dec 05, 2019 - 17:54 PST

Update

The cluster is back and processing data, but now retained data is causing high load on the database
Posted Dec 05, 2019 - 15:25 PST

Update

We continue to perform tests, and seeing expected results.
Next update will be by 2:45PM PST.
Posted Dec 05, 2019 - 13:44 PST

Update

VPN issue confirmed fix.
Mock services is operational again, we are monitoring its health.
Testing of test execution functionality resumed.
Next update will be by 1:30PM PST.
Posted Dec 05, 2019 - 12:53 PST

Update

Testing found one of the new VPN tunnels created for inter-service communication is not working properly. Team is working on a fix.
Next update will be by 12:45PM PST
Posted Dec 05, 2019 - 11:43 PST

Update

Cluster testing still in progress.
Next update will be by 11:45AM PST.
Posted Dec 05, 2019 - 10:49 PST

Update

We have restored the failed kubernetes cluster and testing it now.
Next update will be by 10:45AM PST.
Posted Dec 05, 2019 - 09:47 PST

Identified

One of our production kubernetes clusters is having network connectivity issues. we are working on restoring the cluster.
This affects sporadically the ability to run tests and see results.
Posted Dec 05, 2019 - 05:08 PST

Investigating

We are currently investigating this issue.
Posted Dec 05, 2019 - 01:50 PST
This incident affected: Mock Services.