Coordinating k6 runners on kubernetes

My team is preparing our company to acquire another customer who at initial stages will be 5x bigger than our current biggest customer. To do it, we had to rewrite our performance tests from Gatling to k6. Improve reporting, metrics and scalability of our whole infrastructure and tune set of microservices.

To test our infrastructure we had to scale up our perf test runners too and to do that we developed a set of containerised performance tests and run our performance test pack on dedicated kubernetes nodes. When slowly working up to our desired traffic we hit a limit of what we can run on a single node. Each dedicated node has v16CPUs and 27Gib, each pod is able to processes about 15000 req/s from our test pack, after that it runs ouf memory and CPU usage is too high to reliably tell whether our performance runner is becoming a bottleneck. We needed a way to scale up performance tests on a cloud native environment behind a VPN, so k6-cloud wasn’t an option.

We had to develop a coordinated way to run our perf tests on a scale that would allow us to hit 72000 req/s quite instantly, scaling up pods is obviously a solved problem in Kubernetes, but coordinating startup of pods isn’t a common thing. As we scale up number of runner pods it takes between several seconds, at worse up to 10 minutes before a pod is allocated on a new node, often we must to wait for AWS node autoscaler to kick in. This turned out to be a blocker as some of the tests in the pack run for 47 seconds, another for 250 seconds. We had to make sure all tests start at the same second.

We have a Jenkins (we’re going to call it “Coordinator”) pipeline that setups a performance testing environment, schedules test, analyses results etc, this part isn’t covered here.

Our mesobenchmark approach is as follows:

  • Coordinator pipeline that deploys a performance test environment
    • New independent environment for each performance test run
  • we deploy a Kubernetes job.
    • It is important that it’s a job, not a deployment or statefulset. We only want each pod to run once
  • We deploy the job at desired number of pods
    • Each pod generating the same traffic configured using env vars
  • We wait for the pods to be deployed and ready before starting all tests

k6 out box supports staring tests in a paused mode, it’s as simple as k6 run --paused. The k6 runner will now wait for a remote call to start the tests:

k6 run --paused \
    main.js

Then we have an sh script that runs on the coordinator, waits for the pods to be in a desired state. The script accepts the following parameters:

  • namespace
  • pod name (which can be a regexp or just the job name)
  • number of pods
  • desired state

Coordinator must have access to kubectl to work correctly:

#!/bin/bash

POD_NAMES=""
READY_PODS_COUNT=0

NAMESPACE=$1
PODNAME=$2
WANTED_STATUS=$3
NUMBER_OF_PODS=$4
TARGET_CONTAINER=$5

while [ $READY_PODS_COUNT -lt $NUMBER_OF_PODS ]
do
    READY_PODS_COUNT=$(kubectl get -n $NAMESPACE pods | grep "$PODNAME" | grep "$WANTED_STATUS" -c)
    POD_NAMES=$(kubectl get -n $NAMESPACE pods | grep "$PODNAME" | grep "$WANTED_STATUS" | awk '{print $1}')
    echo "Pod $PODNAME readiness $READY_PODS_COUNT/$NUMBER_OF_PODS in $WANTED_STATUS"
    sleep 5
done < .

echo "All pods in required status: $POD_NAMES"

# While the correct number of pods are in the desired state
# it sometimes takes 1-3 seconds for k6 to preallocate all VUs.
# The sleep below is to prevent race conditions.
sleep 5

for pod in $POD_NAMES
do
    kubectl exec $pod -c $TARGET_CONTAINER -- k6 resume
done

The coordinator executes the script as below:

./start_runners.sh perf-test-namespace k6-job Init:2/3 4 z-warmup
./start_runners.sh perf-test-namespace k6-job Running 4 main

Our performance test k8s-job has an init stage called z-warmup. It’s exactly the same code as the main container. It’s just running at 25% of the target load to warmup JVMs, CPUs and DB caches. When the warmup stage completes the same pods will wait for the main container to start, so both need to be started individually, on-demand, but only when kubernetes reports them to be ready.

When all 4 pods with k6-job in name are in Init:2/3 state, the script will execute k6 resume call starting the tests. When warmup stage completes, the Coordinator will repeat the process, but this time wait for pods to be in Running state, which is the stage at which main container generates 100% of my target load.

This process allows us to coordinate start any number of pods to generate any volume of traffic we want, all tests start at the same second.