Resource allocation strategies

Page content

This page lists deployment strategies I use to run JVM on Kubernetes.

Below you will find 3 sections describing more common deployment practices of JVM. The practices are listed from the most to the least expensive to run, but each strategy has other drawbacks too.

The described practices are more “realistic”, as cost-optimised ways to run Kubernetes deployments. I am for a predictable utilisation over load. There are more strategies, that argue for never setting CPU limits.

When discussing allocated resources, we often talk about cost-efficiency of allocating hardware resources and that’s my base assumption in this post. I think it’s more optimal to invest into horizontal or vertical autoscaler, than overallocate hardware resources and leave no CPU limits. As your costs are predictable, don’t depend on node sizes and scale linearly with your traffic. When I worked on video encoding in K8s, it definitely made sense to overallocate and set no CPU (and GPU) limits, but I don’t think it’s the best option for businesses that mostly start HTTP servers.

Note: CPU cores on a node are a shared resource and allocations are measured in CPU time, not CPU cycles or cores. When multiple pods are scheduled on the same node, they will share the CPU time allocated by Linux kernel, most likely completely fair scheduler.

Note: JVM uses allocation limits to tune own ergonomic features, not requests. Self optimising GC, memory buffers, thread executor pools will be accordingly adjusted to fully utilise limits of a container. Some examples are shown below.

1. Low utilisation

The first strategy is characterised by the lowest and most stable response times, lowest latency, highest throughput and the best reliability, it comes with the highest monetary cost.

In this deployment strategy we set all Kubernetes deployment resources at the same level. CPU request is the same as CPU limit, the same logic applies to memory:

        resources:
          requests:
            cpu: "4000m"
            memory: "4Gi"
          limits:
            cpu: "4000m"
            memory: "4Gi"

In this strategy we want to research a peak of load (eg. in req/s) that a pod can handle. This practices provides us with a deterministic horizontal scaling model. For most of the time, the pods will be running with low utilisation, but when a spike of requests happens naturally or due to escalating failures, the pod will have guaranteed resources to process the incoming traffic. Noisy neighbours aren’t a big problem, as contention happens in Linux kernel and I/O only.

All running services will be over-provisioned for most of the time, but under stress they are least likely to fail because of:

  • spike in number of req/s
  • longer response times of dependencies
  • increased CPU utilisation on the same node
  • fast memory allocation rate on the same node
  • a failing node will cause fewer disruptions due to fewer pods running on the same node
  • it should be noted not to place pods of the same deployment on a single node, which would cause a single point of failure and a complete downtime of such service

Each pod has own guaranteed CPU and memory, which is very important for most ergonomic features of a JVM and JDK.

Only requests are guaranteed, so the JVM will be able to use (the limits) it when it needs them.

2. Balanced utilisation

This strategy is characterised by low latency and medium throughput. Hardware utilisation is nondeterministic as it depends on what pods are scheduled on a single node.

In this strategy deployment are configured to run with no CPU limits, but requests are lower than in the previous example:

        resources:
          requests:
            cpu: "2000m"
            memory: "4Gi"
          limits:
            memory: "4Gi"

This allows Kubernetes to schedule more pods on a single node, but without guarantees about availability of CPU cycles when they are needed, as pods belonging to the same functionality might start competing for the same CPU time. It gets even worse, if multiple pods of the same deployment are running on a single machine. Pods belonging to the same deployment will compete with each other, causing increased response times and reduce reliability, increasing hardware utilisation.

When a single node allocates only pods of different deployments, each of them might have more CPU cycles to expand their utilisation allowing this strategy to act like a vertical scale up, without actually scaling up vertically.

This strategy isn’t great for JVMs as node limits are used for JVM ergonomic tuning. JVM will think it’s running on a bigger machine than it really is, likely to cause longer GC pauses.

Consider this output:

podman run --memory=1g --cpus=2 -ti openjdk:21-jdk-slim-buster java -XX:+UseG1GC -XX:+PrintFlagsFinal | grep Threads
     uint ConcGCThreads                            = 1                                         {product} {ergonomic}
     uint G1ConcRefinementThreads                  = 2                                         {product} {ergonomic}
     uint ParallelGCThreads                        = 2                                         {product} {default}
     bool UseDynamicNumberOfCompilerThreads        = true                                      {product} {default}
     bool UseDynamicNumberOfGCThreads              = true                                      {product} {default}
podman run --memory=1g --cpus=7 -ti openjdk:21-jdk-slim-buster java -XX:+UseG1GC -XX:+PrintFlagsFinal | grep Threads
     uint ConcGCThreads                            = 2                                         {product} {ergonomic}
     uint G1ConcRefinementThreads                  = 7                                         {product} {ergonomic}
     uint ParallelGCThreads                        = 7                                         {product} {default}
     bool UseDynamicNumberOfCompilerThreads        = true                                      {product} {default}
     bool UseDynamicNumberOfGCThreads              = true                                      {product} {default}

During GC cycles G1GC and ParallelGC will start 7 threads, 1 for each CPU (in this case, but this isn’t a general rule), but the CPU time might not be available because of other pods also being allowed to exercise that same CPU. Situation gets even worse if your CPU supports hyper-threading: physically 1 core, that acts like 2.

3. Higher utilisation, reduced reliability

The cheapest strategy is what it says it is, likely multiple times cheaper to run than the first strategy. A node will be able to allocate many more pods, but their response time will be higher and vary frequently. Hardware utilisation is very likely to be high, a disruption on a single node will bring down more pods causing more widespread outages. The effect of noisy neighbours can multiplied.

        resources:
          requests:
            cpu: "500m"
            memory: "2Gi"
          limits:
            cpu: "2000m"
            memory: "2Gi"

JVM will think it’s running with 2 CPUs, but due to ergonomic allocation only 0.5 CPU is guaranteed. Overall P(99) latency will be worst in this strategy due to competition for CPU time, increased number of context switches and saturation on network sockets.

Single points of failure are more fragile.

All other drawbacks from 2. also apply here.