2025-03-23

CPUManagerPolicy Under the Hood

WARNING: There are known issues with using this method, that might cause diruptions in some edge cases and are yet to be fixed. Here are the bugs that are being tracked.

According to the documentation, when static CPUManager is used, the cpuset cgroup controller is used to assign specific cores to the container with guaranteed CPU resources. This ensures that the container has exclusive access to the specified cores, preventing interference from other processes, reducing noisy neighbor issues and also ensuring the container itself is not noisy neighor. First a look at the section where this configuration is defined in kubelet-config.json.

[root@ip-w-x-y-z ~]# cat /etc/kubernetes/kubelet/kubelet-config.json
...
  "cpuManagerPolicy": "static",
  "cpuManagerPolicyOptions": {
    "full-pcpus-only": "true"
  },
...

And this could be analyzed by inspecting the processes in question. In this case the process we are concerned about is clickhouse, we can use the crictl ps command to get the container ID of the process.

First lets take a look at get an node where we have not enabled CPUManager, let’s the container ID of clickhouse.

[root@ip-a-b-c-d ~]# crictl ps --name clickhouse
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID              POD
29ebb158a0a16       2ac35038b7301       3 hours ago         Running             clickhouse-backup         0                   21cea07c4139a       chi-clickhouse-vcputest-0-0-0
0921f5a6a8ab2       639814aaef121       3 hours ago         Running             clickhouse                0                   21cea07c4139a       chi-clickhouse-vcputest-0-0-0

Notice, 0921f5a6a8ab2 is ID of the clickhouse container.

[root@ip-a-b-c-d ~]# crictl inspect 0921f5a6a8ab2 | jq .status.resources.linux
{
  "cpuPeriod": "100000",
  "cpuQuota": "500000",
  "cpuShares": "5120",
  "cpusetCpus": "",
  "cpusetMems": "",
  "hugepageLimits": [],
  "memoryLimitInBytes": "8589934592",
  "memorySwapLimitInBytes": "8589934592",
  "oomScoreAdj": "-997",
  "unified": {}
}

Notice in this case, where CPUManager is not used/enabled, the cpusetCpus is empty. This means that the container is not restricted to specific cores and can use any available CPU resources. Also, note cpu_manager_state.

[root@ip-a-b-c-d ~]# cat /var/lib/kubelet/cpu_manager_state | jq .
{
  "policyName": "none",
  "defaultCpuSet": "",
  "checksum": 1353318690
}

Now, lets take a look at a node where we have enabled CPUManager. Let’s take a look at the clickhouse container on this node.

[root@ip-w-x-y-z ~]# crictl ps --name clickhouse
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD
e166cc07430cd       2ac35038b7301       5 hours ago         Running             clickhouse-backup   0                   4f28ab9656be0       chi-clickhouse-pcputest-0-0-0
27f87a744a759       639814aaef121       5 hours ago         Running             clickhouse          0                   4f28ab9656be0       chi-clickhouse-pcputest-0-0-0

The 27f87a744a759 is our container of interest.

[root@ip-w-x-y-z ~]# crictl inspect 27f87a744a759 | jq .status.resources.linux
{
  "cpuPeriod": "100000",
  "cpuQuota": "500000",
  "cpuShares": "5120",
  "cpusetCpus": "1-5",
  "cpusetMems": "",
  "hugepageLimits": [],
  "memoryLimitInBytes": "8589934592",
  "memorySwapLimitInBytes": "8589934592",
  "oomScoreAdj": "-997",
  "unified": {}
}

Notice in this case, where CPUManager is used/enabled, the cpusetCpus is set to 1-5. This means that the container is restricted to specific cores and can only use the specified cores, and those CPU cores are dedicated to the container. And the cpu_manager_state is also set to 1-5.

[root@ip-w-x-y-z ~]# cat /var/lib/kubelet/cpu_manager_state | jq .
{
  "policyName": "static",
  "defaultCpuSet": "0,6-7",
  "entries": {
    "d8d68b09-ce62-455c-88ba-57050e4169f0": {
      "clickhouse": "1-5"
    }
  },
  "checksum": 2344224312
}

What is `/var/lib/kubelet/cpu_manager_state`?

The /var/lib/kubelet/cpu_manager_state file is a key component of Kubernetes’ CPU management system. By inspecting this file, you can verify that CPU allocations are being handled correctly and consistently. When combined with containerd’s implementation of CPU pinning via cgroups, this ensures that your Pods receive the exclusive CPU resources they require, especially when using the full-pcpus-only policy. Also this file maintains the state of CPU allocations across kubelet restarts, ensuring that CPU assignments are consistent and persistent.

Verifying pcpu is not shared

To verify the CPUs provided to the guaranteed pod’s clickhouse container, we did the following experiment:

Created the ClickHouse pod with guaranteed resources
Created a burstable pod running stress-ng container utilizing cores beyond the limit set

In this case the expected behavior is that burstable pod’s stress-ng container will not be able to utilize the CPU cores allocated to the guaranteed pod’s clickhouse container.

Here is the /var/lib/kubelet/cpu_manager_state on that node.

[root@ip-w-x-y-z ~]# cat /var/lib/kubelet/cpu_manager_state | jq .
{
  "policyName": "static",
  "defaultCpuSet": "0,4-7",
  "entries": {
    "54b3c938-b248-4373-992d-adec7ea2692a": {
      "clickhouse": "1-3"
    }
  },
  "checksum": 1600324503
}

And there is the pod yaml for the stress-ng.

apiVersion: v1
kind: Pod
metadata:
  name: stress-ng
  namespace: stress-test
spec:
  nodeSelector:
      node.kubernetes.io/instance-type: r7g.2xlarge
      karpenter.sh/nodepool: arm-cpumanager-nodepool
      kubernetes.io/hostname: ip-w-x-y-z.ec2.internal
  containers:
  - name: stress-ng
    image: alpine:latest
    command:
    - sh
    - -c
    - apk add --no-cache stress-ng && stress-ng --cpu 4 --timeout 300s
    resources:
      limits:
        cpu: 3
        memory: 8Gi
      requests:
        cpu: 2
        memory: 8Gi
  restartPolicy: Never

So noticed the following using htop, the stress-ng container was not able to utilize the CPU cores allocated to the guaranteed pod’s clickhouse container.

Stress-NG Outside PCPU

ALSO, we flip the switch to check if a container inside of guaranteed pod only uses the CPU that it is provide using cpuset. So we run stress-ng as a guaranteed pod.

apiVersion: v1
kind: Pod
metadata:
  name: stress-ng
  namespace: stress-test
spec:
  nodeSelector:
      node.kubernetes.io/instance-type: r7g.2xlarge
      karpenter.sh/nodepool: arm-cpumanager-nodepool
      kubernetes.io/hostname: ip-w-x-y-z.ec2.internal
  containers:
  - name: stress-ng
    image: alpine:latest
    command:
    - sh
    - -c
    - apk add --no-cache stress-ng && stress-ng --cpu 4 --timeout 300s
    resources:
      limits:
        cpu: 3
        memory: 4Gi
      requests:
        cpu: 3
        memory: 4Gi
  restartPolicy: Never

Check the cpu assignment for this container.

[root@ip-w-x-y-z ~]# cat /var/lib/kubelet/cpu_manager_state | jq .
{
  "policyName": "static",
  "defaultCpuSet": "0,4-7",
  "entries": {
    "c2748791-0c91-4c75-9adb-cbc9d7cd1855": {
      "stress-ng": "1-3"
    }
  },
  "checksum": 1477588470
}

Here is the htop output, notice only assigned CPUs are used.

Stress-NG Inside PCPU

CPUManagerPolicy Under the Hood

What is /var/lib/kubelet/cpu_manager_state?

Verifying pcpu is not shared

What is `/var/lib/kubelet/cpu_manager_state`?