This repository contains a Proof of Concept (POC) designed to demonstrate and resolve severe performance degradation and Out-of-Memory (OOM) fragmentation for C++ High-Performance Computing (HPC) applications running on OpenShift.
In addition, refer to the OpenShift CPU Pinning & Performance Benchmark Report document demonstartes the benchmark approach.
The core of the problem is the "Container CPU Illusion." In a pre-Kubernetes environment, applications relied on standard OS interfaces like std::thread::hardware_concurrency() or sysconf(_SC_NPROCESSORS_ONLN) to dictate thread pool sizes. Inside a container, these functions pierce the container abstraction and return the physical host's total core count rather than the pod's cgroup limits.
On modern, dense OpenShift nodes, this illusion causes massive thread hyper-inflation. An application limited to 2 CPUs might detect 64 physical cores and spawn hundreds of threads. This leads to catastrophic Completely Fair Scheduler (CFS) throttling, severe L3 cache invalidation, and OOM crashes driven by memory arena fragmentation.
This POC addresses the issue at two levels:
-
Application Layer (Primary): A C++ drop-in solution that accurately parses Linux
cgroupv1/v2 boundaries to safely scale thread pools, combined with an advanced hooking technique for legacy libraries. -
Infrastructure Layer (Secondary): OpenShift cluster tuning using the
staticCPU Manager policy and NUMA alignment to achieve absolute bare-metal performance.
├── deployment.yaml # Kubernetes deployment configuring Guaranteed QoS and glibc limits
├── Dockerfile # Multi-stage build for the POC and offline stress-ng installation
├── epel-release-latest-9.noarch.rpm # EPEL repository for UBI 9
├── hpc-kubeletconfig.yaml # Kubelet configuration to enforce static CPU manager policy
├── hpc-mcp.yaml # Custom MachineConfigPool for HPC worker isolation
├── hpc-profile.yaml # PerformanceProfile for CPU reservation, isolation, and NUMA alignment
├── Judy-1.0.5-28.fc36.x86_64.rpm # stress-ng dependency (Fedora 36)
├── hpc_app # Compiled C++ binary (entrypoint)
├── poc.cpp # C++ source code with cgroup parsing
├── hpc_adv_app # Compiled C++ binary (may run from pod)
├── pocadv.cpp # C++ source code with cgroup parsing and sysconf hooking
└── stress-ng-0.14.02-1.fc36.x86_64.rpm # stress-ng binary for UBI 9 / RHEL 9 (Fedora 36)
The primary solution to the CPU illusion is to replace naive hardware detection with a container-aware algorithm. The provided poc.cpp directly interrogates the Linux cgroup pseudo-filesystem to mathematically deduce the exact CPU limits imposed by OpenShift.
The code uses a multi-tiered heuristic approach:
-
Cgroup v2: It attempts to read
/sys/fs/cgroup/cpu.max, dividing thecfs_quota_usby thecfs_period_usto get the exact core limit. -
Cgroup v1 (Fallback): If v2 is absent, it reads
/sys/fs/cgroup/cpu/cpu.cfs_quota_usand/sys/fs/cgroup/cpu/cpu.cfs_period_us. -
OS Affinity (Fallback): It checks
sched_getaffinityto see if the container runtime has actively pinned the process to specific cores.
Code Highlight: Core Detection Logic:
// Tier 1a: Parse cgroup v2 cpu.max for CFS Quotas
std::ifstream cpu_max_file("/sys/fs/cgroup/cpu.max");
if (cpu_max_file.is_open()) {
std::string quota_str, period_str;
cpu_max_file >> quota_str >> period_str;
if (quota_str!= "max" &&!quota_str.empty()) {
long long quota = std::stoll(quota_str);
long long period = std::stoll(period_str);
if (period > 0) return static_cast<int>(std::ceil(static_cast<double>(quota) / period));
}
}Compilation command:
g++ -std=c++11 -O3 poc.cpp -o hpc_appWhile updating your own thread pool logic is straightforward, many C++ applications rely on third-party libraries that suffer from the same CPU illusion. A prominent example is older versions of the Google gRPC C++ library (e.g., v1.10.1).
As thoroughly documented by the open-source community, gRPC dynamically scales its gpr_executor using the host's physical core count, by following rule: 2 * sysconf(_SC_NPROCESSORS_ONLN). In OpenShift, this results in a sudden explosion of of hundreds(or even thousands) of executor threads, causing heavy lock contention and a death spiral of latency.
Note: Issue #7747 and Issue #16557
For production systems where upgrading these older libraries is not possible, the POC implements a highly effective workaround: C++ Symbol Interposition.
By globally overriding the sysconf function using dlsym(RTLD_NEXT, "sysconf"), we intercept the hardware query before it reaches the OS, feeding the legacy library the dynamically parsed cgroup limit instead.
Code Highlight: The sysconf Override
#include <dlfcn.h>
#include <unistd.h>
// Intercept sysconf queries globally
extern "C" {
long sysconf(int name) {
if (name == _SC_NPROCESSORS_ONLN) {
static int cached_cpu_limit = -1;
static bool initialized = false;
// Parse /sys/fs/cgroup/cpu.max exactly once to save I/O overhead
if (!initialized) {
cached_cpu_limit = detect_container_cpu_limit();
initialized = true;
}
// Trick gRPC/legacy libraries into seeing only the pod's limit
if (cached_cpu_limit > 0) {
return cached_cpu_limit;
}
}
// Fallback to the real libc sysconf for all other queries
static long (*real_sysconf)(int) = nullptr;
if (!real_sysconf) {
real_sysconf = (long (*)(int))dlsym(RTLD_NEXT, "sysconf");
}
return real_sysconf(name);
}
}-
C++ Compilation and gRPC Hooking Requirements
The pocadv.cpp file contains a custom extern "C" { long sysconf(int name) } block. This leveragesdlsym(RTLD_NEXT)to intercept hardware queries made by the gRPC library, feeding it the detected OpenShift cgroup limit instead.Crucial Linking Rules:
-
You must compile this dynamically. Fully static binaries (
-static) do not invoke the dynamic linker at runtime, which will causeRTLD_NEXTto fail and crash the application. -
You must link the dynamic linking library (
-ldl).
Compilation Command:
g++ -std=c++11 -O3 poc.cpp -o poc -ldl -pthread
(If using CMake, ensure
${CMAKE_DL_LIBS}is added to yourtarget_link_libraries). -
The included Dockerfile utilizes a Red Hat Universal Base Image (UBI 9). It bypasses subscription requirements by installing the necessary stress-ng RPMs and EPEL repositories locally from the project tree.
Build and Push:
podman build -t docker.io/<your-username>/hpc-poc:v1.
podman push docker.io/<your-username>/hpc-poc:v1If advanced profiling tools like perf are unavailable inside the container, you can extract native thread behavior, CPU affinity, and scheduler wait times directly from the Linux /proc filesystem.
Exec into your running pod (oc exec -it <pod-name> -- /bin/bash) and run the following bash loop to generate a clean thread diagnostic report:
for tid in /proc/self/task/*; do
name=$(awk '/^Name:/ {print $2}' "$tid/status" 2>/dev/null)
state=$(awk '/^State:/ {print $2, $3}' "$tid/status" 2>/dev/null)
runtime=$(awk '/se\.sum_exec_runtime/ {print $3}' "$tid/sched" 2>/dev/null)
if [ -n "$name" ]; then
echo "TID: $(basename $tid) | Name: $name | State: $state | CPU Time: $runtime"
fi
doneNote: Use grep "^Cpus_allowed_list:" /proc/self/task/*/status to verify if OpenShift has successfully pinned the threads to specific cores.
In the deployment examples below, the pod is configured to request exactly 6 CPUs. This number is not arbitrary. When configuring OpenShift for High-Performance Computing, you cannot allocate 100% of a node's physical cores to a workload.
To provide reliable scheduling and prevent node resource overcommitment, OpenShift subtracts a reserved portion of CPU and memory from the node's total capacity to calculate the "Allocatable" resources. These reservations (system-reserved and kube-reserved) ensure that the operating system, kubelet, CRI-O runtime, and network routing remain stable under heavy load.
For instance, on a modern node with 16 physical cores, reserving 4 cores for the platform overhead leaves exactly 12 allocatable cores for user workloads. Requesting 2x6 CPUs ensures the 2 pods fits perfectly within the node's boundaries without starving the OpenShift control plane.
Note: For detailed calculations, refer to the Allocating resources for nodes in an OpenShift Container Platform cluster.
First, we establish a performance baseline on standard worker nodes where threads float across all available cores, inducing context-switching overhead.
-
Label the target nodes:
oc label node <worker-node-1> hpc-dedicated=true oc label node <worker-node-2> hpc-dedicated=true
-
Deploy the application:
Ensuredeployment.yamlsets both CPU requests and limits to6.oc apply -f deployment.yaml
-
Execute the Baseline Test:
Exec into the pod, ensure you are in the/tmpdirectory, and run the matrix math stressor:oc exec -it <pod-name> -- /bin/bash cd /tmp stress-ng --cpu 6 --cpu-method matrixprod --timeout 60s --metrics-brief
-
Record the Results:
Note thebogo ops/soutput. This represents the application's throughput under standard CFS scheduling. Read the full OpenShift CPU Pinning & Performance Benchmark Report document.
We will now use the Node Tuning Operator to isolate the CPUs, apply the static CPU manager policy, and enforce NUMA alignment.
Note: For detailed calculations, refer to the following resources:
- Allocating specific CPUs for nodes in a cluster
- Using CPU Manager and Topology Manager
- Scheduling NUMA-aware workloads
- Sample Performance Profile
-
Assign the HPC role to the dedicated nodes:
oc label node -l hpc-dedicated=true node-role.kubernetes.io/hpc="" -
Apply the Custom Machine Config Pool (MCP):
oc apply -f hpc-mcp.yaml
-
Apply the Performance Profile & Kubelet Config: This reserves cores
0-3for the OS/Master processes and isolates the remaining cores for the HPC workload.oc apply -f hpc-profile.yaml oc apply -f hpc-kubeletconfig.yaml
-
Wait for Reboots:
The Machine Config Operator will drain and reboot the nodes. Wait until thehpcpool is fully updated:oc get mcp hpc -w
Once the nodes reboot, OpenShift will recognize the pod's Guaranteed QoS class (requests == limits = 6) and permanently pin the container to 6 isolated hardware cores.
-
Verify the Hardware Lock:
Check the application logs to see the output from our C++sysconfoverride probe:oc logs deployment/hpc-poc
It should successfully report
OpenShift/K8s 'static' policy (Exclusive Pinned Cores)and an affinity mask of exactly6. -
Execute the Final Benchmark:
Exec into the pod again and run the exact same test:oc exec -it <pod-name> -- /bin/bash cd /tmp stress-ng --cpu 6 --cpu-method matrixprod --timeout 60s --metrics-brief
-
Compare Results:
Compare the newbogo ops/smetric against the untuned baseline. The throughput will be significantly higher due to the elimination of CPU cache invalidation, context-switching, and CFS throttling. Refer to OpenShift CPU Pinning & Performance Benchmark Report document.