Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
Some (non Lustre) filesystems consume 100% of the CPU cycles on one or more cores busy-waiting by polling for event completion and scheduled at a high priority (not sure if "real time" or not). They also apparently configure the CPU scheduler to deny scheduling of other processes on some CPU cores.
The ptlrpcd threads handle RPC sending and receiving and are normally distributed evenly across cores and bound to each NUMA domain to minimize cross-CPU memory traffic when there are well-distributed application workloads running on a system (e.g. multi-threaded computational workload) that allocate and dirty data pages on all of the NUMA domains evenly. In some cases, where the number of cores is larger than the number of active application threads, then it is advantageous for ptlrpcd threads on other CPU cores to take over the RPC processing in order to offload CPU-intensive tasks like checksums, compression, and encryption to cores that are otherwise under utilized.
At no time do ptlrpcd (or other Lustre service) threads exclusively utilize or busy wait on CPU cores or prevent application threads from using them when they are not actively processing requests on behalf of the application.
However, if ptlrpcd threads are started on a NUMA core and then try to process RPCs, they can become stalled when threads on that NUMA domain could not be scheduled for lengthy periods of time. This causes intermittent laggy RPC handling when those threads are processing a time-sensitive RPC.
To work around this issue, we used lscpu to determine the NUMA configuration of the CPUs installed and then created a CPT configuration that avoided scheduling the ptlrpcd threads on cores that had been taken over by the other filesystem:
# lscpu | grep NUMA NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-63,128-191 NUMA node1 CPU(s): 64-127,192-255
In the /etc/modprobe.d/lustre.conf file the following lines were added to create the Lustre CPU Partition Table to the last 8 cores (of 64) in each of 4 NUMA nodes, to avoid the other filesystem that was heating up the first two cores on each of the NUMA nodes:
options libcfs cpu_npartitions=4 options libcfs cpu_pattern="0[56-63] 1[120-127] 2[184-191] 3[248-255]" options ptlrpcd max_ptlrpcds=64
That allows those threads to run on 32 different cores, with a maximum of 16 threads running across the 8 cores in each NUMA node.
However, this is only a workaround solution, as specifying the cpu_pattern and cpu_npartitions is relatively complex and CPU-specific, and likely needs to be different for different systems within the same cluster. It would be better to have more flexible mechanisms to avoid this issue.
One option is to add an exclude pattern option to libcfs which avoids the specified cores when configuring the CPT map, something like the following to exclude the specified 2 cores in each of two NUMA nodes:
options libcfs cpu_pattern="X[0-1] X[64-65]"
That allows a relatively simple (and mostly universal) option to avoid e.g. core0 and core1 on all machines, without having to know the full NUMA configuration details of each one. To exclude cores on each NUMA node, a syntax like the following could be used:
options libcfs cpu_pattern="N X[0-1]"
which would mean "exclude all of the cores in NUMA node0 and node1", to be aligned with the "N 0[0-1]" definition, which means "include all of the cores in NUMA node0 and node1 into CPT0".
To exclude specific cores in each NUMA node, an option like the following could be used:
options libcfs cpu_pattern="N C[0-1]"
to exclude the first eight cores on each NUMA domain. The meaning of "X" and "C" would be identical if "N" is not specified. Possibly it makes sense to also allow "N C[-2,-1]" to allow excluding the last two cores on each NUMA node, in case that is needed at some point?
Having an exclude list for cores would also be an easy way to reserve CPU cores for userspace threads running on server nodes (e.g. HA (Corosync/Pacemaker), monitoring, logging, sshd, etc.).
A further improvement would be to dynamically detect when the CPU scheduler has been configured to avoid scheduling processes on a particular core, and/or dynamically detect when ptlrpcd is unable to be scheduled on a core and avoid using it entirely (probably with a console message to that effect), similar to CPU hot-unplug. Dynamic exclusion/load detection is more complex to implement, but would avoid the need to statically configure nodes at all, and work around the breakage that is introduced by other filesystems.
Attachments
Issue Links
- mentioned in
-
Page Loading...