Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.5
-
None
-
lustre 2.1.3
bullxlinux 6.1 (based on redhat 6.1)
machine with 32 CPUs
-
3
-
6141
Description
The stop of ptlrpcd threads lasts more than one second per thread. On hardware with large number of cores this lead to a few minutes to unload ptlrpc module.
# lscpu | grep "^CPU(s)" CPU(s): 32 # ps -ef | grep ptlrpcd root 7301 2 0 10:58 ? 00:00:00 [ptlrpcd_rcv] root 7302 2 0 10:58 ? 00:00:00 [ptlrpcd_0] root 7303 2 0 10:58 ? 00:00:00 [ptlrpcd_1] root 7304 2 0 10:58 ? 00:00:00 [ptlrpcd_2] root 7305 2 0 10:58 ? 00:00:00 [ptlrpcd_3] root 7306 2 0 10:58 ? 00:00:00 [ptlrpcd_4] root 7307 2 0 10:58 ? 00:00:00 [ptlrpcd_5] root 7308 2 0 10:58 ? 00:00:00 [ptlrpcd_6] root 7309 2 0 10:58 ? 00:00:00 [ptlrpcd_7] root 7310 2 0 10:58 ? 00:00:00 [ptlrpcd_8] root 7311 2 0 10:58 ? 00:00:00 [ptlrpcd_9] root 7312 2 0 10:58 ? 00:00:00 [ptlrpcd_10] root 7313 2 0 10:58 ? 00:00:00 [ptlrpcd_11] root 7314 2 0 10:58 ? 00:00:00 [ptlrpcd_12] root 7315 2 0 10:58 ? 00:00:00 [ptlrpcd_13] root 7316 2 0 10:58 ? 00:00:00 [ptlrpcd_14] root 7317 2 0 10:58 ? 00:00:00 [ptlrpcd_15] root 7318 2 0 10:58 ? 00:00:00 [ptlrpcd_16] root 7319 2 0 10:58 ? 00:00:00 [ptlrpcd_17] root 7320 2 0 10:58 ? 00:00:00 [ptlrpcd_18] root 7321 2 0 10:58 ? 00:00:00 [ptlrpcd_19] root 7322 2 0 10:58 ? 00:00:00 [ptlrpcd_20] root 7323 2 0 10:58 ? 00:00:00 [ptlrpcd_21] root 7324 2 0 10:58 ? 00:00:00 [ptlrpcd_22] root 7325 2 0 10:58 ? 00:00:00 [ptlrpcd_23] root 7326 2 0 10:58 ? 00:00:00 [ptlrpcd_24] root 7327 2 0 10:58 ? 00:00:00 [ptlrpcd_25] root 7328 2 0 10:58 ? 00:00:00 [ptlrpcd_26] root 7329 2 0 10:58 ? 00:00:00 [ptlrpcd_27] root 7330 2 0 10:58 ? 00:00:00 [ptlrpcd_28] root 7331 2 0 10:58 ? 00:00:00 [ptlrpcd_29] root 7332 2 0 10:58 ? 00:00:00 [ptlrpcd_30] root 7333 2 0 10:58 ? 00:00:00 [ptlrpcd_31] # time modprobe -r ptlrpc real 1m7.204s user 0m0.000s sys 0m0.030s
Hello Gregoire, I agree with Andreas, there must be something else to explain the >1mn for the rmmod.
BTW, each or at least groups of the ptlrpcd thread would execute in parallel on multiple Cores (depending on their scheduling policy) thus the timing you get would be the max of all execution-sets which looks too long for me ...
Can you reproduce the problem and ensure that you enabled the full debug-traces before and also you delimited the rmmod timing period with BEGIN/END markers ??
A ps output showing the ptlrpcd pids would be helpful too.