Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.5.0, Lustre 2.4.2
-
None
-
3
-
12210
Description
We have some Lustre clients where hyperthreading is enabled and disabled, possibly on a per job basis. The admins are noting streams of scary messages on the console from Lustre:
2013-12-02 09:58:29 LNet: 5546:0:(linux-cpu.c:1035:cfs_cpu_notify()) Lustre: can't support CPU hotplug well now, performance and stability could be impacted[CPU 40 notify: 3] 2013-12-02 09:58:29 LNet: 5546:0:(linux-cpu.c:1035:cfs_cpu_notify()) Skipped 30 previous similar messages 2013-12-02 09:58:29 Booting Node 0 Processor 40 APIC 0x1 2013-12-02 09:58:30 microcode: CPU40 sig=0x206f2, pf=0x4, revision=0x37 2013-12-02 09:58:30 platform microcode: firmware: requesting intel-ucode/06-2f-02 2013-12-02 09:58:30 Booting Node 0 Processor 41 APIC 0x3
The above message is not acceptable. Please fix.
Further, when I went to look into how this cpu partitions code worked, I wound up mighty confused. For instance, on a node with 4 sockets and 10 codes per socket, I see this:
/proc/sys/lnet$ cat cpu_partition_table 0 : 0 1 2 3 4 1 : 5 6 7 8 9 2 : 10 11 12 13 14 3 : 15 16 17 18 19 4 : 20 21 22 23 24 5 : 25 26 27 28 29 6 : 30 31 32 33 34 7 : 35 36 37 38 39
Why are there two parititions per socket? Is this by design, or a bug?
What is going to happen when hyperthreading is enabled, and there are 80 "cpus" suddenly available?