[LU-12352] libcfs crashes with certain cpu_npartitions values Created: 29/May/19  Updated: 22/Oct/20  Resolved: 04/Jun/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.6

Type: Bug Priority: Minor
Reporter: Andrew Perepechko Assignee: Andrew Perepechko
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Due to a bug in the code, libcfs will crash if the number of online cpus does not divide by the number of cpu partitions. Based on the checks in cfs_cpt_table_create(), it appears that the original intent was to push the remaining cpus into the initial partitions.

A simple reproducer for a system with cpus number that is not a multiple of 3 is:

insmod libcfs.ko cpu_pattern="" cpu_npartitions=3
[112628.427628] LNetError: 14786:0:(libcfs_cpu.c:770:cfs_cpt_choose_ncpus()) ASSERTION( number > 0 ) failed: 
[112628.427862] LNetError: 14786:0:(libcfs_cpu.c:770:cfs_cpt_choose_ncpus()) LBUG
[112628.428073] Pid: 14786, comm: insmod 3.10.0-693.21.1.x3.1.10.x86_64 #1 SMP Wed Nov 14 12:16:53 CST 2018
[112628.428082] Call Trace:
[112628.428180]  [<ffffffff8103a212>] save_stack_trace_tsk+0x22/0x40
[112628.428198]  [<ffffffffc067d7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[112628.428231]  [<ffffffffc067d87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[112628.428261]  [<ffffffffc069137a>] cfs_cpt_choose_ncpus+0x81a/0x820 [libcfs]
[112628.428294]  [<ffffffffc06915ba>] cfs_cpt_table_create+0x23a/0x8d0 [libcfs]
[112628.428325]  [<ffffffffc0691d4b>] cfs_cpu_init+0xbb/0xb70 [libcfs]
[112628.428356]  [<ffffffffc06df031>] libcfs_init+0x31/0x1000 [libcfs]
[112628.428388]  [<ffffffff810020ea>] do_one_initcall+0xba/0x240
[112628.428400]  [<ffffffff81104424>] load_module+0x1f84/0x2a10
[112628.428413]  [<ffffffff81105066>] SyS_finit_module+0xa6/0xd0
[112628.428423]  [<ffffffff816c1715>] system_call_fastpath+0x1c/0x21
[112628.428436]  [<ffffffffffffffff>] 0xffffffffffffffff
[112628.428469] Kernel panic - not syncing: LBUG
[112628.428572] CPU: 3 PID: 14786 Comm: insmod Tainted: G           OE  ------------   3.10.0-693.21.1.x3.1.10.x86_64 #1
[112628.428782] Hardware name:                  /D525MWV, BIOS MWPNT10N.86A.0083.2011.0524.1600 05/24/2011
[112628.428970] Call Trace:
[112628.429046]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[112628.429049]  [<ffffffff816a8634>] panic+0xe8/0x21f
[112628.429049]  [<ffffffffc067d8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[112628.429049]  [<ffffffffc069137a>] cfs_cpt_choose_ncpus+0x81a/0x820 [libcfs]
[112628.429049]  [<ffffffffc06915ba>] cfs_cpt_table_create+0x23a/0x8d0 [libcfs]
[112628.429049]  [<ffffffffc06df000>] ? 0xffffffffc06defff
[112628.429049]  [<ffffffffc0691d4b>] cfs_cpu_init+0xbb/0xb70 [libcfs]
[112628.429049]  [<ffffffffc06df000>] ? 0xffffffffc06defff
[112628.429049]  [<ffffffffc06df031>] libcfs_init+0x31/0x1000 [libcfs]
[112628.429049]  [<ffffffff810020ea>] do_one_initcall+0xba/0x240
[112628.429049]  [<ffffffff81104424>] load_module+0x1f84/0x2a10
[112628.429049]  [<ffffffff813523e0>] ? ddebug_proc_write+0xf0/0xf0
[112628.429049]  [<ffffffff816c514a>] ? ftrace_graph_caller+0x5a/0x85
[112628.429049]  [<ffffffff81100a83>] ? copy_module_from_fd.isra.42+0x53/0x150
[112628.429049]  [<ffffffff81105066>] SyS_finit_module+0xa6/0xd0
[112628.429049]  [<ffffffff816c1715>] system_call_fastpath+0x1c/0x21
[112628.429049]  [<ffffffff816c1661>] ? system_call_after_swapgs+0xae/0x146

A fix will be uploaded shortly.



 Comments   
Comment by Gerrit Updater [ 29/May/19 ]

Andrew Perepechko (c17827@cray.com) uploaded a new patch: https://review.whamcloud.com/34991
Subject: LU-12352 libcfs: crashes with certain cpu part numbers
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 481632a1ff40201998fd564fbf811303c2535b93

Comment by Andrew Perepechko [ 29/May/19 ]

With the fix:

[root@panda-testbox libcfs]# insmod libcfs.ko cpu_pattern="" cpu_npartitions=3
[root@panda-testbox libcfs]# cat /sys/kernel/debug/lnet/cpu_partition_table
0       : 0 1
1       : 2
2       : 3
Comment by Gerrit Updater [ 04/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34991/
Subject: LU-12352 libcfs: crashes with certain cpu part numbers
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e33e3da58972a811e6eafc479f95f6df2baf4b9b

Comment by Peter Jones [ 04/Jun/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 20/Mar/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37994
Subject: LU-12352 libcfs: crashes with certain cpu part numbers
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 4583a0b67d198cf53c6b715e5decd31944fd66b0

Comment by Gerrit Updater [ 22/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37994/
Subject: LU-12352 libcfs: crashes with certain cpu part numbers
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 6677b0ad0db9d4b826d77a768bf561bfe6533ffe

Generated at Sat Feb 10 02:51:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.