[LU-6285] Assert fails in staging client module crashes kernel if CPUMASK_OFFSTACK set Created: 25/Feb/15  Updated: 10/Aug/16  Resolved: 27/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.8.0, Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Tyson Whitehead Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None

Attachments: HTML File bad     HTML File config    
Issue Links:
Blocker
Related
is related to LU-4011 problems with upstream lustre client ... Closed
is related to LU-8492 ptlrpc: Correctly calculate hrp->hrp_... Resolved
is related to LU-6215 Sync Lustre external tree with lustre... Resolved
Epic/Theme: staging
Severity: 3
Rank (Obsolete): 17617

 Description   

Enabling CONFIG_CPUMASK_OFFSTACK in stock kernel 3.18.0 causes the staging ptlrpc module to emit the message

LustreError: 1203:0:(service.c:2796:ptlrpc_hr_init()) ASSERTION( hrp->hrp_nthrs > 0 ) failed:

followed by a backtrace and kernel lockup upon loading. I'll attach my dmesg dump and the .config file I used. I picked version 2.4.0 above as there doesn't seem to be anyway to indicate the staging client version.



 Comments   
Comment by Oleg Drokin [ 26/Feb/15 ]

Thank you for the report.
I think I traced this to a bug in the kernel where cpumask copy is not copying enough bits and cpu_weight is counting too many creating a problem where thousands of "phantom" cpus are detected due to garbage in variables.

I submitted a bugreport upstream with a couple of proposed patches and hopefully that would be taken care of: https://lkml.org/lkml/2015/2/26/29

Comment by Tyson Whitehead [ 26/Feb/15 ]

Wow. That's great! Thanks for the very quick turn around.

We are really looking forward to being able to use the latest Fedora and Ubuntu releases as lustre clients.

Comment by Oleg Drokin [ 27/Feb/15 ]

You can also use this as a workaround (and a minor performance optimization):

diff --git a/drivers/staging/lustre/lustre/ptlrpc/service.c b/drivers/staging/lustre/lustre/ptlrpc/service.c
index 635b12b..4a27c79 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/service.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/service.c
@@ -2752,7 +2752,6 @@ int ptlrpc_start_thread(struct ptlrpc_service_part *svcpt, int wait)
 
 int ptlrpc_hr_init(void)
 {
-	cpumask_t			mask;
 	struct ptlrpc_hr_partition	*hrp;
 	struct ptlrpc_hr_thread		*hrt;
 	int				rc;
@@ -2770,8 +2769,7 @@ int ptlrpc_hr_init(void)
 
 	init_waitqueue_head(&ptlrpc_hr.hr_waitq);
 
-	cpumask_copy(&mask, topology_thread_cpumask(0));
-	weight = cpus_weight(mask);
+	weight = cpus_weight(*topology_thread_cpumask(0));
 
 	cfs_percpt_for_each(hrp, i, ptlrpc_hr.hr_partitions) {
 		hrp->hrp_cpt = i;

I'll send a separate patch for this to Greg, but who knows when it'll actually make it to Fedora.

Comment by Gerrit Updater [ 27/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/13904
Subject: LU-6285 libcfs: Do not unnecessarily copy cpumask
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2651de49d3e54db84f50285eb55a900b4a96ca1e

Comment by Gerrit Updater [ 27/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/13905
Subject: LU-6285 ptlrpc: Do not recalculate siblings of CPU 0 in a loop
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ff0c3c553b42e699a2960cf2788732fbedc3f485

Comment by Gerrit Updater [ 02/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/13925
Subject: LU-6285 ptlrpc: Get rid of cpus_* calls as deprecated
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c79914c41d71d95b95bdd237537d450984ab8894

Comment by Gerrit Updater [ 02/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/13926
Subject: LU-6285 libcfs: get rid of deprecated cpumask function usage
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 14f60e7f68c969f0a9aa0cf35510216b1019bef3

Comment by Gerrit Updater [ 03/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/13954
Subject: LU-6285: o2iblnd: Do not use cpus_weight, it's deprecated
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e0da4f48095531ca651cdcbc8bca9f22d6cc5860

Comment by Gerrit Updater [ 06/Apr/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13904/
Subject: LU-6285 libcfs: Do not unnecessarily copy cpumask
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b4f41e5fef3ff644f9adb95921329ef59e1e3e74

Comment by Gerrit Updater [ 06/Apr/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13905/
Subject: LU-6285 ptlrpc: Do not recalculate siblings of CPU 0 in a loop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0eb4582d87e32dd3e5491e13ba659e625624bfe7

Comment by Gerrit Updater [ 06/Apr/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13954/
Subject: LU-6285: o2iblnd: Do not use cpus_weight, it's deprecated
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b95db057d0501fb19f807cddf3a8ba3f7f47cb1a

Comment by Gerrit Updater [ 06/Apr/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13925/
Subject: LU-6285 ptlrpc: Get rid of cpus_* calls as deprecated
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 61787e1cea610ba38ba917b73db0d43589c029df

Comment by Gerrit Updater [ 01/May/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13926/
Subject: LU-6285 libcfs: get rid of deprecated cpumask function usage
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3b3233792869e706fe1ebfb6605d93fbc0d0d63c

Comment by James A Simmons [ 24/Jun/15 ]

All the patches have landed. We can close this ticket.

Comment by Tyson Whitehead [ 24/Jun/15 ]

Excellent! Thanks everyone.

Comment by James A Simmons [ 26/Jun/15 ]

Please close this ticket

Comment by Amir Shehata (Inactive) [ 23/Mar/16 ]

There is still an issue which could cause the assert.

cpu_pattern can sepcify exactly 1 cpu in a partition:
"0[0]". That means CPT0 will have CPU 0. CPU 0 can have
hyperthreading enabled. This combination would result in

weight = cfs_cpu_ht_nsiblings(0);
hrp->hrp_nthrs = cfs_cpt_weight(ptlrpc_hr.hr_cpt_table, i);
hrp->hrp_nthrs /= weight;

evaluating to 0. Where

cfs_cpt_weight(ptlrpc_hr.hr_cpt_table, i) == 1
weight == 2

Therefore only divide out with weight if

hrp->hrp_nthrs >= weight

This will avoid the assert:

LASSERT(hrp->hrp_nthrs > 0);
Comment by Gerrit Updater [ 23/Mar/16 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: http://review.whamcloud.com/19106
Subject: LU-6285 ptlrpc: Correctly calculate hrp->hrp_nthrs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2d6210842775e4fa7b0c7a6bce1dde8e948e56c9

Comment by Peter Jones [ 27/Jul/16 ]

Bulk of work landed for 2.9. Amir, please open a new ticket to track the landing of http://review.whamcloud.com/#/c/19106/

Generated at Sat Feb 10 01:58:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.