[LU-9843]  LNetError: 57600:0:(linux-cpu.c:572:cfs_cpt_spread_node()) LBUG Created: 07/Aug/17  Updated: 10/Oct/19  Resolved: 10/Oct/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Mahmoud Hanafi Assignee: Sonia Sharma (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

SLES11SP4 with lustre 2.7.3


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Client hit LBUG.

[1498857336.755629] LNetError: 57600:0:(linux-cpu.c:572:cfs_cpt_spread_node()) LBUG
[1498857336.779630] Pid: 57600, comm: lfs
[1498857336.779630] LNetError: 57594:0:(linux-cpu.c:572:cfs_cpt_spread_node()) LBUG
[1498857336.779630] LNetError: 57588:0:(linux-cpu.c:572:cfs_cpt_spread_node()) LBUG
[1498857336.779630] Pid: 57594, comm: lfs
[1498857336.779630] 
[1498857336.779630] Call Trace:
[1498857336.779630] Pid: 57588, comm: lfs
[1498857336.779630] 
[14988573 3 67.7 o7u96t 3o0]f 8Ca cllp usT rainc e:kd^Mb
, waiting for the rest, timeout in 10 second(s)
[1498857336.779630] [<ffffffff81004b35>] dump_trace+0x75/0x300
[1498857336.779630] [<ffffffff81004b35>] dump_trace+0x75/0x300
[1498857336.779630] [<ffffffffa09df82a>] libcfs_debug_dumpstack+0x4a/0x70 [libcfs]
[1498857336.779630] [<ffffffffa09df82a>] libcfs_debug_dumpstack+0x4a/0x70 [libcfs]
[1498857336.779630] [<ffffffffa09dfd5e>] lbug_with_loc+0x3e/0xb0 [libcfs]
[1498857336.779630] [<ffffffffa09dfd5e>] lbug_with_loc+0x3e/0xb0 [libcfs]
[1498857336.779630] [<ffffffffa09e18a6>] cfs_cpt_spread_node+0xf6/0x130 [libcfs]
[1498857336.779630] [<ffffffffa09e18a6>] cfs_cpt_spread_node+0xf6/0x130 [libcfs]
[1498857336.779630] [<ffffffffa09e0488>] cfs_cpt_malloc+0x18/0x40 [libcfs]
[1498857336.779630] [<ffffffffa09e0488>] cfs_cpt_malloc+0x18/0x40 [libcfs]
[1498857336.779630] [<ffffffffa0d1d60b>] ptlrpc_prep_set+0x4b/0x310 [ptlrpc]
[1498857336.779630] [<ffffffffa0d1d60b>] ptlrpc_prep_set+0x4b/0x310 [ptlrpc]
[1498857336.779630] [<ffffffffa0d238ac>] ptlrpc_queue_wait+0x3c/0x220 [ptlrpc]
[1498857336.779630] [<ffffffffa0d238ac>] ptlrpc_queue_wait+0x3c/0x220 [ptlrpc]
[1498857336.779630] [<ffffffffa1086993>] osc_quotactl+0xf3/0x360 [osc]
[1498857336.779630] [<ffffffffa1086993>] osc_quotactl+0xf3/0x360 [osc]
[1498857336.779630] [<ffffffffa0ecb8ea>] lov_quotactl+0x38a/0x930 [lov]
[1498857336.779630] [<ffffffffa0ecb8ea>] lov_quotactl+0x38a/0x930 [lov]
[1498857336.779630] [<ffffffffa0f2f4d9>] obd_quotactl+0xb9/0x340 [lustre]
[1498857336.779630] [<ffffffffa0f2f4d9>] obd_quotactl+0xb9/0x340 [lustre]
[1498857336.779630] [<ffffffffa0f3552b>] quotactl_ioctl+0x100b/0x15b0 [lustre]
[1498857336.779630] [<ffffffffa0f3552b>] quotactl_ioctl+0x100b/0x15b0 [lustre]
[1498857336.779630] [<ffffffffa0f37a48>] ll_dir_ioctl+0x1908/0x62f0 [lustre]
[1498857336.779630] [<ffffffffa0f37a48>] ll_dir_ioctl+0x1908/0x62f0 [lustre]
[1498857336.779630] [<ffffffff8117117b>] do_vfs_ioctl+0x8b/0x3b0
[1498857336.779630] [<ffffffff8117117b>] do_vfs_ioctl+0x8b/0x3b0
[1498857336.779630] [<ffffffff81171541>] sys_ioctl+0xa1/0xb0
[1498857336.779630] [<ffffffff81171541>] sys_ioctl+0xa1/0xb0
[1498857336.779630] [<ffffffff81483972>] system_call_fastpath+0x16/0x1b
[1498857336.779630] [<ffffffff81483972>] system_call_fastpath+0x16/0x1b
[1498857336.779630] [<00007fffed63f9a7>] 0x7fffed63f9a7
[1498857336.779630] [<00007fffed63f9a7>] 0x7fffed63f9a7
[1498857336.779630] 
[1498857336.779630] 
[1498857336.779630] Kernel panic - not syncing: LBUG
[1498857336.779630] Pid: 57594, comm: lfs Tainted: P ENX 3.0.101-100.1.20170523-nasa #1
[1498857336.779630] Call Trace:
[1498857336.779630] [<ffffffff81004b35>] dump_trace+0x75/0x300
[1498857336.779630] [<ffffffff814786b3>] dump_stack+0x69/0x6f
[1498857336.779630] [<ffffffff8147876f>] panic+0xb6/0x224
[1498857336.779630] [<ffffffffa09dfdc3>] lbug_with_loc+0xa3/0xb0 [libcfs]
[1498857336.779630] [<ffffffffa09e18a6>] cfs_cpt_spread_node+0xf6/0x130 [libcfs]
[1498857336.779630] [<ffffffffa09e0488>] cfs_cpt_malloc+0x18/0x40 [libcfs]
[1498857336.779630] [<ffffffffa0d1d60b>] ptlrpc_prep_set+0x4b/0x310 [ptlrpc]
[1498857336.779630] [<ffffffffa0d238ac>] ptlrpc_queue_wait+0x3c/0x220 [ptlrpc]
[1498857336.779630] [<ffffffffa1086993>] osc_quotactl+0xf3/0x360 [osc]
[1498857336.779630] [<ffffffffa0ecb8ea>] lov_quotactl+0x38a/0x930 [lov]
[1498857336.779630] [<ffffffffa0f2f4d9>] obd_quotactl+0xb9/0x340 [lustre]
[1498857336.779630] [<ffffffffa0f3552b>] quotactl_ioctl+0x100b/0x15b0 [lustre]
[1498857336.779630] [<ffffffffa0f37a48>] ll_dir_ioctl+0x1908/0x62f0 [lustre]
[1498857336.779630] [<ffffffff8117117b>] do_vfs_ioctl+0x8b/0x3b0
[1498857336.779630] [<ffffffff81171541>] sys_ioctl+0xa1/0xb0
[1498857336.779630] [<ffffffff81483972>] system_call_fastpath+0x16/0x1b
[1498857336.779630] [<0000.7fffed63f9a7>] 0x7fffed63f9a6
All cpus are now in kdb


 Comments   
Comment by Peter Jones [ 08/Aug/17 ]

Sonia

Can you please advise?

Thanks

Peter

Comment by Sonia Sharma (Inactive) [ 09/Aug/17 ]

Hi Mahmoud,
Can we have access to vmcore and vmlinux for this?

Thanks!

Comment by Mahmoud Hanafi [ 09/Aug/17 ]

Unfortunately we were not able to get a crash dump. Only other clue I have, I believe this occurred when one of the OST had hit a bitmap error and was remounted Read-only.

Comment by Sonia Sharma (Inactive) [ 09/Aug/17 ]

Can you please give details on from where you got the build for SLES11SP4 with lustre 2.7.3 or did you build yourself?
I could not find a branch with tag 2.7.3. Is it the tag in a fe branch? 

Comment by Peter Jones [ 09/Aug/17 ]

Sonia

NASA have their own distribution based on the 2.7 FE branch. They will need to grant you access to it on github

Peter

Comment by Jay Lan (Inactive) [ 09/Aug/17 ]

Hi Sonia,

If you give me your login ID at github.com I can add you to the list with access permission to our FE git repo.

Comment by Sonia Sharma (Inactive) [ 09/Aug/17 ]

Hi Jay
My login-id for github.com is soniash24@gmail.com (username-Sonia241087)

Comment by Jay Lan (Inactive) [ 09/Aug/17 ]

Hi Sonia,
Have you received invitation from github to join "NASA Earth Exchange (NEX)?" The lustre FE (lustre-nas-fe) repo is housed under NEX. You will join as a member to Lustre team.

Comment by Sonia Sharma (Inactive) [ 09/Aug/17 ]

Hi Jay
Yes I am able to access the repo. Thanks

Comment by Sonia Sharma (Inactive) [ 11/Aug/17 ]

Can we get information on how many NUMA nodes and CPU cores are there? Also what is the value of MAX_NUMNODES ( from /proc/self/status, field Mems_allowed)?

Comment by Sonia Sharma (Inactive) [ 11/Aug/17 ]

Hi Mahmoud,

Remounting of ost might have exposed a bug in the cfs_cpt_spread_node() code which is the reason it hit LBUG. Here is what happened we think - when ost was remounted, it called function ost_setup() which further called cfs_cpt_nodemask() and cfs_cpt_set_node() that might change the mask. This would then expose a race condition, where the mask changes after the rotor has been calculated but before the rotor is checked, causing the LBUG to be hit.
I will push a patch to branch b2_7_fe to fix this issue in the code.
Though, the 2.10 release has the fix and a series of many other patches related to CPT rework code. So it might be a good idea to update to the 2.10 release.

Thanks

Comment by Mahmoud Hanafi [ 17/Aug/17 ]

2.10 release is a longer term option. For now we will need a 2.7fe patch.

 

Comment by Sonia Sharma (Inactive) [ 17/Aug/17 ]

I have pushed this patch but it still needs to be reviewed and landed.
https://review.whamcloud.com/28538

Comment by Jay Lan (Inactive) [ 22/Aug/17 ]

Hi Sonia, there was comments on #28538 from your reviewer. Should I ignore the comments and pick you your patchset #1?

Comment by Sonia Sharma (Inactive) [ 22/Aug/17 ]

Hi Jay,

I need to revise the patch but I was waiting for one more reviewer's (Dmitry's) feedback on the comments as he had done the major changes related to this fix on master. 
I will revise it soon and upload a new patch which should anyways fix this particular issue.

Comment by Peter Jones [ 22/Aug/17 ]

Sonia

Dmitry is out of the office this week so I recommend refreshing with Amir's comments and then getting Dmitry's input upon his return

Peter

Comment by Sonia Sharma (Inactive) [ 22/Aug/17 ]

Sure. I refreshed the patch per Amir's comments.

Thanks

Comment by Jay Lan (Inactive) [ 29/Aug/17 ]

Does 2.10.0 need this patch? Thanks!

Comment by Sonia Sharma (Inactive) [ 29/Aug/17 ]

2.10.0 doesn't need this patch as it had a series of patches related to cpt rework which already incorporated this fix.

Comment by Jay Lan (Inactive) [ 30/Aug/17 ]

Sorry I forgot to ask if the patch is needed for 2.9.0. We have lustre clients running 2.7.3, 2.9.0 and 2.10.0. Please advise. Thanks!

Comment by Sonia Sharma (Inactive) [ 31/Aug/17 ]

Yes 2.9.0 would also require this patch.

Comment by Mahmoud Hanafi [ 10/Oct/19 ]

please close we are no longer running 2.7

Comment by Peter Jones [ 10/Oct/19 ]

ok - thanks

Generated at Sat Feb 10 02:29:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.