[LU-9843] LNetError: 57600:0:(linux-cpu.c:572:cfs_cpt_spread_node()) LBUG Created: 07/Aug/17 Updated: 10/Oct/19 Resolved: 10/Oct/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Mahmoud Hanafi | Assignee: | Sonia Sharma (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
SLES11SP4 with lustre 2.7.3 |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Client hit LBUG. [1498857336.755629] LNetError: 57600:0:(linux-cpu.c:572:cfs_cpt_spread_node()) LBUG [1498857336.779630] Pid: 57600, comm: lfs [1498857336.779630] LNetError: 57594:0:(linux-cpu.c:572:cfs_cpt_spread_node()) LBUG [1498857336.779630] LNetError: 57588:0:(linux-cpu.c:572:cfs_cpt_spread_node()) LBUG [1498857336.779630] Pid: 57594, comm: lfs [1498857336.779630] [1498857336.779630] Call Trace: [1498857336.779630] Pid: 57588, comm: lfs [1498857336.779630] [14988573 3 67.7 o7u96t 3o0]f 8Ca cllp usT rainc e:kd^Mb , waiting for the rest, timeout in 10 second(s) [1498857336.779630] [<ffffffff81004b35>] dump_trace+0x75/0x300 [1498857336.779630] [<ffffffff81004b35>] dump_trace+0x75/0x300 [1498857336.779630] [<ffffffffa09df82a>] libcfs_debug_dumpstack+0x4a/0x70 [libcfs] [1498857336.779630] [<ffffffffa09df82a>] libcfs_debug_dumpstack+0x4a/0x70 [libcfs] [1498857336.779630] [<ffffffffa09dfd5e>] lbug_with_loc+0x3e/0xb0 [libcfs] [1498857336.779630] [<ffffffffa09dfd5e>] lbug_with_loc+0x3e/0xb0 [libcfs] [1498857336.779630] [<ffffffffa09e18a6>] cfs_cpt_spread_node+0xf6/0x130 [libcfs] [1498857336.779630] [<ffffffffa09e18a6>] cfs_cpt_spread_node+0xf6/0x130 [libcfs] [1498857336.779630] [<ffffffffa09e0488>] cfs_cpt_malloc+0x18/0x40 [libcfs] [1498857336.779630] [<ffffffffa09e0488>] cfs_cpt_malloc+0x18/0x40 [libcfs] [1498857336.779630] [<ffffffffa0d1d60b>] ptlrpc_prep_set+0x4b/0x310 [ptlrpc] [1498857336.779630] [<ffffffffa0d1d60b>] ptlrpc_prep_set+0x4b/0x310 [ptlrpc] [1498857336.779630] [<ffffffffa0d238ac>] ptlrpc_queue_wait+0x3c/0x220 [ptlrpc] [1498857336.779630] [<ffffffffa0d238ac>] ptlrpc_queue_wait+0x3c/0x220 [ptlrpc] [1498857336.779630] [<ffffffffa1086993>] osc_quotactl+0xf3/0x360 [osc] [1498857336.779630] [<ffffffffa1086993>] osc_quotactl+0xf3/0x360 [osc] [1498857336.779630] [<ffffffffa0ecb8ea>] lov_quotactl+0x38a/0x930 [lov] [1498857336.779630] [<ffffffffa0ecb8ea>] lov_quotactl+0x38a/0x930 [lov] [1498857336.779630] [<ffffffffa0f2f4d9>] obd_quotactl+0xb9/0x340 [lustre] [1498857336.779630] [<ffffffffa0f2f4d9>] obd_quotactl+0xb9/0x340 [lustre] [1498857336.779630] [<ffffffffa0f3552b>] quotactl_ioctl+0x100b/0x15b0 [lustre] [1498857336.779630] [<ffffffffa0f3552b>] quotactl_ioctl+0x100b/0x15b0 [lustre] [1498857336.779630] [<ffffffffa0f37a48>] ll_dir_ioctl+0x1908/0x62f0 [lustre] [1498857336.779630] [<ffffffffa0f37a48>] ll_dir_ioctl+0x1908/0x62f0 [lustre] [1498857336.779630] [<ffffffff8117117b>] do_vfs_ioctl+0x8b/0x3b0 [1498857336.779630] [<ffffffff8117117b>] do_vfs_ioctl+0x8b/0x3b0 [1498857336.779630] [<ffffffff81171541>] sys_ioctl+0xa1/0xb0 [1498857336.779630] [<ffffffff81171541>] sys_ioctl+0xa1/0xb0 [1498857336.779630] [<ffffffff81483972>] system_call_fastpath+0x16/0x1b [1498857336.779630] [<ffffffff81483972>] system_call_fastpath+0x16/0x1b [1498857336.779630] [<00007fffed63f9a7>] 0x7fffed63f9a7 [1498857336.779630] [<00007fffed63f9a7>] 0x7fffed63f9a7 [1498857336.779630] [1498857336.779630] [1498857336.779630] Kernel panic - not syncing: LBUG [1498857336.779630] Pid: 57594, comm: lfs Tainted: P ENX 3.0.101-100.1.20170523-nasa #1 [1498857336.779630] Call Trace: [1498857336.779630] [<ffffffff81004b35>] dump_trace+0x75/0x300 [1498857336.779630] [<ffffffff814786b3>] dump_stack+0x69/0x6f [1498857336.779630] [<ffffffff8147876f>] panic+0xb6/0x224 [1498857336.779630] [<ffffffffa09dfdc3>] lbug_with_loc+0xa3/0xb0 [libcfs] [1498857336.779630] [<ffffffffa09e18a6>] cfs_cpt_spread_node+0xf6/0x130 [libcfs] [1498857336.779630] [<ffffffffa09e0488>] cfs_cpt_malloc+0x18/0x40 [libcfs] [1498857336.779630] [<ffffffffa0d1d60b>] ptlrpc_prep_set+0x4b/0x310 [ptlrpc] [1498857336.779630] [<ffffffffa0d238ac>] ptlrpc_queue_wait+0x3c/0x220 [ptlrpc] [1498857336.779630] [<ffffffffa1086993>] osc_quotactl+0xf3/0x360 [osc] [1498857336.779630] [<ffffffffa0ecb8ea>] lov_quotactl+0x38a/0x930 [lov] [1498857336.779630] [<ffffffffa0f2f4d9>] obd_quotactl+0xb9/0x340 [lustre] [1498857336.779630] [<ffffffffa0f3552b>] quotactl_ioctl+0x100b/0x15b0 [lustre] [1498857336.779630] [<ffffffffa0f37a48>] ll_dir_ioctl+0x1908/0x62f0 [lustre] [1498857336.779630] [<ffffffff8117117b>] do_vfs_ioctl+0x8b/0x3b0 [1498857336.779630] [<ffffffff81171541>] sys_ioctl+0xa1/0xb0 [1498857336.779630] [<ffffffff81483972>] system_call_fastpath+0x16/0x1b [1498857336.779630] [<0000.7fffed63f9a7>] 0x7fffed63f9a6 All cpus are now in kdb |
| Comments |
| Comment by Peter Jones [ 08/Aug/17 ] |
|
Sonia Can you please advise? Thanks Peter |
| Comment by Sonia Sharma (Inactive) [ 09/Aug/17 ] |
|
Hi Mahmoud, Thanks! |
| Comment by Mahmoud Hanafi [ 09/Aug/17 ] |
|
Unfortunately we were not able to get a crash dump. Only other clue I have, I believe this occurred when one of the OST had hit a bitmap error and was remounted Read-only. |
| Comment by Sonia Sharma (Inactive) [ 09/Aug/17 ] |
|
Can you please give details on from where you got the build for SLES11SP4 with lustre 2.7.3 or did you build yourself? |
| Comment by Peter Jones [ 09/Aug/17 ] |
|
Sonia NASA have their own distribution based on the 2.7 FE branch. They will need to grant you access to it on github Peter |
| Comment by Jay Lan (Inactive) [ 09/Aug/17 ] |
|
Hi Sonia, If you give me your login ID at github.com I can add you to the list with access permission to our FE git repo. |
| Comment by Sonia Sharma (Inactive) [ 09/Aug/17 ] |
|
Hi Jay |
| Comment by Jay Lan (Inactive) [ 09/Aug/17 ] |
|
Hi Sonia, |
| Comment by Sonia Sharma (Inactive) [ 09/Aug/17 ] |
|
Hi Jay |
| Comment by Sonia Sharma (Inactive) [ 11/Aug/17 ] |
|
Can we get information on how many NUMA nodes and CPU cores are there? Also what is the value of MAX_NUMNODES ( from /proc/self/status, field Mems_allowed)? |
| Comment by Sonia Sharma (Inactive) [ 11/Aug/17 ] |
|
Hi Mahmoud, Remounting of ost might have exposed a bug in the cfs_cpt_spread_node() code which is the reason it hit LBUG. Here is what happened we think - when ost was remounted, it called function ost_setup() which further called cfs_cpt_nodemask() and cfs_cpt_set_node() that might change the mask. This would then expose a race condition, where the mask changes after the rotor has been calculated but before the rotor is checked, causing the LBUG to be hit. Thanks |
| Comment by Mahmoud Hanafi [ 17/Aug/17 ] |
|
2.10 release is a longer term option. For now we will need a 2.7fe patch.
|
| Comment by Sonia Sharma (Inactive) [ 17/Aug/17 ] |
|
I have pushed this patch but it still needs to be reviewed and landed. |
| Comment by Jay Lan (Inactive) [ 22/Aug/17 ] |
|
Hi Sonia, there was comments on #28538 from your reviewer. Should I ignore the comments and pick you your patchset #1? |
| Comment by Sonia Sharma (Inactive) [ 22/Aug/17 ] |
|
Hi Jay, I need to revise the patch but I was waiting for one more reviewer's (Dmitry's) feedback on the comments as he had done the major changes related to this fix on master. |
| Comment by Peter Jones [ 22/Aug/17 ] |
|
Sonia Dmitry is out of the office this week so I recommend refreshing with Amir's comments and then getting Dmitry's input upon his return Peter |
| Comment by Sonia Sharma (Inactive) [ 22/Aug/17 ] |
|
Sure. I refreshed the patch per Amir's comments. Thanks |
| Comment by Jay Lan (Inactive) [ 29/Aug/17 ] |
|
Does 2.10.0 need this patch? Thanks! |
| Comment by Sonia Sharma (Inactive) [ 29/Aug/17 ] |
|
2.10.0 doesn't need this patch as it had a series of patches related to cpt rework which already incorporated this fix. |
| Comment by Jay Lan (Inactive) [ 30/Aug/17 ] |
|
Sorry I forgot to ask if the patch is needed for 2.9.0. We have lustre clients running 2.7.3, 2.9.0 and 2.10.0. Please advise. Thanks! |
| Comment by Sonia Sharma (Inactive) [ 31/Aug/17 ] |
|
Yes 2.9.0 would also require this patch. |
| Comment by Mahmoud Hanafi [ 10/Oct/19 ] |
|
please close we are no longer running 2.7 |
| Comment by Peter Jones [ 10/Oct/19 ] |
|
ok - thanks |