[LU-11593] LFSCK crashed all OSS servers Created: 01/Nov/18 Updated: 14/Jun/21 |
|
| Status: | In Progress |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Manish (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS Linux release 7.5.1804 (Core) |
||
| Severity: | 2 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hi This is Manish from nasa and we were getting some hangs on ls command and because of that we started running the "lfsck" command on all OSS nodes to clear up stale entries and that crashed all the servers after a long run of lfsck command.
PID: 25536 TASK: ffff881bc2abaf70 CPU: 9 COMMAND: "lfsck"
#0 [ffff88151015fbc8] machine_kexec at ffffffff8105b64b
#1 [ffff88151015fc28] __crash_kexec at ffffffff81105342
#2 [ffff88151015fcf8] panic at ffffffff81689aad
#3 [ffff88151015fd78] lbug_with_loc at ffffffffa08938cb [libcfs]
#4 [ffff88151015fd98] lfsck_layout_slave_prep at ffffffffa0cfa6fd [lfsck]
#5 [ffff88151015fdf0] lfsck_master_engine at ffffffffa0cd4624 [lfsck]
#6 [ffff88151015fec8] kthread at ffffffff810b1131
#7 [ffff88151015ff50] ret_from_fork at ffffffff816a14f7
I will upload the complete crash dump to ftp site soon. Thank You, Manish |
| Comments |
| Comment by Peter Jones [ 01/Nov/18 ] |
|
Lai Could you please advise? Thanks Peter |
| Comment by Andreas Dilger [ 01/Nov/18 ] |
|
The lfsck_layout_slave_prep() function has two different LASSERT() checks in it, but it isn't clear from the above stack trace which one was triggered. Are there some lines on the console before the stack trace that show what the actual problem was? Based on the lfsck_layout_slave_prep() function this was in, it looks like it was at the start of the MDS->OSS layout scanning phase, so it may have already repaired the corrupted LMA structures on the MDS. If the previous LFSCK run has repaired most of the issues with the files, it is not strictly necessary to continue with LFSCK prior to returning the filesystem to service. You might consider to run a "find -uid 0" (or similar, maybe in parallel across each top-level directory from a separate client) on the filesystem to ensure that there are no files that cause the MDS or client to crash when accessed, but cleaning up orphan OST objects is probably a secondary concern at this point. Lustre clients can handle corrupt/unknown file layouts and missing OST objects fairly well (returning an error when such files are accessed). |
| Comment by Manish (Inactive) [ 01/Nov/18 ] |
|
Here is the stack trace from vmcore-dmesg what I see from one of the node.
[216597.870804] LustreError: 29043:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) ASSERTION( !llsd->llsd_rbtree_valid ) failed: [216597.870893] LustreError: 29045:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) ASSERTION( !llsd->llsd_rbtree_valid ) failed: [216597.870896] LustreError: 29045:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) LBUG [216597.870897] Pid: 29045, comm: lfsck 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 [216597.870898] Call Trace: [216597.870916] [<ffffffff8103a1f2>] save_stack_trace_tsk+0x22/0x40 [216597.870926] [<ffffffffa086d7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [216597.870931] [<ffffffffa086d87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [216597.870954] [<ffffffffa10e06fd>] lfsck_layout_slave_prep+0x51d/0x590 [lfsck] [216597.870961] [<ffffffffa10ba624>] lfsck_master_engine+0x184/0x1360 [lfsck] [216597.870965] [<ffffffff810b1131>] kthread+0xd1/0xe0 [216597.870968] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 [216597.870989] [<ffffffffffffffff>] 0xffffffffffffffff [216597.870990] Kernel panic - not syncing: LBUG [216597.870992] CPU: 11 PID: 29045 Comm: lfsck Tainted: G OE ------------ 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 [216597.870994] Call Trace: [216597.870999] [<ffffffff8168f4b8>] dump_stack+0x19/0x1b [216597.871003] [<ffffffff81689aa2>] panic+0xe8/0x21f [216597.871008] [<ffffffffa086d8cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [216597.871017] [<ffffffffa10e06fd>] lfsck_layout_slave_prep+0x51d/0x590 [lfsck] [216597.871025] [<ffffffffa10ba624>] lfsck_master_engine+0x184/0x1360 [lfsck] [216597.871032] [<ffffffffa10ba4a0>] ? lfsck_master_oit_engine+0x1190/0x1190 [lfsck] [216597.871034] [<ffffffff810b1131>] kthread+0xd1/0xe0 [216597.871036] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 [216597.871038] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 [216597.871039] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 [216597.871698] LustreError: 29049:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) ASSERTION( !llsd->llsd_rbtree_valid ) failed: [216597.871700] LustreError: 29049:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) LBUG [216597.871701] Pid: 29049, comm: lfsck 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 [216597.871702] Call Trace: [216597.871710] [<ffffffff8103a1f2>] save_stack_trace_tsk+0x22/0x40 [216597.871716] [<ffffffffa086d7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [216597.871721] [<ffffffffa086d87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [216597.871733] [<ffffffffa10e06fd>] lfsck_layout_slave_prep+0x51d/0x590 [lfsck] [216597.871741] [<ffffffffa10ba624>] lfsck_master_engine+0x184/0x1360 [lfsck] [216597.871743] [<ffffffff810b1131>] kthread+0xd1/0xe0 [216597.871745] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
I hope this helps, and I have already uploaded the complete stack trace to FTP site. Thank You, Manish
|
| Comment by Manish (Inactive) [ 02/Nov/18 ] |
|
Hi Lai,
Just wanted to follow up, any updates on this issues, as we will be getting the patch from other ticket Thank You, Manish |
| Comment by Lai Siyao [ 05/Nov/18 ] |
|
I'm still reviewing code to understand this, and it doesn't look to be the same issue of |
| Comment by Lai Siyao [ 06/Nov/18 ] |
|
Manish, do you remember how you started 'lfsck' on servers? Did you only run 'lfsck' on all OSS? but not MDS? And what's the exact command? |
| Comment by Mahmoud Hanafi [ 06/Nov/18 ] |
lctl lfsc_start -o -r This likely related to https://jira.whamcloud.com/browse/LU-11625 |
| Comment by Lai Siyao [ 26/Nov/18 ] |
|
Mahmoud, does your system have multiple MDTs? Did you run 'lctl lfsck_start -o -r' on MDS either? |
| Comment by Julien Wallior [ 14/Jun/21 ] |
|
We just hit this one on 2.10.8. We have 2 MDS - 2 MDT and we are running 1 MDT on each MDS. |