[LU-11593] LFSCK crashed all OSS servers Created: 01/Nov/18  Updated: 14/Jun/21

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Manish (Inactive) Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS Linux release 7.5.1804 (Core)
Kernel Version 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105
e2fsprogs-libs-1.44.3.wc1-0.el7.x86_64
e2fsprogs-1.44.3.wc1-0.el7.x86_64
e2fsprogs-static-1.44.3.wc1-0.el7.x86_64
e2fsprogs-devel-1.44.3.wc1-0.el7.x86_64
e2fsprogs-debuginfo-1.44.3.wc1-0.el7.x86_64


Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Hi

This is Manish from nasa and we were getting some hangs on ls command and because of that we started running the "lfsck" command on all OSS nodes to clear up stale entries and that crashed all the servers after a long run of lfsck command.

 

PID: 25536  TASK: ffff881bc2abaf70  CPU: 9   COMMAND: "lfsck"
 #0 [ffff88151015fbc8] machine_kexec at ffffffff8105b64b
 #1 [ffff88151015fc28] __crash_kexec at ffffffff81105342
 #2 [ffff88151015fcf8] panic at ffffffff81689aad
 #3 [ffff88151015fd78] lbug_with_loc at ffffffffa08938cb [libcfs]
 #4 [ffff88151015fd98] lfsck_layout_slave_prep at ffffffffa0cfa6fd [lfsck]
 #5 [ffff88151015fdf0] lfsck_master_engine at ffffffffa0cd4624 [lfsck]
 #6 [ffff88151015fec8] kthread at ffffffff810b1131
 #7 [ffff88151015ff50] ret_from_fork at ffffffff816a14f7

 

I will upload the complete crash dump to ftp site soon.

Thank You,

                   Manish



 Comments   
Comment by Peter Jones [ 01/Nov/18 ]

Lai

Could you please advise?

Thanks

Peter

Comment by Andreas Dilger [ 01/Nov/18 ]

The lfsck_layout_slave_prep() function has two different LASSERT() checks in it, but it isn't clear from the above stack trace which one was triggered. Are there some lines on the console before the stack trace that show what the actual problem was? Based on the lfsck_layout_slave_prep() function this was in, it looks like it was at the start of the MDS->OSS layout scanning phase, so it may have already repaired the corrupted LMA structures on the MDS.

If the previous LFSCK run has repaired most of the issues with the files, it is not strictly necessary to continue with LFSCK prior to returning the filesystem to service. You might consider to run a "find -uid 0" (or similar, maybe in parallel across each top-level directory from a separate client) on the filesystem to ensure that there are no files that cause the MDS or client to crash when accessed, but cleaning up orphan OST objects is probably a secondary concern at this point. Lustre clients can handle corrupt/unknown file layouts and missing OST objects fairly well (returning an error when such files are accessed).

Comment by Manish (Inactive) [ 01/Nov/18 ]

Here is the stack trace from vmcore-dmesg what I see from one of the node.

  

[216597.870804] LustreError: 29043:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) ASSERTION( !llsd->llsd_rbtree_valid ) failed:
[216597.870893] LustreError: 29045:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) ASSERTION( !llsd->llsd_rbtree_valid ) failed:
[216597.870896] LustreError: 29045:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) LBUG
[216597.870897] Pid: 29045, comm: lfsck 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
[216597.870898] Call Trace:
[216597.870916]  [<ffffffff8103a1f2>] save_stack_trace_tsk+0x22/0x40
[216597.870926]  [<ffffffffa086d7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[216597.870931]  [<ffffffffa086d87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[216597.870954]  [<ffffffffa10e06fd>] lfsck_layout_slave_prep+0x51d/0x590 [lfsck]
[216597.870961]  [<ffffffffa10ba624>] lfsck_master_engine+0x184/0x1360 [lfsck]
[216597.870965]  [<ffffffff810b1131>] kthread+0xd1/0xe0
[216597.870968]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
[216597.870989]  [<ffffffffffffffff>] 0xffffffffffffffff
[216597.870990] Kernel panic - not syncing: LBUG
[216597.870992] CPU: 11 PID: 29045 Comm: lfsck Tainted: G           OE  ------------   3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1
[216597.870994] Call Trace:
[216597.870999]  [<ffffffff8168f4b8>] dump_stack+0x19/0x1b
[216597.871003]  [<ffffffff81689aa2>] panic+0xe8/0x21f
[216597.871008]  [<ffffffffa086d8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[216597.871017]  [<ffffffffa10e06fd>] lfsck_layout_slave_prep+0x51d/0x590 [lfsck]
[216597.871025]  [<ffffffffa10ba624>] lfsck_master_engine+0x184/0x1360 [lfsck]
[216597.871032]  [<ffffffffa10ba4a0>] ? lfsck_master_oit_engine+0x1190/0x1190 [lfsck]
[216597.871034]  [<ffffffff810b1131>] kthread+0xd1/0xe0
[216597.871036]  [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
[216597.871038]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
[216597.871039]  [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
[216597.871698] LustreError: 29049:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) ASSERTION( !llsd->llsd_rbtree_valid ) failed:
[216597.871700] LustreError: 29049:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) LBUG
[216597.871701] Pid: 29049, comm: lfsck 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
[216597.871702] Call Trace:
[216597.871710]  [<ffffffff8103a1f2>] save_stack_trace_tsk+0x22/0x40
[216597.871716]  [<ffffffffa086d7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[216597.871721]  [<ffffffffa086d87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[216597.871733]  [<ffffffffa10e06fd>] lfsck_layout_slave_prep+0x51d/0x590 [lfsck]
[216597.871741]  [<ffffffffa10ba624>] lfsck_master_engine+0x184/0x1360 [lfsck]
[216597.871743]  [<ffffffff810b1131>] kthread+0xd1/0xe0
[216597.871745]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0

 

I hope this helps, and I have already uploaded the complete stack trace to FTP site.

Thank You,

                  Manish 

 

 

Comment by Manish (Inactive) [ 02/Nov/18 ]

Hi Lai,

 

Just wanted to follow up, any updates on this issues, as we will be getting the patch from other ticket LU-11584 for lfsck and if this issues can be addressed in that patch then it would be nice to avoid one more outage because of this bug.

Thank You,

                  Manish

Comment by Lai Siyao [ 05/Nov/18 ]

I'm still reviewing code to understand this, and it doesn't look to be the same issue of LU-11584.

Comment by Lai Siyao [ 06/Nov/18 ]

Manish, do you remember how you started 'lfsck' on servers? Did you only run 'lfsck' on all OSS? but not MDS? And what's the exact command?

Comment by Mahmoud Hanafi [ 06/Nov/18 ]
lctl lfsc_start -o -r 

This likely related to https://jira.whamcloud.com/browse/LU-11625

Comment by Lai Siyao [ 26/Nov/18 ]

Mahmoud, does your system have multiple MDTs? Did you run 'lctl lfsck_start -o -r' on MDS either?

Comment by Julien Wallior [ 14/Jun/21 ]

We just hit this one on 2.10.8.
I was running `lctl lfsck_start -M MDT-0000 -o -r` on MDS0

We have 2 MDS - 2 MDT and we are running 1 MDT on each MDS.

Generated at Sat Feb 10 02:45:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.