[LU-15886] remove unreasonable assertions in LFSCK code Created: 25/May/22  Updated: 20/Jul/22  Resolved: 20/Jul/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Lai Siyao Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-15868 LFSCK fix inconsistencies in director... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

LFSCK should assert on disk data, i.e. any kind of corrupt data is possible, this can avoid annoying crash in LFSCK.



 Comments   
Comment by Lai Siyao [ 25/May/22 ]

Crashes are seen like this:

[708827.866619] LustreError: 4172:0:(lfsck_lib.c:1639:lfsck_instance_cleanup()) ASSERTION( lfsck->li_obj_dir == ((void *)0) ) failed: 
[708827.870508] LustreError: 4172:0:(lfsck_lib.c:1639:lfsck_instance_cleanup()) LBUG
[708827.872606] Pid: 4172, comm: umount 3.10.0-1160.49.1.el7_lustre.ddn16.x86_64 #1 SMP Mon Dec 20 11:42:01 PST 2021
[708827.872607] Call Trace:
[708827.872623] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[708827.872628] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[708827.872637] [<0>] lfsck_instance_cleanup+0x658/0x730 [lfsck]
[708827.872643] [<0>] lfsck_degister+0x43/0x50 [lfsck]
[708827.872651] [<0>] mdd_process_config+0x16a/0x5f0 [mdd]
[708827.872666] [<0>] mdt_stack_fini+0x2c2/0xca0 [mdt]
[708827.872673] [<0>] mdt_device_fini+0x34b/0x930 [mdt]
[708827.872698] [<0>] class_cleanup+0x9b8/0xc50 [obdclass]
[708827.872713] [<0>] class_process_config+0x65c/0x2830 [obdclass]
[708827.872728] [<0>] class_manual_cleanup+0x1c6/0x710 [obdclass]
[708827.872745] [<0>] server_put_super+0xa35/0x1150 [obdclass]
[708827.872748] [<0>] generic_shutdown_super+0x6d/0x100
[708827.872750] [<0>] kill_anon_super+0x12/0x20
[708827.872764] [<0>] lustre_kill_super+0x32/0x50 [obdclass]
[708827.872765] [<0>] deactivate_locked_super+0x4e/0x70
[708827.872766] [<0>] deactivate_super+0x46/0x60
[708827.872768] [<0>] cleanup_mnt+0x3f/0x80
[708827.872770] [<0>] __cleanup_mnt+0x12/0x20
[708827.872774] [<0>] task_work_run+0xbb/0xe0
[708827.872776] [<0>] do_notify_resume+0xa5/0xc0
[708827.872778] [<0>] int_signal+0x12/0x17
[708827.872795] [<0>] 0xfffffffffffffffe
[708827.872797] Kernel panic - not syncing: LBUG

and

[10089.987070] Lustre: vriprod1-OST000f: deleting orphan objects from 0x0:10180632 to 0x0:10182049
[10090.183768] LustreError: 29027:0:(lfsck_namespace.c:5896:lfsck_namespace_scan_local_lpf_one()) ASSERTION( dt_object_exists(child) ) failed: 
[10090.185375] LustreError: 29027:0:(lfsck_namespace.c:5896:lfsck_namespace_scan_local_lpf_one()) LBUG
[10090.186495] Pid: 29027, comm: lfsck_namespace 3.10.0-1160.49.1.el7_lustre.ddn16.x86_64 #1 SMP Mon Dec 20 11:42:01 PST 2021
[10090.186497] Call Trace:
[10090.186526] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[10090.186532] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[10090.186571] [<0>] lfsck_namespace_scan_local_lpf_one+0xa39/0xdf0 [lfsck]
[10090.186580] [<0>] lfsck_namespace_scan_local_lpf+0x59c/0x970 [lfsck]
[10090.186592] [<0>] lfsck_namespace_assistant_handler_p2+0x682/0xa80 [lfsck]
[10090.186600] [<0>] lfsck_assistant_engine+0xfb1/0x20a0 [lfsck]
[10090.186604] [<0>] kthread+0xd1/0xe0
[10090.186607] [<0>] ret_from_fork_nospec_begin+0x7/0x21
[10090.186632] [<0>] 0xfffffffffffffffe
[10090.186633] Kernel panic - not syncing: LBUG
Comment by Gerrit Updater [ 25/May/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47447
Subject: LU-15886 lfsck: remove unreasonable assertions
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1b456a63dae6a4a48920a5bcbd562fb24f2455e6

Comment by Lai Siyao [ 29/May/22 ]

Another crash hit:

[  420.060499] Lustre: vriprod1-OST007f: deleting orphan objects from 0x1300000404:364617 to 0x1300000404:364961
[  420.061574] Lustre: vriprod1-OST0019: deleting orphan objects from 0x118000040e:3594584 to 0x118000040e:3595009
[  426.029391] LustreError: 1129:0:(lfsck_namespace.c:3340:lfsck_namespace_linkea_clear_overflow()) ASSERTION( ldata->ld_leh->leh_reccount > 0 ) failed: 
[  426.034333] LustreError: 1129:0:(lfsck_namespace.c:3340:lfsck_namespace_linkea_clear_overflow()) LBUG
[  426.037651] Pid: 1129, comm: lfsck_namespace 3.10.0-1160.49.1.el7_lustre.ddn16.x86_64 #1 SMP Mon Dec 20 11:42:01 PST 2021
[  426.037653] Call Trace:
[  426.037697] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[  426.037703] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[  426.037753] [<0>] lfsck_namespace_linkea_clear_overflow.isra.66+0x390/0x4d3 [lfsck]
[  426.037771] [<0>] lfsck_namespace_double_scan_one+0x1b2/0x15a0 [lfsck]
[  426.037787] [<0>] lfsck_namespace_double_scan_one_trace_file+0x3ba/0x7d0 [lfsck]
[  426.037800] [<0>] lfsck_namespace_assistant_handler_p2+0x6e0/0xa80 [lfsck]
[  426.037814] [<0>] lfsck_assistant_engine+0xfb1/0x20a0 [lfsck]
[  426.037818] [<0>] kthread+0xd1/0xe0
[  426.037822] [<0>] ret_from_fork_nospec_begin+0x7/0x21
[  426.037898] [<0>] 0xfffffffffffffffe
[  426.037900] Kernel panic - not syncing: LBUG
[  426.040053] CPU: 8 PID: 1129 Comm: lfsck_namespace Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.49.1.el7_lustre.ddn16.x86_64 #1
[  426.044558] Hardware name: DDN SFA400NVXE, BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[  426.047323] Call Trace:
[  426.049339]  [<ffffffff88584539>] dump_stack+0x19/0x1b
[  426.051794]  [<ffffffff8857e241>] panic+0xe8/0x21f
[  426.053797]  [<ffffffffc0c8d8fb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[  426.056032]  [<ffffffffc16b12fd>] lfsck_namespace_linkea_clear_overflow.isra.66+0x390/0x4d3 [lfsck]
[  426.059110]  [<ffffffffc167eb72>] lfsck_namespace_double_scan_one+0x1b2/0x15a0 [lfsck]
[  426.061930]  [<ffffffffc168031a>] lfsck_namespace_double_scan_one_trace_file+0x3ba/0x7d0 [lfsck]
[  426.064345]  [<ffffffffc16840d0>] lfsck_namespace_assistant_handler_p2+0x6e0/0xa80 [lfsck]
[  426.066723]  [<ffffffffc10e6087>] ? ptlrpc_set_destroy+0x1f7/0x460 [ptlrpc]
[  426.069097]  [<ffffffff88026ae6>] ? kfree+0x106/0x140
[  426.071256]  [<ffffffffc10e6087>] ? ptlrpc_set_destroy+0x1f7/0x460 [ptlrpc]
[  426.073723]  [<ffffffffc1666a81>] lfsck_assistant_engine+0xfb1/0x20a0 [lfsck]
[  426.076042]  [<ffffffff88589df0>] ? __schedule+0x320/0x680
[  426.078232]  [<ffffffff87edadf0>] ? wake_up_state+0x20/0x20
[  426.080655]  [<ffffffffc1665ad0>] ? lfsck_master_engine+0x1360/0x1360 [lfsck]
[  426.082720]  [<ffffffff87ec5e61>] kthread+0xd1/0xe0
[  426.084783]  [<ffffffff87ec5d90>] ? insert_kthread_work+0x40/0x40
[  426.086902]  [<ffffffff88596ddd>] ret_from_fork_nospec_begin+0x7/0x21
[  426.088888]  [<ffffffff87ec5d90>] ? insert_kthread_work+0x40/0x40
Comment by Gerrit Updater [ 18/Jul/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47447/
Subject: LU-15886 lfsck: remove unreasonable assertions
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b52b52c2d142cec15ae35e91f878d1063c094bc4

Comment by Peter Jones [ 20/Jul/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:22:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.