Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15886

remove unreasonable assertions in LFSCK code

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      LFSCK should assert on disk data, i.e. any kind of corrupt data is possible, this can avoid annoying crash in LFSCK.

      Attachments

        Issue Links

          Activity

            [LU-15886] remove unreasonable assertions in LFSCK code
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47447/
            Subject: LU-15886 lfsck: remove unreasonable assertions
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b52b52c2d142cec15ae35e91f878d1063c094bc4

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47447/ Subject: LU-15886 lfsck: remove unreasonable assertions Project: fs/lustre-release Branch: master Current Patch Set: Commit: b52b52c2d142cec15ae35e91f878d1063c094bc4
            laisiyao Lai Siyao added a comment -

            Another crash hit:

            [  420.060499] Lustre: vriprod1-OST007f: deleting orphan objects from 0x1300000404:364617 to 0x1300000404:364961
            [  420.061574] Lustre: vriprod1-OST0019: deleting orphan objects from 0x118000040e:3594584 to 0x118000040e:3595009
            [  426.029391] LustreError: 1129:0:(lfsck_namespace.c:3340:lfsck_namespace_linkea_clear_overflow()) ASSERTION( ldata->ld_leh->leh_reccount > 0 ) failed: 
            [  426.034333] LustreError: 1129:0:(lfsck_namespace.c:3340:lfsck_namespace_linkea_clear_overflow()) LBUG
            [  426.037651] Pid: 1129, comm: lfsck_namespace 3.10.0-1160.49.1.el7_lustre.ddn16.x86_64 #1 SMP Mon Dec 20 11:42:01 PST 2021
            [  426.037653] Call Trace:
            [  426.037697] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
            [  426.037703] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
            [  426.037753] [<0>] lfsck_namespace_linkea_clear_overflow.isra.66+0x390/0x4d3 [lfsck]
            [  426.037771] [<0>] lfsck_namespace_double_scan_one+0x1b2/0x15a0 [lfsck]
            [  426.037787] [<0>] lfsck_namespace_double_scan_one_trace_file+0x3ba/0x7d0 [lfsck]
            [  426.037800] [<0>] lfsck_namespace_assistant_handler_p2+0x6e0/0xa80 [lfsck]
            [  426.037814] [<0>] lfsck_assistant_engine+0xfb1/0x20a0 [lfsck]
            [  426.037818] [<0>] kthread+0xd1/0xe0
            [  426.037822] [<0>] ret_from_fork_nospec_begin+0x7/0x21
            [  426.037898] [<0>] 0xfffffffffffffffe
            [  426.037900] Kernel panic - not syncing: LBUG
            [  426.040053] CPU: 8 PID: 1129 Comm: lfsck_namespace Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.49.1.el7_lustre.ddn16.x86_64 #1
            [  426.044558] Hardware name: DDN SFA400NVXE, BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
            [  426.047323] Call Trace:
            [  426.049339]  [<ffffffff88584539>] dump_stack+0x19/0x1b
            [  426.051794]  [<ffffffff8857e241>] panic+0xe8/0x21f
            [  426.053797]  [<ffffffffc0c8d8fb>] lbug_with_loc+0x9b/0xa0 [libcfs]
            [  426.056032]  [<ffffffffc16b12fd>] lfsck_namespace_linkea_clear_overflow.isra.66+0x390/0x4d3 [lfsck]
            [  426.059110]  [<ffffffffc167eb72>] lfsck_namespace_double_scan_one+0x1b2/0x15a0 [lfsck]
            [  426.061930]  [<ffffffffc168031a>] lfsck_namespace_double_scan_one_trace_file+0x3ba/0x7d0 [lfsck]
            [  426.064345]  [<ffffffffc16840d0>] lfsck_namespace_assistant_handler_p2+0x6e0/0xa80 [lfsck]
            [  426.066723]  [<ffffffffc10e6087>] ? ptlrpc_set_destroy+0x1f7/0x460 [ptlrpc]
            [  426.069097]  [<ffffffff88026ae6>] ? kfree+0x106/0x140
            [  426.071256]  [<ffffffffc10e6087>] ? ptlrpc_set_destroy+0x1f7/0x460 [ptlrpc]
            [  426.073723]  [<ffffffffc1666a81>] lfsck_assistant_engine+0xfb1/0x20a0 [lfsck]
            [  426.076042]  [<ffffffff88589df0>] ? __schedule+0x320/0x680
            [  426.078232]  [<ffffffff87edadf0>] ? wake_up_state+0x20/0x20
            [  426.080655]  [<ffffffffc1665ad0>] ? lfsck_master_engine+0x1360/0x1360 [lfsck]
            [  426.082720]  [<ffffffff87ec5e61>] kthread+0xd1/0xe0
            [  426.084783]  [<ffffffff87ec5d90>] ? insert_kthread_work+0x40/0x40
            [  426.086902]  [<ffffffff88596ddd>] ret_from_fork_nospec_begin+0x7/0x21
            [  426.088888]  [<ffffffff87ec5d90>] ? insert_kthread_work+0x40/0x40
            
            laisiyao Lai Siyao added a comment - Another crash hit: [ 420.060499] Lustre: vriprod1-OST007f: deleting orphan objects from 0x1300000404:364617 to 0x1300000404:364961 [ 420.061574] Lustre: vriprod1-OST0019: deleting orphan objects from 0x118000040e:3594584 to 0x118000040e:3595009 [ 426.029391] LustreError: 1129:0:(lfsck_namespace.c:3340:lfsck_namespace_linkea_clear_overflow()) ASSERTION( ldata->ld_leh->leh_reccount > 0 ) failed: [ 426.034333] LustreError: 1129:0:(lfsck_namespace.c:3340:lfsck_namespace_linkea_clear_overflow()) LBUG [ 426.037651] Pid: 1129, comm: lfsck_namespace 3.10.0-1160.49.1.el7_lustre.ddn16.x86_64 #1 SMP Mon Dec 20 11:42:01 PST 2021 [ 426.037653] Call Trace: [ 426.037697] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [ 426.037703] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 426.037753] [<0>] lfsck_namespace_linkea_clear_overflow.isra.66+0x390/0x4d3 [lfsck] [ 426.037771] [<0>] lfsck_namespace_double_scan_one+0x1b2/0x15a0 [lfsck] [ 426.037787] [<0>] lfsck_namespace_double_scan_one_trace_file+0x3ba/0x7d0 [lfsck] [ 426.037800] [<0>] lfsck_namespace_assistant_handler_p2+0x6e0/0xa80 [lfsck] [ 426.037814] [<0>] lfsck_assistant_engine+0xfb1/0x20a0 [lfsck] [ 426.037818] [<0>] kthread+0xd1/0xe0 [ 426.037822] [<0>] ret_from_fork_nospec_begin+0x7/0x21 [ 426.037898] [<0>] 0xfffffffffffffffe [ 426.037900] Kernel panic - not syncing: LBUG [ 426.040053] CPU: 8 PID: 1129 Comm: lfsck_namespace Kdump: loaded Tainted: G OE ------------ T 3.10.0-1160.49.1.el7_lustre.ddn16.x86_64 #1 [ 426.044558] Hardware name: DDN SFA400NVXE, BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 426.047323] Call Trace: [ 426.049339] [<ffffffff88584539>] dump_stack+0x19/0x1b [ 426.051794] [<ffffffff8857e241>] panic+0xe8/0x21f [ 426.053797] [<ffffffffc0c8d8fb>] lbug_with_loc+0x9b/0xa0 [libcfs] [ 426.056032] [<ffffffffc16b12fd>] lfsck_namespace_linkea_clear_overflow.isra.66+0x390/0x4d3 [lfsck] [ 426.059110] [<ffffffffc167eb72>] lfsck_namespace_double_scan_one+0x1b2/0x15a0 [lfsck] [ 426.061930] [<ffffffffc168031a>] lfsck_namespace_double_scan_one_trace_file+0x3ba/0x7d0 [lfsck] [ 426.064345] [<ffffffffc16840d0>] lfsck_namespace_assistant_handler_p2+0x6e0/0xa80 [lfsck] [ 426.066723] [<ffffffffc10e6087>] ? ptlrpc_set_destroy+0x1f7/0x460 [ptlrpc] [ 426.069097] [<ffffffff88026ae6>] ? kfree+0x106/0x140 [ 426.071256] [<ffffffffc10e6087>] ? ptlrpc_set_destroy+0x1f7/0x460 [ptlrpc] [ 426.073723] [<ffffffffc1666a81>] lfsck_assistant_engine+0xfb1/0x20a0 [lfsck] [ 426.076042] [<ffffffff88589df0>] ? __schedule+0x320/0x680 [ 426.078232] [<ffffffff87edadf0>] ? wake_up_state+0x20/0x20 [ 426.080655] [<ffffffffc1665ad0>] ? lfsck_master_engine+0x1360/0x1360 [lfsck] [ 426.082720] [<ffffffff87ec5e61>] kthread+0xd1/0xe0 [ 426.084783] [<ffffffff87ec5d90>] ? insert_kthread_work+0x40/0x40 [ 426.086902] [<ffffffff88596ddd>] ret_from_fork_nospec_begin+0x7/0x21 [ 426.088888] [<ffffffff87ec5d90>] ? insert_kthread_work+0x40/0x40

            "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47447
            Subject: LU-15886 lfsck: remove unreasonable assertions
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1b456a63dae6a4a48920a5bcbd562fb24f2455e6

            gerrit Gerrit Updater added a comment - "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47447 Subject: LU-15886 lfsck: remove unreasonable assertions Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1b456a63dae6a4a48920a5bcbd562fb24f2455e6
            laisiyao Lai Siyao added a comment -

            Crashes are seen like this:

            [708827.866619] LustreError: 4172:0:(lfsck_lib.c:1639:lfsck_instance_cleanup()) ASSERTION( lfsck->li_obj_dir == ((void *)0) ) failed: 
            [708827.870508] LustreError: 4172:0:(lfsck_lib.c:1639:lfsck_instance_cleanup()) LBUG
            [708827.872606] Pid: 4172, comm: umount 3.10.0-1160.49.1.el7_lustre.ddn16.x86_64 #1 SMP Mon Dec 20 11:42:01 PST 2021
            [708827.872607] Call Trace:
            [708827.872623] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
            [708827.872628] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
            [708827.872637] [<0>] lfsck_instance_cleanup+0x658/0x730 [lfsck]
            [708827.872643] [<0>] lfsck_degister+0x43/0x50 [lfsck]
            [708827.872651] [<0>] mdd_process_config+0x16a/0x5f0 [mdd]
            [708827.872666] [<0>] mdt_stack_fini+0x2c2/0xca0 [mdt]
            [708827.872673] [<0>] mdt_device_fini+0x34b/0x930 [mdt]
            [708827.872698] [<0>] class_cleanup+0x9b8/0xc50 [obdclass]
            [708827.872713] [<0>] class_process_config+0x65c/0x2830 [obdclass]
            [708827.872728] [<0>] class_manual_cleanup+0x1c6/0x710 [obdclass]
            [708827.872745] [<0>] server_put_super+0xa35/0x1150 [obdclass]
            [708827.872748] [<0>] generic_shutdown_super+0x6d/0x100
            [708827.872750] [<0>] kill_anon_super+0x12/0x20
            [708827.872764] [<0>] lustre_kill_super+0x32/0x50 [obdclass]
            [708827.872765] [<0>] deactivate_locked_super+0x4e/0x70
            [708827.872766] [<0>] deactivate_super+0x46/0x60
            [708827.872768] [<0>] cleanup_mnt+0x3f/0x80
            [708827.872770] [<0>] __cleanup_mnt+0x12/0x20
            [708827.872774] [<0>] task_work_run+0xbb/0xe0
            [708827.872776] [<0>] do_notify_resume+0xa5/0xc0
            [708827.872778] [<0>] int_signal+0x12/0x17
            [708827.872795] [<0>] 0xfffffffffffffffe
            [708827.872797] Kernel panic - not syncing: LBUG
            

            and

            [10089.987070] Lustre: vriprod1-OST000f: deleting orphan objects from 0x0:10180632 to 0x0:10182049
            [10090.183768] LustreError: 29027:0:(lfsck_namespace.c:5896:lfsck_namespace_scan_local_lpf_one()) ASSERTION( dt_object_exists(child) ) failed: 
            [10090.185375] LustreError: 29027:0:(lfsck_namespace.c:5896:lfsck_namespace_scan_local_lpf_one()) LBUG
            [10090.186495] Pid: 29027, comm: lfsck_namespace 3.10.0-1160.49.1.el7_lustre.ddn16.x86_64 #1 SMP Mon Dec 20 11:42:01 PST 2021
            [10090.186497] Call Trace:
            [10090.186526] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
            [10090.186532] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
            [10090.186571] [<0>] lfsck_namespace_scan_local_lpf_one+0xa39/0xdf0 [lfsck]
            [10090.186580] [<0>] lfsck_namespace_scan_local_lpf+0x59c/0x970 [lfsck]
            [10090.186592] [<0>] lfsck_namespace_assistant_handler_p2+0x682/0xa80 [lfsck]
            [10090.186600] [<0>] lfsck_assistant_engine+0xfb1/0x20a0 [lfsck]
            [10090.186604] [<0>] kthread+0xd1/0xe0
            [10090.186607] [<0>] ret_from_fork_nospec_begin+0x7/0x21
            [10090.186632] [<0>] 0xfffffffffffffffe
            [10090.186633] Kernel panic - not syncing: LBUG
            
            laisiyao Lai Siyao added a comment - Crashes are seen like this: [708827.866619] LustreError: 4172:0:(lfsck_lib.c:1639:lfsck_instance_cleanup()) ASSERTION( lfsck->li_obj_dir == ((void *)0) ) failed: [708827.870508] LustreError: 4172:0:(lfsck_lib.c:1639:lfsck_instance_cleanup()) LBUG [708827.872606] Pid: 4172, comm: umount 3.10.0-1160.49.1.el7_lustre.ddn16.x86_64 #1 SMP Mon Dec 20 11:42:01 PST 2021 [708827.872607] Call Trace: [708827.872623] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [708827.872628] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [708827.872637] [<0>] lfsck_instance_cleanup+0x658/0x730 [lfsck] [708827.872643] [<0>] lfsck_degister+0x43/0x50 [lfsck] [708827.872651] [<0>] mdd_process_config+0x16a/0x5f0 [mdd] [708827.872666] [<0>] mdt_stack_fini+0x2c2/0xca0 [mdt] [708827.872673] [<0>] mdt_device_fini+0x34b/0x930 [mdt] [708827.872698] [<0>] class_cleanup+0x9b8/0xc50 [obdclass] [708827.872713] [<0>] class_process_config+0x65c/0x2830 [obdclass] [708827.872728] [<0>] class_manual_cleanup+0x1c6/0x710 [obdclass] [708827.872745] [<0>] server_put_super+0xa35/0x1150 [obdclass] [708827.872748] [<0>] generic_shutdown_super+0x6d/0x100 [708827.872750] [<0>] kill_anon_super+0x12/0x20 [708827.872764] [<0>] lustre_kill_super+0x32/0x50 [obdclass] [708827.872765] [<0>] deactivate_locked_super+0x4e/0x70 [708827.872766] [<0>] deactivate_super+0x46/0x60 [708827.872768] [<0>] cleanup_mnt+0x3f/0x80 [708827.872770] [<0>] __cleanup_mnt+0x12/0x20 [708827.872774] [<0>] task_work_run+0xbb/0xe0 [708827.872776] [<0>] do_notify_resume+0xa5/0xc0 [708827.872778] [<0>] int_signal+0x12/0x17 [708827.872795] [<0>] 0xfffffffffffffffe [708827.872797] Kernel panic - not syncing: LBUG and [10089.987070] Lustre: vriprod1-OST000f: deleting orphan objects from 0x0:10180632 to 0x0:10182049 [10090.183768] LustreError: 29027:0:(lfsck_namespace.c:5896:lfsck_namespace_scan_local_lpf_one()) ASSERTION( dt_object_exists(child) ) failed: [10090.185375] LustreError: 29027:0:(lfsck_namespace.c:5896:lfsck_namespace_scan_local_lpf_one()) LBUG [10090.186495] Pid: 29027, comm: lfsck_namespace 3.10.0-1160.49.1.el7_lustre.ddn16.x86_64 #1 SMP Mon Dec 20 11:42:01 PST 2021 [10090.186497] Call Trace: [10090.186526] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [10090.186532] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [10090.186571] [<0>] lfsck_namespace_scan_local_lpf_one+0xa39/0xdf0 [lfsck] [10090.186580] [<0>] lfsck_namespace_scan_local_lpf+0x59c/0x970 [lfsck] [10090.186592] [<0>] lfsck_namespace_assistant_handler_p2+0x682/0xa80 [lfsck] [10090.186600] [<0>] lfsck_assistant_engine+0xfb1/0x20a0 [lfsck] [10090.186604] [<0>] kthread+0xd1/0xe0 [10090.186607] [<0>] ret_from_fork_nospec_begin+0x7/0x21 [10090.186632] [<0>] 0xfffffffffffffffe [10090.186633] Kernel panic - not syncing: LBUG

            People

              laisiyao Lai Siyao
              laisiyao Lai Siyao
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: