Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11516

ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: LBUG

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • Lustre 2.10.5
    • None
    • x86_64, zfs, DNE, centos 7.5
    • 1
    • 9223372036854775807

    Description

      Hi,

      after a namespace lfsck today, and not being able to run a production system due to the "stale" directory problems documented in LU-11418, I had to reboot the MDS.

      unfortunately now I can no longer mount the filesystem - the MDS LBUG's before the MDTs are fully up.

      I've tried mounting with -o skip_lfsck, but it doesn't help.
      with the patch from LU-11418 I get nothing at all on the serial console - just a complete freeze of some sort.
      without that patch, I get the LBUG below.

      ...
      2018-10-14 04:14:50 [  144.564183] LNet: Using FMR for registration
      2018-10-14 04:14:50 [  144.565919] LNetError: 104:0:(o2iblnd_cb.c:2299:kiblnd_passive_connect()) Can't accept conn from 192.168.44.102@o2ib44 on NA (ib0:0:192.168.44.22): 
      bad dst nid 192.168.44.22@o2ib44
      2018-10-14 04:14:50 [  144.639679] LNet: Added LNI 192.168.44.22@o2ib44 [128/2048/0/180]
      2018-10-14 04:14:50 [  144.693364] LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.140@o2ib44 (no target). If you are running an HA pair c
      heck that the target is mounted on the other server.
      2018-10-14 04:14:50 [  144.748380] Lustre: dagg-MDT0000: Not available for connect from 192.168.44.155@o2ib44 (not set up)
      2018-10-14 04:14:50 [  144.850890] Lustre: dagg-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      2018-10-14 04:14:50 [  144.893093] Lustre: dagg-MDT0000: Will be in recovery for at least 2:30, or until 130 clients reconnect
      2018-10-14 04:14:50 [  144.903035] Lustre: dagg-MDT0000: Connection restored to  (at 192.168.44.160@o2ib44)
      2018-10-14 04:14:50 [  144.904095] LNet: 18002:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Abort reconnection of 192.168.44.34@o2ib44: connected
      2018-10-14 04:14:50 [  144.922881] Lustre: Skipped 2 previous similar messages
      2018-10-14 04:14:50 [  145.197706] LustreError: 137-5: dagg-MDT0001_UUID: not available for connect from 192.168.44.135@o2ib44 (no target). If you are running an HA pair c
      heck that the target is mounted on the other server.
      2018-10-14 04:14:50 [  145.216609] LustreError: Skipped 100 previous similar messages
      2018-10-14 04:14:51 [  145.405422] Lustre: dagg-MDT0000: Connection restored to e995292c-7ea9-b4b9-296b-7fc43479891e (at 192.168.44.204@o2ib44)
      2018-10-14 04:14:51 [  145.416862] Lustre: Skipped 45 previous similar messages
      2018-10-14 04:14:51 [  145.749608] LustreError: 11-0: dagg-MDT0000-osp-MDT0001: operation mds_connect to node 0@lo failed: rc = -114
      2018-10-14 04:14:51 [  145.797392] Lustre: dagg-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      2018-10-14 04:14:52 [  146.517192] Lustre: dagg-MDT0001: Will be in recovery for at least 2:30, or until 130 clients reconnect
      2018-10-14 04:14:52 [  146.517895] Lustre: dagg-MDT0000: Connection restored to  (at 192.168.44.145@o2ib44)
      2018-10-14 04:14:52 [  146.517896] Lustre: Skipped 17 previous similar messages
      2018-10-14 04:14:52 [  146.648101] LustreError: 137-5: dagg-MDT0002_UUID: not available for connect from 192.168.44.155@o2ib44 (no target). If you are running an HA pair c
      heck that the target is mounted on the other server.
      2018-10-14 04:14:52 [  146.667038] LustreError: Skipped 27 previous similar messages
      2018-10-14 04:14:53 [  147.742695] LustreError: 11-0: dagg-MDT0000-osp-MDT0002: operation mds_connect to node 0@lo failed: rc = -114
      2018-10-14 04:14:53 [  147.809966] Lustre: dagg-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      2018-10-14 04:14:53 [  147.906112] Lustre: dagg-MDT0002: Will be in recovery for at least 2:30, or until 130 clients reconnect
      2018-10-14 04:14:54 [  148.619484] Lustre: dagg-MDT0002: Connection restored to 4043286f-fc2d-0b38-4c8b-0a509fa70b9d (at 192.168.44.150@o2ib44)
      2018-10-14 04:14:54 [  148.631061] Lustre: Skipped 13 previous similar messages
      2018-10-14 04:14:58 [  152.622612] Lustre: dagg-MDT0001: Connection restored to 4313929d-144f-a365-76bf-0741877d11de (at 192.168.44.182@o2ib44)
      2018-10-14 04:14:58 [  152.634166] Lustre: Skipped 167 previous similar messages
      2018-10-14 04:14:58 [  153.193285] Lustre: dagg-MDT0000: Recovery already passed deadline 2:21. It is due to DNE recovery failed/stuck on the 2 MDT(s): 0001 0002. Please wait until all MDTs recovered or abort the recovery by force.
      2018-10-14 04:14:59 [  153.938830] Lustre: dagg-MDT0000: Recovery already passed deadline 2:21. It is due to DNE recovery failed/stuck on the 2 MDT(s): 0001 0002. Please wait until all MDTs recovered or abort the recovery by force.
      2018-10-14 04:15:09 [  163.702354] Lustre: dagg-MDT0002: Connection restored to 66a65625-28da-d50f-c83b-b2f538a83827 (at 192.168.44.195@o2ib44)
      2018-10-14 04:15:09 [  163.713974] Lustre: Skipped 135 previous similar messages
      2018-10-14 04:15:25 [  179.961770] Lustre: dagg-MDT0000: Connection restored to d2eebaed-e98f-5c92-c1f5-5b272ce31e82 (at 10.8.49.221@tcp201)
      2018-10-14 04:15:25 [  179.973196] Lustre: Skipped 38 previous similar messages
      2018-10-14 04:15:31 [  185.747958] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@ffff8b825e9f2d00 x1613591234569248/t0(0) o601->dagg-MDT0000-lwp-OST0001_UUID@192.168.44.31@o2ib44:652/0 lens 336/0 e 0 to 0 dl 1539450937 ref 1 fl Interpret:/0/ffffffff rc 0/-1
      2018-10-14 04:15:31 [  185.776641] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 2 previous similar messages
      2018-10-14 04:15:36 [  190.915287] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@ffff8b825ec6e300 x1613592493156656/t0(0) o601->dagg-MDT0000-lwp-OST000b_UUID@192.168.44.36@o2ib44:657/0 lens 336/0 e 0 to 0 dl 1539450942 ref 1 fl Interpret:/0/ffffffff rc 0/-1
      2018-10-14 04:15:36 [  190.944086] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 4 previous similar messages
      2018-10-14 04:15:38 [  192.736539] LustreError: 42042:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@ffff8b2334027200 x1613591230816544/t0(0) o601->dagg-MDT0000-lwp-OST0004_UUID@192.168.44.33@o2ib44:659/0 lens 336/0 e 0 to 0 dl 1539450944 ref 1 fl Interpret:/0/ffffffff rc 0/-1
      2018-10-14 04:15:38 [  192.765302] LustreError: 42042:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 3 previous similar messages
      2018-10-14 04:15:40 [  194.809319] LustreError: 42037:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@ffff8b825e9cc200 x1613592493190864/t0(0) o601->dagg-MDT0000-lwp-OST0007_UUID@192.168.44.34@o2ib44:661/0 lens 336/0 e 0 to 0 dl 1539450946 ref 1 fl Interpret:/0/ffffffff rc 0/-1
      2018-10-14 04:15:40 [  194.838140] LustreError: 42037:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 8 previous similar messages
      2018-10-14 04:15:43 [  197.681575] Lustre: dagg-MDT0000: Recovery over after 0:53, of 130 clients 130 recovered and 0 were evicted.
      2018-10-14 04:15:43 [  197.691192] Lustre: 20075:0:(llog.c:572:llog_process_thread()) dagg-MDT0000-osp-MDT0001: invalid length 0 in llog [0x1:0x19a42:0x2]record for index 0/4
      2018-10-14 04:15:43 [  197.691198] LustreError: 20075:0:(lod_dev.c:420:lod_sub_recovery_thread()) dagg-MDT0000-osp-MDT0001 getting update log failed: rc = -22
      2018-10-14 04:15:43 [  197.748062] LustreError: 43000:0:(lu_object.h:862:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: 
      2018-10-14 04:15:43 [  197.762055] LustreError: 43000:0:(lu_object.h:862:lu_object_attr()) LBUG
      2018-10-14 04:15:43 [  197.769608] Pid: 43000, comm: orph_cleanup_da 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018
      2018-10-14 04:15:43 [  197.780185] Call Trace:
      2018-10-14 04:15:43 [  197.783478]  [<ffffffffc06f07cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      2018-10-14 04:15:43 [  197.790871]  [<ffffffffc06f087c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      2018-10-14 04:15:43 [  197.797889]  [<ffffffffc0b722dd>] orph_declare_index_delete+0x40d/0x460 [mdd]
      2018-10-14 04:15:43 [  197.805872]  [<ffffffffc0b72761>] orph_key_test_and_del+0x431/0xd30 [mdd]
      2018-10-14 04:15:43 [  197.813494]  [<ffffffffc0b73617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd]
      2018-10-14 04:15:43 [  197.821009]  [<ffffffff8d2bb621>] kthread+0xd1/0xe0
      2018-10-14 04:15:43 [  197.826702]  [<ffffffff8d9205dd>] ret_from_fork_nospec_begin+0x7/0x21
      2018-10-14 04:15:43 [  197.833942]  [<ffffffffffffffff>] 0xffffffffffffffff
      2018-10-14 04:15:43 [  197.839723] Kernel panic - not syncing: LBUG
      2018-10-14 04:15:43 [  197.844759] CPU: 9 PID: 43000 Comm: orph_cleanup_da Tainted: P           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
      2018-10-14 04:15:43 [  197.856369] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.4.8 05/21/2018
      2018-10-14 04:15:43 [  197.864605] Call Trace:
      2018-10-14 04:15:43 [  197.867810]  [<ffffffff8d90e84e>] dump_stack+0x19/0x1b
      2018-10-14 04:15:43 [  197.873696]  [<ffffffff8d908b50>] panic+0xe8/0x21f
      2018-10-14 04:15:43 [  197.879223]  [<ffffffffc06f08cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      2018-10-14 04:15:43 [  197.886124]  [<ffffffffc0b722dd>] orph_declare_index_delete+0x40d/0x460 [mdd]
      2018-10-14 04:15:43 [  197.893965]  [<ffffffffc11c4439>] ? lod_trans_create+0x39/0x50 [lod]
      2018-10-14 04:15:43 [  197.901017]  [<ffffffffc0b72761>] orph_key_test_and_del+0x431/0xd30 [mdd]
      2018-10-14 04:15:43 [  197.908491]  [<ffffffffc0b73617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd]
      2018-10-14 04:15:43 [  197.915874]  [<ffffffffc0b73060>] ? orph_key_test_and_del+0xd30/0xd30 [mdd]
      2018-10-14 04:15:43 [  197.923506]  [<ffffffff8d2bb621>] kthread+0xd1/0xe0
      2018-10-14 04:15:43 [  197.929044]  [<ffffffff8d2bb550>] ? insert_kthread_work+0x40/0x40
      2018-10-14 04:15:43 [  197.935789]  [<ffffffff8d9205dd>] ret_from_fork_nospec_begin+0x7/0x21
      2018-10-14 04:15:43 [  197.942869]  [<ffffffff8d2bb550>] ? insert_kthread_work+0x40/0x40
      2018-10-14 04:15:43 [  197.949604] Kernel Offset: 0xc200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      

      our standard server lustre is 2.10.4 plus these patches

        lu10683-lu11093-checksum-overquota-gerrit32788-1fb85e7e.patch
        lu10988-lfsck2-gerrit32522-21d33c11.patch
        lu11074-mdc-xattr-gerrit32739-dea1cde9.patch
        lu11107-xattr-gerrit32753-c96a8f08.patch
        lu11111-lfsck-gerrit32796-693fe452.patch
        lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff
        lu11062-stacktrace-gerrit32972-7232c445.patch
        lu11301-stuckMdtThreads2-c43baa1c.patch
        lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff
      

      but also looks like the same LBUG when I boot into a testing server image with 2.10.5 + these patches

        lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff
        lu11111-lfsck-gerrit32796-693fe452.ported.patch    # fixed to apply cleanly
        lu11201-lfsckDoesntFinish-gerrit33078-4829fb05.patch
        lu11301-stuckMdtThreads2-c43baa1c.patch
        lu11418-hungMdtZfs-gerrit33248-eaa3c60d.diff     # doesn't apply cleanly. needs lustre/include/obd_support.h edit
        lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff
      
      2018-10-14 04:45:14 [  256.907785] LustreError: 46668:0:(lu_object.h:862:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed:
      2018-10-14 04:45:14 [  256.921784] LustreError: 46668:0:(lu_object.h:862:lu_object_attr()) LBUG
      2018-10-14 04:45:14 [  256.929337] Pid: 46668, comm: orph_cleanup_da 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018
      2018-10-14 04:45:14 [  256.939901] Call Trace:
      2018-10-14 04:45:14 [  256.943185]  [<ffffffffc08ae7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      2018-10-14 04:45:14 [  256.950584]  [<ffffffffc08ae87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      2018-10-14 04:45:14 [  256.957618]  [<ffffffffc1abf2dd>] orph_declare_index_delete+0x40d/0x460 [mdd]
      2018-10-14 04:45:14 [  256.965586]  [<ffffffffc1abf761>] orph_key_test_and_del+0x431/0xd30 [mdd]
      2018-10-14 04:45:14 [  256.973192]  [<ffffffffc1ac0617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd]
      2018-10-14 04:45:14 [  256.980697]  [<ffffffffb4cbb621>] kthread+0xd1/0xe0
      2018-10-14 04:45:14 [  256.986380]  [<ffffffffb53205dd>] ret_from_fork_nospec_begin+0x7/0x21
      2018-10-14 04:45:14 [  256.993618]  [<ffffffffffffffff>] 0xffffffffffffffff
      2018-10-14 04:45:14 [  256.999378] Kernel panic - not syncing: LBUG
      2018-10-14 04:45:14 [  257.004389] CPU: 5 PID: 46668 Comm: orph_cleanup_da Tainted: P           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
      2018-10-14 04:45:14 [  257.015980] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.4.8 05/21/2018
      2018-10-14 04:45:14 [  257.024200] Call Trace:
      2018-10-14 04:45:14 [  257.027392]  [<ffffffffb530e84e>] dump_stack+0x19/0x1b
      2018-10-14 04:45:14 [  257.033268]  [<ffffffffb5308b50>] panic+0xe8/0x21f
      2018-10-14 04:45:14 [  257.038783]  [<ffffffffc08ae8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      2018-10-14 04:45:14 [  257.045676]  [<ffffffffc1abf2dd>] orph_declare_index_delete+0x40d/0x460 [mdd]
      2018-10-14 04:45:14 [  257.053528]  [<ffffffffc1a24439>] ? lod_trans_create+0x39/0x50 [lod]
      2018-10-14 04:45:14 [  257.060588]  [<ffffffffc1abf761>] orph_key_test_and_del+0x431/0xd30 [mdd]
      2018-10-14 04:45:14 [  257.068071]  [<ffffffffc1ac0617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd]
      2018-10-14 04:45:14 [  257.075460]  [<ffffffffc1ac0060>] ? orph_key_test_and_del+0xd30/0xd30 [mdd]
      2018-10-14 04:45:14 [  257.083100]  [<ffffffffb4cbb621>] kthread+0xd1/0xe0
      2018-10-14 04:45:14 [  257.088650]  [<ffffffffb4cbb550>] ? insert_kthread_work+0x40/0x40
      2018-10-14 04:45:14 [  257.095401]  [<ffffffffb53205dd>] ret_from_fork_nospec_begin+0x7/0x21
      2018-10-14 04:45:14 [  257.102490]  [<ffffffffb4cbb550>] ? insert_kthread_work+0x40/0x40
      2018-10-14 04:45:14 [  257.109241] Kernel Offset: 0x33c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            [LU-11516] ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: LBUG
            pjones Peter Jones added a comment -

            The removal of the assert patch has landed for 2.12. The other issues are being tracked under different tickets.

            pjones Peter Jones added a comment - The removal of the assert patch has landed for 2.12. The other issues are being tracked under different tickets.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33368/
            Subject: LU-11516 mdd: do not assert on missing orphan
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5d89450b462f76fe2a377e7253595a162dae309b

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33368/ Subject: LU-11516 mdd: do not assert on missing orphan Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5d89450b462f76fe2a377e7253595a162dae309b
            scadmin SC Admin added a comment -

            Hi Andreas,

            MDS conman log for warble2 attached.
            warble2 is the MDS that has all 3 DNE dagg MDTs on it. the other MDS (warble1) has MGS and non-DNE MDTs from 3 smaller fs's.

            I can't see a specific orphan message at mount. we have some ongoing orphan related messages though.

            there's a bunch of known corruption in the filesystem. mostly from all the DNE MDT hangs. we have been trying to lfsck since July to try and fix some of this. one lfsck succeeded this weekend, but then we ran into this problem and still have others.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi Andreas, MDS conman log for warble2 attached. warble2 is the MDS that has all 3 DNE dagg MDTs on it. the other MDS (warble1) has MGS and non-DNE MDTs from 3 smaller fs's. I can't see a specific orphan message at mount. we have some ongoing orphan related messages though. there's a bunch of known corruption in the filesystem. mostly from all the DNE MDT hangs. we have been trying to lfsck since July to try and fix some of this. one lfsck succeeded this weekend, but then we ran into this problem and still have others. cheers, robin

            Sorry for the patch spam on this ticket, I pushed the b2_10 version of the patch to the master branch and had some trouble to fix it.

            Glad that this got your system back working

            It would be useful to see the error messages reported on the console when the MDS starts up. Is there just a single message about a bad object in the orphan list, or are there many of them?

            adilger Andreas Dilger added a comment - Sorry for the patch spam on this ticket, I pushed the b2_10 version of the patch to the master branch and had some trouble to fix it. Glad that this got your system back working It would be useful to see the error messages reported on the console when the MDS starts up. Is there just a single message about a bad object in the orphan list, or are there many of them?

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33368
            Subject: LU-11516 mdd: do not assert on missing orphan
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 41bb8c3049360ed53c7f8974251911d8717abbdf

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33368 Subject: LU-11516 mdd: do not assert on missing orphan Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 41bb8c3049360ed53c7f8974251911d8717abbdf

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33366
            Subject: LU-11516 mdd: do not assert on missing orphan
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: aa82c8ede4fb0661671542e084ac4a0654efb7d5

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33366 Subject: LU-11516 mdd: do not assert on missing orphan Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: aa82c8ede4fb0661671542e084ac4a0654efb7d5
            scadmin SC Admin added a comment -

            thanks Andreas. that's done the trick. filesystem is back up.

            scadmin SC Admin added a comment - thanks Andreas. that's done the trick. filesystem is back up.

            It isn't quite clear what the root of the problem is, but I pushed a patch that will hopefully get you past the LASSERT() failing during orphan cleanup. That is just for files that were open-unlinked and are now being deleted, so it isn't a reason to stop the mount of the MDS.

            adilger Andreas Dilger added a comment - It isn't quite clear what the root of the problem is, but I pushed a patch that will hopefully get you past the LASSERT() failing during orphan cleanup. That is just for files that were open-unlinked and are now being deleted, so it isn't a reason to stop the mount of the MDS.
            pjones Peter Jones added a comment -

            Lai

            Can you please advise here?

            Peter

            pjones Peter Jones added a comment - Lai Can you please advise here? Peter

            People

              laisiyao Lai Siyao
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: