Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11516

ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: LBUG

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • Lustre 2.10.5
    • None
    • x86_64, zfs, DNE, centos 7.5
    • 1
    • 9223372036854775807

    Description

      Hi,

      after a namespace lfsck today, and not being able to run a production system due to the "stale" directory problems documented in LU-11418, I had to reboot the MDS.

      unfortunately now I can no longer mount the filesystem - the MDS LBUG's before the MDTs are fully up.

      I've tried mounting with -o skip_lfsck, but it doesn't help.
      with the patch from LU-11418 I get nothing at all on the serial console - just a complete freeze of some sort.
      without that patch, I get the LBUG below.

      ...
      2018-10-14 04:14:50 [  144.564183] LNet: Using FMR for registration
      2018-10-14 04:14:50 [  144.565919] LNetError: 104:0:(o2iblnd_cb.c:2299:kiblnd_passive_connect()) Can't accept conn from 192.168.44.102@o2ib44 on NA (ib0:0:192.168.44.22): 
      bad dst nid 192.168.44.22@o2ib44
      2018-10-14 04:14:50 [  144.639679] LNet: Added LNI 192.168.44.22@o2ib44 [128/2048/0/180]
      2018-10-14 04:14:50 [  144.693364] LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.140@o2ib44 (no target). If you are running an HA pair c
      heck that the target is mounted on the other server.
      2018-10-14 04:14:50 [  144.748380] Lustre: dagg-MDT0000: Not available for connect from 192.168.44.155@o2ib44 (not set up)
      2018-10-14 04:14:50 [  144.850890] Lustre: dagg-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      2018-10-14 04:14:50 [  144.893093] Lustre: dagg-MDT0000: Will be in recovery for at least 2:30, or until 130 clients reconnect
      2018-10-14 04:14:50 [  144.903035] Lustre: dagg-MDT0000: Connection restored to  (at 192.168.44.160@o2ib44)
      2018-10-14 04:14:50 [  144.904095] LNet: 18002:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Abort reconnection of 192.168.44.34@o2ib44: connected
      2018-10-14 04:14:50 [  144.922881] Lustre: Skipped 2 previous similar messages
      2018-10-14 04:14:50 [  145.197706] LustreError: 137-5: dagg-MDT0001_UUID: not available for connect from 192.168.44.135@o2ib44 (no target). If you are running an HA pair c
      heck that the target is mounted on the other server.
      2018-10-14 04:14:50 [  145.216609] LustreError: Skipped 100 previous similar messages
      2018-10-14 04:14:51 [  145.405422] Lustre: dagg-MDT0000: Connection restored to e995292c-7ea9-b4b9-296b-7fc43479891e (at 192.168.44.204@o2ib44)
      2018-10-14 04:14:51 [  145.416862] Lustre: Skipped 45 previous similar messages
      2018-10-14 04:14:51 [  145.749608] LustreError: 11-0: dagg-MDT0000-osp-MDT0001: operation mds_connect to node 0@lo failed: rc = -114
      2018-10-14 04:14:51 [  145.797392] Lustre: dagg-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      2018-10-14 04:14:52 [  146.517192] Lustre: dagg-MDT0001: Will be in recovery for at least 2:30, or until 130 clients reconnect
      2018-10-14 04:14:52 [  146.517895] Lustre: dagg-MDT0000: Connection restored to  (at 192.168.44.145@o2ib44)
      2018-10-14 04:14:52 [  146.517896] Lustre: Skipped 17 previous similar messages
      2018-10-14 04:14:52 [  146.648101] LustreError: 137-5: dagg-MDT0002_UUID: not available for connect from 192.168.44.155@o2ib44 (no target). If you are running an HA pair c
      heck that the target is mounted on the other server.
      2018-10-14 04:14:52 [  146.667038] LustreError: Skipped 27 previous similar messages
      2018-10-14 04:14:53 [  147.742695] LustreError: 11-0: dagg-MDT0000-osp-MDT0002: operation mds_connect to node 0@lo failed: rc = -114
      2018-10-14 04:14:53 [  147.809966] Lustre: dagg-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      2018-10-14 04:14:53 [  147.906112] Lustre: dagg-MDT0002: Will be in recovery for at least 2:30, or until 130 clients reconnect
      2018-10-14 04:14:54 [  148.619484] Lustre: dagg-MDT0002: Connection restored to 4043286f-fc2d-0b38-4c8b-0a509fa70b9d (at 192.168.44.150@o2ib44)
      2018-10-14 04:14:54 [  148.631061] Lustre: Skipped 13 previous similar messages
      2018-10-14 04:14:58 [  152.622612] Lustre: dagg-MDT0001: Connection restored to 4313929d-144f-a365-76bf-0741877d11de (at 192.168.44.182@o2ib44)
      2018-10-14 04:14:58 [  152.634166] Lustre: Skipped 167 previous similar messages
      2018-10-14 04:14:58 [  153.193285] Lustre: dagg-MDT0000: Recovery already passed deadline 2:21. It is due to DNE recovery failed/stuck on the 2 MDT(s): 0001 0002. Please wait until all MDTs recovered or abort the recovery by force.
      2018-10-14 04:14:59 [  153.938830] Lustre: dagg-MDT0000: Recovery already passed deadline 2:21. It is due to DNE recovery failed/stuck on the 2 MDT(s): 0001 0002. Please wait until all MDTs recovered or abort the recovery by force.
      2018-10-14 04:15:09 [  163.702354] Lustre: dagg-MDT0002: Connection restored to 66a65625-28da-d50f-c83b-b2f538a83827 (at 192.168.44.195@o2ib44)
      2018-10-14 04:15:09 [  163.713974] Lustre: Skipped 135 previous similar messages
      2018-10-14 04:15:25 [  179.961770] Lustre: dagg-MDT0000: Connection restored to d2eebaed-e98f-5c92-c1f5-5b272ce31e82 (at 10.8.49.221@tcp201)
      2018-10-14 04:15:25 [  179.973196] Lustre: Skipped 38 previous similar messages
      2018-10-14 04:15:31 [  185.747958] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@ffff8b825e9f2d00 x1613591234569248/t0(0) o601->dagg-MDT0000-lwp-OST0001_UUID@192.168.44.31@o2ib44:652/0 lens 336/0 e 0 to 0 dl 1539450937 ref 1 fl Interpret:/0/ffffffff rc 0/-1
      2018-10-14 04:15:31 [  185.776641] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 2 previous similar messages
      2018-10-14 04:15:36 [  190.915287] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@ffff8b825ec6e300 x1613592493156656/t0(0) o601->dagg-MDT0000-lwp-OST000b_UUID@192.168.44.36@o2ib44:657/0 lens 336/0 e 0 to 0 dl 1539450942 ref 1 fl Interpret:/0/ffffffff rc 0/-1
      2018-10-14 04:15:36 [  190.944086] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 4 previous similar messages
      2018-10-14 04:15:38 [  192.736539] LustreError: 42042:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@ffff8b2334027200 x1613591230816544/t0(0) o601->dagg-MDT0000-lwp-OST0004_UUID@192.168.44.33@o2ib44:659/0 lens 336/0 e 0 to 0 dl 1539450944 ref 1 fl Interpret:/0/ffffffff rc 0/-1
      2018-10-14 04:15:38 [  192.765302] LustreError: 42042:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 3 previous similar messages
      2018-10-14 04:15:40 [  194.809319] LustreError: 42037:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@ffff8b825e9cc200 x1613592493190864/t0(0) o601->dagg-MDT0000-lwp-OST0007_UUID@192.168.44.34@o2ib44:661/0 lens 336/0 e 0 to 0 dl 1539450946 ref 1 fl Interpret:/0/ffffffff rc 0/-1
      2018-10-14 04:15:40 [  194.838140] LustreError: 42037:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 8 previous similar messages
      2018-10-14 04:15:43 [  197.681575] Lustre: dagg-MDT0000: Recovery over after 0:53, of 130 clients 130 recovered and 0 were evicted.
      2018-10-14 04:15:43 [  197.691192] Lustre: 20075:0:(llog.c:572:llog_process_thread()) dagg-MDT0000-osp-MDT0001: invalid length 0 in llog [0x1:0x19a42:0x2]record for index 0/4
      2018-10-14 04:15:43 [  197.691198] LustreError: 20075:0:(lod_dev.c:420:lod_sub_recovery_thread()) dagg-MDT0000-osp-MDT0001 getting update log failed: rc = -22
      2018-10-14 04:15:43 [  197.748062] LustreError: 43000:0:(lu_object.h:862:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: 
      2018-10-14 04:15:43 [  197.762055] LustreError: 43000:0:(lu_object.h:862:lu_object_attr()) LBUG
      2018-10-14 04:15:43 [  197.769608] Pid: 43000, comm: orph_cleanup_da 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018
      2018-10-14 04:15:43 [  197.780185] Call Trace:
      2018-10-14 04:15:43 [  197.783478]  [<ffffffffc06f07cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      2018-10-14 04:15:43 [  197.790871]  [<ffffffffc06f087c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      2018-10-14 04:15:43 [  197.797889]  [<ffffffffc0b722dd>] orph_declare_index_delete+0x40d/0x460 [mdd]
      2018-10-14 04:15:43 [  197.805872]  [<ffffffffc0b72761>] orph_key_test_and_del+0x431/0xd30 [mdd]
      2018-10-14 04:15:43 [  197.813494]  [<ffffffffc0b73617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd]
      2018-10-14 04:15:43 [  197.821009]  [<ffffffff8d2bb621>] kthread+0xd1/0xe0
      2018-10-14 04:15:43 [  197.826702]  [<ffffffff8d9205dd>] ret_from_fork_nospec_begin+0x7/0x21
      2018-10-14 04:15:43 [  197.833942]  [<ffffffffffffffff>] 0xffffffffffffffff
      2018-10-14 04:15:43 [  197.839723] Kernel panic - not syncing: LBUG
      2018-10-14 04:15:43 [  197.844759] CPU: 9 PID: 43000 Comm: orph_cleanup_da Tainted: P           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
      2018-10-14 04:15:43 [  197.856369] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.4.8 05/21/2018
      2018-10-14 04:15:43 [  197.864605] Call Trace:
      2018-10-14 04:15:43 [  197.867810]  [<ffffffff8d90e84e>] dump_stack+0x19/0x1b
      2018-10-14 04:15:43 [  197.873696]  [<ffffffff8d908b50>] panic+0xe8/0x21f
      2018-10-14 04:15:43 [  197.879223]  [<ffffffffc06f08cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      2018-10-14 04:15:43 [  197.886124]  [<ffffffffc0b722dd>] orph_declare_index_delete+0x40d/0x460 [mdd]
      2018-10-14 04:15:43 [  197.893965]  [<ffffffffc11c4439>] ? lod_trans_create+0x39/0x50 [lod]
      2018-10-14 04:15:43 [  197.901017]  [<ffffffffc0b72761>] orph_key_test_and_del+0x431/0xd30 [mdd]
      2018-10-14 04:15:43 [  197.908491]  [<ffffffffc0b73617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd]
      2018-10-14 04:15:43 [  197.915874]  [<ffffffffc0b73060>] ? orph_key_test_and_del+0xd30/0xd30 [mdd]
      2018-10-14 04:15:43 [  197.923506]  [<ffffffff8d2bb621>] kthread+0xd1/0xe0
      2018-10-14 04:15:43 [  197.929044]  [<ffffffff8d2bb550>] ? insert_kthread_work+0x40/0x40
      2018-10-14 04:15:43 [  197.935789]  [<ffffffff8d9205dd>] ret_from_fork_nospec_begin+0x7/0x21
      2018-10-14 04:15:43 [  197.942869]  [<ffffffff8d2bb550>] ? insert_kthread_work+0x40/0x40
      2018-10-14 04:15:43 [  197.949604] Kernel Offset: 0xc200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      

      our standard server lustre is 2.10.4 plus these patches

        lu10683-lu11093-checksum-overquota-gerrit32788-1fb85e7e.patch
        lu10988-lfsck2-gerrit32522-21d33c11.patch
        lu11074-mdc-xattr-gerrit32739-dea1cde9.patch
        lu11107-xattr-gerrit32753-c96a8f08.patch
        lu11111-lfsck-gerrit32796-693fe452.patch
        lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff
        lu11062-stacktrace-gerrit32972-7232c445.patch
        lu11301-stuckMdtThreads2-c43baa1c.patch
        lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff
      

      but also looks like the same LBUG when I boot into a testing server image with 2.10.5 + these patches

        lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff
        lu11111-lfsck-gerrit32796-693fe452.ported.patch    # fixed to apply cleanly
        lu11201-lfsckDoesntFinish-gerrit33078-4829fb05.patch
        lu11301-stuckMdtThreads2-c43baa1c.patch
        lu11418-hungMdtZfs-gerrit33248-eaa3c60d.diff     # doesn't apply cleanly. needs lustre/include/obd_support.h edit
        lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff
      
      2018-10-14 04:45:14 [  256.907785] LustreError: 46668:0:(lu_object.h:862:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed:
      2018-10-14 04:45:14 [  256.921784] LustreError: 46668:0:(lu_object.h:862:lu_object_attr()) LBUG
      2018-10-14 04:45:14 [  256.929337] Pid: 46668, comm: orph_cleanup_da 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018
      2018-10-14 04:45:14 [  256.939901] Call Trace:
      2018-10-14 04:45:14 [  256.943185]  [<ffffffffc08ae7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      2018-10-14 04:45:14 [  256.950584]  [<ffffffffc08ae87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      2018-10-14 04:45:14 [  256.957618]  [<ffffffffc1abf2dd>] orph_declare_index_delete+0x40d/0x460 [mdd]
      2018-10-14 04:45:14 [  256.965586]  [<ffffffffc1abf761>] orph_key_test_and_del+0x431/0xd30 [mdd]
      2018-10-14 04:45:14 [  256.973192]  [<ffffffffc1ac0617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd]
      2018-10-14 04:45:14 [  256.980697]  [<ffffffffb4cbb621>] kthread+0xd1/0xe0
      2018-10-14 04:45:14 [  256.986380]  [<ffffffffb53205dd>] ret_from_fork_nospec_begin+0x7/0x21
      2018-10-14 04:45:14 [  256.993618]  [<ffffffffffffffff>] 0xffffffffffffffff
      2018-10-14 04:45:14 [  256.999378] Kernel panic - not syncing: LBUG
      2018-10-14 04:45:14 [  257.004389] CPU: 5 PID: 46668 Comm: orph_cleanup_da Tainted: P           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
      2018-10-14 04:45:14 [  257.015980] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.4.8 05/21/2018
      2018-10-14 04:45:14 [  257.024200] Call Trace:
      2018-10-14 04:45:14 [  257.027392]  [<ffffffffb530e84e>] dump_stack+0x19/0x1b
      2018-10-14 04:45:14 [  257.033268]  [<ffffffffb5308b50>] panic+0xe8/0x21f
      2018-10-14 04:45:14 [  257.038783]  [<ffffffffc08ae8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      2018-10-14 04:45:14 [  257.045676]  [<ffffffffc1abf2dd>] orph_declare_index_delete+0x40d/0x460 [mdd]
      2018-10-14 04:45:14 [  257.053528]  [<ffffffffc1a24439>] ? lod_trans_create+0x39/0x50 [lod]
      2018-10-14 04:45:14 [  257.060588]  [<ffffffffc1abf761>] orph_key_test_and_del+0x431/0xd30 [mdd]
      2018-10-14 04:45:14 [  257.068071]  [<ffffffffc1ac0617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd]
      2018-10-14 04:45:14 [  257.075460]  [<ffffffffc1ac0060>] ? orph_key_test_and_del+0xd30/0xd30 [mdd]
      2018-10-14 04:45:14 [  257.083100]  [<ffffffffb4cbb621>] kthread+0xd1/0xe0
      2018-10-14 04:45:14 [  257.088650]  [<ffffffffb4cbb550>] ? insert_kthread_work+0x40/0x40
      2018-10-14 04:45:14 [  257.095401]  [<ffffffffb53205dd>] ret_from_fork_nospec_begin+0x7/0x21
      2018-10-14 04:45:14 [  257.102490]  [<ffffffffb4cbb550>] ? insert_kthread_work+0x40/0x40
      2018-10-14 04:45:14 [  257.109241] Kernel Offset: 0x33c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: