[LU-11516] ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: LBUG Created: 13/Oct/18  Updated: 11/Jul/19  Resolved: 02/Nov/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Minor
Reporter: SC Admin (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

x86_64, zfs, DNE, centos 7.5


Attachments: File conman-warble2.log    
Issue Links:
Related
is related to LU-9818 replay-single test 29: lu_object_attr... Resolved
is related to LU-11111 crash doing LFSCK: orph_index_insert(... Resolved
is related to LU-8013 racer test_1: (lu_object.h:862:lu_obj... Resolved
is related to LU-11418 hung threads on MDT and MDT won't umount Resolved
Severity: 1
Rank (Obsolete): 9223372036854775807

 Description   

Hi,

after a namespace lfsck today, and not being able to run a production system due to the "stale" directory problems documented in LU-11418, I had to reboot the MDS.

unfortunately now I can no longer mount the filesystem - the MDS LBUG's before the MDTs are fully up.

I've tried mounting with -o skip_lfsck, but it doesn't help.
with the patch from LU-11418 I get nothing at all on the serial console - just a complete freeze of some sort.
without that patch, I get the LBUG below.

...
2018-10-14 04:14:50 [  144.564183] LNet: Using FMR for registration
2018-10-14 04:14:50 [  144.565919] LNetError: 104:0:(o2iblnd_cb.c:2299:kiblnd_passive_connect()) Can't accept conn from 192.168.44.102@o2ib44 on NA (ib0:0:192.168.44.22): 
bad dst nid 192.168.44.22@o2ib44
2018-10-14 04:14:50 [  144.639679] LNet: Added LNI 192.168.44.22@o2ib44 [128/2048/0/180]
2018-10-14 04:14:50 [  144.693364] LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.140@o2ib44 (no target). If you are running an HA pair c
heck that the target is mounted on the other server.
2018-10-14 04:14:50 [  144.748380] Lustre: dagg-MDT0000: Not available for connect from 192.168.44.155@o2ib44 (not set up)
2018-10-14 04:14:50 [  144.850890] Lustre: dagg-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
2018-10-14 04:14:50 [  144.893093] Lustre: dagg-MDT0000: Will be in recovery for at least 2:30, or until 130 clients reconnect
2018-10-14 04:14:50 [  144.903035] Lustre: dagg-MDT0000: Connection restored to  (at 192.168.44.160@o2ib44)
2018-10-14 04:14:50 [  144.904095] LNet: 18002:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Abort reconnection of 192.168.44.34@o2ib44: connected
2018-10-14 04:14:50 [  144.922881] Lustre: Skipped 2 previous similar messages
2018-10-14 04:14:50 [  145.197706] LustreError: 137-5: dagg-MDT0001_UUID: not available for connect from 192.168.44.135@o2ib44 (no target). If you are running an HA pair c
heck that the target is mounted on the other server.
2018-10-14 04:14:50 [  145.216609] LustreError: Skipped 100 previous similar messages
2018-10-14 04:14:51 [  145.405422] Lustre: dagg-MDT0000: Connection restored to e995292c-7ea9-b4b9-296b-7fc43479891e (at 192.168.44.204@o2ib44)
2018-10-14 04:14:51 [  145.416862] Lustre: Skipped 45 previous similar messages
2018-10-14 04:14:51 [  145.749608] LustreError: 11-0: dagg-MDT0000-osp-MDT0001: operation mds_connect to node 0@lo failed: rc = -114
2018-10-14 04:14:51 [  145.797392] Lustre: dagg-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
2018-10-14 04:14:52 [  146.517192] Lustre: dagg-MDT0001: Will be in recovery for at least 2:30, or until 130 clients reconnect
2018-10-14 04:14:52 [  146.517895] Lustre: dagg-MDT0000: Connection restored to  (at 192.168.44.145@o2ib44)
2018-10-14 04:14:52 [  146.517896] Lustre: Skipped 17 previous similar messages
2018-10-14 04:14:52 [  146.648101] LustreError: 137-5: dagg-MDT0002_UUID: not available for connect from 192.168.44.155@o2ib44 (no target). If you are running an HA pair c
heck that the target is mounted on the other server.
2018-10-14 04:14:52 [  146.667038] LustreError: Skipped 27 previous similar messages
2018-10-14 04:14:53 [  147.742695] LustreError: 11-0: dagg-MDT0000-osp-MDT0002: operation mds_connect to node 0@lo failed: rc = -114
2018-10-14 04:14:53 [  147.809966] Lustre: dagg-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
2018-10-14 04:14:53 [  147.906112] Lustre: dagg-MDT0002: Will be in recovery for at least 2:30, or until 130 clients reconnect
2018-10-14 04:14:54 [  148.619484] Lustre: dagg-MDT0002: Connection restored to 4043286f-fc2d-0b38-4c8b-0a509fa70b9d (at 192.168.44.150@o2ib44)
2018-10-14 04:14:54 [  148.631061] Lustre: Skipped 13 previous similar messages
2018-10-14 04:14:58 [  152.622612] Lustre: dagg-MDT0001: Connection restored to 4313929d-144f-a365-76bf-0741877d11de (at 192.168.44.182@o2ib44)
2018-10-14 04:14:58 [  152.634166] Lustre: Skipped 167 previous similar messages
2018-10-14 04:14:58 [  153.193285] Lustre: dagg-MDT0000: Recovery already passed deadline 2:21. It is due to DNE recovery failed/stuck on the 2 MDT(s): 0001 0002. Please wait until all MDTs recovered or abort the recovery by force.
2018-10-14 04:14:59 [  153.938830] Lustre: dagg-MDT0000: Recovery already passed deadline 2:21. It is due to DNE recovery failed/stuck on the 2 MDT(s): 0001 0002. Please wait until all MDTs recovered or abort the recovery by force.
2018-10-14 04:15:09 [  163.702354] Lustre: dagg-MDT0002: Connection restored to 66a65625-28da-d50f-c83b-b2f538a83827 (at 192.168.44.195@o2ib44)
2018-10-14 04:15:09 [  163.713974] Lustre: Skipped 135 previous similar messages
2018-10-14 04:15:25 [  179.961770] Lustre: dagg-MDT0000: Connection restored to d2eebaed-e98f-5c92-c1f5-5b272ce31e82 (at 10.8.49.221@tcp201)
2018-10-14 04:15:25 [  179.973196] Lustre: Skipped 38 previous similar messages
2018-10-14 04:15:31 [  185.747958] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@ffff8b825e9f2d00 x1613591234569248/t0(0) o601->dagg-MDT0000-lwp-OST0001_UUID@192.168.44.31@o2ib44:652/0 lens 336/0 e 0 to 0 dl 1539450937 ref 1 fl Interpret:/0/ffffffff rc 0/-1
2018-10-14 04:15:31 [  185.776641] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 2 previous similar messages
2018-10-14 04:15:36 [  190.915287] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@ffff8b825ec6e300 x1613592493156656/t0(0) o601->dagg-MDT0000-lwp-OST000b_UUID@192.168.44.36@o2ib44:657/0 lens 336/0 e 0 to 0 dl 1539450942 ref 1 fl Interpret:/0/ffffffff rc 0/-1
2018-10-14 04:15:36 [  190.944086] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 4 previous similar messages
2018-10-14 04:15:38 [  192.736539] LustreError: 42042:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@ffff8b2334027200 x1613591230816544/t0(0) o601->dagg-MDT0000-lwp-OST0004_UUID@192.168.44.33@o2ib44:659/0 lens 336/0 e 0 to 0 dl 1539450944 ref 1 fl Interpret:/0/ffffffff rc 0/-1
2018-10-14 04:15:38 [  192.765302] LustreError: 42042:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 3 previous similar messages
2018-10-14 04:15:40 [  194.809319] LustreError: 42037:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@ffff8b825e9cc200 x1613592493190864/t0(0) o601->dagg-MDT0000-lwp-OST0007_UUID@192.168.44.34@o2ib44:661/0 lens 336/0 e 0 to 0 dl 1539450946 ref 1 fl Interpret:/0/ffffffff rc 0/-1
2018-10-14 04:15:40 [  194.838140] LustreError: 42037:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 8 previous similar messages
2018-10-14 04:15:43 [  197.681575] Lustre: dagg-MDT0000: Recovery over after 0:53, of 130 clients 130 recovered and 0 were evicted.
2018-10-14 04:15:43 [  197.691192] Lustre: 20075:0:(llog.c:572:llog_process_thread()) dagg-MDT0000-osp-MDT0001: invalid length 0 in llog [0x1:0x19a42:0x2]record for index 0/4
2018-10-14 04:15:43 [  197.691198] LustreError: 20075:0:(lod_dev.c:420:lod_sub_recovery_thread()) dagg-MDT0000-osp-MDT0001 getting update log failed: rc = -22
2018-10-14 04:15:43 [  197.748062] LustreError: 43000:0:(lu_object.h:862:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: 
2018-10-14 04:15:43 [  197.762055] LustreError: 43000:0:(lu_object.h:862:lu_object_attr()) LBUG
2018-10-14 04:15:43 [  197.769608] Pid: 43000, comm: orph_cleanup_da 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018
2018-10-14 04:15:43 [  197.780185] Call Trace:
2018-10-14 04:15:43 [  197.783478]  [<ffffffffc06f07cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
2018-10-14 04:15:43 [  197.790871]  [<ffffffffc06f087c>] lbug_with_loc+0x4c/0xa0 [libcfs]
2018-10-14 04:15:43 [  197.797889]  [<ffffffffc0b722dd>] orph_declare_index_delete+0x40d/0x460 [mdd]
2018-10-14 04:15:43 [  197.805872]  [<ffffffffc0b72761>] orph_key_test_and_del+0x431/0xd30 [mdd]
2018-10-14 04:15:43 [  197.813494]  [<ffffffffc0b73617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd]
2018-10-14 04:15:43 [  197.821009]  [<ffffffff8d2bb621>] kthread+0xd1/0xe0
2018-10-14 04:15:43 [  197.826702]  [<ffffffff8d9205dd>] ret_from_fork_nospec_begin+0x7/0x21
2018-10-14 04:15:43 [  197.833942]  [<ffffffffffffffff>] 0xffffffffffffffff
2018-10-14 04:15:43 [  197.839723] Kernel panic - not syncing: LBUG
2018-10-14 04:15:43 [  197.844759] CPU: 9 PID: 43000 Comm: orph_cleanup_da Tainted: P           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
2018-10-14 04:15:43 [  197.856369] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.4.8 05/21/2018
2018-10-14 04:15:43 [  197.864605] Call Trace:
2018-10-14 04:15:43 [  197.867810]  [<ffffffff8d90e84e>] dump_stack+0x19/0x1b
2018-10-14 04:15:43 [  197.873696]  [<ffffffff8d908b50>] panic+0xe8/0x21f
2018-10-14 04:15:43 [  197.879223]  [<ffffffffc06f08cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
2018-10-14 04:15:43 [  197.886124]  [<ffffffffc0b722dd>] orph_declare_index_delete+0x40d/0x460 [mdd]
2018-10-14 04:15:43 [  197.893965]  [<ffffffffc11c4439>] ? lod_trans_create+0x39/0x50 [lod]
2018-10-14 04:15:43 [  197.901017]  [<ffffffffc0b72761>] orph_key_test_and_del+0x431/0xd30 [mdd]
2018-10-14 04:15:43 [  197.908491]  [<ffffffffc0b73617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd]
2018-10-14 04:15:43 [  197.915874]  [<ffffffffc0b73060>] ? orph_key_test_and_del+0xd30/0xd30 [mdd]
2018-10-14 04:15:43 [  197.923506]  [<ffffffff8d2bb621>] kthread+0xd1/0xe0
2018-10-14 04:15:43 [  197.929044]  [<ffffffff8d2bb550>] ? insert_kthread_work+0x40/0x40
2018-10-14 04:15:43 [  197.935789]  [<ffffffff8d9205dd>] ret_from_fork_nospec_begin+0x7/0x21
2018-10-14 04:15:43 [  197.942869]  [<ffffffff8d2bb550>] ? insert_kthread_work+0x40/0x40
2018-10-14 04:15:43 [  197.949604] Kernel Offset: 0xc200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

our standard server lustre is 2.10.4 plus these patches

  lu10683-lu11093-checksum-overquota-gerrit32788-1fb85e7e.patch
  lu10988-lfsck2-gerrit32522-21d33c11.patch
  lu11074-mdc-xattr-gerrit32739-dea1cde9.patch
  lu11107-xattr-gerrit32753-c96a8f08.patch
  lu11111-lfsck-gerrit32796-693fe452.patch
  lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff
  lu11062-stacktrace-gerrit32972-7232c445.patch
  lu11301-stuckMdtThreads2-c43baa1c.patch
  lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff

but also looks like the same LBUG when I boot into a testing server image with 2.10.5 + these patches

  lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff
  lu11111-lfsck-gerrit32796-693fe452.ported.patch    # fixed to apply cleanly
  lu11201-lfsckDoesntFinish-gerrit33078-4829fb05.patch
  lu11301-stuckMdtThreads2-c43baa1c.patch
  lu11418-hungMdtZfs-gerrit33248-eaa3c60d.diff     # doesn't apply cleanly. needs lustre/include/obd_support.h edit
  lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff
2018-10-14 04:45:14 [  256.907785] LustreError: 46668:0:(lu_object.h:862:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed:
2018-10-14 04:45:14 [  256.921784] LustreError: 46668:0:(lu_object.h:862:lu_object_attr()) LBUG
2018-10-14 04:45:14 [  256.929337] Pid: 46668, comm: orph_cleanup_da 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018
2018-10-14 04:45:14 [  256.939901] Call Trace:
2018-10-14 04:45:14 [  256.943185]  [<ffffffffc08ae7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
2018-10-14 04:45:14 [  256.950584]  [<ffffffffc08ae87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
2018-10-14 04:45:14 [  256.957618]  [<ffffffffc1abf2dd>] orph_declare_index_delete+0x40d/0x460 [mdd]
2018-10-14 04:45:14 [  256.965586]  [<ffffffffc1abf761>] orph_key_test_and_del+0x431/0xd30 [mdd]
2018-10-14 04:45:14 [  256.973192]  [<ffffffffc1ac0617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd]
2018-10-14 04:45:14 [  256.980697]  [<ffffffffb4cbb621>] kthread+0xd1/0xe0
2018-10-14 04:45:14 [  256.986380]  [<ffffffffb53205dd>] ret_from_fork_nospec_begin+0x7/0x21
2018-10-14 04:45:14 [  256.993618]  [<ffffffffffffffff>] 0xffffffffffffffff
2018-10-14 04:45:14 [  256.999378] Kernel panic - not syncing: LBUG
2018-10-14 04:45:14 [  257.004389] CPU: 5 PID: 46668 Comm: orph_cleanup_da Tainted: P           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
2018-10-14 04:45:14 [  257.015980] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.4.8 05/21/2018
2018-10-14 04:45:14 [  257.024200] Call Trace:
2018-10-14 04:45:14 [  257.027392]  [<ffffffffb530e84e>] dump_stack+0x19/0x1b
2018-10-14 04:45:14 [  257.033268]  [<ffffffffb5308b50>] panic+0xe8/0x21f
2018-10-14 04:45:14 [  257.038783]  [<ffffffffc08ae8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
2018-10-14 04:45:14 [  257.045676]  [<ffffffffc1abf2dd>] orph_declare_index_delete+0x40d/0x460 [mdd]
2018-10-14 04:45:14 [  257.053528]  [<ffffffffc1a24439>] ? lod_trans_create+0x39/0x50 [lod]
2018-10-14 04:45:14 [  257.060588]  [<ffffffffc1abf761>] orph_key_test_and_del+0x431/0xd30 [mdd]
2018-10-14 04:45:14 [  257.068071]  [<ffffffffc1ac0617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd]
2018-10-14 04:45:14 [  257.075460]  [<ffffffffc1ac0060>] ? orph_key_test_and_del+0xd30/0xd30 [mdd]
2018-10-14 04:45:14 [  257.083100]  [<ffffffffb4cbb621>] kthread+0xd1/0xe0
2018-10-14 04:45:14 [  257.088650]  [<ffffffffb4cbb550>] ? insert_kthread_work+0x40/0x40
2018-10-14 04:45:14 [  257.095401]  [<ffffffffb53205dd>] ret_from_fork_nospec_begin+0x7/0x21
2018-10-14 04:45:14 [  257.102490]  [<ffffffffb4cbb550>] ? insert_kthread_work+0x40/0x40
2018-10-14 04:45:14 [  257.109241] Kernel Offset: 0x33c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

cheers,
robin



 Comments   
Comment by Peter Jones [ 13/Oct/18 ]

Lai

Can you please advise here?

Peter

Comment by Andreas Dilger [ 13/Oct/18 ]

It isn't quite clear what the root of the problem is, but I pushed a patch that will hopefully get you past the LASSERT() failing during orphan cleanup. That is just for files that were open-unlinked and are now being deleted, so it isn't a reason to stop the mount of the MDS.

Comment by SC Admin (Inactive) [ 14/Oct/18 ]

thanks Andreas. that's done the trick. filesystem is back up.

Comment by Gerrit Updater [ 14/Oct/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33366
Subject: LU-11516 mdd: do not assert on missing orphan
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: aa82c8ede4fb0661671542e084ac4a0654efb7d5

Comment by Gerrit Updater [ 14/Oct/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33368
Subject: LU-11516 mdd: do not assert on missing orphan
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 41bb8c3049360ed53c7f8974251911d8717abbdf

Comment by Andreas Dilger [ 15/Oct/18 ]

Sorry for the patch spam on this ticket, I pushed the b2_10 version of the patch to the master branch and had some trouble to fix it.

Glad that this got your system back working

It would be useful to see the error messages reported on the console when the MDS starts up. Is there just a single message about a bad object in the orphan list, or are there many of them?

Comment by SC Admin (Inactive) [ 15/Oct/18 ]

Hi Andreas,

MDS conman log for warble2 attached.
warble2 is the MDS that has all 3 DNE dagg MDTs on it. the other MDS (warble1) has MGS and non-DNE MDTs from 3 smaller fs's.

I can't see a specific orphan message at mount. we have some ongoing orphan related messages though.

there's a bunch of known corruption in the filesystem. mostly from all the DNE MDT hangs. we have been trying to lfsck since July to try and fix some of this. one lfsck succeeded this weekend, but then we ran into this problem and still have others.

cheers,
robin

Comment by Gerrit Updater [ 02/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33368/
Subject: LU-11516 mdd: do not assert on missing orphan
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5d89450b462f76fe2a377e7253595a162dae309b

Comment by Peter Jones [ 02/Nov/18 ]

The removal of the assert patch has landed for 2.12. The other issues are being tracked under different tickets.

Generated at Sat Feb 10 02:44:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.