[LU-11516] ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: LBUG Created: 13/Oct/18 Updated: 11/Jul/19 Resolved: 02/Nov/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.5 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | SC Admin (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
x86_64, zfs, DNE, centos 7.5 |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 1 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
Hi, after a namespace lfsck today, and not being able to run a production system due to the "stale" directory problems documented in unfortunately now I can no longer mount the filesystem - the MDS LBUG's before the MDTs are fully up. I've tried mounting with -o skip_lfsck, but it doesn't help. ... 2018-10-14 04:14:50 [ 144.564183] LNet: Using FMR for registration 2018-10-14 04:14:50 [ 144.565919] LNetError: 104:0:(o2iblnd_cb.c:2299:kiblnd_passive_connect()) Can't accept conn from 192.168.44.102@o2ib44 on NA (ib0:0:192.168.44.22): bad dst nid 192.168.44.22@o2ib44 2018-10-14 04:14:50 [ 144.639679] LNet: Added LNI 192.168.44.22@o2ib44 [128/2048/0/180] 2018-10-14 04:14:50 [ 144.693364] LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.140@o2ib44 (no target). If you are running an HA pair c heck that the target is mounted on the other server. 2018-10-14 04:14:50 [ 144.748380] Lustre: dagg-MDT0000: Not available for connect from 192.168.44.155@o2ib44 (not set up) 2018-10-14 04:14:50 [ 144.850890] Lustre: dagg-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 2018-10-14 04:14:50 [ 144.893093] Lustre: dagg-MDT0000: Will be in recovery for at least 2:30, or until 130 clients reconnect 2018-10-14 04:14:50 [ 144.903035] Lustre: dagg-MDT0000: Connection restored to (at 192.168.44.160@o2ib44) 2018-10-14 04:14:50 [ 144.904095] LNet: 18002:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Abort reconnection of 192.168.44.34@o2ib44: connected 2018-10-14 04:14:50 [ 144.922881] Lustre: Skipped 2 previous similar messages 2018-10-14 04:14:50 [ 145.197706] LustreError: 137-5: dagg-MDT0001_UUID: not available for connect from 192.168.44.135@o2ib44 (no target). If you are running an HA pair c heck that the target is mounted on the other server. 2018-10-14 04:14:50 [ 145.216609] LustreError: Skipped 100 previous similar messages 2018-10-14 04:14:51 [ 145.405422] Lustre: dagg-MDT0000: Connection restored to e995292c-7ea9-b4b9-296b-7fc43479891e (at 192.168.44.204@o2ib44) 2018-10-14 04:14:51 [ 145.416862] Lustre: Skipped 45 previous similar messages 2018-10-14 04:14:51 [ 145.749608] LustreError: 11-0: dagg-MDT0000-osp-MDT0001: operation mds_connect to node 0@lo failed: rc = -114 2018-10-14 04:14:51 [ 145.797392] Lustre: dagg-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 2018-10-14 04:14:52 [ 146.517192] Lustre: dagg-MDT0001: Will be in recovery for at least 2:30, or until 130 clients reconnect 2018-10-14 04:14:52 [ 146.517895] Lustre: dagg-MDT0000: Connection restored to (at 192.168.44.145@o2ib44) 2018-10-14 04:14:52 [ 146.517896] Lustre: Skipped 17 previous similar messages 2018-10-14 04:14:52 [ 146.648101] LustreError: 137-5: dagg-MDT0002_UUID: not available for connect from 192.168.44.155@o2ib44 (no target). If you are running an HA pair c heck that the target is mounted on the other server. 2018-10-14 04:14:52 [ 146.667038] LustreError: Skipped 27 previous similar messages 2018-10-14 04:14:53 [ 147.742695] LustreError: 11-0: dagg-MDT0000-osp-MDT0002: operation mds_connect to node 0@lo failed: rc = -114 2018-10-14 04:14:53 [ 147.809966] Lustre: dagg-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 2018-10-14 04:14:53 [ 147.906112] Lustre: dagg-MDT0002: Will be in recovery for at least 2:30, or until 130 clients reconnect 2018-10-14 04:14:54 [ 148.619484] Lustre: dagg-MDT0002: Connection restored to 4043286f-fc2d-0b38-4c8b-0a509fa70b9d (at 192.168.44.150@o2ib44) 2018-10-14 04:14:54 [ 148.631061] Lustre: Skipped 13 previous similar messages 2018-10-14 04:14:58 [ 152.622612] Lustre: dagg-MDT0001: Connection restored to 4313929d-144f-a365-76bf-0741877d11de (at 192.168.44.182@o2ib44) 2018-10-14 04:14:58 [ 152.634166] Lustre: Skipped 167 previous similar messages 2018-10-14 04:14:58 [ 153.193285] Lustre: dagg-MDT0000: Recovery already passed deadline 2:21. It is due to DNE recovery failed/stuck on the 2 MDT(s): 0001 0002. Please wait until all MDTs recovered or abort the recovery by force. 2018-10-14 04:14:59 [ 153.938830] Lustre: dagg-MDT0000: Recovery already passed deadline 2:21. It is due to DNE recovery failed/stuck on the 2 MDT(s): 0001 0002. Please wait until all MDTs recovered or abort the recovery by force. 2018-10-14 04:15:09 [ 163.702354] Lustre: dagg-MDT0002: Connection restored to 66a65625-28da-d50f-c83b-b2f538a83827 (at 192.168.44.195@o2ib44) 2018-10-14 04:15:09 [ 163.713974] Lustre: Skipped 135 previous similar messages 2018-10-14 04:15:25 [ 179.961770] Lustre: dagg-MDT0000: Connection restored to d2eebaed-e98f-5c92-c1f5-5b272ce31e82 (at 10.8.49.221@tcp201) 2018-10-14 04:15:25 [ 179.973196] Lustre: Skipped 38 previous similar messages 2018-10-14 04:15:31 [ 185.747958] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff8b825e9f2d00 x1613591234569248/t0(0) o601->dagg-MDT0000-lwp-OST0001_UUID@192.168.44.31@o2ib44:652/0 lens 336/0 e 0 to 0 dl 1539450937 ref 1 fl Interpret:/0/ffffffff rc 0/-1 2018-10-14 04:15:31 [ 185.776641] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 2 previous similar messages 2018-10-14 04:15:36 [ 190.915287] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff8b825ec6e300 x1613592493156656/t0(0) o601->dagg-MDT0000-lwp-OST000b_UUID@192.168.44.36@o2ib44:657/0 lens 336/0 e 0 to 0 dl 1539450942 ref 1 fl Interpret:/0/ffffffff rc 0/-1 2018-10-14 04:15:36 [ 190.944086] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 4 previous similar messages 2018-10-14 04:15:38 [ 192.736539] LustreError: 42042:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff8b2334027200 x1613591230816544/t0(0) o601->dagg-MDT0000-lwp-OST0004_UUID@192.168.44.33@o2ib44:659/0 lens 336/0 e 0 to 0 dl 1539450944 ref 1 fl Interpret:/0/ffffffff rc 0/-1 2018-10-14 04:15:38 [ 192.765302] LustreError: 42042:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 3 previous similar messages 2018-10-14 04:15:40 [ 194.809319] LustreError: 42037:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff8b825e9cc200 x1613592493190864/t0(0) o601->dagg-MDT0000-lwp-OST0007_UUID@192.168.44.34@o2ib44:661/0 lens 336/0 e 0 to 0 dl 1539450946 ref 1 fl Interpret:/0/ffffffff rc 0/-1 2018-10-14 04:15:40 [ 194.838140] LustreError: 42037:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 8 previous similar messages 2018-10-14 04:15:43 [ 197.681575] Lustre: dagg-MDT0000: Recovery over after 0:53, of 130 clients 130 recovered and 0 were evicted. 2018-10-14 04:15:43 [ 197.691192] Lustre: 20075:0:(llog.c:572:llog_process_thread()) dagg-MDT0000-osp-MDT0001: invalid length 0 in llog [0x1:0x19a42:0x2]record for index 0/4 2018-10-14 04:15:43 [ 197.691198] LustreError: 20075:0:(lod_dev.c:420:lod_sub_recovery_thread()) dagg-MDT0000-osp-MDT0001 getting update log failed: rc = -22 2018-10-14 04:15:43 [ 197.748062] LustreError: 43000:0:(lu_object.h:862:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: 2018-10-14 04:15:43 [ 197.762055] LustreError: 43000:0:(lu_object.h:862:lu_object_attr()) LBUG 2018-10-14 04:15:43 [ 197.769608] Pid: 43000, comm: orph_cleanup_da 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 2018-10-14 04:15:43 [ 197.780185] Call Trace: 2018-10-14 04:15:43 [ 197.783478] [<ffffffffc06f07cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] 2018-10-14 04:15:43 [ 197.790871] [<ffffffffc06f087c>] lbug_with_loc+0x4c/0xa0 [libcfs] 2018-10-14 04:15:43 [ 197.797889] [<ffffffffc0b722dd>] orph_declare_index_delete+0x40d/0x460 [mdd] 2018-10-14 04:15:43 [ 197.805872] [<ffffffffc0b72761>] orph_key_test_and_del+0x431/0xd30 [mdd] 2018-10-14 04:15:43 [ 197.813494] [<ffffffffc0b73617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd] 2018-10-14 04:15:43 [ 197.821009] [<ffffffff8d2bb621>] kthread+0xd1/0xe0 2018-10-14 04:15:43 [ 197.826702] [<ffffffff8d9205dd>] ret_from_fork_nospec_begin+0x7/0x21 2018-10-14 04:15:43 [ 197.833942] [<ffffffffffffffff>] 0xffffffffffffffff 2018-10-14 04:15:43 [ 197.839723] Kernel panic - not syncing: LBUG 2018-10-14 04:15:43 [ 197.844759] CPU: 9 PID: 43000 Comm: orph_cleanup_da Tainted: P OE ------------ 3.10.0-862.9.1.el7.x86_64 #1 2018-10-14 04:15:43 [ 197.856369] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.4.8 05/21/2018 2018-10-14 04:15:43 [ 197.864605] Call Trace: 2018-10-14 04:15:43 [ 197.867810] [<ffffffff8d90e84e>] dump_stack+0x19/0x1b 2018-10-14 04:15:43 [ 197.873696] [<ffffffff8d908b50>] panic+0xe8/0x21f 2018-10-14 04:15:43 [ 197.879223] [<ffffffffc06f08cb>] lbug_with_loc+0x9b/0xa0 [libcfs] 2018-10-14 04:15:43 [ 197.886124] [<ffffffffc0b722dd>] orph_declare_index_delete+0x40d/0x460 [mdd] 2018-10-14 04:15:43 [ 197.893965] [<ffffffffc11c4439>] ? lod_trans_create+0x39/0x50 [lod] 2018-10-14 04:15:43 [ 197.901017] [<ffffffffc0b72761>] orph_key_test_and_del+0x431/0xd30 [mdd] 2018-10-14 04:15:43 [ 197.908491] [<ffffffffc0b73617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd] 2018-10-14 04:15:43 [ 197.915874] [<ffffffffc0b73060>] ? orph_key_test_and_del+0xd30/0xd30 [mdd] 2018-10-14 04:15:43 [ 197.923506] [<ffffffff8d2bb621>] kthread+0xd1/0xe0 2018-10-14 04:15:43 [ 197.929044] [<ffffffff8d2bb550>] ? insert_kthread_work+0x40/0x40 2018-10-14 04:15:43 [ 197.935789] [<ffffffff8d9205dd>] ret_from_fork_nospec_begin+0x7/0x21 2018-10-14 04:15:43 [ 197.942869] [<ffffffff8d2bb550>] ? insert_kthread_work+0x40/0x40 2018-10-14 04:15:43 [ 197.949604] Kernel Offset: 0xc200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) our standard server lustre is 2.10.4 plus these patches lu10683-lu11093-checksum-overquota-gerrit32788-1fb85e7e.patch lu10988-lfsck2-gerrit32522-21d33c11.patch lu11074-mdc-xattr-gerrit32739-dea1cde9.patch lu11107-xattr-gerrit32753-c96a8f08.patch lu11111-lfsck-gerrit32796-693fe452.patch lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff lu11062-stacktrace-gerrit32972-7232c445.patch lu11301-stuckMdtThreads2-c43baa1c.patch lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff but also looks like the same LBUG when I boot into a testing server image with 2.10.5 + these patches lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff lu11111-lfsck-gerrit32796-693fe452.ported.patch # fixed to apply cleanly lu11201-lfsckDoesntFinish-gerrit33078-4829fb05.patch lu11301-stuckMdtThreads2-c43baa1c.patch lu11418-hungMdtZfs-gerrit33248-eaa3c60d.diff # doesn't apply cleanly. needs lustre/include/obd_support.h edit lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff 2018-10-14 04:45:14 [ 256.907785] LustreError: 46668:0:(lu_object.h:862:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: 2018-10-14 04:45:14 [ 256.921784] LustreError: 46668:0:(lu_object.h:862:lu_object_attr()) LBUG 2018-10-14 04:45:14 [ 256.929337] Pid: 46668, comm: orph_cleanup_da 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 2018-10-14 04:45:14 [ 256.939901] Call Trace: 2018-10-14 04:45:14 [ 256.943185] [<ffffffffc08ae7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] 2018-10-14 04:45:14 [ 256.950584] [<ffffffffc08ae87c>] lbug_with_loc+0x4c/0xa0 [libcfs] 2018-10-14 04:45:14 [ 256.957618] [<ffffffffc1abf2dd>] orph_declare_index_delete+0x40d/0x460 [mdd] 2018-10-14 04:45:14 [ 256.965586] [<ffffffffc1abf761>] orph_key_test_and_del+0x431/0xd30 [mdd] 2018-10-14 04:45:14 [ 256.973192] [<ffffffffc1ac0617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd] 2018-10-14 04:45:14 [ 256.980697] [<ffffffffb4cbb621>] kthread+0xd1/0xe0 2018-10-14 04:45:14 [ 256.986380] [<ffffffffb53205dd>] ret_from_fork_nospec_begin+0x7/0x21 2018-10-14 04:45:14 [ 256.993618] [<ffffffffffffffff>] 0xffffffffffffffff 2018-10-14 04:45:14 [ 256.999378] Kernel panic - not syncing: LBUG 2018-10-14 04:45:14 [ 257.004389] CPU: 5 PID: 46668 Comm: orph_cleanup_da Tainted: P OE ------------ 3.10.0-862.9.1.el7.x86_64 #1 2018-10-14 04:45:14 [ 257.015980] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.4.8 05/21/2018 2018-10-14 04:45:14 [ 257.024200] Call Trace: 2018-10-14 04:45:14 [ 257.027392] [<ffffffffb530e84e>] dump_stack+0x19/0x1b 2018-10-14 04:45:14 [ 257.033268] [<ffffffffb5308b50>] panic+0xe8/0x21f 2018-10-14 04:45:14 [ 257.038783] [<ffffffffc08ae8cb>] lbug_with_loc+0x9b/0xa0 [libcfs] 2018-10-14 04:45:14 [ 257.045676] [<ffffffffc1abf2dd>] orph_declare_index_delete+0x40d/0x460 [mdd] 2018-10-14 04:45:14 [ 257.053528] [<ffffffffc1a24439>] ? lod_trans_create+0x39/0x50 [lod] 2018-10-14 04:45:14 [ 257.060588] [<ffffffffc1abf761>] orph_key_test_and_del+0x431/0xd30 [mdd] 2018-10-14 04:45:14 [ 257.068071] [<ffffffffc1ac0617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd] 2018-10-14 04:45:14 [ 257.075460] [<ffffffffc1ac0060>] ? orph_key_test_and_del+0xd30/0xd30 [mdd] 2018-10-14 04:45:14 [ 257.083100] [<ffffffffb4cbb621>] kthread+0xd1/0xe0 2018-10-14 04:45:14 [ 257.088650] [<ffffffffb4cbb550>] ? insert_kthread_work+0x40/0x40 2018-10-14 04:45:14 [ 257.095401] [<ffffffffb53205dd>] ret_from_fork_nospec_begin+0x7/0x21 2018-10-14 04:45:14 [ 257.102490] [<ffffffffb4cbb550>] ? insert_kthread_work+0x40/0x40 2018-10-14 04:45:14 [ 257.109241] Kernel Offset: 0x33c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) cheers, |
| Comments |
| Comment by Peter Jones [ 13/Oct/18 ] |
|
Lai Can you please advise here? Peter |
| Comment by Andreas Dilger [ 13/Oct/18 ] |
|
It isn't quite clear what the root of the problem is, but I pushed a patch that will hopefully get you past the LASSERT() failing during orphan cleanup. That is just for files that were open-unlinked and are now being deleted, so it isn't a reason to stop the mount of the MDS. |
| Comment by SC Admin (Inactive) [ 14/Oct/18 ] |
|
thanks Andreas. that's done the trick. filesystem is back up. |
| Comment by Gerrit Updater [ 14/Oct/18 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33366 |
| Comment by Gerrit Updater [ 14/Oct/18 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33368 |
| Comment by Andreas Dilger [ 15/Oct/18 ] |
|
Sorry for the patch spam on this ticket, I pushed the b2_10 version of the patch to the master branch and had some trouble to fix it. Glad that this got your system back working It would be useful to see the error messages reported on the console when the MDS starts up. Is there just a single message about a bad object in the orphan list, or are there many of them? |
| Comment by SC Admin (Inactive) [ 15/Oct/18 ] |
|
Hi Andreas, MDS conman log for warble2 attached. I can't see a specific orphan message at mount. we have some ongoing orphan related messages though. there's a bunch of known corruption in the filesystem. mostly from all the DNE MDT hangs. we have been trying to lfsck since July to try and fix some of this. one lfsck succeeded this weekend, but then we ran into this problem and still have others. cheers, |
| Comment by Gerrit Updater [ 02/Nov/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33368/ |
| Comment by Peter Jones [ 02/Nov/18 ] |
|
The removal of the assert patch has landed for 2.12. The other issues are being tracked under different tickets. |