Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.10.5
-
None
-
x86_64, zfs, DNE, centos 7.5
-
1
-
9223372036854775807
Description
Hi,
after a namespace lfsck today, and not being able to run a production system due to the "stale" directory problems documented in LU-11418, I had to reboot the MDS.
unfortunately now I can no longer mount the filesystem - the MDS LBUG's before the MDTs are fully up.
I've tried mounting with -o skip_lfsck, but it doesn't help.
with the patch from LU-11418 I get nothing at all on the serial console - just a complete freeze of some sort.
without that patch, I get the LBUG below.
... 2018-10-14 04:14:50 [ 144.564183] LNet: Using FMR for registration 2018-10-14 04:14:50 [ 144.565919] LNetError: 104:0:(o2iblnd_cb.c:2299:kiblnd_passive_connect()) Can't accept conn from 192.168.44.102@o2ib44 on NA (ib0:0:192.168.44.22): bad dst nid 192.168.44.22@o2ib44 2018-10-14 04:14:50 [ 144.639679] LNet: Added LNI 192.168.44.22@o2ib44 [128/2048/0/180] 2018-10-14 04:14:50 [ 144.693364] LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.140@o2ib44 (no target). If you are running an HA pair c heck that the target is mounted on the other server. 2018-10-14 04:14:50 [ 144.748380] Lustre: dagg-MDT0000: Not available for connect from 192.168.44.155@o2ib44 (not set up) 2018-10-14 04:14:50 [ 144.850890] Lustre: dagg-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 2018-10-14 04:14:50 [ 144.893093] Lustre: dagg-MDT0000: Will be in recovery for at least 2:30, or until 130 clients reconnect 2018-10-14 04:14:50 [ 144.903035] Lustre: dagg-MDT0000: Connection restored to (at 192.168.44.160@o2ib44) 2018-10-14 04:14:50 [ 144.904095] LNet: 18002:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Abort reconnection of 192.168.44.34@o2ib44: connected 2018-10-14 04:14:50 [ 144.922881] Lustre: Skipped 2 previous similar messages 2018-10-14 04:14:50 [ 145.197706] LustreError: 137-5: dagg-MDT0001_UUID: not available for connect from 192.168.44.135@o2ib44 (no target). If you are running an HA pair c heck that the target is mounted on the other server. 2018-10-14 04:14:50 [ 145.216609] LustreError: Skipped 100 previous similar messages 2018-10-14 04:14:51 [ 145.405422] Lustre: dagg-MDT0000: Connection restored to e995292c-7ea9-b4b9-296b-7fc43479891e (at 192.168.44.204@o2ib44) 2018-10-14 04:14:51 [ 145.416862] Lustre: Skipped 45 previous similar messages 2018-10-14 04:14:51 [ 145.749608] LustreError: 11-0: dagg-MDT0000-osp-MDT0001: operation mds_connect to node 0@lo failed: rc = -114 2018-10-14 04:14:51 [ 145.797392] Lustre: dagg-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 2018-10-14 04:14:52 [ 146.517192] Lustre: dagg-MDT0001: Will be in recovery for at least 2:30, or until 130 clients reconnect 2018-10-14 04:14:52 [ 146.517895] Lustre: dagg-MDT0000: Connection restored to (at 192.168.44.145@o2ib44) 2018-10-14 04:14:52 [ 146.517896] Lustre: Skipped 17 previous similar messages 2018-10-14 04:14:52 [ 146.648101] LustreError: 137-5: dagg-MDT0002_UUID: not available for connect from 192.168.44.155@o2ib44 (no target). If you are running an HA pair c heck that the target is mounted on the other server. 2018-10-14 04:14:52 [ 146.667038] LustreError: Skipped 27 previous similar messages 2018-10-14 04:14:53 [ 147.742695] LustreError: 11-0: dagg-MDT0000-osp-MDT0002: operation mds_connect to node 0@lo failed: rc = -114 2018-10-14 04:14:53 [ 147.809966] Lustre: dagg-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 2018-10-14 04:14:53 [ 147.906112] Lustre: dagg-MDT0002: Will be in recovery for at least 2:30, or until 130 clients reconnect 2018-10-14 04:14:54 [ 148.619484] Lustre: dagg-MDT0002: Connection restored to 4043286f-fc2d-0b38-4c8b-0a509fa70b9d (at 192.168.44.150@o2ib44) 2018-10-14 04:14:54 [ 148.631061] Lustre: Skipped 13 previous similar messages 2018-10-14 04:14:58 [ 152.622612] Lustre: dagg-MDT0001: Connection restored to 4313929d-144f-a365-76bf-0741877d11de (at 192.168.44.182@o2ib44) 2018-10-14 04:14:58 [ 152.634166] Lustre: Skipped 167 previous similar messages 2018-10-14 04:14:58 [ 153.193285] Lustre: dagg-MDT0000: Recovery already passed deadline 2:21. It is due to DNE recovery failed/stuck on the 2 MDT(s): 0001 0002. Please wait until all MDTs recovered or abort the recovery by force. 2018-10-14 04:14:59 [ 153.938830] Lustre: dagg-MDT0000: Recovery already passed deadline 2:21. It is due to DNE recovery failed/stuck on the 2 MDT(s): 0001 0002. Please wait until all MDTs recovered or abort the recovery by force. 2018-10-14 04:15:09 [ 163.702354] Lustre: dagg-MDT0002: Connection restored to 66a65625-28da-d50f-c83b-b2f538a83827 (at 192.168.44.195@o2ib44) 2018-10-14 04:15:09 [ 163.713974] Lustre: Skipped 135 previous similar messages 2018-10-14 04:15:25 [ 179.961770] Lustre: dagg-MDT0000: Connection restored to d2eebaed-e98f-5c92-c1f5-5b272ce31e82 (at 10.8.49.221@tcp201) 2018-10-14 04:15:25 [ 179.973196] Lustre: Skipped 38 previous similar messages 2018-10-14 04:15:31 [ 185.747958] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff8b825e9f2d00 x1613591234569248/t0(0) o601->dagg-MDT0000-lwp-OST0001_UUID@192.168.44.31@o2ib44:652/0 lens 336/0 e 0 to 0 dl 1539450937 ref 1 fl Interpret:/0/ffffffff rc 0/-1 2018-10-14 04:15:31 [ 185.776641] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 2 previous similar messages 2018-10-14 04:15:36 [ 190.915287] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff8b825ec6e300 x1613592493156656/t0(0) o601->dagg-MDT0000-lwp-OST000b_UUID@192.168.44.36@o2ib44:657/0 lens 336/0 e 0 to 0 dl 1539450942 ref 1 fl Interpret:/0/ffffffff rc 0/-1 2018-10-14 04:15:36 [ 190.944086] LustreError: 42034:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 4 previous similar messages 2018-10-14 04:15:38 [ 192.736539] LustreError: 42042:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff8b2334027200 x1613591230816544/t0(0) o601->dagg-MDT0000-lwp-OST0004_UUID@192.168.44.33@o2ib44:659/0 lens 336/0 e 0 to 0 dl 1539450944 ref 1 fl Interpret:/0/ffffffff rc 0/-1 2018-10-14 04:15:38 [ 192.765302] LustreError: 42042:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 3 previous similar messages 2018-10-14 04:15:40 [ 194.809319] LustreError: 42037:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff8b825e9cc200 x1613592493190864/t0(0) o601->dagg-MDT0000-lwp-OST0007_UUID@192.168.44.34@o2ib44:661/0 lens 336/0 e 0 to 0 dl 1539450946 ref 1 fl Interpret:/0/ffffffff rc 0/-1 2018-10-14 04:15:40 [ 194.838140] LustreError: 42037:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 8 previous similar messages 2018-10-14 04:15:43 [ 197.681575] Lustre: dagg-MDT0000: Recovery over after 0:53, of 130 clients 130 recovered and 0 were evicted. 2018-10-14 04:15:43 [ 197.691192] Lustre: 20075:0:(llog.c:572:llog_process_thread()) dagg-MDT0000-osp-MDT0001: invalid length 0 in llog [0x1:0x19a42:0x2]record for index 0/4 2018-10-14 04:15:43 [ 197.691198] LustreError: 20075:0:(lod_dev.c:420:lod_sub_recovery_thread()) dagg-MDT0000-osp-MDT0001 getting update log failed: rc = -22 2018-10-14 04:15:43 [ 197.748062] LustreError: 43000:0:(lu_object.h:862:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: 2018-10-14 04:15:43 [ 197.762055] LustreError: 43000:0:(lu_object.h:862:lu_object_attr()) LBUG 2018-10-14 04:15:43 [ 197.769608] Pid: 43000, comm: orph_cleanup_da 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 2018-10-14 04:15:43 [ 197.780185] Call Trace: 2018-10-14 04:15:43 [ 197.783478] [<ffffffffc06f07cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] 2018-10-14 04:15:43 [ 197.790871] [<ffffffffc06f087c>] lbug_with_loc+0x4c/0xa0 [libcfs] 2018-10-14 04:15:43 [ 197.797889] [<ffffffffc0b722dd>] orph_declare_index_delete+0x40d/0x460 [mdd] 2018-10-14 04:15:43 [ 197.805872] [<ffffffffc0b72761>] orph_key_test_and_del+0x431/0xd30 [mdd] 2018-10-14 04:15:43 [ 197.813494] [<ffffffffc0b73617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd] 2018-10-14 04:15:43 [ 197.821009] [<ffffffff8d2bb621>] kthread+0xd1/0xe0 2018-10-14 04:15:43 [ 197.826702] [<ffffffff8d9205dd>] ret_from_fork_nospec_begin+0x7/0x21 2018-10-14 04:15:43 [ 197.833942] [<ffffffffffffffff>] 0xffffffffffffffff 2018-10-14 04:15:43 [ 197.839723] Kernel panic - not syncing: LBUG 2018-10-14 04:15:43 [ 197.844759] CPU: 9 PID: 43000 Comm: orph_cleanup_da Tainted: P OE ------------ 3.10.0-862.9.1.el7.x86_64 #1 2018-10-14 04:15:43 [ 197.856369] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.4.8 05/21/2018 2018-10-14 04:15:43 [ 197.864605] Call Trace: 2018-10-14 04:15:43 [ 197.867810] [<ffffffff8d90e84e>] dump_stack+0x19/0x1b 2018-10-14 04:15:43 [ 197.873696] [<ffffffff8d908b50>] panic+0xe8/0x21f 2018-10-14 04:15:43 [ 197.879223] [<ffffffffc06f08cb>] lbug_with_loc+0x9b/0xa0 [libcfs] 2018-10-14 04:15:43 [ 197.886124] [<ffffffffc0b722dd>] orph_declare_index_delete+0x40d/0x460 [mdd] 2018-10-14 04:15:43 [ 197.893965] [<ffffffffc11c4439>] ? lod_trans_create+0x39/0x50 [lod] 2018-10-14 04:15:43 [ 197.901017] [<ffffffffc0b72761>] orph_key_test_and_del+0x431/0xd30 [mdd] 2018-10-14 04:15:43 [ 197.908491] [<ffffffffc0b73617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd] 2018-10-14 04:15:43 [ 197.915874] [<ffffffffc0b73060>] ? orph_key_test_and_del+0xd30/0xd30 [mdd] 2018-10-14 04:15:43 [ 197.923506] [<ffffffff8d2bb621>] kthread+0xd1/0xe0 2018-10-14 04:15:43 [ 197.929044] [<ffffffff8d2bb550>] ? insert_kthread_work+0x40/0x40 2018-10-14 04:15:43 [ 197.935789] [<ffffffff8d9205dd>] ret_from_fork_nospec_begin+0x7/0x21 2018-10-14 04:15:43 [ 197.942869] [<ffffffff8d2bb550>] ? insert_kthread_work+0x40/0x40 2018-10-14 04:15:43 [ 197.949604] Kernel Offset: 0xc200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
our standard server lustre is 2.10.4 plus these patches
lu10683-lu11093-checksum-overquota-gerrit32788-1fb85e7e.patch lu10988-lfsck2-gerrit32522-21d33c11.patch lu11074-mdc-xattr-gerrit32739-dea1cde9.patch lu11107-xattr-gerrit32753-c96a8f08.patch lu11111-lfsck-gerrit32796-693fe452.patch lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff lu11062-stacktrace-gerrit32972-7232c445.patch lu11301-stuckMdtThreads2-c43baa1c.patch lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff
but also looks like the same LBUG when I boot into a testing server image with 2.10.5 + these patches
lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff lu11111-lfsck-gerrit32796-693fe452.ported.patch # fixed to apply cleanly lu11201-lfsckDoesntFinish-gerrit33078-4829fb05.patch lu11301-stuckMdtThreads2-c43baa1c.patch lu11418-hungMdtZfs-gerrit33248-eaa3c60d.diff # doesn't apply cleanly. needs lustre/include/obd_support.h edit lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff
2018-10-14 04:45:14 [ 256.907785] LustreError: 46668:0:(lu_object.h:862:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: 2018-10-14 04:45:14 [ 256.921784] LustreError: 46668:0:(lu_object.h:862:lu_object_attr()) LBUG 2018-10-14 04:45:14 [ 256.929337] Pid: 46668, comm: orph_cleanup_da 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 2018-10-14 04:45:14 [ 256.939901] Call Trace: 2018-10-14 04:45:14 [ 256.943185] [<ffffffffc08ae7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] 2018-10-14 04:45:14 [ 256.950584] [<ffffffffc08ae87c>] lbug_with_loc+0x4c/0xa0 [libcfs] 2018-10-14 04:45:14 [ 256.957618] [<ffffffffc1abf2dd>] orph_declare_index_delete+0x40d/0x460 [mdd] 2018-10-14 04:45:14 [ 256.965586] [<ffffffffc1abf761>] orph_key_test_and_del+0x431/0xd30 [mdd] 2018-10-14 04:45:14 [ 256.973192] [<ffffffffc1ac0617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd] 2018-10-14 04:45:14 [ 256.980697] [<ffffffffb4cbb621>] kthread+0xd1/0xe0 2018-10-14 04:45:14 [ 256.986380] [<ffffffffb53205dd>] ret_from_fork_nospec_begin+0x7/0x21 2018-10-14 04:45:14 [ 256.993618] [<ffffffffffffffff>] 0xffffffffffffffff 2018-10-14 04:45:14 [ 256.999378] Kernel panic - not syncing: LBUG 2018-10-14 04:45:14 [ 257.004389] CPU: 5 PID: 46668 Comm: orph_cleanup_da Tainted: P OE ------------ 3.10.0-862.9.1.el7.x86_64 #1 2018-10-14 04:45:14 [ 257.015980] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.4.8 05/21/2018 2018-10-14 04:45:14 [ 257.024200] Call Trace: 2018-10-14 04:45:14 [ 257.027392] [<ffffffffb530e84e>] dump_stack+0x19/0x1b 2018-10-14 04:45:14 [ 257.033268] [<ffffffffb5308b50>] panic+0xe8/0x21f 2018-10-14 04:45:14 [ 257.038783] [<ffffffffc08ae8cb>] lbug_with_loc+0x9b/0xa0 [libcfs] 2018-10-14 04:45:14 [ 257.045676] [<ffffffffc1abf2dd>] orph_declare_index_delete+0x40d/0x460 [mdd] 2018-10-14 04:45:14 [ 257.053528] [<ffffffffc1a24439>] ? lod_trans_create+0x39/0x50 [lod] 2018-10-14 04:45:14 [ 257.060588] [<ffffffffc1abf761>] orph_key_test_and_del+0x431/0xd30 [mdd] 2018-10-14 04:45:14 [ 257.068071] [<ffffffffc1ac0617>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd] 2018-10-14 04:45:14 [ 257.075460] [<ffffffffc1ac0060>] ? orph_key_test_and_del+0xd30/0xd30 [mdd] 2018-10-14 04:45:14 [ 257.083100] [<ffffffffb4cbb621>] kthread+0xd1/0xe0 2018-10-14 04:45:14 [ 257.088650] [<ffffffffb4cbb550>] ? insert_kthread_work+0x40/0x40 2018-10-14 04:45:14 [ 257.095401] [<ffffffffb53205dd>] ret_from_fork_nospec_begin+0x7/0x21 2018-10-14 04:45:14 [ 257.102490] [<ffffffffb4cbb550>] ? insert_kthread_work+0x40/0x40 2018-10-14 04:45:14 [ 257.109241] Kernel Offset: 0x33c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
cheers,
robin