Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12360

Can't restart filesystem (2.12) even with abort_recov

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.0
    • None
    • CentOS 7.6
    • 1
    • 9223372036854775807

    Description

      Using 2.12.0 + patches, after running a few days without major issues, we just had a serie of events leading to a filesystem down, it started with an OSS crash on fir-io4-s2:

      [411912.582805] LustreError: 128678:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 7ccedfb6-dc58-53e5-7474-adae274cd5d7 claims 3305472 GRANT, real grant 0
      [411912.597102] LustreError: 128678:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 258 previous similar messages
      [412523.291449] LustreError: 87611:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 3923968 GRANT, real grant 0
      [412523.305649] LustreError: 87611:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 179 previous similar messages
      [413133.047652] LustreError: 1860:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 4132864 GRANT, real grant 0
      [413133.061828] LustreError: 1860:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 209 previous similar messages
      [413733.569845] LustreError: 1546:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 7ccedfb6-dc58-53e5-7474-adae274cd5d7 claims 3780608 GRANT, real grant 0
      [413733.583920] LustreError: 1546:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 206 previous similar messages
      [414347.351179] LustreError: 87688:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 3911680 GRANT, real grant 0
      [414347.365413] LustreError: 87688:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 197 previous similar messages
      [414687.733602] LustreError: 108071:0:(osd_handler.c:2133:osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) ) failed: 
      [414687.748464] LustreError: 108071:0:(osd_handler.c:2133:osd_object_release()) LBUG
      [414687.755956] Pid: 108071, comm: ldlm_cn01_010 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      [414687.766217] Call Trace:
      [414687.768773]  [<ffffffffc0d0f7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [414687.775434]  [<ffffffffc0d0f87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [414687.781736]  [<ffffffffc15d636c>] osd_object_release+0x7c/0x80 [osd_ldiskfs]
      [414687.788910]  [<ffffffffc0e936a0>] lu_object_put+0x190/0x3d0 [obdclass]
      [414687.795582]  [<ffffffffc1750b56>] ofd_lvbo_update+0x5e6/0xdf0 [ofd]
      [414687.801980]  [<ffffffffc10a2c89>] ldlm_request_cancel+0x3c9/0x740 [ptlrpc]
      [414687.809008]  [<ffffffffc10a77da>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
      [414687.815859]  [<ffffffffc10a7ac8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
      [414687.822885]  [<ffffffffc10d8d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [414687.830693]  [<ffffffffc10dc63c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [414687.837115]  [<ffffffffb74c1c31>] kthread+0xd1/0xe0
      [414687.842115]  [<ffffffffb7b74c24>] ret_from_fork_nospec_begin+0xe/0x21
      [414687.848677]  [<ffffffffffffffff>] 0xffffffffffffffff
      [414687.853780] Kernel panic - not syncing: LBUG
      

       
      Shortly after restarting the OSS, an MDS, fir-md1-s2 (MDT1 + 3) crashed:

      [2687144.065647] Lustre: fir-MDT0003: Client ddea348b-e5a4-5330-325a-755d459e8dda (at 10.9.107.57@o2ib4) reconnecting
      [2687144.076003] Lustre: Skipped 1 previous similar message
      [2689300.253654] LustreError: 124673:0:(osp_object.c:1458:osp_declare_create()) ASSERTION( o->opo_reserved == 0 ) failed: 
      [2689300.264453] LustreError: 124673:0:(osp_object.c:1458:osp_declare_create()) LBUG
      [2689300.271942] Pid: 124673, comm: mdt00_044 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      [2689300.281942] Call Trace:
      [2689300.284583]  [<ffffffffc0a397cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [2689300.291332]  [<ffffffffc0a3987c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [2689300.297719]  [<ffffffffc15aecc5>] osp_declare_create+0x5a5/0x5b0 [osp]
      [2689300.304462]  [<ffffffffc14f639f>] lod_sub_declare_create+0xdf/0x210 [lod]
      [2689300.311466]  [<ffffffffc14f5904>] lod_qos_prep_create+0x15d4/0x1890 [lod]
      [2689300.318467]  [<ffffffffc14d6bba>] lod_declare_instantiate_components+0x9a/0x1d0 [lod]
      [2689300.326511]  [<ffffffffc14e94d5>] lod_declare_layout_change+0xb65/0x10f0 [lod]
      [2689300.333948]  [<ffffffffc155b102>] mdd_declare_layout_change+0x62/0x120 [mdd]
      [2689300.341211]  [<ffffffffc1563e52>] mdd_layout_change+0x882/0x1000 [mdd]
      [2689300.347964]  [<ffffffffc13ca327>] mdt_layout_change+0x337/0x430 [mdt]
      [2689300.354636]  [<ffffffffc13d243e>] mdt_intent_layout+0x7ee/0xcc0 [mdt]
      [2689300.361292]  [<ffffffffc13cfa28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
      [2689300.367949]  [<ffffffffc0d6dd46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      [2689300.374893]  [<ffffffffc0d96707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      [2689300.382189]  [<ffffffffc0e1d6e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [2689300.388531]  [<ffffffffc0e2473a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [2689300.395657]  [<ffffffffc0dc8d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [2689300.403563]  [<ffffffffc0dcc63c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [2689300.410077]  [<ffffffff9e6c1c31>] kthread+0xd1/0xe0
      [2689300.415164]  [<ffffffff9ed74c24>] ret_from_fork_nospec_begin+0xe/0x21
      [2689300.421813]  [<ffffffffffffffff>] 0xffffffffffffffff
      [2689300.427016] Kernel panic - not syncing: LBUG
      

      And now I am not able to restart the filesystem, even with -o abort_recov everywhere. OSTs restart but then the MDT0 doesn't want to restart. It keeps wanting to perform a recovery.

      May 29 15:59:43 fir-md1-s1 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
      May 29 15:59:49 fir-md1-s1 kernel: Lustre: DEBUG MARKER: Wed May 29 15:59:49 2019
      May 29 16:00:25 fir-md1-s1 kernel: LDISKFS-fs (dm-2): recovery complete
      May 29 16:00:25 fir-md1-s1 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      May 29 16:00:25 fir-md1-s1 kernel: Lustre: MGS: Connection restored to MGC10.0.10.51@o2ib7_0 (at 0@lo)
      May 29 16:00:26 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 1495a992-a81e-7de9-50a8-88376d880671 (at 10.8.12.32@o2ib6)
      May 29 16:00:26 fir-md1-s1 kernel: Lustre: Skipped 39 previous similar messages
      May 29 16:00:27 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 0e359090-33c6-a6a0-317c-1b138ecd9b87 (at 10.9.105.16@o2ib4)
      May 29 16:00:27 fir-md1-s1 kernel: Lustre: Skipped 72 previous similar messages
      
      
      
      
      
      May 29 16:00:29 fir-md1-s1 kernel: Lustre: MGS: Connection restored to d6e5f2ed-c606-14e7-85a0-13787ff0467b (at 10.9.105.69@o2ib4)
      May 29 16:00:29 fir-md1-s1 kernel: Lustre: Skipped 35 previous similar messages
      May 29 16:00:33 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 2996d077-09af-23fe-fe18-ce9b6359bb8e (at 10.8.21.35@o2ib6)
      May 29 16:00:33 fir-md1-s1 kernel: Lustre: Skipped 263 previous similar messages
      May 29 16:00:43 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 545852b2-0fa4-d28f-21da-1dc3deae2f8b (at 10.9.115.1@o2ib4)
      May 29 16:00:43 fir-md1-s1 kernel: Lustre: Skipped 177 previous similar messages
      
      
      
      May 29 16:01:10 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 98b9580a-8059-3273-b7c7-9f03a2d4d87f (at 10.9.112.7@o2ib4)
      May 29 16:01:10 fir-md1-s1 kernel: Lustre: Skipped 6 previous similar messages
      May 29 16:01:41 fir-md1-s1 kernel: LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
      May 29 16:01:42 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 4030ea6e-e97d-b5db-fa13-470ec3d51337 (at 10.8.1.8@o2ib6)
      May 29 16:01:42 fir-md1-s1 kernel: Lustre: Skipped 40 previous similar messages
      May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5
      May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): recovery complete
      May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
      May 29 16:02:24 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.8.31.10@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:24 fir-md1-s1 kernel: Lustre: fir-MDT0000: Not available for connect from 10.8.18.10@o2ib6 (not set up)
      May 29 16:02:24 fir-md1-s1 kernel: Lustre: Skipped 3 previous similar messages
      May 29 16:02:24 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.8.17.10@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:24 fir-md1-s1 kernel: LustreError: Skipped 4 previous similar messages
      May 29 16:02:26 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.9.102.30@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:26 fir-md1-s1 kernel: LustreError: Skipped 1 previous similar message
      May 29 16:02:28 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.0.10.106@o2ib7 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:28 fir-md1-s1 kernel: LustreError: Skipped 13 previous similar messages
      May 29 16:02:29 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:02:29 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
      May 29 16:02:31 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 2 seconds
      May 29 16:02:31 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
      May 29 16:02:33 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.0.10.3@o2ib7 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:33 fir-md1-s1 kernel: LustreError: Skipped 12 previous similar messages
      May 29 16:02:38 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 1 seconds
      May 29 16:02:38 fir-md1-s1 kernel: Lustre: fir-MDT0000: Imperative Recovery not enabled, recovery window 300-900
      May 29 16:02:38 fir-md1-s1 kernel: Lustre: fir-MDD0000: changelog on
      May 29 16:02:38 fir-md1-s1 kernel: LustreError: 20624:0:(mdt_handler.c:6447:mdt_iocontrol()) fir-MDT0000: Aborting recovery for device
      May 29 16:02:38 fir-md1-s1 kernel: LustreError: 20624:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0000: Aborting recovery
      May 29 16:02:38 fir-md1-s1 kernel: Lustre: 20890:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      May 29 16:02:41 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.9.108.30@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:41 fir-md1-s1 kernel: LustreError: Skipped 29 previous similar messages
      May 29 16:02:44 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (86 recovered, 49 in progress, and 0 evicted) already passed deadline 8:07
      May 29 16:02:46 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 98410362-8850-0375-cdb0-3dd7ebb1ce58 (at 10.9.107.28@o2ib4)
      May 29 16:02:46 fir-md1-s1 kernel: Lustre: Skipped 1171 previous similar messages
      May 29 16:02:57 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.1.6@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:57 fir-md1-s1 kernel: LustreError: Skipped 803 previous similar messages
      May 29 16:03:02 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:03:12 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (802 recovered, 565 in progress, and 0 evicted) already passed deadline 8:34
      May 29 16:03:27 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:03:30 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.20.22@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:03:30 fir-md1-s1 kernel: LustreError: Skipped 2086 previous similar messages
      May 29 16:03:37 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 8:59
      May 29 16:04:02 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 9:24
      May 29 16:04:17 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:04:17 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
      May 29 16:04:27 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 9:50
      May 29 16:04:34 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.8.8.19@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:04:34 fir-md1-s1 kernel: LustreError: Skipped 3903 previous similar messages
      May 29 16:04:38 fir-md1-s1 kernel: INFO: task mount.lustre:20624 blocked for more than 120 seconds.
      May 29 16:04:38 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      May 29 16:04:38 fir-md1-s1 kernel: mount.lustre    D ffff946ebeba9040     0 20624  20623 0x00000082
      May 29 16:04:38 fir-md1-s1 kernel: Call Trace:
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c167c49>] schedule+0x29/0x70
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c165721>] schedule_timeout+0x221/0x2d0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bad67c2>] ? default_wake_function+0x12/0x20
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bac2d2b>] ? autoremove_wake_function+0x2b/0x40
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c167ffd>] wait_for_completion+0xfd/0x140
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bad67b0>] ? wake_up_state+0x20/0x20
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0e8139d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0e81448>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc17f4868>] mdt_iocontrol+0x558/0xb00 [mdt]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0cb2766>] server_start_targets+0x1c66/0x2a20 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c86b10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c7fea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0cb45ec>] server_fill_super+0x10cc/0x1890 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c89828>] lustre_fill_super+0x328/0x950 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c89500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc452cf>] mount_nodev+0x4f/0xb0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c81908>] lustre_mount+0x38/0x60 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc45e4e>] mount_fs+0x3e/0x1b0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc639e7>] vfs_kern_mount+0x67/0x110
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc6600f>] do_mount+0x1ef/0xce0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc3e2aa>] ? __check_object_size+0x1ca/0x250
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc1c74c>] ? kmem_cache_alloc_trace+0x3c/0x200
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc66e43>] SyS_mount+0x83/0xd0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c174ddb>] system_call_fastpath+0x22/0x27
      May 29 16:04:52 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 10:15
      May 29 16:05:17 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 10:40
      May 29 16:05:32 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:05:32 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 2 previous similar messages
      May 29 16:06:06 fir-md1-s1 kernel: LNet: Service thread pid 20981 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      May 29 16:06:06 fir-md1-s1 kernel: Pid: 20981, comm: mdt00_026 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      May 29 16:06:06 fir-md1-s1 kernel: Call Trace:
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1573bf8>] osp_precreate_reserve+0x2e8/0x800 [osp]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc15688b9>] osp_declare_create+0x199/0x5b0 [osp]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ee39f>] lod_sub_declare_create+0xdf/0x210 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e656e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e950e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ed62d>] lod_qos_prep_create+0x12fd/0x1890 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18eddd5>] lod_prepare_create+0x215/0x2e0 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ddcee>] lod_declare_striped_create+0x1ee/0x980 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e5058>] lod_declare_xattr_set+0x138/0x1230 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc193e065>] mdo_declare_xattr_set+0x65/0x150 [mdd]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1947af0>] mdd_declare_create+0x550/0xcb0 [mdd]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc194b6e7>] mdd_create+0x827/0x1440 [mdd]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc181c018>] mdt_reint_open+0x1aa8/0x28b0 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc180eed3>] mdt_reint_rec+0x83/0x210 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17ed143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17f9982>] mdt_intent_open+0x82/0x350 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17f7a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0e79d46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ea2707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0f296e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0f3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ed4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ed863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffff9bac1c31>] kthread+0xd1/0xe0
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffff9c174c24>] ret_from_fork_nospec_begin+0xe/0x21
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      May 29 16:06:06 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1559171166.20981
      May 29 16:06:07 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 11:30
      May 29 16:06:07 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message
      May 29 16:06:38 fir-md1-s1 kernel: INFO: task mount.lustre:20624 blocked for more than 120 seconds.
      May 29 16:06:38 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      May 29 16:06:38 fir-md1-s1 kernel: mount.lustre    D ffff946ebeba9040     0 20624  20623 0x00000082
      May 29 16:06:38 fir-md1-s1 kernel: Call Trace:
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c167c49>] schedule+0x29/0x70
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c165721>] schedule_timeout+0x221/0x2d0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bad67c2>] ? default_wake_function+0x12/0x20
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bac2d2b>] ? autoremove_wake_function+0x2b/0x40
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c167ffd>] wait_for_completion+0xfd/0x140
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bad67b0>] ? wake_up_state+0x20/0x20
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0e8139d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0e81448>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc17f4868>] mdt_iocontrol+0x558/0xb00 [mdt]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0cb2766>] server_start_targets+0x1c66/0x2a20 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c86b10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c7fea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0cb45ec>] server_fill_super+0x10cc/0x1890 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c89828>] lustre_fill_super+0x328/0x950 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c89500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc452cf>] mount_nodev+0x4f/0xb0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c81908>] lustre_mount+0x38/0x60 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc45e4e>] mount_fs+0x3e/0x1b0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc639e7>] vfs_kern_mount+0x67/0x110
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc6600f>] do_mount+0x1ef/0xce0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc3e2aa>] ? __check_object_size+0x1ca/0x250
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc1c74c>] ? kmem_cache_alloc_trace+0x3c/0x200
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc66e43>] SyS_mount+0x83/0xd0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c174ddb>] system_call_fastpath+0x22/0x27
      

       

      mount commands are stuck for all 4 MDTs.

      While I'm writing this, we're trying to restart everything with all lnet routers stopped to avoid any recovery with our compute nodes on Sherlock. My hope is that after the timeout of the imperative recovery window, the MDTs will actually mount. Any idea is welcome. Thanks!

      Attachments

        Issue Links

          Activity

            [LU-12360] Can't restart filesystem (2.12) even with abort_recov

            Stephane,

            Since this is a duplicate of LU-11761, I'm going resolve it as such, and you can track progress there.

            pfarrell Patrick Farrell (Inactive) added a comment - Stephane, Since this is a duplicate of LU-11761 , I'm going resolve it as such, and you can track progress there.

            Thanks Lai!

            sthiell Stephane Thiell added a comment - Thanks Lai!
            laisiyao Lai Siyao added a comment -

            The "lctl abort_recov" hug issue is tracked on LU-11761, and there is a patch in review.
            MDS crash on "osp_declare_create()) ASSERTION( o->opo_reserved == 0 )" is tracked on LU-11967.
            I'm looking into the OSS crash "osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) )" now.

            laisiyao Lai Siyao added a comment - The "lctl abort_recov" hug issue is tracked on LU-11761 , and there is a patch in review. MDS crash on "osp_declare_create()) ASSERTION( o->opo_reserved == 0 )" is tracked on LU-11967 . I'm looking into the OSS crash "osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) )" now.

            Ah yes, whoops, sorry. Thanks for checking with Patrick. Let me know how I can help next time it happens, like for example if taking a crash dump when the recovery is stuck at 0s would help. Or something else.
            Perhaps it's DNE related. I has never happened on MDT0000 as far as I can remember.
            Thanks again.

            sthiell Stephane Thiell added a comment - Ah yes, whoops, sorry. Thanks for checking with Patrick. Let me know how I can help next time it happens, like for example if taking a crash dump when the recovery is stuck at 0s would help. Or something else. Perhaps it's DNE related. I has never happened on MDT0000 as far as I can remember. Thanks again.
            pjones Peter Jones added a comment -

            >  Peter could we increase the severity?

            Well, not as such, it's already at the highest severity level, but we can increase the priority at which this is being worked. I'll try and connect with Patrick to discuss next steps.

            pjones Peter Jones added a comment - >  Peter could we increase the severity? Well, not as such, it's already at the highest severity level, but we can increase the priority at which this is being worked. I'll try and connect with Patrick to discuss next steps.

            The only way to put it back it to remove the routers and then it does complete but will evict all clients :

            [root@fir-md1-s1 ~]# lctl route_list
            [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
            status: RECOVERING
            recovery_start: 1560885207
            time_remaining: 56
            connected_clients: 1399/1400
            req_replay_clients: 1395
            lock_repay_clients: 1395
            completed_clients: 4
            evicted_clients: 0
            replayed_requests: 0
            queued_requests: 0
            next_transno: 344373951730
            
            
            [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
            status: RECOVERING
            recovery_start: 1560885207
            time_remaining: 2
            connected_clients: 1399/1400
            req_replay_clients: 1395
            lock_repay_clients: 1395
            completed_clients: 4
            evicted_clients: 0
            replayed_requests: 0
            queued_requests: 0
            next_transno: 344373951730
            
            [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
            status: RECOVERING
            recovery_start: 1560885207
            time_remaining: 0
            connected_clients: 1399/1400
            req_replay_clients: 1395
            lock_repay_clients: 1395
            completed_clients: 4
            evicted_clients: 0
            replayed_requests: 0
            queued_requests: 0
            next_transno: 344373951730
            
            [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
            status: RECOVERING
            recovery_start: 1560885207
            time_remaining: 59
            connected_clients: 1399/1400
            req_replay_clients: 1395
            lock_repay_clients: 1395
            completed_clients: 4
            evicted_clients: 1
            replayed_requests: 0
            queued_requests: 0
            next_transno: 344373951730
            
            
            [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
            status: RECOVERING
            recovery_start: 1560885207
            time_remaining: 2
            connected_clients: 1399/1400
            req_replay_clients: 1395
            lock_repay_clients: 1395
            completed_clients: 4
            evicted_clients: 1
            replayed_requests: 0
            queued_requests: 0
            next_transno: 344373951730
            [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
            status: COMPLETE
            recovery_start: 1560885207
            recovery_duration: 361
            completed_clients: 4/1400
            replayed_requests: 0
            last_transno: 344373951729
            VBR: ENABLED
            IR: DISABLED
            
            
            sthiell Stephane Thiell added a comment - The only way to put it back it to remove the routers and then it does complete but will evict all clients : [root@fir-md1-s1 ~]# lctl route_list [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status status: RECOVERING recovery_start: 1560885207 time_remaining: 56 connected_clients: 1399/1400 req_replay_clients: 1395 lock_repay_clients: 1395 completed_clients: 4 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 344373951730 [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status status: RECOVERING recovery_start: 1560885207 time_remaining: 2 connected_clients: 1399/1400 req_replay_clients: 1395 lock_repay_clients: 1395 completed_clients: 4 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 344373951730 [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status status: RECOVERING recovery_start: 1560885207 time_remaining: 0 connected_clients: 1399/1400 req_replay_clients: 1395 lock_repay_clients: 1395 completed_clients: 4 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 344373951730 [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status status: RECOVERING recovery_start: 1560885207 time_remaining: 59 connected_clients: 1399/1400 req_replay_clients: 1395 lock_repay_clients: 1395 completed_clients: 4 evicted_clients: 1 replayed_requests: 0 queued_requests: 0 next_transno: 344373951730 [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status status: RECOVERING recovery_start: 1560885207 time_remaining: 2 connected_clients: 1399/1400 req_replay_clients: 1395 lock_repay_clients: 1395 completed_clients: 4 evicted_clients: 1 replayed_requests: 0 queued_requests: 0 next_transno: 344373951730 [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status status: COMPLETE recovery_start: 1560885207 recovery_duration: 361 completed_clients: 4/1400 replayed_requests: 0 last_transno: 344373951729 VBR: ENABLED IR: DISABLED

            This time it's fir-MDT0002:

            [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0000/recovery_status
            status: COMPLETE
            recovery_start: 1560883302
            recovery_duration: 115
            completed_clients: 1400/1400
            replayed_requests: 0
            last_transno: 403936479430
            VBR: DISABLED
            IR: DISABLED
            [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
            status: RECOVERING
            recovery_start: 1560884046
            time_remaining: 0
            connected_clients: 1400/1400
            req_replay_clients: 53
            lock_repay_clients: 219
            completed_clients: 1181
            evicted_clients: 0
            replayed_requests: 1
            queued_requests: 52
            next_transno: 344373951742
            
            sthiell Stephane Thiell added a comment - This time it's fir-MDT0002: [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0000/recovery_status status: COMPLETE recovery_start: 1560883302 recovery_duration: 115 completed_clients: 1400/1400 replayed_requests: 0 last_transno: 403936479430 VBR: DISABLED IR: DISABLED [root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status status: RECOVERING recovery_start: 1560884046 time_remaining: 0 connected_clients: 1400/1400 req_replay_clients: 53 lock_repay_clients: 219 completed_clients: 1181 evicted_clients: 0 replayed_requests: 1 queued_requests: 52 next_transno: 344373951742

            We're having again this issue of endless MDT recovery. Peter could we increase the severity?

            sthiell Stephane Thiell added a comment - We're having again this issue of endless MDT recovery. Peter could we increase the severity?
            sthiell Stephane Thiell added a comment - - edited

            Following a MDS LBUG on Fir this weekend (LU-11967 ASSERTION( o->opo_reserved == 0 ) ), it then took us a few hours to put the filesystem back online because the recovery for fir-MDT0001 wouldn't finish. Nor would it finish when using -o abort_recov or lctl ... abort_recovery. This is very concerning. Is it a current limitation of DNE? Here are the logs showing the first restart after the crash on fir-md1-s2 where MDT0001 and MDT0003 are hosted. You can see that MDT0003 does finish its recovery but not MDT0001:

            [Sat Jun 15 09:15:33 2019][28357.705433] Lustre: fir-MDT0003: Denying connection for new client fc9cbc0d-41e6-18a0-ddfe-91c390cc7652(at 10.9.108.7@o2ib4), waiting for 1397 known clients (1308 recovered, 68 in progress, and 0 evicted) already passed deadline 4:27^M
            [Sat Jun 15 09:15:33 2019][28357.726111] Lustre: Skipped 143 previous similar messages^M
            [Sat Jun 15 09:16:06 2019][28390.961578] Lustre: fir-MDT0003: recovery is timed out, evict stale exports^M
            [Sat Jun 15 09:16:06 2019][28390.968822] Lustre: fir-MDT0003: disconnecting 21 stale clients^M
            [Sat Jun 15 09:16:07 2019][28392.314659] Lustre: fir-MDT0003-osp-MDT0001: Connection restored to 10.0.10.52@o2ib7 (at 0@lo)^M
            [Sat Jun 15 09:16:07 2019][28392.317401] Lustre: fir-MDT0003: Recovery over after 5:01, of 1397 clients 1376 recovered and 21 were evicted.^M
            [Sat Jun 15 09:16:07 2019][28392.333296] Lustre: Skipped 4 previous similar messages^M
            [Sat Jun 15 09:16:08 2019][28392.559315] LustreError: 11-0: fir-MDT0000-lwp-MDT0003: operation quota_acquire to node 10.0.10.51@o2ib7 failed: rc = -11^M
            [Sat Jun 15 09:16:08 2019][28392.570287] LustreError: Skipped 184 previous similar messages^M
            [Sat Jun 15 09:16:15 2019][28400.177816] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M
            [Sat Jun 15 09:16:15 2019][28400.185066] Lustre: fir-MDT0001: disconnecting 21 stale clients^M
            [Sat Jun 15 09:16:22 2019][28407.240015] Lustre: 98362:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply^M
            [Sat Jun 15 09:16:22 2019][28407.240015]   req@ffff9ee5238fef00 x1631565401531536/t0(0) o101->f226e60d-c642-72ed-451c-28866e7cb4cc@10.9.103.26@o2ib4:27/0 lens 584/3264 e 1 to 0 dl 1560615387 ref 2 fl Interpret:/0/0 rc 0/0^M
            [Sat Jun 15 09:17:08 2019][28453.018851] LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.8.8.37@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.^M
            [Sat Jun 15 09:19:38 2019][28602.454583] LustreError: 99405:0:(mdt_handler.c:6447:mdt_iocontrol()) fir-MDT0001: Aborting recovery for device^M
            [Sat Jun 15 09:19:38 2019][28602.464690] LustreError: 99405:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0001: Aborting recovery^M
            [Sat Jun 15 09:19:51 2019][28615.499651] Lustre: fir-MDT0001: Denying connection for new client 304180e1-aa68-a4a4-ed4c-9536f53351a5(at 10.8.1.21@o2ib6), waiting for 1397 known clients (1324 recovered, 52 in progress, and 21 evicted) already passed deadline 7:35^M
            [Sat Jun 15 09:19:51 2019][28615.520312] Lustre: Skipped 168 previous similar messages^M
            [Sat Jun 15 09:19:54 2019][28618.365990] Lustre: fir-MDT0003: haven't heard from client bea0167a-f4ba-4bef-afde-9099abf4f13e (at 10.8.8.37@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9ee52f691c00, cur 1560615594 expire 1560615444 last 1560615367^M
            

            shine (tool that parses /proc or /sys) indicated that the recovery was running, but at the end "0s" of recovery was remaining and it stayed like that.

            second attempt:

            [Sat Jun 15 10:04:00 2019][ 1090.090394] LustreError: 19769:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1560618150, 90s ago), entering recovery for MGS@10.0.10.51@o2ib7 ns: MGC10.0.10.51@o2ib7 lock: ffffa0c3998306c0/0x90311c0f977a450c lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x16a9799edea112f1 expref: -99 pid: 19769 timeout: 0 lvb_type: 0^M
            [Sat Jun 15 10:04:00 2019][ 1090.127635] LustreError: 20798:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffffa0c3ace14b40) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
            [Sat Jun 15 10:04:08 2019][ 1097.501491] Lustre: 19974:0:(nodemap_member.c:122:nm_member_add()) export ffffa0c3a8a46c00 bc38572c-4dfd-e060-15f5-2bafa5ab8152 already hashed, failed to add to nodemap default already member of default^M
            [Sat Jun 15 10:04:18 2019][ 1107.244494] Lustre: fir-MDT0001: Recovery already passed deadline 4:55, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.^M
            [Sat Jun 15 10:04:18 2019][ 1107.260822] Lustre: Skipped 1691 previous similar messages^M
            [Sat Jun 15 10:04:22 2019][ 1111.536233] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M
            [Sat Jun 15 10:04:22 2019][ 1111.543495] Lustre: fir-MDT0001: disconnecting 4 stale clients^M
            [Sat Jun 15 10:04:48 2019][ 1137.522495] LNetError: 7364:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)^M
            [Sat Jun 15 10:04:48 2019][ 1137.547954] Lustre: 20053:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (100:14s); client may timeout.  req@ffffa0c398b97500 x1634475767709184/t0(0) o38-><?>@<?>:0/0 lens 520/192 e 0 to 0 dl 1560618274 ref 1 fl Complete:/0/0 rc -19/-19^M
            [Sat Jun 15 10:05:16 2019][ 1165.524022] LNetError: 19577:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)^M
            [Sat Jun 15 10:05:22 2019][ 1171.697797] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M
            [Sat Jun 15 10:05:22 2019][ 1171.705051] Lustre: fir-MDT0001: disconnecting 1 stale clients^M
            [Sat Jun 15 10:05:27 2019][ 1176.524359] LNetError: 19571:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)^M
            [Sat Jun 15 10:05:36 2019][ 1185.892184] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail^M
            [Sat Jun 15 10:05:36 2019][ 1185.906011] LustreError: 19769:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1560618246, 90s ago), entering recovery for MGS@10.0.10.51@o2ib7 ns: MGC10.0.10.51@o2ib7 lock: ffffa0c399831d40/0x90311c0f977a46be lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x16a9799edea13a12 expref: -99 pid: 19769 timeout: 0 lvb_type: 0^M
            [Sat Jun 15 10:05:36 2019][ 1185.943292] LustreError: 20818:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffffa0c3ace15200) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
            [Sat Jun 15 10:05:53 2019][ 1202.674655] Lustre: fir-MDT0003: recovery is timed out, evict stale exports^M
            [Sat Jun 15 10:05:53 2019][ 1202.681944] Lustre: fir-MDT0003: disconnecting 6 stale clients^M
            [Sat Jun 15 10:06:01 2019][ 1210.995013] LustreError: 11-0: fir-OST000c-osc-MDT0001: operation ost_connect to node 10.0.10.103@o2ib7 failed: rc = -16^M
            [Sat Jun 15 10:06:01 2019][ 1211.005897] LustreError: Skipped 239 previous similar messages^M
            [Sat Jun 15 10:06:16 2019][ 1225.525191] LNet: 19565:0:(o2iblnd_cb.c:1484:kiblnd_reconnect_peer()) Abort reconnection of 10.0.10.211@o2ib7: accepting^M
            [Sat Jun 15 10:06:22 2019][ 1231.859449] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M
            [Sat Jun 15 10:06:22 2019][ 1231.866717] Lustre: fir-MDT0001: disconnecting 3 stale clients^M
            [Sat Jun 15 10:06:37 2019][ 1247.069499] LustreError: 20372:0:(ldlm_lib.c:2840:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffffa0b37b72ad00 x1636333431914752/t0(323735050166) o101->871fc187-81d5-1f51-125e-0c60fb0a4b64@10.9.0.61@o2ib4:7/0 lens 1168/0 e 0 to 0 dl 1560618427 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M
            [Sat Jun 15 10:06:45 2019][ 1254.237810] LustreError: 20837:0:(ldlm_lib.c:2840:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffffa0a3b37b7c50 x1635228444455280/t0(323735075475) o4->5856c966-b502-541a-bf79-fec68258d993@10.9.101.29@o2ib4:21/0 lens 488/0 e 0 to 0 dl 1560618411 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M
            [Sat Jun 15 10:06:45 2019][ 1254.264895] LustreError: 20837:0:(ldlm_lib.c:2840:target_queue_recovery_request()) Skipped 1 previous similar message^M
            [Sat Jun 15 10:06:52 2019][ 1262.030167] LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.9.104.62@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.^M
            [Sat Jun 15 10:06:52 2019][ 1262.047534] LustreError: Skipped 5389 previous similar messages^M
            [Sat Jun 15 10:06:53 2019][ 1262.836300] Lustre: fir-MDT0003: recovery is timed out, evict stale exports^M
            [Sat Jun 15 10:06:53 2019][ 1262.843565] Lustre: fir-MDT0003: disconnecting 2 stale clients^M
            [Sat Jun 15 10:06:57 2019][ 1266.977269] LustreError: 20837:0:(ldlm_lib.c:2840:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffffa0a3b37b7850 x1634768088300608/t0(323735062221) o4->1ceec6a5-3003-d3c3-15d1-22344d794be9@10.9.101.12@o2ib4:3/0 lens 2472/0 e 0 to 0 dl 1560618423 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M
            [Sat Jun 15 10:07:00 2019][ 1269.537317] LustreError: 20234:0:(mdt_io.c:470:mdt_preprw_write()) fir-MDT0001: WRITE IO to missing obj [0x24002b8db:0xd1e5:0x0]: rc = -2^M
            [Sat Jun 15 10:07:06 2019][ 1275.966688] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail^M
            

            Third attempt:

            [Sat Jun 15 10:58:43 2019][ 1499.322692] Lustre: fir-MDT0003: Recovery already passed deadline 3:27, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.^M
            [Sat Jun 15 10:58:43 2019][ 1499.339032] Lustre: Skipped 1608 previous similar messages^M
            [Sat Jun 15 10:58:45 2019][ 1500.942918] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M
            [Sat Jun 15 10:58:45 2019][ 1500.950104] Lustre: fir-MDT0001: disconnecting 5 stale clients^M
            [Sat Jun 15 10:58:46 2019][ 1502.433377] LustreError: 20301:0:(ldlm_lib.c:2840:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff9f36063f6900 x1635095002346752/t0(323735097417) o36->40afa9c7-3abc-7aee-4e4a-0babffe2282c@10.8.17.22@o2ib6:16/0 lens 488/0 e 0 to 0 dl 1560621556 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M
            [Sat Jun 15 10:59:24 2019][ 1540.003218] LNetError: 19036:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)^M
            [Sat Jun 15 10:59:31 2019][ 1547.229133] LNet: 19024:0:(o2iblnd_cb.c:1484:kiblnd_reconnect_peer()) Abort reconnection of 10.0.10.203@o2ib7: accepting^M
            [Sat Jun 15 10:59:45 2019][ 1561.104563] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M
            [Sat Jun 15 10:59:45 2019][ 1561.111750] Lustre: fir-MDT0001: disconnecting 2 stale clients^M
            [Sat Jun 15 10:59:59 2019][ 1574.528934] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail^M
            [Sat Jun 15 10:59:59 2019][ 1574.542768] LustreError: 19913:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1560621509, 90s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9f46262d5100/0x7228bba165f851da lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x130a68c4135d486d expref: -99 pid: 19913 timeout: 0 lvb_type: 0^M
            [Sat Jun 15 11:00:05 2019][ 1580.605107] Lustre: 21053:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply^M
            [Sat Jun 15 11:00:05 2019][ 1580.605107]   req@ffff9f4624f54200 x1636429568502512/t0(0) o400->fir-MDT0003-lwp-OST0014_UUID@10.0.10.103@o2ib7:10/0 lens 224/0 e 0 to 0 dl 1560621610 ref 2 fl New:H/0/ffffffff rc 0/-1^M
            [Sat Jun 15 11:00:06 2019][ 1581.605138] Lustre: 21069:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply^M
            [Sat Jun 15 11:00:06 2019][ 1581.605138]   req@ffff9f3605abf500 x1636429568502160/t0(0) o400->fir-MDT0001-lwp-OST000a_UUID@10.0.10.101@o2ib7:11/0 lens 224/0 e 0 to 0 dl 1560621611 ref 2 fl New:H/0/ffffffff rc 0/-1^M
            [Sat Jun 15 11:00:06 2019][ 1581.633344] Lustre: 21069:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 11 previous similar messages^M
            [Sat Jun 15 11:00:07 2019][ 1582.607185] Lustre: 21069:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply^M
            [Sat Jun 15 11:00:07 2019][ 1582.607185]   req@ffff9f35ef72a700 x1636429568502448/t0(0) o400->fir-MDT0003-lwp-OST001d_UUID@10.0.10.106@o2ib7:12/0 lens 224/0 e 0 to 0 dl 1560621612 ref 2 fl New:H/0/ffffffff rc 0/-1^M
            [Sat Jun 15 11:00:07 2019][ 1582.635386] Lustre: 21069:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 47 previous similar messages^M
            [Sat Jun 15 11:00:10 2019][ 1586.227260] LNet: 19024:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.209@o2ib7: 0 seconds^M
            [Sat Jun 15 11:00:13 2019][ 1589.230231] LNet: 19024:0:(o2iblnd_cb.c:1484:kiblnd_reconnect_peer()) Abort reconnection of 10.0.10.212@o2ib7: accepting^M
            [Sat Jun 15 11:00:16 2019][ 1592.081411] Lustre: fir-MDT0003: recovery is timed out, evict stale exports^M
            [Sat Jun 15 11:00:16 2019][ 1592.088644] Lustre: fir-MDT0003: disconnecting 2 stale clients^M
            [Sat Jun 15 11:00:19 2019][ 1595.329629] Lustre: fir-MDT0003: Export ffff9f4636332800 already connecting from 10.8.19.1@o2ib6^M
            [Sat Jun 15 11:00:22 2019][ 1598.243240] Lustre: fir-MDT0003: Export ffff9f463ec5a000 already connecting from 10.8.10.9@o2ib6^M
            

            Our only solution (that we found) to put the filesystem back online was to get rid of the lnet routers, mount the robinhood client (on the same fabric, so no routers, to start the recovery) and then wait for the recovery to timeout.

            I also noticed an interesting thing, if you umount the MDT while in recovery, it seems to complete its recovery just before the umount, but unfortunately that's not enough to make it work the next time we tried to mount it:

            [Sat Jun 15 11:15:24 2019][ 2499.726764] Lustre: Failing over fir-MDT0001^M
            [Sat Jun 15 11:15:24 2019][ 2499.769929] LustreError: 21982:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0001: Aborting recovery^M
            [Sat Jun 15 11:15:24 2019][ 2499.870073] LNetError: 19039:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 223 previous similar messages^M
            [Sat Jun 15 11:15:27 2019][ 2502.826656] LustreError: 23345:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9f46266b2780) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
            [Sat Jun 15 11:15:27 2019][ 2502.846197] LustreError: 23345:0:(ldlm_resource.c:1146:ldlm_resource_complain()) Skipped 3 previous similar messages^M
            [Sat Jun 15 11:15:27 2019][ 2502.856890] Lustre: Failing over fir-MDT0003^M
            [Sat Jun 15 11:15:33 2019][ 2508.542569] LustreError: 20306:0:(lod_qos.c:213:lod_statfs_and_check()) fir-MDT0001-mdtlov: statfs: rc = -108^M
            [Sat Jun 15 11:15:33 2019][ 2508.552687] Lustre: 20306:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (380:920s); client may timeout.  req@ffff9f4625a7dd00 x1636333431914752/t0(323735050166) o101->871fc187-81d5-1f51-125e-0c60fb0a4b64@10.9.0.61@o2ib4:13/0 lens 1168/560 e 14 to 0 dl 1560621613 ref 1 fl Complete:/4/0 rc -19/-19^M
            [Sat Jun 15 11:15:33 2019][ 2508.552911] LNet: Service thread pid 20316 completed after 1289.78s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).^M
            [Sat Jun 15 11:15:33 2019][ 2508.552913] LNet: Skipped 326 previous similar messages^M
            [Sat Jun 15 11:15:33 2019][ 2508.552983] Lustre: 20165:0:(ldlm_lib.c:2048:target_recovery_overseer()) fir-MDT0001 recovery is aborted by hard timeout^M
            [Sat Jun 15 11:15:33 2019][ 2508.552987] Lustre: 20165:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery^M
            [Sat Jun 15 11:15:33 2019][ 2508.553002] Lustre: 20165:0:(ldlm_lib.c:1617:abort_req_replay_queue()) @@@ aborted:  req@ffff9f263f547500 x1631570405023808/t0(323735068936) o101->e70978d5-4079-f244-8878-d79e9ac6d1e1@10.8.28.5@o2ib6:11/0 lens 1152/0 e 26 to 0 dl 1560621911 ref 1 fl Complete:/4/ffffffff rc 0/-1^M
            [Sat Jun 15 11:15:33 2019][ 2508.553321] LustreError: 20165:0:(ldlm_lib.c:1638:abort_lock_replay_queue()) @@@ aborted:  req@ffff9f263071a100 x1631563950806768/t0(0) o101->aebbd095-b50a-7733-a754-e1423fab741f@10.8.18.10@o2ib6:1/0 lens 328/0 e 27 to 0 dl 1560621931 ref 1 fl Complete:/40/ffffffff rc 0/-1^M
            [Sat Jun 15 11:15:33 2019][ 2508.583130] Lustre: 20165:0:(ldlm_lib.c:2554:target_recovery_thread()) too long recovery - read logs^M
            [Sat Jun 15 11:15:33 2019][ 2508.583179] LustreError: dumping log to /tmp/lustre-log.1560622533.20165^M
            [Sat Jun 15 11:15:33 2019][ 2508.583180] Lustre: fir-MDT0001: Recovery over after 21:48, of 1368 clients 0 recovered and 1368 were evicted.^M
            [Sat Jun 15 11:15:33 2019][ 2508.611905] Lustre: 20165:0:(mdt_handler.c:6544:mdt_postrecov()) fir-MDT0001: auto trigger paused LFSCK failed: rc = -6^M
            [Sat Jun 15 11:15:33 2019][ 2508.710964] Lustre: 20306:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 9072 previous similar messages^M
            [Sat Jun 15 11:15:35 2019][ 2511.232379] Lustre: server umount fir-MDT0001 complete^M
            

            Again, you will see in the logs that I have tried several times to do an abort_recovery (using lctl) but this only printed a message and seemed to do nothing.

            sthiell Stephane Thiell added a comment - - edited Following a MDS LBUG on Fir this weekend ( LU-11967 ASSERTION( o->opo_reserved == 0 ) ), it then took us a few hours to put the filesystem back online because the recovery for fir-MDT0001 wouldn't finish. Nor would it finish when using -o abort_recov or lctl ... abort_recovery. This is very concerning. Is it a current limitation of DNE? Here are the logs showing the first restart after the crash on fir-md1-s2 where MDT0001 and MDT0003 are hosted. You can see that MDT0003 does finish its recovery but not MDT0001: [Sat Jun 15 09:15:33 2019][28357.705433] Lustre: fir-MDT0003: Denying connection for new client fc9cbc0d-41e6-18a0-ddfe-91c390cc7652(at 10.9.108.7@o2ib4), waiting for 1397 known clients (1308 recovered, 68 in progress, and 0 evicted) already passed deadline 4:27^M [Sat Jun 15 09:15:33 2019][28357.726111] Lustre: Skipped 143 previous similar messages^M [Sat Jun 15 09:16:06 2019][28390.961578] Lustre: fir-MDT0003: recovery is timed out, evict stale exports^M [Sat Jun 15 09:16:06 2019][28390.968822] Lustre: fir-MDT0003: disconnecting 21 stale clients^M [Sat Jun 15 09:16:07 2019][28392.314659] Lustre: fir-MDT0003-osp-MDT0001: Connection restored to 10.0.10.52@o2ib7 (at 0@lo)^M [Sat Jun 15 09:16:07 2019][28392.317401] Lustre: fir-MDT0003: Recovery over after 5:01, of 1397 clients 1376 recovered and 21 were evicted.^M [Sat Jun 15 09:16:07 2019][28392.333296] Lustre: Skipped 4 previous similar messages^M [Sat Jun 15 09:16:08 2019][28392.559315] LustreError: 11-0: fir-MDT0000-lwp-MDT0003: operation quota_acquire to node 10.0.10.51@o2ib7 failed: rc = -11^M [Sat Jun 15 09:16:08 2019][28392.570287] LustreError: Skipped 184 previous similar messages^M [Sat Jun 15 09:16:15 2019][28400.177816] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M [Sat Jun 15 09:16:15 2019][28400.185066] Lustre: fir-MDT0001: disconnecting 21 stale clients^M [Sat Jun 15 09:16:22 2019][28407.240015] Lustre: 98362:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply^M [Sat Jun 15 09:16:22 2019][28407.240015] req@ffff9ee5238fef00 x1631565401531536/t0(0) o101->f226e60d-c642-72ed-451c-28866e7cb4cc@10.9.103.26@o2ib4:27/0 lens 584/3264 e 1 to 0 dl 1560615387 ref 2 fl Interpret:/0/0 rc 0/0^M [Sat Jun 15 09:17:08 2019][28453.018851] LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.8.8.37@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.^M [Sat Jun 15 09:19:38 2019][28602.454583] LustreError: 99405:0:(mdt_handler.c:6447:mdt_iocontrol()) fir-MDT0001: Aborting recovery for device^M [Sat Jun 15 09:19:38 2019][28602.464690] LustreError: 99405:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0001: Aborting recovery^M [Sat Jun 15 09:19:51 2019][28615.499651] Lustre: fir-MDT0001: Denying connection for new client 304180e1-aa68-a4a4-ed4c-9536f53351a5(at 10.8.1.21@o2ib6), waiting for 1397 known clients (1324 recovered, 52 in progress, and 21 evicted) already passed deadline 7:35^M [Sat Jun 15 09:19:51 2019][28615.520312] Lustre: Skipped 168 previous similar messages^M [Sat Jun 15 09:19:54 2019][28618.365990] Lustre: fir-MDT0003: haven't heard from client bea0167a-f4ba-4bef-afde-9099abf4f13e (at 10.8.8.37@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9ee52f691c00, cur 1560615594 expire 1560615444 last 1560615367^M shine (tool that parses /proc or /sys) indicated that the recovery was running, but at the end "0s" of recovery was remaining and it stayed like that. second attempt: [Sat Jun 15 10:04:00 2019][ 1090.090394] LustreError: 19769:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1560618150, 90s ago), entering recovery for MGS@10.0.10.51@o2ib7 ns: MGC10.0.10.51@o2ib7 lock: ffffa0c3998306c0/0x90311c0f977a450c lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x16a9799edea112f1 expref: -99 pid: 19769 timeout: 0 lvb_type: 0^M [Sat Jun 15 10:04:00 2019][ 1090.127635] LustreError: 20798:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffffa0c3ace14b40) refcount nonzero (1) after lock cleanup; forcing cleanup.^M [Sat Jun 15 10:04:08 2019][ 1097.501491] Lustre: 19974:0:(nodemap_member.c:122:nm_member_add()) export ffffa0c3a8a46c00 bc38572c-4dfd-e060-15f5-2bafa5ab8152 already hashed, failed to add to nodemap default already member of default^M [Sat Jun 15 10:04:18 2019][ 1107.244494] Lustre: fir-MDT0001: Recovery already passed deadline 4:55, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.^M [Sat Jun 15 10:04:18 2019][ 1107.260822] Lustre: Skipped 1691 previous similar messages^M [Sat Jun 15 10:04:22 2019][ 1111.536233] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M [Sat Jun 15 10:04:22 2019][ 1111.543495] Lustre: fir-MDT0001: disconnecting 4 stale clients^M [Sat Jun 15 10:04:48 2019][ 1137.522495] LNetError: 7364:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)^M [Sat Jun 15 10:04:48 2019][ 1137.547954] Lustre: 20053:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (100:14s); client may timeout. req@ffffa0c398b97500 x1634475767709184/t0(0) o38-><?>@<?>:0/0 lens 520/192 e 0 to 0 dl 1560618274 ref 1 fl Complete:/0/0 rc -19/-19^M [Sat Jun 15 10:05:16 2019][ 1165.524022] LNetError: 19577:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)^M [Sat Jun 15 10:05:22 2019][ 1171.697797] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M [Sat Jun 15 10:05:22 2019][ 1171.705051] Lustre: fir-MDT0001: disconnecting 1 stale clients^M [Sat Jun 15 10:05:27 2019][ 1176.524359] LNetError: 19571:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)^M [Sat Jun 15 10:05:36 2019][ 1185.892184] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail^M [Sat Jun 15 10:05:36 2019][ 1185.906011] LustreError: 19769:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1560618246, 90s ago), entering recovery for MGS@10.0.10.51@o2ib7 ns: MGC10.0.10.51@o2ib7 lock: ffffa0c399831d40/0x90311c0f977a46be lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x16a9799edea13a12 expref: -99 pid: 19769 timeout: 0 lvb_type: 0^M [Sat Jun 15 10:05:36 2019][ 1185.943292] LustreError: 20818:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffffa0c3ace15200) refcount nonzero (1) after lock cleanup; forcing cleanup.^M [Sat Jun 15 10:05:53 2019][ 1202.674655] Lustre: fir-MDT0003: recovery is timed out, evict stale exports^M [Sat Jun 15 10:05:53 2019][ 1202.681944] Lustre: fir-MDT0003: disconnecting 6 stale clients^M [Sat Jun 15 10:06:01 2019][ 1210.995013] LustreError: 11-0: fir-OST000c-osc-MDT0001: operation ost_connect to node 10.0.10.103@o2ib7 failed: rc = -16^M [Sat Jun 15 10:06:01 2019][ 1211.005897] LustreError: Skipped 239 previous similar messages^M [Sat Jun 15 10:06:16 2019][ 1225.525191] LNet: 19565:0:(o2iblnd_cb.c:1484:kiblnd_reconnect_peer()) Abort reconnection of 10.0.10.211@o2ib7: accepting^M [Sat Jun 15 10:06:22 2019][ 1231.859449] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M [Sat Jun 15 10:06:22 2019][ 1231.866717] Lustre: fir-MDT0001: disconnecting 3 stale clients^M [Sat Jun 15 10:06:37 2019][ 1247.069499] LustreError: 20372:0:(ldlm_lib.c:2840:target_queue_recovery_request()) @@@ dropping resent queued req req@ffffa0b37b72ad00 x1636333431914752/t0(323735050166) o101->871fc187-81d5-1f51-125e-0c60fb0a4b64@10.9.0.61@o2ib4:7/0 lens 1168/0 e 0 to 0 dl 1560618427 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M [Sat Jun 15 10:06:45 2019][ 1254.237810] LustreError: 20837:0:(ldlm_lib.c:2840:target_queue_recovery_request()) @@@ dropping resent queued req req@ffffa0a3b37b7c50 x1635228444455280/t0(323735075475) o4->5856c966-b502-541a-bf79-fec68258d993@10.9.101.29@o2ib4:21/0 lens 488/0 e 0 to 0 dl 1560618411 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M [Sat Jun 15 10:06:45 2019][ 1254.264895] LustreError: 20837:0:(ldlm_lib.c:2840:target_queue_recovery_request()) Skipped 1 previous similar message^M [Sat Jun 15 10:06:52 2019][ 1262.030167] LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.9.104.62@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.^M [Sat Jun 15 10:06:52 2019][ 1262.047534] LustreError: Skipped 5389 previous similar messages^M [Sat Jun 15 10:06:53 2019][ 1262.836300] Lustre: fir-MDT0003: recovery is timed out, evict stale exports^M [Sat Jun 15 10:06:53 2019][ 1262.843565] Lustre: fir-MDT0003: disconnecting 2 stale clients^M [Sat Jun 15 10:06:57 2019][ 1266.977269] LustreError: 20837:0:(ldlm_lib.c:2840:target_queue_recovery_request()) @@@ dropping resent queued req req@ffffa0a3b37b7850 x1634768088300608/t0(323735062221) o4->1ceec6a5-3003-d3c3-15d1-22344d794be9@10.9.101.12@o2ib4:3/0 lens 2472/0 e 0 to 0 dl 1560618423 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M [Sat Jun 15 10:07:00 2019][ 1269.537317] LustreError: 20234:0:(mdt_io.c:470:mdt_preprw_write()) fir-MDT0001: WRITE IO to missing obj [0x24002b8db:0xd1e5:0x0]: rc = -2^M [Sat Jun 15 10:07:06 2019][ 1275.966688] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail^M Third attempt: [Sat Jun 15 10:58:43 2019][ 1499.322692] Lustre: fir-MDT0003: Recovery already passed deadline 3:27, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.^M [Sat Jun 15 10:58:43 2019][ 1499.339032] Lustre: Skipped 1608 previous similar messages^M [Sat Jun 15 10:58:45 2019][ 1500.942918] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M [Sat Jun 15 10:58:45 2019][ 1500.950104] Lustre: fir-MDT0001: disconnecting 5 stale clients^M [Sat Jun 15 10:58:46 2019][ 1502.433377] LustreError: 20301:0:(ldlm_lib.c:2840:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff9f36063f6900 x1635095002346752/t0(323735097417) o36->40afa9c7-3abc-7aee-4e4a-0babffe2282c@10.8.17.22@o2ib6:16/0 lens 488/0 e 0 to 0 dl 1560621556 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M [Sat Jun 15 10:59:24 2019][ 1540.003218] LNetError: 19036:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)^M [Sat Jun 15 10:59:31 2019][ 1547.229133] LNet: 19024:0:(o2iblnd_cb.c:1484:kiblnd_reconnect_peer()) Abort reconnection of 10.0.10.203@o2ib7: accepting^M [Sat Jun 15 10:59:45 2019][ 1561.104563] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M [Sat Jun 15 10:59:45 2019][ 1561.111750] Lustre: fir-MDT0001: disconnecting 2 stale clients^M [Sat Jun 15 10:59:59 2019][ 1574.528934] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail^M [Sat Jun 15 10:59:59 2019][ 1574.542768] LustreError: 19913:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1560621509, 90s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9f46262d5100/0x7228bba165f851da lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x130a68c4135d486d expref: -99 pid: 19913 timeout: 0 lvb_type: 0^M [Sat Jun 15 11:00:05 2019][ 1580.605107] Lustre: 21053:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply^M [Sat Jun 15 11:00:05 2019][ 1580.605107] req@ffff9f4624f54200 x1636429568502512/t0(0) o400->fir-MDT0003-lwp-OST0014_UUID@10.0.10.103@o2ib7:10/0 lens 224/0 e 0 to 0 dl 1560621610 ref 2 fl New:H/0/ffffffff rc 0/-1^M [Sat Jun 15 11:00:06 2019][ 1581.605138] Lustre: 21069:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply^M [Sat Jun 15 11:00:06 2019][ 1581.605138] req@ffff9f3605abf500 x1636429568502160/t0(0) o400->fir-MDT0001-lwp-OST000a_UUID@10.0.10.101@o2ib7:11/0 lens 224/0 e 0 to 0 dl 1560621611 ref 2 fl New:H/0/ffffffff rc 0/-1^M [Sat Jun 15 11:00:06 2019][ 1581.633344] Lustre: 21069:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 11 previous similar messages^M [Sat Jun 15 11:00:07 2019][ 1582.607185] Lustre: 21069:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply^M [Sat Jun 15 11:00:07 2019][ 1582.607185] req@ffff9f35ef72a700 x1636429568502448/t0(0) o400->fir-MDT0003-lwp-OST001d_UUID@10.0.10.106@o2ib7:12/0 lens 224/0 e 0 to 0 dl 1560621612 ref 2 fl New:H/0/ffffffff rc 0/-1^M [Sat Jun 15 11:00:07 2019][ 1582.635386] Lustre: 21069:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 47 previous similar messages^M [Sat Jun 15 11:00:10 2019][ 1586.227260] LNet: 19024:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.209@o2ib7: 0 seconds^M [Sat Jun 15 11:00:13 2019][ 1589.230231] LNet: 19024:0:(o2iblnd_cb.c:1484:kiblnd_reconnect_peer()) Abort reconnection of 10.0.10.212@o2ib7: accepting^M [Sat Jun 15 11:00:16 2019][ 1592.081411] Lustre: fir-MDT0003: recovery is timed out, evict stale exports^M [Sat Jun 15 11:00:16 2019][ 1592.088644] Lustre: fir-MDT0003: disconnecting 2 stale clients^M [Sat Jun 15 11:00:19 2019][ 1595.329629] Lustre: fir-MDT0003: Export ffff9f4636332800 already connecting from 10.8.19.1@o2ib6^M [Sat Jun 15 11:00:22 2019][ 1598.243240] Lustre: fir-MDT0003: Export ffff9f463ec5a000 already connecting from 10.8.10.9@o2ib6^M Our only solution (that we found) to put the filesystem back online was to get rid of the lnet routers, mount the robinhood client (on the same fabric, so no routers, to start the recovery) and then wait for the recovery to timeout. I also noticed an interesting thing, if you umount the MDT while in recovery, it seems to complete its recovery just before the umount, but unfortunately that's not enough to make it work the next time we tried to mount it: [Sat Jun 15 11:15:24 2019][ 2499.726764] Lustre: Failing over fir-MDT0001^M [Sat Jun 15 11:15:24 2019][ 2499.769929] LustreError: 21982:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0001: Aborting recovery^M [Sat Jun 15 11:15:24 2019][ 2499.870073] LNetError: 19039:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 223 previous similar messages^M [Sat Jun 15 11:15:27 2019][ 2502.826656] LustreError: 23345:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9f46266b2780) refcount nonzero (1) after lock cleanup; forcing cleanup.^M [Sat Jun 15 11:15:27 2019][ 2502.846197] LustreError: 23345:0:(ldlm_resource.c:1146:ldlm_resource_complain()) Skipped 3 previous similar messages^M [Sat Jun 15 11:15:27 2019][ 2502.856890] Lustre: Failing over fir-MDT0003^M [Sat Jun 15 11:15:33 2019][ 2508.542569] LustreError: 20306:0:(lod_qos.c:213:lod_statfs_and_check()) fir-MDT0001-mdtlov: statfs: rc = -108^M [Sat Jun 15 11:15:33 2019][ 2508.552687] Lustre: 20306:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (380:920s); client may timeout. req@ffff9f4625a7dd00 x1636333431914752/t0(323735050166) o101->871fc187-81d5-1f51-125e-0c60fb0a4b64@10.9.0.61@o2ib4:13/0 lens 1168/560 e 14 to 0 dl 1560621613 ref 1 fl Complete:/4/0 rc -19/-19^M [Sat Jun 15 11:15:33 2019][ 2508.552911] LNet: Service thread pid 20316 completed after 1289.78s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).^M [Sat Jun 15 11:15:33 2019][ 2508.552913] LNet: Skipped 326 previous similar messages^M [Sat Jun 15 11:15:33 2019][ 2508.552983] Lustre: 20165:0:(ldlm_lib.c:2048:target_recovery_overseer()) fir-MDT0001 recovery is aborted by hard timeout^M [Sat Jun 15 11:15:33 2019][ 2508.552987] Lustre: 20165:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery^M [Sat Jun 15 11:15:33 2019][ 2508.553002] Lustre: 20165:0:(ldlm_lib.c:1617:abort_req_replay_queue()) @@@ aborted: req@ffff9f263f547500 x1631570405023808/t0(323735068936) o101->e70978d5-4079-f244-8878-d79e9ac6d1e1@10.8.28.5@o2ib6:11/0 lens 1152/0 e 26 to 0 dl 1560621911 ref 1 fl Complete:/4/ffffffff rc 0/-1^M [Sat Jun 15 11:15:33 2019][ 2508.553321] LustreError: 20165:0:(ldlm_lib.c:1638:abort_lock_replay_queue()) @@@ aborted: req@ffff9f263071a100 x1631563950806768/t0(0) o101->aebbd095-b50a-7733-a754-e1423fab741f@10.8.18.10@o2ib6:1/0 lens 328/0 e 27 to 0 dl 1560621931 ref 1 fl Complete:/40/ffffffff rc 0/-1^M [Sat Jun 15 11:15:33 2019][ 2508.583130] Lustre: 20165:0:(ldlm_lib.c:2554:target_recovery_thread()) too long recovery - read logs^M [Sat Jun 15 11:15:33 2019][ 2508.583179] LustreError: dumping log to /tmp/lustre-log.1560622533.20165^M [Sat Jun 15 11:15:33 2019][ 2508.583180] Lustre: fir-MDT0001: Recovery over after 21:48, of 1368 clients 0 recovered and 1368 were evicted.^M [Sat Jun 15 11:15:33 2019][ 2508.611905] Lustre: 20165:0:(mdt_handler.c:6544:mdt_postrecov()) fir-MDT0001: auto trigger paused LFSCK failed: rc = -6^M [Sat Jun 15 11:15:33 2019][ 2508.710964] Lustre: 20306:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 9072 previous similar messages^M [Sat Jun 15 11:15:35 2019][ 2511.232379] Lustre: server umount fir-MDT0001 complete^M Again, you will see in the logs that I have tried several times to do an abort_recovery (using lctl) but this only printed a message and seemed to do nothing.

            Hi Patrick,

            The vmcore-dmesg.txt for that OSS is attached as fir-io4-s2-2019-05-29-12-51-55-vmcore-dmesg.txt

            I also attached the output of foreach bt as fir-io4-s2-2019-05-29-12-51-55-foreach-bt.txt

            The crash dump is available on the FTP as LU-12360-fir-io4-s2-2019-05-29-12-51-55-vmcore

            Let me know what you think...
            Thanks!

            sthiell Stephane Thiell added a comment - Hi Patrick, The vmcore-dmesg.txt for that OSS is attached as fir-io4-s2-2019-05-29-12-51-55-vmcore-dmesg.txt I also attached the output of foreach bt as fir-io4-s2-2019-05-29-12-51-55-foreach-bt.txt The crash dump is available on the FTP as LU-12360 -fir-io4-s2-2019-05-29-12-51-55-vmcore Let me know what you think... Thanks!

            OK, great - We'll look at this on a non-emergency basis then.  Thanks, and sorry for the trouble.

            pfarrell Patrick Farrell (Inactive) added a comment - OK, great - We'll look at this on a non-emergency basis then.  Thanks, and sorry for the trouble.

            People

              pjones Peter Jones
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: