Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12360

Can't restart filesystem (2.12) even with abort_recov

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.0
    • None
    • CentOS 7.6
    • 1
    • 9223372036854775807

    Description

      Using 2.12.0 + patches, after running a few days without major issues, we just had a serie of events leading to a filesystem down, it started with an OSS crash on fir-io4-s2:

      [411912.582805] LustreError: 128678:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 7ccedfb6-dc58-53e5-7474-adae274cd5d7 claims 3305472 GRANT, real grant 0
      [411912.597102] LustreError: 128678:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 258 previous similar messages
      [412523.291449] LustreError: 87611:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 3923968 GRANT, real grant 0
      [412523.305649] LustreError: 87611:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 179 previous similar messages
      [413133.047652] LustreError: 1860:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 4132864 GRANT, real grant 0
      [413133.061828] LustreError: 1860:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 209 previous similar messages
      [413733.569845] LustreError: 1546:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 7ccedfb6-dc58-53e5-7474-adae274cd5d7 claims 3780608 GRANT, real grant 0
      [413733.583920] LustreError: 1546:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 206 previous similar messages
      [414347.351179] LustreError: 87688:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 3911680 GRANT, real grant 0
      [414347.365413] LustreError: 87688:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 197 previous similar messages
      [414687.733602] LustreError: 108071:0:(osd_handler.c:2133:osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) ) failed: 
      [414687.748464] LustreError: 108071:0:(osd_handler.c:2133:osd_object_release()) LBUG
      [414687.755956] Pid: 108071, comm: ldlm_cn01_010 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      [414687.766217] Call Trace:
      [414687.768773]  [<ffffffffc0d0f7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [414687.775434]  [<ffffffffc0d0f87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [414687.781736]  [<ffffffffc15d636c>] osd_object_release+0x7c/0x80 [osd_ldiskfs]
      [414687.788910]  [<ffffffffc0e936a0>] lu_object_put+0x190/0x3d0 [obdclass]
      [414687.795582]  [<ffffffffc1750b56>] ofd_lvbo_update+0x5e6/0xdf0 [ofd]
      [414687.801980]  [<ffffffffc10a2c89>] ldlm_request_cancel+0x3c9/0x740 [ptlrpc]
      [414687.809008]  [<ffffffffc10a77da>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
      [414687.815859]  [<ffffffffc10a7ac8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
      [414687.822885]  [<ffffffffc10d8d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [414687.830693]  [<ffffffffc10dc63c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [414687.837115]  [<ffffffffb74c1c31>] kthread+0xd1/0xe0
      [414687.842115]  [<ffffffffb7b74c24>] ret_from_fork_nospec_begin+0xe/0x21
      [414687.848677]  [<ffffffffffffffff>] 0xffffffffffffffff
      [414687.853780] Kernel panic - not syncing: LBUG
      

       
      Shortly after restarting the OSS, an MDS, fir-md1-s2 (MDT1 + 3) crashed:

      [2687144.065647] Lustre: fir-MDT0003: Client ddea348b-e5a4-5330-325a-755d459e8dda (at 10.9.107.57@o2ib4) reconnecting
      [2687144.076003] Lustre: Skipped 1 previous similar message
      [2689300.253654] LustreError: 124673:0:(osp_object.c:1458:osp_declare_create()) ASSERTION( o->opo_reserved == 0 ) failed: 
      [2689300.264453] LustreError: 124673:0:(osp_object.c:1458:osp_declare_create()) LBUG
      [2689300.271942] Pid: 124673, comm: mdt00_044 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      [2689300.281942] Call Trace:
      [2689300.284583]  [<ffffffffc0a397cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [2689300.291332]  [<ffffffffc0a3987c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [2689300.297719]  [<ffffffffc15aecc5>] osp_declare_create+0x5a5/0x5b0 [osp]
      [2689300.304462]  [<ffffffffc14f639f>] lod_sub_declare_create+0xdf/0x210 [lod]
      [2689300.311466]  [<ffffffffc14f5904>] lod_qos_prep_create+0x15d4/0x1890 [lod]
      [2689300.318467]  [<ffffffffc14d6bba>] lod_declare_instantiate_components+0x9a/0x1d0 [lod]
      [2689300.326511]  [<ffffffffc14e94d5>] lod_declare_layout_change+0xb65/0x10f0 [lod]
      [2689300.333948]  [<ffffffffc155b102>] mdd_declare_layout_change+0x62/0x120 [mdd]
      [2689300.341211]  [<ffffffffc1563e52>] mdd_layout_change+0x882/0x1000 [mdd]
      [2689300.347964]  [<ffffffffc13ca327>] mdt_layout_change+0x337/0x430 [mdt]
      [2689300.354636]  [<ffffffffc13d243e>] mdt_intent_layout+0x7ee/0xcc0 [mdt]
      [2689300.361292]  [<ffffffffc13cfa28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
      [2689300.367949]  [<ffffffffc0d6dd46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      [2689300.374893]  [<ffffffffc0d96707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      [2689300.382189]  [<ffffffffc0e1d6e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [2689300.388531]  [<ffffffffc0e2473a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [2689300.395657]  [<ffffffffc0dc8d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [2689300.403563]  [<ffffffffc0dcc63c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [2689300.410077]  [<ffffffff9e6c1c31>] kthread+0xd1/0xe0
      [2689300.415164]  [<ffffffff9ed74c24>] ret_from_fork_nospec_begin+0xe/0x21
      [2689300.421813]  [<ffffffffffffffff>] 0xffffffffffffffff
      [2689300.427016] Kernel panic - not syncing: LBUG
      

      And now I am not able to restart the filesystem, even with -o abort_recov everywhere. OSTs restart but then the MDT0 doesn't want to restart. It keeps wanting to perform a recovery.

      May 29 15:59:43 fir-md1-s1 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
      May 29 15:59:49 fir-md1-s1 kernel: Lustre: DEBUG MARKER: Wed May 29 15:59:49 2019
      May 29 16:00:25 fir-md1-s1 kernel: LDISKFS-fs (dm-2): recovery complete
      May 29 16:00:25 fir-md1-s1 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      May 29 16:00:25 fir-md1-s1 kernel: Lustre: MGS: Connection restored to MGC10.0.10.51@o2ib7_0 (at 0@lo)
      May 29 16:00:26 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 1495a992-a81e-7de9-50a8-88376d880671 (at 10.8.12.32@o2ib6)
      May 29 16:00:26 fir-md1-s1 kernel: Lustre: Skipped 39 previous similar messages
      May 29 16:00:27 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 0e359090-33c6-a6a0-317c-1b138ecd9b87 (at 10.9.105.16@o2ib4)
      May 29 16:00:27 fir-md1-s1 kernel: Lustre: Skipped 72 previous similar messages
      
      
      
      
      
      May 29 16:00:29 fir-md1-s1 kernel: Lustre: MGS: Connection restored to d6e5f2ed-c606-14e7-85a0-13787ff0467b (at 10.9.105.69@o2ib4)
      May 29 16:00:29 fir-md1-s1 kernel: Lustre: Skipped 35 previous similar messages
      May 29 16:00:33 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 2996d077-09af-23fe-fe18-ce9b6359bb8e (at 10.8.21.35@o2ib6)
      May 29 16:00:33 fir-md1-s1 kernel: Lustre: Skipped 263 previous similar messages
      May 29 16:00:43 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 545852b2-0fa4-d28f-21da-1dc3deae2f8b (at 10.9.115.1@o2ib4)
      May 29 16:00:43 fir-md1-s1 kernel: Lustre: Skipped 177 previous similar messages
      
      
      
      May 29 16:01:10 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 98b9580a-8059-3273-b7c7-9f03a2d4d87f (at 10.9.112.7@o2ib4)
      May 29 16:01:10 fir-md1-s1 kernel: Lustre: Skipped 6 previous similar messages
      May 29 16:01:41 fir-md1-s1 kernel: LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
      May 29 16:01:42 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 4030ea6e-e97d-b5db-fa13-470ec3d51337 (at 10.8.1.8@o2ib6)
      May 29 16:01:42 fir-md1-s1 kernel: Lustre: Skipped 40 previous similar messages
      May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5
      May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): recovery complete
      May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
      May 29 16:02:24 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.8.31.10@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:24 fir-md1-s1 kernel: Lustre: fir-MDT0000: Not available for connect from 10.8.18.10@o2ib6 (not set up)
      May 29 16:02:24 fir-md1-s1 kernel: Lustre: Skipped 3 previous similar messages
      May 29 16:02:24 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.8.17.10@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:24 fir-md1-s1 kernel: LustreError: Skipped 4 previous similar messages
      May 29 16:02:26 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.9.102.30@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:26 fir-md1-s1 kernel: LustreError: Skipped 1 previous similar message
      May 29 16:02:28 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.0.10.106@o2ib7 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:28 fir-md1-s1 kernel: LustreError: Skipped 13 previous similar messages
      May 29 16:02:29 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:02:29 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
      May 29 16:02:31 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 2 seconds
      May 29 16:02:31 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
      May 29 16:02:33 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.0.10.3@o2ib7 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:33 fir-md1-s1 kernel: LustreError: Skipped 12 previous similar messages
      May 29 16:02:38 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 1 seconds
      May 29 16:02:38 fir-md1-s1 kernel: Lustre: fir-MDT0000: Imperative Recovery not enabled, recovery window 300-900
      May 29 16:02:38 fir-md1-s1 kernel: Lustre: fir-MDD0000: changelog on
      May 29 16:02:38 fir-md1-s1 kernel: LustreError: 20624:0:(mdt_handler.c:6447:mdt_iocontrol()) fir-MDT0000: Aborting recovery for device
      May 29 16:02:38 fir-md1-s1 kernel: LustreError: 20624:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0000: Aborting recovery
      May 29 16:02:38 fir-md1-s1 kernel: Lustre: 20890:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      May 29 16:02:41 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.9.108.30@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:41 fir-md1-s1 kernel: LustreError: Skipped 29 previous similar messages
      May 29 16:02:44 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (86 recovered, 49 in progress, and 0 evicted) already passed deadline 8:07
      May 29 16:02:46 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 98410362-8850-0375-cdb0-3dd7ebb1ce58 (at 10.9.107.28@o2ib4)
      May 29 16:02:46 fir-md1-s1 kernel: Lustre: Skipped 1171 previous similar messages
      May 29 16:02:57 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.1.6@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:57 fir-md1-s1 kernel: LustreError: Skipped 803 previous similar messages
      May 29 16:03:02 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:03:12 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (802 recovered, 565 in progress, and 0 evicted) already passed deadline 8:34
      May 29 16:03:27 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:03:30 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.20.22@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:03:30 fir-md1-s1 kernel: LustreError: Skipped 2086 previous similar messages
      May 29 16:03:37 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 8:59
      May 29 16:04:02 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 9:24
      May 29 16:04:17 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:04:17 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
      May 29 16:04:27 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 9:50
      May 29 16:04:34 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.8.8.19@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:04:34 fir-md1-s1 kernel: LustreError: Skipped 3903 previous similar messages
      May 29 16:04:38 fir-md1-s1 kernel: INFO: task mount.lustre:20624 blocked for more than 120 seconds.
      May 29 16:04:38 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      May 29 16:04:38 fir-md1-s1 kernel: mount.lustre    D ffff946ebeba9040     0 20624  20623 0x00000082
      May 29 16:04:38 fir-md1-s1 kernel: Call Trace:
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c167c49>] schedule+0x29/0x70
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c165721>] schedule_timeout+0x221/0x2d0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bad67c2>] ? default_wake_function+0x12/0x20
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bac2d2b>] ? autoremove_wake_function+0x2b/0x40
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c167ffd>] wait_for_completion+0xfd/0x140
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bad67b0>] ? wake_up_state+0x20/0x20
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0e8139d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0e81448>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc17f4868>] mdt_iocontrol+0x558/0xb00 [mdt]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0cb2766>] server_start_targets+0x1c66/0x2a20 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c86b10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c7fea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0cb45ec>] server_fill_super+0x10cc/0x1890 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c89828>] lustre_fill_super+0x328/0x950 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c89500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc452cf>] mount_nodev+0x4f/0xb0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c81908>] lustre_mount+0x38/0x60 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc45e4e>] mount_fs+0x3e/0x1b0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc639e7>] vfs_kern_mount+0x67/0x110
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc6600f>] do_mount+0x1ef/0xce0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc3e2aa>] ? __check_object_size+0x1ca/0x250
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc1c74c>] ? kmem_cache_alloc_trace+0x3c/0x200
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc66e43>] SyS_mount+0x83/0xd0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c174ddb>] system_call_fastpath+0x22/0x27
      May 29 16:04:52 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 10:15
      May 29 16:05:17 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 10:40
      May 29 16:05:32 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:05:32 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 2 previous similar messages
      May 29 16:06:06 fir-md1-s1 kernel: LNet: Service thread pid 20981 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      May 29 16:06:06 fir-md1-s1 kernel: Pid: 20981, comm: mdt00_026 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      May 29 16:06:06 fir-md1-s1 kernel: Call Trace:
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1573bf8>] osp_precreate_reserve+0x2e8/0x800 [osp]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc15688b9>] osp_declare_create+0x199/0x5b0 [osp]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ee39f>] lod_sub_declare_create+0xdf/0x210 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e656e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e950e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ed62d>] lod_qos_prep_create+0x12fd/0x1890 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18eddd5>] lod_prepare_create+0x215/0x2e0 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ddcee>] lod_declare_striped_create+0x1ee/0x980 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e5058>] lod_declare_xattr_set+0x138/0x1230 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc193e065>] mdo_declare_xattr_set+0x65/0x150 [mdd]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1947af0>] mdd_declare_create+0x550/0xcb0 [mdd]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc194b6e7>] mdd_create+0x827/0x1440 [mdd]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc181c018>] mdt_reint_open+0x1aa8/0x28b0 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc180eed3>] mdt_reint_rec+0x83/0x210 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17ed143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17f9982>] mdt_intent_open+0x82/0x350 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17f7a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0e79d46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ea2707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0f296e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0f3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ed4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ed863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffff9bac1c31>] kthread+0xd1/0xe0
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffff9c174c24>] ret_from_fork_nospec_begin+0xe/0x21
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      May 29 16:06:06 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1559171166.20981
      May 29 16:06:07 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 11:30
      May 29 16:06:07 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message
      May 29 16:06:38 fir-md1-s1 kernel: INFO: task mount.lustre:20624 blocked for more than 120 seconds.
      May 29 16:06:38 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      May 29 16:06:38 fir-md1-s1 kernel: mount.lustre    D ffff946ebeba9040     0 20624  20623 0x00000082
      May 29 16:06:38 fir-md1-s1 kernel: Call Trace:
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c167c49>] schedule+0x29/0x70
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c165721>] schedule_timeout+0x221/0x2d0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bad67c2>] ? default_wake_function+0x12/0x20
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bac2d2b>] ? autoremove_wake_function+0x2b/0x40
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c167ffd>] wait_for_completion+0xfd/0x140
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bad67b0>] ? wake_up_state+0x20/0x20
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0e8139d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0e81448>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc17f4868>] mdt_iocontrol+0x558/0xb00 [mdt]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0cb2766>] server_start_targets+0x1c66/0x2a20 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c86b10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c7fea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0cb45ec>] server_fill_super+0x10cc/0x1890 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c89828>] lustre_fill_super+0x328/0x950 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c89500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc452cf>] mount_nodev+0x4f/0xb0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c81908>] lustre_mount+0x38/0x60 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc45e4e>] mount_fs+0x3e/0x1b0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc639e7>] vfs_kern_mount+0x67/0x110
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc6600f>] do_mount+0x1ef/0xce0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc3e2aa>] ? __check_object_size+0x1ca/0x250
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc1c74c>] ? kmem_cache_alloc_trace+0x3c/0x200
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc66e43>] SyS_mount+0x83/0xd0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c174ddb>] system_call_fastpath+0x22/0x27
      

       

      mount commands are stuck for all 4 MDTs.

      While I'm writing this, we're trying to restart everything with all lnet routers stopped to avoid any recovery with our compute nodes on Sherlock. My hope is that after the timeout of the imperative recovery window, the MDTs will actually mount. Any idea is welcome. Thanks!

      Attachments

        Issue Links

          Activity

            [LU-12360] Can't restart filesystem (2.12) even with abort_recov

            We hit crash LU-11967 on an MDS last night, and then we had same issue again: the MDT recovery could never complete…

            [root@fir-md1-s2 127.0.0.1-2019-08-09-23:48:13]# cat /proc/fs/lustre/mdt/fir-MDT0001/recovery_status
            status: RECOVERING
            recovery_start: 1565445991
            time_remaining: 0
            connected_clients: 1388/1399
            req_replay_clients: 165
            lock_repay_clients: 247
            completed_clients: 1141
            evicted_clients: 11
            replayed_requests: 10
            queued_requests: 164
            next_transno: 342221616219
            
            [root@fir-md1-s2 127.0.0.1-2019-08-09-23:48:13]# cat /proc/fs/lustre/mdt/fir-MDT0003/recovery_status
            status: RECOVERING
            recovery_start: 1565445894
            time_remaining: 0
            connected_clients: 1390/1399
            req_replay_clients: 44
            lock_repay_clients: 137
            completed_clients: 1253
            evicted_clients: 9
            replayed_requests: 2
            queued_requests: 43
            next_transno: 477760981504
            

            A dump of the tasks is available in fir-md1-s2-20190910-endlessrecov.sysrq-t.log while the recovery was stuck and time_remaining at 0, in the hope it could provide more insight. Thanks!

            sthiell Stephane Thiell added a comment - We hit crash LU-11967 on an MDS last night, and then we had same issue again: the MDT recovery could never complete… [root@fir-md1-s2 127.0.0.1-2019-08-09-23:48:13]# cat /proc/fs/lustre/mdt/fir-MDT0001/recovery_status status: RECOVERING recovery_start: 1565445991 time_remaining: 0 connected_clients: 1388/1399 req_replay_clients: 165 lock_repay_clients: 247 completed_clients: 1141 evicted_clients: 11 replayed_requests: 10 queued_requests: 164 next_transno: 342221616219 [root@fir-md1-s2 127.0.0.1-2019-08-09-23:48:13]# cat /proc/fs/lustre/mdt/fir-MDT0003/recovery_status status: RECOVERING recovery_start: 1565445894 time_remaining: 0 connected_clients: 1390/1399 req_replay_clients: 44 lock_repay_clients: 137 completed_clients: 1253 evicted_clients: 9 replayed_requests: 2 queued_requests: 43 next_transno: 477760981504 A dump of the tasks is available in fir-md1-s2-20190910-endlessrecov.sysrq-t.log while the recovery was stuck and time_remaining at 0, in the hope it could provide more insight. Thanks!

            Hi Patrick,

            Any progress on this? We're still very concerned by this issue and I liked your idea  

            Thanks!

            sthiell Stephane Thiell added a comment - Hi Patrick, Any progress on this? We're still very concerned by this issue and I liked your idea   Thanks!

            We're going to use LU-12485 for the assertion.

            pfarrell Patrick Farrell (Inactive) added a comment - We're going to use LU-12485 for the assertion.

            Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35360
            Subject: LU-12360 osd-ldiskfs: don't assert on 0-nlink inode
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 413beb61df0471abab2e0b1a0d1b54943b63c8e4

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35360 Subject: LU-12360 osd-ldiskfs: don't assert on 0-nlink inode Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 413beb61df0471abab2e0b1a0d1b54943b63c8e4

            Thanks, Stephane.

            The issue (roughly) appears to be this.  (I ran in to this while trying to figure out how to reproduce this or LU-11761.)

            This is more rough notes for myself, sorry - It will not be terribly intelligible.

            MDT0000 is failing to finish recovery, because it cannot abort recovery.  It's told the various recovery threads that they need to stop, but there's another thread stuck, and it's waiting:

            May 29 16:06:06 fir-md1-s1 kernel: LNet: Service thread pid 20981 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            May 29 16:06:06 fir-md1-s1 kernel: Pid: 20981, comm: mdt00_026 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
            May 29 16:06:06 fir-md1-s1 kernel: Call Trace:
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1573bf8>] osp_precreate_reserve+0x2e8/0x800 [osp]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc15688b9>] osp_declare_create+0x199/0x5b0 [osp]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ee39f>] lod_sub_declare_create+0xdf/0x210 [lod]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e656e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e950e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ed62d>] lod_qos_prep_create+0x12fd/0x1890 [lod]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18eddd5>] lod_prepare_create+0x215/0x2e0 [lod]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ddcee>] lod_declare_striped_create+0x1ee/0x980 [lod]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e5058>] lod_declare_xattr_set+0x138/0x1230 [lod]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc193e065>] mdo_declare_xattr_set+0x65/0x150 [mdd]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1947af0>] mdd_declare_create+0x550/0xcb0 [mdd]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc194b6e7>] mdd_create+0x827/0x1440 [mdd]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc181c018>] mdt_reint_open+0x1aa8/0x28b0 [mdt]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc180eed3>] mdt_reint_rec+0x83/0x210 [mdt]
            May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17ed143>] mdt_reint_internal+0x6e3/0xaf0 [mdt] 

            This is the thread in question.  I can't quite walk this back to blocking the recovery process, but I'm almost certain that what's happening is recovery is unable to complete because this thread is unable to exit.  The issue is that even if thread is woken up (which is the intended behavior during recovery), it's not going to complete, because of this check:

             static int osp_precreate_ready_condition(const struct lu_env *env,
                                                     struct osp_device *d)
            {
                    if (d->opd_pre_recovering)
                            return 0;

            So I believe even if the precreate thread is getting woken up, it's still not going to exit from here.

            I've got an idea for a patch for this, but no clear idea of how to reproduce the problem.  I'll see about putting something together.

            pfarrell Patrick Farrell (Inactive) added a comment - Thanks, Stephane. The issue (roughly) appears to be this.  (I ran in to this while trying to figure out how to reproduce this or LU-11761 .) This is more rough notes for myself, sorry - It will not be terribly intelligible. MDT0000 is failing to finish recovery, because it cannot abort recovery.  It's told the various recovery threads that they need to stop, but there's another thread stuck, and it's waiting: May 29 16:06:06 fir-md1-s1 kernel: LNet: Service thread pid 20981 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: May 29 16:06:06 fir-md1-s1 kernel: Pid: 20981, comm: mdt00_026 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 May 29 16:06:06 fir-md1-s1 kernel: Call Trace: May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc1573bf8>] osp_precreate_reserve+0x2e8/0x800 [osp] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc15688b9>] osp_declare_create+0x199/0x5b0 [osp] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18ee39f>] lod_sub_declare_create+0xdf/0x210 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18e656e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18e950e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18ed62d>] lod_qos_prep_create+0x12fd/0x1890 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18eddd5>] lod_prepare_create+0x215/0x2e0 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18ddcee>] lod_declare_striped_create+0x1ee/0x980 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18e5058>] lod_declare_xattr_set+0x138/0x1230 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc193e065>] mdo_declare_xattr_set+0x65/0x150 [mdd] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc1947af0>] mdd_declare_create+0x550/0xcb0 [mdd] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc194b6e7>] mdd_create+0x827/0x1440 [mdd] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc181c018>] mdt_reint_open+0x1aa8/0x28b0 [mdt] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc180eed3>] mdt_reint_rec+0x83/0x210 [mdt] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc17ed143>] mdt_reint_internal+0x6e3/0xaf0 [mdt] This is the thread in question.  I can't quite walk this back to blocking the recovery process, but I'm almost certain that what's happening is recovery is unable to complete because this thread is unable to exit.  The issue is that even if thread is woken up (which is the intended behavior during recovery), it's not going to complete, because of this check: static int osp_precreate_ready_condition(const struct lu_env *env, struct osp_device *d) { if (d->opd_pre_recovering) return 0; So I believe even if the precreate thread is getting woken up, it's still not going to exit from here. I've got an idea for a patch for this, but no clear idea of how to reproduce the problem.  I'll see about putting something together.
            sthiell Stephane Thiell added a comment - - edited

            Patrick, I don't think I noticed the 100% CPU for this. Probably not a duplicate issue then, indeed. Thanks!

            sthiell Stephane Thiell added a comment - - edited Patrick, I don't think I noticed the 100% CPU for this. Probably not a duplicate issue then, indeed. Thanks!

            Stephane, Lai,

            I think this is a different abort_recov issue than LU-11761.

            I don't see any evidence of the fld_client issue, which manifested itself as threads like:
            lod0002_rec0001

            Spinning to 100% of CPU.

            Stephane,

            You didn't see anything like that this time, correct?

            More in a minute.

            pfarrell Patrick Farrell (Inactive) added a comment - Stephane, Lai, I think this is a different abort_recov issue than LU-11761 . I don't see any evidence of the fld_client issue, which manifested itself as threads like: lod0002_rec0001 Spinning to 100% of CPU. Stephane, You didn't see anything like that this time, correct? More in a minute.

            Stephane,

            Since this is a duplicate of LU-11761, I'm going resolve it as such, and you can track progress there.

            pfarrell Patrick Farrell (Inactive) added a comment - Stephane, Since this is a duplicate of LU-11761 , I'm going resolve it as such, and you can track progress there.

            Thanks Lai!

            sthiell Stephane Thiell added a comment - Thanks Lai!
            laisiyao Lai Siyao added a comment -

            The "lctl abort_recov" hug issue is tracked on LU-11761, and there is a patch in review.
            MDS crash on "osp_declare_create()) ASSERTION( o->opo_reserved == 0 )" is tracked on LU-11967.
            I'm looking into the OSS crash "osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) )" now.

            laisiyao Lai Siyao added a comment - The "lctl abort_recov" hug issue is tracked on LU-11761 , and there is a patch in review. MDS crash on "osp_declare_create()) ASSERTION( o->opo_reserved == 0 )" is tracked on LU-11967 . I'm looking into the OSS crash "osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) )" now.

            People

              pjones Peter Jones
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: