Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12360

Can't restart filesystem (2.12) even with abort_recov

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Reopened
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: Lustre 2.12.0
    • Fix Version/s: None
    • Labels:
      None
    • Environment:
      CentOS 7.6
    • Severity:
      1
    • Rank (Obsolete):
      9223372036854775807

      Description

      Using 2.12.0 + patches, after running a few days without major issues, we just had a serie of events leading to a filesystem down, it started with an OSS crash on fir-io4-s2:

      [411912.582805] LustreError: 128678:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 7ccedfb6-dc58-53e5-7474-adae274cd5d7 claims 3305472 GRANT, real grant 0
      [411912.597102] LustreError: 128678:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 258 previous similar messages
      [412523.291449] LustreError: 87611:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 3923968 GRANT, real grant 0
      [412523.305649] LustreError: 87611:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 179 previous similar messages
      [413133.047652] LustreError: 1860:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 4132864 GRANT, real grant 0
      [413133.061828] LustreError: 1860:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 209 previous similar messages
      [413733.569845] LustreError: 1546:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 7ccedfb6-dc58-53e5-7474-adae274cd5d7 claims 3780608 GRANT, real grant 0
      [413733.583920] LustreError: 1546:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 206 previous similar messages
      [414347.351179] LustreError: 87688:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 3911680 GRANT, real grant 0
      [414347.365413] LustreError: 87688:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 197 previous similar messages
      [414687.733602] LustreError: 108071:0:(osd_handler.c:2133:osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) ) failed: 
      [414687.748464] LustreError: 108071:0:(osd_handler.c:2133:osd_object_release()) LBUG
      [414687.755956] Pid: 108071, comm: ldlm_cn01_010 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      [414687.766217] Call Trace:
      [414687.768773]  [<ffffffffc0d0f7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [414687.775434]  [<ffffffffc0d0f87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [414687.781736]  [<ffffffffc15d636c>] osd_object_release+0x7c/0x80 [osd_ldiskfs]
      [414687.788910]  [<ffffffffc0e936a0>] lu_object_put+0x190/0x3d0 [obdclass]
      [414687.795582]  [<ffffffffc1750b56>] ofd_lvbo_update+0x5e6/0xdf0 [ofd]
      [414687.801980]  [<ffffffffc10a2c89>] ldlm_request_cancel+0x3c9/0x740 [ptlrpc]
      [414687.809008]  [<ffffffffc10a77da>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
      [414687.815859]  [<ffffffffc10a7ac8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
      [414687.822885]  [<ffffffffc10d8d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [414687.830693]  [<ffffffffc10dc63c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [414687.837115]  [<ffffffffb74c1c31>] kthread+0xd1/0xe0
      [414687.842115]  [<ffffffffb7b74c24>] ret_from_fork_nospec_begin+0xe/0x21
      [414687.848677]  [<ffffffffffffffff>] 0xffffffffffffffff
      [414687.853780] Kernel panic - not syncing: LBUG
      

       
      Shortly after restarting the OSS, an MDS, fir-md1-s2 (MDT1 + 3) crashed:

      [2687144.065647] Lustre: fir-MDT0003: Client ddea348b-e5a4-5330-325a-755d459e8dda (at 10.9.107.57@o2ib4) reconnecting
      [2687144.076003] Lustre: Skipped 1 previous similar message
      [2689300.253654] LustreError: 124673:0:(osp_object.c:1458:osp_declare_create()) ASSERTION( o->opo_reserved == 0 ) failed: 
      [2689300.264453] LustreError: 124673:0:(osp_object.c:1458:osp_declare_create()) LBUG
      [2689300.271942] Pid: 124673, comm: mdt00_044 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      [2689300.281942] Call Trace:
      [2689300.284583]  [<ffffffffc0a397cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [2689300.291332]  [<ffffffffc0a3987c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [2689300.297719]  [<ffffffffc15aecc5>] osp_declare_create+0x5a5/0x5b0 [osp]
      [2689300.304462]  [<ffffffffc14f639f>] lod_sub_declare_create+0xdf/0x210 [lod]
      [2689300.311466]  [<ffffffffc14f5904>] lod_qos_prep_create+0x15d4/0x1890 [lod]
      [2689300.318467]  [<ffffffffc14d6bba>] lod_declare_instantiate_components+0x9a/0x1d0 [lod]
      [2689300.326511]  [<ffffffffc14e94d5>] lod_declare_layout_change+0xb65/0x10f0 [lod]
      [2689300.333948]  [<ffffffffc155b102>] mdd_declare_layout_change+0x62/0x120 [mdd]
      [2689300.341211]  [<ffffffffc1563e52>] mdd_layout_change+0x882/0x1000 [mdd]
      [2689300.347964]  [<ffffffffc13ca327>] mdt_layout_change+0x337/0x430 [mdt]
      [2689300.354636]  [<ffffffffc13d243e>] mdt_intent_layout+0x7ee/0xcc0 [mdt]
      [2689300.361292]  [<ffffffffc13cfa28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
      [2689300.367949]  [<ffffffffc0d6dd46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      [2689300.374893]  [<ffffffffc0d96707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      [2689300.382189]  [<ffffffffc0e1d6e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [2689300.388531]  [<ffffffffc0e2473a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [2689300.395657]  [<ffffffffc0dc8d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [2689300.403563]  [<ffffffffc0dcc63c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [2689300.410077]  [<ffffffff9e6c1c31>] kthread+0xd1/0xe0
      [2689300.415164]  [<ffffffff9ed74c24>] ret_from_fork_nospec_begin+0xe/0x21
      [2689300.421813]  [<ffffffffffffffff>] 0xffffffffffffffff
      [2689300.427016] Kernel panic - not syncing: LBUG
      

      And now I am not able to restart the filesystem, even with -o abort_recov everywhere. OSTs restart but then the MDT0 doesn't want to restart. It keeps wanting to perform a recovery.

      May 29 15:59:43 fir-md1-s1 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
      May 29 15:59:49 fir-md1-s1 kernel: Lustre: DEBUG MARKER: Wed May 29 15:59:49 2019
      May 29 16:00:25 fir-md1-s1 kernel: LDISKFS-fs (dm-2): recovery complete
      May 29 16:00:25 fir-md1-s1 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      May 29 16:00:25 fir-md1-s1 kernel: Lustre: MGS: Connection restored to MGC10.0.10.51@o2ib7_0 (at 0@lo)
      May 29 16:00:26 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 1495a992-a81e-7de9-50a8-88376d880671 (at 10.8.12.32@o2ib6)
      May 29 16:00:26 fir-md1-s1 kernel: Lustre: Skipped 39 previous similar messages
      May 29 16:00:27 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 0e359090-33c6-a6a0-317c-1b138ecd9b87 (at 10.9.105.16@o2ib4)
      May 29 16:00:27 fir-md1-s1 kernel: Lustre: Skipped 72 previous similar messages
      
      
      
      
      
      May 29 16:00:29 fir-md1-s1 kernel: Lustre: MGS: Connection restored to d6e5f2ed-c606-14e7-85a0-13787ff0467b (at 10.9.105.69@o2ib4)
      May 29 16:00:29 fir-md1-s1 kernel: Lustre: Skipped 35 previous similar messages
      May 29 16:00:33 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 2996d077-09af-23fe-fe18-ce9b6359bb8e (at 10.8.21.35@o2ib6)
      May 29 16:00:33 fir-md1-s1 kernel: Lustre: Skipped 263 previous similar messages
      May 29 16:00:43 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 545852b2-0fa4-d28f-21da-1dc3deae2f8b (at 10.9.115.1@o2ib4)
      May 29 16:00:43 fir-md1-s1 kernel: Lustre: Skipped 177 previous similar messages
      
      
      
      May 29 16:01:10 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 98b9580a-8059-3273-b7c7-9f03a2d4d87f (at 10.9.112.7@o2ib4)
      May 29 16:01:10 fir-md1-s1 kernel: Lustre: Skipped 6 previous similar messages
      May 29 16:01:41 fir-md1-s1 kernel: LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
      May 29 16:01:42 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 4030ea6e-e97d-b5db-fa13-470ec3d51337 (at 10.8.1.8@o2ib6)
      May 29 16:01:42 fir-md1-s1 kernel: Lustre: Skipped 40 previous similar messages
      May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5
      May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): recovery complete
      May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
      May 29 16:02:24 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.8.31.10@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:24 fir-md1-s1 kernel: Lustre: fir-MDT0000: Not available for connect from 10.8.18.10@o2ib6 (not set up)
      May 29 16:02:24 fir-md1-s1 kernel: Lustre: Skipped 3 previous similar messages
      May 29 16:02:24 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.8.17.10@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:24 fir-md1-s1 kernel: LustreError: Skipped 4 previous similar messages
      May 29 16:02:26 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.9.102.30@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:26 fir-md1-s1 kernel: LustreError: Skipped 1 previous similar message
      May 29 16:02:28 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.0.10.106@o2ib7 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:28 fir-md1-s1 kernel: LustreError: Skipped 13 previous similar messages
      May 29 16:02:29 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:02:29 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
      May 29 16:02:31 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 2 seconds
      May 29 16:02:31 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
      May 29 16:02:33 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.0.10.3@o2ib7 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:33 fir-md1-s1 kernel: LustreError: Skipped 12 previous similar messages
      May 29 16:02:38 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 1 seconds
      May 29 16:02:38 fir-md1-s1 kernel: Lustre: fir-MDT0000: Imperative Recovery not enabled, recovery window 300-900
      May 29 16:02:38 fir-md1-s1 kernel: Lustre: fir-MDD0000: changelog on
      May 29 16:02:38 fir-md1-s1 kernel: LustreError: 20624:0:(mdt_handler.c:6447:mdt_iocontrol()) fir-MDT0000: Aborting recovery for device
      May 29 16:02:38 fir-md1-s1 kernel: LustreError: 20624:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0000: Aborting recovery
      May 29 16:02:38 fir-md1-s1 kernel: Lustre: 20890:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      May 29 16:02:41 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.9.108.30@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:41 fir-md1-s1 kernel: LustreError: Skipped 29 previous similar messages
      May 29 16:02:44 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (86 recovered, 49 in progress, and 0 evicted) already passed deadline 8:07
      May 29 16:02:46 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 98410362-8850-0375-cdb0-3dd7ebb1ce58 (at 10.9.107.28@o2ib4)
      May 29 16:02:46 fir-md1-s1 kernel: Lustre: Skipped 1171 previous similar messages
      May 29 16:02:57 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.1.6@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:02:57 fir-md1-s1 kernel: LustreError: Skipped 803 previous similar messages
      May 29 16:03:02 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:03:12 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (802 recovered, 565 in progress, and 0 evicted) already passed deadline 8:34
      May 29 16:03:27 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:03:30 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.20.22@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:03:30 fir-md1-s1 kernel: LustreError: Skipped 2086 previous similar messages
      May 29 16:03:37 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 8:59
      May 29 16:04:02 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 9:24
      May 29 16:04:17 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:04:17 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
      May 29 16:04:27 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 9:50
      May 29 16:04:34 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.8.8.19@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
      May 29 16:04:34 fir-md1-s1 kernel: LustreError: Skipped 3903 previous similar messages
      May 29 16:04:38 fir-md1-s1 kernel: INFO: task mount.lustre:20624 blocked for more than 120 seconds.
      May 29 16:04:38 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      May 29 16:04:38 fir-md1-s1 kernel: mount.lustre    D ffff946ebeba9040     0 20624  20623 0x00000082
      May 29 16:04:38 fir-md1-s1 kernel: Call Trace:
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c167c49>] schedule+0x29/0x70
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c165721>] schedule_timeout+0x221/0x2d0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bad67c2>] ? default_wake_function+0x12/0x20
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bac2d2b>] ? autoremove_wake_function+0x2b/0x40
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c167ffd>] wait_for_completion+0xfd/0x140
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bad67b0>] ? wake_up_state+0x20/0x20
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0e8139d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0e81448>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc17f4868>] mdt_iocontrol+0x558/0xb00 [mdt]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0cb2766>] server_start_targets+0x1c66/0x2a20 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c86b10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c7fea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0cb45ec>] server_fill_super+0x10cc/0x1890 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c89828>] lustre_fill_super+0x328/0x950 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c89500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc452cf>] mount_nodev+0x4f/0xb0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c81908>] lustre_mount+0x38/0x60 [obdclass]
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc45e4e>] mount_fs+0x3e/0x1b0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc639e7>] vfs_kern_mount+0x67/0x110
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc6600f>] do_mount+0x1ef/0xce0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc3e2aa>] ? __check_object_size+0x1ca/0x250
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc1c74c>] ? kmem_cache_alloc_trace+0x3c/0x200
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc66e43>] SyS_mount+0x83/0xd0
      May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c174ddb>] system_call_fastpath+0x22/0x27
      May 29 16:04:52 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 10:15
      May 29 16:05:17 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 10:40
      May 29 16:05:32 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
      May 29 16:05:32 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 2 previous similar messages
      May 29 16:06:06 fir-md1-s1 kernel: LNet: Service thread pid 20981 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      May 29 16:06:06 fir-md1-s1 kernel: Pid: 20981, comm: mdt00_026 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      May 29 16:06:06 fir-md1-s1 kernel: Call Trace:
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1573bf8>] osp_precreate_reserve+0x2e8/0x800 [osp]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc15688b9>] osp_declare_create+0x199/0x5b0 [osp]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ee39f>] lod_sub_declare_create+0xdf/0x210 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e656e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e950e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ed62d>] lod_qos_prep_create+0x12fd/0x1890 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18eddd5>] lod_prepare_create+0x215/0x2e0 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ddcee>] lod_declare_striped_create+0x1ee/0x980 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e5058>] lod_declare_xattr_set+0x138/0x1230 [lod]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc193e065>] mdo_declare_xattr_set+0x65/0x150 [mdd]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1947af0>] mdd_declare_create+0x550/0xcb0 [mdd]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc194b6e7>] mdd_create+0x827/0x1440 [mdd]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc181c018>] mdt_reint_open+0x1aa8/0x28b0 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc180eed3>] mdt_reint_rec+0x83/0x210 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17ed143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17f9982>] mdt_intent_open+0x82/0x350 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17f7a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0e79d46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ea2707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0f296e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0f3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ed4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ed863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffff9bac1c31>] kthread+0xd1/0xe0
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffff9c174c24>] ret_from_fork_nospec_begin+0xe/0x21
      May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      May 29 16:06:06 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1559171166.20981
      May 29 16:06:07 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 11:30
      May 29 16:06:07 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message
      May 29 16:06:38 fir-md1-s1 kernel: INFO: task mount.lustre:20624 blocked for more than 120 seconds.
      May 29 16:06:38 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      May 29 16:06:38 fir-md1-s1 kernel: mount.lustre    D ffff946ebeba9040     0 20624  20623 0x00000082
      May 29 16:06:38 fir-md1-s1 kernel: Call Trace:
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c167c49>] schedule+0x29/0x70
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c165721>] schedule_timeout+0x221/0x2d0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bad67c2>] ? default_wake_function+0x12/0x20
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bac2d2b>] ? autoremove_wake_function+0x2b/0x40
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c167ffd>] wait_for_completion+0xfd/0x140
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bad67b0>] ? wake_up_state+0x20/0x20
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0e8139d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0e81448>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc17f4868>] mdt_iocontrol+0x558/0xb00 [mdt]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0cb2766>] server_start_targets+0x1c66/0x2a20 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c86b10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c7fea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0cb45ec>] server_fill_super+0x10cc/0x1890 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c89828>] lustre_fill_super+0x328/0x950 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c89500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc452cf>] mount_nodev+0x4f/0xb0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c81908>] lustre_mount+0x38/0x60 [obdclass]
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc45e4e>] mount_fs+0x3e/0x1b0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc639e7>] vfs_kern_mount+0x67/0x110
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc6600f>] do_mount+0x1ef/0xce0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc3e2aa>] ? __check_object_size+0x1ca/0x250
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc1c74c>] ? kmem_cache_alloc_trace+0x3c/0x200
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc66e43>] SyS_mount+0x83/0xd0
      May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c174ddb>] system_call_fastpath+0x22/0x27
      

       

      mount commands are stuck for all 4 MDTs.

      While I'm writing this, we're trying to restart everything with all lnet routers stopped to avoid any recovery with our compute nodes on Sherlock. My hope is that after the timeout of the imperative recovery window, the MDTs will actually mount. Any idea is welcome. Thanks!

        Attachments

        1. fir-io4-s2-2019-05-29-12-51-55-foreach-bt.txt
          1.05 MB
          Stephane Thiell
        2. fir-io4-s2-2019-05-29-12-51-55-vmcore-dmesg.txt
          836 kB
          Stephane Thiell
        3. fir-md1-s1_noabort_rtrdown.log
          218 kB
          Stephane Thiell
        4. fir-md1-s1.log
          227 kB
          Stephane Thiell
        5. fir-md1-s2_noabort_rtrdown.log
          211 kB
          Stephane Thiell
        6. fir-md1-s2.log
          230 kB
          Stephane Thiell
        7. fir-md1-s2-20190910-endlessrecov.sysrq-t.log
          1.46 MB
          Stephane Thiell

          Issue Links

            Activity

              People

              Assignee:
              pjones Peter Jones
              Reporter:
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated: