Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.12.0
-
None
-
CentOS 7.6
-
1
-
9223372036854775807
Description
Using 2.12.0 + patches, after running a few days without major issues, we just had a serie of events leading to a filesystem down, it started with an OSS crash on fir-io4-s2:
[411912.582805] LustreError: 128678:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 7ccedfb6-dc58-53e5-7474-adae274cd5d7 claims 3305472 GRANT, real grant 0 [411912.597102] LustreError: 128678:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 258 previous similar messages [412523.291449] LustreError: 87611:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 3923968 GRANT, real grant 0 [412523.305649] LustreError: 87611:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 179 previous similar messages [413133.047652] LustreError: 1860:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 4132864 GRANT, real grant 0 [413133.061828] LustreError: 1860:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 209 previous similar messages [413733.569845] LustreError: 1546:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 7ccedfb6-dc58-53e5-7474-adae274cd5d7 claims 3780608 GRANT, real grant 0 [413733.583920] LustreError: 1546:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 206 previous similar messages [414347.351179] LustreError: 87688:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 3911680 GRANT, real grant 0 [414347.365413] LustreError: 87688:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 197 previous similar messages [414687.733602] LustreError: 108071:0:(osd_handler.c:2133:osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) ) failed: [414687.748464] LustreError: 108071:0:(osd_handler.c:2133:osd_object_release()) LBUG [414687.755956] Pid: 108071, comm: ldlm_cn01_010 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 [414687.766217] Call Trace: [414687.768773] [<ffffffffc0d0f7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [414687.775434] [<ffffffffc0d0f87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [414687.781736] [<ffffffffc15d636c>] osd_object_release+0x7c/0x80 [osd_ldiskfs] [414687.788910] [<ffffffffc0e936a0>] lu_object_put+0x190/0x3d0 [obdclass] [414687.795582] [<ffffffffc1750b56>] ofd_lvbo_update+0x5e6/0xdf0 [ofd] [414687.801980] [<ffffffffc10a2c89>] ldlm_request_cancel+0x3c9/0x740 [ptlrpc] [414687.809008] [<ffffffffc10a77da>] ldlm_handle_cancel+0xba/0x250 [ptlrpc] [414687.815859] [<ffffffffc10a7ac8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [414687.822885] [<ffffffffc10d8d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [414687.830693] [<ffffffffc10dc63c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [414687.837115] [<ffffffffb74c1c31>] kthread+0xd1/0xe0 [414687.842115] [<ffffffffb7b74c24>] ret_from_fork_nospec_begin+0xe/0x21 [414687.848677] [<ffffffffffffffff>] 0xffffffffffffffff [414687.853780] Kernel panic - not syncing: LBUG
Shortly after restarting the OSS, an MDS, fir-md1-s2 (MDT1 + 3) crashed:
[2687144.065647] Lustre: fir-MDT0003: Client ddea348b-e5a4-5330-325a-755d459e8dda (at 10.9.107.57@o2ib4) reconnecting [2687144.076003] Lustre: Skipped 1 previous similar message [2689300.253654] LustreError: 124673:0:(osp_object.c:1458:osp_declare_create()) ASSERTION( o->opo_reserved == 0 ) failed: [2689300.264453] LustreError: 124673:0:(osp_object.c:1458:osp_declare_create()) LBUG [2689300.271942] Pid: 124673, comm: mdt00_044 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 [2689300.281942] Call Trace: [2689300.284583] [<ffffffffc0a397cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [2689300.291332] [<ffffffffc0a3987c>] lbug_with_loc+0x4c/0xa0 [libcfs] [2689300.297719] [<ffffffffc15aecc5>] osp_declare_create+0x5a5/0x5b0 [osp] [2689300.304462] [<ffffffffc14f639f>] lod_sub_declare_create+0xdf/0x210 [lod] [2689300.311466] [<ffffffffc14f5904>] lod_qos_prep_create+0x15d4/0x1890 [lod] [2689300.318467] [<ffffffffc14d6bba>] lod_declare_instantiate_components+0x9a/0x1d0 [lod] [2689300.326511] [<ffffffffc14e94d5>] lod_declare_layout_change+0xb65/0x10f0 [lod] [2689300.333948] [<ffffffffc155b102>] mdd_declare_layout_change+0x62/0x120 [mdd] [2689300.341211] [<ffffffffc1563e52>] mdd_layout_change+0x882/0x1000 [mdd] [2689300.347964] [<ffffffffc13ca327>] mdt_layout_change+0x337/0x430 [mdt] [2689300.354636] [<ffffffffc13d243e>] mdt_intent_layout+0x7ee/0xcc0 [mdt] [2689300.361292] [<ffffffffc13cfa28>] mdt_intent_policy+0x2e8/0xd00 [mdt] [2689300.367949] [<ffffffffc0d6dd46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] [2689300.374893] [<ffffffffc0d96707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] [2689300.382189] [<ffffffffc0e1d6e2>] tgt_enqueue+0x62/0x210 [ptlrpc] [2689300.388531] [<ffffffffc0e2473a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [2689300.395657] [<ffffffffc0dc8d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2689300.403563] [<ffffffffc0dcc63c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [2689300.410077] [<ffffffff9e6c1c31>] kthread+0xd1/0xe0 [2689300.415164] [<ffffffff9ed74c24>] ret_from_fork_nospec_begin+0xe/0x21 [2689300.421813] [<ffffffffffffffff>] 0xffffffffffffffff [2689300.427016] Kernel panic - not syncing: LBUG
And now I am not able to restart the filesystem, even with -o abort_recov everywhere. OSTs restart but then the MDT0 doesn't want to restart. It keeps wanting to perform a recovery.
May 29 15:59:43 fir-md1-s1 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait. May 29 15:59:49 fir-md1-s1 kernel: Lustre: DEBUG MARKER: Wed May 29 15:59:49 2019 May 29 16:00:25 fir-md1-s1 kernel: LDISKFS-fs (dm-2): recovery complete May 29 16:00:25 fir-md1-s1 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc May 29 16:00:25 fir-md1-s1 kernel: Lustre: MGS: Connection restored to MGC10.0.10.51@o2ib7_0 (at 0@lo) May 29 16:00:26 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 1495a992-a81e-7de9-50a8-88376d880671 (at 10.8.12.32@o2ib6) May 29 16:00:26 fir-md1-s1 kernel: Lustre: Skipped 39 previous similar messages May 29 16:00:27 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 0e359090-33c6-a6a0-317c-1b138ecd9b87 (at 10.9.105.16@o2ib4) May 29 16:00:27 fir-md1-s1 kernel: Lustre: Skipped 72 previous similar messages May 29 16:00:29 fir-md1-s1 kernel: Lustre: MGS: Connection restored to d6e5f2ed-c606-14e7-85a0-13787ff0467b (at 10.9.105.69@o2ib4) May 29 16:00:29 fir-md1-s1 kernel: Lustre: Skipped 35 previous similar messages May 29 16:00:33 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 2996d077-09af-23fe-fe18-ce9b6359bb8e (at 10.8.21.35@o2ib6) May 29 16:00:33 fir-md1-s1 kernel: Lustre: Skipped 263 previous similar messages May 29 16:00:43 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 545852b2-0fa4-d28f-21da-1dc3deae2f8b (at 10.9.115.1@o2ib4) May 29 16:00:43 fir-md1-s1 kernel: Lustre: Skipped 177 previous similar messages May 29 16:01:10 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 98b9580a-8059-3273-b7c7-9f03a2d4d87f (at 10.9.112.7@o2ib4) May 29 16:01:10 fir-md1-s1 kernel: Lustre: Skipped 6 previous similar messages May 29 16:01:41 fir-md1-s1 kernel: LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait. May 29 16:01:42 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 4030ea6e-e97d-b5db-fa13-470ec3d51337 (at 10.8.1.8@o2ib6) May 29 16:01:42 fir-md1-s1 kernel: Lustre: Skipped 40 previous similar messages May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5 May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): recovery complete May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc May 29 16:02:24 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.8.31.10@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server. May 29 16:02:24 fir-md1-s1 kernel: Lustre: fir-MDT0000: Not available for connect from 10.8.18.10@o2ib6 (not set up) May 29 16:02:24 fir-md1-s1 kernel: Lustre: Skipped 3 previous similar messages May 29 16:02:24 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.8.17.10@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server. May 29 16:02:24 fir-md1-s1 kernel: LustreError: Skipped 4 previous similar messages May 29 16:02:26 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.9.102.30@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server. May 29 16:02:26 fir-md1-s1 kernel: LustreError: Skipped 1 previous similar message May 29 16:02:28 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.0.10.106@o2ib7 (no target). If you are running an HA pair check that the target is mounted on the other server. May 29 16:02:28 fir-md1-s1 kernel: LustreError: Skipped 13 previous similar messages May 29 16:02:29 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds May 29 16:02:29 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message May 29 16:02:31 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 2 seconds May 29 16:02:31 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message May 29 16:02:33 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.0.10.3@o2ib7 (no target). If you are running an HA pair check that the target is mounted on the other server. May 29 16:02:33 fir-md1-s1 kernel: LustreError: Skipped 12 previous similar messages May 29 16:02:38 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 1 seconds May 29 16:02:38 fir-md1-s1 kernel: Lustre: fir-MDT0000: Imperative Recovery not enabled, recovery window 300-900 May 29 16:02:38 fir-md1-s1 kernel: Lustre: fir-MDD0000: changelog on May 29 16:02:38 fir-md1-s1 kernel: LustreError: 20624:0:(mdt_handler.c:6447:mdt_iocontrol()) fir-MDT0000: Aborting recovery for device May 29 16:02:38 fir-md1-s1 kernel: LustreError: 20624:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0000: Aborting recovery May 29 16:02:38 fir-md1-s1 kernel: Lustre: 20890:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery May 29 16:02:41 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.9.108.30@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server. May 29 16:02:41 fir-md1-s1 kernel: LustreError: Skipped 29 previous similar messages May 29 16:02:44 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (86 recovered, 49 in progress, and 0 evicted) already passed deadline 8:07 May 29 16:02:46 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 98410362-8850-0375-cdb0-3dd7ebb1ce58 (at 10.9.107.28@o2ib4) May 29 16:02:46 fir-md1-s1 kernel: Lustre: Skipped 1171 previous similar messages May 29 16:02:57 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.1.6@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server. May 29 16:02:57 fir-md1-s1 kernel: LustreError: Skipped 803 previous similar messages May 29 16:03:02 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds May 29 16:03:12 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (802 recovered, 565 in progress, and 0 evicted) already passed deadline 8:34 May 29 16:03:27 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds May 29 16:03:30 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.20.22@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server. May 29 16:03:30 fir-md1-s1 kernel: LustreError: Skipped 2086 previous similar messages May 29 16:03:37 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 8:59 May 29 16:04:02 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 9:24 May 29 16:04:17 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds May 29 16:04:17 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message May 29 16:04:27 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 9:50 May 29 16:04:34 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.8.8.19@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server. May 29 16:04:34 fir-md1-s1 kernel: LustreError: Skipped 3903 previous similar messages May 29 16:04:38 fir-md1-s1 kernel: INFO: task mount.lustre:20624 blocked for more than 120 seconds. May 29 16:04:38 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 29 16:04:38 fir-md1-s1 kernel: mount.lustre D ffff946ebeba9040 0 20624 20623 0x00000082 May 29 16:04:38 fir-md1-s1 kernel: Call Trace: May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9c167c49>] schedule+0x29/0x70 May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9c165721>] schedule_timeout+0x221/0x2d0 May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9bad67c2>] ? default_wake_function+0x12/0x20 May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9bac2d2b>] ? autoremove_wake_function+0x2b/0x40 May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9c167ffd>] wait_for_completion+0xfd/0x140 May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9bad67b0>] ? wake_up_state+0x20/0x20 May 29 16:04:38 fir-md1-s1 kernel: [<ffffffffc0e8139d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc] May 29 16:04:38 fir-md1-s1 kernel: [<ffffffffc0e81448>] target_stop_recovery_thread+0x18/0x20 [ptlrpc] May 29 16:04:38 fir-md1-s1 kernel: [<ffffffffc17f4868>] mdt_iocontrol+0x558/0xb00 [mdt] May 29 16:04:38 fir-md1-s1 kernel: [<ffffffffc0cb2766>] server_start_targets+0x1c66/0x2a20 [obdclass] May 29 16:04:38 fir-md1-s1 kernel: [<ffffffffc0c86b10>] ? lustre_start_mgc+0x260/0x2510 [obdclass] May 29 16:04:38 fir-md1-s1 kernel: [<ffffffffc0c7fea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] May 29 16:04:38 fir-md1-s1 kernel: [<ffffffffc0cb45ec>] server_fill_super+0x10cc/0x1890 [obdclass] May 29 16:04:38 fir-md1-s1 kernel: [<ffffffffc0c89828>] lustre_fill_super+0x328/0x950 [obdclass] May 29 16:04:38 fir-md1-s1 kernel: [<ffffffffc0c89500>] ? lustre_common_put_super+0x270/0x270 [obdclass] May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9bc452cf>] mount_nodev+0x4f/0xb0 May 29 16:04:38 fir-md1-s1 kernel: [<ffffffffc0c81908>] lustre_mount+0x38/0x60 [obdclass] May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9bc45e4e>] mount_fs+0x3e/0x1b0 May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9bc639e7>] vfs_kern_mount+0x67/0x110 May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9bc6600f>] do_mount+0x1ef/0xce0 May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9bc3e2aa>] ? __check_object_size+0x1ca/0x250 May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9bc1c74c>] ? kmem_cache_alloc_trace+0x3c/0x200 May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9bc66e43>] SyS_mount+0x83/0xd0 May 29 16:04:38 fir-md1-s1 kernel: [<ffffffff9c174ddb>] system_call_fastpath+0x22/0x27 May 29 16:04:52 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 10:15 May 29 16:05:17 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 10:40 May 29 16:05:32 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds May 29 16:05:32 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 2 previous similar messages May 29 16:06:06 fir-md1-s1 kernel: LNet: Service thread pid 20981 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: May 29 16:06:06 fir-md1-s1 kernel: Pid: 20981, comm: mdt00_026 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 May 29 16:06:06 fir-md1-s1 kernel: Call Trace: May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc1573bf8>] osp_precreate_reserve+0x2e8/0x800 [osp] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc15688b9>] osp_declare_create+0x199/0x5b0 [osp] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18ee39f>] lod_sub_declare_create+0xdf/0x210 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18e656e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18e950e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18ed62d>] lod_qos_prep_create+0x12fd/0x1890 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18eddd5>] lod_prepare_create+0x215/0x2e0 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18ddcee>] lod_declare_striped_create+0x1ee/0x980 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc18e5058>] lod_declare_xattr_set+0x138/0x1230 [lod] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc193e065>] mdo_declare_xattr_set+0x65/0x150 [mdd] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc1947af0>] mdd_declare_create+0x550/0xcb0 [mdd] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc194b6e7>] mdd_create+0x827/0x1440 [mdd] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc181c018>] mdt_reint_open+0x1aa8/0x28b0 [mdt] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc180eed3>] mdt_reint_rec+0x83/0x210 [mdt] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc17ed143>] mdt_reint_internal+0x6e3/0xaf0 [mdt] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc17f9982>] mdt_intent_open+0x82/0x350 [mdt] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc17f7a28>] mdt_intent_policy+0x2e8/0xd00 [mdt] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc0e79d46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc0ea2707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc0f296e2>] tgt_enqueue+0x62/0x210 [ptlrpc] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc0f3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc0ed4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffc0ed863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] May 29 16:06:06 fir-md1-s1 kernel: [<ffffffff9bac1c31>] kthread+0xd1/0xe0 May 29 16:06:06 fir-md1-s1 kernel: [<ffffffff9c174c24>] ret_from_fork_nospec_begin+0xe/0x21 May 29 16:06:06 fir-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff May 29 16:06:06 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1559171166.20981 May 29 16:06:07 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 11:30 May 29 16:06:07 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message May 29 16:06:38 fir-md1-s1 kernel: INFO: task mount.lustre:20624 blocked for more than 120 seconds. May 29 16:06:38 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 29 16:06:38 fir-md1-s1 kernel: mount.lustre D ffff946ebeba9040 0 20624 20623 0x00000082 May 29 16:06:38 fir-md1-s1 kernel: Call Trace: May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9c167c49>] schedule+0x29/0x70 May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9c165721>] schedule_timeout+0x221/0x2d0 May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9bad67c2>] ? default_wake_function+0x12/0x20 May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9bac2d2b>] ? autoremove_wake_function+0x2b/0x40 May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9c167ffd>] wait_for_completion+0xfd/0x140 May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9bad67b0>] ? wake_up_state+0x20/0x20 May 29 16:06:38 fir-md1-s1 kernel: [<ffffffffc0e8139d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc] May 29 16:06:38 fir-md1-s1 kernel: [<ffffffffc0e81448>] target_stop_recovery_thread+0x18/0x20 [ptlrpc] May 29 16:06:38 fir-md1-s1 kernel: [<ffffffffc17f4868>] mdt_iocontrol+0x558/0xb00 [mdt] May 29 16:06:38 fir-md1-s1 kernel: [<ffffffffc0cb2766>] server_start_targets+0x1c66/0x2a20 [obdclass] May 29 16:06:38 fir-md1-s1 kernel: [<ffffffffc0c86b10>] ? lustre_start_mgc+0x260/0x2510 [obdclass] May 29 16:06:38 fir-md1-s1 kernel: [<ffffffffc0c7fea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] May 29 16:06:38 fir-md1-s1 kernel: [<ffffffffc0cb45ec>] server_fill_super+0x10cc/0x1890 [obdclass] May 29 16:06:38 fir-md1-s1 kernel: [<ffffffffc0c89828>] lustre_fill_super+0x328/0x950 [obdclass] May 29 16:06:38 fir-md1-s1 kernel: [<ffffffffc0c89500>] ? lustre_common_put_super+0x270/0x270 [obdclass] May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9bc452cf>] mount_nodev+0x4f/0xb0 May 29 16:06:38 fir-md1-s1 kernel: [<ffffffffc0c81908>] lustre_mount+0x38/0x60 [obdclass] May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9bc45e4e>] mount_fs+0x3e/0x1b0 May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9bc639e7>] vfs_kern_mount+0x67/0x110 May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9bc6600f>] do_mount+0x1ef/0xce0 May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9bc3e2aa>] ? __check_object_size+0x1ca/0x250 May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9bc1c74c>] ? kmem_cache_alloc_trace+0x3c/0x200 May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9bc66e43>] SyS_mount+0x83/0xd0 May 29 16:06:38 fir-md1-s1 kernel: [<ffffffff9c174ddb>] system_call_fastpath+0x22/0x27
mount commands are stuck for all 4 MDTs.
While I'm writing this, we're trying to restart everything with all lnet routers stopped to avoid any recovery with our compute nodes on Sherlock. My hope is that after the timeout of the imperative recovery window, the MDTs will actually mount. Any idea is welcome. Thanks!