[LU-12360] Can't restart filesystem (2.12) even with abort_recov Created: 29/May/19  Updated: 25/Sep/19

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Peter Jones
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.6


Attachments: Text File fir-io4-s2-2019-05-29-12-51-55-foreach-bt.txt     Text File fir-io4-s2-2019-05-29-12-51-55-vmcore-dmesg.txt     Text File fir-md1-s1.log     Text File fir-md1-s1_noabort_rtrdown.log     Text File fir-md1-s2-20190910-endlessrecov.sysrq-t.log     Text File fir-md1-s2.log     Text File fir-md1-s2_noabort_rtrdown.log    
Issue Links:
Related
is related to LU-11761 blocked MDT mount and high cpu usage ... Resolved
is related to LU-12485 (osd_handler.c:2146:osd_object_releas... Resolved
Severity: 1
Rank (Obsolete): 9223372036854775807

 Description   

Using 2.12.0 + patches, after running a few days without major issues, we just had a serie of events leading to a filesystem down, it started with an OSS crash on fir-io4-s2:

[411912.582805] LustreError: 128678:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 7ccedfb6-dc58-53e5-7474-adae274cd5d7 claims 3305472 GRANT, real grant 0
[411912.597102] LustreError: 128678:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 258 previous similar messages
[412523.291449] LustreError: 87611:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 3923968 GRANT, real grant 0
[412523.305649] LustreError: 87611:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 179 previous similar messages
[413133.047652] LustreError: 1860:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 4132864 GRANT, real grant 0
[413133.061828] LustreError: 1860:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 209 previous similar messages
[413733.569845] LustreError: 1546:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 7ccedfb6-dc58-53e5-7474-adae274cd5d7 claims 3780608 GRANT, real grant 0
[413733.583920] LustreError: 1546:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 206 previous similar messages
[414347.351179] LustreError: 87688:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0028: cli 100744eb-5315-bfcb-8c97-cf1e0209fc7b claims 3911680 GRANT, real grant 0
[414347.365413] LustreError: 87688:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 197 previous similar messages
[414687.733602] LustreError: 108071:0:(osd_handler.c:2133:osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) ) failed: 
[414687.748464] LustreError: 108071:0:(osd_handler.c:2133:osd_object_release()) LBUG
[414687.755956] Pid: 108071, comm: ldlm_cn01_010 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[414687.766217] Call Trace:
[414687.768773]  [<ffffffffc0d0f7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[414687.775434]  [<ffffffffc0d0f87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[414687.781736]  [<ffffffffc15d636c>] osd_object_release+0x7c/0x80 [osd_ldiskfs]
[414687.788910]  [<ffffffffc0e936a0>] lu_object_put+0x190/0x3d0 [obdclass]
[414687.795582]  [<ffffffffc1750b56>] ofd_lvbo_update+0x5e6/0xdf0 [ofd]
[414687.801980]  [<ffffffffc10a2c89>] ldlm_request_cancel+0x3c9/0x740 [ptlrpc]
[414687.809008]  [<ffffffffc10a77da>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
[414687.815859]  [<ffffffffc10a7ac8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
[414687.822885]  [<ffffffffc10d8d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[414687.830693]  [<ffffffffc10dc63c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[414687.837115]  [<ffffffffb74c1c31>] kthread+0xd1/0xe0
[414687.842115]  [<ffffffffb7b74c24>] ret_from_fork_nospec_begin+0xe/0x21
[414687.848677]  [<ffffffffffffffff>] 0xffffffffffffffff
[414687.853780] Kernel panic - not syncing: LBUG

 
Shortly after restarting the OSS, an MDS, fir-md1-s2 (MDT1 + 3) crashed:

[2687144.065647] Lustre: fir-MDT0003: Client ddea348b-e5a4-5330-325a-755d459e8dda (at 10.9.107.57@o2ib4) reconnecting
[2687144.076003] Lustre: Skipped 1 previous similar message
[2689300.253654] LustreError: 124673:0:(osp_object.c:1458:osp_declare_create()) ASSERTION( o->opo_reserved == 0 ) failed: 
[2689300.264453] LustreError: 124673:0:(osp_object.c:1458:osp_declare_create()) LBUG
[2689300.271942] Pid: 124673, comm: mdt00_044 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[2689300.281942] Call Trace:
[2689300.284583]  [<ffffffffc0a397cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[2689300.291332]  [<ffffffffc0a3987c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[2689300.297719]  [<ffffffffc15aecc5>] osp_declare_create+0x5a5/0x5b0 [osp]
[2689300.304462]  [<ffffffffc14f639f>] lod_sub_declare_create+0xdf/0x210 [lod]
[2689300.311466]  [<ffffffffc14f5904>] lod_qos_prep_create+0x15d4/0x1890 [lod]
[2689300.318467]  [<ffffffffc14d6bba>] lod_declare_instantiate_components+0x9a/0x1d0 [lod]
[2689300.326511]  [<ffffffffc14e94d5>] lod_declare_layout_change+0xb65/0x10f0 [lod]
[2689300.333948]  [<ffffffffc155b102>] mdd_declare_layout_change+0x62/0x120 [mdd]
[2689300.341211]  [<ffffffffc1563e52>] mdd_layout_change+0x882/0x1000 [mdd]
[2689300.347964]  [<ffffffffc13ca327>] mdt_layout_change+0x337/0x430 [mdt]
[2689300.354636]  [<ffffffffc13d243e>] mdt_intent_layout+0x7ee/0xcc0 [mdt]
[2689300.361292]  [<ffffffffc13cfa28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
[2689300.367949]  [<ffffffffc0d6dd46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[2689300.374893]  [<ffffffffc0d96707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[2689300.382189]  [<ffffffffc0e1d6e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[2689300.388531]  [<ffffffffc0e2473a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[2689300.395657]  [<ffffffffc0dc8d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[2689300.403563]  [<ffffffffc0dcc63c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[2689300.410077]  [<ffffffff9e6c1c31>] kthread+0xd1/0xe0
[2689300.415164]  [<ffffffff9ed74c24>] ret_from_fork_nospec_begin+0xe/0x21
[2689300.421813]  [<ffffffffffffffff>] 0xffffffffffffffff
[2689300.427016] Kernel panic - not syncing: LBUG

And now I am not able to restart the filesystem, even with -o abort_recov everywhere. OSTs restart but then the MDT0 doesn't want to restart. It keeps wanting to perform a recovery.

May 29 15:59:43 fir-md1-s1 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
May 29 15:59:49 fir-md1-s1 kernel: Lustre: DEBUG MARKER: Wed May 29 15:59:49 2019
May 29 16:00:25 fir-md1-s1 kernel: LDISKFS-fs (dm-2): recovery complete
May 29 16:00:25 fir-md1-s1 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
May 29 16:00:25 fir-md1-s1 kernel: Lustre: MGS: Connection restored to MGC10.0.10.51@o2ib7_0 (at 0@lo)
May 29 16:00:26 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 1495a992-a81e-7de9-50a8-88376d880671 (at 10.8.12.32@o2ib6)
May 29 16:00:26 fir-md1-s1 kernel: Lustre: Skipped 39 previous similar messages
May 29 16:00:27 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 0e359090-33c6-a6a0-317c-1b138ecd9b87 (at 10.9.105.16@o2ib4)
May 29 16:00:27 fir-md1-s1 kernel: Lustre: Skipped 72 previous similar messages





May 29 16:00:29 fir-md1-s1 kernel: Lustre: MGS: Connection restored to d6e5f2ed-c606-14e7-85a0-13787ff0467b (at 10.9.105.69@o2ib4)
May 29 16:00:29 fir-md1-s1 kernel: Lustre: Skipped 35 previous similar messages
May 29 16:00:33 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 2996d077-09af-23fe-fe18-ce9b6359bb8e (at 10.8.21.35@o2ib6)
May 29 16:00:33 fir-md1-s1 kernel: Lustre: Skipped 263 previous similar messages
May 29 16:00:43 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 545852b2-0fa4-d28f-21da-1dc3deae2f8b (at 10.9.115.1@o2ib4)
May 29 16:00:43 fir-md1-s1 kernel: Lustre: Skipped 177 previous similar messages



May 29 16:01:10 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 98b9580a-8059-3273-b7c7-9f03a2d4d87f (at 10.9.112.7@o2ib4)
May 29 16:01:10 fir-md1-s1 kernel: Lustre: Skipped 6 previous similar messages
May 29 16:01:41 fir-md1-s1 kernel: LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
May 29 16:01:42 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 4030ea6e-e97d-b5db-fa13-470ec3d51337 (at 10.8.1.8@o2ib6)
May 29 16:01:42 fir-md1-s1 kernel: Lustre: Skipped 40 previous similar messages
May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5
May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): recovery complete
May 29 16:02:23 fir-md1-s1 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
May 29 16:02:24 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.8.31.10@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:24 fir-md1-s1 kernel: Lustre: fir-MDT0000: Not available for connect from 10.8.18.10@o2ib6 (not set up)
May 29 16:02:24 fir-md1-s1 kernel: Lustre: Skipped 3 previous similar messages
May 29 16:02:24 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.8.17.10@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:24 fir-md1-s1 kernel: LustreError: Skipped 4 previous similar messages
May 29 16:02:26 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.9.102.30@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:26 fir-md1-s1 kernel: LustreError: Skipped 1 previous similar message
May 29 16:02:28 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.0.10.106@o2ib7 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:28 fir-md1-s1 kernel: LustreError: Skipped 13 previous similar messages
May 29 16:02:29 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
May 29 16:02:29 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
May 29 16:02:31 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 2 seconds
May 29 16:02:31 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
May 29 16:02:33 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.0.10.3@o2ib7 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:33 fir-md1-s1 kernel: LustreError: Skipped 12 previous similar messages
May 29 16:02:38 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 1 seconds
May 29 16:02:38 fir-md1-s1 kernel: Lustre: fir-MDT0000: Imperative Recovery not enabled, recovery window 300-900
May 29 16:02:38 fir-md1-s1 kernel: Lustre: fir-MDD0000: changelog on
May 29 16:02:38 fir-md1-s1 kernel: LustreError: 20624:0:(mdt_handler.c:6447:mdt_iocontrol()) fir-MDT0000: Aborting recovery for device
May 29 16:02:38 fir-md1-s1 kernel: LustreError: 20624:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0000: Aborting recovery
May 29 16:02:38 fir-md1-s1 kernel: Lustre: 20890:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery
May 29 16:02:41 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.9.108.30@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:41 fir-md1-s1 kernel: LustreError: Skipped 29 previous similar messages
May 29 16:02:44 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (86 recovered, 49 in progress, and 0 evicted) already passed deadline 8:07
May 29 16:02:46 fir-md1-s1 kernel: Lustre: fir-MDT0000: Connection restored to 98410362-8850-0375-cdb0-3dd7ebb1ce58 (at 10.9.107.28@o2ib4)
May 29 16:02:46 fir-md1-s1 kernel: Lustre: Skipped 1171 previous similar messages
May 29 16:02:57 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.1.6@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:02:57 fir-md1-s1 kernel: LustreError: Skipped 803 previous similar messages
May 29 16:03:02 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
May 29 16:03:12 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (802 recovered, 565 in progress, and 0 evicted) already passed deadline 8:34
May 29 16:03:27 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
May 29 16:03:30 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.20.22@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:03:30 fir-md1-s1 kernel: LustreError: Skipped 2086 previous similar messages
May 29 16:03:37 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 8:59
May 29 16:04:02 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 9:24
May 29 16:04:17 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
May 29 16:04:17 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
May 29 16:04:27 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 9:50
May 29 16:04:34 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0002_UUID: not available for connect from 10.8.8.19@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.
May 29 16:04:34 fir-md1-s1 kernel: LustreError: Skipped 3903 previous similar messages
May 29 16:04:38 fir-md1-s1 kernel: INFO: task mount.lustre:20624 blocked for more than 120 seconds.
May 29 16:04:38 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 29 16:04:38 fir-md1-s1 kernel: mount.lustre    D ffff946ebeba9040     0 20624  20623 0x00000082
May 29 16:04:38 fir-md1-s1 kernel: Call Trace:
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c167c49>] schedule+0x29/0x70
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c165721>] schedule_timeout+0x221/0x2d0
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bad67c2>] ? default_wake_function+0x12/0x20
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bac2d2b>] ? autoremove_wake_function+0x2b/0x40
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c167ffd>] wait_for_completion+0xfd/0x140
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bad67b0>] ? wake_up_state+0x20/0x20
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0e8139d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0e81448>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc17f4868>] mdt_iocontrol+0x558/0xb00 [mdt]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0cb2766>] server_start_targets+0x1c66/0x2a20 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c86b10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c7fea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0cb45ec>] server_fill_super+0x10cc/0x1890 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c89828>] lustre_fill_super+0x328/0x950 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c89500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc452cf>] mount_nodev+0x4f/0xb0
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffffc0c81908>] lustre_mount+0x38/0x60 [obdclass]
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc45e4e>] mount_fs+0x3e/0x1b0
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc639e7>] vfs_kern_mount+0x67/0x110
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc6600f>] do_mount+0x1ef/0xce0
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc3e2aa>] ? __check_object_size+0x1ca/0x250
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc1c74c>] ? kmem_cache_alloc_trace+0x3c/0x200
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9bc66e43>] SyS_mount+0x83/0xd0
May 29 16:04:38 fir-md1-s1 kernel:  [<ffffffff9c174ddb>] system_call_fastpath+0x22/0x27
May 29 16:04:52 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 10:15
May 29 16:05:17 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 10:40
May 29 16:05:32 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.52@o2ib7: 0 seconds
May 29 16:05:32 fir-md1-s1 kernel: LNet: 20145:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 2 previous similar messages
May 29 16:06:06 fir-md1-s1 kernel: LNet: Service thread pid 20981 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
May 29 16:06:06 fir-md1-s1 kernel: Pid: 20981, comm: mdt00_026 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
May 29 16:06:06 fir-md1-s1 kernel: Call Trace:
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1573bf8>] osp_precreate_reserve+0x2e8/0x800 [osp]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc15688b9>] osp_declare_create+0x199/0x5b0 [osp]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ee39f>] lod_sub_declare_create+0xdf/0x210 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e656e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e950e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ed62d>] lod_qos_prep_create+0x12fd/0x1890 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18eddd5>] lod_prepare_create+0x215/0x2e0 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ddcee>] lod_declare_striped_create+0x1ee/0x980 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e5058>] lod_declare_xattr_set+0x138/0x1230 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc193e065>] mdo_declare_xattr_set+0x65/0x150 [mdd]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1947af0>] mdd_declare_create+0x550/0xcb0 [mdd]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc194b6e7>] mdd_create+0x827/0x1440 [mdd]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc181c018>] mdt_reint_open+0x1aa8/0x28b0 [mdt]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc180eed3>] mdt_reint_rec+0x83/0x210 [mdt]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17ed143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17f9982>] mdt_intent_open+0x82/0x350 [mdt]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17f7a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0e79d46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ea2707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0f296e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0f3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ed4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc0ed863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffff9bac1c31>] kthread+0xd1/0xe0
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffff9c174c24>] ret_from_fork_nospec_begin+0xe/0x21
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
May 29 16:06:06 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1559171166.20981
May 29 16:06:07 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client c51c4531-a6f0-1587-ed6e-533795d105f6(at 10.8.23.12@o2ib6), waiting for 1403 known clients (818 recovered, 580 in progress, and 0 evicted) already passed deadline 11:30
May 29 16:06:07 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message
May 29 16:06:38 fir-md1-s1 kernel: INFO: task mount.lustre:20624 blocked for more than 120 seconds.
May 29 16:06:38 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 29 16:06:38 fir-md1-s1 kernel: mount.lustre    D ffff946ebeba9040     0 20624  20623 0x00000082
May 29 16:06:38 fir-md1-s1 kernel: Call Trace:
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c167c49>] schedule+0x29/0x70
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c165721>] schedule_timeout+0x221/0x2d0
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bad67c2>] ? default_wake_function+0x12/0x20
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bac2d2b>] ? autoremove_wake_function+0x2b/0x40
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c167ffd>] wait_for_completion+0xfd/0x140
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bad67b0>] ? wake_up_state+0x20/0x20
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0e8139d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0e81448>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc17f4868>] mdt_iocontrol+0x558/0xb00 [mdt]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0cb2766>] server_start_targets+0x1c66/0x2a20 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c86b10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c7fea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0cb45ec>] server_fill_super+0x10cc/0x1890 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c89828>] lustre_fill_super+0x328/0x950 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c89500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc452cf>] mount_nodev+0x4f/0xb0
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffffc0c81908>] lustre_mount+0x38/0x60 [obdclass]
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc45e4e>] mount_fs+0x3e/0x1b0
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc639e7>] vfs_kern_mount+0x67/0x110
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc6600f>] do_mount+0x1ef/0xce0
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc3e2aa>] ? __check_object_size+0x1ca/0x250
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc1c74c>] ? kmem_cache_alloc_trace+0x3c/0x200
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9bc66e43>] SyS_mount+0x83/0xd0
May 29 16:06:38 fir-md1-s1 kernel:  [<ffffffff9c174ddb>] system_call_fastpath+0x22/0x27

 

mount commands are stuck for all 4 MDTs.

While I'm writing this, we're trying to restart everything with all lnet routers stopped to avoid any recovery with our compute nodes on Sherlock. My hope is that after the timeout of the imperative recovery window, the MDTs will actually mount. Any idea is welcome. Thanks!



 Comments   
Comment by Stephane Thiell [ 29/May/19 ]

posted logs when trying to mount MDTs with abort_recov, routers down, didn't work

fir-md1-s1.log

fir-md1-s2.log

Comment by Stephane Thiell [ 30/May/19 ]

Now, routers down and without abort_recov, the mount commands seem to have completed after the recover timeout. We'll restart the routers.

Comment by Patrick Farrell (Inactive) [ 30/May/19 ]

Stephane,

What's your current status?  Hit me on Skype if you're still having issues and let's try to troubleshoot.

Comment by Patrick Farrell (Inactive) [ 30/May/19 ]

Stephane,

Once you're back to normal, do you have a crash dump for that OSS crash?  Or at least dmesg from that OSS?

Comment by Stephane Thiell [ 30/May/19 ]

Hi Patrick! Thanks! It is back online. We restarted the routers after the filesystem, one of the router then generated some RDMA errors so we simply shut it down (there are enough of them anyway), and now I think it's stable. Scary though.

 

fir-io3-s2: May 29 17:09:24 fir-io3-s2 kernel: LNetError: 98291:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.10.210@o2ib7 (0): c: 0, oc: 2, rc: 8

I can provide more logs showing the origin of the issue. But can't do it right now, will be online later if needed. Thanks!

Comment by Patrick Farrell (Inactive) [ 30/May/19 ]

OK, great - We'll look at this on a non-emergency basis then.  Thanks, and sorry for the trouble.

Comment by Stephane Thiell [ 07/Jun/19 ]

Hi Patrick,

The vmcore-dmesg.txt for that OSS is attached as fir-io4-s2-2019-05-29-12-51-55-vmcore-dmesg.txt

I also attached the output of foreach bt as fir-io4-s2-2019-05-29-12-51-55-foreach-bt.txt

The crash dump is available on the FTP as LU-12360-fir-io4-s2-2019-05-29-12-51-55-vmcore

Let me know what you think...
Thanks!

Comment by Stephane Thiell [ 16/Jun/19 ]

Following a MDS LBUG on Fir this weekend (LU-11967 ASSERTION( o->opo_reserved == 0 ) ), it then took us a few hours to put the filesystem back online because the recovery for fir-MDT0001 wouldn't finish. Nor would it finish when using -o abort_recov or lctl ... abort_recovery. This is very concerning. Is it a current limitation of DNE? Here are the logs showing the first restart after the crash on fir-md1-s2 where MDT0001 and MDT0003 are hosted. You can see that MDT0003 does finish its recovery but not MDT0001:

[Sat Jun 15 09:15:33 2019][28357.705433] Lustre: fir-MDT0003: Denying connection for new client fc9cbc0d-41e6-18a0-ddfe-91c390cc7652(at 10.9.108.7@o2ib4), waiting for 1397 known clients (1308 recovered, 68 in progress, and 0 evicted) already passed deadline 4:27^M
[Sat Jun 15 09:15:33 2019][28357.726111] Lustre: Skipped 143 previous similar messages^M
[Sat Jun 15 09:16:06 2019][28390.961578] Lustre: fir-MDT0003: recovery is timed out, evict stale exports^M
[Sat Jun 15 09:16:06 2019][28390.968822] Lustre: fir-MDT0003: disconnecting 21 stale clients^M
[Sat Jun 15 09:16:07 2019][28392.314659] Lustre: fir-MDT0003-osp-MDT0001: Connection restored to 10.0.10.52@o2ib7 (at 0@lo)^M
[Sat Jun 15 09:16:07 2019][28392.317401] Lustre: fir-MDT0003: Recovery over after 5:01, of 1397 clients 1376 recovered and 21 were evicted.^M
[Sat Jun 15 09:16:07 2019][28392.333296] Lustre: Skipped 4 previous similar messages^M
[Sat Jun 15 09:16:08 2019][28392.559315] LustreError: 11-0: fir-MDT0000-lwp-MDT0003: operation quota_acquire to node 10.0.10.51@o2ib7 failed: rc = -11^M
[Sat Jun 15 09:16:08 2019][28392.570287] LustreError: Skipped 184 previous similar messages^M
[Sat Jun 15 09:16:15 2019][28400.177816] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M
[Sat Jun 15 09:16:15 2019][28400.185066] Lustre: fir-MDT0001: disconnecting 21 stale clients^M
[Sat Jun 15 09:16:22 2019][28407.240015] Lustre: 98362:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply^M
[Sat Jun 15 09:16:22 2019][28407.240015]   req@ffff9ee5238fef00 x1631565401531536/t0(0) o101->f226e60d-c642-72ed-451c-28866e7cb4cc@10.9.103.26@o2ib4:27/0 lens 584/3264 e 1 to 0 dl 1560615387 ref 2 fl Interpret:/0/0 rc 0/0^M
[Sat Jun 15 09:17:08 2019][28453.018851] LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.8.8.37@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server.^M
[Sat Jun 15 09:19:38 2019][28602.454583] LustreError: 99405:0:(mdt_handler.c:6447:mdt_iocontrol()) fir-MDT0001: Aborting recovery for device^M
[Sat Jun 15 09:19:38 2019][28602.464690] LustreError: 99405:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0001: Aborting recovery^M
[Sat Jun 15 09:19:51 2019][28615.499651] Lustre: fir-MDT0001: Denying connection for new client 304180e1-aa68-a4a4-ed4c-9536f53351a5(at 10.8.1.21@o2ib6), waiting for 1397 known clients (1324 recovered, 52 in progress, and 21 evicted) already passed deadline 7:35^M
[Sat Jun 15 09:19:51 2019][28615.520312] Lustre: Skipped 168 previous similar messages^M
[Sat Jun 15 09:19:54 2019][28618.365990] Lustre: fir-MDT0003: haven't heard from client bea0167a-f4ba-4bef-afde-9099abf4f13e (at 10.8.8.37@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9ee52f691c00, cur 1560615594 expire 1560615444 last 1560615367^M

shine (tool that parses /proc or /sys) indicated that the recovery was running, but at the end "0s" of recovery was remaining and it stayed like that.

second attempt:

[Sat Jun 15 10:04:00 2019][ 1090.090394] LustreError: 19769:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1560618150, 90s ago), entering recovery for MGS@10.0.10.51@o2ib7 ns: MGC10.0.10.51@o2ib7 lock: ffffa0c3998306c0/0x90311c0f977a450c lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x16a9799edea112f1 expref: -99 pid: 19769 timeout: 0 lvb_type: 0^M
[Sat Jun 15 10:04:00 2019][ 1090.127635] LustreError: 20798:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffffa0c3ace14b40) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
[Sat Jun 15 10:04:08 2019][ 1097.501491] Lustre: 19974:0:(nodemap_member.c:122:nm_member_add()) export ffffa0c3a8a46c00 bc38572c-4dfd-e060-15f5-2bafa5ab8152 already hashed, failed to add to nodemap default already member of default^M
[Sat Jun 15 10:04:18 2019][ 1107.244494] Lustre: fir-MDT0001: Recovery already passed deadline 4:55, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.^M
[Sat Jun 15 10:04:18 2019][ 1107.260822] Lustre: Skipped 1691 previous similar messages^M
[Sat Jun 15 10:04:22 2019][ 1111.536233] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M
[Sat Jun 15 10:04:22 2019][ 1111.543495] Lustre: fir-MDT0001: disconnecting 4 stale clients^M
[Sat Jun 15 10:04:48 2019][ 1137.522495] LNetError: 7364:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)^M
[Sat Jun 15 10:04:48 2019][ 1137.547954] Lustre: 20053:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (100:14s); client may timeout.  req@ffffa0c398b97500 x1634475767709184/t0(0) o38-><?>@<?>:0/0 lens 520/192 e 0 to 0 dl 1560618274 ref 1 fl Complete:/0/0 rc -19/-19^M
[Sat Jun 15 10:05:16 2019][ 1165.524022] LNetError: 19577:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)^M
[Sat Jun 15 10:05:22 2019][ 1171.697797] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M
[Sat Jun 15 10:05:22 2019][ 1171.705051] Lustre: fir-MDT0001: disconnecting 1 stale clients^M
[Sat Jun 15 10:05:27 2019][ 1176.524359] LNetError: 19571:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)^M
[Sat Jun 15 10:05:36 2019][ 1185.892184] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail^M
[Sat Jun 15 10:05:36 2019][ 1185.906011] LustreError: 19769:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1560618246, 90s ago), entering recovery for MGS@10.0.10.51@o2ib7 ns: MGC10.0.10.51@o2ib7 lock: ffffa0c399831d40/0x90311c0f977a46be lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x16a9799edea13a12 expref: -99 pid: 19769 timeout: 0 lvb_type: 0^M
[Sat Jun 15 10:05:36 2019][ 1185.943292] LustreError: 20818:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffffa0c3ace15200) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
[Sat Jun 15 10:05:53 2019][ 1202.674655] Lustre: fir-MDT0003: recovery is timed out, evict stale exports^M
[Sat Jun 15 10:05:53 2019][ 1202.681944] Lustre: fir-MDT0003: disconnecting 6 stale clients^M
[Sat Jun 15 10:06:01 2019][ 1210.995013] LustreError: 11-0: fir-OST000c-osc-MDT0001: operation ost_connect to node 10.0.10.103@o2ib7 failed: rc = -16^M
[Sat Jun 15 10:06:01 2019][ 1211.005897] LustreError: Skipped 239 previous similar messages^M
[Sat Jun 15 10:06:16 2019][ 1225.525191] LNet: 19565:0:(o2iblnd_cb.c:1484:kiblnd_reconnect_peer()) Abort reconnection of 10.0.10.211@o2ib7: accepting^M
[Sat Jun 15 10:06:22 2019][ 1231.859449] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M
[Sat Jun 15 10:06:22 2019][ 1231.866717] Lustre: fir-MDT0001: disconnecting 3 stale clients^M
[Sat Jun 15 10:06:37 2019][ 1247.069499] LustreError: 20372:0:(ldlm_lib.c:2840:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffffa0b37b72ad00 x1636333431914752/t0(323735050166) o101->871fc187-81d5-1f51-125e-0c60fb0a4b64@10.9.0.61@o2ib4:7/0 lens 1168/0 e 0 to 0 dl 1560618427 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M
[Sat Jun 15 10:06:45 2019][ 1254.237810] LustreError: 20837:0:(ldlm_lib.c:2840:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffffa0a3b37b7c50 x1635228444455280/t0(323735075475) o4->5856c966-b502-541a-bf79-fec68258d993@10.9.101.29@o2ib4:21/0 lens 488/0 e 0 to 0 dl 1560618411 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M
[Sat Jun 15 10:06:45 2019][ 1254.264895] LustreError: 20837:0:(ldlm_lib.c:2840:target_queue_recovery_request()) Skipped 1 previous similar message^M
[Sat Jun 15 10:06:52 2019][ 1262.030167] LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.9.104.62@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.^M
[Sat Jun 15 10:06:52 2019][ 1262.047534] LustreError: Skipped 5389 previous similar messages^M
[Sat Jun 15 10:06:53 2019][ 1262.836300] Lustre: fir-MDT0003: recovery is timed out, evict stale exports^M
[Sat Jun 15 10:06:53 2019][ 1262.843565] Lustre: fir-MDT0003: disconnecting 2 stale clients^M
[Sat Jun 15 10:06:57 2019][ 1266.977269] LustreError: 20837:0:(ldlm_lib.c:2840:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffffa0a3b37b7850 x1634768088300608/t0(323735062221) o4->1ceec6a5-3003-d3c3-15d1-22344d794be9@10.9.101.12@o2ib4:3/0 lens 2472/0 e 0 to 0 dl 1560618423 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M
[Sat Jun 15 10:07:00 2019][ 1269.537317] LustreError: 20234:0:(mdt_io.c:470:mdt_preprw_write()) fir-MDT0001: WRITE IO to missing obj [0x24002b8db:0xd1e5:0x0]: rc = -2^M
[Sat Jun 15 10:07:06 2019][ 1275.966688] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail^M

Third attempt:

[Sat Jun 15 10:58:43 2019][ 1499.322692] Lustre: fir-MDT0003: Recovery already passed deadline 3:27, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.^M
[Sat Jun 15 10:58:43 2019][ 1499.339032] Lustre: Skipped 1608 previous similar messages^M
[Sat Jun 15 10:58:45 2019][ 1500.942918] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M
[Sat Jun 15 10:58:45 2019][ 1500.950104] Lustre: fir-MDT0001: disconnecting 5 stale clients^M
[Sat Jun 15 10:58:46 2019][ 1502.433377] LustreError: 20301:0:(ldlm_lib.c:2840:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff9f36063f6900 x1635095002346752/t0(323735097417) o36->40afa9c7-3abc-7aee-4e4a-0babffe2282c@10.8.17.22@o2ib6:16/0 lens 488/0 e 0 to 0 dl 1560621556 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M
[Sat Jun 15 10:59:24 2019][ 1540.003218] LNetError: 19036:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)^M
[Sat Jun 15 10:59:31 2019][ 1547.229133] LNet: 19024:0:(o2iblnd_cb.c:1484:kiblnd_reconnect_peer()) Abort reconnection of 10.0.10.203@o2ib7: accepting^M
[Sat Jun 15 10:59:45 2019][ 1561.104563] Lustre: fir-MDT0001: recovery is timed out, evict stale exports^M
[Sat Jun 15 10:59:45 2019][ 1561.111750] Lustre: fir-MDT0001: disconnecting 2 stale clients^M
[Sat Jun 15 10:59:59 2019][ 1574.528934] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail^M
[Sat Jun 15 10:59:59 2019][ 1574.542768] LustreError: 19913:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1560621509, 90s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff9f46262d5100/0x7228bba165f851da lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x130a68c4135d486d expref: -99 pid: 19913 timeout: 0 lvb_type: 0^M
[Sat Jun 15 11:00:05 2019][ 1580.605107] Lustre: 21053:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply^M
[Sat Jun 15 11:00:05 2019][ 1580.605107]   req@ffff9f4624f54200 x1636429568502512/t0(0) o400->fir-MDT0003-lwp-OST0014_UUID@10.0.10.103@o2ib7:10/0 lens 224/0 e 0 to 0 dl 1560621610 ref 2 fl New:H/0/ffffffff rc 0/-1^M
[Sat Jun 15 11:00:06 2019][ 1581.605138] Lustre: 21069:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply^M
[Sat Jun 15 11:00:06 2019][ 1581.605138]   req@ffff9f3605abf500 x1636429568502160/t0(0) o400->fir-MDT0001-lwp-OST000a_UUID@10.0.10.101@o2ib7:11/0 lens 224/0 e 0 to 0 dl 1560621611 ref 2 fl New:H/0/ffffffff rc 0/-1^M
[Sat Jun 15 11:00:06 2019][ 1581.633344] Lustre: 21069:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 11 previous similar messages^M
[Sat Jun 15 11:00:07 2019][ 1582.607185] Lustre: 21069:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply^M
[Sat Jun 15 11:00:07 2019][ 1582.607185]   req@ffff9f35ef72a700 x1636429568502448/t0(0) o400->fir-MDT0003-lwp-OST001d_UUID@10.0.10.106@o2ib7:12/0 lens 224/0 e 0 to 0 dl 1560621612 ref 2 fl New:H/0/ffffffff rc 0/-1^M
[Sat Jun 15 11:00:07 2019][ 1582.635386] Lustre: 21069:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 47 previous similar messages^M
[Sat Jun 15 11:00:10 2019][ 1586.227260] LNet: 19024:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.209@o2ib7: 0 seconds^M
[Sat Jun 15 11:00:13 2019][ 1589.230231] LNet: 19024:0:(o2iblnd_cb.c:1484:kiblnd_reconnect_peer()) Abort reconnection of 10.0.10.212@o2ib7: accepting^M
[Sat Jun 15 11:00:16 2019][ 1592.081411] Lustre: fir-MDT0003: recovery is timed out, evict stale exports^M
[Sat Jun 15 11:00:16 2019][ 1592.088644] Lustre: fir-MDT0003: disconnecting 2 stale clients^M
[Sat Jun 15 11:00:19 2019][ 1595.329629] Lustre: fir-MDT0003: Export ffff9f4636332800 already connecting from 10.8.19.1@o2ib6^M
[Sat Jun 15 11:00:22 2019][ 1598.243240] Lustre: fir-MDT0003: Export ffff9f463ec5a000 already connecting from 10.8.10.9@o2ib6^M

Our only solution (that we found) to put the filesystem back online was to get rid of the lnet routers, mount the robinhood client (on the same fabric, so no routers, to start the recovery) and then wait for the recovery to timeout.

I also noticed an interesting thing, if you umount the MDT while in recovery, it seems to complete its recovery just before the umount, but unfortunately that's not enough to make it work the next time we tried to mount it:

[Sat Jun 15 11:15:24 2019][ 2499.726764] Lustre: Failing over fir-MDT0001^M
[Sat Jun 15 11:15:24 2019][ 2499.769929] LustreError: 21982:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0001: Aborting recovery^M
[Sat Jun 15 11:15:24 2019][ 2499.870073] LNetError: 19039:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 223 previous similar messages^M
[Sat Jun 15 11:15:27 2019][ 2502.826656] LustreError: 23345:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff9f46266b2780) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
[Sat Jun 15 11:15:27 2019][ 2502.846197] LustreError: 23345:0:(ldlm_resource.c:1146:ldlm_resource_complain()) Skipped 3 previous similar messages^M
[Sat Jun 15 11:15:27 2019][ 2502.856890] Lustre: Failing over fir-MDT0003^M
[Sat Jun 15 11:15:33 2019][ 2508.542569] LustreError: 20306:0:(lod_qos.c:213:lod_statfs_and_check()) fir-MDT0001-mdtlov: statfs: rc = -108^M
[Sat Jun 15 11:15:33 2019][ 2508.552687] Lustre: 20306:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (380:920s); client may timeout.  req@ffff9f4625a7dd00 x1636333431914752/t0(323735050166) o101->871fc187-81d5-1f51-125e-0c60fb0a4b64@10.9.0.61@o2ib4:13/0 lens 1168/560 e 14 to 0 dl 1560621613 ref 1 fl Complete:/4/0 rc -19/-19^M
[Sat Jun 15 11:15:33 2019][ 2508.552911] LNet: Service thread pid 20316 completed after 1289.78s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).^M
[Sat Jun 15 11:15:33 2019][ 2508.552913] LNet: Skipped 326 previous similar messages^M
[Sat Jun 15 11:15:33 2019][ 2508.552983] Lustre: 20165:0:(ldlm_lib.c:2048:target_recovery_overseer()) fir-MDT0001 recovery is aborted by hard timeout^M
[Sat Jun 15 11:15:33 2019][ 2508.552987] Lustre: 20165:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery^M
[Sat Jun 15 11:15:33 2019][ 2508.553002] Lustre: 20165:0:(ldlm_lib.c:1617:abort_req_replay_queue()) @@@ aborted:  req@ffff9f263f547500 x1631570405023808/t0(323735068936) o101->e70978d5-4079-f244-8878-d79e9ac6d1e1@10.8.28.5@o2ib6:11/0 lens 1152/0 e 26 to 0 dl 1560621911 ref 1 fl Complete:/4/ffffffff rc 0/-1^M
[Sat Jun 15 11:15:33 2019][ 2508.553321] LustreError: 20165:0:(ldlm_lib.c:1638:abort_lock_replay_queue()) @@@ aborted:  req@ffff9f263071a100 x1631563950806768/t0(0) o101->aebbd095-b50a-7733-a754-e1423fab741f@10.8.18.10@o2ib6:1/0 lens 328/0 e 27 to 0 dl 1560621931 ref 1 fl Complete:/40/ffffffff rc 0/-1^M
[Sat Jun 15 11:15:33 2019][ 2508.583130] Lustre: 20165:0:(ldlm_lib.c:2554:target_recovery_thread()) too long recovery - read logs^M
[Sat Jun 15 11:15:33 2019][ 2508.583179] LustreError: dumping log to /tmp/lustre-log.1560622533.20165^M
[Sat Jun 15 11:15:33 2019][ 2508.583180] Lustre: fir-MDT0001: Recovery over after 21:48, of 1368 clients 0 recovered and 1368 were evicted.^M
[Sat Jun 15 11:15:33 2019][ 2508.611905] Lustre: 20165:0:(mdt_handler.c:6544:mdt_postrecov()) fir-MDT0001: auto trigger paused LFSCK failed: rc = -6^M
[Sat Jun 15 11:15:33 2019][ 2508.710964] Lustre: 20306:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 9072 previous similar messages^M
[Sat Jun 15 11:15:35 2019][ 2511.232379] Lustre: server umount fir-MDT0001 complete^M

Again, you will see in the logs that I have tried several times to do an abort_recovery (using lctl) but this only printed a message and seemed to do nothing.

Comment by Stephane Thiell [ 18/Jun/19 ]

We're having again this issue of endless MDT recovery. Peter could we increase the severity?

Comment by Stephane Thiell [ 18/Jun/19 ]

This time it's fir-MDT0002:

[root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0000/recovery_status
status: COMPLETE
recovery_start: 1560883302
recovery_duration: 115
completed_clients: 1400/1400
replayed_requests: 0
last_transno: 403936479430
VBR: DISABLED
IR: DISABLED
[root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
status: RECOVERING
recovery_start: 1560884046
time_remaining: 0
connected_clients: 1400/1400
req_replay_clients: 53
lock_repay_clients: 219
completed_clients: 1181
evicted_clients: 0
replayed_requests: 1
queued_requests: 52
next_transno: 344373951742
Comment by Stephane Thiell [ 18/Jun/19 ]

The only way to put it back it to remove the routers and then it does complete but will evict all clients :

[root@fir-md1-s1 ~]# lctl route_list
[root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
status: RECOVERING
recovery_start: 1560885207
time_remaining: 56
connected_clients: 1399/1400
req_replay_clients: 1395
lock_repay_clients: 1395
completed_clients: 4
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 344373951730


[root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
status: RECOVERING
recovery_start: 1560885207
time_remaining: 2
connected_clients: 1399/1400
req_replay_clients: 1395
lock_repay_clients: 1395
completed_clients: 4
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 344373951730

[root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
status: RECOVERING
recovery_start: 1560885207
time_remaining: 0
connected_clients: 1399/1400
req_replay_clients: 1395
lock_repay_clients: 1395
completed_clients: 4
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 344373951730

[root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
status: RECOVERING
recovery_start: 1560885207
time_remaining: 59
connected_clients: 1399/1400
req_replay_clients: 1395
lock_repay_clients: 1395
completed_clients: 4
evicted_clients: 1
replayed_requests: 0
queued_requests: 0
next_transno: 344373951730


[root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
status: RECOVERING
recovery_start: 1560885207
time_remaining: 2
connected_clients: 1399/1400
req_replay_clients: 1395
lock_repay_clients: 1395
completed_clients: 4
evicted_clients: 1
replayed_requests: 0
queued_requests: 0
next_transno: 344373951730
[root@fir-md1-s1 ~]# cat /proc/fs/lustre/mdt/fir-MDT0002/recovery_status
status: COMPLETE
recovery_start: 1560885207
recovery_duration: 361
completed_clients: 4/1400
replayed_requests: 0
last_transno: 344373951729
VBR: ENABLED
IR: DISABLED

Comment by Peter Jones [ 19/Jun/19 ]

>  Peter could we increase the severity?

Well, not as such, it's already at the highest severity level, but we can increase the priority at which this is being worked. I'll try and connect with Patrick to discuss next steps.

Comment by Stephane Thiell [ 19/Jun/19 ]

Ah yes, whoops, sorry. Thanks for checking with Patrick. Let me know how I can help next time it happens, like for example if taking a crash dump when the recovery is stuck at 0s would help. Or something else.
Perhaps it's DNE related. I has never happened on MDT0000 as far as I can remember.
Thanks again.

Comment by Lai Siyao [ 21/Jun/19 ]

The "lctl abort_recov" hug issue is tracked on LU-11761, and there is a patch in review.
MDS crash on "osp_declare_create()) ASSERTION( o->opo_reserved == 0 )" is tracked on LU-11967.
I'm looking into the OSS crash "osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) )" now.

Comment by Stephane Thiell [ 21/Jun/19 ]

Thanks Lai!

Comment by Patrick Farrell (Inactive) [ 25/Jun/19 ]

Stephane,

Since this is a duplicate of LU-11761, I'm going resolve it as such, and you can track progress there.

Comment by Patrick Farrell (Inactive) [ 26/Jun/19 ]

Stephane, Lai,

I think this is a different abort_recov issue than LU-11761.

I don't see any evidence of the fld_client issue, which manifested itself as threads like:
lod0002_rec0001

Spinning to 100% of CPU.

Stephane,

You didn't see anything like that this time, correct?

More in a minute.

Comment by Stephane Thiell [ 26/Jun/19 ]

Patrick, I don't think I noticed the 100% CPU for this. Probably not a duplicate issue then, indeed. Thanks!

Comment by Patrick Farrell (Inactive) [ 26/Jun/19 ]

Thanks, Stephane.

The issue (roughly) appears to be this.  (I ran in to this while trying to figure out how to reproduce this or LU-11761.)

This is more rough notes for myself, sorry - It will not be terribly intelligible.

MDT0000 is failing to finish recovery, because it cannot abort recovery.  It's told the various recovery threads that they need to stop, but there's another thread stuck, and it's waiting:

May 29 16:06:06 fir-md1-s1 kernel: LNet: Service thread pid 20981 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
May 29 16:06:06 fir-md1-s1 kernel: Pid: 20981, comm: mdt00_026 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
May 29 16:06:06 fir-md1-s1 kernel: Call Trace:
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1573bf8>] osp_precreate_reserve+0x2e8/0x800 [osp]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc15688b9>] osp_declare_create+0x199/0x5b0 [osp]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ee39f>] lod_sub_declare_create+0xdf/0x210 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e656e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e950e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ed62d>] lod_qos_prep_create+0x12fd/0x1890 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18eddd5>] lod_prepare_create+0x215/0x2e0 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18ddcee>] lod_declare_striped_create+0x1ee/0x980 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc18e5058>] lod_declare_xattr_set+0x138/0x1230 [lod]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc193e065>] mdo_declare_xattr_set+0x65/0x150 [mdd]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc1947af0>] mdd_declare_create+0x550/0xcb0 [mdd]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc194b6e7>] mdd_create+0x827/0x1440 [mdd]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc181c018>] mdt_reint_open+0x1aa8/0x28b0 [mdt]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc180eed3>] mdt_reint_rec+0x83/0x210 [mdt]
May 29 16:06:06 fir-md1-s1 kernel:  [<ffffffffc17ed143>] mdt_reint_internal+0x6e3/0xaf0 [mdt] 

This is the thread in question.  I can't quite walk this back to blocking the recovery process, but I'm almost certain that what's happening is recovery is unable to complete because this thread is unable to exit.  The issue is that even if thread is woken up (which is the intended behavior during recovery), it's not going to complete, because of this check:

 static int osp_precreate_ready_condition(const struct lu_env *env,
                                         struct osp_device *d)
{
        if (d->opd_pre_recovering)
                return 0;

So I believe even if the precreate thread is getting woken up, it's still not going to exit from here.

I've got an idea for a patch for this, but no clear idea of how to reproduce the problem.  I'll see about putting something together.

Comment by Gerrit Updater [ 28/Jun/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35360
Subject: LU-12360 osd-ldiskfs: don't assert on 0-nlink inode
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 413beb61df0471abab2e0b1a0d1b54943b63c8e4

Comment by Patrick Farrell (Inactive) [ 28/Jun/19 ]

We're going to use LU-12485 for the assertion.

Comment by Stephane Thiell [ 16/Jul/19 ]

Hi Patrick,

Any progress on this? We're still very concerned by this issue and I liked your idea  

Thanks!

Comment by Stephane Thiell [ 10/Aug/19 ]

We hit crash LU-11967 on an MDS last night, and then we had same issue again: the MDT recovery could never complete…

[root@fir-md1-s2 127.0.0.1-2019-08-09-23:48:13]# cat /proc/fs/lustre/mdt/fir-MDT0001/recovery_status
status: RECOVERING
recovery_start: 1565445991
time_remaining: 0
connected_clients: 1388/1399
req_replay_clients: 165
lock_repay_clients: 247
completed_clients: 1141
evicted_clients: 11
replayed_requests: 10
queued_requests: 164
next_transno: 342221616219

[root@fir-md1-s2 127.0.0.1-2019-08-09-23:48:13]# cat /proc/fs/lustre/mdt/fir-MDT0003/recovery_status
status: RECOVERING
recovery_start: 1565445894
time_remaining: 0
connected_clients: 1390/1399
req_replay_clients: 44
lock_repay_clients: 137
completed_clients: 1253
evicted_clients: 9
replayed_requests: 2
queued_requests: 43
next_transno: 477760981504

A dump of the tasks is available in fir-md1-s2-20190910-endlessrecov.sysrq-t.log while the recovery was stuck and time_remaining at 0, in the hope it could provide more insight. Thanks!

Generated at Sat Feb 10 02:51:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.