Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.3
-
None
-
CentOS 7.6
-
1
-
9223372036854775807
Description
Currently we cannot make MDT0 work again on Fir (2.12.3) due to these backtraces and lock timeout:
Nov 04 18:30:17 fir-md1-s1 kernel: Pid: 32408, comm: mdt01_024 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019 Nov 04 18:30:17 fir-md1-s1 kernel: Call Trace: Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc10ccac0>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc10cd5e1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc15d850b>] mdt_object_local_lock+0x50b/0xb20 [mdt] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc15d8b90>] mdt_object_lock_internal+0x70/0x360 [mdt] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc15d8ea0>] mdt_object_lock+0x20/0x30 [mdt] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc1617c4b>] mdt_brw_enqueue+0x44b/0x760 [mdt] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc15c64bf>] mdt_intent_brw+0x1f/0x30 [mdt] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc15debb5>] mdt_intent_policy+0x435/0xd80 [mdt] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc10b3d46>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc10dc336>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc1164a12>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc116936a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc111024b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffc1113bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffbe8c2e81>] kthread+0xd1/0xe0 Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffbef77c24>] ret_from_fork_nospec_begin+0xe/0x21 Nov 04 18:30:17 fir-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Nov 04 18:30:17 fir-md1-s1 kernel: LNet: Service thread pid 32415 was inactive for 201.19s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Nov 04 18:30:17 fir-md1-s1 kernel: LNet: Skipped 1 previous similar message Nov 04 18:31:56 fir-md1-s1 kernel: LustreError: 32601:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572920815, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffffa10e0b407bc0/0x675682d854098c0 lrc: 3/0,1 mode: --/PW res: [0x200034eb7:0x1:0x0].0x0 bits 0x40/0x0 rrc: 912 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 32601 timeout: 0 lvb_type: 0 Nov 04 18:31:56 fir-md1-s1 kernel: LustreError: 32601:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 224 previous similar messages Nov 04 18:34:25 fir-md1-s1 kernel: LustreError: 32404:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572920965, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffffa12da9254ec0/0x675682d8540d5dd lrc: 3/0,1 mode: --/PW res: [0x200034eb7:0x1:0x0].0x0 bits 0x40/0x0 rrc: 913 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 32404 timeout: 0 lvb_type: 0 Nov 04 18:34:25 fir-md1-s1 kernel: LustreError: 32404:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 161 previous similar messages
This looks similar to LU-11358.
I also just noticed another thing, on our new 2.12.3 clients, I can't find the async_discard import flag:
[root@sh-117-13 ~]# lctl get_param mdc.fir-*.import | grep connect_flags connect_flags: [ write_grant, server_lock, version, acl, xattr, create_on_write, truncate_lock, inode_bit_locks, getattr_by_fid, no_oh_for_devices, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, pools, large_ea, full20, layout_lock, 64bithash, jobstats, umask, einprogress, grant_param, lvb_type, short_io, flock_deadlock, disp_stripe, open_by_fid, lfsck, multi_mod_rpcs, dir_stripe, subtree, bulk_mbits, second_flags, file_secctx, dir_migrate, unknown, flr, lock_convert, archive_id_array, selinux_policy, lsom, unknown2_0x4000 ]
Is it unknown2_0x4000 ? Or is it missing??
Please confirm that the async_discard patch is missing from 2.12.3? If it's the case, we'll need to perform a full downgrade or emergency patching of the cluster and Lustre servers.
Attaching logs from MDS fir-md1-s1 which serves MDT0 and that we cannot make operational again at the moment. We even tried with abort_recovery with no luck.