[LU-12935] MDT deadlock on 2.12.3 with DoM; is it missing async_discard feature? - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.12.4
Affects Version/s: Lustre 2.12.3
Labels:
None
Environment:
CentOS 7.6

Severity:
1
Rank (Obsolete):
9223372036854775807

Description

Currently we cannot make MDT0 work again on Fir (2.12.3) due to these backtraces and lock timeout:

Nov 04 18:30:17 fir-md1-s1 kernel: Pid: 32408, comm: mdt01_024 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019
Nov 04 18:30:17 fir-md1-s1 kernel: Call Trace:
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc10ccac0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc10cd5e1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15d850b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15d8b90>] mdt_object_lock_internal+0x70/0x360 [mdt]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15d8ea0>] mdt_object_lock+0x20/0x30 [mdt]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc1617c4b>] mdt_brw_enqueue+0x44b/0x760 [mdt]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15c64bf>] mdt_intent_brw+0x1f/0x30 [mdt]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15debb5>] mdt_intent_policy+0x435/0xd80 [mdt]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc10b3d46>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc10dc336>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc1164a12>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc116936a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc111024b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc1113bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffbe8c2e81>] kthread+0xd1/0xe0
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffbef77c24>] ret_from_fork_nospec_begin+0xe/0x21
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Nov 04 18:30:17 fir-md1-s1 kernel: LNet: Service thread pid 32415 was inactive for 201.19s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
Nov 04 18:30:17 fir-md1-s1 kernel: LNet: Skipped 1 previous similar message
Nov 04 18:31:56 fir-md1-s1 kernel: LustreError: 32601:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572920815, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffffa10e0b407bc0/0x675682d854098c0 lrc: 3/0,1 mode: --/PW res: [0x200034eb7:0x1:0x0].0x0 bits 0x40/0x0 rrc: 912 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 32601 timeout: 0 lvb_type: 0
Nov 04 18:31:56 fir-md1-s1 kernel: LustreError: 32601:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 224 previous similar messages
Nov 04 18:34:25 fir-md1-s1 kernel: LustreError: 32404:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572920965, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffffa12da9254ec0/0x675682d8540d5dd lrc: 3/0,1 mode: --/PW res: [0x200034eb7:0x1:0x0].0x0 bits 0x40/0x0 rrc: 913 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 32404 timeout: 0 lvb_type: 0
Nov 04 18:34:25 fir-md1-s1 kernel: LustreError: 32404:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 161 previous similar messages

This looks similar to LU-11358.

I also just noticed another thing, on our new 2.12.3 clients, I can't find the async_discard import flag:

[root@sh-117-13 ~]# lctl get_param mdc.fir-*.import | grep connect_flags
    connect_flags: [ write_grant, server_lock, version, acl, xattr, create_on_write, truncate_lock, inode_bit_locks, getattr_by_fid, no_oh_for_devices, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, pools, large_ea, full20, layout_lock, 64bithash, jobstats, umask, einprogress, grant_param, lvb_type, short_io, flock_deadlock, disp_stripe, open_by_fid, lfsck, multi_mod_rpcs, dir_stripe, subtree, bulk_mbits, second_flags, file_secctx, dir_migrate, unknown, flr, lock_convert, archive_id_array, selinux_policy, lsom, unknown2_0x4000 ]

Is it unknown2_0x4000 ? Or is it missing??

Please confirm that the async_discard patch is missing from 2.12.3? If it's the case, we'll need to perform a full downgrade or emergency patching of the cluster and Lustre servers.

Attaching logs from MDS fir-md1-s1 which serves MDT0 and that we cannot make operational again at the moment. We even tried with abort_recovery with no luck.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

crash-sysrq-fir-md1-s1-foreach-bt.log
1.35 MB
05/Nov/19 4:13 AM
fir-md1-s1_20191105.log
758 kB
06/Nov/19 5:57 AM
fir-md1-s1_lfsck-results.log
3 kB
05/Nov/19 4:42 AM
fir-md1-s1-dk.log.gz
3.30 MB
05/Nov/19 3:32 AM
fir-md1-s1-dk2.log.gz
395 kB
05/Nov/19 5:11 AM
fir-md1-s1-MDT0.log
433 kB
05/Nov/19 2:42 AM
fir-rbh01.dk.log
651 kB
05/Nov/19 4:23 AM

Issue Links

is related to

LU-11421 DoM: manual migration OST-MDT, MDT-MDT

Resolved

LU-10664 DoM: make DoM lock enqueue non-blocking

Resolved

MDT deadlock on 2.12.3 with DoM; is it missing async_discard feature?

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates