[LU-16404] sanity test_230n: lfs mirror: cannot get UNLOCK lease, Mirroring failed Created: 15/Dec/22  Updated: 20/Mar/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16633 obd_get_mod_rpc_slot() is vulnerable ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/65516983-67f2-4883-b7ab-1cb5abdf58e5

test_230n failed with the following error:

Mirroring failed

Test output is:

== sanity test 230n: Dir migration with mirrored file ==== 22:36:31 (1671057391)
lfs mirror mirror: cannot get UNLOCK lease, ext 4: No such file or directory (2)
error: lfs mirror extend: /mnt/lustre/d230n.sanity/f230n.sanity: cannot merge layout: No such file or directory
 sanity test_230n: @@@@@@ FAIL: Mirroring failed 

In client dmesg, we can see the client was evicted by the MDS:

[13212.828154] LustreError: 866765:0:(file.c:242:ll_close_inode_openhandle()) lustre-clilmv-ffff93f6048ef000:
    inode [0x200002b15:0x5b9b:0x0] mdc close failed: rc = -2
[13212.831301] LustreError: 11-0: lustre-MDT0000-mdc-ffff93f6048ef000: operation ldlm_enqueue to node 10.240.38.63@tcp failed: rc = -107
[13212.831308] Lustre: lustre-MDT0000-mdc-ffff93f6048ef000: Connection to lustre-MDT0000 (at 10.240.38.63@tcp) was lost;
    in progress operations using this service will wait for recovery to complete
[13212.833555] LustreError: Skipped 1 previous similar message
[13212.838262] LustreError: 167-0: lustre-MDT0000-mdc-ffff93f6048ef000: This client was evicted by lustre-MDT0000;
    in progress operations using this service will fail.
[13212.847340] Lustre: lustre-MDT0000-mdc-ffff93f6048ef000: Connection restored to 10.240.38.63@tcp (at 10.240.38.63@tcp)
[13213.270161] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_230n: @@@@@@ FAIL: Mirroring failed 

The MDS claims the client was not responsive to the lock request:

[13026.421567] Lustre: DEBUG MARKER: == sanity test 230n: Dir migration with mirrored file ==== 03:41:59 (1679283719)
[13070.199369] Lustre: mdt_rdpg00_003: service thread pid 541635 was inactive for 42.970 seconds. The thread might be hung,
     or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[13070.203029] Pid: 541635, comm: mdt_rdpg00_003 4.18.0-348.23.1.el8_lustre.x86_64 #1 SMP Thu Mar 2 00:54:25 UTC 2023
[13070.204965] Call Trace TBD:
[13070.206179] ldlm_completion_ast+0x7ac/0x900 [ptlrpc]
[13070.207308] ldlm_cli_enqueue_local+0x307/0x860 [ptlrpc]
[13070.208596] mdt_object_local_lock+0x509/0xb30 [mdt]
[13070.209666] mdt_object_lock_internal+0x18d/0x4a0 [mdt]
[13070.210781] mdt_object_lock+0x1b/0x20 [mdt]
[13070.211735] mdt_close_handle_layouts+0x935/0x10b0 [mdt]
[13070.212865] mdt_mfd_close+0x510/0xbc0 [mdt]
[13070.213811] mdt_close_internal+0xcc/0x250 [mdt]
[13070.214833] mdt_close+0x2c0/0x8b0 [mdt]
[13070.215745] tgt_request_handle+0xc8c/0x1950 [ptlrpc]
[13070.216850] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
[13070.218096] ptlrpc_main+0xc4e/0x1510 [ptlrpc]
[13127.540308] LustreError: 482890:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 100s:
     evicting client at 10.240.38.60@tcp  ns: mdt-lustre-MDT0000_UUID lock: 0000000001a8df2b/0x3c9ac870599a2eb7
     lrc: 3/0,0 mode: CR/CR res: [0x200002b15:0x5b9b:0x0].0x0 bits 0xd/0x0 rrc: 4 type: IBT gid 0 flags: 0x60200400000020
     nid: 10.240.38.60@tcp remote: 0x5d158cc36e2e425f expref: 49 pid: 537456 timeout: 13128 lvb_type: 0
[13128.435958] Lustre: DEBUG MARKER: sanity test_230n: @@@@@@ FAIL: Mirroring failed

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_230n - Mirroring failed



 Comments   
Comment by Andreas Dilger [ 26/Jan/23 ]

+12 failures in past 4 weeks, sometimes "Mirroring failed", sometimes "Migration failed".

The first failures started on 2022-11-09 (a few earlier failures, but caused by previous subtest failures), and have been 21 on master since then (every 3-7 days), and not on any other branch. The first failure was for a patch that is fixing a later test in sanity, so very unlikely to be the source of the MDS eviction.

There were 45 patches landed to master between 2022-11-01 and 2022-11-10:

# git log --oneline --after 2022-11-01 --before 2022-11-10
3737b28a9d LU-10003 lnet: to properly handle errors reported use ssize_t
93784852c8 LU-15825 ldlm: remove linefeed from LDLM_DEBUG
ed71af843d LU-16280 tests: conf-sanity/117 call setup when FS is not mounted
9a0a89520e LU-16251 obdclass: fill jobid in a safe way
b104c0a277 LU-14377 tests: make parallel-scale/rr_alloc less strict
38a52f2fc3 LU-16243 tests: Specify ping source in test_218
d55fe25a95 LU-16215 kfilnd: Use immediate for routed GETs
78c681d9f4 LU-16207 build: add rpm-build BuildRequires for SLES15 SP3
95f7ef6094 LU-16111 build: Fix include of stddef.h
0171801df5 LU-8151 obd: Show correct shadow mountpoints for server
26e765e0d2 LU-15011 tests: additional checks for pool spilling
41610e6207 LU-8837 mgc: move server-only code out of mgc_request.c
f32bdde057 LU-8837 lustre: make ldlm and target file lists
ef90a02d12 LU-13135  quota: improve checks in OSDs to ignore quota
896cd5b7bc LU-10391 lnet: fix build issue when IPv6 is disabled.
122644ae19 LU-15847 target: report multiple transno to debug log
cf121b1668 LU-16203 llog: skip bad records in llog
53b3d1de99 LU-16152 utils: fix integer overflow in cYAML parser
db0fb8f2b6 LU-10391 lnet: allow ping packet to contain large nids
5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs
8f8f6e2f36 LU-10003 lnet: use Netlink to support old and new NI APIs.
85941b9fb9 LU-15117 ofd: no lock for dt_bufs_get() in read path
293844d132 LU-16222 kernel: RHEL 8.7 client and server support
d3fc875617 LU-16256 tests: sanity-flr to remove temp files
f97ce54357 LU-15564 osd: add allocation time histogram
9aaaebfa22 LU-16252 tests: test_129() fix for not combined_mgs_mds
d41a508a49 LU-16241 obdclass: NL_SET_ERR_MSG work for older kernel
aaab50dd88 LU-16249 sec: krb5_decrypt_bulk calls decryption primitive
b533700add LU-16232 scripts: changelog/updatelog emergency cleanup
b4e06ff1e4 LU-16172 o2iblnd: add verbose debug prints for rx/tx events
c619b6d6a5 LU-16258 llite: Explicitly support .splice_write
8f718df474 LU-15852 lnet: Don't modify uptodate peer with temp NI
4f2914537c LU-15139 osp: block reads until the object is created
ec1b0dc6e4 LU-15626 tests: Fix "error" reported by shellcheck for replay-dual
8e563553c0 LU-15626 tests: Fix "error" reported by shellcheck
b2274a7160 LU-15619 osc: Remove oap lock
7e0a2b0737 LU-15046 osp: precreate thread vs connect race
ed0f74cd63 LU-6142 obdclass: make ccc_users in cl_client_cache a refcount_t
5b911e0326 LU-16160 llite: clear stale page's uptodate bit
1a79d395dd LU-15935 target: keep track of multirpc slots in last_rcvd
62fd8f9b49 Revert "LU-16046 revert: "LU-9964 llite: prevent mulitple group locks""
dd609c6f31 Revert "LU-16046 ldlm: group lock fix"

Nothing immediately jumps out as an obvious cause. One possibility might be "LU-15947 obdclass: improve precision of wakeups for mod_rpcs" if there was some problem with the wakeups.

Patches committed the previous week seem even less likely:

# git log --oneline --after 2022-10-25 --before 2022-11-01
6aee406c84 LU-14719 lod: distributed transaction check space
d4848d779b LU-16187 tests: Fix is_project_quota_supported()
9309a32c1a LU-16240 build: Use new AS_HELP_STRING
83d3f42118 LU-15305 obdclass: fix race in class_del_profile
1df5199097 LU-15807 ksocklnd: fix irq lock inversion while calling sk_data_ready()
8a2bd9661d LU-16177 kernel: kernel update RHEL9.0 [5.14.0-70.26.1.el9_0]
5547158daa LU-16175 kernel: kernel update SLES12 SP5 [4.12.14-122.133.1]
5d31fdaf68 LU-16174 kernel: kernel update SLES15 SP4 [5.14.21-150400.24.21.2]
c3467db7e7 LU-16173 kernel: kernel update SLES15 SP3 [5.3.18-150300.59.93.1]
3493db6a67 LU-16233 build: Add always target for SUSE15 SP3 LTSS
950e59ced1 LU-15234 lnet: add mechanism for dumping lnd peer debug info
0a639a32cf LU-15795 lbuild: enable KABI
c5a436d849 LU-12130 test: pool inheritance for mdt component
78dddb423f LU-15447 tests: sanity-flr/208 reset rotational status
c24a38b583 LU-16183 test: sanity-hsm/70 should detect python
04e5fa7b24 LU-13175 tests: sanity/803 to sync MDTs for actual statfs
2e08974390 LU-16139 statahead: avoid to block ptlrpcd interpret context
a966b624ac LU-16149 lnet: Discovery queue and deletion race
0a317b171e LU-15847 tgt: move tti_ transaction params to tsi_
4e2e8fd2fc LU-15847 tgt: reply always with the latest assigned transno
4468f6c9d9 LU-16025 llite: adjust read count as file got truncated

The LU-14719 patch has had a few issues, but those are mostly during recovery, which isn't the case here.

It might be necessary

Comment by Andreas Dilger [ 20/Mar/23 ]

May possibly relate to LU-16633, which relates to a bug initially introduced by LU-15947.

Comment by Gerrit Updater [ 20/Mar/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50343
Subject: LU-16404 tests: check if wakeup race fixes issue
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 69aa93cf93a7c2648d4fbb2441b96c5a53ba671f

Comment by Andreas Dilger [ 20/Mar/23 ]

If patch https://review.whamcloud.com/50261 "LU-16633 obdclass: fix rpc slot leakage" fixes this issue, it implies that the client AST reply can be blocked due to max_rpcs_in_flight limits. That makes me wonder if there we are blocking AST RPCs due to max_rpcs_in_flight, or if there should be a higher RPC limit for AST replies than regular MDC/OSC requests?

The client shouldn't be evicted if it cannot send an AST reply because all of the RPC slots are used for long-running RPC requests (e.g. writes or LDLM enqueues that are blocked on different locks). The OST lock prolonging only helps for timeouts on locks actively undergoing writeout, but doesn't help if there is an AST on a different resource and the client can't even send writes under those locks because all of the RPC slots are full.

Comment by Andreas Dilger [ 20/Mar/23 ]

The test patch passed 200x iterations of sanity test_230n with the LU-16633 patch.

Generated at Sat Feb 10 03:26:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.