[LU-16404] sanity test_230n: lfs mirror: cannot get UNLOCK lease, Mirroring failed Created: 15/Dec/22 Updated: 20/Mar/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for S Buisson <sbuisson@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/65516983-67f2-4883-b7ab-1cb5abdf58e5 test_230n failed with the following error: Mirroring failed Test output is: == sanity test 230n: Dir migration with mirrored file ==== 22:36:31 (1671057391) lfs mirror mirror: cannot get UNLOCK lease, ext 4: No such file or directory (2) error: lfs mirror extend: /mnt/lustre/d230n.sanity/f230n.sanity: cannot merge layout: No such file or directory sanity test_230n: @@@@@@ FAIL: Mirroring failed In client dmesg, we can see the client was evicted by the MDS: [13212.828154] LustreError: 866765:0:(file.c:242:ll_close_inode_openhandle()) lustre-clilmv-ffff93f6048ef000:
inode [0x200002b15:0x5b9b:0x0] mdc close failed: rc = -2
[13212.831301] LustreError: 11-0: lustre-MDT0000-mdc-ffff93f6048ef000: operation ldlm_enqueue to node 10.240.38.63@tcp failed: rc = -107
[13212.831308] Lustre: lustre-MDT0000-mdc-ffff93f6048ef000: Connection to lustre-MDT0000 (at 10.240.38.63@tcp) was lost;
in progress operations using this service will wait for recovery to complete
[13212.833555] LustreError: Skipped 1 previous similar message
[13212.838262] LustreError: 167-0: lustre-MDT0000-mdc-ffff93f6048ef000: This client was evicted by lustre-MDT0000;
in progress operations using this service will fail.
[13212.847340] Lustre: lustre-MDT0000-mdc-ffff93f6048ef000: Connection restored to 10.240.38.63@tcp (at 10.240.38.63@tcp)
[13213.270161] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_230n: @@@@@@ FAIL: Mirroring failed
The MDS claims the client was not responsive to the lock request: [13026.421567] Lustre: DEBUG MARKER: == sanity test 230n: Dir migration with mirrored file ==== 03:41:59 (1679283719)
[13070.199369] Lustre: mdt_rdpg00_003: service thread pid 541635 was inactive for 42.970 seconds. The thread might be hung,
or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[13070.203029] Pid: 541635, comm: mdt_rdpg00_003 4.18.0-348.23.1.el8_lustre.x86_64 #1 SMP Thu Mar 2 00:54:25 UTC 2023
[13070.204965] Call Trace TBD:
[13070.206179] ldlm_completion_ast+0x7ac/0x900 [ptlrpc]
[13070.207308] ldlm_cli_enqueue_local+0x307/0x860 [ptlrpc]
[13070.208596] mdt_object_local_lock+0x509/0xb30 [mdt]
[13070.209666] mdt_object_lock_internal+0x18d/0x4a0 [mdt]
[13070.210781] mdt_object_lock+0x1b/0x20 [mdt]
[13070.211735] mdt_close_handle_layouts+0x935/0x10b0 [mdt]
[13070.212865] mdt_mfd_close+0x510/0xbc0 [mdt]
[13070.213811] mdt_close_internal+0xcc/0x250 [mdt]
[13070.214833] mdt_close+0x2c0/0x8b0 [mdt]
[13070.215745] tgt_request_handle+0xc8c/0x1950 [ptlrpc]
[13070.216850] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
[13070.218096] ptlrpc_main+0xc4e/0x1510 [ptlrpc]
[13127.540308] LustreError: 482890:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 100s:
evicting client at 10.240.38.60@tcp ns: mdt-lustre-MDT0000_UUID lock: 0000000001a8df2b/0x3c9ac870599a2eb7
lrc: 3/0,0 mode: CR/CR res: [0x200002b15:0x5b9b:0x0].0x0 bits 0xd/0x0 rrc: 4 type: IBT gid 0 flags: 0x60200400000020
nid: 10.240.38.60@tcp remote: 0x5d158cc36e2e425f expref: 49 pid: 537456 timeout: 13128 lvb_type: 0
[13128.435958] Lustre: DEBUG MARKER: sanity test_230n: @@@@@@ FAIL: Mirroring failed
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 26/Jan/23 ] |
|
+12 failures in past 4 weeks, sometimes "Mirroring failed", sometimes "Migration failed". The first failures started on 2022-11-09 (a few earlier failures, but caused by previous subtest failures), and have been 21 on master since then (every 3-7 days), and not on any other branch. The first failure was for a patch that is fixing a later test in sanity, so very unlikely to be the source of the MDS eviction. There were 45 patches landed to master between 2022-11-01 and 2022-11-10: # git log --oneline --after 2022-11-01 --before 2022-11-10 3737b28a9d LU-10003 lnet: to properly handle errors reported use ssize_t 93784852c8 LU-15825 ldlm: remove linefeed from LDLM_DEBUG ed71af843d LU-16280 tests: conf-sanity/117 call setup when FS is not mounted 9a0a89520e LU-16251 obdclass: fill jobid in a safe way b104c0a277 LU-14377 tests: make parallel-scale/rr_alloc less strict 38a52f2fc3 LU-16243 tests: Specify ping source in test_218 d55fe25a95 LU-16215 kfilnd: Use immediate for routed GETs 78c681d9f4 LU-16207 build: add rpm-build BuildRequires for SLES15 SP3 95f7ef6094 LU-16111 build: Fix include of stddef.h 0171801df5 LU-8151 obd: Show correct shadow mountpoints for server 26e765e0d2 LU-15011 tests: additional checks for pool spilling 41610e6207 LU-8837 mgc: move server-only code out of mgc_request.c f32bdde057 LU-8837 lustre: make ldlm and target file lists ef90a02d12 LU-13135 quota: improve checks in OSDs to ignore quota 896cd5b7bc LU-10391 lnet: fix build issue when IPv6 is disabled. 122644ae19 LU-15847 target: report multiple transno to debug log cf121b1668 LU-16203 llog: skip bad records in llog 53b3d1de99 LU-16152 utils: fix integer overflow in cYAML parser db0fb8f2b6 LU-10391 lnet: allow ping packet to contain large nids 5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs 8f8f6e2f36 LU-10003 lnet: use Netlink to support old and new NI APIs. 85941b9fb9 LU-15117 ofd: no lock for dt_bufs_get() in read path 293844d132 LU-16222 kernel: RHEL 8.7 client and server support d3fc875617 LU-16256 tests: sanity-flr to remove temp files f97ce54357 LU-15564 osd: add allocation time histogram 9aaaebfa22 LU-16252 tests: test_129() fix for not combined_mgs_mds d41a508a49 LU-16241 obdclass: NL_SET_ERR_MSG work for older kernel aaab50dd88 LU-16249 sec: krb5_decrypt_bulk calls decryption primitive b533700add LU-16232 scripts: changelog/updatelog emergency cleanup b4e06ff1e4 LU-16172 o2iblnd: add verbose debug prints for rx/tx events c619b6d6a5 LU-16258 llite: Explicitly support .splice_write 8f718df474 LU-15852 lnet: Don't modify uptodate peer with temp NI 4f2914537c LU-15139 osp: block reads until the object is created ec1b0dc6e4 LU-15626 tests: Fix "error" reported by shellcheck for replay-dual 8e563553c0 LU-15626 tests: Fix "error" reported by shellcheck b2274a7160 LU-15619 osc: Remove oap lock 7e0a2b0737 LU-15046 osp: precreate thread vs connect race ed0f74cd63 LU-6142 obdclass: make ccc_users in cl_client_cache a refcount_t 5b911e0326 LU-16160 llite: clear stale page's uptodate bit 1a79d395dd LU-15935 target: keep track of multirpc slots in last_rcvd 62fd8f9b49 Revert "LU-16046 revert: "LU-9964 llite: prevent mulitple group locks"" dd609c6f31 Revert "LU-16046 ldlm: group lock fix" Nothing immediately jumps out as an obvious cause. One possibility might be " Patches committed the previous week seem even less likely: # git log --oneline --after 2022-10-25 --before 2022-11-01 6aee406c84 LU-14719 lod: distributed transaction check space d4848d779b LU-16187 tests: Fix is_project_quota_supported() 9309a32c1a LU-16240 build: Use new AS_HELP_STRING 83d3f42118 LU-15305 obdclass: fix race in class_del_profile 1df5199097 LU-15807 ksocklnd: fix irq lock inversion while calling sk_data_ready() 8a2bd9661d LU-16177 kernel: kernel update RHEL9.0 [5.14.0-70.26.1.el9_0] 5547158daa LU-16175 kernel: kernel update SLES12 SP5 [4.12.14-122.133.1] 5d31fdaf68 LU-16174 kernel: kernel update SLES15 SP4 [5.14.21-150400.24.21.2] c3467db7e7 LU-16173 kernel: kernel update SLES15 SP3 [5.3.18-150300.59.93.1] 3493db6a67 LU-16233 build: Add always target for SUSE15 SP3 LTSS 950e59ced1 LU-15234 lnet: add mechanism for dumping lnd peer debug info 0a639a32cf LU-15795 lbuild: enable KABI c5a436d849 LU-12130 test: pool inheritance for mdt component 78dddb423f LU-15447 tests: sanity-flr/208 reset rotational status c24a38b583 LU-16183 test: sanity-hsm/70 should detect python 04e5fa7b24 LU-13175 tests: sanity/803 to sync MDTs for actual statfs 2e08974390 LU-16139 statahead: avoid to block ptlrpcd interpret context a966b624ac LU-16149 lnet: Discovery queue and deletion race 0a317b171e LU-15847 tgt: move tti_ transaction params to tsi_ 4e2e8fd2fc LU-15847 tgt: reply always with the latest assigned transno 4468f6c9d9 LU-16025 llite: adjust read count as file got truncated The It might be necessary |
| Comment by Andreas Dilger [ 20/Mar/23 ] |
|
May possibly relate to |
| Comment by Gerrit Updater [ 20/Mar/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50343 |
| Comment by Andreas Dilger [ 20/Mar/23 ] |
|
If patch https://review.whamcloud.com/50261 " The client shouldn't be evicted if it cannot send an AST reply because all of the RPC slots are used for long-running RPC requests (e.g. writes or LDLM enqueues that are blocked on different locks). The OST lock prolonging only helps for timeouts on locks actively undergoing writeout, but doesn't help if there is an AST on a different resource and the client can't even send writes under those locks because all of the RPC slots are full. |
| Comment by Andreas Dilger [ 20/Mar/23 ] |
|
The test patch passed 200x iterations of sanity test_230n with the |