[LU-15572] Interop sanity-flr failing with "cannot get UNLOCK lease" Created: 20/Feb/22  Updated: 21/Dec/22  Resolved: 05/Mar/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0, Lustre 2.15.1
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Blocker
Reporter: Andreas Dilger Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-15552 Interop: sanity-flr test 0d fails wit... Open
is related to LU-15268 lfs mirror extend error propagation Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running sanity-flr in interop mode fails a number of subtests:
https://testing.whamcloud.com/test_sets/0a29fd64-d2e6-4cbf-b636-b1deae709128

lfs mirror: cannot get UNLOCK lease, ext 8: Device or resource busy (16)


 Comments   
Comment by Andreas Dilger [ 20/Feb/22 ]

Temporarily bumping this up to a blocker to ensure it is reviewed. This appears to be a real interop issue between 2.14.0 and 2.15.0, and not only a test script issue.

Comment by Andreas Dilger [ 20/Feb/22 ]

This may have been introduced with the landing of patch https://review.whamcloud.com/45636 "LU-15268 ‘mdt: reveal the real intent close error code"., which is also causing other interop failures like LU-15552.

Comment by Andreas Dilger [ 20/Feb/22 ]

This affects sanity-flr test_0b test_50a test_60b test_61b test_203

Comment by Andreas Dilger [ 20/Feb/22 ]

Bobijam, could you please take a look at this. It looks like more than just a test script issue, but an actual interop bug with FLR, since it is failing in a variety of different tests.

Comment by Andreas Dilger [ 20/Feb/22 ]

This is also causing sanity test_272d, test_272e, and test_272f to fail in interop testing.

Comment by Zhenyu Xu [ 24/Feb/22 ]

Could be an interop issue between 2.15.0 client and 2.14.0 MDS on mirror split operation.

test_0b:
lfs mirror mirror: cannot get UNLOCK lease, ext 8: Device or resource busy (16)
error lfs mirror delete: cannot split '/mnt/lustre/d0b.sanity-flr/f0b.sanity-flr': Device or resource busy

test_50A:
lfs mirror mirror: cannot get UNLOCK lease, ext 8: Device or resource busy (16)
error lfs mirror split: cannot split '/mnt/lustre/d50A.sanity-flr/f50A.sanity-flr': Device or resource busy

test_60b:
lfs mirror mirror: cannot get UNLOCK lease, ext 8: Device or resource busy (16)
error lfs mirror split: cannot split '/mnt/lustre/d60b.sanity-flr/f60b.sanity-flr': Device or resource busy

test_61b:
lfs mirror mirror: cannot get UNLOCK lease, ext 8: Device or resource busy (16)
error lfs mirror split: cannot split '/mnt/lustre/d61b.sanity-flr/f61b.sanity-flr': Device or resource busy

test_203:
lfs mirror mirror: cannot get UNLOCK lease, ext 8: Device or resource busy (16)
error lfs mirror delete: cannot split '/mnt/lustre/f203.sanity-flr': Device or resource busy

take test_60b as an example:
on MDS, it received the file close intent for mirror split request, and finished it successfully

00000100:00100000:1.0:1645359656.359528:0:991126:0:(service.c:2142:ptlrpc_server_handle_req_in()) got req x1725282051462784
00000100:00100000:1.0:1645359656.359542:0:991126:0:(service.c:2299:ptlrpc_server_handle_request()) Handling RPC req@0000000030f1e911 pname:cluuid+ref:pid:xid:nid:opc:job mdt_rdpg00_001:67bf88ef-fd73-4c00-9a58-d65897456138+35:1227937:x1725282051462784:12345-10.240.25.52@tcp:35:lfs.0
...
00000100:00100000:1.0:1645359656.362030:0:991126:0:(service.c:2348:ptlrpc_server_handle_request()) Handled RPC req@0000000030f1e911 pname:cluuid+ref:pid:xid:nid:opc:job mdt_rdpg00_001:67bf88ef-fd73-4c00-9a58-d65897456138+32:1227937:x1725282051462784:12345-10.240.25.52@tcp:35:lfs.0 Request processed in 2487us (2537us total) trans 85899346091 rc 0/0

while on client, I don't see the failure reason

00000080:00200000:1.0:1645359656.359344:0:1227937:0:(file.c:3896:ll_file_ioctl()) VFS Op:inode=[0x2000090a6:0x4e:0x0](00000000a9e821ea), cmd=c00c66f3
00000002:00100000:1.0:1645359656.359354:0:1227937:0:(mdc_request.c:944:mdc_close()) @@@ matched open  req@00000000d7fc9409 x1725282051460736/t85899346086(85899346086) o101->lustre-MDT0000-mdc-ffff913d103c9000@10.240.25.55@tcp:12/10 lens 576/656 e 0 to 0 dl 1645359663 ref 1 fl Complete:RPQU/4/ffffffff rc 0/-1 job:'lfs.0'
00000100:00100000:1.0:1645359656.359372:0:1227937:0:(client.c:742:ptlrpc_reassign_next_xid()) @@@ reassign xid  req@000000008e2795d6 x1725282051462784/t0(0) o35->lustre-MDT0000-mdc-ffff913d103c9000@10.240.25.55@tcp:23/10 lens 504/1072 e 0 to 0 dl 0 ref 1 fl New:QU/0/ffffffff rc 0/-1 job:''
00000100:00100000:1.0:1645359656.359378:0:1227937:0:(client.c:1733:ptlrpc_send_new_req()) Sending RPC req@000000008e2795d6 pname:cluuid:pid:xid:nid:opc:job lfs:67bf88ef-fd73-4c00-9a58-d65897456138:1227937:1725282051462784:10.240.25.55@tcp:35:lfs.0
...
00000100:00100000:1.0:1645359656.362161:0:1227937:0:(client.c:2210:ptlrpc_check_set()) Completed RPC req@000000008e2795d6 pname:cluuid:pid:xid:nid:opc:job lfs:67bf88ef-fd73-4c00-9a58-d65897456138:1227937:1725282051462784:10.240.25.55@tcp:35:lfs.0
00000002:00100000:1.0:1645359656.362168:0:1227937:0:(mdc_request.c:854:mdc_free_open()) @@@ free open request, rq_replay=0  req@00000000d7fc9409 x1725282051460736/t85899346086(85899346086) o101->lustre-MDT0000-mdc-ffff913d103c9000@10.240.25.55@tcp:12/10 lens 576/656 e 0 to 0 dl 1645359663 ref 1 fl Complete:RQU/4/ffffffff rc 0/-1 job:'lfs.0'
Comment by Zhenyu Xu [ 24/Feb/22 ]

The patch https://review.whamcloud.com/45636 "LU-15268 ‘mdt: reveal the real intent close error code" is a server only patch, it should something else causing the interop failure.

Comment by Zhenyu Xu [ 24/Feb/22 ]

I think it could be patch https://review.whamcloud.com/42116 "LU-14521 'flr: delete mirror without volatile file'" does not catch proper error code from old MDS (it expects -EINVAL from MDS while old server returns -EBUSY in mirror delete).

Comment by Zhenyu Xu [ 24/Feb/22 ]

Could use following patch for 2.15 to catch older MDS ambiguous error code

diff --git a/lustre/utils/lfs.c b/lustre/utils/lfs.c
index 3a1511fe05..e5d245af3e 100644
--- a/lustre/utils/lfs.c
+++ b/lustre/utils/lfs.c
@@ -2329,7 +2329,7 @@ again:
        data->lil_ids[1] = mirror_id;
        rc = llapi_lease_set(fd, data);
        if (rc <= 0) {
-               if (rc == -EINVAL && purge) {
+               if (purge) {
                        /* could be old MDS which prohibit fd==fdv */
                        purge = false;
                        goto again; 
Comment by Andreas Dilger [ 24/Feb/22 ]

Bobijam, rather than pasting the patch here as a comment, could you please submit it as a patch to Gerrit and add a Test-Parameters line to run it in interop mode:

Test-Parameters: trivial serverversion=2.14.0 testlist=sanity env=ONLY="0 50 60 61"
Test-Parameters: clientversion=2.14.0 testlist=sanity env=ONLY="0 50 60 61"
Test-Parameters: serverversion=2.12.8 testlist=sanity env=ONLY="0 50 60 61"
Test-Parameters: clientversion=2.12.8 testlist=sanity env=ONLY="0 50 60 61"
Comment by Gerrit Updater [ 25/Feb/22 ]

"Bobi Jam <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/46614
Subject: LU-15572 util: mirror delete with old MDS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cdd3a1299ca77face6078caa254ffa2b493cbf3e

Comment by Gerrit Updater [ 05/Mar/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46614/
Subject: LU-15572 util: mirror delete with old MDS
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 79f8a4a38c71c7f06ac561ca4cd6d6b9cb5aa4fb

Comment by Peter Jones [ 05/Mar/22 ]

Landed for 2.15

Comment by Sarah Liu [ 02/Aug/22 ]

Still seeing this error in sanity-flr test_50a in 2.15.1 interop with 2.12.9 el7.9 servers. In the current testing, it seems only happened to subtest 50a
https://testing.whamcloud.com/test_sets/1f2bc891-06de-4859-97b1-f51a7ed6cf2f

trevis-96vm2: trevis-96vm2.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
trevis-96vm1: CMD: trevis-96vm1.trevis.whamcloud.com lctl get_param -n at_max
trevis-96vm1: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
trevis-96vm2: CMD: trevis-96vm2.trevis.whamcloud.com lctl get_param -n at_max
trevis-96vm2: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
lfs mirror mirror: cannot get UNLOCK lease, ext 8: Device or resource busy (16)
/mnt/lustre/d50A.sanity-flr/f50A.sanity-flr layout generation from 2 to 1
 sanity-flr test_50A: @@@@@@ FAIL: split does not increase layout gen from 2 to 1 
Generated at Sat Feb 10 03:19:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.