[LU-15572] Interop sanity-flr failing with "cannot get UNLOCK lease" Created: 20/Feb/22 Updated: 21/Dec/22 Resolved: 05/Mar/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0, Lustre 2.15.1 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Andreas Dilger | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Running sanity-flr in interop mode fails a number of subtests: lfs mirror: cannot get UNLOCK lease, ext 8: Device or resource busy (16) |
| Comments |
| Comment by Andreas Dilger [ 20/Feb/22 ] |
|
Temporarily bumping this up to a blocker to ensure it is reviewed. This appears to be a real interop issue between 2.14.0 and 2.15.0, and not only a test script issue. |
| Comment by Andreas Dilger [ 20/Feb/22 ] |
|
This may have been introduced with the landing of patch https://review.whamcloud.com/45636 " |
| Comment by Andreas Dilger [ 20/Feb/22 ] |
|
This affects sanity-flr test_0b test_50a test_60b test_61b test_203 |
| Comment by Andreas Dilger [ 20/Feb/22 ] |
|
Bobijam, could you please take a look at this. It looks like more than just a test script issue, but an actual interop bug with FLR, since it is failing in a variety of different tests. |
| Comment by Andreas Dilger [ 20/Feb/22 ] |
|
This is also causing sanity test_272d, test_272e, and test_272f to fail in interop testing. |
| Comment by Zhenyu Xu [ 24/Feb/22 ] |
|
Could be an interop issue between 2.15.0 client and 2.14.0 MDS on mirror split operation. test_0b: test_50A: test_60b: test_61b: test_203: take test_60b as an example: 00000100:00100000:1.0:1645359656.359528:0:991126:0:(service.c:2142:ptlrpc_server_handle_req_in()) got req x1725282051462784 00000100:00100000:1.0:1645359656.359542:0:991126:0:(service.c:2299:ptlrpc_server_handle_request()) Handling RPC req@0000000030f1e911 pname:cluuid+ref:pid:xid:nid:opc:job mdt_rdpg00_001:67bf88ef-fd73-4c00-9a58-d65897456138+35:1227937:x1725282051462784:12345-10.240.25.52@tcp:35:lfs.0 ... 00000100:00100000:1.0:1645359656.362030:0:991126:0:(service.c:2348:ptlrpc_server_handle_request()) Handled RPC req@0000000030f1e911 pname:cluuid+ref:pid:xid:nid:opc:job mdt_rdpg00_001:67bf88ef-fd73-4c00-9a58-d65897456138+32:1227937:x1725282051462784:12345-10.240.25.52@tcp:35:lfs.0 Request processed in 2487us (2537us total) trans 85899346091 rc 0/0 while on client, I don't see the failure reason 00000080:00200000:1.0:1645359656.359344:0:1227937:0:(file.c:3896:ll_file_ioctl()) VFS Op:inode=[0x2000090a6:0x4e:0x0](00000000a9e821ea), cmd=c00c66f3 00000002:00100000:1.0:1645359656.359354:0:1227937:0:(mdc_request.c:944:mdc_close()) @@@ matched open req@00000000d7fc9409 x1725282051460736/t85899346086(85899346086) o101->lustre-MDT0000-mdc-ffff913d103c9000@10.240.25.55@tcp:12/10 lens 576/656 e 0 to 0 dl 1645359663 ref 1 fl Complete:RPQU/4/ffffffff rc 0/-1 job:'lfs.0' 00000100:00100000:1.0:1645359656.359372:0:1227937:0:(client.c:742:ptlrpc_reassign_next_xid()) @@@ reassign xid req@000000008e2795d6 x1725282051462784/t0(0) o35->lustre-MDT0000-mdc-ffff913d103c9000@10.240.25.55@tcp:23/10 lens 504/1072 e 0 to 0 dl 0 ref 1 fl New:QU/0/ffffffff rc 0/-1 job:'' 00000100:00100000:1.0:1645359656.359378:0:1227937:0:(client.c:1733:ptlrpc_send_new_req()) Sending RPC req@000000008e2795d6 pname:cluuid:pid:xid:nid:opc:job lfs:67bf88ef-fd73-4c00-9a58-d65897456138:1227937:1725282051462784:10.240.25.55@tcp:35:lfs.0 ... 00000100:00100000:1.0:1645359656.362161:0:1227937:0:(client.c:2210:ptlrpc_check_set()) Completed RPC req@000000008e2795d6 pname:cluuid:pid:xid:nid:opc:job lfs:67bf88ef-fd73-4c00-9a58-d65897456138:1227937:1725282051462784:10.240.25.55@tcp:35:lfs.0 00000002:00100000:1.0:1645359656.362168:0:1227937:0:(mdc_request.c:854:mdc_free_open()) @@@ free open request, rq_replay=0 req@00000000d7fc9409 x1725282051460736/t85899346086(85899346086) o101->lustre-MDT0000-mdc-ffff913d103c9000@10.240.25.55@tcp:12/10 lens 576/656 e 0 to 0 dl 1645359663 ref 1 fl Complete:RQU/4/ffffffff rc 0/-1 job:'lfs.0' |
| Comment by Zhenyu Xu [ 24/Feb/22 ] |
|
The patch https://review.whamcloud.com/45636 " |
| Comment by Zhenyu Xu [ 24/Feb/22 ] |
|
I think it could be patch https://review.whamcloud.com/42116 " |
| Comment by Zhenyu Xu [ 24/Feb/22 ] |
|
Could use following patch for 2.15 to catch older MDS ambiguous error code diff --git a/lustre/utils/lfs.c b/lustre/utils/lfs.c index 3a1511fe05..e5d245af3e 100644 --- a/lustre/utils/lfs.c +++ b/lustre/utils/lfs.c @@ -2329,7 +2329,7 @@ again: data->lil_ids[1] = mirror_id; rc = llapi_lease_set(fd, data); if (rc <= 0) { - if (rc == -EINVAL && purge) { + if (purge) { /* could be old MDS which prohibit fd==fdv */ purge = false; goto again; |
| Comment by Andreas Dilger [ 24/Feb/22 ] |
|
Bobijam, rather than pasting the patch here as a comment, could you please submit it as a patch to Gerrit and add a Test-Parameters line to run it in interop mode: Test-Parameters: trivial serverversion=2.14.0 testlist=sanity env=ONLY="0 50 60 61" Test-Parameters: clientversion=2.14.0 testlist=sanity env=ONLY="0 50 60 61" Test-Parameters: serverversion=2.12.8 testlist=sanity env=ONLY="0 50 60 61" Test-Parameters: clientversion=2.12.8 testlist=sanity env=ONLY="0 50 60 61" |
| Comment by Gerrit Updater [ 25/Feb/22 ] |
|
"Bobi Jam <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/46614 |
| Comment by Gerrit Updater [ 05/Mar/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46614/ |
| Comment by Peter Jones [ 05/Mar/22 ] |
|
Landed for 2.15 |
| Comment by Sarah Liu [ 02/Aug/22 ] |
|
Still seeing this error in sanity-flr test_50a in 2.15.1 interop with 2.12.9 el7.9 servers. In the current testing, it seems only happened to subtest 50a trevis-96vm2: trevis-96vm2.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid trevis-96vm1: CMD: trevis-96vm1.trevis.whamcloud.com lctl get_param -n at_max trevis-96vm1: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec trevis-96vm2: CMD: trevis-96vm2.trevis.whamcloud.com lctl get_param -n at_max trevis-96vm2: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec lfs mirror mirror: cannot get UNLOCK lease, ext 8: Device or resource busy (16) /mnt/lustre/d50A.sanity-flr/f50A.sanity-flr layout generation from 2 to 1 sanity-flr test_50A: @@@@@@ FAIL: split does not increase layout gen from 2 to 1 |