[LU-15339] sanity test 208 fails with 'lease not broken over recovery' Created: 07/Dec/21  Updated: 18/Jan/22  Resolved: 23/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Improvement Priority: Minor
Reporter: Patrick Farrell Assignee: Patrick Farrell
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

During testing on some internal branches, we've observed sanity test 208 failing.  It turns out this is a timing issue.

Looking at the test, we see:

$MULTIOP $DIR/$tfile oO_RDWR:eR_E-eUc &
PID=$!
sleep 1

# open file to break lease and then recovery
$MULTIOP $DIR/$tfile oO_RDWR:c || error "open file error"
fail mds1

This is one multiop asking for the lease, then a 1 second wait, then the other trying an open which will break the lease.

In the test logs, we see:

==== test 5: lease broken can't be regained by replay
multiop: cannot get READ lease, ext 0: Device or resource busy (16)
multiop: apply/unlock lease error: Device or resource busy
Failing mds1 on onyx-50vm2 

So this is the first multiop getting EBUSY, which it would get if the file were already open.

Looking at the debug logs, we see two processes working on the file, 3066 and 3068.  Obviously 3066 was started earlier.
We see 3066 opening the file, getting an intent lock, etc, then requesting the lease.  Somewhere during the earlier intent lock request, we start seeing messages from 3068, meaning it has started.  This ioctl is the lease request:

00000080:00200000:0.0:1638330788.087774:0:3066:0:(file.c:3518:ll_file_ioctl()) VFS Op:inode=[0x200005222:0x27a6:0x0](ffff9fddbb55f098), cmd=c00c66f3
00000002:00010000:0.0:1638330788.087789:0:3066:0:(mdc_locks.c:1228:mdc_intent_lock()) (name: ,[0x200005222:0x27a6:0x0]) in obj [0x200005222:0x27a6:0x0], intent: open flags 01004040000001
[...]
00000100:00100000:0.0:1638330788.087812:0:3066:0:(client.c:1685:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc multiop:e3dc4b61-17a3-a1d0-026f-41b421fc661f:3066:1717912348006336:10.240.24.89@tcp:101
00000100:00100000:0.0:1638330788.087830:0:3066:0:(client.c:2409:ptlrpc_set_wait()) set ffff9fdd603fc880 going to sleep for 13 seconds

But before that can complete, we see the open from 3068 complete:

00000100:00100000:0.0:1638330788.087874:0:3068:0:(client.c:2115:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc multiop:e3dc4b61-17a3-a1d0-026f-41b421fc661f:3068:1717912348006272:10.240.24.89@tcp:101
[...]
00000002:00100000:0.0:1638330788.087906:0:3068:0:(mdc_request.c:801:mdc_set_open_replay_data()) @@@ Set up open replay data  req@ffff9fdd55dc0000 x1717912348006272/t51539607555(51539607555) o101->lustre-MDT0000-mdc-ffff9fdd58dea800@10.240.24.89@tcp:12/10 lens 752/616 e 0 to 0 dl 1638330803 ref 2 fl Complete:RP/4/0 rc 0/0  

The first message logged by 3066 is at 1638330787.082310, and we can see above, the intent lock request is starting at 1638330788.087774 - over a second later.

So this is just a pure timing issue.  The solution should be to increase the timeouts in this test to 2 seconds.



 Comments   
Comment by Gerrit Updater [ 07/Dec/21 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45779
Subject: LU-15339 tests: Increase timeout in sanity 208
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b3620920b75b25620e3570bfd7cc4c491aecacdb

Comment by Gerrit Updater [ 23/Dec/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45779/
Subject: LU-15339 tests: Increase timeout in sanity 208
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dc015fc0b51b95151366b0355cfc90b068d98b01

Comment by Peter Jones [ 23/Dec/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:17:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.