[LU-15339] sanity test 208 fails with 'lease not broken over recovery' - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.15.0
Affects Version/s: None
Labels:
None

Rank (Obsolete):
9223372036854775807

Description

During testing on some internal branches, we've observed sanity test 208 failing. It turns out this is a timing issue.

Looking at the test, we see:

$MULTIOP $DIR/$tfile oO_RDWR:eR_E-eUc &
PID=$!
sleep 1

# open file to break lease and then recovery
$MULTIOP $DIR/$tfile oO_RDWR:c || error "open file error"
fail mds1

This is one multiop asking for the lease, then a 1 second wait, then the other trying an open which will break the lease.

In the test logs, we see:

==== test 5: lease broken can't be regained by replay
multiop: cannot get READ lease, ext 0: Device or resource busy (16)
multiop: apply/unlock lease error: Device or resource busy
Failing mds1 on onyx-50vm2

So this is the first multiop getting EBUSY, which it would get if the file were already open.

Looking at the debug logs, we see two processes working on the file, 3066 and 3068. Obviously 3066 was started earlier.
We see 3066 opening the file, getting an intent lock, etc, then requesting the lease. Somewhere during the earlier intent lock request, we start seeing messages from 3068, meaning it has started. This ioctl is the lease request:

00000080:00200000:0.0:1638330788.087774:0:3066:0:(file.c:3518:ll_file_ioctl()) VFS Op:inode=[0x200005222:0x27a6:0x0](ffff9fddbb55f098), cmd=c00c66f3
00000002:00010000:0.0:1638330788.087789:0:3066:0:(mdc_locks.c:1228:mdc_intent_lock()) (name: ,[0x200005222:0x27a6:0x0]) in obj [0x200005222:0x27a6:0x0], intent: open flags 01004040000001
[...]
00000100:00100000:0.0:1638330788.087812:0:3066:0:(client.c:1685:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc multiop:e3dc4b61-17a3-a1d0-026f-41b421fc661f:3066:1717912348006336:10.240.24.89@tcp:101
00000100:00100000:0.0:1638330788.087830:0:3066:0:(client.c:2409:ptlrpc_set_wait()) set ffff9fdd603fc880 going to sleep for 13 seconds

But before that can complete, we see the open from 3068 complete:

00000100:00100000:0.0:1638330788.087874:0:3068:0:(client.c:2115:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc multiop:e3dc4b61-17a3-a1d0-026f-41b421fc661f:3068:1717912348006272:10.240.24.89@tcp:101
[...]
00000002:00100000:0.0:1638330788.087906:0:3068:0:(mdc_request.c:801:mdc_set_open_replay_data()) @@@ Set up open replay data  req@ffff9fdd55dc0000 x1717912348006272/t51539607555(51539607555) o101->lustre-MDT0000-mdc-ffff9fdd58dea800@10.240.24.89@tcp:12/10 lens 752/616 e 0 to 0 dl 1638330803 ref 2 fl Complete:RP/4/0 rc 0/0

The first message logged by 3066 is at 1638330787.082310, and we can see above, the intent lock request is starting at 1638330788.087774 - over a second later.

So this is just a pure timing issue. The solution should be to increase the timeouts in this test to 2 seconds.

Attachments

Activity

People

Assignee:: Patrick Farrell

Reporter:: Patrick Farrell

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Dec/21 9:52 PM

Updated:: 18/Jan/22 3:50 PM

Resolved:: 23/Dec/21 2:47 PM