[LU-11670] Incorrect size when using lockahead - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

When running IOR with the patched MPICH to use lockahead, we pretty consistently get size miscompares on some systems, eg:
WARNING: inconsistent file size by different tasks.
WARNING: Expected aggregate file size = 5368709120.
WARNING: Stat() of aggregate file size = 5103419392.
WARNING: Using actual aggregate bytes moved = 5368709120.
This seems to be pretty timing dependent, as we don't see it at all on another system running the same software, and we didn't see it in our original testing, even though the bug is definitely present.

I've identified the bug and found a fix. During development, Oleg pointed out that it was not necessary to send glimpse callbacks to all locks on a particular resource, but rather we could send one per client, because the client size check is not lock specific - It actually gets the size from the upper layers.

This is true, but there is a caveat that went unnoticed:
This only happens if l_ast_data is set (see osc_ldlm_glimpse_ast), which is not true for speculatively requested locks (glimpse, lockahead - see osc_lock_enqueue & osc_enqueue_base) until they are used for I/O.

This means that if one client requests, say, two 1 MiB locks, then writes in to the first of them, and another client stats the file, the server will only send a glimpse to the highest lock.*

This higher lock has not been used for I/O and therefore does not have l_ast_data set, so the part of the glimpse callback that gets size from clio layers does not run. So the second client will see a file size of zero.

*Note that if we wait long enough, the write associated with the first lock will be flushed and the server will have up to date size and will return the correct value to the client. Part of the reason this is timing dependent.

The fix is for the client, in the glimpse AST, to walk the granted lock list looking for a lock with l_ast_data set. If none is found, then either no writes actually used these locks, or the object is being destroyed - either way, this client doesn't have useful size information.

Patch forthcoming momentarily.

I'll leave this up to WC whether or not this should be a blocker, but it's probably worth considering as one.

Attachments

Issue Links

is related to

LU-13645 Various data corruptions possible in lustre.

Resolved

LU-13089 ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed

Resolved

LU-11719 Refactor search_itree

Resolved

LU-13149 Interop: sanityn test 103 fails with 'Lockahead test23 failed, 255'

Resolved

Activity

People

Assignee:: Patrick Farrell (Inactive)

Reporter:: Patrick Farrell (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 15/Nov/18 3:21 AM

Updated:: 26/Oct/20 1:27 PM

Resolved:: 04/Oct/19 12:45 PM