[LU-11670] Incorrect size when using lockahead Created: 15/Nov/18  Updated: 26/Oct/20  Resolved: 04/Oct/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4

Type: Bug Priority: Critical
Reporter: Patrick Farrell (Inactive) Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13645 Various data corruptions possible in ... Resolved
is related to LU-13089 ASSERTION( (((( lock))->l_flags & (1U... Resolved
is related to LU-11719 Refactor search_itree Resolved
is related to LU-13149 Interop: sanityn test 103 fails with ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When running IOR with the patched MPICH to use lockahead, we pretty consistently get size miscompares on some systems, eg:
WARNING: inconsistent file size by different tasks.
WARNING: Expected aggregate file size       = 5368709120.
WARNING: Stat() of aggregate file size      = 5103419392.
WARNING: Using actual aggregate bytes moved = 5368709120.
This seems to be pretty timing dependent, as we don't see it at all on another system running the same software, and we didn't see it in our original testing, even though the bug is definitely present.

 

I've identified the bug and found a fix.  During development, Oleg pointed out that it was not necessary to send glimpse callbacks to all locks on a particular resource, but rather we could send one per client, because the client size check is not lock specific - It actually gets the size from the upper layers.

This is true, but there is a caveat that went unnoticed:
This only happens if l_ast_data is set (see osc_ldlm_glimpse_ast), which is not true for speculatively requested locks (glimpse, lockahead - see osc_lock_enqueue & osc_enqueue_base) until they are used for I/O.

This means that if one client requests, say, two 1 MiB locks, then writes in to the first of them, and another client stats the file, the server will only send a glimpse to the highest lock.*

This higher lock has not been used for I/O and therefore does not have l_ast_data set, so the part of the glimpse callback that gets size from clio layers does not run.  So the second client will see a file size of zero.

*Note that if we wait long enough, the write associated with the first lock will be flushed and the server will have up to date size and will return the correct value to the client.  Part of the reason this is timing dependent.

The fix is for the client, in the glimpse AST, to walk the granted lock list looking for a lock with l_ast_data set.  If none is found, then either no writes actually used these locks, or the object is being destroyed - either way, this client doesn't have useful size information.

 

Patch forthcoming momentarily.

 

I'll leave this up to WC whether or not this should be a blocker, but it's probably worth considering as one.



 Comments   
Comment by Gerrit Updater [ 15/Nov/18 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33660
Subject: LU-11670 osc: glimpse - search for active lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ca392f75798a8d06727c98210d1bd5c30866deb5

Comment by Gerrit Updater [ 20/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33660/
Subject: LU-11670 osc: glimpse - search for active lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b3461d11dcb04670cc2e1bfbb99306cfd3f645ef

Comment by Peter Jones [ 20/Sep/19 ]

Landed for 2.13

Comment by Andreas Dilger [ 25/Sep/19 ]

The new sanityn test_103 is causing intermittent test failures on master since this patch has landed:
https://testing.whamcloud.com/sub_tests/6e7ab5ca-dbf2-11e9-b62b-52540065bddc
https://testing.whamcloud.com/sub_tests/5465b1be-dc86-11e9-add9-52540065bddc
https://testing.whamcloud.com/sub_tests/7fb92682-de13-11e9-be86-52540065bddc
https://testing.whamcloud.com/sub_tests/988e7d56-de18-11e9-a197-52540065bddc
https://testing.whamcloud.com/sub_tests/b8faf26a-de1b-11e9-add9-52540065bddc
https://testing.whamcloud.com/sub_tests/ea74ad9e-de49-11e9-be86-52540065bddc
https://testing.whamcloud.com/sub_tests/b50fb5a0-de5c-11e9-be86-52540065bddc
https://testing.whamcloud.com/sub_tests/df3dabbc-def6-11e9-be86-52540065bddc
https://testing.whamcloud.com/sub_tests/70b6fa24-def9-11e9-9874-52540065bddc
https://testing.whamcloud.com/sub_tests/b26d3222-df11-11e9-b62b-52540065bddc

Comment by Peter Jones [ 25/Sep/19 ]

paf any thoughts?

Comment by Andreas Dilger [ 25/Sep/19 ]
Incorrect size expected (no glimpse fix):
Starting test test23 at 1569352103
Finishing test test23 at 1569352106
 sanityn test_103: @@@@@@ FAIL: Lockahead test 23 passed with fail_loc set, 0 
Comment by Patrick Farrell [ 25/Sep/19 ]

Oh, yuck.  I checked a few of those, and they all appear to be ZFS.

So presumably it's some sort of very long sync delay, or possibly the ZFS periodic commit interval is triggering in the middle of the test...  Can't quite nail what it is right now.  But there's an easy way out.

 

This test is a little weird, as it has two halves so it can sort of self-verify...  The first half proves the problem exists, by showing that if you disable the fix, the bug occurs.  The second half tests that the problem does not occur with the fix in place.

The first half, the one that shows the bug exists, is the only one that's failing.  Basically, we are unexpectedly succeeding sometimes.

And while I like having a self-verifying test, the truth is very few of our tests are that way.

So a quick & simple solution that would not overly compromise things would be to remove everything after test_mkdir (line 4753 here https://review.whamcloud.com/#/c/33660/31/lustre/tests/sanityn.sh ) up to (but not including) line 4780, where we start setting the fail_loc.

(I'm not going to push this patch at the moment, but I'd be happy to review it.)

That would leave you with only the "positive" test, removing the "negative" (ie, prove it's broken without the fix) one, which is the only one that's failing.  

If someone really wants to debug this, I would add a printf with the stat information in to lockahead test.c, rather than only printing stat info when it disagrees...  See whether the "agreeing" size on both clients is 0 or 1 MiB.  And, basically, dig through the ZFS stuff and see how the different ZFS sync behavior is conflicting with the test.

Comment by Jian Yu [ 26/Sep/19 ]

I'm working on the patch to remove the first part of the test from sanityn test 103.

Comment by Jian Yu [ 26/Sep/19 ]

Hi Patrick,
What about just making the following change so as to keep the first half test and just print an info when the problem cannot be reproduced?

sanityn.sh
diff --git a/lustre/tests/sanityn.sh b/lustre/tests/sanityn.sh
index 7d4007534b..ae47c09d55 100755
--- a/lustre/tests/sanityn.sh
+++ b/lustre/tests/sanityn.sh
@@ -4768,7 +4768,7 @@ test_103() {
        lockahead_test -d $DIR/$tdir -D $DIR2/$tdir -t $testnum -f $tfile
        rc=$?
        if [ $rc -eq 0 ]; then
-               error "Lockahead test $testnum passed with fail_loc set, ${rc}"
+               echo "Lockahead test $testnum passed with fail_loc set, ${rc}"
        fi
 
        # guarantee write commit timeout has expired
Comment by Patrick Farrell [ 26/Sep/19 ]

Hi Jian,

Ah, that's better than my suggestion.   Yeah, I like that idea.

Comment by Gerrit Updater [ 26/Sep/19 ]

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36303
Subject: LU-11670 tests: do not fail the first half in sanityn test 103
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5c382fc06b013044e5bc16c20d648ac467e6aa5e

Comment by Andreas Dilger [ 01/Oct/19 ]

NB: there also appear to be about 40% of failures on ldiskfs, and not just ZFS.

Comment by Gerrit Updater [ 04/Oct/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36303/
Subject: LU-11670 tests: do not fail the first half in sanityn test 103
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a88d0aa76c62e3074a05e869c9f0ba7ac128300f

Comment by Peter Jones [ 04/Oct/19 ]

ok - the correction to the patch has landed too

Comment by Gerrit Updater [ 08/Oct/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36406
Subject: LU-11670 osc: glimpse - search for active lock
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 7df7f1d5b2bf01643cdd55d6ed9fccaaf98a2c42

Comment by Gerrit Updater [ 08/Oct/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36407
Subject: LU-11670 tests: do not fail the first half in sanityn test 103
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: ef91b632b93e8ea57b4fe41ee2e9b1306d48fb8f

Comment by Gerrit Updater [ 12/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36406/
Subject: LU-11670 osc: glimpse - search for active lock
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 2548cb9e32bfca897de577f88836629f72641369

Comment by Gerrit Updater [ 12/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36407/
Subject: LU-11670 tests: do not fail the first half in sanityn test 103
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 9292c2a0d1159d3ab47a190799f73e1e1a09f77c

Generated at Sat Feb 10 02:45:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.