Details

    • 3
    • 9223372036854775807

    Description

      When running IOR with the patched MPICH to use lockahead, we pretty consistently get size miscompares on some systems, eg:
      WARNING: inconsistent file size by different tasks.
      WARNING: Expected aggregate file size       = 5368709120.
      WARNING: Stat() of aggregate file size      = 5103419392.
      WARNING: Using actual aggregate bytes moved = 5368709120.
      This seems to be pretty timing dependent, as we don't see it at all on another system running the same software, and we didn't see it in our original testing, even though the bug is definitely present.

       

      I've identified the bug and found a fix.  During development, Oleg pointed out that it was not necessary to send glimpse callbacks to all locks on a particular resource, but rather we could send one per client, because the client size check is not lock specific - It actually gets the size from the upper layers.

      This is true, but there is a caveat that went unnoticed:
      This only happens if l_ast_data is set (see osc_ldlm_glimpse_ast), which is not true for speculatively requested locks (glimpse, lockahead - see osc_lock_enqueue & osc_enqueue_base) until they are used for I/O.

      This means that if one client requests, say, two 1 MiB locks, then writes in to the first of them, and another client stats the file, the server will only send a glimpse to the highest lock.*

      This higher lock has not been used for I/O and therefore does not have l_ast_data set, so the part of the glimpse callback that gets size from clio layers does not run.  So the second client will see a file size of zero.

      *Note that if we wait long enough, the write associated with the first lock will be flushed and the server will have up to date size and will return the correct value to the client.  Part of the reason this is timing dependent.

      The fix is for the client, in the glimpse AST, to walk the granted lock list looking for a lock with l_ast_data set.  If none is found, then either no writes actually used these locks, or the object is being destroyed - either way, this client doesn't have useful size information.

       

      Patch forthcoming momentarily.

       

      I'll leave this up to WC whether or not this should be a blocker, but it's probably worth considering as one.

      Attachments

        Issue Links

          Activity

            [LU-11670] Incorrect size when using lockahead

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36406
            Subject: LU-11670 osc: glimpse - search for active lock
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 7df7f1d5b2bf01643cdd55d6ed9fccaaf98a2c42

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36406 Subject: LU-11670 osc: glimpse - search for active lock Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 7df7f1d5b2bf01643cdd55d6ed9fccaaf98a2c42
            pjones Peter Jones added a comment -

            ok - the correction to the patch has landed too

            pjones Peter Jones added a comment - ok - the correction to the patch has landed too

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36303/
            Subject: LU-11670 tests: do not fail the first half in sanityn test 103
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a88d0aa76c62e3074a05e869c9f0ba7ac128300f

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36303/ Subject: LU-11670 tests: do not fail the first half in sanityn test 103 Project: fs/lustre-release Branch: master Current Patch Set: Commit: a88d0aa76c62e3074a05e869c9f0ba7ac128300f

            NB: there also appear to be about 40% of failures on ldiskfs, and not just ZFS.

            adilger Andreas Dilger added a comment - NB: there also appear to be about 40% of failures on ldiskfs, and not just ZFS.

            Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36303
            Subject: LU-11670 tests: do not fail the first half in sanityn test 103
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5c382fc06b013044e5bc16c20d648ac467e6aa5e

            gerrit Gerrit Updater added a comment - Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36303 Subject: LU-11670 tests: do not fail the first half in sanityn test 103 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5c382fc06b013044e5bc16c20d648ac467e6aa5e

            Hi Jian,

            Ah, that's better than my suggestion.   Yeah, I like that idea.

            paf0186 Patrick Farrell added a comment - Hi Jian, Ah, that's better than my suggestion.   Yeah, I like that idea.
            yujian Jian Yu added a comment -

            Hi Patrick,
            What about just making the following change so as to keep the first half test and just print an info when the problem cannot be reproduced?

            sanityn.sh
            diff --git a/lustre/tests/sanityn.sh b/lustre/tests/sanityn.sh
            index 7d4007534b..ae47c09d55 100755
            --- a/lustre/tests/sanityn.sh
            +++ b/lustre/tests/sanityn.sh
            @@ -4768,7 +4768,7 @@ test_103() {
                    lockahead_test -d $DIR/$tdir -D $DIR2/$tdir -t $testnum -f $tfile
                    rc=$?
                    if [ $rc -eq 0 ]; then
            -               error "Lockahead test $testnum passed with fail_loc set, ${rc}"
            +               echo "Lockahead test $testnum passed with fail_loc set, ${rc}"
                    fi
             
                    # guarantee write commit timeout has expired
            
            yujian Jian Yu added a comment - Hi Patrick, What about just making the following change so as to keep the first half test and just print an info when the problem cannot be reproduced? sanityn.sh diff --git a/lustre/tests/sanityn.sh b/lustre/tests/sanityn.sh index 7d4007534b..ae47c09d55 100755 --- a/lustre/tests/sanityn.sh +++ b/lustre/tests/sanityn.sh @@ -4768,7 +4768,7 @@ test_103() { lockahead_test -d $DIR/$tdir -D $DIR2/$tdir -t $testnum -f $tfile rc=$? if [ $rc -eq 0 ]; then - error "Lockahead test $testnum passed with fail_loc set, ${rc}" + echo "Lockahead test $testnum passed with fail_loc set, ${rc}" fi # guarantee write commit timeout has expired
            yujian Jian Yu added a comment -

            I'm working on the patch to remove the first part of the test from sanityn test 103.

            yujian Jian Yu added a comment - I'm working on the patch to remove the first part of the test from sanityn test 103.

            Oh, yuck.  I checked a few of those, and they all appear to be ZFS.

            So presumably it's some sort of very long sync delay, or possibly the ZFS periodic commit interval is triggering in the middle of the test...  Can't quite nail what it is right now.  But there's an easy way out.

             

            This test is a little weird, as it has two halves so it can sort of self-verify...  The first half proves the problem exists, by showing that if you disable the fix, the bug occurs.  The second half tests that the problem does not occur with the fix in place.

            The first half, the one that shows the bug exists, is the only one that's failing.  Basically, we are unexpectedly succeeding sometimes.

            And while I like having a self-verifying test, the truth is very few of our tests are that way.

            So a quick & simple solution that would not overly compromise things would be to remove everything after test_mkdir (line 4753 here https://review.whamcloud.com/#/c/33660/31/lustre/tests/sanityn.sh ) up to (but not including) line 4780, where we start setting the fail_loc.

            (I'm not going to push this patch at the moment, but I'd be happy to review it.)

            That would leave you with only the "positive" test, removing the "negative" (ie, prove it's broken without the fix) one, which is the only one that's failing.  

            If someone really wants to debug this, I would add a printf with the stat information in to lockahead test.c, rather than only printing stat info when it disagrees...  See whether the "agreeing" size on both clients is 0 or 1 MiB.  And, basically, dig through the ZFS stuff and see how the different ZFS sync behavior is conflicting with the test.

            paf0186 Patrick Farrell added a comment - Oh, yuck.  I checked a few of those, and they all appear to be ZFS. So presumably it's some sort of very long sync delay, or possibly the ZFS periodic commit interval is triggering in the middle of the test...  Can't quite nail what it is right now.  But there's an easy way out.   This test is a little weird, as it has two halves so it can sort of self-verify...  The first half proves the problem exists, by showing that if you disable  the fix, the bug occurs.  The second half tests that the problem does not occur with the fix in place. The first half, the one that shows the bug exists, is the only one that's failing.  Basically, we are unexpectedly succeeding sometimes. And while I like having a self-verifying test, the truth is very few of our tests are that way. So a quick & simple solution that would not overly compromise things would be to remove everything after test_mkdir (line 4753 here  https://review.whamcloud.com/#/c/33660/31/lustre/tests/sanityn.sh  ) up to (but not including) line 4780, where we start setting the fail_loc. (I'm not going to push this patch at the moment, but I'd be happy to review it.) That would leave you with only the "positive" test, removing the "negative" (ie, prove it's broken without the fix) one, which is the only one that's failing.   If someone really wants to debug this, I would add a printf with the stat information in to lockahead test.c, rather than only printing stat info when it disagrees...  See whether the "agreeing" size on both clients is 0 or 1 MiB.  And, basically, dig through the ZFS stuff and see how the different ZFS sync behavior is conflicting with the test.
            Incorrect size expected (no glimpse fix):
            Starting test test23 at 1569352103
            Finishing test test23 at 1569352106
             sanityn test_103: @@@@@@ FAIL: Lockahead test 23 passed with fail_loc set, 0 
            
            adilger Andreas Dilger added a comment - Incorrect size expected (no glimpse fix): Starting test test23 at 1569352103 Finishing test test23 at 1569352106 sanityn test_103: @@@@@@ FAIL: Lockahead test 23 passed with fail_loc set, 0
            pjones Peter Jones added a comment -

            paf any thoughts?

            pjones Peter Jones added a comment - paf any thoughts?

            People

              pfarrell Patrick Farrell (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: