Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11670

Incorrect size when using lockahead

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      When running IOR with the patched MPICH to use lockahead, we pretty consistently get size miscompares on some systems, eg:
      WARNING: inconsistent file size by different tasks.
      WARNING: Expected aggregate file size       = 5368709120.
      WARNING: Stat() of aggregate file size      = 5103419392.
      WARNING: Using actual aggregate bytes moved = 5368709120.
      This seems to be pretty timing dependent, as we don't see it at all on another system running the same software, and we didn't see it in our original testing, even though the bug is definitely present.

       

      I've identified the bug and found a fix.  During development, Oleg pointed out that it was not necessary to send glimpse callbacks to all locks on a particular resource, but rather we could send one per client, because the client size check is not lock specific - It actually gets the size from the upper layers.

      This is true, but there is a caveat that went unnoticed:
      This only happens if l_ast_data is set (see osc_ldlm_glimpse_ast), which is not true for speculatively requested locks (glimpse, lockahead - see osc_lock_enqueue & osc_enqueue_base) until they are used for I/O.

      This means that if one client requests, say, two 1 MiB locks, then writes in to the first of them, and another client stats the file, the server will only send a glimpse to the highest lock.*

      This higher lock has not been used for I/O and therefore does not have l_ast_data set, so the part of the glimpse callback that gets size from clio layers does not run.  So the second client will see a file size of zero.

      *Note that if we wait long enough, the write associated with the first lock will be flushed and the server will have up to date size and will return the correct value to the client.  Part of the reason this is timing dependent.

      The fix is for the client, in the glimpse AST, to walk the granted lock list looking for a lock with l_ast_data set.  If none is found, then either no writes actually used these locks, or the object is being destroyed - either way, this client doesn't have useful size information.

       

      Patch forthcoming momentarily.

       

      I'll leave this up to WC whether or not this should be a blocker, but it's probably worth considering as one.

      Attachments

        Issue Links

          Activity

            People

              pfarrell Patrick Farrell (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: