Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.4.2
    • lustre-2.4.2-14chaos, ZFS OSD
    • 3
    • 15831

    Description

      Users have reported several recent cases of file corruption. The corrupt files are larger than expected, and contain all of the original written data plus additional data at the end of the file. The additional data appears to be valid structured user data of unknown origin. We have not found anything unusual in the console logs from clients or servers at the time the files were written.

      In one case, the user made a copy of a Lustre directory tree using HPSS archival storage tools, then compared the copy to the original. He found one corrupt file in the copy. The original file size was 2,000,000 bytes, but the copy size was 2,097,152 (2 MiB). The archive tool reported 2,000,000 bytes written. The extra 97,152 bytes appear to be valid structured user data of unknown origin.

      The corrupt file sizes are not always aligned on MiB boundaries, however. Of the cases reported so far, these are the sizes involved:

      Example # Expected Size Actual Size
      1 2000000 2097152
      2 1008829 2097152
      3 36473 1053224
      4 1008829 1441432

      In Example 1, the "bad data" begins immediately at the end of the expected data, with no sparse area between. Seen below with od -A d -a, the expected data is random bytes, whereas from offset 2000000 onward the unexpected data is structured.

                                                                               
      1999840   ! esc nul del   [ dc3   +   b   h   \ can   ;   f   h dc4   9            
      1999856   D   +   U   1   j   q   g   ;   7   J   r   {   "   j   )   D            
      1999872 enq   *   C   `   =   o   C   &   K   \   a   1   D   v   k  ht            
      1999888   !   A   ;  ff   2   "   G   i   m   9   e dle   $  si   T   )            
      1999904   9 etb  nl   w bel   N  rs   R   * nul eot   o   v   p   y can            
      1999920   1   4   $   c   W   l   M   D   &   3   U   J   B   )   t   {            
      1999936   A del   s   I   M dc1 esc   w dc1  sp   g  bs dle   `  sp   A            
      1999952 nak   D   %   l   1   r   1   W   % ack   !   h   0 syn   c   r            
      1999968 nak   W   ;   b   h   W   Z   z   w   B stx  bs   "   #   J   7            
      1999984   h   o   $  em   b   V   p bel   ] dc2   o  cr   )   S del   >            
      2000000  sp   1   1   1   9  nl   B   o   x  sp   f   r   a   c   A   A            
      2000016  sp   6   4   0  sp   6   7   9  sp   4   0  sp   7   9  sp   1            
      2000032   1   0   0  sp   1   1   1   9  nl   B   o   x  sp   f   r   a            
      2000048   c   A   A  sp   6   8   0  sp   7   1   9  sp   4   0  sp   7            
      2000064   9  sp   1   1   0   0  sp   1   1   1   9  nl   B   o   x  sp            
      2000080   f   r   a   c   A   A  sp   7   2   0  sp   7   5   9  sp   4            
      2000096   0  sp   7   9  sp   1   1   0   0  sp   1   1   1   9  nl   B            
      2000112   o   x  sp   f   r   a   c   A   A  sp   7   6   0  sp   7   9            
      2000128   9  sp   4   0  sp   7   9  sp   1   1   0   0  sp   1   1   1            
      2000144   9  nl   B   o   x  sp   f   r   a   c   A   A  sp   8   0   0            
      

      In examples 2-4, there is a sparse region between the expected and unexpected data, ending on a 1 MiB boundary. Here is another od snippet illustrating the expected, sparse, and unexpected regions for example 2:

                                                                               
      1008768   g  nl   .   /   t   e   s   t   d   i   r   /   c   h   m   o            
      1008784   d   s   t   g  nl   .   /   t   e   s   t   d   i   r   /   l            
      1008800   s   t   o   r   a   g   e  nl   .   /   t   e   s   t   f   i            
      1008816   l   e   3  nl   .   /   z   z   j   u   n   k  nl nul nul nul            
      1008832 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul            
      *                                                                                  
      1048576  sp   5   9   9  nl   B   o   x  sp   f   r   a   c   A   A  sp            
      1048592   2   0   0  sp   2   3   9  sp   2   0   0  sp   2   3   9  sp            
      1048608   5   8   0  sp   5   9   9  nl   B   o   x  sp   f   r   a   c            
      1048624   A   A  sp   2   4   0  sp   2   7   9  sp   2   0   0  sp   2            
      1048640   3   9  sp   5   8   0  sp   5   9   9  nl   B   o   x  sp   f            
      1048656   r   a   c   A   A  sp   2   8   0  sp   3   1   9  sp   2   0            
      1048672   0  sp   2   3   9  sp   5   8   0  sp   5   9   9  nl   B   o            
      1048688   x  sp   f   r   a   c   A   A  sp   3   2   0  sp   3   5   9            
      1048704  sp   2   0   0  sp   2   3   9  sp   5   8   0  sp   5   9   9            
      1048720  nl   B   o   x  sp   f   r   a   c   A   A  sp   3   6   0  sp            
      1048736   3   9   9  sp   2   0   0  sp   2   3   9  sp   5   8   0  sp            
      1048752   5   9   9  nl   B   o   x  sp   f   r   a   c   A   A  sp   4            
      

      In all 4 examples the corrupt data resides within the second OST object. In examples 2-4 the file should have only one object. This feels like some bug is causing the second OST object to be doubly linked, and our users have partially overwritten another user's data.

      Attachments

        Issue Links

          Activity

            [LU-5648] corrupt files contain extra data

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12068/
            Subject: LU-5648 ofd: Reject precreate requests below last_id
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 65a17be2d9e1acdec5d9aebaed007d4cb6d0ca11

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12068/ Subject: LU-5648 ofd: Reject precreate requests below last_id Project: fs/lustre-release Branch: master Current Patch Set: Commit: 65a17be2d9e1acdec5d9aebaed007d4cb6d0ca11
            jlevi Jodi Levi (Inactive) added a comment - Emoly, Could you please complete the test based on Oleg's comments found here: https://jira.hpdd.intel.com/browse/LU-5648?focusedCommentId=95011&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-95011 Thank you!
            pjones Peter Jones added a comment -

            Oleg confirms that it does

            pjones Peter Jones added a comment - Oleg confirms that it does
            pjones Peter Jones added a comment -

            Does the second patch need to land also? http://review.whamcloud.com/#/c/12068/

            pjones Peter Jones added a comment - Does the second patch need to land also? http://review.whamcloud.com/#/c/12068/

            Patch landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12067/
            Subject: LU-5648 ofd: In destroy orphan case always let MDS know last id
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: da5773e2b498a4edacc26fbf610d0b7628818d93

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12067/ Subject: LU-5648 ofd: In destroy orphan case always let MDS know last id Project: fs/lustre-release Branch: master Current Patch Set: Commit: da5773e2b498a4edacc26fbf610d0b7628818d93

            I'm wondering if osd-zfs OSTs may be more susceptible to this problem than osd-ldiskfs. With osd-ldiskfs, whenever there are writes to precreated objects they will trigger a journal commit, but with osd-zfs there may be a large number of writes in cache without committing the TXG. While it is possible to hit a similar scenario on osd-ldiskfs OSTs just precreating a large number of empty objects and not forcing a commit, this wouldn't actually lead to the data corruption since unwritten objects are still anonymous.

            As mentioned in http://review.whamcloud.com/12067 I think it makes sense to track the number of uncommitted precreates with a commit callback, and if this grows too large (OST_MAX_PRECREATE / 2 or so) it should start a transaction commit. I expect this wouldn't happen very often, so a precreate operation shouldn't be marked sync unless it really needs to be.

            adilger Andreas Dilger added a comment - I'm wondering if osd-zfs OSTs may be more susceptible to this problem than osd-ldiskfs. With osd-ldiskfs, whenever there are writes to precreated objects they will trigger a journal commit, but with osd-zfs there may be a large number of writes in cache without committing the TXG. While it is possible to hit a similar scenario on osd-ldiskfs OSTs just precreating a large number of empty objects and not forcing a commit, this wouldn't actually lead to the data corruption since unwritten objects are still anonymous. As mentioned in http://review.whamcloud.com/12067 I think it makes sense to track the number of uncommitted precreates with a commit callback, and if this grows too large ( OST_MAX_PRECREATE / 2 or so) it should start a transaction commit. I expect this wouldn't happen very often, so a precreate operation shouldn't be marked sync unless it really needs to be.

            with osd_txg_sync_delay_us=-1, which essentially disables sync operations on the MDS.

            To clarify, -1 is the default value which preserves the safe full sync behavior.

            We run our MDS with osd_txg_sync_delay_us=-1 and osd_object_sync_delay_us=0. This mainly prevents client fsyncs from forcing a pool-wide sync. dt_trans_stop() still safely honors the sync behavior on our systems, so operations using that interfaces will be consistent after a crash.

            nedbass Ned Bass (Inactive) added a comment - with osd_txg_sync_delay_us=-1, which essentially disables sync operations on the MDS. To clarify, -1 is the default value which preserves the safe full sync behavior. We run our MDS with osd_txg_sync_delay_us=-1 and osd_object_sync_delay_us=0 . This mainly prevents client fsyncs from forcing a pool-wide sync. dt_trans_stop() still safely honors the sync behavior on our systems, so operations using that interfaces will be consistent after a crash.
            green Oleg Drokin added a comment -

            oh! I did not realize that.

            I have been having second thoughts about what would happen if we have sequence advancement (less of a concern now with 48 bits, more of a concern with DNE where it's only 32 bits).
            I was kind of expecting that all the sync logic will make sure that this cannot happen, but...
            If sync is skipped and we also get into the next sequence and allocate some there, then MDS crashes and comes back with old sequence - sure we'll advance our pointer on reconnect to the end of old sequence, and then the next sequence will be proposed, but numbering there would not start from 0, so we'll end up in the same situation (detected by second patch, but only preventing corruption, not actually allowing any allocations).
            I wonder how we better handle it then?

            green Oleg Drokin added a comment - oh! I did not realize that. I have been having second thoughts about what would happen if we have sequence advancement (less of a concern now with 48 bits, more of a concern with DNE where it's only 32 bits). I was kind of expecting that all the sync logic will make sure that this cannot happen, but... If sync is skipped and we also get into the next sequence and allocate some there, then MDS crashes and comes back with old sequence - sure we'll advance our pointer on reconnect to the end of old sequence, and then the next sequence will be proposed, but numbering there would not start from 0, so we'll end up in the same situation (detected by second patch, but only preventing corruption, not actually allowing any allocations). I wonder how we better handle it then?

            I think it is worthwhile to mention that LLNL is running the "debug" patch http://review.whamcloud.com/7761 on their production systems with osd_txg_sync_delay_us=-1, which essentially disables sync operations on the MDS. That means any operations which the MDS or OSS try to sync for consistency reasons are not actually committed to disk before replying to the client. I haven't looked at the code specifically, but I can imagine lots of places where distributed consistency can be lost if Lustre thinks some updates are safe on disk, but the OSD is only caching this in RAM and the server crashes.

            adilger Andreas Dilger added a comment - I think it is worthwhile to mention that LLNL is running the "debug" patch http://review.whamcloud.com/7761 on their production systems with osd_txg_sync_delay_us=-1, which essentially disables sync operations on the MDS. That means any operations which the MDS or OSS try to sync for consistency reasons are not actually committed to disk before replying to the client. I haven't looked at the code specifically, but I can imagine lots of places where distributed consistency can be lost if Lustre thinks some updates are safe on disk, but the OSD is only caching this in RAM and the server crashes.

            Thanks Oleg, nice work. I verified the patches using your reproducer. It would be great to get a test case based on your reproducer into the Lustre regression test suite.

            nedbass Ned Bass (Inactive) added a comment - Thanks Oleg, nice work. I verified the patches using your reproducer. It would be great to get a test case based on your reproducer into the Lustre regression test suite.

            People

              green Oleg Drokin
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: