Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8411

Fix Lustre filesystem corruption when updating journal superblock fails

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Lustre 2.10.0
    • Labels:
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      During validation of issue we encountered another data corruption. It looks like the corruption occurred because the external journal went offline but the filesystem processed the transaction as if successful.

      Jun  9 08:33:41 cslcodev912 kernel: JBD2: I/O error detected when updating journal superblock for md129.
      Buffer I/O error on device md0, logical block 0
      
      commit 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
      Author: Joseph Qi <joseph.qi@huawei.com>
      Date:   Mon Jun 15 14:36:01 2015 -0400
      
      jbd2: fix ocfs2 corrupt when updating journal superblock fails
          
      If updating journal superblock fails after journal data has been
      flushed, the error is omitted and this will mislead the caller as a
      normal case.
      

      This directly addressed reported issue.

      
      

      6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
      Author: Joseph Qi <joseph.qi@huawei.com>
      Date: Mon Jun 15 14:36:01 2015 -0400

      jbd2: fix ocfs2 corrupt when updating journal superblock fails

      If updating journal superblock fails after journal data has been
      flushed, the error is omitted and this will mislead the caller as a
      normal case. In ocfs2, the checkpoint will be treated successfully
      and the other node can get the lock to update. Since the sb_start is
      still pointing to the old log block, it will rewrite the journal data
      during journal recovery by the other node. Thus the new updates will
      be overwritten and ocfs2 corrupts. So in above case we have to return
      the error, and ocfs2_commit_cache will take care of the error and
      prevent the other node to do update first. And only after recovering
      journal it can do the new updates.

      The issue discussion mail can be found at:
      https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html
      http://comments.gmane.org/gmane.comp.file-systems.ext4/48841

      [Fixed bug in patch which allowed a non-negative error return from
      jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this
      was causing xfstests ext4/306 to fail. – Ted ]

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wc-triage WC Triage
                Reporter:
                artem_blagodarenko Artem Blagodarenko
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: