[LU-8411] Fix Lustre filesystem corruption when updating journal superblock fails Created: 18/Jul/16  Updated: 26/Aug/19  Resolved: 31/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Major
Reporter: Artem Blagodarenko (Inactive) Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
Related
is related to LU-12700 sanity test_407 added to ALWAYS_EXCEP... Open
is related to LU-9135 sanity test_313: osp_sync.c:571:osp_s... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During validation of issue we encountered another data corruption. It looks like the corruption occurred because the external journal went offline but the filesystem processed the transaction as if successful.

Jun  9 08:33:41 cslcodev912 kernel: JBD2: I/O error detected when updating journal superblock for md129.
Buffer I/O error on device md0, logical block 0
commit 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
Author: Joseph Qi <joseph.qi@huawei.com>
Date:   Mon Jun 15 14:36:01 2015 -0400

jbd2: fix ocfs2 corrupt when updating journal superblock fails
    
If updating journal superblock fails after journal data has been
flushed, the error is omitted and this will mislead the caller as a
normal case.

This directly addressed reported issue.


6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
Author: Joseph Qi <joseph.qi@huawei.com>
Date: Mon Jun 15 14:36:01 2015 -0400

jbd2: fix ocfs2 corrupt when updating journal superblock fails

If updating journal superblock fails after journal data has been
flushed, the error is omitted and this will mislead the caller as a
normal case. In ocfs2, the checkpoint will be treated successfully
and the other node can get the lock to update. Since the sb_start is
still pointing to the old log block, it will rewrite the journal data
during journal recovery by the other node. Thus the new updates will
be overwritten and ocfs2 corrupts. So in above case we have to return
the error, and ocfs2_commit_cache will take care of the error and
prevent the other node to do update first. And only after recovering
journal it can do the new updates.

The issue discussion mail can be found at:
https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html
http://comments.gmane.org/gmane.comp.file-systems.ext4/48841

[Fixed bug in patch which allowed a non-negative error return from
jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this
was causing xfstests ext4/306 to fail. – Ted ]



 Comments   
Comment by Gerrit Updater [ 18/Jul/16 ]

Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: http://review.whamcloud.com/21398
Subject: LU-8411 ofd: handle last_rcvd file can't update properly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: aec39022f7c0d39c7c633b9ce52a59d6e3012c82

Comment by Gerrit Updater [ 31/Jan/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21398/
Subject: LU-8411 ofd: handle last_rcvd file can't update properly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6a81ffa1e9e44231d812e331c73cfa9df67746ed

Comment by Peter Jones [ 31/Jan/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:17:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.