[LU-13416] Data corruption during IOR testing with DoM files and hard failover Created: 06/Apr/20  Updated: 23/Dec/20  Resolved: 20/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0, Lustre 2.12.5

Type: Bug Priority: Blocker
Reporter: Alexey Lyashkov Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: None
Environment:

Any Lustre 2.x affected.


Issue Links:
Related
is related to LU-14267 osd_ldiskfs_write_record(): do not up... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

IAM tables uses a zero copy update for files as similar as ldiskfs directories does.
osd-ldiskfs staring from

# git describe 67076c3c7e2b11023b943db2f5031d9b9a11329c
v2_2_50_0-22-g67076c3

does same. But it's not a safe without set a LDISKFS_INODE_JOURNAL_DATA to inodes.
(thanks bzzz for tip).
Otherwise metadata blocks can be reused before journal checkpoint without corresponded revoke records. It caused a valid file data will replaced with stale journaled data.
from blk trace perspective it shown

    mdt_io01_025-32148 [003]  4161.223760: block_bio_queue:      9,65 W 12075997800 + 8 [mdt_io01_025]
    mdt_io01_019-31765 [003]  4163.374449: block_bio_queue:      9,65 W 12075997800 + 8 [mdt_io01_019]
    mdt_io01_000-12006 [014]  4165.256635: block_bio_queue:      9,65 W 12075997800 + 8 [mdt_io01_000]
    mdt_io01_019-31765 [004]  4167.030265: block_bio_queue:      9,65 W 12075997800 + 8 [mdt_io01_019]

but this info is committed

00000001:00080000:9.0:1585615546.198190:0:11825:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) snx11281-MDT0000: transno 4522600752066 is committed
00000001:00080000:9.0:1585615546.198196:0:11825:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) snx11281-MDT0000: transno 4522600752064 is committed

but after crash, journal records is

Commit time 1585612866.905807896
  FS block 1509499725 logged at journal block 1370 (flags 0x2)
Found expected sequence 86453863, type 2 (commit block) at block 1382
Commit time 1585612871.80796396
Found expected sequence 86453864, type 2 (commit block) at block 1395
Commit time 1585612871.147796211
  FS block 1509499725 logged at journal block 1408 (flags 0x2)
Found expected sequence 86453865, type 2 (commit block) at block 1414
Commit time 1585612872.386792798
  FS block 1509499725 logged at journal block 1427 (flags 0x2)
Found expected sequence 86453866, type 2 (commit block) at block 1438
Commit time 1585612876.763804361
Found expected sequence 86453867, type 2 (commit block) at block 1451
Commit time 1585612876.834804666
  FS block 1509499725 logged at journal block 1464 (flags 0x2)
Found expected sequence 86453868, type 2 (commit block) at block 1471

and none revoke records.



 Comments   
Comment by Andreas Dilger [ 09/Apr/20 ]

Is the solution here just to set LDISKFS_JOURNAL_DATA_FL on IAM files at create/open time (if unset)? I know we can't normally change the "+j" flag on an inode without flushing the journal, but this is handled properly via ldiskfs_ioctl_setflags() and will only change the flag once per IAM file (normally at mount) so it is not a significant performance concern.

Comment by Alexey Lyashkov [ 09/Apr/20 ]

Andreas,
solution is simple, but realization is too hard, as OSD don't know is this object is under journaling or not.
In most cases we can check a fid sequences, but sometimes not.

'out' protocol is problem, and Wangdi version for DNE2 llogs.
for first case, we don't have an 'open' - just a locate an object and start to write to it.
second case (probably it's different side of first) - it's llogs who uses an normal fid (it can be fixed - but it need to change an object type to create).

first case is primary problem - as OUT processing don't have something like 'open' command.

PS. lustre large EA is affected also - ext4 version have an exclusion in code, which force to add a revoke records in EA inode destroyed.

Comment by Alex Zhuravlev [ 09/Apr/20 ]

I'd think that setting this flag in osd_write() and when IAM is being modified should be enough?

Comment by Alexey Lyashkov [ 10/Apr/20 ]

Alex,

I think it's not enough. this flags affects just for objects in memory, but what will be under high memory load? object will flushed from cache and inode is released without unlink. at next time, object allocated again and passed to unlink path. It mean no flag while unlink so .. bug will exist.
In case on journal size 1G and more - this mean up 1000s before journal checkpoint, in comparison to commit which is more often.

Comment by Alex Zhuravlev [ 10/Apr/20 ]

shadow this flag is part of on-disk inode like INODE_EXTENTS, for example.

Comment by Alexey Lyashkov [ 10/Apr/20 ]

I understand, but inode dirty to have write this flag to disk on each write is too expensive.
Otherwise this flag will don't flushed to disk in case file updates (same size don't want to have mark_inode_dirty) - which is common case for already existent system.

PS. i checked a last logs. System had a 4100s up and none checkpoint process was processed and journal is ~20% full.

Comment by Alex Zhuravlev [ 10/Apr/20 ]

this is to be done once, then mark_inode_dirty() can be skipped.

Comment by Alexey Lyashkov [ 10/Apr/20 ]

> this is to be done once, then mark_inode_dirty() can be skipped.

anyway this change need to be checked with performance evaluation, once DNE2 have so much llog operations.

Comment by Alex Zhuravlev [ 10/Apr/20 ]

buffer is kept in cache until it's checkpointed, so I don't see why storing the flag in buffer (with mark_inode_dirty()) won't work.

Comment by Alex Zhuravlev [ 10/Apr/20 ]

anyway this change need to be checked with performance evaluation, once DNE2 have so much llog operations.

sure, though I'd expect that to be barely visible - any first write to a llog file would set the flag and given llog file is just created, inode is to be modified anyway to get it's block(s). so it could be optimized even with something like the following in osd_write():

if (LDISKFS_INODE_JOURNAL_DATA not set) {
    set LDISKFS_INODE_JOURNAL_DATA;
    if (inode->i_size != 0)
        mark_inode_dirty();
}

the first condition can be wrapped with unlikely(), I guess.

Comment by Gerrit Updater [ 20/Apr/20 ]

Alexey Lyashkov (alexey.lyashkov@hpe.com) uploaded a new patch: https://review.whamcloud.com/38281
Subject: LU-13416 ldiskfs: don't corrupt data on journal replay
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 94b7608c78c85d1d79fec8196ba0fecc9538ab78

Comment by Gerrit Updater [ 20/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38281/
Subject: LU-13416 ldiskfs: don't corrupt data on journal replay
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a23aac2219047cb04ed1fa555f31fa39e5c499dc

Comment by Peter Jones [ 20/May/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 23/May/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38705
Subject: LU-13416 ldiskfs: don't corrupt data on journal replay
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 77c1f307df4a3c068ec45a4948350bc55112e151

Comment by Gerrit Updater [ 27/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38705/
Subject: LU-13416 ldiskfs: don't corrupt data on journal replay
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 76b1050a56385cf8ddea47c9fea12eec21478601

Generated at Sat Feb 10 03:01:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.