[LU-13416] Data corruption during IOR testing with DoM files and hard failover Created: 06/Apr/20 Updated: 23/Dec/20 Resolved: 20/May/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.5 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Alexey Lyashkov | Assignee: | Alexey Lyashkov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Any Lustre 2.x affected. |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
IAM tables uses a zero copy update for files as similar as ldiskfs directories does. # git describe 67076c3c7e2b11023b943db2f5031d9b9a11329c v2_2_50_0-22-g67076c3 does same. But it's not a safe without set a LDISKFS_INODE_JOURNAL_DATA to inodes. mdt_io01_025-32148 [003] 4161.223760: block_bio_queue: 9,65 W 12075997800 + 8 [mdt_io01_025]
mdt_io01_019-31765 [003] 4163.374449: block_bio_queue: 9,65 W 12075997800 + 8 [mdt_io01_019]
mdt_io01_000-12006 [014] 4165.256635: block_bio_queue: 9,65 W 12075997800 + 8 [mdt_io01_000]
mdt_io01_019-31765 [004] 4167.030265: block_bio_queue: 9,65 W 12075997800 + 8 [mdt_io01_019]
but this info is committed 00000001:00080000:9.0:1585615546.198190:0:11825:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) snx11281-MDT0000: transno 4522600752066 is committed 00000001:00080000:9.0:1585615546.198196:0:11825:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) snx11281-MDT0000: transno 4522600752064 is committed but after crash, journal records is Commit time 1585612866.905807896 FS block 1509499725 logged at journal block 1370 (flags 0x2) Found expected sequence 86453863, type 2 (commit block) at block 1382 Commit time 1585612871.80796396 Found expected sequence 86453864, type 2 (commit block) at block 1395 Commit time 1585612871.147796211 FS block 1509499725 logged at journal block 1408 (flags 0x2) Found expected sequence 86453865, type 2 (commit block) at block 1414 Commit time 1585612872.386792798 FS block 1509499725 logged at journal block 1427 (flags 0x2) Found expected sequence 86453866, type 2 (commit block) at block 1438 Commit time 1585612876.763804361 Found expected sequence 86453867, type 2 (commit block) at block 1451 Commit time 1585612876.834804666 FS block 1509499725 logged at journal block 1464 (flags 0x2) Found expected sequence 86453868, type 2 (commit block) at block 1471 and none revoke records. |
| Comments |
| Comment by Andreas Dilger [ 09/Apr/20 ] |
|
Is the solution here just to set LDISKFS_JOURNAL_DATA_FL on IAM files at create/open time (if unset)? I know we can't normally change the "+j" flag on an inode without flushing the journal, but this is handled properly via ldiskfs_ioctl_setflags() and will only change the flag once per IAM file (normally at mount) so it is not a significant performance concern. |
| Comment by Alexey Lyashkov [ 09/Apr/20 ] |
|
Andreas, 'out' protocol is problem, and Wangdi version for DNE2 llogs. first case is primary problem - as OUT processing don't have something like 'open' command. PS. lustre large EA is affected also - ext4 version have an exclusion in code, which force to add a revoke records in EA inode destroyed. |
| Comment by Alex Zhuravlev [ 09/Apr/20 ] |
|
I'd think that setting this flag in osd_write() and when IAM is being modified should be enough? |
| Comment by Alexey Lyashkov [ 10/Apr/20 ] |
|
Alex, I think it's not enough. this flags affects just for objects in memory, but what will be under high memory load? object will flushed from cache and inode is released without unlink. at next time, object allocated again and passed to unlink path. It mean no flag while unlink so .. bug will exist. |
| Comment by Alex Zhuravlev [ 10/Apr/20 ] |
|
shadow this flag is part of on-disk inode like INODE_EXTENTS, for example. |
| Comment by Alexey Lyashkov [ 10/Apr/20 ] |
|
I understand, but inode dirty to have write this flag to disk on each write is too expensive. PS. i checked a last logs. System had a 4100s up and none checkpoint process was processed and journal is ~20% full. |
| Comment by Alex Zhuravlev [ 10/Apr/20 ] |
|
this is to be done once, then mark_inode_dirty() can be skipped. |
| Comment by Alexey Lyashkov [ 10/Apr/20 ] |
|
> this is to be done once, then mark_inode_dirty() can be skipped. anyway this change need to be checked with performance evaluation, once DNE2 have so much llog operations. |
| Comment by Alex Zhuravlev [ 10/Apr/20 ] |
|
buffer is kept in cache until it's checkpointed, so I don't see why storing the flag in buffer (with mark_inode_dirty()) won't work. |
| Comment by Alex Zhuravlev [ 10/Apr/20 ] |
sure, though I'd expect that to be barely visible - any first write to a llog file would set the flag and given llog file is just created, inode is to be modified anyway to get it's block(s). so it could be optimized even with something like the following in osd_write(): if (LDISKFS_INODE_JOURNAL_DATA not set) { set LDISKFS_INODE_JOURNAL_DATA; if (inode->i_size != 0) mark_inode_dirty(); } the first condition can be wrapped with unlikely(), I guess. |
| Comment by Gerrit Updater [ 20/Apr/20 ] |
|
Alexey Lyashkov (alexey.lyashkov@hpe.com) uploaded a new patch: https://review.whamcloud.com/38281 |
| Comment by Gerrit Updater [ 20/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38281/ |
| Comment by Peter Jones [ 20/May/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 23/May/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38705 |
| Comment by Gerrit Updater [ 27/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38705/ |