Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13416

Data corruption during IOR testing with DoM files and hard failover

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.14.0, Lustre 2.12.5
    • None
    • None
    • Any Lustre 2.x affected.
    • 3
    • 9223372036854775807

    Description

      IAM tables uses a zero copy update for files as similar as ldiskfs directories does.
      osd-ldiskfs staring from

      # git describe 67076c3c7e2b11023b943db2f5031d9b9a11329c
      v2_2_50_0-22-g67076c3
      

      does same. But it's not a safe without set a LDISKFS_INODE_JOURNAL_DATA to inodes.
      (thanks bzzz for tip).
      Otherwise metadata blocks can be reused before journal checkpoint without corresponded revoke records. It caused a valid file data will replaced with stale journaled data.
      from blk trace perspective it shown

          mdt_io01_025-32148 [003]  4161.223760: block_bio_queue:      9,65 W 12075997800 + 8 [mdt_io01_025]
          mdt_io01_019-31765 [003]  4163.374449: block_bio_queue:      9,65 W 12075997800 + 8 [mdt_io01_019]
          mdt_io01_000-12006 [014]  4165.256635: block_bio_queue:      9,65 W 12075997800 + 8 [mdt_io01_000]
          mdt_io01_019-31765 [004]  4167.030265: block_bio_queue:      9,65 W 12075997800 + 8 [mdt_io01_019]
      

      but this info is committed

      00000001:00080000:9.0:1585615546.198190:0:11825:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) snx11281-MDT0000: transno 4522600752066 is committed
      00000001:00080000:9.0:1585615546.198196:0:11825:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) snx11281-MDT0000: transno 4522600752064 is committed
      

      but after crash, journal records is

      Commit time 1585612866.905807896
        FS block 1509499725 logged at journal block 1370 (flags 0x2)
      Found expected sequence 86453863, type 2 (commit block) at block 1382
      Commit time 1585612871.80796396
      Found expected sequence 86453864, type 2 (commit block) at block 1395
      Commit time 1585612871.147796211
        FS block 1509499725 logged at journal block 1408 (flags 0x2)
      Found expected sequence 86453865, type 2 (commit block) at block 1414
      Commit time 1585612872.386792798
        FS block 1509499725 logged at journal block 1427 (flags 0x2)
      Found expected sequence 86453866, type 2 (commit block) at block 1438
      Commit time 1585612876.763804361
      Found expected sequence 86453867, type 2 (commit block) at block 1451
      Commit time 1585612876.834804666
        FS block 1509499725 logged at journal block 1464 (flags 0x2)
      Found expected sequence 86453868, type 2 (commit block) at block 1471
      

      and none revoke records.

      Attachments

        Issue Links

          Activity

            [LU-13416] Data corruption during IOR testing with DoM files and hard failover

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38705/
            Subject: LU-13416 ldiskfs: don't corrupt data on journal replay
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 76b1050a56385cf8ddea47c9fea12eec21478601

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38705/ Subject: LU-13416 ldiskfs: don't corrupt data on journal replay Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 76b1050a56385cf8ddea47c9fea12eec21478601

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38705
            Subject: LU-13416 ldiskfs: don't corrupt data on journal replay
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 77c1f307df4a3c068ec45a4948350bc55112e151

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38705 Subject: LU-13416 ldiskfs: don't corrupt data on journal replay Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 77c1f307df4a3c068ec45a4948350bc55112e151
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38281/
            Subject: LU-13416 ldiskfs: don't corrupt data on journal replay
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a23aac2219047cb04ed1fa555f31fa39e5c499dc

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38281/ Subject: LU-13416 ldiskfs: don't corrupt data on journal replay Project: fs/lustre-release Branch: master Current Patch Set: Commit: a23aac2219047cb04ed1fa555f31fa39e5c499dc

            Alexey Lyashkov (alexey.lyashkov@hpe.com) uploaded a new patch: https://review.whamcloud.com/38281
            Subject: LU-13416 ldiskfs: don't corrupt data on journal replay
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 94b7608c78c85d1d79fec8196ba0fecc9538ab78

            gerrit Gerrit Updater added a comment - Alexey Lyashkov (alexey.lyashkov@hpe.com) uploaded a new patch: https://review.whamcloud.com/38281 Subject: LU-13416 ldiskfs: don't corrupt data on journal replay Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 94b7608c78c85d1d79fec8196ba0fecc9538ab78

            anyway this change need to be checked with performance evaluation, once DNE2 have so much llog operations.

            sure, though I'd expect that to be barely visible - any first write to a llog file would set the flag and given llog file is just created, inode is to be modified anyway to get it's block(s). so it could be optimized even with something like the following in osd_write():

            if (LDISKFS_INODE_JOURNAL_DATA not set) {
                set LDISKFS_INODE_JOURNAL_DATA;
                if (inode->i_size != 0)
                    mark_inode_dirty();
            }
            

            the first condition can be wrapped with unlikely(), I guess.

            bzzz Alex Zhuravlev added a comment - anyway this change need to be checked with performance evaluation, once DNE2 have so much llog operations. sure, though I'd expect that to be barely visible - any first write to a llog file would set the flag and given llog file is just created, inode is to be modified anyway to get it's block(s). so it could be optimized even with something like the following in osd_write(): if (LDISKFS_INODE_JOURNAL_DATA not set) { set LDISKFS_INODE_JOURNAL_DATA; if (inode->i_size != 0) mark_inode_dirty(); } the first condition can be wrapped with unlikely(), I guess.

            buffer is kept in cache until it's checkpointed, so I don't see why storing the flag in buffer (with mark_inode_dirty()) won't work.

            bzzz Alex Zhuravlev added a comment - buffer is kept in cache until it's checkpointed, so I don't see why storing the flag in buffer (with mark_inode_dirty()) won't work.

            > this is to be done once, then mark_inode_dirty() can be skipped.

            anyway this change need to be checked with performance evaluation, once DNE2 have so much llog operations.

            shadow Alexey Lyashkov added a comment - > this is to be done once, then mark_inode_dirty() can be skipped. anyway this change need to be checked with performance evaluation, once DNE2 have so much llog operations.

            this is to be done once, then mark_inode_dirty() can be skipped.

            bzzz Alex Zhuravlev added a comment - this is to be done once, then mark_inode_dirty() can be skipped.
            shadow Alexey Lyashkov added a comment - - edited

            I understand, but inode dirty to have write this flag to disk on each write is too expensive.
            Otherwise this flag will don't flushed to disk in case file updates (same size don't want to have mark_inode_dirty) - which is common case for already existent system.

            PS. i checked a last logs. System had a 4100s up and none checkpoint process was processed and journal is ~20% full.

            shadow Alexey Lyashkov added a comment - - edited I understand, but inode dirty to have write this flag to disk on each write is too expensive. Otherwise this flag will don't flushed to disk in case file updates (same size don't want to have mark_inode_dirty) - which is common case for already existent system. PS. i checked a last logs. System had a 4100s up and none checkpoint process was processed and journal is ~20% full.

            People

              shadow Alexey Lyashkov
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: