Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13416

Data corruption during IOR testing with DoM files and hard failover

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.14.0, Lustre 2.12.5
    • None
    • None
    • Any Lustre 2.x affected.
    • 3
    • 9223372036854775807

    Description

      IAM tables uses a zero copy update for files as similar as ldiskfs directories does.
      osd-ldiskfs staring from

      # git describe 67076c3c7e2b11023b943db2f5031d9b9a11329c
      v2_2_50_0-22-g67076c3
      

      does same. But it's not a safe without set a LDISKFS_INODE_JOURNAL_DATA to inodes.
      (thanks bzzz for tip).
      Otherwise metadata blocks can be reused before journal checkpoint without corresponded revoke records. It caused a valid file data will replaced with stale journaled data.
      from blk trace perspective it shown

          mdt_io01_025-32148 [003]  4161.223760: block_bio_queue:      9,65 W 12075997800 + 8 [mdt_io01_025]
          mdt_io01_019-31765 [003]  4163.374449: block_bio_queue:      9,65 W 12075997800 + 8 [mdt_io01_019]
          mdt_io01_000-12006 [014]  4165.256635: block_bio_queue:      9,65 W 12075997800 + 8 [mdt_io01_000]
          mdt_io01_019-31765 [004]  4167.030265: block_bio_queue:      9,65 W 12075997800 + 8 [mdt_io01_019]
      

      but this info is committed

      00000001:00080000:9.0:1585615546.198190:0:11825:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) snx11281-MDT0000: transno 4522600752066 is committed
      00000001:00080000:9.0:1585615546.198196:0:11825:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) snx11281-MDT0000: transno 4522600752064 is committed
      

      but after crash, journal records is

      Commit time 1585612866.905807896
        FS block 1509499725 logged at journal block 1370 (flags 0x2)
      Found expected sequence 86453863, type 2 (commit block) at block 1382
      Commit time 1585612871.80796396
      Found expected sequence 86453864, type 2 (commit block) at block 1395
      Commit time 1585612871.147796211
        FS block 1509499725 logged at journal block 1408 (flags 0x2)
      Found expected sequence 86453865, type 2 (commit block) at block 1414
      Commit time 1585612872.386792798
        FS block 1509499725 logged at journal block 1427 (flags 0x2)
      Found expected sequence 86453866, type 2 (commit block) at block 1438
      Commit time 1585612876.763804361
      Found expected sequence 86453867, type 2 (commit block) at block 1451
      Commit time 1585612876.834804666
        FS block 1509499725 logged at journal block 1464 (flags 0x2)
      Found expected sequence 86453868, type 2 (commit block) at block 1471
      

      and none revoke records.

      Attachments

        Issue Links

          Activity

            [LU-13416] Data corruption during IOR testing with DoM files and hard failover
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-14267 [ LU-14267 ]
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-27 [ JFC-27 ]
            pjones Peter Jones made changes -
            Labels Original: LTS12
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.12.5 [ 14696 ]

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38705/
            Subject: LU-13416 ldiskfs: don't corrupt data on journal replay
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 76b1050a56385cf8ddea47c9fea12eec21478601

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38705/ Subject: LU-13416 ldiskfs: don't corrupt data on journal replay Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 76b1050a56385cf8ddea47c9fea12eec21478601

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38705
            Subject: LU-13416 ldiskfs: don't corrupt data on journal replay
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 77c1f307df4a3c068ec45a4948350bc55112e151

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38705 Subject: LU-13416 ldiskfs: don't corrupt data on journal replay Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 77c1f307df4a3c068ec45a4948350bc55112e151
            pjones Peter Jones made changes -
            Link New: This issue is related to JFC-27 [ JFC-27 ]
            pjones Peter Jones made changes -
            Labels New: LTS12
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-17 [ JFC-17 ]
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            People

              shadow Alexey Lyashkov
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: