Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11013

Data Corruption error on Lustre ZFS dRaid

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.10.3
    • None
    • RHEL-7.4, in-kernel ofed, mellanox FDR10, Lustre-2.10.3, dRaid-(pull-7078), dm-multipath
    • 3
    • 9223372036854775807

    Description

      Setting up Lustre testbed in ANL with: 

      • 4 OSSs, total 16 OSTs ( 8 JBODs each with 60 HDDs)
      • Hybrid lustre, MGT/MDT - mdraid - raid10 - ldiskfs, OST - zfs - dRaid
      • MGT - 2 SSDs, raid10 - ldiskfs
      • MDT0 - 12 SSDs, raid10 - ldiskfs
      • MDT1 - 10 SSDs, raid10 - ldiskfs
      • Each OST, 30 HDDs, zfs dRaid with 3*(8+1) + 1

       

      • Filled up the fs to about 99%, we got data corruption problem after cleaned up fs and ran zfs scrub. Quite severe and ended up crashed the fs.
      • Rebuild lustre dRaid fs, and test again in order to duplicate the problem.
      • On first iteration of fill and clean up, the fs was holding up. Only got "One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected." on two dRaid zpool. So just clear up those errors.
      • After 2nd iteration, finally able to reproduce the error, after emptied file system and run scrub, we got the same data corruption problem ("One or more devices has experienced an error resulting in data corruption. Application may be affected").
      • Change the zpool to raidz2 with 3*(8+2) and we don't have this problem.

      Attachments

        Activity

          People

            isaac Isaac Huang (Inactive)
            kalfizah Kurniawan Alfizah (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: