Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1015

ldiskfs corruption with large LUNs

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.3.0, Lustre 2.1.1, Lustre 2.1.2
    • Fix Version/s: None
    • Labels:
    • Environment:
      lustre-2.1.0-13chaos_2.6.32_220.1chaos.ch5.x86_64.x86_64
      toss/chaos 5
      NetApp 22TB LUNs
    • Severity:
      3
    • Epic:
    • Rank (Obsolete):
      2172

      Description

      We have been running ior testing on hyperion with toss 5 and have seen ldiskfs corruption. Since I know you have access to hyperion, I was hoping you could log on and look around (the console logs are on hyperion577-pub and santricity can be run from there as well). I have set up a test filesystem called /p/ls1 created with large luns (22TB per lun with 6 luns on each RBOD) on a Netapp. The mds is on hyperion-agb25 and the 2 oss nodes are hyperion-agb27 and hyperion-agb28. I had 10 clients writing i/o to the filesystem and would power cycle an oss every hour to simulate a node crashing. Upon bringing the oss up I would run the full fsck to check for errors and bring up lustre again and continue the i/o load from clients. We hit a bug where the fsck shows corruption and doesn't mount lustre. As a side note, I was running the same testing in parallel with the same HW, but with a small 3TB lun size and did not hit this issue.

      zgrep Mounting ../conman.old/console.hyperion-agb27-20120115.gz

      2012-01-14 14:24:02 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0000
      2012-01-14 14:24:04 Mounting /dev/dm-0 on /mnt/lustre/local/ls1-OST0001
      2012-01-14 14:24:06 Mounting /dev/dm-4 on /mnt/lustre/local/ls1-OST0002
      2012-01-14 15:21:57 Mounting local filesystems: [ OK ]
      2012-01-14 15:22:03 Mounting other filesystems: [ OK ]
      2012-01-14 15:24:13 Mounting /dev/dm-0 on /mnt/lustre/local/ls1-OST0000
      2012-01-14 15:24:15 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0001
      2012-01-14 15:24:17 Mounting /dev/dm-1 on /mnt/lustre/local/ls1-OST0002
      2012-01-14 16:21:49 Mounting local filesystems: [ OK ]
      2012-01-14 16:21:55 Mounting other filesystems: [ OK ]
      2012-01-14 16:23:51 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0000
      2012-01-14 16:23:53 Mounting /dev/dm-1 on /mnt/lustre/local/ls1-OST0001
      2012-01-14 16:23:55 Mounting /dev/dm-5 on /mnt/lustre/local/ls1-OST0002
      2012-01-14 17:21:53 Mounting local filesystems: [ OK ]
      2012-01-14 17:21:59 Mounting other filesystems: [ OK ]
      2012-01-14 18:22:00 Mounting local filesystems: [ OK ]
      2012-01-14 18:22:06 Mounting other filesystems: [ OK ]
      2012-01-14 19:21:56 Mounting local filesystems: [ OK ]
      2012-01-14 19:22:02 Mounting other filesystems: [ OK ]
      2012-01-14 20:21:52 Mounting local filesystems: [ OK ]
      2012-01-14 20:21:58 Mounting other filesystems: [ OK ]

      It appears that the corruption occurred way back on Friday 1/14 after 16:20.
      I state this based upon the fact that the OST's did not make it back after the power cycle on 1/14 @ 17:20.

      Also the following fsck results only surfaced after that power cycle:

      2012-01-14 17:23:48 Group descriptor 0 checksum is invalid. FIXED.
      2012-01-14 17:23:48 Group descriptor 1 checksum is invalid. FIXED.
      2012-01-14 17:23:48 Group descriptor 2 checksum is invalid. FIXED.

        Attachments

        1. after.tar
          73 kB
        2. before.tar
          4 kB
        3. full.sdd.1021.log.gz
          2.13 MB
        4. LU1015.log.gz
          5.30 MB
        5. sdb.20120305.5143.stats.gz
          9.34 MB
        6. sdb.20120305.5143.stats.post.gz
          9.34 MB
        7. sdb.20121805.1805.stats.gz
          9.44 MB
        8. sdb.20125805.5815.stats.gz
          9.44 MB
        9. sdb.20125805.5815.stats.post.gz
          9.44 MB
        10. sdb.fail.tar
          60 kB
        11. sdd.20120305.1910.stats.gz
          9.33 MB
        12. sdd.20120305.1910.stats.post.gz
          9.33 MB
        13. sdd.20120306.1204.stats.gz
          9.34 MB
        14. sdd.20120306.1204.stats.post.gz
          9.34 MB
        15. sdd.20120522.2011.e2fsck.gz
          0.2 kB
        16. sdd.20120522.2011.journal.gz
          0.0 kB
        17. sdd.20120522.2011.logdump.gz
          1.68 MB
        18. sdd.20120522.2011.stats.gz
          9.36 MB
        19. sdd.20120522.2011.stats.post.gz
          9.36 MB
        20. sdd.20120522.2323.e2fsck.gz
          0.3 kB
        21. sdd.20120522.2323.journal.gz
          0.0 kB
        22. sdd.20120522.2323.logdump.gz
          26 kB
        23. sdd.20120522.2323.stats.gz
          9.44 MB
        24. sdd.20120522.2323.stats.post.gz
          9.44 MB
        25. sdd.20125905.5946.stats.gz
          9.34 MB
        26. sdd.20125905.5946.stats.post.gz
          9.34 MB
        27. sdd.ext4.full.fsck.txt.gz
          2 kB
        28. sdd.fail.1.tar
          10 kB
        29. sdd.full.fsck.txt.gz
          59 kB

          Activity

            People

            • Assignee:
              adilger Andreas Dilger
              Reporter:
              cindyheer cindy heer
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: