Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4102

lots of multiply-claimed blocks in e2fsck

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 1.8.8
    • e2fsprogs 1.41.90.wc2
    • 3
    • 11017

    Description

      After a power loss, an older e2fsck (e2fsprogs 1.41.90.wc2) was run on the OSTs. It found tons of multiply-claimed blocks, including for the /O directory. Here's an example of one of the inodes:

      File ... (inode #17825793, mod time Wed Aug 15 19:02:25 2012)
        has 1 multiply-claimed block(s), shared with 1 file(s):
              /O (inode #84934657, mod time Wed Aug 15 19:02:25 2012)
      Clone multiply-claimed blocks? yes
      
      Inode 17825793 doesn't have an associated directory entry, it eventually gets put into lost+found.
      

      So the questions are:

      • how could this have happened? My slightly-informed-probably-wrong theory is that the journal got corrupted and it replayed some old inodes back into existence. I noticed there were a lot of patches dealing with journal checksums committed after 1.41.90.
      • what's the best way to deal with these? Cloning takes forever when you are talking about TB sized files. I tested the delete extended option, and it looks like it deletes both sides of the file. It would be nice if it just deleted the unlinked side. Right now my plan is to create a debugfs script from the read-only e2fsck output, but if there is a better way, that would be good.

      Thanks.

      Attachments

        Issue Links

          Activity

            [LU-4102] lots of multiply-claimed blocks in e2fsck
            pjones Peter Jones added a comment -

            Any work still outstanding should be tracked under a new ticket

            pjones Peter Jones added a comment - Any work still outstanding should be tracked under a new ticket

            Hi Andreas,

            I wanted to tie up the loose ends with the e2fsck patches in this thread. Is the shared=ignore patch something that could be landed? Should I create a gerrit changeset for it?

            As for the skip-invalid-bitmap issue, what's the best path to resolution on that? Should I create a new Jira ticket for it or what do you suggest?

            Thanks,
            Kit

            kitwestneat Kit Westneat (Inactive) added a comment - Hi Andreas, I wanted to tie up the loose ends with the e2fsck patches in this thread. Is the shared=ignore patch something that could be landed? Should I create a gerrit changeset for it? As for the skip-invalid-bitmap issue, what's the best path to resolution on that? Should I create a new Jira ticket for it or what do you suggest? Thanks, Kit

            http://review.whamcloud.com/8188 updates lustre/ChangeLog to recommend using a newer e2fsprogs for b2_1.

            adilger Andreas Dilger added a comment - http://review.whamcloud.com/8188 updates lustre/ChangeLog to recommend using a newer e2fsprogs for b2_1.

            Hi Andreas,

            One of the targets just went RO:
            Oct 28 17:09:45 lfs-oss-2-6 kernel: LDISKFS-fs error (device dm-50): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 577corrupted: 24074 blocks free in bitmap,
            24075 - in gd

            It looks like I missed it when disabling targets. I had forgotten that I used the shared=ignore flag when cleaning it up, so the clean bill of health from e2fsck was an illusion.

            I've marked it deactivated on the MDT. Hopefully it can hold until Wednesday.

            kitwestneat Kit Westneat (Inactive) added a comment - Hi Andreas, One of the targets just went RO: Oct 28 17:09:45 lfs-oss-2-6 kernel: LDISKFS-fs error (device dm-50): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 577corrupted: 24074 blocks free in bitmap, 24075 - in gd It looks like I missed it when disabling targets. I had forgotten that I used the shared=ignore flag when cleaning it up, so the clean bill of health from e2fsck was an illusion. I've marked it deactivated on the MDT. Hopefully it can hold until Wednesday.

            I think it seems reasonable, so long as the new ll_recover_lost_found_objs fixes the shared block problem to a large extent.

            It should be possible to do a test run a full test against a raw e2image file for rach of the OSTs. This would reduce the risk of problems during the actual repair, give some confidence that the remaining problems will be repaired, and also minimize the system downtime because the debugfs scripts can be generated while the system is still running.

            Loopback mount the raw image file, run "ll_recover_lost_found_objs -n" against it and unmount. Generate and run the debugfs script against the raw image file, then run "e2fsck -fy" on the image to see what is left. If all goes well, the debugfs script can be used on the real device.

            adilger Andreas Dilger added a comment - I think it seems reasonable, so long as the new ll_recover_lost_found_objs fixes the shared block problem to a large extent. It should be possible to do a test run a full test against a raw e2image file for rach of the OSTs. This would reduce the risk of problems during the actual repair, give some confidence that the remaining problems will be repaired, and also minimize the system downtime because the debugfs scripts can be generated while the system is still running. Loopback mount the raw image file, run "ll_recover_lost_found_objs -n" against it and unmount. Generate and run the debugfs script against the raw image file, then run "e2fsck -fy" on the image to see what is left. If all goes well, the debugfs script can be used on the real device.

            Here's the list of corruption and a general plan of attack:
            ost_15 unattached inodes
            ost_19 multiply-claimed block (2780 inodes), unattached inodes
            ost_28 multiply-claimed blocks (52 inodes), unattached inodes
            ost_32 unattached inodes - will take multiple passes
            ost_34 multiply-claimed blocks (48 inodes), unattached inodes
            ost_36 multiply-claimed blocks (1792 inodes), unattached inodes
            ost_3 multiply-claimed blocks (376 inodes), unattached inodes
            ost_45 multiply-claimed blocks (362 inodes), unattached inodes

            Plan:
            1) using bind mount, run new ll_recover -n on osts with multiply-claimed blocks
            2) use list of duplicate files to create a debugfs script to clri and unlink each "evil twin" file
            3) take downtime to execute debugfs script
            4) [1 hour] run e2fsck -n on OSTs to make sure all multiply-claimed blocks are gone
            a) prepare script in advance to move files to lost+found if any are left
            5) [1 hour] run e2fsck -p on all OSTs, as well as ll_recover
            a) it's possible that there could be unknown issues at this stage
            6) [30 min] clri any multiply-claimed block files in l+f, delete all other files
            a) prepare script to nuke l+f
            7) [30 min] rerun e2fsck -p to verify that all OSTs are clean
            a) again, it's possible that there could be unknown issues at this stage

            So I am thinking 3 hours + 4 hours for unknown issues + 1 hour for startup/shutdown. What do you think of this plan/schedule?

            kitwestneat Kit Westneat (Inactive) added a comment - Here's the list of corruption and a general plan of attack: ost_15 unattached inodes ost_19 multiply-claimed block (2780 inodes), unattached inodes ost_28 multiply-claimed blocks (52 inodes), unattached inodes ost_32 unattached inodes - will take multiple passes ost_34 multiply-claimed blocks (48 inodes), unattached inodes ost_36 multiply-claimed blocks (1792 inodes), unattached inodes ost_3 multiply-claimed blocks (376 inodes), unattached inodes ost_45 multiply-claimed blocks (362 inodes), unattached inodes Plan: 1) using bind mount, run new ll_recover -n on osts with multiply-claimed blocks 2) use list of duplicate files to create a debugfs script to clri and unlink each "evil twin" file 3) take downtime to execute debugfs script 4) [1 hour] run e2fsck -n on OSTs to make sure all multiply-claimed blocks are gone a) prepare script in advance to move files to lost+found if any are left 5) [1 hour] run e2fsck -p on all OSTs, as well as ll_recover a) it's possible that there could be unknown issues at this stage 6) [30 min] clri any multiply-claimed block files in l+f, delete all other files a) prepare script to nuke l+f 7) [30 min] rerun e2fsck -p to verify that all OSTs are clean a) again, it's possible that there could be unknown issues at this stage So I am thinking 3 hours + 4 hours for unknown issues + 1 hour for startup/shutdown. What do you think of this plan/schedule?

            Sorry I was unclear which disks, I meant the system disks are formatted as ext3, where I was building the sparse file. I got a non-sparse e2image I am copying over to webspace. Nathan also set up a Lustre filesystem I will use to dump a sparse image to.

            kitwestneat Kit Westneat (Inactive) added a comment - Sorry I was unclear which disks, I meant the system disks are formatted as ext3, where I was building the sparse file. I got a non-sparse e2image I am copying over to webspace. Nathan also set up a Lustre filesystem I will use to dump a sparse image to.

            NB - you can avoid the 2TB limit if you stripe the file more widely so that individual objects are below 2TB. If you are running an ext4-based ldiskfs (presumably yes) but on a filesystem that was formatted a while ago, you can use "tune2fs -O huge_file" to enable larger-than-2TB files also.

            adilger Andreas Dilger added a comment - NB - you can avoid the 2TB limit if you stripe the file more widely so that individual objects are below 2TB. If you are running an ext4-based ldiskfs (presumably yes) but on a filesystem that was formatted a while ago, you can use "tune2fs -O huge_file" to enable larger-than-2TB files also.

            Actually non-sparse e2image works fine. It looks like the sparse image is having issues past 2TB. I guess the FSes on this OSS are all ext3, so that would explain it.

            I'll create a dm snapshot to test the tool on.

            kitwestneat Kit Westneat (Inactive) added a comment - Actually non-sparse e2image works fine. It looks like the sparse image is having issues past 2TB. I guess the FSes on this OSS are all ext3, so that would explain it. I'll create a dm snapshot to test the tool on.

            Hi Andreas,

            I'm getting an error when I try to run e2image, I've done it on a couple OSTs:
            [root@lfs-oss-2-4 ~]# /usr/bin/time e2image -r /dev/mapper/ost_lfs2_19 /scratch/ost_lfs2_19.img
            e2image 1.42.7.wc1 (12-Apr-2013)
            seek: Invalid argument
            Command exited with non-zero status 1

            I got an strace, would that be useful?

            kitwestneat Kit Westneat (Inactive) added a comment - Hi Andreas, I'm getting an error when I try to run e2image, I've done it on a couple OSTs: [root@lfs-oss-2-4 ~] # /usr/bin/time e2image -r /dev/mapper/ost_lfs2_19 /scratch/ost_lfs2_19.img e2image 1.42.7.wc1 (12-Apr-2013) seek: Invalid argument Command exited with non-zero status 1 I got an strace, would that be useful?

            People

              niu Niu Yawei (Inactive)
              orentas Oz Rentas (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: