Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3542

deleted/unused inodes not actually cleared by e2fsck

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • None
    • None
    • Centos5, e2fsprogs-1.42.7.wc1-0redhat
    • 2
    • 8914

    Description

      e2fsck doesn't actually clear deleted/unused inodes, though it claims to. I've attached a log showing what we are seeing. The customer is CalTech.

      Attachments

        1. e2fsck_safe_repair_ost_3.log-1
          20 kB
        2. e2fsck_safe_repair_ost_3.log-2
          17 kB
        3. e2fsck.log
          9 kB
        4. fsck.hpfs2-eg3-oss11.ost0.2013_11_07.out1
          186 kB
        5. htree.dump
          122 kB

        Activity

          [LU-3542] deleted/unused inodes not actually cleared by e2fsck

          The raw device of ftp.whamcloud.com/uploads/LU-3542/ost000b.qcow.bz2 is 16TB? It's hard for me to find a machine with that big drive to reproduce the problem, is there any smaller OST which has the same problem?

          niu Niu Yawei (Inactive) added a comment - The raw device of ftp.whamcloud.com/uploads/ LU-3542 /ost000b.qcow.bz2 is 16TB? It's hard for me to find a machine with that big drive to reproduce the problem, is there any smaller OST which has the same problem?
          dvicker Darby Vicker added a comment -

          I just uploaded my qcow image to ftp.whamcloud.com/uploads/LU-3542/ost000b.qcow.bz2

          dvicker Darby Vicker added a comment - I just uploaded my qcow image to ftp.whamcloud.com/uploads/ LU-3542 /ost000b.qcow.bz2

          We ran into this problem as well. I'll attach the fsck output to this JIRA. Email me if you'd like me to send you the qcow image.

          dvicker Darby Vicker added a comment - We ran into this problem as well. I'll attach the fsck output to this JIRA. Email me if you'd like me to send you the qcow image.

          I got a qcow image with a file exhibiting the corruption, it's available here:
          http://ddntsr.com/ftp/2013-10-30-lustre-ost_lfs2_36.qcow2.bz2 [295M]

          1. e2fsck -fp /dev/mapper/ost_lfs2_36
            lfs2-OST0024: Entry '62977970' in /O/0/d18 (88080410) has deleted/unused inode 1051496. CLEARED.
            lfs2-OST0024: 1546929/89620480 files (9.5% non-contiguous), 2367418676/5735710720 blocks
          1. e2fsck -fp /dev/mapper/ost_lfs2_36
            lfs2-OST0024: Entry '62977970' in /O/0/d18 (88080410) has deleted/unused inode 1051496. CLEARED.
            lfs2-OST0024: 1546929/89620480 files (9.5% non-contiguous), 2367418676/5735710720 blocks
          kitwestneat Kit Westneat (Inactive) added a comment - I got a qcow image with a file exhibiting the corruption, it's available here: http://ddntsr.com/ftp/2013-10-30-lustre-ost_lfs2_36.qcow2.bz2 [295M] e2fsck -fp /dev/mapper/ost_lfs2_36 lfs2-OST0024: Entry '62977970' in /O/0/d18 (88080410) has deleted/unused inode 1051496. CLEARED. lfs2-OST0024: 1546929/89620480 files (9.5% non-contiguous), 2367418676/5735710720 blocks e2fsck -fp /dev/mapper/ost_lfs2_36 lfs2-OST0024: Entry '62977970' in /O/0/d18 (88080410) has deleted/unused inode 1051496. CLEARED. lfs2-OST0024: 1546929/89620480 files (9.5% non-contiguous), 2367418676/5735710720 blocks

          Even if there isn't a 100% chance that OST has the problem, it is still worthwhile to make an image of the OST. This will first give us an idea of how long it takes to generate the image, how large it is (uncompressed and compressed), and it can also be used to test the LU-4102 code.

          adilger Andreas Dilger added a comment - Even if there isn't a 100% chance that OST has the problem, it is still worthwhile to make an image of the OST. This will first give us an idea of how long it takes to generate the image, how large it is (uncompressed and compressed), and it can also be used to test the LU-4102 code.

          I don't think any of the OSTs described in LU-4102 currently has the deleted/unused inodes issue. All the ones that reported it on the r/o e2fsck had previously been clean, so I think that it's just a matter of them being in use. That being said I could get an image of the OST (ost_45) that had the error before.. Do you think that might be useful? I have the e2fsck output as well.

          kitwestneat Kit Westneat (Inactive) added a comment - I don't think any of the OSTs described in LU-4102 currently has the deleted/unused inodes issue. All the ones that reported it on the r/o e2fsck had previously been clean, so I think that it's just a matter of them being in use. That being said I could get an image of the OST (ost_45) that had the error before.. Do you think that might be useful? I have the e2fsck output as well.

          I looked through the relevant code in pass2.c::check_dir_block():

                          /* 
                           * Offer to clear unused inodes; if we are going to be
                           * restarting the scan due to bg_itable_unused being
                           * wrong, then don't clear any inodes to avoid zapping
                           * inodes that were skipped during pass1 due to an
                           * incorrect bg_itable_unused; we'll get any real
                           * problems after we restart.
                           */
                          if (!(ctx->flags & E2F_FLAG_RESTART_LATER) &&
                              !(ext2fs_test_inode_bitmap2(ctx->inode_used_map,
                                                          dirent->inode)))
                                  problem = PR_2_UNUSED_INODE;
          
                          if (problem) {
                                  if (fix_problem(ctx, problem, &cd->pctx)) {
                                          dirent->inode = 0;
                                          dir_modified++;
                                          goto next;
          

          It is easy to trigger the PR_2_UNUSED_INODE problem by setting nlink = 0 in the inode(s) via debugfs. However, when I run e2fsck against such a filesystem (whether with small directories or large htree directories) e2fsck fixes the problem by clearing the dirent (setting inode = 0 above, and later writing out the directory block) and a second check shows it is fixed.

          To capture a filesystem that has a persistent case of this problem (after "e2fsck -fy" didn't fix it) so that it can be debugged and fixed, please use e2image to dump the filesystem metadata. The dense image format can be efficiently compressed and transported, unlike the sparse variant of e2image:

          e2image -Q /dev/OSTnnnn OSTnnnn.qcow
          bzip2 -9 OSTnnnn.qcow
          

          Hopefully the OSTnnnn.qcow.bz2 image size is small enough for transport. It is possible to reconstitute the (uncompressed) qcow file into a raw ext4 image file that can be tested with e2fsck, debugfs, or mounted via loopback.

          e2image -r OSTnnnn.qcow OSTnnnn.raw
          
          adilger Andreas Dilger added a comment - I looked through the relevant code in pass2.c::check_dir_block(): /* * Offer to clear unused inodes; if we are going to be * restarting the scan due to bg_itable_unused being * wrong, then don't clear any inodes to avoid zapping * inodes that were skipped during pass1 due to an * incorrect bg_itable_unused; we'll get any real * problems after we restart. */ if (!(ctx->flags & E2F_FLAG_RESTART_LATER) && !(ext2fs_test_inode_bitmap2(ctx->inode_used_map, dirent->inode))) problem = PR_2_UNUSED_INODE; if (problem) { if (fix_problem(ctx, problem, &cd->pctx)) { dirent->inode = 0; dir_modified++; goto next; It is easy to trigger the PR_2_UNUSED_INODE problem by setting nlink = 0 in the inode(s) via debugfs. However, when I run e2fsck against such a filesystem (whether with small directories or large htree directories) e2fsck fixes the problem by clearing the dirent (setting inode = 0 above, and later writing out the directory block) and a second check shows it is fixed. To capture a filesystem that has a persistent case of this problem (after "e2fsck -fy" didn't fix it) so that it can be debugged and fixed, please use e2image to dump the filesystem metadata. The dense image format can be efficiently compressed and transported, unlike the sparse variant of e2image: e2image -Q /dev/OSTnnnn OSTnnnn.qcow bzip2 -9 OSTnnnn.qcow Hopefully the OSTnnnn.qcow.bz2 image size is small enough for transport. It is possible to reconstitute the (uncompressed) qcow file into a raw ext4 image file that can be tested with e2fsck, debugfs, or mounted via loopback. e2image -r OSTnnnn.qcow OSTnnnn.raw

          Hi Niu,

          The first customer had a problem with the RAID storage which caused the ldiskfs corruption. The second customer had a power outage that we think corrupted the journal and journal replay (LU-4102). Basically when there is some kind of ldiskfs corruption, there is the possibility of getting these delete/unused inode messages, and it seems if the htrees are also corrupt, e2fsck is unable to clear them.

          Thanks,
          Kit

          kitwestneat Kit Westneat (Inactive) added a comment - Hi Niu, The first customer had a problem with the RAID storage which caused the ldiskfs corruption. The second customer had a power outage that we think corrupted the journal and journal replay ( LU-4102 ). Basically when there is some kind of ldiskfs corruption, there is the possibility of getting these delete/unused inode messages, and it seems if the htrees are also corrupt, e2fsck is unable to clear them. Thanks, Kit

          Kit, I didn't know they often run into the problem of "deleted/unused inode". Which Lustre version did they use? and do you know what kind of operation could possibly caused the problem? If possible, could you collect the log on OST before the problem happen? I think it might be helpful for us to figure out how this happened.

          I'll look into the e2fsck problem at the same time. Thank you.

          niu Niu Yawei (Inactive) added a comment - Kit, I didn't know they often run into the problem of "deleted/unused inode". Which Lustre version did they use? and do you know what kind of operation could possibly caused the problem? If possible, could you collect the log on OST before the problem happen? I think it might be helpful for us to figure out how this happened. I'll look into the e2fsck problem at the same time. Thank you.

          Hi Niu,

          This has become a higher priority for us. The problem is that if deleted inodes are not cleared, the filesystem will go read-only when it encounters the inode. This can lead to a state where the filesystem goes read-only at a random time and only manual intervention with debugfs can bring it back to a healthy state. It has happened to us a couple of times now, so I think we need to explore problem #1 a little more closely.

          Thanks.

          kitwestneat Kit Westneat (Inactive) added a comment - Hi Niu, This has become a higher priority for us. The problem is that if deleted inodes are not cleared, the filesystem will go read-only when it encounters the inode. This can lead to a state where the filesystem goes read-only at a random time and only manual intervention with debugfs can bring it back to a healthy state. It has happened to us a couple of times now, so I think we need to explore problem #1 a little more closely. Thanks.

          Peter, the two questions Kit asked are probably e2fsck bugs. The remaining work is:

          • Search to find out if the same problem was reported in Linux community before, and if there is any patch alreay. (I did an initial searching, but had no luck so far)
          • Try to reproduce the probelm and trace into the e2fsck code to see if it's really some bug needs be fixed. (that requires e2fsprogs expert and could be time-consuming)

          I agree with Kit that it's not high priority job.

          niu Niu Yawei (Inactive) added a comment - Peter, the two questions Kit asked are probably e2fsck bugs. The remaining work is: Search to find out if the same problem was reported in Linux community before, and if there is any patch alreay. (I did an initial searching, but had no luck so far) Try to reproduce the probelm and trace into the e2fsck code to see if it's really some bug needs be fixed. (that requires e2fsprogs expert and could be time-consuming) I agree with Kit that it's not high priority job.

          People

            niu Niu Yawei (Inactive)
            kitwestneat Kit Westneat (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: