Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7867

OI scrubber causing performance issues

Details

    • Question/Request
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.5.5
    • None
    • 9223372036854775807

    Description

      OI scrubber causing performance issues.

      • There are orphaned objects that are causing the OI scrubber to start.
      • Is there an online or offline way of running lfsck at the moment?

      Attachments

        Issue Links

          Activity

            [LU-7867] OI scrubber causing performance issues

            Related patches have been landed.

            yong.fan nasf (Inactive) added a comment - Related patches have been landed.

            Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/19493/
            Subject: LU-7867 e2fsprogs: update build version to 1.42.13.wc5
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set:
            Commit: 00c372887fff6d4a9baae075c3fc1523c47d8ad4

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/19493/ Subject: LU-7867 e2fsprogs: update build version to 1.42.13.wc5 Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: Commit: 00c372887fff6d4a9baae075c3fc1523c47d8ad4

            Thanks Dustin for the explanation. Then we will close the ticket after landing Andreas' latest patch.

            yong.fan nasf (Inactive) added a comment - Thanks Dustin for the explanation. Then we will close the ticket after landing Andreas' latest patch.

            We are in a holding pattern on this until the customer allows us to take an outage. It will probably be another 5-6 weeks before we get the green light. I think it is okay to close this ticket.

            Our plan is to run the e2fsck on the LUN, if that doesn't fix the issue we will wipe out the 2 OST objects by hand and will lfsck when we update to lustre-2.8 servers.

            Thanks,
            Dustin

            dustb100 Dustin Leverman added a comment - We are in a holding pattern on this until the customer allows us to take an outage. It will probably be another 5-6 weeks before we get the green light. I think it is okay to close this ticket. Our plan is to run the e2fsck on the LUN, if that doesn't fix the issue we will wipe out the 2 OST objects by hand and will lfsck when we update to lustre-2.8 servers. Thanks, Dustin

            Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/19493
            Subject: LU-7867 e2fsprogs: update build version to 1.42.13.wc5
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: fae5f9fa7b3bb6bf134f0560b163276f9f6187a3

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/19493 Subject: LU-7867 e2fsprogs: update build version to 1.42.13.wc5 Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: fae5f9fa7b3bb6bf134f0560b163276f9f6187a3

            Any further information about the issue? Have we ever run e2fsck as discussed above?

            yong.fan nasf (Inactive) added a comment - Any further information about the issue? Have we ever run e2fsck as discussed above?

            Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/18999/
            Subject: LU-7867 debugfs: fix check for out-of-bound xattr value
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set:
            Commit: 595b51179eaafdc1c50ab8348cb83d9429aa2bfa

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/18999/ Subject: LU-7867 debugfs: fix check for out-of-bound xattr value Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: Commit: 595b51179eaafdc1c50ab8348cb83d9429aa2bfa

            I suspect that e2fsck will just consider these files as hard linked, so you will find them by looking for regular files under O/0/ with nlink=2.

            adilger Andreas Dilger added a comment - I suspect that e2fsck will just consider these files as hard linked, so you will find them by looking for regular files under O/0/ with nlink=2.
            ezell Matt Ezell added a comment -

            I think the strange PFID EA is coming from fid_flatten. It takes the FID and does

            ino = (seq << 24) + ((seq >> 24) & 0xffffff0000ULL) + fid_oid(fid);

            to make the 128-bit FID fit in a 64-bit inode. I didn't specifically see this in the log, but I believe this is what the client is setting as PFID. Since the 1.8 clients are going away, it's not worth looking into it further. We'll make sure to test LFSCK in 2.8 against objects created under 1.8 before upgrading the server.

            So, back to the original issue at hand. I guess we need to take a downtime to run e2fsck. There could be two outcomes:

            • e2fsck identifies these objects as damaged and repairs or unlinks them
            • It doesn't notice the issue and runs cleanly

            If situation #2 happens, should we manually unlink them? Then, how do we ensure there aren't more damaged objects on this OST?

            ezell Matt Ezell added a comment - I think the strange PFID EA is coming from fid_flatten . It takes the FID and does ino = (seq << 24) + ((seq >> 24) & 0xffffff0000ULL) + fid_oid(fid); to make the 128-bit FID fit in a 64-bit inode. I didn't specifically see this in the log, but I believe this is what the client is setting as PFID. Since the 1.8 clients are going away, it's not worth looking into it further. We'll make sure to test LFSCK in 2.8 against objects created under 1.8 before upgrading the server. So, back to the original issue at hand. I guess we need to take a downtime to run e2fsck. There could be two outcomes: e2fsck identifies these objects as damaged and repairs or unlinks them It doesn't notice the issue and runs cleanly If situation #2 happens, should we manually unlink them? Then, how do we ensure there aren't more damaged objects on this OST?

            It is the client to send the PFID information to the OST when write/setattr. According to your description, the 1.8 client may sent "ostid" formatted PFID to the OST, then caused the strange PFID EA. Usually, as long as the OST-object is NOT real orphan, then after upgrading your system to Lustre-2.6 or newer, the layout LFSCK will handle it as unmatched MDT-OST objects pairs, and will correct the PFID EA with the right MDT-object's FID.

            yong.fan nasf (Inactive) added a comment - It is the client to send the PFID information to the OST when write/setattr. According to your description, the 1.8 client may sent "ostid" formatted PFID to the OST, then caused the strange PFID EA. Usually, as long as the OST-object is NOT real orphan, then after upgrading your system to Lustre-2.6 or newer, the layout LFSCK will handle it as unmatched MDT-OST objects pairs, and will correct the PFID EA with the right MDT-object's FID.

            People

              yong.fan nasf (Inactive)
              dustb100 Dustin Leverman
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: