Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3934

Directories gone missing after 2.4 update

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.5.0, Lustre 2.4.2
    • Lustre 2.4.1
    • lustre 2.4.0-17chaos (github.com/chaos/lustre)
    • 3
    • 10401

    Description

      After upgrade of our servers from 2.1 to 2.4, our MDS crashed on LU-2842, and we applied the patch. That patch avoided the LBUG, but now it is clear that there is a more basic problem that we can no longer look up a bunch of the top-level subdirectories in this lustre filesystem.

      We are seeing problems like:

      2013-09-11 13:01:22 LustreError: 5570:0:(mdt_open.c:1687:mdt_reint_open()) lsc-MDT0000: name purgelogs present, but fid [0x2830891e:0xd1781321:0x0] invalid

      It looks to me like the directory entries are still there, but FID lookups do not work on them. We verified that the directory named "purgelogs" appears on the underlying ldiskfs filesystem at ROOT/purgelogs.

      We also see error messages diring recovery shortly after the recent boot like the following:

      2013-09-11 12:58:27 sumom-mds1 login: LustreError: 4164:0:(mdt_open.c:1497:mdt_reint_open()) @@@ [0x24d18001:0x3db440f0:0x0]/XXXXXX->[0x24d98604:0
      x2a32454:0x0] cr_flags=0104200200001 mode=0200100000 msg_flag=0x4 not found in open replay.  req@ffff8808263d1000 x1443453865661288/t0(46385661850
      2) o101->f45d6fab-2c9c-6b39-0090-4935fbe03e32@192.168.115.87@o2ib10:0/0 lens 568/1176 e 0 to 0 dl 1378929568 ref 1 fl Interpret:/4/0 rc 0/0

      (I X'ed out the user name there, but everything else is cut-and-paste.)

      Any ideas on the next step to get these directories accessible again?

      Attachments

        Issue Links

          Activity

            [LU-3934] Directories gone missing after 2.4 update
            pjones Peter Jones added a comment -

            Closing as LLNL have pulled the fix(es) into their release and the fix is landed for 2.5.0

            pjones Peter Jones added a comment - Closing as LLNL have pulled the fix(es) into their release and the fix is landed for 2.5.0

            6515 has been on b2_4 already, but not on b2_4_0, so you need to backport 6515 to b2_4_0, then apply 7625.

            yong.fan nasf (Inactive) added a comment - 6515 has been on b2_4 already, but not on b2_4_0, so you need to backport 6515 to b2_4_0, then apply 7625.

            http://review.whamcloud.com/#/c/6515/ was also landed on b2_4, and you therefore based http://review.whamcloud.com/#/c/7625/ on that. 6515 does not apply cleanly without 7625. I'll just take both.

            morrone Christopher Morrone (Inactive) added a comment - http://review.whamcloud.com/#/c/6515/ was also landed on b2_4, and you therefore based http://review.whamcloud.com/#/c/7625/ on that. 6515 does not apply cleanly without 7625. I'll just take both.

            Firstly, you need this patch (http://review.whamcloud.com/#/c/7625/) on Lustre-2.4 to resolve LU-3934.

            Then, if possible, please consider the patch (http://review.whamcloud.com/#/c/6515/) also, which mainly focus on triggering OI scrub properly under DNE mode. The patch is based on master (Lustre-2.5). I am not sure whether it can be applied on your patches stack directly or not, please try. If cannot, we can back-port.

            yong.fan nasf (Inactive) added a comment - Firstly, you need this patch ( http://review.whamcloud.com/#/c/7625/ ) on Lustre-2.4 to resolve LU-3934 . Then, if possible, please consider the patch ( http://review.whamcloud.com/#/c/6515/ ) also, which mainly focus on triggering OI scrub properly under DNE mode. The patch is based on master (Lustre-2.5). I am not sure whether it can be applied on your patches stack directly or not, please try. If cannot, we can back-port.

            It looks like the 2.4 patch assumes the existence of this patch:

            448a0fb 2013-08-08 LU-3420 scrub: trigger OI scrub properly

            which did not exists at 2.4.0. I assume that you suggest that I cherry pick that as well?

            morrone Christopher Morrone (Inactive) added a comment - It looks like the 2.4 patch assumes the existence of this patch: 448a0fb 2013-08-08 LU-3420 scrub: trigger OI scrub properly which did not exists at 2.4.0. I assume that you suggest that I cherry pick that as well?

            Patch landed to Master so closing ticket. Please let me know if anything additional is needed and I will reopen

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master so closing ticket. Please let me know if anything additional is needed and I will reopen

            The patch for master to detect the upgrading:

            http://review.whamcloud.com/#/c/7719/

            yong.fan nasf (Inactive) added a comment - The patch for master to detect the upgrading: http://review.whamcloud.com/#/c/7719/

            The full system OI scrub will run at background. So it will not cause MDT hang. If some client access the system before the OI scrub finished, then there will be several cases:

            1) name-based accessing. Means the client send lookup by name firstly, and then access the object by the returned FID. It works.

            2) FID-based accessing. Means client connected to the MDT before, and has known the object, and cache its FID before MDT upgrading, and send the cached FID directly to the MDT after MDT remount for upgrading.
            2.1) If such FID mapping has been processed by OI scrub already, then it works.
            2.2) Otherwise, the client may get failures.

            In theory, the MDT can revoke all the locks hold by the client if found upgrading. But there are race cases that the current using FIDs by the client also may hit above failures.

            yong.fan nasf (Inactive) added a comment - The full system OI scrub will run at background. So it will not cause MDT hang. If some client access the system before the OI scrub finished, then there will be several cases: 1) name-based accessing. Means the client send lookup by name firstly, and then access the object by the returned FID. It works. 2) FID-based accessing. Means client connected to the MDT before, and has known the object, and cache its FID before MDT upgrading, and send the cached FID directly to the MDT after MDT remount for upgrading. 2.1) If such FID mapping has been processed by OI scrub already, then it works. 2.2) Otherwise, the client may get failures. In theory, the MDT can revoke all the locks hold by the client if found upgrading. But there are race cases that the current using FIDs by the client also may hit above failures.

            What will happen when the automatic OI scrub is made to work?

            When we boot our MGS/MDS node after upgrading the software what will we expect to see? Does the OI scrub make the mount of the MDT hang for several hours, or does it happen in the background?

            If the OI scrub happens in the background and clients are permitted to mount the filesystem, I presume that there would be a period of time when users would still see inaccessible files and directories.

            morrone Christopher Morrone (Inactive) added a comment - What will happen when the automatic OI scrub is made to work? When we boot our MGS/MDS node after upgrading the software what will we expect to see? Does the OI scrub make the mount of the MDT hang for several hours, or does it happen in the background? If the OI scrub happens in the background and clients are permitted to mount the filesystem, I presume that there would be a period of time when users would still see inaccessible files and directories.

            It is really good news. Anyway we need above patch to make the auto detect mechanism more robust to avoid similar issues next time.

            yong.fan nasf (Inactive) added a comment - It is really good news. Anyway we need above patch to make the auto detect mechanism more robust to avoid similar issues next time.

            People

              yong.fan nasf (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: