Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2943

LBUG mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) )

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.1.4
    • 3
    • 7064

    Description

      This issue has already been hit on lustre 2.2 (see LU-1702). Traces are exactly the same as for LU-1702.

      It's been hit four consecutive times so it seems quite easy to reproduce.

      2013-03-06 16:05:01 LustreError: 31751:0:(mdt_open.c:1023:mdt_reconstruct_open()) ASSERTION( (!(rc < 0)

      (lustre_msg_get_transno(req->rq_repmsg) == 0)) ) failed:
      2013-03-06 16:05:01 LustreError: 31751:0:(mdt_open.c:1023:mdt_reconstruct_open()) LBUG
      2013-03-06 16:05:01 Pid: 31751, comm: mdt_145
      2013-03-06 16:05:01
      2013-03-06 16:05:01 Call Trace:
      2013-03-06 16:05:01 [<ffffffffa04a27f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-03-06 16:05:01 [<ffffffffa04a2e07>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-03-06 16:05:01 [<ffffffffa0d9ed87>] mdt_reconstruct_open+0x7c7/0xa80 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d908c5>] mdt_reconstruct+0x45/0x120 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d7d099>] mdt_reint_internal+0x709/0x8e0 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d7d53d>] mdt_intent_reint+0x1ed/0x500 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d7bc09>] mdt_intent_policy+0x379/0x690 [mdt]
      2013-03-06 16:05:01 [<ffffffffa06ca3c1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffffa06f03dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffffa0d7c586>] mdt_enqueue+0x46/0x130 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d71762>] mdt_handle_common+0x932/0x1750 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d72655>] mdt_regular_handle+0x15/0x20 [mdt]
      2013-03-06 16:05:01 [<ffffffffa071f4f6>] ptlrpc_main+0xd16/0x1a80 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffff810017cc>] ? __switch_to+0x1ac/0x320
      2013-03-06 16:05:01 [<ffffffffa071e7e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffff8100412a>] child_rip+0xa/0x20
      2013-03-06 16:05:01 [<ffffffffa071e7e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffffa071e7e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffff81004120>] ? child_rip+0x0/0x20

      On the crash, the file who make the LBUG is a file created by mpio.

      Onsite support team made the following analysis

      The return status (rc) is -EREMOTE (-66) and it seems the
      disposition mask was DISP_IT_EXECD / DISP_LOOKUP_EXECD / DISP_LOOKUP_POS
      / DISP_OPEN_OPEN / DISP_OPEN_LOCK. According to these information, it could be possible that, prior to the LBUG, MDS has run mdt_reint_open() having in return -EREMOTE just before the LBUG.

      So mdt_reint_open() would return -EREMOTE and then
      mdt_reconstruct_open() does not make attention that in case of -EREMOTE
      return there is no msg transno setting ...

      On the attachment file you can find the struct mdt_thread_info info data
      who made the LBUG and also the req data (struct ptlrpc_request°
      and lcd data (struct lsd_client_data).

      Attachments

        Issue Links

          Activity

            [LU-2943] LBUG mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) )

            Cool, thanks for your update Seb. So I am marking this ticket as Fixed.

            bfaccini Bruno Faccini (Inactive) added a comment - Cool, thanks for your update Seb. So I am marking this ticket as Fixed.

            Hi Bruno,

            Support team confirms that your fix does fix the issue.
            Thank you!

            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi Bruno, Support team confirms that your fix does fix the issue. Thank you! Sebastien.

            Hello Alex and Seb, do you have any update fo this ticket ??
            Bye,
            Bruno.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Alex and Seb, do you have any update fo this ticket ?? Bye, Bruno.

            Hello Bruno,
            The new package with your fix was delivered last Friday, when we got the approval to pick up your patch.
            Remain that we have to find a time frame to install it on the system.

            I'll keep you inform.

            Cheers,
            Alex.

            louveta Alexandre Louvet (Inactive) added a comment - Hello Bruno, The new package with your fix was delivered last Friday, when we got the approval to pick up your patch. Remain that we have to find a time frame to install it on the system. I'll keep you inform. Cheers, Alex.

            Hi Bruno,

            The patchset #3 of http://review.whamcloud.com/5954 has been rolled out at CEA for test purpose at the end of last week.
            Hopefully we will have news soon.

            Cheers,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi Bruno, The patchset #3 of http://review.whamcloud.com/5954 has been rolled out at CEA for test purpose at the end of last week. Hopefully we will have news soon. Cheers, Sebastien.

            Hello Alex,
            Reviewers agreed my patch, so it should be integrated soon, but in the mean time is there a possibility for you to temporarily integrate it and test it under production work-load? I know it is not easy to setup for you since it affects the MDS side, but I have no idea on how to reproduce locally.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Alex, Reviewers agreed my patch, so it should be integrated soon, but in the mean time is there a possibility for you to temporarily integrate it and test it under production work-load? I know it is not easy to setup for you since it affects the MDS side, but I have no idea on how to reproduce locally.

            What the current status of the latest patch ?

            louveta Alexandre Louvet (Inactive) added a comment - What the current status of the latest patch ?

            New version/patch-set #3 of b2_1 port/patch http://review.whamcloud.com/5954 submitted and successfully passed auto-tests.

            bfaccini Bruno Faccini (Inactive) added a comment - New version/patch-set #3 of b2_1 port/patch http://review.whamcloud.com/5954 submitted and successfully passed auto-tests.

            Humm sorry about that, and due to your report I am currently looking again to original/master patch from LU-2927. Seems that I missed some case in my 1st back-port and I am testing a new version. Will keep you updated soon.

            bfaccini Bruno Faccini (Inactive) added a comment - Humm sorry about that, and due to your report I am currently looking again to original/master patch from LU-2927 . Seems that I missed some case in my 1st back-port and I am testing a new version. Will keep you updated soon.

            Bruno, sorry to says that just after installing the patch, we got a lot of crashes on 3 large clusters.

            The lbug message is :
            (mdt_handler.c:3411:mdt_intent_reint()) ASSERTION( lustre_handle_is_used(&lhc->mlh_reg_lh) ) failed:
            (mdt_handler.c:3411:mdt_intent_reint()) LBUG

            followed by this stack :
            Call Trace:
            [<ffffffffa041a7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            [<ffffffffa041ae07>] lbug_with_loc+0x47/0xb0 [libcfs]
            [<ffffffffa0c0a841>] mdt_intent_reint+0x4f1/0x530 [mdt]
            [<ffffffffa0c08c09>] mdt_intent_policy+0x379/0x690 [mdt]
            [<ffffffffa06653c1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
            [<ffffffffa068b3dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
            [<ffffffffa0c09586>] mdt_enqueue+0x46/0x130 [mdt]
            [<ffffffffa0bfe762>] mdt_handle_common+0x932/0x1750 [mdt]
            [<ffffffffa0bff655>] mdt_regular_handle+0x15/0x20 [mdt]
            [<ffffffffa06ba4f6>] ptlrpc_main+0xd16/0x1a80 [ptlrpc]
            [<ffffffff810017cc>] ? __switch_to+0x1ac/0x320
            [<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
            [<ffffffff8100412a>] child_rip+0xa/0x20
            [<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
            [<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
            [<ffffffff81004120>] ? child_rip+0x0/0x20

            Kernel panic - not syncing: LBUG
            Pid: 60412, comm: mdt_440 Not tainted 2.6.32-220.23.1.bl6.Bull.28.8.x86_64 0000001
            Call Trace:
            [<ffffffff81484650>] ? panic+0x78/0x143
            [<ffffffffa041ae5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
            [<ffffffffa0c0a841>] ? mdt_intent_reint+0x4f1/0x530 [mdt]
            [<ffffffffa0c08c09>] ? mdt_intent_policy+0x379/0x690 [mdt]
            [<ffffffffa06653c1>] ? ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
            [<ffffffffa068b3dd>] ? ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
            [<ffffffffa0c09586>] ? mdt_enqueue+0x46/0x130 [mdt]
            [<ffffffffa0bfe762>] ? mdt_handle_common+0x932/0x1750 [mdt]
            [<ffffffffa0bff655>] ? mdt_regular_handle+0x15/0x20 [mdt]
            [<ffffffffa06ba4f6>] ? ptlrpc_main+0xd16/0x1a80 [ptlrpc]
            [<ffffffff810017cc>] ? __switch_to+0x1ac/0x320
            [<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
            [<ffffffff8100412a>] ? child_rip+0xa/0x20
            [<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
            [<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
            [<ffffffff81004120>] ? child_rip+0x0/0x20

            That was observed on system running lustre 2.1.5 + patches

            • ORNL-22 general ptlrpcd threads pool support
            • LU-1144 implement a NUMA aware ptlrpcd binding policy
            • LU-1110 MDS Oops in osd_xattr_get() during file open by FID
            • LU-2613 to much unreclaimable slab space
            • LU-2624 ptlrpc fix thread stop
            • LU-2683 client deadlock in cl_lock_mutex_get
            • LU-2943 LBUG in mdt_reconstruct_open()

            I agree this is not the same context than previously, but it located exactly were the patch is modifying the source code.
            The patch was removed on the 3 affected cluster and since, stability is back.

            Alex.

            louveta Alexandre Louvet (Inactive) added a comment - Bruno, sorry to says that just after installing the patch, we got a lot of crashes on 3 large clusters. The lbug message is : (mdt_handler.c:3411:mdt_intent_reint()) ASSERTION( lustre_handle_is_used(&lhc->mlh_reg_lh) ) failed: (mdt_handler.c:3411:mdt_intent_reint()) LBUG followed by this stack : Call Trace: [<ffffffffa041a7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa041ae07>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0c0a841>] mdt_intent_reint+0x4f1/0x530 [mdt] [<ffffffffa0c08c09>] mdt_intent_policy+0x379/0x690 [mdt] [<ffffffffa06653c1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] [<ffffffffa068b3dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc] [<ffffffffa0c09586>] mdt_enqueue+0x46/0x130 [mdt] [<ffffffffa0bfe762>] mdt_handle_common+0x932/0x1750 [mdt] [<ffffffffa0bff655>] mdt_regular_handle+0x15/0x20 [mdt] [<ffffffffa06ba4f6>] ptlrpc_main+0xd16/0x1a80 [ptlrpc] [<ffffffff810017cc>] ? __switch_to+0x1ac/0x320 [<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc] [<ffffffff8100412a>] child_rip+0xa/0x20 [<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc] [<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc] [<ffffffff81004120>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG Pid: 60412, comm: mdt_440 Not tainted 2.6.32-220.23.1.bl6.Bull.28.8.x86_64 0000001 Call Trace: [<ffffffff81484650>] ? panic+0x78/0x143 [<ffffffffa041ae5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0c0a841>] ? mdt_intent_reint+0x4f1/0x530 [mdt] [<ffffffffa0c08c09>] ? mdt_intent_policy+0x379/0x690 [mdt] [<ffffffffa06653c1>] ? ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] [<ffffffffa068b3dd>] ? ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc] [<ffffffffa0c09586>] ? mdt_enqueue+0x46/0x130 [mdt] [<ffffffffa0bfe762>] ? mdt_handle_common+0x932/0x1750 [mdt] [<ffffffffa0bff655>] ? mdt_regular_handle+0x15/0x20 [mdt] [<ffffffffa06ba4f6>] ? ptlrpc_main+0xd16/0x1a80 [ptlrpc] [<ffffffff810017cc>] ? __switch_to+0x1ac/0x320 [<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc] [<ffffffff8100412a>] ? child_rip+0xa/0x20 [<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc] [<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc] [<ffffffff81004120>] ? child_rip+0x0/0x20 That was observed on system running lustre 2.1.5 + patches ORNL-22 general ptlrpcd threads pool support LU-1144 implement a NUMA aware ptlrpcd binding policy LU-1110 MDS Oops in osd_xattr_get() during file open by FID LU-2613 to much unreclaimable slab space LU-2624 ptlrpc fix thread stop LU-2683 client deadlock in cl_lock_mutex_get LU-2943 LBUG in mdt_reconstruct_open() I agree this is not the same context than previously, but it located exactly were the patch is modifying the source code. The patch was removed on the 3 affected cluster and since, stability is back. Alex.

            People

              bfaccini Bruno Faccini (Inactive)
              dmoreno Diego Moreno (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: