Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2943

LBUG mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) )

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.1.4
    • 3
    • 7064

    Description

      This issue has already been hit on lustre 2.2 (see LU-1702). Traces are exactly the same as for LU-1702.

      It's been hit four consecutive times so it seems quite easy to reproduce.

      2013-03-06 16:05:01 LustreError: 31751:0:(mdt_open.c:1023:mdt_reconstruct_open()) ASSERTION( (!(rc < 0)

      (lustre_msg_get_transno(req->rq_repmsg) == 0)) ) failed:
      2013-03-06 16:05:01 LustreError: 31751:0:(mdt_open.c:1023:mdt_reconstruct_open()) LBUG
      2013-03-06 16:05:01 Pid: 31751, comm: mdt_145
      2013-03-06 16:05:01
      2013-03-06 16:05:01 Call Trace:
      2013-03-06 16:05:01 [<ffffffffa04a27f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-03-06 16:05:01 [<ffffffffa04a2e07>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-03-06 16:05:01 [<ffffffffa0d9ed87>] mdt_reconstruct_open+0x7c7/0xa80 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d908c5>] mdt_reconstruct+0x45/0x120 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d7d099>] mdt_reint_internal+0x709/0x8e0 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d7d53d>] mdt_intent_reint+0x1ed/0x500 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d7bc09>] mdt_intent_policy+0x379/0x690 [mdt]
      2013-03-06 16:05:01 [<ffffffffa06ca3c1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffffa06f03dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffffa0d7c586>] mdt_enqueue+0x46/0x130 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d71762>] mdt_handle_common+0x932/0x1750 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d72655>] mdt_regular_handle+0x15/0x20 [mdt]
      2013-03-06 16:05:01 [<ffffffffa071f4f6>] ptlrpc_main+0xd16/0x1a80 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffff810017cc>] ? __switch_to+0x1ac/0x320
      2013-03-06 16:05:01 [<ffffffffa071e7e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffff8100412a>] child_rip+0xa/0x20
      2013-03-06 16:05:01 [<ffffffffa071e7e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffffa071e7e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffff81004120>] ? child_rip+0x0/0x20

      On the crash, the file who make the LBUG is a file created by mpio.

      Onsite support team made the following analysis

      The return status (rc) is -EREMOTE (-66) and it seems the
      disposition mask was DISP_IT_EXECD / DISP_LOOKUP_EXECD / DISP_LOOKUP_POS
      / DISP_OPEN_OPEN / DISP_OPEN_LOCK. According to these information, it could be possible that, prior to the LBUG, MDS has run mdt_reint_open() having in return -EREMOTE just before the LBUG.

      So mdt_reint_open() would return -EREMOTE and then
      mdt_reconstruct_open() does not make attention that in case of -EREMOTE
      return there is no msg transno setting ...

      On the attachment file you can find the struct mdt_thread_info info data
      who made the LBUG and also the req data (struct ptlrpc_request°
      and lcd data (struct lsd_client_data).

      Attachments

        Issue Links

          Activity

            [LU-2943] LBUG mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) )
            bfaccini Bruno Faccini (Inactive) made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Cool, thanks for your update Seb. So I am marking this ticket as Fixed.

            bfaccini Bruno Faccini (Inactive) added a comment - Cool, thanks for your update Seb. So I am marking this ticket as Fixed.

            Hi Bruno,

            Support team confirms that your fix does fix the issue.
            Thank you!

            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi Bruno, Support team confirms that your fix does fix the issue. Thank you! Sebastien.

            Hello Alex and Seb, do you have any update fo this ticket ??
            Bye,
            Bruno.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Alex and Seb, do you have any update fo this ticket ?? Bye, Bruno.
            pjones Peter Jones made changes -
            Labels Original: ptr New: mn1
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-3987 [ LU-3987 ]

            Hello Bruno,
            The new package with your fix was delivered last Friday, when we got the approval to pick up your patch.
            Remain that we have to find a time frame to install it on the system.

            I'll keep you inform.

            Cheers,
            Alex.

            louveta Alexandre Louvet (Inactive) added a comment - Hello Bruno, The new package with your fix was delivered last Friday, when we got the approval to pick up your patch. Remain that we have to find a time frame to install it on the system. I'll keep you inform. Cheers, Alex.

            Hi Bruno,

            The patchset #3 of http://review.whamcloud.com/5954 has been rolled out at CEA for test purpose at the end of last week.
            Hopefully we will have news soon.

            Cheers,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi Bruno, The patchset #3 of http://review.whamcloud.com/5954 has been rolled out at CEA for test purpose at the end of last week. Hopefully we will have news soon. Cheers, Sebastien.

            Hello Alex,
            Reviewers agreed my patch, so it should be integrated soon, but in the mean time is there a possibility for you to temporarily integrate it and test it under production work-load? I know it is not easy to setup for you since it affects the MDS side, but I have no idea on how to reproduce locally.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Alex, Reviewers agreed my patch, so it should be integrated soon, but in the mean time is there a possibility for you to temporarily integrate it and test it under production work-load? I know it is not easy to setup for you since it affects the MDS side, but I have no idea on how to reproduce locally.

            What the current status of the latest patch ?

            louveta Alexandre Louvet (Inactive) added a comment - What the current status of the latest patch ?

            People

              bfaccini Bruno Faccini (Inactive)
              dmoreno Diego Moreno (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: