Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2943

LBUG mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) )

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.1.4
    • 3
    • 7064

    Description

      This issue has already been hit on lustre 2.2 (see LU-1702). Traces are exactly the same as for LU-1702.

      It's been hit four consecutive times so it seems quite easy to reproduce.

      2013-03-06 16:05:01 LustreError: 31751:0:(mdt_open.c:1023:mdt_reconstruct_open()) ASSERTION( (!(rc < 0)

      (lustre_msg_get_transno(req->rq_repmsg) == 0)) ) failed:
      2013-03-06 16:05:01 LustreError: 31751:0:(mdt_open.c:1023:mdt_reconstruct_open()) LBUG
      2013-03-06 16:05:01 Pid: 31751, comm: mdt_145
      2013-03-06 16:05:01
      2013-03-06 16:05:01 Call Trace:
      2013-03-06 16:05:01 [<ffffffffa04a27f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-03-06 16:05:01 [<ffffffffa04a2e07>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-03-06 16:05:01 [<ffffffffa0d9ed87>] mdt_reconstruct_open+0x7c7/0xa80 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d908c5>] mdt_reconstruct+0x45/0x120 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d7d099>] mdt_reint_internal+0x709/0x8e0 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d7d53d>] mdt_intent_reint+0x1ed/0x500 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d7bc09>] mdt_intent_policy+0x379/0x690 [mdt]
      2013-03-06 16:05:01 [<ffffffffa06ca3c1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffffa06f03dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffffa0d7c586>] mdt_enqueue+0x46/0x130 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d71762>] mdt_handle_common+0x932/0x1750 [mdt]
      2013-03-06 16:05:01 [<ffffffffa0d72655>] mdt_regular_handle+0x15/0x20 [mdt]
      2013-03-06 16:05:01 [<ffffffffa071f4f6>] ptlrpc_main+0xd16/0x1a80 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffff810017cc>] ? __switch_to+0x1ac/0x320
      2013-03-06 16:05:01 [<ffffffffa071e7e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffff8100412a>] child_rip+0xa/0x20
      2013-03-06 16:05:01 [<ffffffffa071e7e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffffa071e7e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      2013-03-06 16:05:01 [<ffffffff81004120>] ? child_rip+0x0/0x20

      On the crash, the file who make the LBUG is a file created by mpio.

      Onsite support team made the following analysis

      The return status (rc) is -EREMOTE (-66) and it seems the
      disposition mask was DISP_IT_EXECD / DISP_LOOKUP_EXECD / DISP_LOOKUP_POS
      / DISP_OPEN_OPEN / DISP_OPEN_LOCK. According to these information, it could be possible that, prior to the LBUG, MDS has run mdt_reint_open() having in return -EREMOTE just before the LBUG.

      So mdt_reint_open() would return -EREMOTE and then
      mdt_reconstruct_open() does not make attention that in case of -EREMOTE
      return there is no msg transno setting ...

      On the attachment file you can find the struct mdt_thread_info info data
      who made the LBUG and also the req data (struct ptlrpc_request°
      and lcd data (struct lsd_client_data).

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              dmoreno Diego Moreno (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: