Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12816

LBUG: (niobuf.c:350:ptlrpc_register_bulk()) ASSERTION( !(desc->bd_registered && req->rq_send_state != LUSTRE_IMP_REPLAY) || mbits != desc->bd_last_mbits )

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0, Lustre 2.12.5
    • Lustre 2.12.1
    • None
    • 3
    • 9223372036854775807

    Description

      c0-0c0s14n3 LustreError: 7380:0:(niobuf.c:350:ptlrpc_register_bulk()) LBUG
      c0-0c0s14n3 Pid: 7380, comm: ptlrpcd_01_49
      c0-0c0s14n3 Call Trace:
      c0-0c0s14n3 [<ffffffff81008efc>] try_stack_unwind+0x17c/0x190
      c0-0c0s14n3 [<ffffffff81007e84>] dump_trace+0x64/0x380
      c0-0c0s14n3 [<ffffffffa025476e>] libcfs_call_trace+0x4e/0x60 [libcfs]
      c0-0c0s14n3 [<ffffffffa0254e75>] lbug_with_loc+0x45/0xb0 [libcfs]
      c0-0c0s14n3 [<ffffffffa0a0ed32>] ptlrpc_register_bulk+0x822/0x950 [ptlrpc]
      c0-0c0s14n3 [<ffffffffa0a0f765>] ptl_send_rpc+0x215/0xd40 [ptlrpc]
      c0-0c0s14n3 [<ffffffffa0a0561d>] ptlrpc_send_new_req+0x42d/0x9d0 [ptlrpc]
      c0-0c0s14n3 [<ffffffffa0a077b8>] ptlrpc_check_set+0x8a8/0x2c70 [ptlrpc]
      c0-0c0s14n3 [<ffffffffa0a33f2a>] ptlrpcd_check+0x3aa/0x5b0 [ptlrpc]
      c0-0c0s14n3 [<ffffffffa0a342fc>] ptlrpcd+0x1cc/0x4c0 [ptlrpc]
      c0-0c0s14n3 [<ffffffff810775b6>] kthread+0xd6/0xf0
      c0-0c0s14n3 [<ffffffff8152690f>] ret_from_fork+0x3f/0x70
      

      This is the same fundamental problem as LU-10643. If LNetMEAttach fails with an ENOMEM error, ptl_send_rpc() fails mid-processing and must cleanup the work it has done before the client tries to send the rpc again. The ptl_send_rpc path makes two calls to LNetMEAttach in the case of bulk reads and writes. LU-10643 addresses an ENOMEM after the first call. This bug is the result of an ENOMEM after the second call.

      The assertion fails because desc->bd_registered is true.

      LustreError: 7380:0:(niobuf.c:350:ptlrpc_register_bulk()) ASSERTION
      ( !(desc->bd_registered && req->rq_send_state != LUSTRE_IMP_REPLAY) || mbits != desc->bd_last_mbits ) failed: 
      registered: 1 rq_mbits: 1636629211272768 bd_last_mbits: 1636629211272768
      
      crash_x86_64> ptlrpc_request ffff88298086dc40 | grep send_state
            cr_send_state = LUSTRE_IMP_FULL,
      

      Error scenario: An attempt is made to send a bulk rpc under low memory conditions. ptl_send_rpc() successfully calls ptlrpc_register_bulk(), which attaches the request buffer and sets bd_registered. ptl_send_rpc() then tries to attach the reply buffer. But this fails with an ENOMEM error. The cleanup path does not reset bd_registered, so when the next attempt is made to send the rpc the assert is triggered in ptlrpc_register_bulk().

      ptl_send_rpc:
      ....
              ptlrpc_register_bulk:
                     sets bd_registered
                     LNetMEAttach(request buffer)   <--- CAST-16472 fixes ENOMEM error  handling
      
              if reply expected:
                     LNetMEAttach(reply buffer)
                     if ENOMEM
                            got cleanup_bulk
      ....
      cleanup_bulk:
               ptlrpc_unregister_bulk()      <--- doesn't reset bd_registered
      

      Attachments

        Issue Links

          Activity

            People

              amk Ann Koehler (Inactive)
              amk Ann Koehler (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: