Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.12.1
-
None
-
3
-
9223372036854775807
Description
c0-0c0s14n3 LustreError: 7380:0:(niobuf.c:350:ptlrpc_register_bulk()) LBUG c0-0c0s14n3 Pid: 7380, comm: ptlrpcd_01_49 c0-0c0s14n3 Call Trace: c0-0c0s14n3 [<ffffffff81008efc>] try_stack_unwind+0x17c/0x190 c0-0c0s14n3 [<ffffffff81007e84>] dump_trace+0x64/0x380 c0-0c0s14n3 [<ffffffffa025476e>] libcfs_call_trace+0x4e/0x60 [libcfs] c0-0c0s14n3 [<ffffffffa0254e75>] lbug_with_loc+0x45/0xb0 [libcfs] c0-0c0s14n3 [<ffffffffa0a0ed32>] ptlrpc_register_bulk+0x822/0x950 [ptlrpc] c0-0c0s14n3 [<ffffffffa0a0f765>] ptl_send_rpc+0x215/0xd40 [ptlrpc] c0-0c0s14n3 [<ffffffffa0a0561d>] ptlrpc_send_new_req+0x42d/0x9d0 [ptlrpc] c0-0c0s14n3 [<ffffffffa0a077b8>] ptlrpc_check_set+0x8a8/0x2c70 [ptlrpc] c0-0c0s14n3 [<ffffffffa0a33f2a>] ptlrpcd_check+0x3aa/0x5b0 [ptlrpc] c0-0c0s14n3 [<ffffffffa0a342fc>] ptlrpcd+0x1cc/0x4c0 [ptlrpc] c0-0c0s14n3 [<ffffffff810775b6>] kthread+0xd6/0xf0 c0-0c0s14n3 [<ffffffff8152690f>] ret_from_fork+0x3f/0x70
This is the same fundamental problem as LU-10643. If LNetMEAttach fails with an ENOMEM error, ptl_send_rpc() fails mid-processing and must cleanup the work it has done before the client tries to send the rpc again. The ptl_send_rpc path makes two calls to LNetMEAttach in the case of bulk reads and writes. LU-10643 addresses an ENOMEM after the first call. This bug is the result of an ENOMEM after the second call.
The assertion fails because desc->bd_registered is true.
LustreError: 7380:0:(niobuf.c:350:ptlrpc_register_bulk()) ASSERTION ( !(desc->bd_registered && req->rq_send_state != LUSTRE_IMP_REPLAY) || mbits != desc->bd_last_mbits ) failed: registered: 1 rq_mbits: 1636629211272768 bd_last_mbits: 1636629211272768 crash_x86_64> ptlrpc_request ffff88298086dc40 | grep send_state cr_send_state = LUSTRE_IMP_FULL,
Error scenario: An attempt is made to send a bulk rpc under low memory conditions. ptl_send_rpc() successfully calls ptlrpc_register_bulk(), which attaches the request buffer and sets bd_registered. ptl_send_rpc() then tries to attach the reply buffer. But this fails with an ENOMEM error. The cleanup path does not reset bd_registered, so when the next attempt is made to send the rpc the assert is triggered in ptlrpc_register_bulk().
ptl_send_rpc: .... ptlrpc_register_bulk: sets bd_registered LNetMEAttach(request buffer) <--- CAST-16472 fixes ENOMEM error handling if reply expected: LNetMEAttach(reply buffer) if ENOMEM got cleanup_bulk .... cleanup_bulk: ptlrpc_unregister_bulk() <--- doesn't reset bd_registered
Attachments
Issue Links
- is related to
-
LU-13509 Improve ptlrpc_register_bulk() behavior
-
- Resolved
-