[LU-12816] LBUG: (niobuf.c:350:ptlrpc_register_bulk()) ASSERTION( !(desc->bd_registered && req->rq_send_state != LUSTRE_IMP_REPLAY) || mbits != desc->bd_last_mbits ) Created: 27/Sep/19 Updated: 02/May/20 Resolved: 17/Apr/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.1 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.5 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ann Koehler (Inactive) | Assignee: | Ann Koehler (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
c0-0c0s14n3 LustreError: 7380:0:(niobuf.c:350:ptlrpc_register_bulk()) LBUG c0-0c0s14n3 Pid: 7380, comm: ptlrpcd_01_49 c0-0c0s14n3 Call Trace: c0-0c0s14n3 [<ffffffff81008efc>] try_stack_unwind+0x17c/0x190 c0-0c0s14n3 [<ffffffff81007e84>] dump_trace+0x64/0x380 c0-0c0s14n3 [<ffffffffa025476e>] libcfs_call_trace+0x4e/0x60 [libcfs] c0-0c0s14n3 [<ffffffffa0254e75>] lbug_with_loc+0x45/0xb0 [libcfs] c0-0c0s14n3 [<ffffffffa0a0ed32>] ptlrpc_register_bulk+0x822/0x950 [ptlrpc] c0-0c0s14n3 [<ffffffffa0a0f765>] ptl_send_rpc+0x215/0xd40 [ptlrpc] c0-0c0s14n3 [<ffffffffa0a0561d>] ptlrpc_send_new_req+0x42d/0x9d0 [ptlrpc] c0-0c0s14n3 [<ffffffffa0a077b8>] ptlrpc_check_set+0x8a8/0x2c70 [ptlrpc] c0-0c0s14n3 [<ffffffffa0a33f2a>] ptlrpcd_check+0x3aa/0x5b0 [ptlrpc] c0-0c0s14n3 [<ffffffffa0a342fc>] ptlrpcd+0x1cc/0x4c0 [ptlrpc] c0-0c0s14n3 [<ffffffff810775b6>] kthread+0xd6/0xf0 c0-0c0s14n3 [<ffffffff8152690f>] ret_from_fork+0x3f/0x70 This is the same fundamental problem as The assertion fails because desc->bd_registered is true. LustreError: 7380:0:(niobuf.c:350:ptlrpc_register_bulk()) ASSERTION
( !(desc->bd_registered && req->rq_send_state != LUSTRE_IMP_REPLAY) || mbits != desc->bd_last_mbits ) failed:
registered: 1 rq_mbits: 1636629211272768 bd_last_mbits: 1636629211272768
crash_x86_64> ptlrpc_request ffff88298086dc40 | grep send_state
cr_send_state = LUSTRE_IMP_FULL,
Error scenario: An attempt is made to send a bulk rpc under low memory conditions. ptl_send_rpc() successfully calls ptlrpc_register_bulk(), which attaches the request buffer and sets bd_registered. ptl_send_rpc() then tries to attach the reply buffer. But this fails with an ENOMEM error. The cleanup path does not reset bd_registered, so when the next attempt is made to send the rpc the assert is triggered in ptlrpc_register_bulk(). ptl_send_rpc:
....
ptlrpc_register_bulk:
sets bd_registered
LNetMEAttach(request buffer) <--- CAST-16472 fixes ENOMEM error handling
if reply expected:
LNetMEAttach(reply buffer)
if ENOMEM
got cleanup_bulk
....
cleanup_bulk:
ptlrpc_unregister_bulk() <--- doesn't reset bd_registered
|
| Comments |
| Comment by Gerrit Updater [ 27/Sep/19 ] |
|
Ann Koehler (amk@cray.com) uploaded a new patch: https://review.whamcloud.com/36309 |
| Comment by Gerrit Updater [ 06/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36309/ |
| Comment by Cory Spitz [ 09/Dec/19 ] |
|
Thanks for landing this fix. Sadly, Ann has retired and she won't be able to close this bug herself. Best wishes, Ann! |
| Comment by Gerrit Updater [ 17/Apr/20 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38266 |
| Comment by Gerrit Updater [ 01/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38266/ |