[LU-15190] ptlrpc_server_check_resend_in_progress() can miss duplicate RPC Created: 03/Nov/21 Updated: 05/May/22 Resolved: 05/May/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Upstream |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Alex Zhuravlev | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
ptlrpc_server_check_resend_in_progress() has the following check at the beginning:
if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT) ||
(atomic_read(&req->rq_export->exp_rpc_count) == 0))
return NULL;
I think this can cause duplicate RPCs if none is in progress at the moment (due to high load, deep incoming queue). and there is a crash dump in support of this theory. in that dump I was able to find lots of duplicate (up to 14). for example,
crash> p *(struct ptlrpc_request *)(0xffff9887c6c75ee0-0x60)
rq_reqmsg = 0xffff9887c7f34000,
rq_xid = 1709012909603712,
rq_export = 0xffff988805ee5400,
rq_peer = {
nid = 1407418002966021,
crash> p *(struct ptlrpc_request *)(0xffff987eb906a8e0-0x60)
rq_reqmsg = 0xffff987e7145c148,
rq_xid = 1709012909603712,
rq_export = 0xffff988805ee5400,
rq_peer = {
nid = 1407418002966021,
crash> ptlrpc_request_dump (0xffff98745d3a5a60-0x60)
req: 0xffff9875002d6520, xid: 2531069376, opc: 103, flags: 2, buf2: 0xffff9875002d6600/104
crash> ptlrpc_request_dump (0xffff98771dd06360-0x60)
req: 0xffff9884e2218520, xid: 2531069376, opc: 103, flags: 2, buf2: 0xffff9884e2218600/104
crash> ptlrpc_request_dump (0xffff9878a80c3ae0-0x60)
req: 0xffff9878ae8ae148, xid: 2531069376, opc: 103, flags: 2, buf2: 0xffff9878ae8ae228/104
crash> ptlrpc_request_dump (0xffff98789c049b60-0x60)
req: 0xffff98789c7403d8, xid: 2531069376, opc: 103, flags: 2, buf2: 0xffff98789c7404b8/104
crash> p ((struct ldlm_request *)0xffff9875002d6600)->lock_handle
cookie = 13969718594132579448
crash> p ((struct ldlm_request *)0xffff9884e2218600)->lock_handle
cookie = 13969718594132579448
crash> p ((struct ldlm_request *)0xffff9878ae8ae228)->lock_handle
cookie = 13969718594132579448
crash> p ((struct ldlm_request *)0xffff98789c7404b8)->lock_handle
cookie = 13969718594132579448
notice same XID and same lock's handle. dumped all RPCs from export's HP list and checked the XID's: $ cat xid-sorted-list.txt | wc -l 877858 $ cat xid-sorted-list.txt | uniq |wc -l 213480 i.e. 3/4 of all RPCs were duplicates. given ptlrpc_server_check_resend_in_progress() uses a linear scan to check for duplicates and a single spinlock, the check takes a lot and many CPUs were spinning for seconds. |
| Comments |
| Comment by Gerrit Updater [ 03/Nov/21 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45445 |
| Comment by Gerrit Updater [ 03/Nov/21 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45446 |
| Comment by Gerrit Updater [ 13/Dec/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45445/ |
| Comment by Peter Jones [ 05/May/22 ] |
|
Seems to be landed for 2.15 |