[LU-15190] ptlrpc_server_check_resend_in_progress() can miss duplicate RPC Created: 03/Nov/21  Updated: 05/May/22  Resolved: 05/May/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Upstream
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Alex Zhuravlev Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

ptlrpc_server_check_resend_in_progress() has the following check at the beginning:

        if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT) ||
            (atomic_read(&req->rq_export->exp_rpc_count) == 0))
                return NULL;

I think this can cause duplicate RPCs if none is in progress at the moment (due to high load, deep incoming queue).

and there is a crash dump in support of this theory. in that dump I was able to find lots of duplicate (up to 14). for example,

crash> p *(struct ptlrpc_request *)(0xffff9887c6c75ee0-0x60)
  rq_reqmsg = 0xffff9887c7f34000, 
  rq_xid = 1709012909603712, 
  rq_export = 0xffff988805ee5400, 
  rq_peer = {
    nid = 1407418002966021, 

crash> p *(struct ptlrpc_request *)(0xffff987eb906a8e0-0x60)
  rq_reqmsg = 0xffff987e7145c148, 
  rq_xid = 1709012909603712, 
  rq_export = 0xffff988805ee5400, 
  rq_peer = {
    nid = 1407418002966021, 

crash> ptlrpc_request_dump (0xffff98745d3a5a60-0x60)
req: 0xffff9875002d6520, xid: 2531069376, opc: 103, flags: 2, buf2: 0xffff9875002d6600/104
crash> ptlrpc_request_dump (0xffff98771dd06360-0x60)
req: 0xffff9884e2218520, xid: 2531069376, opc: 103, flags: 2, buf2: 0xffff9884e2218600/104
crash> ptlrpc_request_dump (0xffff9878a80c3ae0-0x60)
req: 0xffff9878ae8ae148, xid: 2531069376, opc: 103, flags: 2, buf2: 0xffff9878ae8ae228/104
crash> ptlrpc_request_dump (0xffff98789c049b60-0x60)
req: 0xffff98789c7403d8, xid: 2531069376, opc: 103, flags: 2, buf2: 0xffff98789c7404b8/104

crash> p ((struct ldlm_request *)0xffff9875002d6600)->lock_handle
    cookie = 13969718594132579448
crash> p ((struct ldlm_request *)0xffff9884e2218600)->lock_handle
    cookie = 13969718594132579448
crash> p ((struct ldlm_request *)0xffff9878ae8ae228)->lock_handle
    cookie = 13969718594132579448
crash> p ((struct ldlm_request *)0xffff98789c7404b8)->lock_handle
    cookie = 13969718594132579448

notice same XID and same lock's handle.

dumped all RPCs from export's HP list and checked the XID's:

$ cat xid-sorted-list.txt | wc -l
877858
$ cat xid-sorted-list.txt | uniq |wc -l
213480

i.e. 3/4 of all RPCs were duplicates.

given ptlrpc_server_check_resend_in_progress() uses a linear scan to check for duplicates and a single spinlock, the check takes a lot and many CPUs were spinning for seconds.



 Comments   
Comment by Gerrit Updater [ 03/Nov/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45445
Subject: LU-15190 ptlrpc: fix duplication check
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 177c1951d83e3efdbfc6cd63ca99c4d967898c0f

Comment by Gerrit Updater [ 03/Nov/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45446
Subject: LU-15190 ptlrpc: rhashtable for xid duplication check
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4426b529a6dd44c3b6e00b400a8507d1728d8e39

Comment by Gerrit Updater [ 13/Dec/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45445/
Subject: LU-15190 ptlrpc: fix duplication check
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bb83a8af59d30b3f9e6de171eca962316ab7f6f4

Comment by Peter Jones [ 05/May/22 ]

Seems to be landed for 2.15

Generated at Sat Feb 10 03:16:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.