[LU-15068] Race between commit callback and reply_out_callback::LNET_EVENT_SEND Created: 06/Oct/21  Updated: 12/Mar/22  Resolved: 30/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When LNet is under load it is possible for messages to be queued while waiting for a peer TX credit or a network TX credit. When running benchmarks on a large scale system we observed clients hitting "slow reply" timeouts for MDS_REINT RPCs. Tracing revealed that the server received the MDS_REINT RPC and sent a reply to the client, but the reply was queued in LNet because there weren't any peer credits available.

Shortly after, the commit callback was triggered which added the reply state to be handled via ptlrpc_commit_replies() -> rs_batch_add()

void ptlrpc_commit_replies(struct obd_export *exp)
{
...
                if (rs->rs_transno <= exp->exp_last_committed) {
                        list_del_init(&rs->rs_obd_list);
                        rs_batch_add(&batch, rs);
                } 

The reply state MD handle then got unlinked by ptlrpc_handle_rs().

static int
ptlrpc_handle_rs(struct ptlrpc_reply_state *rs)
{
...
        if ((!been_handled && rs->rs_on_net) || nlocks > 0) {
                spin_unlock(&rs->rs_lock);

                if (!been_handled && rs->rs_on_net) {
                        LNetMDUnlink(rs->rs_md_h);

But the reply never left the server - it was always queued in LNet. Since the MD was unlinked, LNet aborted the send once a credit became available. Client eventually hit "timeout for slow reply" and this caused the client to reconnect.

I'm able to readily reproduce the issue using a four node cluster where I have 1 MDS, 1 OSS and 2 clients.
1. Run mdtest create
2. Start LST in the background - I'm doing a simultaneous read and write session where MDS is in the "to" group and the OSS and 2 clients are in the "from" group - concurrency 64
3. Run mdtest delete

LST causes credit starvation during the mdtest delete phase, and so the replies are more readily queued in LNet as I described above.



 Comments   
Comment by Gerrit Updater [ 06/Oct/21 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45138
Subject: LU-15068 ptlrpc: Do not unlink difficult reply until sent
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7060fe3cbbc35c44cf071da993cf1fdea60e7f1f

Comment by Gerrit Updater [ 30/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45138/
Subject: LU-15068 ptlrpc: Do not unlink difficult reply until sent
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5c156b48425aae245537aaf10229734166463347

Comment by Peter Jones [ 30/Nov/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 14/Dec/21 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45849
Subject: LU-15068 ptlrpc: Do not unlink difficult reply until sent
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 6234ec04be8e5523ffda71ce4a25dbbb63002a57

Generated at Sat Feb 10 03:15:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.