[LU-14741] Close RPC might get stuck behind normal RPCs waiting for slot Created: 07/Jun/21  Updated: 19/Nov/22  Resolved: 19/Nov/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6, Lustre 2.12.7, Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-15915 /bin/rm: fts_read failed: Cannot send... Resolved
Related
is related to LU-15915 /bin/rm: fts_read failed: Cannot send... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It looks like obd_get_mod_rpc_slot places all RPCs waiting for a slot into a single exclusive waitq:

               wait_event_idle_exclusive(cli->cl_mod_rpcs_waitq,
                                          obd_mod_rpc_slot_avail(cli,
                                                                 close_req));
        } while (true);
}
EXPORT_SYMBOL(obd_get_mod_rpc_slot);

The problem is CLOSE RPCs have a higher chance of being sent. So if a CLOSE RPC completes and frees a slot the next item (only one) at the top of the waitq would be woken up and if it happens to be a non-close RPC, it'll go back to sleep and nothing would wake up the close rpc somewhere down the list.

Normally this is not too much of a visible problem because the hope is eventually a normal RPC or a few will complete and the close. cpc will gets its turn, but sometimes the entire available queue is plugged on requests waiting on say an open lock that needs the close to finish first and if it's stuck down the list - we have a deadlock. This seems to be especially common with NFS servers, but could also manifest in master now that we added opencache on by default.

We should either have separate waitqs or close/non-close RPCs or do wake_up_all() for completed CLOSE RPCs



 Comments   
Comment by Gerrit Updater [ 07/Jun/21 ]

Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43941
Subject: LU-14741 obdclass: Wake up entire queue of requests on close completion
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5f3a9a7f292c3c46ac6b249db7066d5826559c55

Comment by Gerrit Updater [ 30/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43941/
Subject: LU-14741 obdclass: Wake up entire queue of requests on close completion
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a4e1567d67559b797a5c24ee0bfbca4a52649c47

Comment by Gerrit Updater [ 14/Dec/21 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45850
Subject: LU-14741 obdclass: Wake up entire queue of requests on close completion
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 963c6e18113fc7e044b83ae0feedb68894dbe073

Comment by Peter Jones [ 19/Nov/22 ]

IIUC this fix was landed to master for 2.15.0

Generated at Sat Feb 10 03:12:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.