[LU-14741] Close RPC might get stuck behind normal RPCs waiting for slot Created: 07/Jun/21 Updated: 19/Nov/22 Resolved: 19/Nov/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.6, Lustre 2.12.7, Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
It looks like obd_get_mod_rpc_slot places all RPCs waiting for a slot into a single exclusive waitq: wait_event_idle_exclusive(cli->cl_mod_rpcs_waitq,
obd_mod_rpc_slot_avail(cli,
close_req));
} while (true);
}
EXPORT_SYMBOL(obd_get_mod_rpc_slot);
The problem is CLOSE RPCs have a higher chance of being sent. So if a CLOSE RPC completes and frees a slot the next item (only one) at the top of the waitq would be woken up and if it happens to be a non-close RPC, it'll go back to sleep and nothing would wake up the close rpc somewhere down the list. Normally this is not too much of a visible problem because the hope is eventually a normal RPC or a few will complete and the close. cpc will gets its turn, but sometimes the entire available queue is plugged on requests waiting on say an open lock that needs the close to finish first and if it's stuck down the list - we have a deadlock. This seems to be especially common with NFS servers, but could also manifest in master now that we added opencache on by default. We should either have separate waitqs or close/non-close RPCs or do wake_up_all() for completed CLOSE RPCs |
| Comments |
| Comment by Gerrit Updater [ 07/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43941 |
| Comment by Gerrit Updater [ 30/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43941/ |
| Comment by Gerrit Updater [ 14/Dec/21 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45850 |
| Comment by Peter Jones [ 19/Nov/22 ] |
|
IIUC this fix was landed to master for 2.15.0 |