Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.15.0
Affects Version/s: Lustre 2.12.6, Lustre 2.12.7, Lustre 2.15.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

It looks like obd_get_mod_rpc_slot places all RPCs waiting for a slot into a single exclusive waitq:

               wait_event_idle_exclusive(cli->cl_mod_rpcs_waitq,
                                          obd_mod_rpc_slot_avail(cli,
                                                                 close_req));
        } while (true);
}
EXPORT_SYMBOL(obd_get_mod_rpc_slot);

The problem is CLOSE RPCs have a higher chance of being sent. So if a CLOSE RPC completes and frees a slot the next item (only one) at the top of the waitq would be woken up and if it happens to be a non-close RPC, it'll go back to sleep and nothing would wake up the close rpc somewhere down the list.

Normally this is not too much of a visible problem because the hope is eventually a normal RPC or a few will complete and the close. cpc will gets its turn, but sometimes the entire available queue is plugged on requests waiting on say an open lock that needs the close to finish first and if it's stuck down the list - we have a deadlock. This seems to be especially common with NFS servers, but could also manifest in master now that we added opencache on by default.

We should either have separate waitqs or close/non-close RPCs or do wake_up_all() for completed CLOSE RPCs

Attachments

Issue Links

is duplicated by

LU-15915 /bin/rm: fts_read failed: Cannot send after transport endpoint shutdown

Resolved

is related to

LU-15915 /bin/rm: fts_read failed: Cannot send after transport endpoint shutdown

Resolved

LU-17510 Client hung on ll_file_open

Resolved

Activity

People

Assignee:: Oleg Drokin

Reporter:: Oleg Drokin

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 07/Jun/21 7:08 PM

Updated:: 01/Mar/24 8:11 PM

Resolved:: 19/Nov/22 4:19 PM