Was it considered that the alternative is the server is stuck waiting for a lock from unavailable client for a long time?
What's the plan there?
Before the patch: Send AST, once there is timeout, retry a few more times, then evict client.
Now: Try to send AST, it gets blocked... wait...? (for how long) and then what?
What if the client never reappears, are we waiting for it to be evicted by the ping evictor or something?
What about all the other clients that could get blocked on this lock meanwhile?
I believe that all the above will remain as it is now.
The only situation the patch is to address arises when a server handles reconnect from a client and when there is blocking ast for that client to be resend.
The server enters target_handle_connect() and eventually rev_import_reconnect(), where the reverse import is set to LUSTRE_IMP_RECOVER, and then ptlrpc_import_recovery_state_machine() is called to do resends.
ptlrpc_resend() sets rq_resend flag for each request on the sending list and wakeup corresponding ptlrpc_check_set-s. Then the import state is set to LUSTRE_IMP_FULL.
if (imp->imp_state == LUSTRE_IMP_RECOVER) {
struct ptlrpc_connection *conn = imp->imp_connection;
rc = ptlrpc_resend(imp);
if (rc)
GOTO(out, rc);
IMPORT_SET_STATE(imp, LUSTRE_IMP_FULL);
There is a import’s lock race window between ptlrpc_resend() and IMPORT_SET_STATE(imp, LUSTRE_IMP_FULL).
If awaken ptlrpc_set_check()->ptlrpc_import_delay_req() locks the import before import’s state gets set to LUSTRE_IMP_FULL, it sees req->rq_send_state != imp->imp_state and sets *status to -EWOULDBLOCK(-EAGAIN).
} else if (req->rq_send_state != imp->imp_state) {
if (atomic_read(&imp->imp_inval_count) != 0) {
DEBUG_REQ(D_ERROR, req, "invalidate in flight");
*status = -EIO;
} else if (imp->imp_dlm_fake || req->rq_no_delay) {
*status = -EWOULDBLOCK;
After that the request goes to interpret state and ldlm_handle_ast_error()->ldlm_failed_ast() evicts the client as req->rq_status is -EAGAIN.
Also, the import on the server is not a real import, so how could it be non-full?
The server is handling reconnect. The state of the reverse import is changed in rev_import_reconnect().
this probably needs an interop check too? I.e. the change is on the server, but we always run with client scriptss, so we need to ensure the server has this fix before running this test?
I do not know. The test is attached just to illustrate the problem. If you would like older servers testing to not fail you might want to just add the test to exception list.
Landed for 2.10