Hmm, I think this is what happened
During replay
1. client send replay request to MDS01, and MDS01 did not handle it in time.
2. So client did not get reply, so it expires and reconnects the import. And during sending this CONNECT request, it will first find out the min XID first,
but it only tries to find such XID in sending and replay list, which is wrong, because the replay request (just timeout) are not in these list (see ptlrpc_check_set()) . So usually, it will get the xid of this connect request, which is obviously larger than the xid of replay request.
Delete req from sending list after req failure.
3. And MDT handle that replay request (mentioned in 1) at this time, which will add the lrd of this req into the reply_list.
4. Then connect request arrives, and it has bigger reply XID, which then delete the lrd created in step 3
5. After connect, client will resend the replay request, but server can not identify it as a already received request, because lrd has been deleted in 3. So the replay request has been executed twice, then it causes this Version mismatch issue.
so the fix might be in step 2, i.e. we need consider the expired request when finding this min XID, or we just do not pack this min xid for CONNECT request.
Duplicate with
LU-5951