Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.4.0, Lustre 2.5.0, Lustre 2.6.0
-
3
-
13504
Description
Cray recently had an issue where an unusual network problem was causing a large number of RPCs from clients to the MDS to be delivered twice. This was causing a very large number of RPCs to be restored, which, with a particular job, eventually lead to a bug that appears similar to LU-2827.
In investigating this, we didn't notice the severe network problem because we had to turn on RPC tracing (which creates a huge message volume) and walk through the logs to see this issue.
As restoring an RPC indicates something has gone wrong, even if it's being handled correctly, I'm suggesting changing this message in mdt_req_from_lcd from D_RPCTRACE to D_WARN to make the sort of issue we saw more obvious.
DEBUG_REQ(D_RPCTRACE, req, "restoring transno "LPD64"/status %d", req->rq_transno, req->rq_status);
Patch will be available in Gerrit shortly.