[LU-4882] Convert MDS restoring RPC message from D_RPCTRACE to D_WARNING Created: 11/Apr/14  Updated: 31/Jan/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0, Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Patrick Farrell (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 13504

 Description   

Cray recently had an issue where an unusual network problem was causing a large number of RPCs from clients to the MDS to be delivered twice. This was causing a very large number of RPCs to be restored, which, with a particular job, eventually lead to a bug that appears similar to LU-2827.

In investigating this, we didn't notice the severe network problem because we had to turn on RPC tracing (which creates a huge message volume) and walk through the logs to see this issue.

As restoring an RPC indicates something has gone wrong, even if it's being handled correctly, I'm suggesting changing this message in mdt_req_from_lcd from D_RPCTRACE to D_WARN to make the sort of issue we saw more obvious.

        DEBUG_REQ(D_RPCTRACE, req, "restoring transno "LPD64"/status %d",
                  req->rq_transno, req->rq_status);

Patch will be available in Gerrit shortly.



 Comments   
Comment by Patrick Farrell (Inactive) [ 11/Apr/14 ]

http://review.whamcloud.com/9932

Comment by Andreas Dilger [ 01/Dec/14 ]

Patrick, I just came across the patch from this bug while looking at other patches. Did you ever get to run your patch in production over some period of time to verify (e.g. grep syslogs at some sites over the past few months) that it doesn't spam the console?

Comment by Patrick Farrell (Inactive) [ 01/Dec/14 ]

Andreas - Thanks for reminding me. This has been run a system where we do regular failover testing since, roughly, early May of this year. Grepping the relevant logs shows zero instances of this message.

It's also running in production at some large sites, but I can't easily confirm they aren't getting this message. It is at least not so common that anyone has brought it up.

I'll rebase the patch.

Generated at Sat Feb 10 01:46:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.