Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
While trying to analyze a problem specific to one client/application on a large cluster that is in active use by many applications, we are faced with trying to capture debug logs on multiple servers that may be actively processing thousands of RPCs per second that are (likely) unrelated to the problem at hand.
It might be possible to set a debug_jobid and debug_jobid_mask parameter of the client(s) and server(s) and then execute the job with the specific JobID on the clients. Then, when the servers are processing RPC requests, they use an elevated debug_jobid_mask only when processing RPCs from that job.
There are some potential implementation issues with this, namely that the existing libcfs_debug mask is currently global to the node, so there would have to be some changes to eg. CDEBUG() to allow this to be checked on a per-thread basis (hopefully without changing the arguments to this widely-used macro).
Also, this approach has the risk of missing important information from other threads that may be running at the same time (eg. getting conflicting locks) so it is open for discussion whether it will actually be useful in practice.