[LU-16766] Combine some kernel process names for jobid Created: 24/Apr/23 Updated: 07/Feb/24 Resolved: 31/Aug/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Thomas Bertschinger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | lug23dd | ||
| Issue Links: |
|
||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
Reduce the long kernel thread names like "kworker/CPU:ID" to just "kworker", and "ll_sa_PID" to "ll_sa", since it is actually less useful to have the full kernel thread ID instead of aggregating these into a single process name on the stats. There may be other similar kernel thread names that should be abbreviated. Also, for statahead and similar Lustre threads that are generating RPCs on behalf of user processes, they should be properly accounted to the user/client. |
| Comments |
| Comment by Gerrit Updater [ 10/Aug/23 ] |
|
"Thomas Bertschinger <bertschinger@lanl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51919 |
| Comment by Thomas Bertschinger [ 10/Aug/23 ] |
|
I've uploaded a patch in progress for this but wanted to ask some design questions that may be broader than the patch I've submitted for this issue. First, the description here has:
It looks like LU-16781 is for this issue so I think this can be handled with a patch on that ticket. LU-16781 also says:
I have some questions about the reasoning behind jobid_name_is_valid(), which could be relevant to this patch. It looks like that check is only called if obd_jobid_var refers to an actual environment var and not one of the special values (nodelocal, session, procname_uid). Why is this the only case where the exclusion matters? To me it seems that the decision of whether a kernel thread should be included in jobstats is independent of whether the jobid setting is an env var, per-session var, or anything else. But I may not have the full picture here. Can you clarify what the intent of the exclusion is? If the purpose of the exclusion is just to avoid checking the process's environment in the event the process is a kernel thread, |
| Comment by Andreas Dilger [ 10/Aug/23 ] |
|
The purpose of jobid_name_is_valid() is to avoid using the procname/environment from those threads when generating the jobid, and instead get this information from the inode that is being processed. Also, there are some "housekeeping" RPCs like pings that are excluded since they might otherwise flood the server logs. I suspect there are still a couple of bugs in how the jobid name is generated, and we should be using the application process jobid that was stored in the file inode for the kernel threads to use, but somehow this is not happening correctly in all cases. I think PF_KTHREAD is overly broad to deny generating any jobid for an RPC, since ptlrpcd is a kernel thread and it is generating many client RPCs. However, it may be that PF_KTHREAD is a good indicator that we shouldn't be generating the jobid from the current thread, but rather from the file of interest in the RPC... This is unfortunately a bit vague, since it has been some time since I was debugging this code. It might be possible to run with "+rpctrace" debugging enabled on the client and see what RPCs are being generated by kworker and what they can use to generate a better jobid for the RPC. |
| Comment by Gerrit Updater [ 31/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51919/ |
| Comment by Peter Jones [ 31/Aug/23 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 05/Feb/24 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53904 |