I can understand that if hundreds of nodes are generating unlabelled RPCs then using procname_uid could result in a lot of "rsync.1234", "rsync.2345", "ls.5678", "cp.9876", etc. kind of results if there are many active users, but otherwise this still provides useful information about what commands are generating a lot of IO traffic. The reason "procname.uid" was chosen as the fallback if JOBENV can't be found is that there is a good likelihood of the same user running on different nodes without an actual JobID to still generate the same jobid string, unlike embedding PID or other unique identifier (which would be useless after the process exits anyway).
One option would be to allow userspace to specify a fallback jobid if obd_jobid_var is not found. This could be a more expressive syntax for the primary/fallback than just "disabled", "procname_uid", and "nodelocal" that can be specified today. For example interpreting "%proc.%uid" as "process name" '.' "user id", but allowing just "%proc", just "%uid", but also maybe "%gid", "%nid", "%pid", and other fields as desired (filtering out any unknown '%' and other escape characters). This could instead use a subset of escapes from core filenames in format_corename(), to minimize the effort for sysadmins (e.g. %e=executable, %p=PID (and friends?), %u=UID, %g=UID, %h=hostname, %n=NID). It isn't clear to me yet if PID is useful for JobID, but it isn't hard to implement and maybe there is a case for it.
Unknown strings would just be copied literally, so you could set:
or to get Jinshan's desired behaviour just set:
This implies that if "JOBENV" is not found then "jobid_name" would be used as a fallback (which doesn't happen today), and would be interpreted as needed.
Using "jobid_var=nodelocal" would keep "jobid_name" as a literal string as it is today, while allowing the kernel to generate useful jobids directly, similar to core dump filenames. My preference would be to keep "jobid_name=%e.%u" as the default if jobstats is enabled, since this is what we currently have, and is at least providing some reasonable information to users that didn't set anything in advance.
Jinshan, all of what you propose can be done in userspace. You can translate all procname.uid formatted JobID's to "unknown", you can leave them out of the database you use for mining. What you can't do, is take stats from Lustre of "Unknown" and translate them into "rsync.12345" on 6 different nodes.
My understanding from what I've seen from the management side of our Lustre products is that they are accumulating each job, and scoring it in a number of ways, along with keeping it in a database for deeper investigation. I'm not sure what the limits may be concerning what is kept in the DB and for how long, and at what timescales.
I do know that this is an area of active development, as the performance penalties incurred by JobID are not as harsh as they used to be due to the cache. So we've moved from a case where JobID is off by default to one where it can be on by default.