[LUDOC-381] Improve documentation for jobstats Created: 14/Jun/17  Updated: 21/Mar/23

Status: Open
Project: Lustre Documentation
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Ben Evans (Inactive) Assignee: Dzmitry Kosach
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
Related
is related to LU-10698 Specify complex JobIDs for Lustre Resolved
is related to LU-9221 Create pid-based hash to enhance Jobs... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Some new user interface options for jobstats have been added via LU-9221, create some documentation around them.



 Comments   
Comment by Ben Evans (Inactive) [ 14/Jun/17 ]

Purging the Cache
The cache can be purged of a specific job by writing the JobID to the jobid_name proc file. Any items in the cache that are more than 300 seconds old will also be purged at this time.

Lifecycle of a mapping
A new mapping is created when a lookup is performed, and there is no map in the cache. At this time, the JobID is determined
Each time the map is accessed, it is checked to see if it needs to be refreshed (every 30 seconds). The timer is then reset to the current time. Each map has its own timer.
During a purge, if the JobID matches the item to be purged, or if the timer is more than 300 seconds.

Determining JobID
The JobID will be determined as follows:
1) The jobid_var proc variable, which can be “procname_uid”, or the name of a variable in the application’s environment, typically the environment variable containing the job name assigned by the scheduler
2) If 1 is not available, defaulting to the “procname_uid” scheme.
3) All Lustre threads are filtered out
4) If none are available, the JobID stored in the inode is used
5) If there is no JobID stored in the inode, it will remain blank.

This is a change from the current method which simply returns an empty JobID if nothing is available from the environment. The reason for doing this is to identify processes (and users) running on a node that is not scheduled, or are taking up significant resources, and provide read-ahead accounting properly.

Comment by Andreas Dilger [ 10/Apr/18 ]

Also, there should be some documentation added for patch https://review.whamcloud.com/31691 "LU-10698 obdclass: allow specifying complex jobids". The commit comment is pretty reasonable:

Allow specifying a format string for the jobid_name variable to create
    a jobid for processes on the client.  The jobid_name is used when
    jobid_var=nodelocal, if jobid_name contains "%j", or as a fallback if
    getting the specified jobid_var from the environment fails.
    
    The jobid_node string allows the following escape sequences:
    
        %e = executable name
        %g = group ID
        %h = hostname (system utsname)
        %j = jobid from jobid_var environment variable
        %p = process ID
        %u = user ID
    
    Any unknown escape sequences are dropped. Other arbitrary characters
    pass through unmodified, up to the maximum jobid string size of 32,
    though whitespace within the jobid is not copied.
    
    This allows, for example, specifying an arbitrary prefix, such as the
    cluster name, in addition to the traditional "procname.uid" format,
    to distinguish between jobs running on clients in different clusters:
    
        lctl set_param jobid_var=nodelocal jobid_name=cluster2.%e.%u
    or
        lctl set_param jobid_var=SLURM_JOB_ID jobid_name=cluster2.%j.%e
    
    To use an environment-specified JobID, if available, but fall back to
    a static string for all processes that do not have a valid JobID:
    
        lctl set_param jobid_var=SLURM_JOB_ID jobid_name=unknown
Generated at Sat Feb 10 03:42:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.