[LUDOC-381] Improve documentation for jobstats - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Rank (Obsolete):
9223372036854775807

Description

Some new user interface options for jobstats have been added via ~~LU-9221~~, create some documentation around them.

Attachments

Issue Links

is related to

LU-10698 Specify complex JobIDs for Lustre

Resolved

LU-9221 Create pid-based hash to enhance Jobstats performance

Resolved

Activity

[LUDOC-381] Improve documentation for jobstats

Andreas Dilger added a comment - 10/Apr/18 5:23 PM

Also, there should be some documentation added for patch https://review.whamcloud.com/31691 "LU-10698 obdclass: allow specifying complex jobids". The commit comment is pretty reasonable:

Allow specifying a format string for the jobid_name variable to create
    a jobid for processes on the client.  The jobid_name is used when
    jobid_var=nodelocal, if jobid_name contains "%j", or as a fallback if
    getting the specified jobid_var from the environment fails.
    
    The jobid_node string allows the following escape sequences:
    
        %e = executable name
        %g = group ID
        %h = hostname (system utsname)
        %j = jobid from jobid_var environment variable
        %p = process ID
        %u = user ID
    
    Any unknown escape sequences are dropped. Other arbitrary characters
    pass through unmodified, up to the maximum jobid string size of 32,
    though whitespace within the jobid is not copied.
    
    This allows, for example, specifying an arbitrary prefix, such as the
    cluster name, in addition to the traditional "procname.uid" format,
    to distinguish between jobs running on clients in different clusters:
    
        lctl set_param jobid_var=nodelocal jobid_name=cluster2.%e.%u
    or
        lctl set_param jobid_var=SLURM_JOB_ID jobid_name=cluster2.%j.%e
    
    To use an environment-specified JobID, if available, but fall back to
    a static string for all processes that do not have a valid JobID:
    
        lctl set_param jobid_var=SLURM_JOB_ID jobid_name=unknown

Andreas Dilger added a comment - 10/Apr/18 5:23 PM Also, there should be some documentation added for patch https://review.whamcloud.com/31691 " LU-10698 obdclass: allow specifying complex jobids ". The commit comment is pretty reasonable: Allow specifying a format string for the jobid_name variable to create a jobid for processes on the client. The jobid_name is used when jobid_var=nodelocal, if jobid_name contains "%j", or as a fallback if getting the specified jobid_var from the environment fails. The jobid_node string allows the following escape sequences: %e = executable name %g = group ID %h = hostname (system utsname) %j = jobid from jobid_var environment variable %p = process ID %u = user ID Any unknown escape sequences are dropped. Other arbitrary characters pass through unmodified, up to the maximum jobid string size of 32, though whitespace within the jobid is not copied. This allows, for example, specifying an arbitrary prefix, such as the cluster name, in addition to the traditional "procname.uid" format, to distinguish between jobs running on clients in different clusters: lctl set_param jobid_var=nodelocal jobid_name=cluster2.%e.%u or lctl set_param jobid_var=SLURM_JOB_ID jobid_name=cluster2.%j.%e To use an environment-specified JobID, if available, but fall back to a static string for all processes that do not have a valid JobID: lctl set_param jobid_var=SLURM_JOB_ID jobid_name=unknown

Ben Evans (Inactive) added a comment - 14/Jun/17 7:31 PM

Purging the Cache
The cache can be purged of a specific job by writing the JobID to the jobid_name proc file. Any items in the cache that are more than 300 seconds old will also be purged at this time.

Lifecycle of a mapping
A new mapping is created when a lookup is performed, and there is no map in the cache. At this time, the JobID is determined
Each time the map is accessed, it is checked to see if it needs to be refreshed (every 30 seconds). The timer is then reset to the current time. Each map has its own timer.
During a purge, if the JobID matches the item to be purged, or if the timer is more than 300 seconds.

Determining JobID
The JobID will be determined as follows:
1) The jobid_var proc variable, which can be “procname_uid”, or the name of a variable in the application’s environment, typically the environment variable containing the job name assigned by the scheduler
2) If 1 is not available, defaulting to the “procname_uid” scheme.
3) All Lustre threads are filtered out
4) If none are available, the JobID stored in the inode is used
5) If there is no JobID stored in the inode, it will remain blank.

This is a change from the current method which simply returns an empty JobID if nothing is available from the environment. The reason for doing this is to identify processes (and users) running on a node that is not scheduled, or are taking up significant resources, and provide read-ahead accounting properly.

Ben Evans (Inactive) added a comment - 14/Jun/17 7:31 PM Purging the Cache The cache can be purged of a specific job by writing the JobID to the jobid_name proc file. Any items in the cache that are more than 300 seconds old will also be purged at this time. Lifecycle of a mapping A new mapping is created when a lookup is performed, and there is no map in the cache. At this time, the JobID is determined Each time the map is accessed, it is checked to see if it needs to be refreshed (every 30 seconds). The timer is then reset to the current time. Each map has its own timer. During a purge, if the JobID matches the item to be purged, or if the timer is more than 300 seconds. Determining JobID The JobID will be determined as follows: 1) The jobid_var proc variable, which can be “procname_uid”, or the name of a variable in the application’s environment, typically the environment variable containing the job name assigned by the scheduler 2) If 1 is not available, defaulting to the “procname_uid” scheme. 3) All Lustre threads are filtered out 4) If none are available, the JobID stored in the inode is used 5) If there is no JobID stored in the inode, it will remain blank. This is a change from the current method which simply returns an empty JobID if nothing is available from the environment. The reason for doing this is to identify processes (and users) running on a node that is not scheduled, or are taking up significant resources, and provide read-ahead accounting properly.

Improve documentation for jobstats

Details

Description

Attachments

Issue Links

Activity

People

Dates