<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:47:07 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4935] Collect job stats by both procname_uid and scheduler job ID</title>
                <link>https://jira.whamcloud.com/browse/LU-4935</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I have no insight into the code, so there may be some reasons this is unworkable, but here&apos;s what I&apos;m thinking.&lt;/p&gt;

&lt;p&gt;1) If you set jobid_var to procname_uid, you capture every process that is using the file system. This is nice for debugging, but not that useful for job statistics, as processes certainly can have similar names/uid across jobs.&lt;/p&gt;

&lt;p&gt;2) If you set jobid_var to the scheduler of your choice, like SLURM_JOB_ID, you of course get those statistics. But if someone is for example sitting on a submit node and issuing commands, those aren&apos;t seen.&lt;/p&gt;

&lt;p&gt;Would it be possible to enable collection on both? If every request has the job_id, process name, and UID packed in, why not get it all?&lt;/p&gt;

&lt;p&gt;So if you had a job with a scheduler jobid of &quot;123&quot;, run from uid 555, and let&apos;s say it runs 2 processes, process1 and process2.&lt;/p&gt;

&lt;p&gt;Could you then have jobstats report job_id that look like:&lt;/p&gt;

&lt;p&gt;123.process1.555&lt;br/&gt;
123.process2.555&lt;/p&gt;

&lt;p&gt;A process &apos;myscript&apos; not run via the scheduler by uid 561 could be&lt;/p&gt;

&lt;p&gt;0.myscript.561 &lt;/p&gt;

&lt;p&gt;Besides not losing the statistics for &quot;non-scheduler&quot; lustre requests, you have possible a little more insight into your job if it&apos;s a multi-step type job.&lt;/p&gt;

&lt;p&gt;Finally, to take it to the extreme - consider that we run filesystems which may be accesses by different schedulers, say slurm and SGE on different systems (yes, this happens!). Why not include every possible scheduler scheme? So you end up with something like:&lt;/p&gt;

&lt;p&gt;SLURM_JOB_ID.JOB_ID.LSB_JOBID.LOADL_STEP_ID.PBS_JOBID.ALPS_APP_ID.procname.uid&lt;/p&gt;

&lt;p&gt;So the example above would be:&lt;/p&gt;

&lt;p&gt;123.0.0.0.0.0.process1.555&lt;/p&gt;

&lt;p&gt;I would not be surprised if this is potentially stupid. One thing, is it&apos;s overloading a variable to be an array of data. It&apos;s also using a character valid for filenames &quot;.&quot; as a field seperator.&lt;/p&gt;

&lt;p&gt;Scott&lt;/p&gt;
</description>
                <environment></environment>
        <key id="24306">LU-4935</key>
            <summary>Collect job stats by both procname_uid and scheduler job ID</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="sknolin">Scott Nolin</reporter>
                        <labels>
                    </labels>
                <created>Mon, 21 Apr 2014 15:41:12 +0000</created>
                <updated>Sat, 22 Nov 2014 19:55:53 +0000</updated>
                            <resolved>Sat, 22 Nov 2014 00:29:32 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="82064" author="green" created="Mon, 21 Apr 2014 17:18:59 +0000"  >&lt;p&gt;I think it&apos;s mostly possible in a less automated way.&lt;br/&gt;
Just set job id on every client in a way that makes sense for it&lt;br/&gt;
So set login nodes to the procname_uid and batch nodes to your job scheduler type?&lt;/p&gt;</comment>
                            <comment id="82080" author="adilger" created="Mon, 21 Apr 2014 18:17:58 +0000"  >&lt;p&gt;Note that it is possible to specify different jobid values to be collected on the login nodes and the compute nodes, and for that matter to specify different jobid sources on different compute nodes. It is not currently possible to collect both of these statistics at one time, nor as your final proposal suggests to collect a myriad of different identifiers at one time. The jobid is sent with every RPC from the client, and has a limited space in which to do so. &lt;/p&gt;

&lt;p&gt;At design time we surveyed the job schedulers and picked a maximum jobid size that would satisfy their requirements. It doesn&apos;t make sense that a single node should be subject to different job schedulers at one time. Since it is possible to specify different jobid sources on different clients, and the resulting identifiers should still be unique and identifiable by their naming structure, this should meet most of the requirements here.&lt;/p&gt;

&lt;p&gt;Since the jobid source is itself just the name of an environment variable in which to find the identifier, it is possible for your runtime environment to generate a new environment variable that could contain any identifier of your choice, subject to the size limits in struct ptlrpc_body (32 bytes, I believe). &lt;/p&gt;</comment>
                            <comment id="82083" author="sknolin" created="Mon, 21 Apr 2014 18:52:52 +0000"  >&lt;p&gt;To be sure I understand this right, I could do something like this:&lt;/p&gt;

&lt;p&gt;lctl conf_param testfs.sys.jobid_var=SLURM_JOB_ID&lt;/p&gt;

&lt;p&gt;-to set the filesystem to collect assuming SLURM_JOB_ID&lt;/p&gt;

&lt;p&gt;I can see how on for example an SGE node, you could set SLURM_JOB_ID to the SGE job id and for a job.&lt;/p&gt;

&lt;p&gt;How do you set SLURM_JOB_ID to procname.uid for every random process on a system, say on your login node? This is really the case that&apos;s more useful for us and mysterious to me.&lt;/p&gt;

&lt;p&gt;I apologize if I&apos;m being particularly dense here.&lt;/p&gt;

&lt;p&gt;Scott&lt;/p&gt;</comment>
                            <comment id="83268" author="adilger" created="Tue, 6 May 2014 07:09:20 +0000"  >&lt;p&gt;You can set &lt;tt&gt;lctl conf_param testfs.sys.jobid_var=SLURM_JOB_ID&lt;/tt&gt; to set the global default jobid source (this is only needed once), and then &lt;tt&gt;lctl set_param jobid_var=procname_uid&lt;/tt&gt; on the login nodes in a startup script.  If you have other nodes that are running SGE you can run &lt;tt&gt;lctl set_param jobid_var=JOB_ID&lt;/tt&gt;.  On the nodes with SLURM_JOB_ID the jobids tracked by Lustre will be of the form NNNNNNNN (32-bit integer), on the SGE nodes they would be of the form NNNNN (5 decimal digits), and on the login nodes it would be process.NNNN.  There would be a chance of conflict between SLURM and SGE, but unlikely.&lt;/p&gt;</comment>
                            <comment id="83288" author="sknolin" created="Tue, 6 May 2014 13:08:51 +0000"  >&lt;p&gt;Thank you Andreas, this is perfect.&lt;/p&gt;

&lt;p&gt;I assumed setting the jobid source parameter always set the global default, didn&apos;t realize subsequent settings to different values only apply to the client. So I assume to reset the global you would set to disable, then start over.&lt;/p&gt;

&lt;p&gt;I read the man page for &quot;lctl&quot; and &quot;conf_param&quot; section says &quot;Set  a  permanent configuration parameter for any device via the MGS.  This command must be run on the MGS node.&quot;, and the manual seemed to suggest the same - so I took that to mean that all settings are to be done on the mgs and global.&lt;/p&gt;

&lt;p&gt;Since no improvement is needed this can simply be closed unless there&apos;s some documentation needed.&lt;/p&gt;

&lt;p&gt;Thanks again,&lt;br/&gt;
Scott&lt;/p&gt;</comment>
                            <comment id="83332" author="adilger" created="Tue, 6 May 2014 17:59:45 +0000"  >&lt;p&gt;To clarify - the &quot;lctl conf_param&quot; (or in Lustre 2.5 and later &quot;lctl set_param -P&quot;) settings on the MGS are global and persistent, while &quot;lctl set_param&quot; settings are local to the node on which they are run and only last until the filesystem unmounts. As yet there is no way to specify persistent settings for only a subset of nodes, but there are many other mechanisms for achieving this. &lt;/p&gt;</comment>
                            <comment id="84233" author="adilger" created="Fri, 16 May 2014 08:22:26 +0000"  >&lt;p&gt;It would probably be good to get something into the manual related to this, in case it comes up again in the future.&lt;/p&gt;</comment>
                            <comment id="92603" author="sknolin" created="Wed, 27 Aug 2014 15:14:49 +0000"  >&lt;p&gt;Andreas,&lt;/p&gt;

&lt;p&gt;I just had a client restart (rebooted) and the client-only /proc/fs/lustre/jobid_var setting persisted on the client. I expected it to go away and use the MGS value.&lt;/p&gt;

&lt;p&gt;Scott&lt;/p&gt;</comment>
                            <comment id="99834" author="adilger" created="Sat, 22 Nov 2014 00:29:32 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LUDOC-260&quot; title=&quot;improve jobstats example&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LUDOC-260&quot;&gt;&lt;del&gt;LUDOC-260&lt;/del&gt;&lt;/a&gt; was opened to track the documentation issue.  Closing this bug.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="27545">LUDOC-260</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwkn3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>13639</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>