<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:37:25 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-10698] Specify complex JobIDs for Lustre</title>
                <link>https://jira.whamcloud.com/browse/LU-10698</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;One thing that Lustre has been missing for a long time is I/O profiling. Lustre does support I/O profiling per process and per client, but it doesn&apos;t support profiling I/O per job, which is the commonly used case in reality.&lt;/p&gt;

&lt;p&gt;It would be desirable to add I/O profiling to client, llite in particular, by &lt;tt&gt;JOBID&lt;/tt&gt;, also it will be better to provide tools to accumulate those stats from multiple clients and plot them correspondingly.&lt;/p&gt;</description>
                <environment></environment>
        <key id="50903">LU-10698</key>
            <summary>Specify complex JobIDs for Lustre</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="Jinshan">Jinshan Xiong</assignee>
                                    <reporter username="Jinshan">Jinshan Xiong</reporter>
                        <labels>
                    </labels>
                <created>Thu, 22 Feb 2018 05:08:52 +0000</created>
                <updated>Wed, 7 Feb 2024 19:27:34 +0000</updated>
                            <resolved>Fri, 31 Aug 2018 07:34:13 +0000</resolved>
                                                    <fixVersion>Lustre 2.12.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="221426" author="jinshan" created="Thu, 22 Feb 2018 05:47:10 +0000"  >&lt;p&gt;I know darshan is able to profile I/O by intercepting glibc callbacks but still it would be better if Lustre can support it natively.&lt;/p&gt;</comment>
                            <comment id="222164" author="gerrit" created="Fri, 2 Mar 2018 21:39:04 +0000"  >&lt;p&gt;Jinshan Xiong (jinshan.xiong@gmail.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/31500&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/31500&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10698&quot; title=&quot;Specify complex JobIDs for Lustre&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10698&quot;&gt;&lt;del&gt;LU-10698&lt;/del&gt;&lt;/a&gt; obdclass: cleanup jobid implementation&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 3ef0ceb1c27fb02ec8c93b53333365d3fb36cd27&lt;/p&gt;</comment>
                            <comment id="222641" author="adilger" created="Tue, 6 Mar 2018 21:45:11 +0000"  >&lt;p&gt;I can understand that if hundreds of nodes are generating unlabelled RPCs then using &lt;tt&gt;procname_uid&lt;/tt&gt; could result in a lot of &quot;rsync.1234&quot;, &quot;rsync.2345&quot;, &quot;ls.5678&quot;, &quot;cp.9876&quot;, etc. kind of results if there are many active users, but otherwise this still provides useful information about what commands are generating a lot of IO traffic. The reason &quot;&lt;tt&gt;procname.uid&lt;/tt&gt;&quot; was chosen as the fallback if &lt;tt&gt;JOBENV&lt;/tt&gt; can&apos;t be found is that there is a good likelihood of the same user running on different nodes without an actual JobID to still generate the same jobid string, unlike embedding PID or other unique identifier (which would be useless after the process exits anyway).&lt;/p&gt;

&lt;p&gt;One option would be to allow userspace to specify a fallback jobid if obd_jobid_var is not found. This could be a more expressive syntax for the primary/fallback than just &quot;&lt;tt&gt;disabled&lt;/tt&gt;&quot;, &quot;&lt;tt&gt;procname_uid&lt;/tt&gt;&quot;, and &quot;&lt;tt&gt;nodelocal&lt;/tt&gt;&quot; that can be specified today. For example interpreting &quot;&lt;tt&gt;%proc.%uid&lt;/tt&gt;&quot; as &quot;&lt;tt&gt;process name&lt;/tt&gt;&quot; &apos;&lt;tt&gt;.&lt;/tt&gt;&apos; &quot;&lt;tt&gt;user id&lt;/tt&gt;&quot;, but allowing just &quot;&lt;tt&gt;%proc&lt;/tt&gt;&quot;, just &quot;&lt;tt&gt;%uid&lt;/tt&gt;&quot;, but also maybe &quot;&lt;tt&gt;%gid&lt;/tt&gt;&quot;, &quot;&lt;tt&gt;%nid&lt;/tt&gt;&quot;, &quot;&lt;tt&gt;%pid&lt;/tt&gt;&quot;, and other fields as desired (filtering out any unknown &apos;%&apos; and other escape characters). This could instead use a subset of escapes from core filenames in &lt;tt&gt;format_corename()&lt;/tt&gt;, to minimize the effort for sysadmins (e.g. &lt;tt&gt;%e&lt;/tt&gt;=executable, &lt;tt&gt;%p&lt;/tt&gt;=PID (and friends?), &lt;tt&gt;%u&lt;/tt&gt;=UID, &lt;tt&gt;%g&lt;/tt&gt;=UID, &lt;tt&gt;%h&lt;/tt&gt;=hostname, &lt;tt&gt;%n&lt;/tt&gt;=NID). It isn&apos;t clear to me yet if PID is useful for JobID, but it isn&apos;t hard to implement and maybe there is a case for it.&lt;/p&gt;

&lt;p&gt;Unknown strings would just be copied literally, so you could set:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;    lctl set_param jobid_var=PBS_JOBID
    lctl set_param jobid_name=&apos;%e.%u:%g_%n&apos;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;or to get Jinshan&apos;s desired behaviour just set:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;    lctl set_param jobid_name=&apos;unknown&apos;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This implies that if &quot;&lt;tt&gt;JOBENV&lt;/tt&gt;&quot; is not found then &quot;&lt;tt&gt;jobid_name&lt;/tt&gt;&quot; would be used as a fallback (which doesn&apos;t happen today), and would be interpreted as needed.&lt;/p&gt;

&lt;p&gt;Using &quot;&lt;tt&gt;jobid_var=nodelocal&lt;/tt&gt;&quot; would keep &quot;&lt;tt&gt;jobid_name&lt;/tt&gt;&quot; as a literal string as it is today, while allowing the kernel to generate useful jobids directly, similar to core dump filenames. My preference would be to keep &quot;&lt;tt&gt;jobid_name=%e.%u&lt;/tt&gt;&quot; as the default if jobstats is enabled, since this is what we currently have, and is at least providing some reasonable information to users that didn&apos;t set anything in advance.&lt;/p&gt;</comment>
                            <comment id="222642" author="bevans" created="Tue, 6 Mar 2018 21:58:45 +0000"  >&lt;p&gt;I&apos;m not a big fan of reinventing printf for just jobids.  We had a similar proposal within Cray for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9789&quot; title=&quot;Create JobID prefix&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9789&quot;&gt;&lt;del&gt;LU-9789&lt;/del&gt;&lt;/a&gt; and it didn&apos;t ever get implemented due to complexity and that a simple &quot;good enough&quot; solution would work for pretty much everyone.&lt;/p&gt;</comment>
                            <comment id="222660" author="jinshan" created="Wed, 7 Mar 2018 01:55:17 +0000"  >&lt;p&gt;Hi Andreas,&lt;/p&gt;

&lt;p&gt;It seems to be a good suggestion but probably won&apos;t be accomplished in short time.&lt;/p&gt;

&lt;p&gt;One thing I want to clarify is that providing too much information is not necessary good. For example, actions like collecting and saving the status of all disk drives in a cluster are not good at all because useful information are completely flooded. We should only care about the situations that some components are not working properly, like some OSTs are in degraded mode, and it should be a separate procedure to figure out which drives are not working.&lt;/p&gt;

&lt;p&gt;So in this case if some workload are running without proper jobid setting, I tend to think it&apos;s not a good practice to fallback to &apos;procname.uid&apos; because:&lt;br/&gt;
1. it may be difficult to extract useful information from a very long list of stats;&lt;br/&gt;
2. if anonymous workload can be accumulated to a single entry, it&apos;s easier to know how much resource it&apos;s consumed. If it&apos;s little, probably it will be simply ignored; otherwise a separate procedure will be performed to figure out which anonymous jobs consumed that much resource.&lt;/p&gt;

&lt;p&gt;I hope this will make some sense.&lt;/p&gt;</comment>
                            <comment id="222665" author="adilger" created="Wed, 7 Mar 2018 02:43:46 +0000"  >&lt;p&gt;Ben, at the same time, the proposed &quot;cluster ID&quot; functionality could be implemented in a similar manner rather than adding a special-case handler for the cluster. Something like &lt;tt&gt;jobid_name=&quot;clustername.%j&quot;&lt;/tt&gt; since the cluster name will be constant for the lifetime of the node and can just be set as a static string from the POV of the kernel. &lt;/p&gt;

&lt;p&gt;I don&apos;t think the implementation would be too complex, basically a scan for &apos;%&apos; in the string, then a switch statement that replaces the string with a known value (length limited to output buffer string). &lt;/p&gt;

&lt;p&gt;Jinshan, as for dumping all unknown RPCs into a single bucket, that is OK if they don&apos;t take up much of the resource, but as you write then more work is needed if it &lt;em&gt;does&lt;/em&gt; take up a lot of the resources, so it would be useful to have a way to debug that. Your&apos;ee replacing the case that works well with Cray, but not well for you with one that works for you but not Cray (and IMHO will work badly for you as soon as you want to debug what is causing a lot of &quot;unknown&quot; traffic). I think we can have a solution that works for both of you that doesn&apos;t add too much complexity. &lt;/p&gt;</comment>
                            <comment id="222670" author="jinshan" created="Wed, 7 Mar 2018 04:52:16 +0000"  >&lt;blockquote&gt;
&lt;p&gt;... so it would be useful to have a way to debug that. &lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;True, in that case, the admin should clear &lt;tt&gt;job_stats&lt;/tt&gt;, and set the &lt;tt&gt;jobid_var&lt;/tt&gt; to &lt;tt&gt;procname_uid&lt;/tt&gt; and then monitor the output of &lt;tt&gt;job_stats&lt;/tt&gt; in real-time manner to figure out who the &apos;bad&apos; guy is.&lt;/p&gt;

&lt;p&gt;It boils down to the point if &lt;tt&gt;job_stats&lt;/tt&gt; is mainly for monitoring or auditing. Cray&apos;s customer would like to use it as monitoring, but I think we should use it for auditing. Obviously we don&apos;t work for the same customer lol.&lt;/p&gt;</comment>
                            <comment id="222697" author="bevans" created="Wed, 7 Mar 2018 14:34:10 +0000"  >&lt;p&gt;Jinshan, if this is something you care about in a database, simply pre-process it on insertion to ignore procname.uid style entries.&lt;/p&gt;

&lt;p&gt;If, on the other hand, you want this information, it can&apos;t be generated from &quot;unknown&quot;&lt;/p&gt;</comment>
                            <comment id="222709" author="jinshan" created="Wed, 7 Mar 2018 16:45:13 +0000"  >&lt;p&gt;Hi Evans,&lt;/p&gt;

&lt;p&gt;Not just discard procname.uid records, we also need to accumulate them because we want to know how much I/O is from anonymous therefore we can decide if to start investigation.&lt;/p&gt;

&lt;p&gt;Can you please summarize how you and your customers use job_stats information? So it sounds like those data will be kept in memory and never collected?&lt;/p&gt;</comment>
                            <comment id="222721" author="bevans" created="Wed, 7 Mar 2018 18:22:13 +0000"  >&lt;p&gt;Jinshan, all of what you propose can be done in userspace.  You can translate all procname.uid formatted JobID&apos;s to &quot;unknown&quot;, you can leave them out of the database you use for mining.  What you can&apos;t do, is take stats from Lustre of &quot;Unknown&quot; and translate them into &quot;rsync.12345&quot; on 6 different nodes.&lt;/p&gt;

&lt;p&gt;My understanding from what I&apos;ve seen from the management side of our Lustre products is that they are accumulating each job, and scoring it in a number of ways, along with keeping it in a database for deeper investigation.  I&apos;m not sure what the limits may be concerning what is kept in the DB and for how long, and at what timescales.&lt;/p&gt;

&lt;p&gt;I do know that this is an area of active development, as the performance penalties incurred by JobID are not as harsh as they used to be due to the cache.  So we&apos;ve moved from a case where JobID is off by default to one where it can be on by default.&lt;/p&gt;</comment>
                            <comment id="224015" author="gerrit" created="Tue, 20 Mar 2018 10:53:24 +0000"  >&lt;p&gt;Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/31691&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/31691&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10698&quot; title=&quot;Specify complex JobIDs for Lustre&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10698&quot;&gt;&lt;del&gt;LU-10698&lt;/del&gt;&lt;/a&gt; obdclass: allow specifying complex jobids&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 5a7b3fd923cec7d4acc199b9f205b3ea8483c495&lt;/p&gt;</comment>
                            <comment id="225473" author="gerrit" created="Mon, 9 Apr 2018 19:46:12 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/31691/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/31691/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10698&quot; title=&quot;Specify complex JobIDs for Lustre&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10698&quot;&gt;&lt;del&gt;LU-10698&lt;/del&gt;&lt;/a&gt; obdclass: allow specifying complex jobids&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 6488c0ec57de2d188bd15e502917b762e3a9dd1d&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="47384">LU-9789</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="80702">LU-17512</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="46708">LUDOC-381</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="55724">LU-12330</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="32836">LUDOC-310</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzt5j:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>