<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:28:37 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16625] improved Lustre thread debugging</title>
                <link>https://jira.whamcloud.com/browse/LU-16625</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I was thinking about how we might improve the debugging of Lustre threads that are busy (e.g. threads stuck in &lt;tt&gt;ldlm_cli_enqueue_local()&lt;/tt&gt; or &lt;tt&gt;ldlm_completion_ast()&lt;/tt&gt; or possibly on a mutex/spinlock).&lt;/p&gt;

&lt;p&gt;One thing that would help, especially for post-facto debugging where we only have watchdog stack traces dumped to dmesg/messages, would be to print the FID/resource of locks that the thread is holding and/or blocked on as part of the watchdog stack trace.  Since doing this in a generic way would be difficult, it would be possible to either create a thread-local data structure (similar to, or part of, &quot;&lt;tt&gt;env&lt;/tt&gt;&quot;) that contained &quot;well-known&quot; slots for e.g. parent/child FIDs, locked LDLM resources, next LDLM resource to lock.  Possibly these would be kept in ASCII format so that the whole chunk could just be printed as-is without much interpretation (maybe walking through NUL-terminated 32-char slots and only printing those that are used).&lt;/p&gt;

&lt;p&gt;Potentially this could be looked up by PID during watchdog stack dump, but would typically only be accessed by the local thread. &lt;/p&gt;

&lt;p&gt;The main benefit here would be that instead of just seeing the stack traces being dumped, we could also see which resources the thread is (or was) holding, and this would greatly simplify the ability to analyze stuck thread issues after the fact.&lt;/p&gt;</description>
                <environment></environment>
        <key id="74971">LU-16625</key>
            <summary>improved Lustre thread debugging</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="adilger">Andreas Dilger</reporter>
                        <labels>
                    </labels>
                <created>Wed, 8 Mar 2023 19:48:30 +0000</created>
                <updated>Tue, 23 Jan 2024 00:08:06 +0000</updated>
                                            <version>Lustre 2.16.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="365296" author="adilger" created="Wed, 8 Mar 2023 19:54:08 +0000"  >&lt;p&gt;I&apos;d welcome input on this idea, whether you think it is practical to implement, or if there is something better we could do?&lt;/p&gt;

&lt;p&gt;Having a full crash dump available can be very helpful if it is captured in a timely manner, or having a mechanism in sysfs (like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14858&quot; title=&quot;kernfs tree to dump/traverse ldlm locks&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14858&quot;&gt;LU-14858&lt;/a&gt;) to dump the lock state at the time of the problem is useful, but often the issue is only caught afterward, or the customer doesn&apos;t want to crashdump the machine and/or the size of the crashdump makes it impractical.&lt;/p&gt;</comment>
                            <comment id="365297" author="paf0186" created="Wed, 8 Mar 2023 20:14:39 +0000"  >&lt;p&gt;I&apos;m not sure about how it would be implemented - not quite obvious to me what you&apos;re getting at - but if we do this, we should definitely dump &lt;b&gt;all&lt;/b&gt; the locks on the resource we&apos;re trying to lock.&lt;/p&gt;</comment>
                            <comment id="400604" author="adilger" created="Mon, 22 Jan 2024 16:12:09 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=timday&quot; class=&quot;user-hover&quot; rel=&quot;timday&quot;&gt;timday&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=stancheff&quot; class=&quot;user-hover&quot; rel=&quot;stancheff&quot;&gt;stancheff&lt;/a&gt; I recall a discussion that mentioned it is possible to extend the &lt;tt&gt;dump_stack()&lt;/tt&gt; functionality to include more information, and this was already being done in some device driver.  Unfortunately, I can&apos;t find that here or &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16375&quot; title=&quot;dump more information for threads blocked on local DLM locks&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16375&quot;&gt;LU-16375&lt;/a&gt; that is also discussing a similar issue.&lt;/p&gt;</comment>
                            <comment id="400609" author="JIRAUSER18433" created="Mon, 22 Jan 2024 16:15:49 +0000"  >&lt;p&gt;The discussion was on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-17242&quot; title=&quot;Clean up and Improve Lustre Debugging&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-17242&quot;&gt;LU-17242&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Seems useful. I think we could register a custom panic handler. I see upstream drivers (like &lt;tt&gt;drivers/net/ipa/ipa_smp2p.c&lt;/tt&gt;) doing something like that. We could avoid extending custom Lustre debugging and it should work on every panic. Adding &lt;tt&gt;current-&amp;gt;journal_info&lt;/tt&gt; to the handler would be easy. Getting the Lustre specific info might be tougher, but I saw some ideas upstream we could probably copy. The &lt;tt&gt;ipa&lt;/tt&gt; just embedded the &lt;tt&gt;notifier_block&lt;/tt&gt; in a larger struct and used &lt;tt&gt;container_of&lt;/tt&gt; to get everything else.&lt;/p&gt;&lt;/blockquote&gt;</comment>
                            <comment id="400681" author="adilger" created="Tue, 23 Jan 2024 00:06:30 +0000"  >&lt;p&gt;It&#160; looks like there is some infrastructure to handle this already:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
&lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; ipa_smp2p_panic_notifier_register(struct ipa_smp2p *smp2p)
{
        &lt;span class=&quot;code-comment&quot;&gt;/* IPA panic handler needs to run before modem shuts down */&lt;/span&gt;
        smp2p-&amp;gt;panic_notifier.notifier_call = ipa_smp2p_panic_notifier;
        smp2p-&amp;gt;panic_notifier.priority = INT_MAX;       &lt;span class=&quot;code-comment&quot;&gt;/* Do it early */&lt;/span&gt;

        &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; atomic_notifier_chain_register(&amp;amp;panic_notifier_list,
                                              &amp;amp;smp2p-&amp;gt;panic_notifier);
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;but this looks like it is only for a panic, not necessarily a stack trace...&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="73538">LU-16375</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="78657">LU-17242</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i03fun:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>