<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:09:59 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14463] stack traces not being printed to console on RHEL8</title>
                <link>https://jira.whamcloud.com/browse/LU-14463</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I noticed in a recent test that instead of getting a stack trace printed to the console, all that is shown on the MDS console in the log is:&lt;br/&gt;
&lt;a href=&quot;https://testing.whamcloud.com/test_sets/92361cb8-6cb4-4045-a0df-c9efc31520ea&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/92361cb8-6cb4-4045-a0df-c9efc31520ea&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[  508.376802] Call Trace TBD:
[  508.378011] Pid: 31468, comm: mdt00_033 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Fri Feb 19 20:34:57 UTC 2021
[  508.379472] Call Trace TBD:
[  508.379899] Pid: 31454, comm: mdt00_019 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Fri Feb 19 20:34:57 UTC 2021
[  508.381521] Call Trace TBD:
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;which is not very useful.  That message comes from patch &lt;a href=&quot;https://review.whamcloud.com/35239&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/35239&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12400&quot; title=&quot;Support for linux kernel version 5.2&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12400&quot;&gt;&lt;del&gt;LU-12400&lt;/del&gt;&lt;/a&gt; libcfs: save_stack_trace_tsk if ARCH_STACKWALK&lt;/tt&gt;&quot;.  That patch was ostensibly to fix an issue with 5.4 kernels, but the MDS was running RHEL8.3 (4.18.0-240.1.1.el8_lustre.x86_64).&lt;/p&gt;

&lt;p&gt;At a minimum, in &lt;tt&gt;libcfs_call_trace()&lt;/tt&gt; if &lt;tt&gt;tsk == current&lt;/tt&gt; this should fall back to doing &lt;em&gt;something&lt;/em&gt; useful:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
        spin_lock(&amp;amp;st_lock);
        pr_info(&lt;span class=&quot;code-quote&quot;&gt;&quot;Pid: %d, comm: %.20s %s %s\n&quot;&lt;/span&gt;, tsk-&amp;gt;pid, tsk-&amp;gt;comm,
                init_utsname()-&amp;gt;release, init_utsname()-&amp;gt;version);
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (task_dump_stack) {
                pr_info(&lt;span class=&quot;code-quote&quot;&gt;&quot;Call Trace:\n&quot;&lt;/span&gt;);
                nr_entries = task_dump_stack(tsk, entries, MAX_ST_ENTRIES, 0);
                &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; (i = 0; i &amp;lt; nr_entries; i++)
                        pr_info(&lt;span class=&quot;code-quote&quot;&gt;&quot;[&amp;lt;0&amp;gt;] %pB\n&quot;&lt;/span&gt;, (void *)entries[i]);
        } &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (tsk == current) {
                dump_stack();
        } &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt; {
                pr_info(&lt;span class=&quot;code-quote&quot;&gt;&quot;can&lt;span class=&quot;code-quote&quot;&gt;&apos;t show stack: kernel doesn&apos;&lt;/span&gt;t export save_stack_trace_tsk\n&quot;&lt;/span&gt;);
        }
        spin_unlock(&amp;amp;st_lock);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;so that the stack is printed in the common case.  &lt;/p&gt;</description>
                <environment></environment>
        <key id="62989">LU-14463</key>
            <summary>stack traces not being printed to console on RHEL8</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="adilger">Andreas Dilger</reporter>
                        <labels>
                    </labels>
                <created>Mon, 22 Feb 2021 19:55:06 +0000</created>
                <updated>Thu, 5 May 2022 13:50:48 +0000</updated>
                            <resolved>Fri, 26 Nov 2021 22:43:55 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="292663" author="adilger" created="Mon, 22 Feb 2021 20:16:50 +0000"  >&lt;p&gt;One option for the servers at least would be to add an EXPORT_SYMBOL() for &lt;tt&gt;save_stack_trace_tsk&lt;/tt&gt;.  Not ideal, but not having any stack traces is also bad.  It &lt;em&gt;may&lt;/em&gt; be that Neil&apos;s patch to change the watchdog timers to be an interrupt timer on the process itself (rather than an external process) means these stacks will &lt;b&gt;always&lt;/b&gt; be the &quot;&lt;tt&gt;tsk == current&lt;/tt&gt;&quot; case, and the rest of this complex stack dumping machinery can just go away?&lt;/p&gt;

&lt;p&gt;It also looks like sanity test_422 needs to be further improved to check for an actual stack trace being dumped.  It is currently checking for the &quot;&lt;tt&gt;Dumping the stack trace for debugging purposes&lt;/tt&gt;&quot; message, but no actual stack is printed:&lt;br/&gt;
&lt;a href=&quot;https://testing.whamcloud.com/test_sets/d99609ff-22a4-4f85-8ed8-768053b29ce5&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/d99609ff-22a4-4f85-8ed8-768053b29ce5&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[10387.837479] Lustre: mdt00_000: service thread pid 716096 was inactive for 42.178 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[10387.841359] Pid: 716096, comm: mdt00_000 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Fri Feb 19 20:34:57 UTC 2021
[10387.843222] Call Trace TBD:
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;For reference, with a RHEL7.9 kernel this looks like:&lt;br/&gt;
&lt;a href=&quot;https://testing.whamcloud.com/test_sets/cad78b1e-832d-480f-8162-799d4358a414&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/cad78b1e-832d-480f-8162-799d4358a414&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[13446.051452] Lustre: mdt00_004: service thread pid 370 was inactive for 40.123 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[13446.054750] Pid: 370, comm: mdt00_004 3.10.0-1160.6.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 21:18:48 UTC 2020
[13446.056456] Call Trace:
[13446.056976]  [&amp;lt;ffffffffc07b04c1&amp;gt;] __cfs_fail_timeout_set+0xe1/0x200 [libcfs]
[13446.058341]  [&amp;lt;ffffffffc0ce5146&amp;gt;] ptlrpc_server_handle_request+0x1f6/0xb10 [ptlrpc]
[13446.060104]  [&amp;lt;ffffffffc0ce9cfc&amp;gt;] ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
[13446.061329]  [&amp;lt;ffffffff8dac5c21&amp;gt;] kthread+0xd1/0xe0
[13448.355379] Lustre: mdt00_000: service thread pid 22324 was inactive for 40.036 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[13448.359101] Pid: 22324, comm: mdt00_000 3.10.0-1160.6.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 21:18:48 UTC 2020
[13448.360837] Call Trace:
[13448.361337]  [&amp;lt;ffffffffc07b04c1&amp;gt;] __cfs_fail_timeout_set+0xe1/0x200 [libcfs]
[13448.362682]  [&amp;lt;ffffffffc12ec542&amp;gt;] mdd_rename+0x152/0x16b0 [mdd]
[13448.363898]  [&amp;lt;ffffffffc1152d9b&amp;gt;] mdo_rename+0x2b/0x60 [mdt]
[13448.365090]  [&amp;lt;ffffffffc1158464&amp;gt;] mdt_reint_rename+0x1b24/0x2c10 [mdt]
[13448.366432]  [&amp;lt;ffffffffc1161cc3&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
[13448.367591]  [&amp;lt;ffffffffc1139a30&amp;gt;] mdt_reint_internal+0x720/0xaf0 [mdt]
[13448.368813]  [&amp;lt;ffffffffc11455c7&amp;gt;] mdt_reint+0x67/0x140 [mdt]
[13448.369924]  [&amp;lt;ffffffffc0d456fa&amp;gt;] tgt_request_handle+0x7ea/0x1750 [ptlrpc]
[13448.371400]  [&amp;lt;ffffffffc0ce51a6&amp;gt;] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[13448.372849]  [&amp;lt;ffffffffc0ce9cfc&amp;gt;] ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
[13448.374053]  [&amp;lt;ffffffff8dac5c21&amp;gt;] kthread+0xd1/0xe0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;It would be pretty safe to check for &lt;tt&gt;ptlrpc_main&lt;/tt&gt; or &lt;tt&gt;ptlrpc_server_handle_request()&lt;/tt&gt; to verify that the stack is being dumped, since neither of these appear on the console.&lt;/p&gt;</comment>
                            <comment id="292668" author="simmonsja" created="Mon, 22 Feb 2021 20:25:27 +0000"  >&lt;p&gt;Which patch of Neil&apos;s is it that changes to an interrupt timer?&lt;/p&gt;</comment>
                            <comment id="292669" author="adilger" created="Mon, 22 Feb 2021 20:29:54 +0000"  >&lt;p&gt;It looks like the only place that is calling &lt;tt&gt;libcfs_debug_dumpstack()&lt;/tt&gt; with a non-NULL argument is in &lt;tt&gt;ptlrpc_watchdog_fire()&lt;/tt&gt;, but unfortunately this delayed work is happening in a different context from the stack being dumped, so the &lt;tt&gt;tsk == current&lt;/tt&gt; case does not help that case (though it would at least fix all of the other callers):&lt;br/&gt;
&lt;a href=&quot;https://testing.whamcloud.com/test_sets/7a00c660-73d0-46f9-8d4b-8ff6f8a2dbac&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/7a00c660-73d0-46f9-8d4b-8ff6f8a2dbac&lt;/a&gt; (looking at the test_425 MDS logs):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:02000400:0.0:1612373649.290722:0:692869:0:(service.c:2652:ptlrpc_watchdog_fire()) mdt00_001: service thread pid 710575 was inactive for 40.008 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The &lt;tt&gt;Dumping the stack trace&lt;/tt&gt; message is printed by PID &lt;tt&gt;692869&lt;/tt&gt; but the thread being dumped is PID &lt;tt&gt;710575&lt;/tt&gt;.&lt;/p&gt;</comment>
                            <comment id="292670" author="simmonsja" created="Mon, 22 Feb 2021 20:34:55 +0000"  >&lt;p&gt;BTW Shaun pushed a patch - &lt;a href=&quot;https://review.whamcloud.com/#/c/40503.&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/40503.&lt;/a&gt;&#160;Does this resolve it?&lt;/p&gt;</comment>
                            <comment id="292673" author="adilger" created="Mon, 22 Feb 2021 20:51:08 +0000"  >&lt;p&gt;James, that is patch &lt;a href=&quot;https://review.whamcloud.com/33018&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33018&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9859&quot; title=&quot;libcfs simplification&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9859&quot;&gt;LU-9859&lt;/a&gt; libcfs: add watchdog for ptlrpc service threads&lt;/tt&gt;&quot;, but as I posted in my previous comment, the delayed work timer is in the context of another thread, so &lt;tt&gt;dump_stack()&lt;/tt&gt; will not work in that case (it does not take a &lt;tt&gt;task_struct&lt;/tt&gt; argument).&lt;/p&gt;

&lt;p&gt;What &lt;b&gt;does&lt;/b&gt; look promising is that &lt;tt&gt;stack_trace_save_tsk()&lt;/tt&gt; looks like it is only a thin wrapper around &lt;tt&gt;save_stack_trace_tsk()&lt;/tt&gt;, and that function &lt;b&gt;is&lt;/b&gt; exported, so it may be that instead of calling &lt;tt&gt;symbol_get(&quot;stack_trace_save_tsk&quot;)&lt;/tt&gt; (which doesn&apos;t work anymore) we just make a copy of that function:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
unsigned &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; stack_trace_save_tsk(struct task_struct *task,
                                  unsigned &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; *store, unsigned &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; size,
                                  unsigned &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; skipnr)
{               
        struct stack_trace trace = {
                .entries        = store,
                .max_entries    = size,
                &lt;span class=&quot;code-comment&quot;&gt;/* skip &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; function &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; they are tracing us */&lt;/span&gt;
                .skip   = skipnr + (current == task),
        };

        save_stack_trace_tsk(task, &amp;amp;trace);
        &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; trace.nr_entries;
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="292679" author="adilger" created="Mon, 22 Feb 2021 21:01:35 +0000"  >&lt;blockquote&gt;
&lt;p&gt;BTW Shaun pushed a patch - &lt;a href=&quot;https://review.whamcloud.com/#/c/40503&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/40503&lt;/a&gt;. Does this resolve it?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I don&apos;t &lt;em&gt;think&lt;/em&gt; for x86 at least.  The changes under &lt;tt&gt;CONFIG_ARCH_STACKWALK&lt;/tt&gt;, where the &lt;tt&gt;Call Trace TBD:&lt;/tt&gt; message is printed, are a no-op. It is hard to say for sure, since the review-ldiskfs sanity test_422 printed the stacks correctly, but it ran with EL8.2 on the MDS and I think the &lt;tt&gt;symbol_get()&lt;/tt&gt; &quot;fix&quot; only appeared with EL8.3.  I think the &lt;tt&gt;stack_trace_save_tsk()&lt;/tt&gt; copy may allow this to work on x86 as well.&lt;/p&gt;</comment>
                            <comment id="319266" author="adilger" created="Fri, 26 Nov 2021 22:43:55 +0000"  >&lt;p&gt;It looks like Shaun&apos;s patch &lt;a href=&quot;https://review.whamcloud.com/40503&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40503&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14099&quot; title=&quot;SUSE 15 SP2 aarch64 does not set arch_stackwalk missing print_stack_trace &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14099&quot;&gt;&lt;del&gt;LU-14099&lt;/del&gt;&lt;/a&gt; build: Fix for unconfigured arch_stackwalk&lt;/tt&gt;&quot; has fixed this problem.  I&apos;m able to see an abbreviated, though still useful, stack trace on my RHEL8 server:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[4439794.910865] Lustre: 1611:0:(osd_handler.c:1934:osd_trans_start()) myth-OST0000: credits 1351 &amp;gt; trans_max 1024
[4439794.939134] Call Trace TBD:
[4439794.941671] [&amp;lt;0&amp;gt;] libcfs_call_trace+0x6f/0x90 [libcfs]
[4439794.944169] [&amp;lt;0&amp;gt;] osd_trans_start+0x4fd/0x520 [osd_ldiskfs]
[4439794.946662] [&amp;lt;0&amp;gt;] ofd_precreate_objects+0x10f2/0x1f60 [ofd]
[4439794.949034] [&amp;lt;0&amp;gt;] ofd_create_hdl+0x6a1/0x1740 [ofd]
[4439794.951652] [&amp;lt;0&amp;gt;] tgt_request_handle+0xc78/0x1910 [ptlrpc]
[4439794.954031] [&amp;lt;0&amp;gt;] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
[4439794.956444] [&amp;lt;0&amp;gt;] ptlrpc_main+0xba2/0x14a0 [ptlrpc]
[4439794.958745] [&amp;lt;0&amp;gt;] kthread+0x112/0x130
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="61465">LU-14099</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="47758">LU-9859</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="55879">LU-12400</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01nan:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>