<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:52:35 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5566] Lustre waiting in TASK_INTERRUPTIBLE with all signals blocked results in unkillable processes</title>
                <link>https://jira.whamcloud.com/browse/LU-5566</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Lustre does much of its waiting on Linux in a TASK_INTERRUPTIBLE state with all signals blocked.&lt;/p&gt;

&lt;p&gt;This was changed briefly to TASK_UNINTERRUPTIBLE as part of bugzilla 16842 (&lt;a href=&quot;https://projectlava.xyratex.com/show_bug.cgi?id=16842&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://projectlava.xyratex.com/show_bug.cgi?id=16842&lt;/a&gt;), then changed back because it led to extremely high reported load averages on Lustre servers:&lt;br/&gt;
&lt;a href=&quot;https://projectlava.xyratex.com/show_bug.cgi?id=16842#c28&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://projectlava.xyratex.com/show_bug.cgi?id=16842#c28&lt;/a&gt;&lt;br/&gt;
These sorts of reported load averages would be expected to cause issues &lt;/p&gt;

&lt;p&gt;This is because the kernel calculates load average by counting tasks in TASK_UNINTERRUPTIBLE.  (See the definition of &quot;task_contributes_to_load&quot;.).&lt;/p&gt;

&lt;p&gt;This method of waiting in TASK_INTERRUPTIBLE with all signals blocked (including SIGKILL) causes a problem with delivery of signals in shared_pending which can result in unkillable processes.&lt;/p&gt;

&lt;p&gt;The situation is this:&lt;br/&gt;
Lustre is waiting as described, in TASK_INTERRUPTIBLE with all signals blocked.  A SIGKILL is sent to the process group of a user process in a syscall in to Lustre.  This goes in the shared_pending mask for the process.  This would normally wake a process sleeping in TASK_INTERRUPTIBLE, but is not handled because Lustre is waiting with all signals blocked.&lt;/p&gt;

&lt;p&gt;Separately, a SIGSTOP is sent directly to the process.  This is commonly used as part of our debugging/tracing software, and normally SIGSTOP and SIGKILL arriving to the same process is not a problem:&lt;/p&gt;

&lt;p&gt;For a process waiting in TASK_INTERRUPTIBLE (without SIGKILL blocked), SIGKILL will cause that task to exit (whether it arrives before SIGSTOP or after - the effect is the same).&lt;/p&gt;

&lt;p&gt;For a task waiting in TASK_UNINTERRUPTIBLE, the task finishes waiting, then on return to userspace, the signals (SIGKILL in the shared mask &amp;amp; SIGSTOP in the per-process mask) are handled correctly &amp;amp; the process exits.&lt;/p&gt;

&lt;p&gt;But somehow, waiting in TASK_INTERRUPTIBLE with all signals blocked confuses things, and the result is stopped processes that do not exit.  Sending another SIGKILL works, but any other process would have exited by this point.&lt;/p&gt;

&lt;p&gt;It&apos;s very clear why Lustre does its waiting in TASK_INTERRUPTIBLE (Oleg reported load averages on a health server of &amp;gt;3000 in the bugzilla bug linked above), and since a complex system like Lustre is not good at being interrupted arbitrarily, it&apos;s understandable why it waits with all signals blocked.&lt;br/&gt;
At the same time, it&apos;s clearly wrong to be waiting in TASK_INTERRUPTIBLE while blocking all signals.  Whatever scheduler behavior is causing these zombie processes is not the central problem.&lt;/p&gt;

&lt;p&gt;The cleanest solution I can see is to add a new wait state to the Linux kernel, one that allows a process to wait uninterruptibly but not contribute to load.  I&apos;m thinking of proposing a patch to this effect (to start the conversation) on fs-devel or the LKML, and I wanted to get input from anyone at Intel who&apos;d like to give it before starting that conversation.&lt;/p&gt;


&lt;p&gt;I can provide further details of the non-exiting processes, etc, if needed, including dumps, but I think the description above should be sufficient.&lt;/p&gt;</description>
                <environment></environment>
        <key id="26244">LU-5566</key>
            <summary>Lustre waiting in TASK_INTERRUPTIBLE with all signals blocked results in unkillable processes</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="paf">Patrick Farrell</reporter>
                        <labels>
                    </labels>
                <created>Fri, 29 Aug 2014 22:13:31 +0000</created>
                <updated>Fri, 5 Sep 2014 18:04:41 +0000</updated>
                                            <version>Lustre 2.7.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="92969" author="paf" created="Tue, 2 Sep 2014 16:07:55 +0000"  >&lt;p&gt;I&apos;ve got a more specific suggestion, from Cray&apos;s Paul Cassella.&lt;/p&gt;

&lt;p&gt;He found this old lkml message about a proposed change for almost exactly this problem, suggested by Linus:&lt;br/&gt;
&lt;a href=&quot;http://marc.info/?l=linux-kernel&amp;amp;m=102822913830599&amp;amp;w=1&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://marc.info/?l=linux-kernel&amp;amp;m=102822913830599&amp;amp;w=1&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;On Wed, 31 Jul 2002, David Howells wrote:&lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; Can you comment on whether a driver is allowed to block signals like this, and&lt;br/&gt;
&amp;gt; whether they should be waiting in TASK_UNINTERRUPTIBLE?&lt;/p&gt;

&lt;p&gt;They should be waiting in TASK_UNINTERRUPTIBLE, and we should add a flag &lt;br/&gt;
to distinguish between &quot;increases load average&quot; and &quot;doesn&apos;t&quot;. So you &lt;br/&gt;
could have&lt;/p&gt;

&lt;p&gt;	TASK_WAKESIGNAL - wake on all signals&lt;br/&gt;
	TASK_WAKEKILL	- wake on signals that are deadly&lt;br/&gt;
	TASK_NOSIGNAL	- don&apos;t wake on signals&lt;br/&gt;
	TASK_LOADAVG	- counts toward loadaverage&lt;/p&gt;

&lt;p&gt;	#define TASK_UNINTERRUPTIBLE	(TASK_NOSIGNAL | TASK_LOADAVG)&lt;br/&gt;
	#define TASK_INTERRUPTIBLE	TASK_WAKESIGNAL&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;So it seems like this would fill Lustre&apos;s need nicely, and if done correctly, would presumably meet with approval upstream.&lt;/p&gt;</comment>
                            <comment id="93132" author="paf" created="Wed, 3 Sep 2014 18:01:05 +0000"  >&lt;p&gt;Here&apos;s a proposed note to LKML about this change.  In Lustre, this would mean re-writing l_wait_event so instead of blocking all signals and waiting at TASK_INTERRUPTIBLE, it would wait in a new scheduler state, one with the signal ignoring properties of TASK_UNINTERRUPTIBLE but without contributing to load average.&lt;br/&gt;
-------&lt;br/&gt;
Back in 2002, Linus proposed splitting TASK_UNINTERRUPTIBLE in to two flags -&lt;br/&gt;
TASK_NOSIGNAL, and TASK_LOADAVG, so ignoring signals and contributing to load&lt;br/&gt;
could be handled separately:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://lkml.org/lkml/2002/8/1/186&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://lkml.org/lkml/2002/8/1/186&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I&apos;ve been looking at this, because the Lustre file system client makes heavy&lt;br/&gt;
use of waiting in TASK_INTERRUPTIBLE with all signals blocked &amp;#8211; &lt;br/&gt;
including SIGKILL and SIGSTOP.&lt;/p&gt;

&lt;p&gt;This is to manage load average, as Lustre regularly has very long waits for&lt;br/&gt;
networked IO, during which it can&apos;t be interrupted, and if Lustre is changed&lt;br/&gt;
to wait in TASK_UNINTERRUPTIBLE, load averages get out of control.&lt;/p&gt;

&lt;p&gt;But the choice to block all signals while waiting in TASK_INTERRUPTIBLE causes&lt;br/&gt;
an issue with tasks not exiting that I am hesitant to call a bug, because I&lt;br/&gt;
think the root problem is that we&apos;re (effectively) lying to the scheduler by&lt;br/&gt;
using TASK_INTERRUPTIBLE with all signals blocked.  If I&apos;m wrong and that&apos;s&lt;br/&gt;
acceptable behavior, I&apos;d be happy to share the gory details of the exact&lt;br/&gt;
problem we&apos;re having.&lt;/p&gt;

&lt;p&gt;Linus&apos; suggestions from 2002 fits this use case perfectly, but it was never&lt;br/&gt;
implemented.&lt;/p&gt;

&lt;p&gt;It nearly made it in 2007, with this from Matthew Wilcox:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://lkml.org/lkml/2007/8/29/219&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://lkml.org/lkml/2007/8/29/219&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But Matthew dropped the split of TASK_LOADAVG off from TASK_UNINTERRUPTIBLE in&lt;br/&gt;
the second version, stating:&lt;br/&gt;
&quot;- Don&apos;t split up TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE.&lt;br/&gt;
   TASK_WAKESIGNAL and TASK_LOADAVG were pretty much equivalent, and since&lt;br/&gt;
   we had to keep &amp;#95;&amp;#95;TASK&amp;#95;{UN,}INTERRUPTIBLE anyway, splitting them made&lt;br/&gt;
   little sense.&quot;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://lkml.org/lkml/2007/9/1/232&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://lkml.org/lkml/2007/9/1/232&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So:&lt;br/&gt;
Would a resurrection of the TASK_LOADAVG implementation from Matthew&apos;s first&lt;br/&gt;
patch likely meet with approval?&lt;br/&gt;
It fits the Lustre use case perfectly, and would let us stop doing something&lt;br/&gt;
decidedly nasty without paying any price.&lt;/p&gt;</comment>
                            <comment id="93350" author="adilger" created="Fri, 5 Sep 2014 17:55:09 +0000"  >&lt;p&gt;Patrick, before we stir the hornet&apos;s nest upstream, it probably makes sense to see if we can change the Lustre code to better match the upstream kernel before we ask the kernel to change to match Lustre.  There have been several improvements in the upstream kernel since l_wait_event() was first written that might be useful to us today.  Also, it may be useful to have a different implementation for waiting on the client and on the server, since clients waiting on RPCs &lt;em&gt;should&lt;/em&gt; contribute to the load average just like they would if they were waiting on the disk.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwv07:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>15523</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>