<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:47:54 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5026] Create an lbug_on_eviction option</title>
                <link>https://jira.whamcloud.com/browse/LU-5026</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;There should be an option to force an LBUG on eviction.  This option should be controlled through a procfs interface, at /proc/sys/lustre/lbug_on_eviction.&lt;/p&gt;

&lt;p&gt;If set on the client only, the client should LBUG when it discovers that it has been evicted.  If set on a server, it should LBUG when it evicts any client.&lt;/p&gt;</description>
                <environment></environment>
        <key id="24615">LU-5026</key>
            <summary>Create an lbug_on_eviction option</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="hornc">Chris Horn</assignee>
                                    <reporter username="haasken">Ryan Haasken</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Wed, 7 May 2014 21:17:36 +0000</created>
                <updated>Wed, 29 May 2019 17:14:06 +0000</updated>
                            <resolved>Mon, 1 Oct 2018 17:17:30 +0000</resolved>
                                                    <fixVersion>Lustre 2.12.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>17</watches>
                                                                            <comments>
                            <comment id="83450" author="keith" created="Wed, 7 May 2014 21:24:08 +0000"  >&lt;p&gt;What use case are you trying to handle? &lt;/p&gt;

&lt;p&gt;If you lbug on a server for any reason your will bring down the whole FS for all clients.  &lt;/p&gt;</comment>
                            <comment id="83459" author="haasken" created="Wed, 7 May 2014 21:58:45 +0000"  >&lt;p&gt;Here is a patch:  &lt;a href=&quot;http://review.whamcloud.com/#/c/10257/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10257/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="83461" author="rread" created="Wed, 7 May 2014 22:06:57 +0000"  >&lt;p&gt;Wouldn&apos;t this allow someone to cause a server to LBUG simply by restarting a client that was holding a write lock?&lt;/p&gt;</comment>
                            <comment id="83515" author="amk" created="Thu, 8 May 2014 15:06:02 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Wouldn&apos;t this allow someone to cause a server to LBUG simply by restarting a client that was holding a write lock?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Only if lbug_on_eviction was explicitly set on the server. We don&apos;t expect the flag will normally be set on servers. But we have customers who cannot tolerate any evictions. Any information we can gather that helps us eliminate the evictions is potentially useful. I imagine lbug_on_eviction would typically be enabled during system bring up as we work out the problems and disabled when the system goes into production use.&lt;/p&gt;</comment>
                            <comment id="83538" author="green" created="Thu, 8 May 2014 17:13:45 +0000"  >&lt;p&gt;I suspect lock dump on eviction/timeout already covers like 95% of all usecases?&lt;/p&gt;</comment>
                            <comment id="83554" author="amk" created="Thu, 8 May 2014 18:19:16 +0000"  >&lt;blockquote&gt;
&lt;p&gt;I suspect lock dump on eviction/timeout already covers like 95% of all usecases?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Well, if it were really LOCK dump on eviction rather than LOG dump on eviction, it would probably cover most of the use cases. Adding a dump of the lock tables is next on our list of enhancements.&lt;/p&gt;

&lt;p&gt;Since the lbug_on_eviction feature for servers is meeting with so much resistance, how about this idea?&lt;/p&gt;

&lt;p&gt;1. Make lbug_on_eviction available only on clients.&lt;br/&gt;
2. Change dump_on_eviction/timeout to dump a copy of the lock table along with the dk log by temporarily enabling dlmtrace and calling ldlm_dump_all_namespaces. Actually, I&apos;m thinking the lock table should be dumped after the dk log to a separate file to avoid overrunning the trace buffer. &lt;/p&gt;</comment>
                            <comment id="83922" author="spitzcor" created="Mon, 12 May 2014 20:15:59 +0000"  >&lt;p&gt;LBUG upon eviction is really just a big hammer.  We&apos;re finding that there are lots of applications that should fail upon an eviction, but don&apos;t.  It&apos;s not necessarily the case that the application was badly written and doesn&apos;t check return codes either.  We&apos;re trying to catch cases where successful writes complete to cache, but an app doesn&apos;t later do a sync (until sync upon close), but in the meantime the dirty pages that were dropped from the buffer cache for the eviction are silently ignored.  In this case, I think, the app is totally unaware that &apos;something bad&apos; happened.  If there is something that we can do along those lines, then things like lbug_on_eviction on clients might not be needed.&lt;/p&gt;</comment>
                            <comment id="85922" author="haasken" created="Thu, 5 Jun 2014 21:40:33 +0000"  >&lt;p&gt;I&apos;ve created &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5149&quot; title=&quot;Create debug_upcall script which dumps ldlm namespaces to log file&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5149&quot;&gt;&lt;del&gt;LU-5149&lt;/del&gt;&lt;/a&gt; for adding a lock dump when the Lustre trace buffers are dumped.&lt;/p&gt;

&lt;p&gt;I will change this patch so that it only makes lbug_on_eviction available on clients.  Oleg, Robert, Keith, do you think this is a better solution?&lt;/p&gt;</comment>
                            <comment id="86268" author="haasken" created="Tue, 10 Jun 2014 19:48:20 +0000"  >&lt;p&gt;I&apos;ve updated the patch and removed the code which makes the server LBUG when it evicts a client.  Please take a look at the patch:  &lt;a href=&quot;http://review.whamcloud.com/#/c/10257&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10257&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="87974" author="haasken" created="Wed, 2 Jul 2014 15:47:18 +0000"  >&lt;p&gt;As an alternative to my patch, there is a similar patch here:  &lt;a href=&quot;http://review.whamcloud.com/#/c/8875/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8875/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;My patch creates a new interface under /proc called lbug_on_eviction, while the other patch overloads the dump_on_eviction interface by causing an LBUG when obd_dump_on_eviction == -1.  I prefer my patch, but as long as its well documented, the other patch would be fine as well.&lt;/p&gt;</comment>
                            <comment id="88762" author="cliffw" created="Thu, 10 Jul 2014 20:05:29 +0000"  >&lt;p&gt;I&apos;ve asked Liang to review the patch, hopefully that will help resolve it. &lt;/p&gt;</comment>
                            <comment id="92961" author="bfaccini" created="Tue, 2 Sep 2014 14:58:06 +0000"  >&lt;p&gt;I have developed &lt;a href=&quot;http://review.whamcloud.com/#/c/8875/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8875/&lt;/a&gt;, as a debug patch only (thus I decided to only divert a bit of dump_on_eviction purpose), to help on some ticket(s) at some point of time. But more experiences seems to indicate that the need of the feature is real so it needs to become more formal and thus &lt;a href=&quot;http://review.whamcloud.com/#/c/10257&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10257&lt;/a&gt; appears better now.&lt;/p&gt;</comment>
                            <comment id="93988" author="green" created="Mon, 15 Sep 2014 16:09:20 +0000"  >&lt;p&gt;I think you really mean that you want panic to happen on eviction?&lt;br/&gt;
Because LBUG does not always panic and is panic_on_lbug setting dependent, if it&apos;s not set, you get essentially a lock dump and a hung thread.&lt;br/&gt;
So I think explicitly going with panic makes more sense? (both in name of the control and in the behavior).&lt;/p&gt;

&lt;p&gt;Additionally I&apos;d like to draw your attention to the fact that MDS is a CLIENT to OSTs. That means that it could be evicted from OSTs and can panic then if this is set which is sort of undesired I suspect.&lt;br/&gt;
Not only that, every other kind of server is a client of MGS, though I am not quie sure evictions on in mgc are handled in the same way as other client modules, so this might need some additional looking into.&lt;/p&gt;</comment>
                            <comment id="94781" author="haasken" created="Tue, 23 Sep 2014 21:20:11 +0000"  >&lt;p&gt;Thanks for the comments, Oleg.  Yes, that&apos;s a good point; I think we&apos;d rather directly call panic() and change the interface name to panic_on_eviction.&lt;/p&gt;

&lt;p&gt;Thanks for pointing out the possibility of server panic on eviction, contrary to my commit message.  I tested the two cases you mentioned with the patch as it currently is implemented.&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;Set lbug_on_eviction=1 on MDS.  Evicted the MDS from an OSS via &lt;tt&gt;lctl set_param obdfilter.*.evict_client=nid:$MGS_NID&lt;/tt&gt;.  The MDS LBUG&apos;d&lt;/li&gt;
	&lt;li&gt;Set lbug_on_eviction=1 on OSS.  Evicted the OSS from the MGS via &lt;tt&gt;lctl set_param mgs.MGS.evict_client=$OSS_UUID&lt;/tt&gt;.  The OSS LBUG&apos;d.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;We probably don&apos;t want this, even though it can be avoided by leaving &lt;tt&gt;lbug_on_eviction=0&lt;/tt&gt; on the servers.&lt;/p&gt;

&lt;p&gt;How would we determine whether we are running on a client or a server so we can panic only on clients?  It doesn&apos;t seem like something we could do at compile time, and I&apos;m not sure how to determine if there are any Lustre targets mounted.  It doesn&apos;t look like there&apos;s anything to indicate this in the obd_import structure passed to ptlrpc_invalidate_import_thread.&lt;/p&gt;</comment>
                            <comment id="99697" author="cliffw" created="Thu, 20 Nov 2014 17:24:34 +0000"  >&lt;p&gt;Ryan, if you are continuing work on this patch, it needs to be rebased, can you address this?&lt;/p&gt;</comment>
                            <comment id="218997" author="simmonsja" created="Wed, 24 Jan 2018 14:31:10 +0000"  >&lt;p&gt;Now that we have sysfs support a better approach would be to use uvevent with udev so user land utilities can detect when a client is evicted. What I have noticed is we have imp_state covered by LUSTRE_IMP_* and we obd_import_event which handles IMP_EVENT_*. Which change should be removed to user land?&lt;/p&gt;</comment>
                            <comment id="220600" author="cliffw" created="Fri, 9 Feb 2018 16:57:38 +0000"  >&lt;p&gt;Old issue&lt;/p&gt;</comment>
                            <comment id="220606" author="paf" created="Fri, 9 Feb 2018 17:05:23 +0000"  >&lt;p&gt;Cliff,&lt;/p&gt;

&lt;p&gt;It&apos;s an old issue but it has an approved-but-never-landed patch associated with it.  I believe someone from Cray is going to update the patch shortly (like, in the next few days) and work on getting it landed.  I&apos;ll assign it to myself as a placeholder.&lt;/p&gt;</comment>
                            <comment id="221179" author="simmonsja" created="Fri, 16 Feb 2018 17:35:55 +0000"  >&lt;p&gt;I have an idea to make this work with udev rules. Need &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9431&quot; title=&quot;class_process_proc_param can&amp;#39;t handle sysfs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9431&quot;&gt;&lt;del&gt;LU-9431&lt;/del&gt;&lt;/a&gt; to land first. The idea is send a udev rule when the import changes state. Then a udev rule can do what ever the admin sets it up to do.&lt;/p&gt;</comment>
                            <comment id="221609" author="gerrit" created="Sat, 24 Feb 2018 00:40:45 +0000"  >&lt;p&gt;James Simmons (uja.ornl@yahoo.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/31407&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/31407&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5026&quot; title=&quot;Create an lbug_on_eviction option&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5026&quot;&gt;&lt;del&gt;LU-5026&lt;/del&gt;&lt;/a&gt; ptlrpc: send uevents when import state changes&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 6f2d12d62ddd0d49e6615cdd72b81cc061f145c7&lt;/p&gt;</comment>
                            <comment id="221611" author="simmonsja" created="Sat, 24 Feb 2018 00:59:43 +0000"  >&lt;p&gt;So with the udev patch I believe the best way to get the results you want is with the following udev rule.&lt;/p&gt;

&lt;p&gt;SUBSYSTEM=&quot;lustre&quot;, ACTION==&quot;change&quot;, ENV{STATE}==&quot;RECOVER&quot;, RUN+=&quot;/usr/sbin/lctl dk &amp;gt; /tmp/dump.log&quot;&lt;/p&gt;

&lt;p&gt;Of course you could create script to do more and use RUN+=&quot;my_script&quot; instead. Basically you can do what ever you want.&lt;/p&gt;

&lt;p&gt;The udev also has two unique fields. One is the STATE which&#160; is what is returned by ptlrpc_import_state_name() in lustre_import.h. The second field is IMPORT which tells you want actually has changed state with you. For example it can be IMPORT=&quot;lustre-MDT0000_UUID&apos;. So in theory you could also filter based on device i.e ENV{IMPORT} ==&quot;lustre-MDT*&quot;,&#160; or even by file system. This allows you to do things like report to some utility like nagios that a client was evicted and extract the file system and which device it was. Lots of possibilities here. If you want to use these fields as an argument do somethings along the lines of RUN+=&quot;my_app $env{IMPORT} $env{STATE}&quot;.&lt;/p&gt;

&lt;p&gt;Hope that helps. As you can see this give much power to the admins since many states are handled and the import is reported back to the user.&lt;/p&gt;</comment>
                            <comment id="222102" author="simmonsja" created="Thu, 1 Mar 2018 23:56:06 +0000"  >&lt;p&gt;Have you played with my patch?&lt;/p&gt;</comment>
                            <comment id="222159" author="hornc" created="Fri, 2 Mar 2018 21:06:13 +0000"  >&lt;p&gt;James, I haven&apos;t had a chance to check it out yet, but it should probably be submitted under a different ticket, no?&lt;/p&gt;</comment>
                            <comment id="222160" author="hornc" created="Fri, 2 Mar 2018 21:15:27 +0000"  >&lt;p&gt;James, read your instructions just now and we&apos;re not looking to dump dk logs. We want the node to panic the moment it discovers it was evicted. I don&apos;t think this uevent gives us the sort of granularity we&apos;re seeking for the panic. i.e. we can&apos;t guarantee to panic before the client tears down the import structures and we lose valuable state information.&lt;/p&gt;</comment>
                            <comment id="222171" author="simmonsja" created="Fri, 2 Mar 2018 23:38:59 +0000"  >&lt;p&gt;Doing a LBUG to force a panic is too heavy handed. Also servers are clients to each other which Oleg pointed out before in this ticket.  If you look at LBUG(), which is really just lbug_with_loc(), it will not really panic unless you set libcfs_panic_on_lbug. When libcfs_panic_on_lbug is not set you end up calling libcfs_debug_dumplog(). You can do the exact same thing with obd_dump_on_eviction. Okay I see libcfs_debug_dumpstack() is missing. If we add libcfs_debug_dumpstack() to ptlrpc_invalidate_import_thread() under the obd_dump_on_eviction condition would that work for you?&lt;/p&gt;

&lt;p&gt;As long as you dump the stack right away the panic can be done later with the udev rule if you really want to panic the node. This  places the reboot into the hands of the admin to choose when to push the button and under what conditions. I don&apos;t think many admins will want to the node to spontaneously reboot without any say.&lt;/p&gt;</comment>
                            <comment id="222173" author="paf" created="Sat, 3 Mar 2018 00:49:08 +0000"  >&lt;p&gt;James,&lt;/p&gt;

&lt;p&gt;There is a trend in this bug to dispute whether or not Cray needs or wants this debug option we&apos;re trying to get merged (we&apos;ve been using it happily for years, to great benefit).  I wish it would stop.  I&apos;m going to try to be really clear about what we need and why in hopes of achieving that goal.&lt;/p&gt;

&lt;p&gt;We sometimes &lt;b&gt;need to be able to panic nodes at the time of an eviction in order to debug evictions&lt;/b&gt;.  Both dk logs and stack traces are insufficient, and we can&apos;t wait for later to panic the node.  This option is purely for debugging and is not enabled by an admin except at the explicit request of a developer.&lt;/p&gt;

&lt;p&gt;All sorts of interesting state exists in memory and it is not realistic to try to dump all of it in debug logs, especially since new state will be created when the code changes.  In certain situations, we need to be able to trigger a kernel panic at the time of an eviction.  We cannot wait to panic at some future time/wait for admin intervention - It must be &lt;b&gt;as immediate as practical&lt;/b&gt; or all sorts of state will get cleared away.  If the eviction is allowed to proceed to destroying the import, almost all potentially useful information is gone.&lt;/p&gt;

&lt;p&gt;If that immediate, synchronous (ie no further execution of the thread processing the eviction except to do the panic) panic is possible to do with a udev rule, then that&apos;s probably fine!  Nothing else and nothing short of that - giving us the ability to panic a client when an eviction occurs - will do what we need.&lt;/p&gt;

&lt;p&gt;The conversation about admins not wanting to reboot nodes is silly - this isn&apos;t for admins and it&apos;s not for normal run time.  If it&apos;s in use, it&apos;s being used in a limited situation to get information for a developer, after the admin was specifically informed of the implications.  No one will run with this enabled normally.  This is no more dangerous than the existence of force_lbug or /proc/sysrq-trigger/.&lt;/p&gt;

&lt;p&gt;And we know servers are clients to each other sometimes.  We also know they get evicted sometimes...  So if we turned this on there, it would be to help debug such an eviction.  This isn&apos;t dangerous - It&apos;s intentional.  Lots of things in /proc/ can ruin your day in equally bad or worse ways if you use them incorrectly.&lt;/p&gt;</comment>
                            <comment id="222423" author="simmonsja" created="Mon, 5 Mar 2018 19:13:13 +0000"  >&lt;p&gt;Looking back on the comments people at Intel were uneasy about this patch. I was just trying to make it acceptable to them. I&apos;m not personally against your patch. In any case the udev approach is of interest to ORNL as well as Intel for monitoring purposes. I removed my negative review and will leave it up to you to convince Oleg and Andreas to accept your work.&lt;/p&gt;</comment>
                            <comment id="234156" author="gerrit" created="Mon, 1 Oct 2018 14:00:12 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/10257/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/10257/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5026&quot; title=&quot;Create an lbug_on_eviction option&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5026&quot;&gt;&lt;del&gt;LU-5026&lt;/del&gt;&lt;/a&gt; obdclass: Add lbug_on_eviction option&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 97381ffc9231a9e8b997fd3b4d66c9c5ade1b4d0&lt;/p&gt;</comment>
                            <comment id="234172" author="pjones" created="Mon, 1 Oct 2018 17:17:30 +0000"  >&lt;p&gt;Landed for 2.12&lt;/p&gt;</comment>
                            <comment id="247977" author="cfaber" created="Wed, 29 May 2019 17:07:43 +0000"  >&lt;p&gt;should this be closed?&lt;/p&gt;</comment>
                            <comment id="247981" author="pjones" created="Wed, 29 May 2019 17:14:06 +0000"  >&lt;p&gt;Same response as on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1365&quot; title=&quot;Implement ldiskfs LARGEDIR support for e2fsprogs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1365&quot;&gt;&lt;del&gt;LU-1365&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwm2v:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>13907</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>