<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:45:56 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4797] ASSERTION( cl_lock_is_mutexed(slice-&gt;cls_lock) ) failed</title>
                <link>https://jira.whamcloud.com/browse/LU-4797</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;After 3 days in production with Lustre 2.4.2, CEA is suffering from the following &quot;assertion failed&quot; issue about 5 times a day:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 4089:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice-&amp;gt;cls_lock) ) failed:
LustreError: 4089:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG
Pid: 4089, comm: %%AQC.P.I.O

Call Trace:
 [&amp;lt;ffffffffa0af4895&amp;gt;] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [&amp;lt;ffffffffa0af4e97&amp;gt;] lbug_with_loc+0x47/0xb0 [libcfs]
 [&amp;lt;ffffffffa1065d51&amp;gt;] lovsub_lock_state+0x1a1/0x1b0 [lov]
 [&amp;lt;ffffffffa0bd7a88&amp;gt;] cl_lock_state_signal+0x68/0x160 [obdclass]
 [&amp;lt;ffffffffa0bd7bd5&amp;gt;] cl_lock_state_set+0x55/0x190 [obdclass]
 [&amp;lt;ffffffffa0bdb8d9&amp;gt;] cl_enqueue_try+0x149/0x300 [obdclass]
 [&amp;lt;ffffffffa105e0da&amp;gt;] lov_lock_enqueue+0x22a/0x850 [lov]
 [&amp;lt;ffffffffa0bdb88c&amp;gt;] cl_enqueue_try+0xfc/0x300 [obdclass]
 [&amp;lt;ffffffffa0bdcc7f&amp;gt;] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
 [&amp;lt;ffffffffa0bdd8ee&amp;gt;] cl_lock_request+0x7e/0x270 [obdclass]
 [&amp;lt;ffffffffa0be2b8c&amp;gt;] cl_io_lock+0x3cc/0x560 [obdclass]
 [&amp;lt;ffffffffa0be2dc2&amp;gt;] cl_io_loop+0xa2/0x1b0 [obdclass]
 [&amp;lt;ffffffffa10dba90&amp;gt;] ll_file_io_generic+0x450/0x600 [lustre]
 [&amp;lt;ffffffffa10dc9d2&amp;gt;] ll_file_aio_write+0x142/0x2c0 [lustre]
 [&amp;lt;ffffffffa10dccbc&amp;gt;] ll_file_write+0x16c/0x2a0 [lustre]
 [&amp;lt;ffffffff811895d8&amp;gt;] vfs_write+0xb8/0x1a0
 [&amp;lt;ffffffff81189ed1&amp;gt;] sys_write+0x51/0x90
 [&amp;lt;ffffffff81091039&amp;gt;] ? sys_times+0x29/0x70
 [&amp;lt;ffffffff8100b072&amp;gt;] system_call_fastpath+0x16/0x1b
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This issue is very similar to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4693&quot; title=&quot;(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice-&amp;gt;cls_lock) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4693&quot;&gt;&lt;del&gt;LU-4693&lt;/del&gt;&lt;/a&gt;, which is itself a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4692&quot; title=&quot;(osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols-&amp;gt;ols_state == OLS_NEW ) failed: Impossible state: 6&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4692&quot;&gt;&lt;del&gt;LU-4692&lt;/del&gt;&lt;/a&gt;, for which there is unfortunately no fix yet.&lt;/p&gt;

&lt;p&gt;Please ask if you need additional information that could help the diagnostic and resolution of the problem.&lt;/p&gt;

&lt;p&gt;Sebastien.&lt;/p&gt;</description>
                <environment></environment>
        <key id="23822">LU-4797</key>
            <summary>ASSERTION( cl_lock_is_mutexed(slice-&gt;cls_lock) ) failed</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="jay">Jinshan Xiong</assignee>
                                    <reporter username="sebastien.buisson">Sebastien Buisson</reporter>
                        <labels>
                    </labels>
                <created>Fri, 21 Mar 2014 14:00:33 +0000</created>
                <updated>Mon, 28 Apr 2014 14:21:49 +0000</updated>
                            <resolved>Fri, 4 Apr 2014 16:24:42 +0000</resolved>
                                    <version>Lustre 2.4.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="79996" author="bfaccini" created="Fri, 21 Mar 2014 14:42:13 +0000"  >&lt;p&gt;Hello Sebastien, are there any crash-dump available ?? If yes, could it be possible to extract the debug-log content with the crash-tool expansion described in CFS BZ #13155 (source to be re-compiled are available, and I know you may need them to install+use on-site) ?? BTW, waht is the default debug mask you run with ??&lt;/p&gt;</comment>
                            <comment id="79998" author="sebastien.buisson" created="Fri, 21 Mar 2014 14:53:33 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;I have forwarded your request to on-site Support team. Do you want us to attach the requested debug-log content to this ticket? Or could we have a look by ourselves and search for something specific?&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Sebastien.&lt;/p&gt;</comment>
                            <comment id="80001" author="pjones" created="Fri, 21 Mar 2014 15:45:13 +0000"  >&lt;p&gt;Bobijam&lt;/p&gt;

&lt;p&gt;Does this appear to be a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4692&quot; title=&quot;(osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols-&amp;gt;ols_state == OLS_NEW ) failed: Impossible state: 6&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4692&quot;&gt;&lt;del&gt;LU-4692&lt;/del&gt;&lt;/a&gt;? Is there anything additional that would assist with debugging this issue?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="80203" author="bobijam" created="Tue, 25 Mar 2014 07:29:44 +0000"  >&lt;p&gt;Beside crash-dump, is it possible to find a rehit procedure?&lt;/p&gt;</comment>
                            <comment id="80204" author="sebastien.buisson" created="Tue, 25 Mar 2014 07:56:35 +0000"  >&lt;p&gt;Hi Bobijam,&lt;/p&gt;

&lt;p&gt;All I know is that the impacted file is a log file in which several processes write.&lt;/p&gt;

&lt;p&gt;I have forwarded your request to our on-site Support team.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Sebastien.&lt;/p&gt;</comment>
                            <comment id="80357" author="sebastien.buisson" created="Thu, 27 Mar 2014 09:33:26 +0000"  >&lt;p&gt;The workload should be reproduced by launching the script run_reproducer_2.sh with 4 processes on 2 nodes.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;::::::::::::::
run_reproducer_2.sh
::::::::::::::
#!/bin/bash
sleeptime=$(( ( ${SLURM_PROCID} * 10000 ) + 1000000 ))
reproducer2.sh 10 /&amp;lt;path&amp;gt;/mylog ${sleeptime} ${SLURM_JOBID}_${SLURM_PROCID}
::::::::::::::
reproducer2.sh
::::::::::::::
#!/bin/bash
#
for i in $(seq 1 $1)
do
  usleep $3
  echo $(date) $(date &apos;+%N&apos;) $4 $3 testing write in append mode &amp;gt;&amp;gt; $2
done
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="80425" author="morrone" created="Fri, 28 Mar 2014 01:40:00 +0000"  >&lt;p&gt;LLNL also hit this in testing Lustre version 2.4.2-6chaos on a Lustre client.&lt;/p&gt;</comment>
                            <comment id="80429" author="bobijam" created="Fri, 28 Mar 2014 02:28:17 +0000"  >&lt;p&gt;I&apos;ve been being trying the reproduce script since yesterday, haven&apos;t gotten a hit yet. I did a little change for my VM test environment.&lt;/p&gt;

&lt;p&gt;$ cat ~/tmp/reproducer.sh&lt;br/&gt;
#!/bin/bash&lt;/p&gt;

&lt;p&gt;ls &amp;gt; /dev/null &amp;amp;&lt;br/&gt;
PID=$!&lt;/p&gt;

&lt;p&gt;for i in $(seq 1 10000)&lt;br/&gt;
do&lt;br/&gt;
	echo $PID - $i&lt;br/&gt;
	usleep $((($PID * 100) + 1500000))&lt;br/&gt;
	echo &quot;$(date) $(date &apos;+%N&apos;) $PID-$i   *****   testing write in append mode&quot; &amp;gt;&amp;gt; /mnt/lustre/file&lt;br/&gt;
done&lt;/p&gt;

&lt;p&gt;and on 2 nodes, run &quot;$~/tmp/reproducer.sh &amp;amp;&quot; five times, I think the basic idea is the same. &lt;/p&gt;

&lt;p&gt;Sebastien, How long would it rehit the issue in your case?&lt;/p&gt;</comment>
                            <comment id="80430" author="jay" created="Fri, 28 Mar 2014 04:03:12 +0000"  >&lt;p&gt;Is it possible to collect a crash dump for this issue?&lt;/p&gt;

&lt;p&gt;The only difference between 2.4.1 and 2.4.2 is &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt;, there are two patches landed for it. Can anyone please revert them and try again? Probably this way we can get some clues for this issue.&lt;/p&gt;</comment>
                            <comment id="80442" author="adegremont" created="Fri, 28 Mar 2014 08:25:36 +0000"  >&lt;p&gt;Please note that Sebastien&apos;s script is not a reproducer of this crash, but something similar to the workload that leads to this crash. This code only easily triggers a lot of evictions.&lt;/p&gt;</comment>
                            <comment id="80446" author="bfaccini" created="Fri, 28 Mar 2014 09:46:37 +0000"  >&lt;p&gt;Seb, sorry to only answer to your own comment/reply on &quot;21/Mar/14 3:53 PM&quot;, so yes it could be useful as a 1st debugging info to get the debug-log content extracted from the crash-dump. BTW, I hope that you run with a default debug-levels mask that will be enough to gather accurate traces for this problem ??...&lt;/p&gt;</comment>
                            <comment id="80447" author="bfaccini" created="Fri, 28 Mar 2014 09:52:15 +0000"  >&lt;p&gt;Aurelien, concerning the evictions likely to be reproduced on site with this script, is it also possible to get a Lustre debug-log, at least from evicted Client side and with the full debug mask/traces enabled ?&lt;/p&gt;</comment>
                            <comment id="80456" author="sebastien.buisson" created="Fri, 28 Mar 2014 12:38:37 +0000"  >&lt;p&gt;Bobijam,&lt;/p&gt;

&lt;p&gt;To make it clear, 2.4.2 is the first b2_4 release used at CEA, so we are not comparing with 2.4.1, but with 2.1.6.&lt;/p&gt;

&lt;p&gt;Sebastien.&lt;/p&gt;</comment>
                            <comment id="80468" author="apercher" created="Fri, 28 Mar 2014 14:54:31 +0000"  >&lt;p&gt;About the testcase :&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;the waiting time is too large than the initial reproducer by x15&lt;/li&gt;
	&lt;li&gt;fix the run a processor/core with taskset for example when&lt;br/&gt;
           you run /tmp/reproducer.sh on the 2 nodes &lt;br/&gt;
               for cpu in $(seq 0 5) # nb cpus nodes&lt;br/&gt;
                taskset -c $cpu /tmp/reproducer.sh &amp;amp; ; done&lt;/li&gt;
	&lt;li&gt;My first testcase was without usleep and 32 task/process by nodes&lt;br/&gt;
    and 1000 write/echo by task&lt;/li&gt;
	&lt;li&gt;no lustre trace enable&lt;/li&gt;
&lt;/ul&gt;


</comment>
                            <comment id="80593" author="bobijam" created="Mon, 31 Mar 2014 09:10:30 +0000"  >&lt;p&gt;my wait time is 1.5x of the initial reproducer, I&apos;ll try the taskset.&lt;/p&gt;</comment>
                            <comment id="80629" author="bobijam" created="Mon, 31 Mar 2014 16:51:04 +0000"  >&lt;p&gt;update: tried taskset, usleep 1000000, each cpu took 250 echo taskes, 2 VM nodes, each node has 2 cpus, haven&apos;t got the hit yet.&lt;/p&gt;</comment>
                            <comment id="80655" author="bfaccini" created="Mon, 31 Mar 2014 19:58:45 +0000"  >&lt;p&gt;Bobi, I attach the Client-side debug logs that have been taken during an on-site run of the reproducer on 2x nodes (lascaux&lt;span class=&quot;error&quot;&gt;&amp;#91;2890-2891&amp;#93;&lt;/span&gt;) with evictions :&lt;/p&gt;

&lt;p&gt;       _ file &quot;eviction&quot; is a log showing the reproducer run+errors and the actions taken.&lt;br/&gt;
       _ files lascaux289&lt;span class=&quot;error&quot;&gt;&amp;#91;0,1&amp;#93;&lt;/span&gt;&lt;em&gt;dump_lustre_pb_eviction&lt;/em&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;1,2&amp;#93;&lt;/span&gt;.gz are Lustre debug logs taken on both nodes and as shown in &quot;eviction&quot; file/trace.&lt;/p&gt;

&lt;p&gt;Hope this will allow you to get more infos to qualify.&lt;/p&gt;</comment>
                            <comment id="80824" author="bobijam" created="Wed, 2 Apr 2014 10:57:07 +0000"  >&lt;p&gt;still checking the eviction logs, there are lock enqueue and blocking ast call trace intertwined, but that&apos;s normal, haven&apos;t found where the race happens causing this mutex assertion violation.&lt;/p&gt;</comment>
                            <comment id="81000" author="jay" created="Thu, 3 Apr 2014 22:22:12 +0000"  >&lt;p&gt;Please try patch: &lt;a href=&quot;http://review.whamcloud.com/9881&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9881&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I believe this is the same issue in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4591&quot; title=&quot;Related cl_lock failures on master/2.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4591&quot;&gt;&lt;del&gt;LU-4591&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="81054" author="jay" created="Fri, 4 Apr 2014 16:24:42 +0000"  >&lt;p&gt;duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4558&quot; title=&quot;Crash in cl_lock_put on racer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4558&quot;&gt;&lt;del&gt;LU-4558&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="81385" author="apercher" created="Thu, 10 Apr 2014 16:53:02 +0000"  >&lt;p&gt;Hi Zhenyu Xu,&lt;br/&gt;
 Could you explain me why you think &quot;there are lock enqueue and blocking ast call trace intertwined&quot; &lt;br/&gt;
 and eviction. You said &quot;that&apos;s normal&quot;. I don&apos;t understand why there a lot contention on this because&lt;br/&gt;
 we just add some bytes at the end of one file with just 4 process on 2 nodes . for me and on a 2.1.x &lt;br/&gt;
 lustre distribution we haven&apos;t this contention.&lt;br/&gt;
 and that could explain why sometime we meet the race fix by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4558&quot; title=&quot;Crash in cl_lock_put on racer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4558&quot;&gt;&lt;del&gt;LU-4558&lt;/del&gt;&lt;/a&gt;&lt;br/&gt;
 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4495&quot; title=&quot;client evicted on parallel append write to the shared file.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4495&quot;&gt;&lt;del&gt;LU-4495&lt;/del&gt;&lt;/a&gt; could explain this contention ?&lt;br/&gt;
thanks&lt;/p&gt;</comment>
                            <comment id="81407" author="bobijam" created="Fri, 11 Apr 2014 01:07:01 +0000"  >&lt;p&gt;every write needs a exclusive lock, write from other node will cause the lock holder to relinquish the lock, and multiple write upon the same file from different node will cause lock enqueue and lock blocking ast intertwined, by that I meant normal.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="23409">LU-4693</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="22914">LU-4558</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="23022">LU-4591</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="14632" name="eviction" size="5962" author="bfaccini" created="Mon, 31 Mar 2014 19:59:27 +0000"/>
                            <attachment id="14633" name="lascaux2890_dump_lustre_pb_eviction_1.gz" size="275" author="bfaccini" created="Mon, 31 Mar 2014 20:01:31 +0000"/>
                            <attachment id="14634" name="lascaux2890_dump_lustre_pb_eviction_2.gz" size="275" author="bfaccini" created="Mon, 31 Mar 2014 20:05:48 +0000"/>
                            <attachment id="14635" name="lascaux2891_dump_lustre_pb_eviction_1.gz" size="275" author="bfaccini" created="Mon, 31 Mar 2014 20:08:37 +0000"/>
                            <attachment id="14636" name="lascaux2891_dump_lustre_pb_eviction_2.gz" size="275" author="bfaccini" created="Mon, 31 Mar 2014 20:10:47 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwi3b:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>13202</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>