<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:09:59 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7564] (out_handler.c:854:out_tx_end())  ... rc = -524</title>
                <link>https://jira.whamcloud.com/browse/LU-7564</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;The error happens during soak testing of build &apos;20151214&apos; (see &lt;a href=&quot;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151214&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151214&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Approximately 10% of the total amount of batch jobs using &lt;tt&gt;simul&lt;/tt&gt; crash with &apos;typical&apos; error message:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;...
...
03:01:31: Running test #10(iter 42): mkdir, shared mode.
03:01:31: Running test #10(iter 43): mkdir, shared mode.
03:01:31: Process 0(lola-27.lola.whamcloud.com): FAILED in remove_dirs, rmdir failed: Input/output error
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
In: PMI_Abort(1, N/A)
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
[lola-29][[616,1],2][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
srun: error: lola-32: task 3: Killed
srun: Terminating job step 393832.0
srun: error: lola-29: task 2: Killed
srun: error: lola-27: task 0: Exited with exit code 1
srun: error: lola-27: task 1: Killed
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Each job crash can be temporal correlated perfectly to an event on a MDS:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) Skipped 3 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>lola&lt;br/&gt;
build: tip of master (commit ae3a2891f10a19acf855a90337316dda704da5d)</environment>
        <key id="33718">LU-7564</key>
            <summary>(out_handler.c:854:out_tx_end())  ... rc = -524</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="di.wang">Di Wang</assignee>
                                    <reporter username="heckes">Frank Heckes</reporter>
                        <labels>
                            <label>dne2</label>
                            <label>soak</label>
                    </labels>
                <created>Wed, 16 Dec 2015 12:23:57 +0000</created>
                <updated>Fri, 9 Sep 2016 19:28:20 +0000</updated>
                            <resolved>Fri, 5 Feb 2016 15:26:48 +0000</resolved>
                                    <version>Lustre 2.8.0</version>
                                    <fixVersion>Lustre 2.8.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="136523" author="heckes" created="Wed, 16 Dec 2015 12:26:52 +0000"  >&lt;p&gt;The errors happen during normal operations, i.e. at times when no fault was injected.&lt;/p&gt;</comment>
                            <comment id="136683" author="heckes" created="Thu, 17 Dec 2015 09:36:13 +0000"  >&lt;p&gt;Attached the debug logs taken on two MDSes after the event occurred. For the sequence of timing, here are the&lt;br/&gt;
error messages for each event and node:&lt;br/&gt;
&lt;b&gt;lola-10&lt;/b&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lola-10.log:Dec 17 00:29:51 lola-10 kernel: LustreError: 5454:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;b&gt;lola-11&lt;/b&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lola-11.log:Dec 17 01:22:34 lola-11 kernel: LustreError: 6293:0:(out_handler.c:854:out_tx_end()) soaked-MDT0007-osd: undo for /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6.6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="136684" author="heckes" created="Thu, 17 Dec 2015 09:47:21 +0000"  >&lt;p&gt;Debug mask is set to default. Please let me know if you like to set to an other value that is more appropriate.&lt;/p&gt;</comment>
                            <comment id="139831" author="di.wang" created="Sat, 23 Jan 2016 18:09:32 +0000"  >&lt;p&gt;Pushed a patch &lt;a href=&quot;http://review.whamcloud.com/16838&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/16838&lt;/a&gt;, which includes all of DNE fixes and some fixes on remote llog (in llog_cat_new_log()), which might cause recovery failure. &lt;/p&gt;

&lt;p&gt;Please try this one.&lt;/p&gt;</comment>
                            <comment id="139834" author="di.wang" created="Sat, 23 Jan 2016 23:04:34 +0000"  >&lt;p&gt;Another possible reason is that soak-test is doing double MDT failover, without COS, that might cause corruption during failover, considering following scenarios&lt;/p&gt;

&lt;p&gt;1. Client1 send the operation(Op1) to MDT1, and MDT1 distribute updates of Op1 to MDT2, then after finish Op1, MDT1 sends reply to client1&lt;br/&gt;
2. Client2 send the operation Op2 to MDT3, and MDT3 distribute updates of Op2 to MDT2, (Note: Op2 depends on Op1).&lt;br/&gt;
3. Before Op1 committed to MDT1 and MDT2,  both MDT1 and MDT2 reboots.&lt;br/&gt;
4. After MDT1 restarts, client1 will send Op1 to MDT1, and MDT1 will distribute updates of Op1 to MDT2, but with different xid and 0 transno.&lt;br/&gt;
5. After MDT2 restarts and recovers, of course it will ignore the updates of Op1 because of its 0 transno, instead it will receive the replay updates from MDT3 (Op2). Then it will fail of course.&lt;/p&gt;

&lt;p&gt;If we have COS here, Op1 will be committed before Op2 starts, then it will help a lot here. I includes the COS patch in 16838, see how it goes.&lt;/p&gt;</comment>
                            <comment id="140120" author="gerrit" created="Tue, 26 Jan 2016 22:32:45 +0000"  >&lt;p&gt;wangdi (di.wang@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/18165&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/18165&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7564&quot; title=&quot;(out_handler.c:854:out_tx_end())  ... rc = -524&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7564&quot;&gt;&lt;del&gt;LU-7564&lt;/del&gt;&lt;/a&gt; llog: separate llog creation with initialization&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 734c40e674eb14055648d26af99aca9de4d0fd4f&lt;/p&gt;</comment>
                            <comment id="140228" author="heckes" created="Wed, 27 Jan 2016 17:24:57 +0000"  >&lt;p&gt;Error is present for build &apos;20160126&apos; (see &lt;a href=&quot;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160126&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160126&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Test was executed without any fault being injected (aka no mds restart/failover, oss failover)&lt;/p&gt;

&lt;p&gt;After mounting the FS even the previous dangling files don&apos;t disappear (see list in attached file &apos;dangling-files-before-restart-build-20160126&apos;)&lt;br/&gt;
Also new dangling files are leftover from job - crashes (see list in attached file &apos;dangling-files-after-restart-build-20160126&apos;)&lt;br/&gt;
Here the job crash can be corrleated to &lt;tt&gt;rc == -524&lt;/tt&gt; - event and the dangling file(s) in the end. E.g.&lt;br/&gt;
JOB 420993:&lt;br/&gt;
from job output file&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; 
01/27/2016 04:51:32: Process 5(lola-33.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error
01/27/2016 04:51:32: Process 3(lola-31.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error
01/27/2016 04:51:32: Process 2(lola-30.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error
slurmd[lola-34]: *** STEP 420993.0 KILLED AT 2016-01-27T04:51:32 WITH SIGNAL 9 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[lola-32]: *** STEP 420993.0 KILLED AT 2016-01-27T04:51:32 WITH SIGNAL 9 ***
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;server error:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:846:out_tx_end()) error during execution of #8 from /lbuilds/soak-builds/workspace/lustr
e-soaked-20160126/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.65/lustre/ptlrpc/../../lustre/target/out_handler.c:503: rc = -17
lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:846:out_tx_end()) Skipped 4 previous similar messages
lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:856:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soa
ked-20160126/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.65/lustre/ptlrpc/../../lustre/target/out_handler.c:387: rc = -524
lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:856:out_tx_end()) Skipped 17 previous similar messages
lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1453899092.15929
lola-10.log:Jan 27 04:51:33 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1453899093.15184
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;dangling files in FS&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@lola-16 ~]# ll /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.0.0/
ls: cannot access /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.0.0/dir.mdtest.0.61: No such file or directory
total 0
d????????? ? ? ? ?            ? dir.mdtest.0.61
[root@lola-16 ~]# ll /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.5.0
ls: cannot access /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.5.0/dir.mdtest.5.61: No such file or directory
total 0
d????????? ? ? ? ?            ? dir.mdtest.5.61
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Attached file containing lists of dangling files and debug log files mentioned in the MDT error message above.&lt;/p&gt;</comment>
                            <comment id="140470" author="gerrit" created="Fri, 29 Jan 2016 00:36:55 +0000"  >&lt;p&gt;wangdi (di.wang@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/18206&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/18206&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7564&quot; title=&quot;(out_handler.c:854:out_tx_end())  ... rc = -524&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7564&quot;&gt;&lt;del&gt;LU-7564&lt;/del&gt;&lt;/a&gt; osp: lock remote object exclusively&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 468a6d9ee4854740353cc41c04aa70ab6155e069&lt;/p&gt;</comment>
                            <comment id="141335" author="gerrit" created="Fri, 5 Feb 2016 14:55:04 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/18206/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/18206/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7564&quot; title=&quot;(out_handler.c:854:out_tx_end())  ... rc = -524&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7564&quot;&gt;&lt;del&gt;LU-7564&lt;/del&gt;&lt;/a&gt; osp: Do not match the lock for OSP&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: beab72b475c6006f53d5cab628cfdbe6dca09b32&lt;/p&gt;</comment>
                            <comment id="141349" author="jgmitter" created="Fri, 5 Feb 2016 15:26:48 +0000"  >&lt;p&gt;Patch has landed for 2.8&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="20205" name="dangling-files-after-restart-build-20160126" size="1974" author="heckes" created="Wed, 27 Jan 2016 17:38:21 +0000"/>
                            <attachment id="20206" name="dangling-files-before-restart-build-20160126" size="16397" author="heckes" created="Wed, 27 Jan 2016 17:38:21 +0000"/>
                            <attachment id="19953" name="lola-10-lustre-log.for-LU-7564.20151217T0035.bz2" size="5271" author="heckes" created="Thu, 17 Dec 2015 10:17:20 +0000"/>
                            <attachment id="19954" name="lola-11-lustre-log-LU-7565-20151217-0125.bz2" size="284" author="heckes" created="Thu, 17 Dec 2015 10:17:20 +0000"/>
                            <attachment id="20213" name="lustre-log.1453899092.15929.bz2" size="3524599" author="heckes" created="Thu, 28 Jan 2016 07:51:46 +0000"/>
                            <attachment id="20207" name="lustre-log.1453899093.15184.bz2" size="81708" author="heckes" created="Wed, 27 Jan 2016 17:38:46 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxvzr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>