<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:09:35 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-701] parallel-scale test_write_disjoint fails due to invalid file size</title>
                <link>https://jira.whamcloud.com/browse/LU-701</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;v2_1_0_0_RC2 testing, MPI_ABORT for unknown reason. No console, syslog at all in the report (maloo bug?)&lt;/p&gt;

&lt;p&gt;Report: &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/44dc4934-e440-11e0-9909-52540025f9af&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/44dc4934-e440-11e0-9909-52540025f9af&lt;/a&gt;&lt;/p&gt;


&lt;p&gt;== parallel-scale test write_disjoint: write_disjoint == 14:43:05 (1316554985)&lt;br/&gt;
OPTIONS:&lt;br/&gt;
WRITE_DISJOINT=/usr/lib64/lustre/tests/write_disjoint&lt;br/&gt;
clients=fat-intel-1vm1,fat-intel-1vm2&lt;br/&gt;
wdisjoint_THREADS=4&lt;br/&gt;
wdisjoint_REP=10000&lt;br/&gt;
MACHINEFILE=/tmp/parallel-scale.machines&lt;br/&gt;
fat-intel-1vm1&lt;br/&gt;
fat-intel-1vm2&lt;br/&gt;
+ /usr/lib64/lustre/tests/write_disjoint -f /mnt/lustre/d0.write_disjoint/file -n 10000&lt;br/&gt;
UUID                      Inodes       IUsed       IFree IUse% Mounted on&lt;br/&gt;
lustre-MDT0000_UUID      5000040          86     4999954   0% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;MDT:0&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0000_UUID       167552       10974      156578   7% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:0&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0001_UUID       167552       11326      156226   7% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:1&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0002_UUID       167552        3807      163745   2% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:2&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0003_UUID       167552        4830      162722   3% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:3&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0004_UUID       167552        3806      163746   2% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:4&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0005_UUID       167552        3646      163906   2% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:5&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0006_UUID       167552        3806      163746   2% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:6&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;filesystem summary:      5000040          86     4999954   0% /mnt/lustre&lt;/p&gt;

&lt;p&gt;+ chmod 0777 /mnt/lustre&lt;br/&gt;
drwxrwxrwx 7 root root 4096 Sep 20 14:43 /mnt/lustre&lt;br/&gt;
+ su mpiuser sh -c &quot;/usr/lib64/openmpi/bin/mpirun -mca boot ssh  -mca btl tcp,self -np 8 -machinefile /tmp/parallel-scale.machines /usr/lib64/lustre/tests/write_disjoint -f /mnt/lustre/d0.write_disjoint/file -n 10000 &quot;&lt;br/&gt;
loop 0: chunk_size 103399&lt;br/&gt;
loop 1000: chunk_size 69125&lt;br/&gt;
loop 2000: chunk_size 104360&lt;br/&gt;
loop 3000: chunk_size 11295&lt;br/&gt;
loop 4000: chunk_size 77918&lt;br/&gt;
loop 5000: chunk_size 27295&lt;br/&gt;
loop 6000: chunk_size 42065&lt;br/&gt;
loop 7000: chunk_size 82749&lt;br/&gt;
loop 8000: chunk_size 94370&lt;br/&gt;
loop 9000: chunk_size 107226&lt;br/&gt;
loop 9371: chunk_size 25301, file size was 202408&lt;br/&gt;
rank 5, loop 9372: invalid file size 801136 instead of 915584 = 114448 * 8&lt;br/&gt;
--------------------------------------------------------------------------&lt;br/&gt;
MPI_ABORT was invoked on rank 5 in communicator MPI_COMM_WORLD &lt;br/&gt;
with errorcode -1.&lt;/p&gt;

&lt;p&gt;NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.&lt;br/&gt;
You may or may not see output from other processes, depending on&lt;br/&gt;
exactly when Open MPI kills them.&lt;br/&gt;
--------------------------------------------------------------------------&lt;br/&gt;
--------------------------------------------------------------------------&lt;br/&gt;
mpirun has exited due to process rank 5 with PID 30944 on&lt;br/&gt;
node fat-intel-1vm2 exiting without calling &quot;finalize&quot;. This may&lt;br/&gt;
have caused other processes in the application to be&lt;br/&gt;
terminated by signals sent by mpirun (as reported here).&lt;br/&gt;
--------------------------------------------------------------------------&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;fat-intel-1vm2.lab.whamcloud.com&amp;#93;&lt;/span&gt;[&lt;span class=&quot;error&quot;&gt;&amp;#91;61908,1&amp;#93;&lt;/span&gt;,7]&lt;span class=&quot;error&quot;&gt;&amp;#91;btl_tcp_frag.c:216:mca_btl_tcp_frag_recv&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;fat-intel-1vm1.lab.whamcloud.com&amp;#93;&lt;/span&gt;[&lt;span class=&quot;error&quot;&gt;&amp;#91;61908,1&amp;#93;&lt;/span&gt;,4]&lt;span class=&quot;error&quot;&gt;&amp;#91;btl_tcp_frag.c:216:mca_btl_tcp_frag_recv&amp;#93;&lt;/span&gt; mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)&lt;br/&gt;
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)&lt;br/&gt;
UUID                      Inodes       IUsed       IFree IUse% Mounted on&lt;br/&gt;
lustre-MDT0000_UUID      5000040          87     4999953   0% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;MDT:0&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0000_UUID       167552       10974      156578   7% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:0&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0001_UUID       167552       11326      156226   7% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:1&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0002_UUID       167552        3806      163746   2% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:2&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0003_UUID       167552        4830      162722   3% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:3&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0004_UUID       167552        3806      163746   2% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:4&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0005_UUID       167552        3646      163906   2% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:5&amp;#93;&lt;/span&gt;&lt;br/&gt;
lustre-OST0006_UUID       167552        3806      163746   2% /mnt/lustre&lt;span class=&quot;error&quot;&gt;&amp;#91;OST:6&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;filesystem summary:      5000040          87     4999953   0% /mnt/lustre&lt;/p&gt;

&lt;p&gt; parallel-scale test_write_disjoint: @@@@@@ FAIL: write_disjoint failed! 1 &lt;br/&gt;
Dumping lctl log to /logdir/test_logs/2011-09-19/lustre-mixed-el6-x86_64_&lt;em&gt;283&lt;/em&gt;_-7f6a2ad2c9e0/parallel-scale.test_write_disjoint.*.1316557553.log&lt;br/&gt;
Resetting fail_loc on all nodes...done.&lt;/p&gt;</description>
                <environment>Lustre Clients: &lt;br/&gt;
Tag: 1.8.6-wc1 &lt;br/&gt;
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32_131.2.1.el6) &lt;br/&gt;
Build: &lt;a href=&quot;http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/&quot;&gt;http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/&lt;/a&gt; &lt;br/&gt;
Network: TCP&lt;br/&gt;
ENABLE_QUOTA=yes &lt;br/&gt;
&lt;br/&gt;
Lustre Servers: &lt;br/&gt;
Tag: v2_1_0_0_RC2 &lt;br/&gt;
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.6.1.el6_lustre.g65156ed.x86_64)&lt;br/&gt;
Build: &lt;a href=&quot;http://newbuild.whamcloud.com/job/lustre-master/228/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/&quot;&gt;http://newbuild.whamcloud.com/job/lustre-master/228/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/&lt;/a&gt; &lt;br/&gt;
Network: TCP</environment>
        <key id="11875">LU-701</key>
            <summary>parallel-scale test_write_disjoint fails due to invalid file size</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="mdiep">Minh Diep</reporter>
                        <labels>
                    </labels>
                <created>Wed, 21 Sep 2011 15:11:04 +0000</created>
                <updated>Fri, 6 Sep 2013 16:50:18 +0000</updated>
                            <resolved>Fri, 6 Sep 2013 16:50:18 +0000</resolved>
                                    <version>Lustre 2.1.0</version>
                    <version>Lustre 2.4.0</version>
                    <version>Lustre 1.8.7</version>
                    <version>Lustre 2.5.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="20444" author="yujian" created="Fri, 23 Sep 2011 04:45:12 +0000"  >&lt;p&gt;Lustre Clients:&lt;br/&gt;
Tag: 1.8.6-wc1&lt;br/&gt;
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32_131.2.1.el6)&lt;br/&gt;
Build: &lt;a href=&quot;http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/&lt;/a&gt;&lt;br/&gt;
Network: TCP (1GigE)&lt;br/&gt;
ENABLE_QUOTA=yes&lt;/p&gt;

&lt;p&gt;Lustre Servers:&lt;br/&gt;
Tag: v2_1_0_0_RC2&lt;br/&gt;
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.6.1.el6_lustre)&lt;br/&gt;
Build: &lt;a href=&quot;http://newbuild.whamcloud.com/job/lustre-master/283/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://newbuild.whamcloud.com/job/lustre-master/283/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;write_disjoint test passed in manual run: &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/af1b916c-e5bf-11e0-9909-52540025f9af&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/af1b916c-e5bf-11e0-9909-52540025f9af&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="59707" author="adilger" created="Thu, 30 May 2013 22:47:21 +0000"  >&lt;p&gt;I just noticed in the &quot;full&quot; runs that test_write_disjoint is one of the few tests that is consistently failing, and this bug is listed as the cause.&lt;/p&gt;

&lt;p&gt;The MPI_ABORT is not the &lt;em&gt;cause&lt;/em&gt; of this problem, just a symptom.  When write_disjoint detects a data consistency error it prints an error message and then calls MPI_Abort() to exit.&lt;/p&gt;

&lt;p&gt;The real problem is that the output file was not being written correctly or the DLM locks are caching the file size incorrectly, resulting in an inconsistent file size reported to the application:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;loop 90: chunk_size 62460, file size was 499680
rank 4, loop 91: invalid file size 203532 instead of 232608 = 29076 * 8
rank 2, loop 91: invalid file size 203532 instead of 232608 = 29076 * 8
rank 6, loop 91: invalid file size 203532 instead of 232608 = 29076 * 8
rank 0, loop 91: invalid file size 203532 instead of 232608 = 29076 * 8
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In this case MPI_ABORT is expected, and we need to find out why the test is failing.  For good or bad, it seems like it fails virtually every test run, so it will hopefully not be too complex to debug.  Almost certainly we would need to gather more debug logs from the client nodes (&lt;tt&gt;lctl set_param debug=&quot;+vfstrace +rpctrace +dlmtrace&quot;&lt;/tt&gt; at a minimum).&lt;/p&gt;</comment>
                            <comment id="65965" author="adilger" created="Fri, 6 Sep 2013 16:50:18 +0000"  >&lt;p&gt;Duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3027&quot; title=&quot;Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3027&quot;&gt;&lt;del&gt;LU-3027&lt;/del&gt;&lt;/a&gt;, which has a landed patch.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="18086">LU-3027</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvc0n:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>5514</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>