<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:30:38 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3062] Multiple clients writing to the same file caused mpi application to fail</title>
                <link>https://jira.whamcloud.com/browse/LU-3062</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;After we upgraded our clients from 2.1.3 to 2.3.0, some users (the crowd is increasing) started seeing their application to fail, to hang, or even crash. The servers run 2.1.4. In all cases, same application ran OK with 2.1.3.&lt;/p&gt;

&lt;p&gt;Since we do not have reproducer on the hang and the crash cases, we here attach a reproducer that can cause application to fail. The test were executed with stripe count of 1, 2, 4, 8, 16. The higher number the stripe count the more likely application fails.&lt;/p&gt;

&lt;p&gt;The &apos;reproducer1.scr&apos; is a PBS script to start 1024 mpi tests.&lt;br/&gt;
&apos;reproducer1.scr.o1000145&apos; is PBS output of the execution.&lt;br/&gt;
&apos;1000145.pbspl1.0.log.txt&apos; is an output of one of our tools to collect /var/log/messages from servers and clients related to the specified job.&lt;/p&gt;

&lt;p&gt;The PBS specific argument lines start with &quot;#PBS &quot; string and are ignored if executed without PBS. The script use SGI MPT, but can be converted to openmpi or intel mpi.&lt;/p&gt;</description>
                <environment>Lustre server 2.1.4 centos 6.3&lt;br/&gt;
Lustre clients 2.3.0 sles11sp1</environment>
        <key id="18157">LU-3062</key>
            <summary>Multiple clients writing to the same file caused mpi application to fail</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="jaylan">Jay Lan</reporter>
                        <labels>
                            <label>ptr</label>
                    </labels>
                <created>Fri, 29 Mar 2013 00:54:55 +0000</created>
                <updated>Thu, 8 Sep 2016 21:33:28 +0000</updated>
                            <resolved>Thu, 8 Sep 2016 21:33:28 +0000</resolved>
                                    <version>Lustre 2.3.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="55073" author="green" created="Fri, 29 Mar 2013 05:52:43 +0000"  >&lt;p&gt;Ok, so from the logs we can see the client was evicted by server for some reason.&lt;br/&gt;
Now why it was evicted is not clear because there seems to be no server logs included, but I imagine it&apos;s due to AST timeouts. We included multiple patches in 2.4 to help this cause.&lt;/p&gt;

&lt;p&gt;In addition to that I cannot stop wondering about this message:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Thu Mar 28 12:43:49 2013 M r325i4n13 kernel: [569506.266405] pcieport 0000:00:02.0: AER: Multiple Corrected error received: id=0010
Thu Mar 28 12:43:49 2013 M r325i4n13 kernel: [569506.266572] pcieport 0000:00:02.0: PCIE Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Receiver ID)
Thu Mar 28 12:43:49 2013 M r325i4n13 kernel: [569506.276986] pcieport 0000:00:02.0:   device [8086:3c04] error status/mask=00000040/00002000
Thu Mar 28 12:43:49 2013 M r325i4n13 kernel: [569506.285434] pcieport 0000:00:02.0:    [ 6] Bad TLP            
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;hopefully it did not result in any dropped messages.&lt;/p&gt;</comment>
                            <comment id="55111" author="jaylan" created="Fri, 29 Mar 2013 18:20:50 +0000"  >&lt;p&gt;About the &quot;PCIE Bus Error: severity=Corrected&quot; errors, I checked with our admins. He said it was normal and not indicative of an IB problem.&lt;/p&gt;

&lt;p&gt;I will collect and provide logs at the servers.&lt;/p&gt;

&lt;p&gt;BTW, the test was run using 1024 cpus distributed to 64 nodes. However, I was able to reproduce the problem with only 4 sandy bridge nodes, 4*16=64 processes.&lt;/p&gt;</comment>
                            <comment id="55113" author="jaylan" created="Fri, 29 Mar 2013 18:34:05 +0000"  >&lt;p&gt;The PCIE corrected errors seem to be related to sandy bridge PCI 3.0. We have seen 10,000s of those errors a day. However, same applications did not fail if run 2.1.3 clients.&lt;/p&gt;

&lt;p&gt;Can you identify those AST timeouts patches? Are they client side? Note that we run 2.1.4 at servers. Is there any issue with 2.1.4 server + 2.3.0 client combination?&lt;/p&gt;</comment>
                            <comment id="55119" author="jaylan" created="Fri, 29 Mar 2013 19:11:53 +0000"  >&lt;p&gt;This tarball contains syslog between &lt;span class=&quot;error&quot;&gt;&amp;#91;Thu Mar 28 12:00:00 2013&amp;#93;&lt;/span&gt; and &lt;span class=&quot;error&quot;&gt;&amp;#91;Thu Mar 28 13:00:00 2013&amp;#93;&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;service160 is mds/mgs. The rest are oss&apos;es.&lt;/p&gt;</comment>
                            <comment id="55201" author="jay" created="Mon, 1 Apr 2013 17:21:19 +0000"  >&lt;p&gt;Hi Jay Lan,&lt;/p&gt;

&lt;p&gt;What does &quot;write(9999) 66&quot; mean in the reproducer? I mean how much data it will write to the file by this command.&lt;/p&gt;

&lt;p&gt;Can you please collect lustre logs on the client and server side while running the reproducer?&lt;/p&gt;</comment>
                            <comment id="55206" author="jaylan" created="Mon, 1 Apr 2013 17:47:24 +0000"  >&lt;p&gt;(pbspl1,241) od -x fort.9999&lt;br/&gt;
0000000 0004 0000 0042 0000 0004 0000&lt;br/&gt;
0000014&lt;br/&gt;
(pbspl1,242) ls -l fort.9999&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 jlan g1099 12 Mar 28 18:39 fort.9999&lt;br/&gt;
(pbspl1,243) &lt;/p&gt;

&lt;p&gt;It is 12 bytes.&lt;/p&gt;</comment>
                            <comment id="55210" author="jaylan" created="Mon, 1 Apr 2013 18:10:39 +0000"  >&lt;p&gt;Hi Jinshan,&lt;/p&gt;

&lt;p&gt;The client side logs are in 1000145.pbspl1.0.log.txt. You may want to filter out pbs information. The nbp2-server-logs.&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3062&quot; title=&quot;Multiple clients writing to the same file caused mpi application to fail&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3062&quot;&gt;&lt;del&gt;LU-3062&lt;/del&gt;&lt;/a&gt; is the tarball of all server side logs:&lt;/p&gt;

&lt;p&gt;linux39.jlan 109&amp;gt; tar -tzf nbp2-server-logs.&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3062&quot; title=&quot;Multiple clients writing to the same file caused mpi application to fail&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3062&quot;&gt;&lt;del&gt;LU-3062&lt;/del&gt;&lt;/a&gt;&lt;br/&gt;
service160&lt;br/&gt;
service161&lt;br/&gt;
service162&lt;br/&gt;
service163&lt;br/&gt;
service164&lt;br/&gt;
service165&lt;br/&gt;
service166&lt;br/&gt;
service167&lt;br/&gt;
service168&lt;/p&gt;

&lt;p&gt;Service160 is the mds/mgs, the rest are oss.&lt;/p&gt;</comment>
                            <comment id="55214" author="jaylan" created="Mon, 1 Apr 2013 18:40:50 +0000"  >&lt;p&gt;This is a shorten version of 1000145.pbspl1.0.log.txt with pbs_mom and epilogue messages removed.&lt;/p&gt;</comment>
                            <comment id="55215" author="jay" created="Mon, 1 Apr 2013 18:40:59 +0000"  >&lt;p&gt;Hi Jay Lan,&lt;/p&gt;

&lt;p&gt;I already took a look at those files, and I need more detail information about, can you please turn on more debug options, especially LNET on the client and server side and collect it again? The most interesting thing is that even the clients lost connection to the MGS which is not involved in the IO path at all. If I guess it correctly, this is likely a LNET problem. But I&apos;d like to make it clear before pointing my finger to others.&lt;/p&gt;

&lt;p&gt;Do you know if f90 opens the file with O_APPEND, and &quot;write(9999) 66&quot; just writes 66 bytes to the file?&lt;/p&gt;

&lt;p&gt;THank you.&lt;/p&gt;</comment>
                            <comment id="55217" author="jaylan" created="Mon, 1 Apr 2013 19:00:35 +0000"  >&lt;p&gt;I will try to reproduce the problem with increased debugging.&lt;/p&gt;

&lt;p&gt;The f90 program does not open with O_APPEND. All instances writes 12 bytes to the file. The content of the file:&lt;br/&gt;
0000000 0004 0000 0042 0000 0004 0000&lt;br/&gt;
contains three 4-byte word. The first and the last words probably are envelop. The second byte is the hex value of the number &quot;66&quot;. Not 66 bytes. The program just write a number &quot;66&quot; to the output file.&lt;/p&gt;</comment>
                            <comment id="55226" author="qm137" created="Mon, 1 Apr 2013 20:02:07 +0000"  >&lt;p&gt;original reproducer had a bug in it.&lt;/p&gt;</comment>
                            <comment id="55232" author="qm137" created="Mon, 1 Apr 2013 21:08:06 +0000"  >&lt;p&gt;Client debug logs.  Server will be tougher to get.  We may have to switch to our test filesystem to get that to work.  Please look at client side logs and determine if you want me to still get server logs.&lt;/p&gt;</comment>
                            <comment id="55233" author="jaylan" created="Mon, 1 Apr 2013 21:30:27 +0000"  >&lt;p&gt;Is there an interop issue between 2.3.0 client and 2.1.4 server? Any change in 2.3.0 client requires same change at the server?&lt;/p&gt;</comment>
                            <comment id="55237" author="qm137" created="Mon, 1 Apr 2013 22:23:07 +0000"  >&lt;p&gt;full debug turned on this time.. debug logs were over 500MB.. jira has only 10MB limit so i took the portion of the log that made sense, that shows the OST disconnect.  let me know if you want the whole client logs and we can figure out how to get them to you.&lt;/p&gt;</comment>
                            <comment id="55322" author="qm137" created="Tue, 2 Apr 2013 18:30:22 +0000"  >&lt;p&gt;Uploading full debug log file.  Split file first, then compressed each segment using bzip2 to get below the max file size of 10MB.&lt;br/&gt;
Order of files is xaa, xab, xac, xad, xae.&lt;/p&gt;</comment>
                            <comment id="59712" author="pjones" created="Thu, 30 May 2013 23:41:57 +0000"  >&lt;p&gt;Bobbie&lt;/p&gt;

&lt;p&gt;Could you please setup the reproducer supplied on April 1st above.&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="60099" author="bobbielind" created="Thu, 6 Jun 2013 16:18:31 +0000"  >&lt;p&gt;I have received my account on Rosso and expect to complete the testing over the coming week.&lt;/p&gt;</comment>
                            <comment id="60721" author="adilger" created="Fri, 14 Jun 2013 22:38:27 +0000"  >&lt;p&gt;So, just to clarify, the problem here is that the reproducer program is starting 1024 tasks to write 12 bytes to the same offset=0 of the same file (striped over 16 OSTs?), and there is a lot of contention?  Or am I misunderstanding and each thread will write to non-overlapping ranges of the file (i.e. like O_APPEND)?&lt;/p&gt;

&lt;p&gt;This isn&apos;t terribly surprising, because either case is a pathologically bad IO pattern.  If they are all writing to the same offset it is completely serialized by the locking, while using O_APPEND actually gets worse with increasing numbers of stripes, since it needs to lock all stripes to get the current file size.&lt;/p&gt;

&lt;p&gt;Do you have any idea what the application is actually trying to accomplish with these overlapping writes?  Is there any chance to modify the application to do (whatever it is trying to do) in a more sensible manner?  Depending on what the application is actually trying to accomplish, there may be many more filesystem-friendly ways of doing this.&lt;/p&gt;

&lt;p&gt;The servers should definitely not fail in this case, though I can imagine that the clients might time out waiting for their chance to overwrite the same bytes again.  The clients should reconnect and complete the writes, however.&lt;/p&gt;

&lt;p&gt;It might be possible to optimize pathological cases like this by using OST-side locking for the RPCs, though there is still a difficulty with sending sub-page writes that also need to be handled.&lt;/p&gt;</comment>
                            <comment id="60898" author="qm137" created="Thu, 20 Jun 2013 00:47:18 +0000"  >&lt;p&gt;Sent Andreas email on 18th, but didn&apos;t add to Jira.&lt;/p&gt;

&lt;p&gt;Hey Andreas,&lt;/p&gt;

&lt;p&gt;Our position is that users doing this type of work should&lt;br/&gt;
not cause an eviction.  We agree that it is sub-optimal&lt;br/&gt;
at best (we have a different term for it: stupid), but&lt;br/&gt;
our users continue to do it.  There are various reasons&lt;br/&gt;
why users can&apos;t/won&apos;t change their code here at NASA.&lt;br/&gt;
Word from management is that we need to get this fixed.&lt;br/&gt;
I&apos;ve copied our local Lustre team in case anyone has&lt;br/&gt;
anything else to add.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;jdk&lt;/p&gt;</comment>
                            <comment id="61034" author="bobbielind" created="Fri, 21 Jun 2013 20:03:46 +0000"  >&lt;p&gt;I&apos;m attaching the logs from running the reproducer on the opensfs cluster.  It was a 16 node, 1mds, 2oss test.&lt;/p&gt;</comment>
                            <comment id="61459" author="jay" created="Thu, 27 Jun 2013 18:21:01 +0000"  >&lt;p&gt;It seems that this is due to a deadlock on the client side.&lt;/p&gt;

&lt;p&gt;Is it possible to do an experiment as follows:&lt;br/&gt;
1. write 66 to the shared file as usual;&lt;br/&gt;
2. if the write can&apos;t be finished in 5 minutes(this is only true on NASA cluster because the lock timeout was 1900 seconds; on our local site, we should make it shorter, say 60 seconds), the script will dump the stack trace of processes running on the node by &apos;echo t &amp;gt; /proc/sysrq-trigger&apos;.&lt;/p&gt;</comment>
                            <comment id="61460" author="jay" created="Thu, 27 Jun 2013 18:53:59 +0000"  >&lt;p&gt;After taking a further look, this should be a problem about truncate because the test program was opening the file with O_TRUNC. This is why more stripes leads to more likely causing the problem.&lt;/p&gt;

&lt;p&gt;Can you please try this patch: &lt;a href=&quot;http://review.whamcloud.com/5208&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/5208&lt;/a&gt; to see if it can help?&lt;/p&gt;</comment>
                            <comment id="62571" author="jaylan" created="Thu, 18 Jul 2013 21:51:34 +0000"  >&lt;p&gt;Jim Karellas tested 2.4 client (against 2.1.5 server) before and he was&lt;br/&gt;
still able to reproduce the problem with ease. Since 2.4 contains the &lt;a href=&quot;http://review.whamcloud.com/5208&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/5208&lt;/a&gt; patch Jinshan mentioned and wanted us to test, I think we need to feed back the test result.&lt;/p&gt;</comment>
                            <comment id="165409" author="jaylan" created="Thu, 8 Sep 2016 21:32:40 +0000"  >&lt;p&gt;We have not seen this for a long while and we are running 2.7, so please close it.&lt;/p&gt;</comment>
                            <comment id="165411" author="pjones" created="Thu, 8 Sep 2016 21:33:28 +0000"  >&lt;p&gt;ok Jay&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="12444" name="1000145.pbspl1.0.log.txt" size="232439" author="jaylan" created="Fri, 29 Mar 2013 00:54:55 +0000"/>
                            <attachment id="12454" name="1000145.pbspl1.0.log.txt.-pbs" size="26509" author="jaylan" created="Mon, 1 Apr 2013 18:40:50 +0000"/>
                            <attachment id="13078" name="lu-3062-reproducer-logs.tgz" size="249" author="bobbielind" created="Fri, 21 Jun 2013 20:03:46 +0000"/>
                            <attachment id="12452" name="nbp2-server-logs.LU-3062" size="5539" author="jaylan" created="Fri, 29 Mar 2013 19:11:53 +0000"/>
                            <attachment id="12445" name="reproducer1.scr" size="785" author="jaylan" created="Fri, 29 Mar 2013 00:54:55 +0000"/>
                            <attachment id="12446" name="reproducer1.scr.o1000145" size="3841" author="jaylan" created="Fri, 29 Mar 2013 00:54:55 +0000"/>
                            <attachment id="12455" name="reproducer2.scr" size="775" author="qm137" created="Mon, 1 Apr 2013 20:02:07 +0000"/>
                            <attachment id="12457" name="reproducer_debug_r311i1n10_log" size="56849" author="qm137" created="Mon, 1 Apr 2013 21:08:06 +0000"/>
                            <attachment id="12456" name="reproducer_debug_r311i1n9_log" size="57545" author="qm137" created="Mon, 1 Apr 2013 21:08:06 +0000"/>
                            <attachment id="12458" name="reproducer_full_debug_log" size="2106452" author="qm137" created="Mon, 1 Apr 2013 22:23:07 +0000"/>
                            <attachment id="12468" name="reproducer_full_debug_xaa.bz2" size="254" author="qm137" created="Tue, 2 Apr 2013 18:30:22 +0000"/>
                            <attachment id="12466" name="reproducer_full_debug_xab.bz2" size="5239697" author="qm137" created="Tue, 2 Apr 2013 18:30:22 +0000"/>
                            <attachment id="12467" name="reproducer_full_debug_xac.bz2" size="254" author="qm137" created="Tue, 2 Apr 2013 18:30:22 +0000"/>
                            <attachment id="12469" name="reproducer_full_debug_xad.bz2" size="254" author="qm137" created="Tue, 2 Apr 2013 18:30:22 +0000"/>
                            <attachment id="12465" name="reproducer_full_debug_xae.bz2" size="3165002" author="qm137" created="Tue, 2 Apr 2013 18:30:22 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Thu, 18 Jul 2013 00:54:55 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvmmv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>7461</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 29 Mar 2013 00:54:55 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>