<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:12:02 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-953] OST connection lost</title>
                <link>https://jira.whamcloud.com/browse/LU-953</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Upgrading to lustre 1.8.6 and OFED1.5.3.1 we have started to see OST&amp;lt;-&amp;gt;MDT connection issue. &lt;br/&gt;
We have checked the IB fabric for errors and have found none. &lt;br/&gt;
Are there any know issues with Lustre1.8.6 and OFED1.5.3?&lt;/p&gt;

&lt;p&gt;=== ERROR ON MDS === &lt;br/&gt;
Dec 28 07:04:56 service100 kernel: Lustre: 6149:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1389011653751232 sent from nbp6-OST0002-osc to NID 10.151.25.157@o2ib 7s ago has timed out (7s prior to deadline). &lt;br/&gt;
Dec 28 07:04:56 service100 kernel: req@ffff81071b30ac00 x1389011653751232/t0 o13-&amp;gt;nbp6-OST0002_UUID@10.151.25.157@o2ib:7/4 lens 192/528 e 0 to 1 dl 1325084696 ref 1 fl Rpc:N/0/0 rc 0/0 &lt;br/&gt;
Dec 28 07:04:56 service100 kernel: Lustre: 6149:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 258 previous similar messages &lt;br/&gt;
Dec 28 07:04:56 service100 kernel: Lustre: nbp6-OST0002-osc: Connection to service nbp6-OST0002 via nid 10.151.25.157@o2ib was lost; in progress operations using this service will wait for recovery to complete.&lt;br/&gt;
Dec 28 07:04:56 service100 kernel: Lustre: Skipped 2 previous similar messages &lt;br/&gt;
Dec 28 07:05:04 service100 kernel: Lustre: 6151:0:(import.c:517:import_select_connection()) nbp6-OST000a-osc: tried all connections, increasing latency to 11s &lt;br/&gt;
Dec 28 07:05:04 service100 kernel: Lustre: 6151:0:(import.c:517:import_select_connection()) Skipped 220 previous similar messages &lt;br/&gt;
Dec 28 07:05:06 service100 kernel: Lustre: nbp6-OST0042-osc: Connection restored to service nbp6-OST0042 using nid 10.151.25.157@o2ib. &lt;br/&gt;
Dec 28 07:05:06 service100 kernel: Lustre: Skipped 14 previous similar messages &lt;br/&gt;
Dec 28 07:05:06 service100 kernel: LustreError: 30626:0:(quota_ctl.c:473:lov_quota_ctl()) ost 75 is inactive &lt;br/&gt;
Dec 28 07:05:06 service100 kernel: LustreError: 30626:0:(quota_ctl.c:473:lov_quota_ctl()) Skipped 5 previous similar messages Dec 28 07:05:06 service100 kernel: Lustre: MDS nbp6-MDT0000: nbp6-OST0042_UUID now active, resetting orphans &lt;br/&gt;
Dec 28 07:05:06 service100 kernel: Lustre: Skipped 29 previous similar messages &lt;br/&gt;
Dec 28 07:05:07 service100 kernel: LustreError: 30630:0:(quota_master.c:1698:qmaster_recovery_main()) nbp6-MDT0000: qmaster recovery failed for uid 11631 rc:-11) &lt;br/&gt;
Dec 28 07:05:07 service100 kernel: LustreError: 30630:0:(quota_master.c:1698:qmaster_recovery_main()) Skipped 52 previous similar messages &lt;/p&gt;</description>
                <environment>lustre-1.8.6.81&lt;br/&gt;
&amp;nbsp;OFED1.5.3.1&lt;br/&gt;
NASA AMES</environment>
        <key id="12766">LU-953</key>
            <summary>OST connection lost</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="liang">Liang Zhen</assignee>
                                    <reporter username="mhanafi">Mahmoud Hanafi</reporter>
                        <labels>
                    </labels>
                <created>Thu, 29 Dec 2011 13:18:59 +0000</created>
                <updated>Thu, 12 Sep 2013 07:57:18 +0000</updated>
                            <resolved>Thu, 12 Sep 2013 07:57:18 +0000</resolved>
                                    <version>Lustre 2.2.0</version>
                    <version>Lustre 1.8.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="25281" author="pjones" created="Thu, 29 Dec 2011 20:35:26 +0000"  >&lt;p&gt;Lsi&lt;/p&gt;

&lt;p&gt;Can you please comment on this issue?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="25362" author="cliffw" created="Tue, 3 Jan 2012 14:31:54 +0000"  >&lt;p&gt;We need to have more information. Can you please attach the MDS system log for the 5 hours prior to this event, and the system logs from the node providing nid 10.151.25.157@o2ib for the same time period. &lt;/p&gt;</comment>
                            <comment id="31980" author="simmonsja" created="Fri, 23 Mar 2012 10:51:12 +0000"  >&lt;p&gt;I also saw the problem Lustre pre-2.2 with OFED 1.5.3 on RHEL5. Upgrading to OFED 1.5.4 made the problem go away for me. Also Lustre pre-2.2 without OFED on rhle6 shows this error but it appears to be a minor probelm. Have you seen sever problems with this?&lt;/p&gt;</comment>
                            <comment id="32198" author="simmonsja" created="Tue, 27 Mar 2012 12:35:10 +0000"  >&lt;p&gt;Okay I just moved to OFED 1.5.4.1 on rhel6 and I still see this issue.&lt;/p&gt;</comment>
                            <comment id="32200" author="simmonsja" created="Tue, 27 Mar 2012 12:41:29 +0000"  >&lt;p&gt;Also I want to comment that this is affecting the stripe placement on our OSSs. For example on our test file system I set the stripe count to 28 which is the total number of OSTs I have, each OSS has 7 OSTs. Doing a lfs getstripe on a file in this case yields&lt;/p&gt;

&lt;p&gt;testfile.out.00000009&lt;br/&gt;
lmm_magic:          0x0BD10BD0&lt;br/&gt;
lmm_seq:            0x2000013aa&lt;br/&gt;
lmm_object_id:      0x16&lt;br/&gt;
lmm_stripe_count:   7&lt;br/&gt;
lmm_stripe_size:    1048576&lt;br/&gt;
lmm_stripe_pattern: 1&lt;br/&gt;
lmm_layout_gen:     0&lt;br/&gt;
lmm_stripe_offset:  12&lt;br/&gt;
        obdidx           objid          objid            group&lt;br/&gt;
            12            6928         0x1b10                0&lt;br/&gt;
            16            6928         0x1b10                0&lt;br/&gt;
            20            7185         0x1c11                0&lt;br/&gt;
            24            7504         0x1d50                0&lt;br/&gt;
             0            7184         0x1c10                0&lt;br/&gt;
             4            6992         0x1b50                0&lt;br/&gt;
             8            6928         0x1b10                0&lt;/p&gt;

&lt;p&gt;All those object creates happen on one OSS. This is the case for all files. Mohmoud can you verify that you are seeing this behavior as well. Because of this the best performance I get for writing out a file is 250 MB/s versus the 2.5GB/s I got before. This is a blocker for developing in a production environment.&lt;/p&gt;</comment>
                            <comment id="33552" author="ian" created="Thu, 5 Apr 2012 14:49:16 +0000"  >&lt;p&gt;Also observed in Lustre 2.2 at ORNL.&lt;/p&gt;</comment>
                            <comment id="33622" author="pjones" created="Thu, 5 Apr 2012 21:58:00 +0000"  >&lt;p&gt;Liang&lt;/p&gt;

&lt;p&gt;Could you please help with this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="33777" author="liang" created="Fri, 6 Apr 2012 05:58:59 +0000"  >&lt;p&gt;Did OFED complained anything while you saw this issue? Also, could you please turn on neterror print so we can check whether there is any LNet/o2iblnd problem (echo +1 &amp;gt; /proc/sys/lnet/printk).&lt;br/&gt;
I feel it&apos;s more like an issue in ptlrpc layer, so it would be helpful to get debug log, dmesg and console output from both the MDS and OSS.&lt;/p&gt;

&lt;p&gt;Liang&lt;/p&gt;</comment>
                            <comment id="34414" author="simmonsja" created="Tue, 10 Apr 2012 11:32:54 +0000"  >&lt;p&gt;Mahmoud can you try the following patch against your 1.8 source.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#change,1797&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,1797&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For me it seems to have helped. Let me know if it helps with yoru problem as well. I will be doing more testing on my side. I still have the strange striping pattern tho.&lt;/p&gt;</comment>
                            <comment id="34526" author="simmonsja" created="Wed, 11 Apr 2012 10:38:03 +0000"  >&lt;p&gt;Managed to collect logs for this problem and hand them off to Oleg.&lt;/p&gt;</comment>
                            <comment id="34578" author="green" created="Wed, 11 Apr 2012 22:57:55 +0000"  >&lt;p&gt;Looking at the ORNL logs from yesterday I see that MDS is constantly trying to connect to OST0001 at an address of oss1 (I assume, becaue that&apos;s where the connections end up at reported as &quot;no such OST here&quot;).&lt;br/&gt;
Now the OST0001 is started on oss2 where no connection attempts are made, as the oss1 address is the only one listed for OST0001 apparently.&lt;br/&gt;
As such the problem seems to be some sort of a configuration error&lt;/p&gt;</comment>
                            <comment id="34627" author="simmonsja" created="Thu, 12 Apr 2012 12:44:22 +0000"  >&lt;p&gt;I attached my build scripts to see if it is indeed a config error. I will test with the llmount.sh script as well.&lt;/p&gt;</comment>
                            <comment id="34637" author="simmonsja" created="Thu, 12 Apr 2012 15:30:30 +0000"  >&lt;p&gt;Mahmoud do you format your OSTs with --index=&quot;some number&quot;. We do that at the lab to allow parallel mounting of the OSTs.It appears to be causing problems. I&apos;m going to do a test format without using the index to see if we still have the problems.&lt;/p&gt;</comment>
                            <comment id="35060" author="mhanafi" created="Wed, 18 Apr 2012 19:29:56 +0000"  >&lt;p&gt;Sorry for the late reply.&lt;br/&gt;
Yes we do use the --index. we set it to the ost number.&lt;/p&gt;</comment>
                            <comment id="35089" author="simmonsja" created="Thu, 19 Apr 2012 11:30:54 +0000"  >&lt;p&gt;No problem about the delay. I have done some tracking down of the problem and discovered how to replicate this issue. The problem only shows up when formatting the OST with index=&quot;number&quot;. Whats causes the problem is a mounting order. If you mount MGS &amp;gt; MDS &amp;gt; OSS(s) no problems will show up. If you mount MGS &amp;gt; OSS(s) &amp;gt; MDS then you will experience this problem.&lt;/p&gt;

&lt;p&gt;Now here is a extra bit of info. If you format with OST index and you mount in the MGS &amp;gt; MDS &amp;gt; OSS order then umount the file system then remount in the order of MGS &amp;gt; OSS(s) &amp;gt; MDS you will not run into the connection problem. This tells you the problem is the wrong data being written to the llog on the MDS. For some reason the data sent by the OSS to the OSC layer on the MDS is different if the MDS sents a signal to the OSS to send it&apos;s configuration data versus the OSS successfully sending its configuration data to MDS already available. &lt;/p&gt;</comment>
                            <comment id="35103" author="simmonsja" created="Thu, 19 Apr 2012 12:47:40 +0000"  >&lt;p&gt;Another interesting clue is on the MDS if you do a  for i in $(ls /proc/fs/lustre/osc/&lt;b&gt;-OST&lt;/b&gt;/ost_conn_uuid); do cat $i; done  you will see all the NIDS are exactly the same.&lt;/p&gt;</comment>
                            <comment id="47846" author="simmonsja" created="Thu, 15 Nov 2012 10:51:28 +0000"  >&lt;p&gt;Okay I just tested this again on Lustre 2.3.54 and it still exist.&lt;/p&gt;</comment>
                            <comment id="53181" author="simmonsja" created="Thu, 28 Feb 2013 11:41:16 +0000"  >&lt;p&gt;Tested this bug on Lustre 2.3.61 and the problem seems to have been fixed.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="11146" name="new-build.sh" size="3853" author="simmonsja" created="Thu, 12 Apr 2012 12:44:39 +0000"/>
                            <attachment id="11147" name="new-lustre_start.sh" size="5591" author="simmonsja" created="Thu, 12 Apr 2012 12:44:39 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvkf3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>7037</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>