<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:14:27 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8077] LLNL BGQ IO-node Lustre Client Not Reconnecting</title>
                <link>https://jira.whamcloud.com/browse/LU-8077</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Lustre Clients:  lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64&lt;br/&gt;
Lustre Servers:  lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64&lt;/p&gt;

&lt;p&gt;On our IBM BGQ system Vulcan at LLNL, the ion&apos;s have been experiencing what is believed to be repeated ost connection issues affecting user jobs.  Recently two ions have reported issues have been identified.  The rack has been drained and the ions left as is.  The command &quot;lfs check servers&quot; reports the following errors:&lt;/p&gt;

&lt;p&gt;vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable&lt;br/&gt;
vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable&lt;/p&gt;

&lt;p&gt;Output from proc &quot;import&quot; file for affect ost:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import
import:
    name: fsv-OST0017-osc-c0000003e09f49c0
    target: fsv-OST0017_UUID
    state: REPLAY
    connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
    connect_data:
       flags: 0x4af0e3440478
       instance: 45
       target_version: 2.5.5.0
       initial_grant: 2097152
       max_brw_size: 4194304
       grant_block_size: 0
       grant_inode_size: 0
       grant_extent_overhead: 0
       cksum_types: 0x2
       max_easize: 32768
       max_object_bytes: 9223372036854775807
    import_flags: [ replayable, pingable, connect_tried ]
    connection:
       failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
       current_connection: 172.20.20.23@o2ib500
       connection_attempts: 39
       generation: 1
       in-progress_invalidations: 0
    rpcs:
       inflight: 168
       unregistering: 1
       timeouts: 20977
       avg_waittime: 209959 usec
    service_estimates:
       services: 48 sec
       network: 45 sec
    transactions:
       last_replay: 0
       peer_committed: 150323856033
       last_checked: 150323856033
    read_data_averages:
       bytes_per_rpc: 69000
       usec_per_rpc: 4389
       MB_per_sec: 15.72
    write_data_averages:
       bytes_per_rpc: 893643
       usec_per_rpc: 2458
       MB_per_sec: 363.56
vulcanio121-ib0@root: 

vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import 
import:
    name: fsv-OST0017-osc-c0000003c4483300
    target: fsv-OST0017_UUID
    state: REPLAY
    connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
    connect_data:
       flags: 0x4af0e3440478
       instance: 45
       target_version: 2.5.5.0
       initial_grant: 2097152
       max_brw_size: 4194304
       grant_block_size: 0
       grant_inode_size: 0
       grant_extent_overhead: 0
       cksum_types: 0x2
       max_easize: 32768
       max_object_bytes: 9223372036854775807
    import_flags: [ replayable, pingable, connect_tried ]
    connection:
       failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
       current_connection: 172.20.20.23@o2ib500
       connection_attempts: 36
       generation: 1
       in-progress_invalidations: 0
    rpcs:
       inflight: 131
       unregistering: 1
       timeouts: 19341
       avg_waittime: 144395 usec
    service_estimates:
       services: 45 sec
       network: 50 sec
    transactions:
       last_replay: 0
       peer_committed: 150323856116
       last_checked: 150323856116
    read_data_averages:
       bytes_per_rpc: 67548
       usec_per_rpc: 3326
       MB_per_sec: 20.30
    write_data_averages:
       bytes_per_rpc: 913996
       usec_per_rpc: 5909
       MB_per_sec: 154.67
vulcanio127-ib0@root: 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The disconnects appear to have happen when we updated our lustre cluster.  All other ion&apos;s reconnect with not problem once the lustre cluster was back up and running.&lt;/p&gt;</description>
                <environment></environment>
        <key id="36446">LU-8077</key>
            <summary>LLNL BGQ IO-node Lustre Client Not Reconnecting</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="6">Not a Bug</resolution>
                                        <assignee username="jgmitter">Joseph Gmitter</assignee>
                                    <reporter username="weems2">Lance Weems</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Wed, 27 Apr 2016 23:59:52 +0000</created>
                <updated>Fri, 6 Jul 2018 14:31:34 +0000</updated>
                            <resolved>Fri, 6 Jul 2018 14:31:34 +0000</resolved>
                                    <version>Lustre 2.5.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="150479" author="green" created="Thu, 28 Apr 2016 17:25:30 +0000"  >&lt;p&gt;So it looks like the io nodes reconnect just fine, but then disconnect again?&lt;br/&gt;
What&apos;s the picture on the server concerning those nodes?&lt;/p&gt;</comment>
                            <comment id="150507" author="nedbass" created="Thu, 28 Apr 2016 20:49:02 +0000"  >&lt;p&gt;Oleg, not sure I understand your question. The io nodes disconnected from the OST when it was rebooted and remain in a disconnected state.  The server is up and can lctl ping the io nodes. The server has no export under /proc/fs/lustre for those clients. What else specifically would you like to know about the server?&lt;/p&gt;</comment>
                            <comment id="150603" author="green" created="Fri, 29 Apr 2016 17:23:53 +0000"  >&lt;p&gt;Oh, it was not apparent from the logs that there were multiple clients referenced.&lt;/p&gt;

&lt;p&gt;So to reconstruct the picture better: the io nodes in question disconnected from OSTs and then do they try to reconnect and fail?  Don&apos;t try to reconnect?&lt;br/&gt;
Can we get a better log snippet from one of the io nodes only from before the moment of reconnection through several reconnect attempts that fail?&lt;br/&gt;
When reconnect happens - does the server see it, any reporting about this client from the server?&lt;/p&gt;</comment>
                            <comment id="150610" author="nedbass" created="Fri, 29 Apr 2016 19:15:52 +0000"  >&lt;p&gt;I only see the connection lost message on the clients:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@vulcansn1:~]# grep OST0017 /bgsys/logs/BGQ.sn/R17-ID-J00.log                                                                                                                                                                                                                                                                                                           
2016-04-14 11:01:36.246847 {RMP14Ap004250453} [mmcs]{128}.15.1: Lustre: fsv-OST0017-osc-c0000003e09f49c0: Connection to fsv-OST0017 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete
[root@vulcansn1:~]# grep OST0017 /bgsys/logs/BGQ.sn/R17-ID-J06.log                                                                                                                                                                                                                                                                                                           
2016-04-14 11:01:44.047027 {RMP14Ap004250453} [mmcs]{134}.1.1: Lustre: fsv-OST0017-osc-c0000003c4483300: Connection to fsv-OST0017 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The server didn&apos;t log anything about connection attempts from these clients.  I&apos;ll also note that both disconnected clients had non-zero read RPCs in flight in the rpcs_stats file.  Both were stuck in the REPLAY state.  The clients have since been shutdown.  If I recall correctly the state history looked something like this:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;current_state: REPLAY
state_history:
 - [ 1460666257, CONNECTING ]
 - [ 1460666307, DISCONN ]
 - [ 1460666307, CONNECTING ]
 - [ 1460666332, DISCONN ]
 - [ 1460666332, CONNECTING ]
 - [ 1460666382, DISCONN ]
 - [ 1460666382, CONNECTING ]
 - [ 1460666417, DISCONN ]
 - [ 1460666432, CONNECTING ]
 - [ 1460666487, DISCONN ]
 - [ 1460666507, CONNECTING ]
 - [ 1460666507, REPLAY ]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="150616" author="morrone" created="Fri, 29 Apr 2016 20:34:49 +0000"  >&lt;p&gt;I glanced at the clients, and I believe that they both had started replay but didn&apos;t finish, and then got stuck on the subsequent replay.&lt;/p&gt;</comment>
                            <comment id="150652" author="green" created="Sat, 30 Apr 2016 14:52:18 +0000"  >&lt;p&gt;hm, I wonder if ir was one of the recently uncovered bulk or rpc state hangups like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7434&quot; title=&quot;lost bulk leads to a hang&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7434&quot;&gt;&lt;del&gt;LU-7434&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="150717" author="green" created="Mon, 2 May 2016 17:05:05 +0000"  >&lt;p&gt;next time something like this happens it would be great if you an get backtraces from lustre threads to see what they were stuck on.&lt;/p&gt;</comment>
                            <comment id="150751" author="morrone" created="Mon, 2 May 2016 18:49:50 +0000"  >&lt;p&gt;I looked at the backtraces on one of the nodes and didn&apos;t see any lustre threads stuck on anything.  It looked like all were just in their normal idle waiting states.&lt;/p&gt;</comment>
                            <comment id="150759" author="jfc" created="Mon, 2 May 2016 20:20:37 +0000"  >&lt;p&gt;Hello Lance and team,&lt;/p&gt;

&lt;p&gt;Is there any more work required on this ticket? Or can we go ahead mark it as resolved?&lt;/p&gt;

&lt;p&gt;Many thanks,&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                            <comment id="150761" author="nedbass" created="Mon, 2 May 2016 20:26:45 +0000"  >&lt;p&gt;Please keep the ticket open until the underlying bug has been fixed.&lt;/p&gt;</comment>
                            <comment id="150776" author="green" created="Mon, 2 May 2016 23:56:02 +0000"  >&lt;p&gt;Chris, it must have been doing something somewhere, though&lt;br/&gt;
was there an ll_imp_inval for example?&lt;/p&gt;

&lt;p&gt;Would be great if you can get a sysrq-t (or an equivalent from a crashdump) should something like this happen again.&lt;/p&gt;</comment>
                            <comment id="150787" author="morrone" created="Tue, 3 May 2016 00:00:09 +0000"  >&lt;p&gt;I did a &quot;foreach bt&quot; from crash.  That is easier on these nodes than doing sysrq-t.  Console access is possible, but a pain.  And I don&apos;t believe that these nodes have been set up to get crash dumps.  But at least crash works on a live system.  That is much better than we had on previous BG systems.&lt;/p&gt;

&lt;p&gt;I didn&apos;t know the nodes were being rebooted without getting more info, or I would have saved it myself.  I agree, more needs to be gathered next time this happens.&lt;/p&gt;</comment>
                            <comment id="150799" author="green" created="Tue, 3 May 2016 02:05:04 +0000"  >&lt;p&gt;On the off chance can you see if any of that info is still there in the &quot;xterm&quot; scroll buffer somewhere from the time you did the bt?&lt;/p&gt;</comment>
                            <comment id="229286" author="jgmitter" created="Thu, 7 Jun 2018 14:05:03 +0000"  >&lt;p&gt;Can we close this issue?&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Mon, 9 May 2016 23:59:52 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10040" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic</customfieldname>
                        <customfieldvalues>
                                        <label>client</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzy9nb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Wed, 27 Apr 2016 23:59:52 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>