<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:20:14 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8750] Wrong obd_timeout on the client when we have 2 or more lustre fs</title>
                <link>https://jira.whamcloud.com/browse/LU-8750</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;when we mount 2 or more lustre fs on a client, the obd_timeout is the max of the all server  obd_timeout. in some cases, it could be have some server evict due to that one of server does not wait obd_ping request enough time&lt;/p&gt;

&lt;p&gt;in my case, I have 2 lustre fs, Servers 2.5.X and some Clients 2.7, the first server have obd_timeout=100 and the second server have obd_timeout=300 so the obd_timeout inherited on the client is obd_timeout=300. the client send one obd-ping request each 75 seconds if just one obd_ping request is lost, the client could be evict, so it could be better to have a obd_timeout by filesystems or the min of the each servers filesystems&lt;/p&gt;</description>
                <environment></environment>
        <key id="40977">LU-8750</key>
            <summary>Wrong obd_timeout on the client when we have 2 or more lustre fs</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="hongchao.zhang">Hongchao Zhang</assignee>
                                    <reporter username="apercher">Antoine Percher</reporter>
                        <labels>
                    </labels>
                <created>Mon, 24 Oct 2016 09:39:30 +0000</created>
                <updated>Tue, 18 Apr 2023 17:24:35 +0000</updated>
                            <resolved>Wed, 29 Mar 2023 22:32:08 +0000</resolved>
                                    <version>Lustre 2.7.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="171003" author="adilger" created="Tue, 25 Oct 2016 17:52:15 +0000"  >&lt;p&gt;I agree that this is a potential issue, and having a single global obd_timeout value is something that doesn&apos;t align with configurations where e.g. one filesystem is local and another is remote, and they should really have different timeout values.&lt;/p&gt;

&lt;p&gt;There are a few options that can be tried to resolve this problem without needing to wait for a patch and new release:&lt;br/&gt;
1) Try mounting the filesystems on a test client in the opposite order: the filesystem with the longer timeout (FS300) mounted first and the shorter timeout (FS100) mounted second, and then check &lt;tt&gt;lctl get_param timeout&lt;/tt&gt; to see if this client uses the 100s timeout.  If yes, then this could be put into production immediately without any further changes, except in the rare case where one filesystem is being mounted inside the other.  If the client still has a timeout of 300s, then it appears that FS100 is using the default obd_timeout of 100s and not explicitly setting a timeout at all, and something more needs to be done.&lt;br/&gt;
2) As with #1 above, change the mount order to mount FS300 first and FS100 second, and also explicitly set the timeout parameter for FSshort via &lt;tt&gt;lctl conf_param &amp;lt;fsname&amp;gt;.sys.timeout=100&lt;/tt&gt; and see if this allows the client to store the shorter timeout.&lt;br/&gt;
3) Set the timeout for FS100 to 300s to match FS300, so that the servers will wait up to 300s for the pings to arrive.  However, this will also increase the recovery time for FS100 and that may not be desirable for some configurations.&lt;/p&gt;

&lt;p&gt;There are also potential code fixes for this problem, in particular we discussed to add a per-target &lt;tt&gt;ping_interval&lt;/tt&gt; tunable in /proc, similar to &lt;tt&gt;max_rpcs_in_flight&lt;/tt&gt; and &lt;tt&gt;max_pages_per_rpc&lt;/tt&gt; that allows setting the ping interval for a single filesystem explicitly.&lt;/p&gt;</comment>
                            <comment id="171008" author="jgmitter" created="Tue, 25 Oct 2016 17:58:52 +0000"  >&lt;p&gt;Hi Hongchao,&lt;/p&gt;

&lt;p&gt;Can you please look into the suggested code fixes that Andreas has highlighted in the last comment?&lt;/p&gt;

&lt;p&gt;Thanks.&lt;br/&gt;
Joe&lt;/p&gt;</comment>
                            <comment id="171530" author="hongchao.zhang" created="Fri, 28 Oct 2016 08:40:32 +0000"  >&lt;p&gt;test output:&lt;br/&gt;
1) mount client with timeout 300 first, mount client with timeout 100 second&lt;br/&gt;
the timeout is 100&lt;/p&gt;

&lt;p&gt;After setting timeout of FS100 to 300 explicitly, the timeout will be changed to 300 &lt;tt&gt;lctl conf_param FS100.sys.timeout=300&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;2) mount client with timeout 100 first, mount client with timeout 300 second&lt;br/&gt;
the timeout is 300&lt;/p&gt;

&lt;p&gt;After setting timeout of FS300 to 100 explicitly, the timeout will be changed to 100 &lt;tt&gt;lctl conf_param FS300.sys.timeout=100&lt;/tt&gt;&lt;/p&gt;</comment>
                            <comment id="225948" author="adilger" created="Thu, 12 Apr 2018 23:58:15 +0000"  >&lt;p&gt;To properly fix this problem, it would be good to store the ping_interval and obd_timeout on a per-import basis.  That would allow a single client to mount two or more different filesystems with different server timeouts (which the client can&apos;t control).&lt;/p&gt;</comment>
                            <comment id="232088" author="adilger" created="Thu, 16 Aug 2018 20:02:23 +0000"  >&lt;p&gt;With the newer userspace-driven parameter parsing (an upcall via udev to &lt;tt&gt;lctl&lt;/tt&gt;) it &lt;em&gt;may&lt;/em&gt; be easier to implement per-OBD timeouts relatively easily.  By default, new OBD devices will inherit the global timeout value when they are created (stored in each obd_device or obd_export separately, and always used from the local device instead of the global value).  If there is a &lt;tt&gt;timeout&lt;/tt&gt; parameter in the configuration logs (which would normally generate an &quot;&lt;tt&gt;lctl set_param timeout=&amp;lt;value&amp;gt;&lt;/tt&gt;&quot; upcall), this would be replaced by &quot;&lt;tt&gt;&amp;#42;.&amp;lt;fsname&amp;gt;-&amp;#42;.timeout&lt;/tt&gt;&quot; so that the upcall for that filesystem&apos;s configuration log will only change the devices for the named filesystem.&lt;/p&gt;</comment>
                            <comment id="367812" author="adilger" created="Wed, 29 Mar 2023 22:32:08 +0000"  >&lt;p&gt;Closing this as a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9912&quot; title=&quot;fix multiple client mounts with different server timeouts&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9912&quot;&gt;LU-9912&lt;/a&gt;, I&apos;ve copied CC&apos;s over already.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="47950">LU-9912</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="36381">LU-8066</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="67226">LU-15246</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="75623">LU-16749</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>lnet</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzyt5j:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>