<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:29:43 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9838] target registration mount fails with -108 but then succeeds if retried immediately</title>
                <link>https://jira.whamcloud.com/browse/LU-9838</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Occasionally, when we try to register (i.e. mount for a first time) a target it will fail with:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 15f-b: testfs-OST0001: cannot register this server with the MGS: rc = -108. Is the MGS running?
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:1866:server_fill_super()) Unable to start targets: -108
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:1576:server_put_super()) no obd testfs-OST0001
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:135:server_deregister_mount()) testfs-OST0001 not registered
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: Evicted from MGS (at 10.14.83.68@tcp) after server handle changed from 0xddf3a8acc8d5136b to 0xddf3a8acc8d51555
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: MGC10.14.83.68@tcp: Connection restored to 10.14.83.68@tcp (at 10.14.83.68@tcp)
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: server umount testfs-OST0001 complete
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-108)


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;But then if we try the exact same &lt;tt&gt;mount&lt;/tt&gt;&#160;command a second time, immediately followig the failure it succeeds:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0001: new disk, initializing
Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: srv-testfs-OST0001: No data found on store. Initialize space
Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0001: Imperative Recovery not enabled, recovery window 300-900
Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 30649:0:(osd_oi.c:503:osd_oid()) testfs-OST0001-osd: unsupported quota oid: 0x16

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;In the MGS&apos;s &lt;em&gt;syslog&lt;/em&gt;&#160;around the failure time was:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Aug  5 03:35:50 lotus-58vm15.lotus.hpdd.lab.intel.com kernel: Lustre: MGS: Connection restored to 7d1fb120-bb6f-8171-c97c-aa44f5e9c3db (at 10.14.83.71@tcp)

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;where &lt;em&gt;10.14.83.71&lt;/em&gt; is &lt;em&gt;lotus-58vm18&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I&apos;ve set this issue to Critical because while it may be at least disconcerting for a human user who might just be persistent and curious enough to just try a second time and discover it works, it wreaks havoc on automated systems like IML and Pacemaker, that expect commands that should succeed to succeed.&lt;/p&gt;</description>
                <environment>Lustre: Build Version: 2.10.0_15_gbaa1ce2 from b2_10</environment>
        <key id="47674">LU-9838</key>
            <summary>target registration mount fails with -108 but then succeeds if retried immediately</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="brian">Brian Murrell</reporter>
                        <labels>
                    </labels>
                <created>Sat, 5 Aug 2017 13:41:31 +0000</created>
                <updated>Thu, 4 Jan 2024 01:16:29 +0000</updated>
                            <resolved>Thu, 4 Jan 2024 01:16:29 +0000</resolved>
                                    <version>Lustre 2.10.0</version>
                                                        <due></due>
                            <votes>1</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="204684" author="green" created="Mon, 7 Aug 2017 17:13:35 +0000"  >&lt;p&gt;OI know it&apos;s only hitting occasionally, but we probably still need you to reproduce this with all debug enabled and dump debug logs on both ends once it hits.&lt;br/&gt;
Do you think this is possible?&lt;/p&gt;</comment>
                            <comment id="204747" author="brian" created="Tue, 8 Aug 2017 10:27:04 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=green&quot; class=&quot;user-hover&quot; rel=&quot;green&quot;&gt;green&lt;/a&gt;:&#160;Just because it can take a while to hit this bug and so that we don&apos;t iterate over this multiple times until I get it right and collect the correct and enough debug information, could I bother you to just list the commands I need here to start the debugging you are looking for and then collect it when the bug hits? &#160;Thanks much.&lt;/p&gt;</comment>
                            <comment id="205259" author="green" created="Sun, 13 Aug 2017 16:42:43 +0000"  >&lt;p&gt;basically I wand you to do this before the registration but after modules are loaded on both mgs and the target:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lctl set_param debug=-1
lctl set_param debug_mb=200

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;then do your registration attempt and immediately after that on both nodes:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lctl dk &amp;gt;/tmp/debug.log

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;If there was a failure - collect those two files and make them available.&lt;/p&gt;</comment>
                            <comment id="205304" author="joe.grund" created="Mon, 14 Aug 2017 13:37:22 +0000"  >&lt;p&gt;Hitting intermittent failures in IML, being tracked under:  &lt;a href=&quot;https://github.com/intel-hpdd/intel-manager-for-lustre/issues/107&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/intel-hpdd/intel-manager-for-lustre/issues/107&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="205831" author="brian" created="Sat, 19 Aug 2017 14:19:08 +0000"  >&lt;p&gt;Debug logs attached: &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/28025/28025_mds-debug.log&quot; title=&quot;mds-debug.log attached to LU-9838&quot;&gt;mds-debug.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; and &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/28024/28024_oss-debug.log&quot; title=&quot;oss-debug.log attached to LU-9838&quot;&gt;oss-debug.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;.&lt;/p&gt;</comment>
                            <comment id="205837" author="green" created="Sat, 19 Aug 2017 17:45:31 +0000"  >&lt;p&gt;Well, this time it failed with -114 - &quot;operation already in progress&quot;, not -108 - &quot;Cannot send after transport endpoint shutdown&quot; so I am not sure if this is the same problem or not.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000020:00000001:0.0:1503141275.035516:0:16570:0:(obd_mount_server.c:1753:osd_start()) Process entered
00000020:01000004:0.0:1503141275.035517:0:16570:0:(obd_mount_server.c:1757:osd_start()) Attempting to start testfs-OST0000, type=osd-zfs, lsifl=200002, mountfl=0
00000020:01000004:0.0:1503141275.035522:0:16570:0:(obd_mount_server.c:1776:osd_start()) testfs-OST0000-osd already started
00000020:00000001:0.0:1503141275.035523:0:16570:0:(obd_mount_server.c:1783:osd_start()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e)
00000020:00020000:0.0:1503141275.035525:0:16570:0:(obd_mount_server.c:1832:server_fill_super()) Unable to start osd on zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk3/testfs-OST0000: -114
00000020:00000001:0.0:1503141275.044922:0:16570:0:(obd_mount.c:657:lustre_put_lsi()) Process entered
00000020:01000004:0.0:1503141275.044924:0:16570:0:(obd_mount.c:661:lustre_put_lsi()) put ffff88003d7e1000 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This was even before any talking to MGS, so I guess it is something else.&lt;br/&gt;
It sounds like the fs was mounted when you tried to mount it again to me?&lt;/p&gt;</comment>
                            <comment id="205871" author="brian" created="Mon, 21 Aug 2017 13:28:07 +0000"  >&lt;p&gt;That&apos;s strange.&lt;/p&gt;

&lt;p&gt;In any case, I tried to attach two more logs but ran into DCO-7458&#160;so they are on the ftp site in &lt;tt&gt;/uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9838&quot; title=&quot;target registration mount fails with -108 but then succeeds if retried immediately&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9838&quot;&gt;&lt;del&gt;LU-9838&lt;/del&gt;&lt;/a&gt;/&lt;/tt&gt;&#160;in files &lt;tt&gt;mds-debug.log.bz2&lt;/tt&gt; and &lt;tt&gt;/tmp/oss-debug.log.bz2&lt;/tt&gt;.&lt;/p&gt;</comment>
                            <comment id="205948" author="green" created="Tue, 22 Aug 2017 02:24:59 +0000"  >&lt;p&gt;So the way it unfolds is this:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00000040:1.0:1503245974.332685:0:1172:0:(lustre_net.h:2457:ptlrpc_rqphase_move()) @@@ move req &quot;New&quot; -&amp;gt; &quot;Rpc&quot;  req@ffff88001d2c2400 x1576258375780640/t0(0) o250-&amp;gt;MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
00000100:00000040:1.0:1503245974.332731:0:1172:0:(niobuf.c:896:ptl_send_rpc()) @@@ send flg=0  req@ffff88001d2c2400 x1576258375780640/t0(0) o250-&amp;gt;MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 520/544 e 0 to 0 dl 1503246010 ref 2 fl Rpc:N/0/ffffffff rc 0/-1
00000100:00000200:1.0:1503245974.332898:0:1168:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff88001d2c2400 x1576258375780640/t0(0) o250-&amp;gt;MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 520/544 e 0 to 0 dl 1503246010 ref 2 fl Rpc:N/0/ffffffff rc 0/-1
10000000:01000000:0.0:1503245974.355790:0:19104:0:(mgc_request.c:2139:mgc_process_log()) MGC10.14.80.248@tcp: configuration from log &apos;testfs-client&apos; succeeded (0).
00000100:00000040:0.0:1503245974.355897:0:19104:0:(lustre_net.h:2457:ptlrpc_rqphase_move()) @@@ move req &quot;New&quot; -&amp;gt; &quot;Rpc&quot;  req@ffff8800449dea00 x1576258375780672/t0(0) o253-&amp;gt;MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1
00000100:00000200:0.0:1503245974.355926:0:19104:0:(client.c:1181:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8800449dea00 x1576258375780672/t0(0) o253-&amp;gt;MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
00000100:00000400:0.0:1503246010.355433:0:1172:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1503245974/real 1503245974]  req@ffff88001d2c2400 x1576258375780640/t0(0) o250-&amp;gt;MGC10.14.80.248@tcp@10.14.80.248@tcp:26/25 lens 520/544 e 0 to 1 dl 1503246010 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So at least in this instance it looks like we tried to connect (o250) to the MGS, but it did not respond to us, yet we already had a cached config log from somewhere so we decided to send the registration (o253) and because the mgc imports are obd_no_recov until actually connected we bail right out instead of waiting for the import to become active first.&lt;/p&gt;

&lt;p&gt;This whole strange thing with mgc where it&apos;s not recoverable and has strange no-resend policies has bitten us again and it&apos;s probably time to take a good hard look at it.&lt;/p&gt;</comment>
                            <comment id="205951" author="brian" created="Tue, 22 Aug 2017 02:44:30 +0000"  >&lt;p&gt;Is this something that is reasonable to fix on b2_10?&lt;/p&gt;</comment>
                            <comment id="210204" author="bhoagland" created="Tue, 3 Oct 2017 15:42:09 +0000"  >&lt;p&gt;working around this for IML 4.0 GA&lt;/p&gt;</comment>
                            <comment id="283110" author="javed" created="Fri, 23 Oct 2020 07:21:40 +0000"  >&lt;p&gt;lustre versions on server and clients have been upgraded. above issue no longer relevant to us.&lt;/p&gt;</comment>
                            <comment id="398524" author="adilger" created="Thu, 4 Jan 2024 01:16:29 +0000"  >&lt;p&gt;Close old issue related to 2.10 servers.  Should be reproduced on modern system to confirm if issue still exists. &lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10120">
                    <name>Blocker</name>
                                                                <inwardlinks description="is blocked by">
                                                        </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="24409">LU-4966</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="28025" name="mds-debug.log" size="65673" author="brian" created="Sat, 19 Aug 2017 14:15:51 +0000"/>
                            <attachment id="28035" name="mds-debug.log.bz2" size="8958924" author="brian" created="Tue, 22 Aug 2017 02:43:27 +0000"/>
                            <attachment id="28024" name="oss-debug.log" size="103089" author="brian" created="Sat, 19 Aug 2017 14:15:51 +0000"/>
                            <attachment id="28036" name="oss-debug.log.bz2" size="7443781" author="brian" created="Tue, 22 Aug 2017 02:43:25 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzhvr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>