<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:46:20 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4842] MDS crash and didn&apos;t failover (24022:0:(mdt_open.c:1497:mdt_reint_open()) @@@ OPEN &amp; CREAT not in open replay/by_fid)</title>
                <link>https://jira.whamcloud.com/browse/LU-4842</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;we have seen multiple times that the Lustre MDT did not start on the&lt;br/&gt;
failover server. This happened for different file systems. Either the&lt;br/&gt;
configuration is wrong or something causes the failover server to hang&lt;br/&gt;
during recovery. &lt;br/&gt;
This happened on the file system HC3WORK at Dec 20 2013 but also on the &lt;br/&gt;
pfs2dat1 file system at Mar 10 and on multiple file systems &lt;br/&gt;
(pfs2wor1, pfs2dat2, pfs2dat1, HC3WORK) while I tried to deactivate ACLs &lt;br/&gt;
at Mar 17.  &lt;/p&gt;</description>
                <environment>Lustre 2.4.1</environment>
        <key id="23986">LU-4842</key>
            <summary>MDS crash and didn&apos;t failover (24022:0:(mdt_open.c:1497:mdt_reint_open()) @@@ OPEN &amp; CREAT not in open replay/by_fid)</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="4">Incomplete</resolution>
                                        <assignee username="hongchao.zhang">Hongchao Zhang</assignee>
                                    <reporter username="rganesan@ddn.com">Rajeshwaran Ganesan</reporter>
                        <labels>
                    </labels>
                <created>Mon, 31 Mar 2014 15:46:51 +0000</created>
                <updated>Sat, 30 Jan 2016 02:44:09 +0000</updated>
                            <resolved>Sat, 30 Jan 2016 02:44:09 +0000</resolved>
                                    <version>Lustre 2.4.1</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="80736" author="pjones" created="Tue, 1 Apr 2014 18:36:53 +0000"  >&lt;p&gt;Hongchao&lt;/p&gt;

&lt;p&gt;Could you please advise on this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="80953" author="rganesan@ddn.com" created="Thu, 3 Apr 2014 16:09:55 +0000"  >&lt;p&gt;Hello &lt;/p&gt;

&lt;p&gt;Hong Chao - Could you please give us an update? If you need any other log and other information, please let me know. &lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Rajesh&lt;/p&gt;</comment>
                            <comment id="81024" author="hongchao.zhang" created="Fri, 4 Apr 2014 03:00:42 +0000"  >&lt;p&gt;Hi Rajesh,&lt;/p&gt;

&lt;p&gt;The log message &quot;(24022:0:(mdt_open.c:1497:mdt_reint_open()) @@@ OPEN &amp;amp; CREAT not in open replay/by_fid)&quot; means there is no object found at MDT&lt;br/&gt;
for an open request without &quot;MDS_OPEN_CREAT&quot; flag, and it could be just created by some previous clients which could not recover with MDT correctly.&lt;/p&gt;

&lt;p&gt;Could you please attach the syslog of the MDT, the OSS and the clients?&lt;br/&gt;
It will be much helpful if you have also collected the debug logs of these nodes (using &quot;lctl dk &amp;gt; XXX&quot; to get it).&lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
Hongchao&lt;/p&gt;</comment>
                            <comment id="81435" author="rganesan@ddn.com" created="Fri, 11 Apr 2014 16:18:42 +0000"  >&lt;p&gt;Hi&lt;/p&gt;

&lt;p&gt;I have uploaded requested logs into the ftp.whamcloud.com under Lu-4842.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Rajesh&lt;/p&gt;</comment>
                            <comment id="81720" author="rganesan@ddn.com" created="Wed, 16 Apr 2014 10:08:22 +0000"  >&lt;p&gt;Hello Hongchao,&lt;/p&gt;

&lt;p&gt;Please give us an update  and please let me know, if you need any other logs etc.&lt;/p&gt;


&lt;p&gt;Thanks,&lt;br/&gt;
Rajesh&lt;/p&gt;</comment>
                            <comment id="81815" author="hongchao.zhang" created="Thu, 17 Apr 2014 08:58:44 +0000"  >&lt;p&gt;Hi Rajesh,&lt;/p&gt;

&lt;p&gt;the logs(Lu-4842/LU_logs.txt) doesn&apos;t contain error message. when did you get the debug log? it contains the logs about 50 seconds (1397232533 ~ 1397232605),&lt;br/&gt;
the new logs will overwrite the old logs, the value in &quot;/proc/sys/lnet/debug_mb&quot; is the size of log memory in megabytes, could you please enlarge it by running&lt;br/&gt;
&quot;echo &apos;new_size&apos; &amp;gt; /proc/sys/lnet/debug_mb&quot; and get the logs just after the recovery failed.&lt;/p&gt;

&lt;p&gt;could you please upload the syslog of nodes (MDT, OSS, clients) in this recovery failure?&lt;br/&gt;
Thanks&lt;/p&gt;</comment>
                            <comment id="82738" author="rganesan@ddn.com" created="Tue, 29 Apr 2014 14:04:18 +0000"  >&lt;p&gt;Attaching additional logs. The same issue occured on 10/03. Attaching their logs too. &lt;/p&gt;

&lt;p&gt;2014-03-10-es_lustre_showall_2014-03-10_155943.tar.bz2&lt;br/&gt;
2014-03-10-es_lustre_showall_2014-03-10_152317.tar.bz2&lt;/p&gt;

&lt;p&gt;Hope it helps, Please let us know, if you need any other logs&lt;/p&gt;</comment>
                            <comment id="82844" author="hongchao.zhang" created="Wed, 30 Apr 2014 13:14:53 +0000"  >&lt;p&gt;Did you fail the MDS at pfs2n2 (IB: 172.26.8.2, FS: pfs2dat1) over MDS at pfs2n3 (IB: 172.26.8.3, FS: pfs2dat1) at 2014-03-10?&lt;/p&gt;

&lt;p&gt;the MGS at pfs2n6 (IB: 172.26.8.6, FS: pfs2wor1) detected the failover and changed to pfs2n3,&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 10 14:16:09 pfs2n6 kernel: : Lustre: 11986:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1394457362/real 0]  req@ffff880745bc5c00 x1454141536457600/t0(0) o400-&amp;gt;MGC172.26.8.2@o2ib@172.26.8.2@o2ib:26/25 lens 224/224 e 0 to 1 dl 1394457369 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 10 14:16:09 pfs2n6 kernel: : LustreError: 166-1: MGC172.26.8.2@o2ib: Connection to MGS (at 172.26.8.2@o2ib) was lost; in progress operations using this service will fail
Mar 10 14:16:40 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394457394/real 1394457394]  req@ffff8807bd6c4c00 x1454141536458640/t0(0) o250-&amp;gt;MGC172.26.8.2@o2ib@172.26.8.3@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394457400 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 10 14:16:40 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Mar 10 14:17:10 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1394457419/real 0]  req@ffff88081d026000 x1454141536459448/t0(0) o250-&amp;gt;MGC172.26.8.2@o2ib@172.26.8.2@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394457430 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 10 14:17:24 pfs2n6 kernel: : Lustre: Evicted from MGS (at MGC172.26.8.2@o2ib_0) after server handle changed from 0x3a6efc7d7b9853c2 to 0x80cf882fb86516e3
Mar 10 14:17:24 pfs2n6 kernel: : Lustre: MGC172.26.8.2@o2ib: Connection restored to MGS (at 172.26.8.3@o2ib)
Mar 10 14:17:24 pfs2n6 kernel: : Lustre: Skipped 9 previous similar messages
Mar 10 14:18:26 pfs2n6 kernel: : Lustre: 11992:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394457494/real 1394457494]  req@ffff880740871800 x1454141536461908/t0(0) o400-&amp;gt;MGC172.26.8.2@o2ib@172.26.8.3@o2ib:26/25 lens 224/224 e 0 to 1 dl 1394457506 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 10 14:18:26 pfs2n6 kernel: : LustreError: 166-1: MGC172.26.8.2@o2ib: Connection to MGS (at 172.26.8.3@o2ib) was lost; in progress operations using this service will fail
Mar 10 14:19:57 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1394457581/real 0]  req@ffff88060f659400 x1454141536464732/t0(0) o250-&amp;gt;MGC172.26.8.2@o2ib@172.26.8.2@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394457597 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 10 14:19:57 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Mar 10 14:22:42 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1394457731/real 0]  req@ffff880802be6400 x1454141536469596/t0(0) o250-&amp;gt;MGC172.26.8.2@o2ib@172.26.8.3@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394457762 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 10 14:22:42 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Mar 10 14:26:21 pfs2n6 kernel: : Lustre: Evicted from MGS (at MGC172.26.8.2@o2ib_0) after server handle changed from 0x80cf882fb86516e3 to 0xe62e03dfa42a5fd9
Mar 10 14:26:21 pfs2n6 kernel: : Lustre: MGC172.26.8.2@o2ib: Connection restored to MGS (at 172.26.8.3@o2ib)
Mar 10 14:54:15 pfs2n6 kernel: : Lustre: pfs2wor1-MDT0000: haven&apos;t heard from client f7256a7b-c8ff-e78f-0a3d-a50b8fb47015 (at 172.26.5.106@o2ib) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff8810047b5c00, cur 1394459655 expire 1394459505 last 1394459428
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;but there seems no client connect to pfs2n3 except pfs2n4(OSS), pfs2n5(OSS)&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 10 14:26:09 pfs2n3 kernel: : LustreError: 137-5: pfs2dat1-MDT0000_UUID: not available for connect from 172.26.8.5@o2ib (no target)
Mar 10 14:26:09 pfs2n3 kernel: : LustreError: 137-5: pfs2dat1-MDT0000_UUID: not available for connect from 172.26.8.4@o2ib (no target)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;INFO: task tgt_recov:5224 blocked for more than 120 seconds.
Mar 10 14:47:59 pfs2n3 kernel: : &quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
Mar 10 14:47:59 pfs2n3 kernel: : tgt_recov     D 000000000000000f     0  5224      2 0x00000080
Mar 10 14:47:59 pfs2n3 kernel: : ffff880800ea3e00 0000000000000046 0000000000000000 0000000000000003
Mar 10 14:47:59 pfs2n3 kernel: : ffff880800ea3d90 ffffffff81055f96 ffff880800ea3da0 ffff880830298080
Mar 10 14:47:59 pfs2n3 kernel: : ffff880832d73ab8 ffff880800ea3fd8 000000000000fb88 ffff880832d73ab8
Mar 10 14:47:59 pfs2n3 kernel: : Call Trace:
Mar 10 14:47:59 pfs2n3 kernel: : [&amp;lt;ffffffff81055f96&amp;gt;] ? enqueue_task+0x66/0x80
Mar 10 14:47:59 pfs2n3 kernel: : [&amp;lt;ffffffffa07d8070&amp;gt;] ? check_for_clients+0x0/0x70 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [&amp;lt;ffffffffa07d972d&amp;gt;] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [&amp;lt;ffffffffa07d7d60&amp;gt;] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [&amp;lt;ffffffff81096da0&amp;gt;] ? autoremove_wake_function+0x0/0x40
Mar 10 14:47:59 pfs2n3 kernel: : [&amp;lt;ffffffffa07e0ede&amp;gt;] target_recovery_thread+0x58e/0x1970 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [&amp;lt;ffffffffa07e0950&amp;gt;] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [&amp;lt;ffffffff8100c0ca&amp;gt;] child_rip+0xa/0x20
Mar 10 14:47:59 pfs2n3 kernel: : [&amp;lt;ffffffffa07e0950&amp;gt;] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [&amp;lt;ffffffffa07e0950&amp;gt;] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [&amp;lt;ffffffff8100c0c0&amp;gt;] ? child_rip+0x0/0x20
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;then the problem a bit like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4722&quot; title=&quot;IO Errors during the failover - SLES 11 SP2 - Lustre 2.4.2&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4722&quot;&gt;&lt;del&gt;LU-4722&lt;/del&gt;&lt;/a&gt;, which the clients doesn&apos;t fail over to the right server node.&lt;br/&gt;
could you please also dump the config log &quot;pfs2dat1-client&quot; and &quot;pfs2wor1-client&quot; at pfs2n3?&lt;br/&gt;
Thanks&lt;/p&gt;</comment>
                            <comment id="139709" author="hongchao.zhang" created="Fri, 22 Jan 2016 05:35:27 +0000"  >&lt;p&gt;Hi Rajesh,&lt;br/&gt;
Do you need any more works on this ticket? Or can we close it?&lt;br/&gt;
Thanks&lt;/p&gt;</comment>
                            <comment id="140341" author="rganesan@ddn.com" created="Thu, 28 Jan 2016 13:58:14 +0000"  >&lt;p&gt;Hello Hongchao, &lt;/p&gt;

&lt;p&gt;Please close this case. Customer had upgraded to higher version of lustre and we are good now. &lt;/p&gt;


&lt;p&gt;Thanks,&lt;br/&gt;
Rajesh&lt;/p&gt;</comment>
                            <comment id="140610" author="jfc" created="Sat, 30 Jan 2016 02:44:09 +0000"  >&lt;p&gt;Thanks Rajesh.&lt;/p&gt;

&lt;p&gt;~ jfc.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwivj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>13339</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>