<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:18:52 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1694] ENOENT error encountered during HA tests</title>
                <link>https://jira.whamcloud.com/browse/LU-1694</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;During failover testing, we are occasionally (about 30% of the time) seeing ENOENT errors on the client. The way the test is performed is by untarring the kernel source on clients and then powering off an OSS. The OSTs failover to the partner OSS. After the state of OST that is being written to goes from RECOVERING to COMPLETE, the client then returns the ENOENT. &lt;/p&gt;

&lt;p&gt;There appear to be some similarities with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-463&quot; title=&quot;orphan recovery happens too late, causing writes to fail with ENOENT after recovery&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-463&quot;&gt;&lt;del&gt;LU-463&lt;/del&gt;&lt;/a&gt;, although this also features a client soft lockup. Here are the server and client logs:&lt;/p&gt;

&lt;p&gt;EPAM-ES-Server03&lt;br/&gt;
----------------&lt;br/&gt;
es01a-OST0000/recovery_status:status: COMPLETE&lt;br/&gt;
es01a-OST0001/recovery_status:status: COMPLETE&lt;br/&gt;
es01a-OST0002/recovery_status:status: RECOVERING&lt;br/&gt;
es01a-OST0003/recovery_status:status: RECOVERING&lt;/p&gt;

&lt;p&gt;es01a-OST0000/recovery_status:status: COMPLETE&lt;br/&gt;
es01a-OST0001/recovery_status:status: COMPLETE&lt;br/&gt;
es01a-OST0002/recovery_status:status: RECOVERING&lt;br/&gt;
es01a-OST0003/recovery_status:status: COMPLETE&lt;br/&gt;
Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: 13393:0:(ldlm_lib.c:1973:target_recovery_init()) R&lt;br/&gt;
ECOVERY: service es01a-OST0003, 2 recoverable clients, last_transno 11314055&lt;br/&gt;
Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: 13393:0:(ldlm_lib.c:1973:target_recovery_init()) S&lt;br/&gt;
kipped 1 previous similar message&lt;br/&gt;
Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: es01a-OST0003: Now serving es01a-OST0003 on /dev/m&lt;br/&gt;
apper/ost_es01a_3 with recovery enabled&lt;br/&gt;
Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: Skipped 1 previous similar message&lt;br/&gt;
Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: es01a-OST0003: Will be in recovery for at least 5:&lt;br/&gt;
00, or until 2 clients reconnect&lt;br/&gt;
Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: Skipped 1 previous similar message&lt;br/&gt;
Jun 28 09:02:48 EPAM-ES-Server03 kernel: LDISKFS-fs (dm-9): recovery complete&lt;br/&gt;
Jun 28 09:02:48 EPAM-ES-Server03 kernel: LDISKFS-fs (dm-9): mounted filesystem with ordered data mo&lt;br/&gt;
de. Opts:&lt;br/&gt;
Jun 28 09:02:48 EPAM-ES-Server03 kernel: LDISKFS-fs (dm-9): warning: maximal mount count reached, r&lt;br/&gt;
unning e2fsck is recommended&lt;br/&gt;
Jun 28 09:02:48 EPAM-ES-Server03 kernel: LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. Opts:&lt;br/&gt;
Jun 28 09:02:56 EPAM-ES-Server03 kernel: Lustre: 10205:0:(ldlm_lib.c:933:target_handle_connect()) es01a-OST0003: connection from es01a-MDT0000-mdtlov_UUID@192.168.3.10@o2ib recovering/t0 exp ffff88018fcff800 cur 1340899376 last 1340899368&lt;br/&gt;
Jun 28 09:02:56 EPAM-ES-Server03 kernel: Lustre: 10118:0:(ldlm_lib.c:933:target_handle_connect()) es01a-OST0002: connection from es01a-MDT0000-mdtlov_UUID@192.168.3.10@o2ib recovering/t0 exp ffff88019fca8400 cur 1340899376 last 1340899368 &lt;br/&gt;
Jun 28 09:03:09 EPAM-ES-Server03 kernel: LustreError: 13394:0:(ldlm_resource.c:1088:ldlm_resource_get()) lvbo_init failed for resource 522895: rc -2&lt;br/&gt;
Jun 28 09:03:46 EPAM-ES-Server03 kernel: Lustre: es01a-OST0003: sending delayed replies to recovered clients&lt;br/&gt;
Jun 28 09:03:46 EPAM-ES-Server03 kernel: Lustre: es01a-OST0003: received MDS connection from 192.168.3.10@o2ib&lt;br/&gt;
Jun 28 09:03:46 EPAM-ES-Server03 kernel: LustreError: 13448:0:(ldlm_resource.c:1088:ldlm_resource_get()) lvbo_init failed for resource 1047468: rc -2&lt;/p&gt;

&lt;p&gt;On client &lt;br/&gt;
Jun 28 09:03:46 EPAM-ES-Client01 kernel: Lustre: Skipped 1 previous similar message&lt;br/&gt;
Jun 28 09:06:19 EPAM-ES-Client01 kernel: Lustre: es01a-OST0002-osc-ffff88018a031800: Connection to&lt;br/&gt;
es01a-OST0002 (at 192.168.3.12@o2ib) was lost; in progress operations using this service will wait&lt;br/&gt;
for recovery to complete&lt;br/&gt;
Jun 28 09:06:19 EPAM-ES-Client01 kernel: Lustre: Skipped 1 previous similar message&lt;br/&gt;
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5796:0:(ldlm_request.c:1172:ldlm_cli_cancel_r&lt;br/&gt;
eq()) Got rc -11 from cancel RPC: canceling anyway&lt;br/&gt;
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5796:0:(ldlm_request.c:1799:ldlm_cli_cancel_l&lt;br/&gt;
ist()) ldlm_cli_cancel_list: -11&lt;br/&gt;
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5921:0:(ldlm_request.c:1172:ldlm_cli_cancel_r&lt;br/&gt;
eq()) Got rc -11 from cancel RPC: canceling anyway&lt;br/&gt;
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5796:0:(ldlm_request.c:1799:ldlm_cli_cancel_l&lt;br/&gt;
ist()) Skipped 1 previous similar message&lt;br/&gt;
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1799:ldlm_cli_cancel_l&lt;br/&gt;
ist()) ldlm_cli_cancel_list: -11&lt;br/&gt;
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1172:ldlm_cli_cancel_r&lt;br/&gt;
eq()) Got rc -11 from cancel RPC: canceling anyway&lt;br/&gt;
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1172:ldlm_cli_cancel_r&lt;br/&gt;
eq()) Skipped 34 previous similar messages&lt;br/&gt;
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1799:ldlm_cli_cancel_l&lt;br/&gt;
ist()) ldlm_cli_cancel_list: -11&lt;br/&gt;
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1799:ldlm_cli_cancel_l&lt;br/&gt;
ist()) Skipped 33 previous similar messages&lt;br/&gt;
Jun 28 09:08:20 EPAM-ES-Client01 kernel: BUG: soft lockup - CPU#0 stuck for 67s! &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpcd-rcv:4203&amp;#93;&lt;/span&gt;&lt;br/&gt;
Jun 28 09:08:20 EPAM-ES-Client01 kernel: Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) l&lt;br/&gt;
quota(U) mdc(U) fid(U) fld(U) ptlrpc(U) autofs4 ib_uverbs(U) ib_srp(U) scsi_transport_srp scsi_tgt&lt;br/&gt;
sunrpc ko2iblnd(U) rdma_cm(U) iw_cm(U) ib_addr(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ib_umad(U)&lt;br/&gt;
ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 mlx4_en(U) mlx4_ib(U) mlx4_core(U) ib_mad(U) ib_core(U) ipmi_si&lt;br/&gt;
ipmi_devintf ipmi_msghandler power_meter microcode dcdbas serio_raw iTCO_wdt iTCO_vendor_support sg&lt;br/&gt;
 i7core_edac edac_core bnx2 ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic&lt;br/&gt;
ata_piix mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last un&lt;br/&gt;
loaded: scsi_tgt]&lt;/p&gt;


&lt;p&gt;Are there any debug settings we could run with to help aid debugging?&lt;/p&gt;</description>
                <environment>RHEL 6.2</environment>
        <key id="15350">LU-1694</key>
            <summary>ENOENT error encountered during HA tests</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="4">Incomplete</resolution>
                                        <assignee username="hongchao.zhang">Hongchao Zhang</assignee>
                                    <reporter username="ihara">Shuichi Ihara</reporter>
                        <labels>
                    </labels>
                <created>Tue, 31 Jul 2012 23:35:38 +0000</created>
                <updated>Sat, 5 Mar 2016 00:23:41 +0000</updated>
                            <resolved>Sat, 5 Mar 2016 00:23:41 +0000</resolved>
                                    <version>Lustre 2.1.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="42546" author="pjones" created="Wed, 1 Aug 2012 09:50:21 +0000"  >&lt;p&gt;Hongchao&lt;/p&gt;

&lt;p&gt;Could you please look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="42585" author="adilger" created="Thu, 2 Aug 2012 03:16:17 +0000"  >&lt;p&gt;Could this relate to the orphan object cleanup race that was recently fixed on master?&lt;/p&gt;</comment>
                            <comment id="51770" author="hongchao.zhang" created="Mon, 4 Feb 2013 22:20:34 +0000"  >&lt;p&gt;Hi Kit, &lt;br/&gt;
does the issue occur recently again? thanks&lt;/p&gt;</comment>
                            <comment id="139554" author="hongchao.zhang" created="Thu, 21 Jan 2016 13:04:34 +0000"  >&lt;p&gt;Hi Kit,&lt;br/&gt;
Do you need any more works on this ticket? Or are we okay to close it?&lt;br/&gt;
Thanks&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw1kf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10335</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>