<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:28:31 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2824] recovery-mds-scale test_failover_ost: tar: etc/localtime: Cannot open: Input/output error</title>
                <link>https://jira.whamcloud.com/browse/LU-2824</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;While running recovery-mds-scale test_failover_ost, it failed as follows after running 15 hours:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;==== Checking the clients loads AFTER failover -- failure NOT OK
Client load failed on node client-31vm6, rc=1
Client load failed during failover. Exiting...
Found the END_RUN_FILE file: /home/autotest/.autotest/shared_dir/2013-02-12/172229-70152412386500/end_run_file
client-31vm6.lab.whamcloud.com
Client load  failed on node client-31vm6.lab.whamcloud.com:
/logdir/test_logs/2013-02-12/lustre-b1_8-el5-x86_64-vs-lustre-b1_8-el6-x86_64--review--1_1_1__13121__-70152412386500-172228/recovery-mds-scale.test_failover_ost.run__stdout.client-31vm6.lab.whamcloud.com.log
/logdir/test_logs/2013-02-12/lustre-b1_8-el5-x86_64-vs-lustre-b1_8-el6-x86_64--review--1_1_1__13121__-70152412386500-172228/recovery-mds-scale.test_failover_ost.run__debug.client-31vm6.lab.whamcloud.com.log
2013-02-13 10:50:55 Terminating clients loads ...
Duration:               86400
Server failover period: 900 seconds
Exited after:           56345 seconds
Number of failovers before exit:
mds: 0 times
ost1: 8 times
ost2: 3 times
ost3: 13 times
ost4: 7 times
ost5: 10 times
ost6: 10 times
ost7: 12 times
Status: FAIL: rc=1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The output of &quot;tar&quot; operation on client-31vm6 showed that:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;tar: etc/chef/solo.rb: Cannot open: Input/output error
tar: etc/chef/client.rb: Cannot open: Input/output error
tar: etc/prelink.cache: Cannot open: Input/output error
tar: etc/readahead.conf: Cannot open: Input/output error
tar: etc/localtime: Cannot open: Input/output error
tar: Exiting with failure status due to previous errors
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Dmesg on the MDS node (client-31vm3) showed that:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: DEBUG MARKER: Starting failover on ost2
Lustre: 7396:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1426822752375573 sent from lustre-OST0000-osc to NID 10.10.4.195@tcp 7s ago has timed out (7s prior to deadline).
  req@ffff810048893000 x1426822752375573/t0 o13-&amp;gt;lustre-OST0000_UUID@10.10.4.195@tcp:7/4 lens 192/528 e 0 to 1 dl 1360781077 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 7396:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 117 previous similar messages
Lustre: lustre-OST0000-osc: Connection to service lustre-OST0000 via nid 10.10.4.195@tcp was lost; in progress operations using this service will wait for recovery to complete.
Lustre: lustre-OST0005-osc: Connection to service lustre-OST0005 via nid 10.10.4.195@tcp was lost; in progress operations using this service will wait for recovery to complete.
Lustre: Skipped 4 previous similar messages
Lustre: lustre-OST0006-osc: Connection to service lustre-OST0006 via nid 10.10.4.195@tcp was lost; in progress operations using this service will wait for recovery to complete.
Lustre: 7398:0:(import.c:517:import_select_connection()) lustre-OST0000-osc: tried all connections, increasing latency to 2s
Lustre: 7398:0:(import.c:517:import_select_connection()) Skipped 59 previous similar messages
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 0/1: rc = -11
LustreError: 7725:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 0/1: rc = -5
LustreError: 7725:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
LustreError: 7725:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 1/1: rc = -11
LustreError: 7750:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
LustreError: 7750:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
Lustre: MGS: haven&apos;t heard from client 71727980-7899-77af-8af0-a42b0349985a (at 10.10.4.195@tcp) in 49 seconds. I think it&apos;s dead, and I am evicting it.
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 2/1: rc = -11
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message
LustreError: 7745:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
LustreError: 7745:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 3/1: rc = -11
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message
LustreError: 7739:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
LustreError: 7739:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 4/1: rc = -11
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message
LustreError: 7731:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
LustreError: 7731:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
Lustre: 7397:0:(quota_master.c:1724:mds_quota_recovery()) Only 6/7 OSTs are active, abort quota recovery
Lustre: 7397:0:(quota_master.c:1724:mds_quota_recovery()) Skipped 6 previous similar messages
Lustre: lustre-OST0000-osc: Connection restored to service lustre-OST0000 using nid 10.10.4.191@tcp.
Lustre: Skipped 6 previous similar messages
Lustre: MDS lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans
Lustre: Skipped 6 previous similar messages
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 5/1: rc = -11
LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Client load failed on node client-31vm6, rc=1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Maloo report: &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/40f45b4c-760f-11e2-b5e2-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/40f45b4c-760f-11e2-b5e2-52540035b04c&lt;/a&gt;&lt;/p&gt;</description>
                <environment>&lt;br/&gt;
Lustre Tag: v1_8_9_WC1_RC1&lt;br/&gt;
Lustre Build: &lt;a href=&quot;http://build.whamcloud.com/job/lustre-b1_8/256&quot;&gt;http://build.whamcloud.com/job/lustre-b1_8/256&lt;/a&gt;&lt;br/&gt;
Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client)&lt;br/&gt;
Network: TCP (1GigE)&lt;br/&gt;
ENABLE_QUOTA=yes&lt;br/&gt;
FAILURE_MODE=HARD&lt;br/&gt;
&lt;br/&gt;
MGS/MDS Nodes: client-31vm3(active), client-31vm7(passive)&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;\ /&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;1 combined MGS/MDT&lt;br/&gt;
&lt;br/&gt;
OSS Nodes: client-31vm4(active), client-31vm8(passive)&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;\ /&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;7 OSTs&lt;br/&gt;
&lt;br/&gt;
Client Nodes: client-31vm[1,5,6]&lt;br/&gt;
&lt;br/&gt;
IP Addresses:&lt;br/&gt;
client-31vm1: 10.10.4.196&lt;br/&gt;
client-31vm3: 10.10.4.190&lt;br/&gt;
client-31vm4: 10.10.4.191&lt;br/&gt;
client-31vm5: 10.10.4.192&lt;br/&gt;
client-31vm6: 10.10.4.193&lt;br/&gt;
client-31vm7: 10.10.4.194&lt;br/&gt;
client-31vm8: 10.10.4.195&lt;br/&gt;
</environment>
        <key id="17599">LU-2824</key>
            <summary>recovery-mds-scale test_failover_ost: tar: etc/localtime: Cannot open: Input/output error</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="niu">Niu Yawei</assignee>
                                    <reporter username="yujian">Jian Yu</reporter>
                        <labels>
                    </labels>
                <created>Sat, 16 Feb 2013 04:49:58 +0000</created>
                <updated>Thu, 7 Mar 2013 01:55:13 +0000</updated>
                            <resolved>Thu, 21 Feb 2013 02:20:14 +0000</resolved>
                                    <version>Lustre 1.8.9</version>
                                    <fixVersion>Lustre 2.1.5</fixVersion>
                    <fixVersion>Lustre 1.8.9</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="52530" author="yujian" created="Sat, 16 Feb 2013 04:53:59 +0000"  >&lt;p&gt;We used to hit &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-463&quot; title=&quot;orphan recovery happens too late, causing writes to fail with ENOENT after recovery&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-463&quot;&gt;&lt;del&gt;LU-463&lt;/del&gt;&lt;/a&gt; while running the recovery-mds-scale test_failover_ost on Lustre b1_8 build.&lt;/p&gt;</comment>
                            <comment id="52532" author="yujian" created="Sat, 16 Feb 2013 07:40:28 +0000"  >&lt;p&gt;Lustre Tag: v1_8_9_WC1_RC1&lt;br/&gt;
Lustre Build: &lt;a href=&quot;http://build.whamcloud.com/job/lustre-b1_8/256&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://build.whamcloud.com/job/lustre-b1_8/256&lt;/a&gt;&lt;br/&gt;
Distro/Arch: RHEL5.9/x86_64&lt;br/&gt;
Network: TCP (1GigE)&lt;br/&gt;
ENABLE_QUOTA=yes&lt;br/&gt;
FAILURE_MODE=HARD&lt;/p&gt;

&lt;p&gt;While running recovery-mds-scale test_failover_mds, it also failed with the similar issue after running 6 hours (MDS failed over 26 times).&lt;/p&gt;

&lt;p&gt;The output of &quot;tar&quot; operation on client node (client-32vm6) showed that:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&amp;lt;~snip~&amp;gt;
tar: etc/default/nss: Cannot open: Input/output error
tar: etc/default/useradd: Cannot open: Input/output error
tar: Error exit delayed from previous errors
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Dmesg on the MDS node (client-32vm3) showed that:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: lustre-OST0000-osc: Connection restored to service lustre-OST0000 using nid 10.10.4.199@tcp.
Lustre: lustre-OST0002-osc: Connection restored to service lustre-OST0002 using nid 10.10.4.199@tcp.
Lustre: Skipped 1 previous similar message
Lustre: MDS lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans
LustreError: 3181:0:(lov_obd.c:1153:lov_clear_orphans()) error in orphan recovery on OST idx 0/7: rc = -16
LustreError: 3181:0:(mds_lov.c:1057:__mds_lov_synchronize()) lustre-OST0000_UUID failed at mds_lov_clear_orphans: -16
LustreError: 3181:0:(mds_lov.c:1066:__mds_lov_synchronize()) lustre-OST0000_UUID sync failed -16, deactivating
Lustre: MDS lustre-MDT0000: lustre-OST0001_UUID now active, resetting orphans
Lustre: Skipped 2 previous similar messages
LustreError: 3182:0:(lov_obd.c:1153:lov_clear_orphans()) error in orphan recovery on OST idx 1/7: rc = -16
LustreError: 2768:0:(llog_server.c:466:llog_origin_handle_cancel()) Cancel 1 of 2 llog-records failed: -2
LustreError: 3182:0:(mds_lov.c:1057:__mds_lov_synchronize()) lustre-OST0001_UUID failed at mds_lov_clear_orphans: -16
LustreError: 3182:0:(mds_lov.c:1066:__mds_lov_synchronize()) lustre-OST0001_UUID sync failed -16, deactivating
Lustre: Service thread pid 2786 completed after 102.31s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
LustreError: 2793:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 2622037: rc = -5
LustreError: 2793:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
LustreError: 2794:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 2622037: rc = -5
LustreError: 2794:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
LustreError: 2792:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 2622052: rc = -5
LustreError: 2792:0:(mds_open.c:442:mds_create_objects()) Skipped 581 previous similar messages
LustreError: 2792:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
LustreError: 2792:0:(mds_open.c:827:mds_finish_open()) Skipped 581 previous similar messages
LustreError: 2769:0:(llog_server.c:466:llog_origin_handle_cancel()) Cancel 2 of 2 llog-records failed: -2
LustreError: 2769:0:(llog_server.c:466:llog_origin_handle_cancel()) Cancel 59 of 122 llog-records failed: -2
LustreError: 2769:0:(llog_server.c:466:llog_origin_handle_cancel()) Skipped 6 previous similar messages
LustreError: 3189:0:(llog_server.c:466:llog_origin_handle_cancel()) Cancel 51 of 122 llog-records failed: -2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Maloo report: &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/61cf03e0-76c7-11e2-bc2f-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/61cf03e0-76c7-11e2-bc2f-52540035b04c&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="52533" author="yujian" created="Sat, 16 Feb 2013 07:42:46 +0000"  >&lt;p&gt;Hi Oleg,&lt;/p&gt;

&lt;p&gt;Is the original issue in this ticket a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1475&quot; title=&quot;lov_update_create_set() error creating fid 0xf99b sub-object on OST idx 0/1: rc = -5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1475&quot;&gt;&lt;del&gt;LU-1475&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;</comment>
                            <comment id="52553" author="green" created="Sat, 16 Feb 2013 23:49:56 +0000"  >&lt;p&gt;This seems to be a very similar problem, though unlike lu-1475, here we see there is an error MDS gets from OST (in clear orphans) that prompts it to deactivate the export and this leads to object creation failures, where as in lu-1475 the failure is kind of sporadic without any real error visible other than ost restarting.&lt;/p&gt;

&lt;p&gt;So for this bug I imagine somebody needs to figure out why did OST fail to clear orphans like requested.&lt;/p&gt;</comment>
                            <comment id="52567" author="yujian" created="Sun, 17 Feb 2013 06:42:47 +0000"  >&lt;p&gt;Hi Oleg,&lt;/p&gt;

&lt;p&gt;Sorry for the confusion, the original issue in this ticket occurred after failing over OST, and the MDS dmesg log showed that:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 0/1: rc = -11
LustreError: 7725:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 0/1: rc = -5
LustreError: 7725:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
LustreError: 7725:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Maloo report: &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/40f45b4c-760f-11e2-b5e2-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/40f45b4c-760f-11e2-b5e2-52540035b04c&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, it seems the original issue is a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1475&quot; title=&quot;lov_update_create_set() error creating fid 0xf99b sub-object on OST idx 0/1: rc = -5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1475&quot;&gt;&lt;del&gt;LU-1475&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The second issue in this ticket occurred after failing over MDS, and the MDS dmesg log showed that:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 3182:0:(lov_obd.c:1153:lov_clear_orphans()) error in orphan recovery on OST idx 1/7: rc = -16
LustreError: 2768:0:(llog_server.c:466:llog_origin_handle_cancel()) Cancel 1 of 2 llog-records failed: -2
LustreError: 3182:0:(mds_lov.c:1057:__mds_lov_synchronize()) lustre-OST0001_UUID failed at mds_lov_clear_orphans: -16
LustreError: 3182:0:(mds_lov.c:1066:__mds_lov_synchronize()) lustre-OST0001_UUID sync failed -16, deactivating
Lustre: Service thread pid 2786 completed after 102.31s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
LustreError: 2793:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 2622037: rc = -5
LustreError: 2793:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Maloo report: &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/61cf03e0-76c7-11e2-bc2f-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/61cf03e0-76c7-11e2-bc2f-52540035b04c&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For the second issue, as you suggested, we need figure out why OST failed to clear orphans. After looking into the console and debug logs on OSS node, I did not see failure messages. However, on MDS, I found:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: Service thread pid 2786 was inactive for 60.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 2786, comm: ll_mdt_01

Call Trace:
 [&amp;lt;ffffffff8b03f220&amp;gt;] lustre_pack_request+0x630/0x6f0 [ptlrpc]
 [&amp;lt;ffffffff8006389f&amp;gt;] schedule_timeout+0x8a/0xad
 [&amp;lt;ffffffff8009a950&amp;gt;] process_timeout+0x0/0x5
 [&amp;lt;ffffffff8b105bd5&amp;gt;] osc_create+0xc75/0x13d0 [osc]
 [&amp;lt;ffffffff8005c3be&amp;gt;] cache_alloc_refill+0x108/0x188
 [&amp;lt;ffffffff8008f372&amp;gt;] default_wake_function+0x0/0xe
 [&amp;lt;ffffffff8b1b4edb&amp;gt;] qos_remedy_create+0x45b/0x570 [lov]
 [&amp;lt;ffffffff8002e4db&amp;gt;] __wake_up+0x38/0x4f
 [&amp;lt;ffffffff8008eb6b&amp;gt;] dequeue_task+0x18/0x37
 [&amp;lt;ffffffff8b1aedf3&amp;gt;] lov_fini_create_set+0x243/0x11e0 [lov]
 [&amp;lt;ffffffff8b1a2b72&amp;gt;] lov_create+0x1552/0x1860 [lov]
 [&amp;lt;ffffffff8b2952e5&amp;gt;] ldiskfs_mark_iloc_dirty+0x4a5/0x540 [ldiskfs]
 [&amp;lt;ffffffff8008f372&amp;gt;] default_wake_function+0x0/0xe
 [&amp;lt;ffffffff8b38eb8a&amp;gt;] mds_finish_open+0x1fea/0x43e0 [mds]
 [&amp;lt;ffffffff80019d5a&amp;gt;] __getblk+0x25/0x22c
 [&amp;lt;ffffffff8b28c16b&amp;gt;] __ldiskfs_handle_dirty_metadata+0xdb/0x110 [ldiskfs]
 [&amp;lt;ffffffff8b2952e5&amp;gt;] ldiskfs_mark_iloc_dirty+0x4a5/0x540 [ldiskfs]
 [&amp;lt;ffffffff8b295b07&amp;gt;] ldiskfs_mark_inode_dirty+0x187/0x1e0 [ldiskfs]
 [&amp;lt;ffffffff80063af9&amp;gt;] mutex_lock+0xd/0x1d
 [&amp;lt;ffffffff8b395e11&amp;gt;] mds_open+0x2f01/0x386b [mds]
 [&amp;lt;ffffffff8b03d5f1&amp;gt;] lustre_swab_buf+0x81/0x170 [ptlrpc]
 [&amp;lt;ffffffff8000d585&amp;gt;] dput+0x2c/0x114
 [&amp;lt;ffffffff8b36c0c5&amp;gt;] mds_reint_rec+0x365/0x550 [mds]
 [&amp;lt;ffffffff8b396d3e&amp;gt;] mds_update_unpack+0x1fe/0x280 [mds]
 [&amp;lt;ffffffff8b35eeda&amp;gt;] mds_reint+0x35a/0x420 [mds]
 [&amp;lt;ffffffff8b35ddea&amp;gt;] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
 [&amp;lt;ffffffff8b368c0e&amp;gt;] mds_intent_policy+0x49e/0xc10 [mds]
 [&amp;lt;ffffffff8affe270&amp;gt;] ldlm_resource_putref_internal+0x230/0x460 [ptlrpc]
 [&amp;lt;ffffffff8affbeb6&amp;gt;] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
 [&amp;lt;ffffffff8aff87fd&amp;gt;] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
 [&amp;lt;ffffffff8b020870&amp;gt;] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
 [&amp;lt;ffffffff8b01db29&amp;gt;] ldlm_handle_enqueue+0xbf9/0x1210 [ptlrpc]
 [&amp;lt;ffffffff8b367b40&amp;gt;] mds_handle+0x40e0/0x4d10 [mds]
 [&amp;lt;ffffffff8aed5868&amp;gt;] libcfs_ip_addr2str+0x38/0x40 [libcfs]
 [&amp;lt;ffffffff8aed5c7e&amp;gt;] libcfs_nid2str+0xbe/0x110 [libcfs]
 [&amp;lt;ffffffff8b048af5&amp;gt;] ptlrpc_server_log_handling_request+0x105/0x130 [ptlrpc]
 [&amp;lt;ffffffff8b04b874&amp;gt;] ptlrpc_server_handle_request+0x984/0xe00 [ptlrpc]
 [&amp;lt;ffffffff8b04bfd5&amp;gt;] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
 [&amp;lt;ffffffff8008d7a6&amp;gt;] __wake_up_common+0x3e/0x68
 [&amp;lt;ffffffff8b04cf16&amp;gt;] ptlrpc_main+0xf16/0x10e0 [ptlrpc]
 [&amp;lt;ffffffff8005dfc1&amp;gt;] child_rip+0xa/0x11
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And the debug log on MDS showed that:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000004:02000000:0:1360747003.206214:0:3136:0:(mds_lov.c:1049:__mds_lov_synchronize()) MDS lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans
00020000:00080000:0:1360747003.206224:0:3136:0:(lov_obd.c:1116:lov_clear_orphans()) clearing orphans only for lustre-OST0000_UUID
00020000:01000000:0:1360747003.206225:0:3136:0:(lov_obd.c:1141:lov_clear_orphans()) Clear orphans for 0:lustre-OST0000_UUID
00000008:00080000:0:1360747003.206229:0:3136:0:(osc_create.c:556:osc_create()) lustre-OST0000-osc: oscc recovery started - delete to 23015
00000008:00080000:0:1360747003.206238:0:3136:0:(osc_request.c:406:osc_real_create()) @@@ delorphan from OST integration  req@ffff810068000c00 x1426846474323651/t0 o5-&amp;gt;lustre-OST0000_UUID@10.10.4.199@tcp:28/4 lens 400/592 e 0 to 1 dl 0 ref 1 fl New:/0/0 rc 0/0
......
00000004:02000000:0:1360747003.206849:0:3181:0:(mds_lov.c:1049:__mds_lov_synchronize()) MDS lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans
00020000:00080000:0:1360747003.206852:0:3181:0:(lov_obd.c:1116:lov_clear_orphans()) clearing orphans only for lustre-OST0000_UUID
00020000:01000000:0:1360747003.206853:0:3181:0:(lov_obd.c:1141:lov_clear_orphans()) Clear orphans for 0:lustre-OST0000_UUID
00020000:00020000:0:1360747003.206855:0:3181:0:(lov_obd.c:1153:lov_clear_orphans()) error in orphan recovery on OST idx 0/7: rc = -16
00000004:00020000:0:1360747003.207975:0:3181:0:(mds_lov.c:1057:__mds_lov_synchronize()) lustre-OST0000_UUID failed at mds_lov_clear_orphans: -16
00000004:00020000:0:1360747003.209032:0:3181:0:(mds_lov.c:1066:__mds_lov_synchronize()) lustre-OST0000_UUID sync failed -16, deactivating
00020000:01000000:0:1360747003.209906:0:3181:0:(lov_obd.c:589:lov_set_osc_active()) Marking OSC lustre-OST0000_UUID inactive
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The orphans clear operation was performed twice with the same OST, and in the second run, osc_create() returns -EBUSY because OSCC_FLAG_SYNC_IN_PROGRESS flag was still set due to slow ll_mdt service thread. Then __mds_lov_synchronize()) deactivated the OSC. Other OSCs also hit this issue and were deactivated.&lt;/p&gt;

&lt;p&gt;Niu proposed a patch to fix this issue in &lt;a href=&quot;http://review.whamcloud.com/5450&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/5450&lt;/a&gt;. Could you please inspect? Thanks.&lt;/p&gt;</comment>
                            <comment id="52627" author="pjones" created="Mon, 18 Feb 2013 10:20:51 +0000"  >&lt;p&gt;Niu&lt;/p&gt;

&lt;p&gt;Do we need this same change on master and/or b2_1?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="52665" author="niu" created="Mon, 18 Feb 2013 21:08:05 +0000"  >&lt;p&gt;Peter, we need to port it to b2_1. Master has different orphan cleanup mechanism, it needn&apos;t such fix.&lt;/p&gt;

&lt;p&gt;patch for b2_1: &lt;a href=&quot;http://review.whamcloud.com/5462&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/5462&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="53508" author="nozaki" created="Thu, 7 Mar 2013 01:54:27 +0000"  >&lt;p&gt;Hi All. I&apos;m sorry to interrupt you suddenly but I&apos;d like to say some words.&lt;/p&gt;

&lt;p&gt;1) In b2_1, IMHO, I think we should handle -ENODEV case too in the same way of the -EBUSY. Because when obd_stopping has already been set, mdtlov might have already been released.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;&quot;__mds_lov_synchronize()&quot;&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;&lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; __mds_lov_synchronize(void *data)
{
        struct mds_lov_sync_info *mlsi = data;
        struct obd_device *obd = mlsi-&amp;gt;mlsi_obd;
        struct obd_device *watched = mlsi-&amp;gt;mlsi_watched;
        struct mds_obd *mds = &amp;amp;obd-&amp;gt;u.mds;
        struct obd_uuid *uuid;
        __u32  idx = mlsi-&amp;gt;mlsi_index;
        struct mds_group_info mgi;
        struct llog_ctxt *ctxt;
        &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; rc = 0;
        ENTRY;

        OBD_FREE_PTR(mlsi);

        LASSERT(obd);
        LASSERT(watched);
        uuid = &amp;amp;watched-&amp;gt;u.cli.cl_target_uuid;
        LASSERT(uuid);

        cfs_down_read(&amp;amp;mds-&amp;gt;mds_notify_lock);          &lt;span class=&quot;code-comment&quot;&gt;///////////////////
&lt;/span&gt;        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (obd-&amp;gt;obd_stopping || obd-&amp;gt;obd_fail) &amp;lt;----- &lt;span class=&quot;code-comment&quot;&gt;// when the &lt;span class=&quot;code-keyword&quot;&gt;case&lt;/span&gt; //
&lt;/span&gt;                GOTO(out, rc = -ENODEV);               &lt;span class=&quot;code-comment&quot;&gt;///////////////////
&lt;/span&gt;
        ....

        EXIT;
out:
        cfs_up_read(&amp;amp;mds-&amp;gt;mds_notify_lock);
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rc) {
                &lt;span class=&quot;code-comment&quot;&gt;/* Deactivate it &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; safety */&lt;/span&gt;
                CERROR(&lt;span class=&quot;code-quote&quot;&gt;&quot;%s sync failed %d, deactivating\n&quot;&lt;/span&gt;, obd_uuid2str(uuid),
                       rc);                                          &lt;span class=&quot;code-comment&quot;&gt;///////////////////////////////////////////////////////////////////////////////
&lt;/span&gt;                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!obd-&amp;gt;obd_stopping &amp;amp;&amp;amp; mds-&amp;gt;mds_lov_obd &amp;amp;&amp;amp; &amp;lt;----- &lt;span class=&quot;code-comment&quot;&gt;//there must be an address, but we don&lt;span class=&quot;code-quote&quot;&gt;&apos;t know wheter it&apos;&lt;/span&gt;s been freed or not. //
&lt;/span&gt;                    !mds-&amp;gt;mds_lov_obd-&amp;gt;obd_stopping &amp;amp;&amp;amp; !watched-&amp;gt;obd_stopping)
                        obd_notify(mds-&amp;gt;mds_lov_obd, watched,
                                   OBD_NOTIFY_INACTIVE, NULL);
        }

        class_decref(obd, &lt;span class=&quot;code-quote&quot;&gt;&quot;mds_lov_synchronize&quot;&lt;/span&gt;, obd);
        &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; rc;
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;&quot;mds_precleanup&quot;&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;&lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; mds_precleanup(struct obd_device *obd, &lt;span class=&quot;code-keyword&quot;&gt;enum&lt;/span&gt; obd_cleanup_stage stage)
{
        struct mds_obd *mds = &amp;amp;obd-&amp;gt;u.mds;
        struct llog_ctxt *ctxt;
        &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; rc = 0;
        ENTRY;

        &lt;span class=&quot;code-keyword&quot;&gt;switch&lt;/span&gt; (stage) {
        &lt;span class=&quot;code-keyword&quot;&gt;case&lt;/span&gt; OBD_CLEANUP_EARLY:
                &lt;span class=&quot;code-keyword&quot;&gt;break&lt;/span&gt;;
        &lt;span class=&quot;code-keyword&quot;&gt;case&lt;/span&gt; OBD_CLEANUP_EXPORTS:
                mds_lov_early_clean(obd);
                cfs_down_write(&amp;amp;mds-&amp;gt;mds_notify_lock);
                mds_lov_disconnect(obd);
                mds_lov_clean(obd);
                ctxt = llog_get_context(obd, LLOG_CONFIG_ORIG_CTXT);
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (ctxt)
                        llog_cleanup(ctxt);
                ctxt = llog_get_context(obd, LLOG_LOVEA_ORIG_CTXT);
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (ctxt)
                        llog_cleanup(ctxt);
                rc = obd_llog_finish(obd, 0);
                mds-&amp;gt;mds_lov_exp = NULL;
                cfs_up_write(&amp;amp;mds-&amp;gt;mds_notify_lock);
                &lt;span class=&quot;code-keyword&quot;&gt;break&lt;/span&gt;;
        }
        RETURN(rc);
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;2) I&apos;m not so sure but the backported patch of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1291&quot; title=&quot;Test failure on test suite replay-single 44c&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1291&quot;&gt;&lt;del&gt;LU-1291&lt;/del&gt;&lt;/a&gt;(&lt;a href=&quot;http://review.whamcloud.com/#change,2708&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,2708&lt;/a&gt;) looks necessary to me.&lt;br/&gt;
3) Don&apos;t we need the mds_notify_lock logic for b1_8 too ?&lt;/p&gt;

&lt;p&gt;Anyway, I&apos;ll be grad when you find my comments helpful.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvj9z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6837</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>