<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:56:16 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12858] recovery-mds-scale test failover_ost fails due to dd failure &#8220;dd: closing output file &#8216;/mnt/lustre/*/dd-file&#8217;: Input/output error&#8221;</title>
                <link>https://jira.whamcloud.com/browse/LU-12858</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;recovery-mds-scale test_failover_ost fails with &apos;test_failover_ost returned 1&apos; due to a client dd failure&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;+ echoerr &apos;Total free disk space is 9057280, 4k blocks to dd is 1018944&apos;
+ echo &apos;Total free disk space is 9057280, 4k blocks to dd is 1018944&apos;
Total free disk space is 9057280, 4k blocks to dd is 1018944
+ df /mnt/lustre/d0.dd-trevis-41vm3.trevis.whamcloud.com
+ dd bs=4k count=1018944 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-trevis-41vm3.trevis.whamcloud.com/dd-file
dd: closing output file &#8216;/mnt/lustre/d0.dd-trevis-41vm3.trevis.whamcloud.com/dd-file&#8217;: Input/output error
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Looking at the suite_log for the failure at &lt;a href=&quot;https://testing.whamcloud.com/test_sets/cec5665e-ebfc-11e9-add9-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/cec5665e-ebfc-11e9-add9-52540065bddc&lt;/a&gt;, the last things we see are&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Client load  failed on node trevis-41vm3.trevis.whamcloud.com:
/autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.run__stdout.trevis-41vm3.trevis.whamcloud.com.log
/autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.run__debug.trevis-41vm3.trevis.whamcloud.com.log
2019-10-09 22:38:23 Terminating clients loads ...
Duration:               86400
Server failover period: 1200 seconds
Exited after:           20536 seconds
Number of failovers before exit:
mds1: 0 times
ost1: 3 times
ost2: 1 times
ost3: 2 times
ost4: 2 times
ost5: 4 times
ost6: 4 times
ost7: 2 times
Status: FAIL: rc=1
CMD: trevis-41vm3,trevis-41vm4 test -f /tmp/client-load.pid &amp;amp;&amp;amp;
        { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
trevis-41vm3: sh: line 1: kill: (21671) - No such process
trevis-41vm4: sh: line 1: kill: (1981) - No such process
Dumping lctl log to /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.*.1570660705.log
CMD: trevis-41vm3.trevis.whamcloud.com,trevis-41vm5,trevis-41vm6,trevis-41vm7,trevis-41vm8 /usr/sbin/lctl dk &amp;gt; /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.debug_log.\$(hostname -s).1570660705.log;
         dmesg &amp;gt; /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.dmesg.\$(hostname -s).1570660705.log
trevis-41vm6: invalid parameter &apos;dump_kernel&apos;
trevis-41vm6: open(dump_kernel) failed: No such file or directory
trevis-41vm8: invalid parameter &apos;dump_kernel&apos;
trevis-41vm8: open(dump_kernel) failed: No such file or directory
test_failover_ost returned 1
FAIL failover_ost (21626s)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In dmesg for ons of the OSSs (vm5), we see errors around this time&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[ 1239.800591] Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
[ 1239.844533] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
[ 1241.423507] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost6 has failed over 4 times, and counting...
[ 1241.642352] Lustre: DEBUG MARKER: ost6 has failed over 4 times, and counting...
[ 1244.272791] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0001: extended recovery timer reaching hard limit: 180, extend: 1
[ 1244.790911] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0001: extended recovery timer reaching hard limit: 180, extend: 1
[ 1244.795648] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) Skipped 7 previous similar messages
[ 1245.884378] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0001: extended recovery timer reaching hard limit: 180, extend: 1
[ 1245.887879] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) Skipped 24 previous similar messages
[ 1247.888052] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0001: extended recovery timer reaching hard limit: 180, extend: 1
[ 1247.891749] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) Skipped 62 previous similar messages
[ 1249.852546] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
[ 1251.609811] Lustre: lustre-OST0006: Recovery over after 0:14, of 3 clients 3 recovered and 0 were evicted.
[ 1251.638584] Lustre: lustre-OST0006: deleting orphan objects from 0x0:30580 to 0x0:30657
[ 1251.933830] Lustre: 9281:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0002: extended recovery timer reaching hard limit: 180, extend: 1
[ 1251.936545] Lustre: 9281:0:(ldlm_lib.c:1765:extend_recovery_timer()) Skipped 84 previous similar messages
[ 1252.799744] Lustre: lustre-OST0004: Recovery over after 0:32, of 3 clients 3 recovered and 0 were evicted.
[ 1252.811613] Lustre: lustre-OST0004: deleting orphan objects from 0x0:30578 to 0x0:30657
[ 1255.963626] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
[ 1255.966033] Lustre: lustre-OST0000: disconnecting 1 stale clients
[ 1255.968200] LustreError: 6223:0:(tgt_grant.c:248:tgt_grant_sanity_check()) ofd_obd_disconnect: tot_granted 15728640 != fo_tot_granted 69206016
[ 1255.982021] Lustre: lustre-OST0000: Denying connection for new client 539fb43a-6e36-de0f-64a5-8efe899e98d9 (at 10.9.5.239@tcp), waiting for 3 known clients (0 recovered, 2 in progress, and 1 evicted) already passed deadline 0:00
[ 1258.899246] Lustre: lustre-OST0000: Denying connection for new client 539fb43a-6e36-de0f-64a5-8efe899e98d9 (at 10.9.5.239@tcp), waiting for 3 known clients (0 recovered, 2 in progress, and 1 evicted) already passed deadline 0:03
[ 1263.907198] Lustre: lustre-OST0000: Denying connection for new client 539fb43a-6e36-de0f-64a5-8efe899e98d9 (at 10.9.5.239@tcp), waiting for 3 known clients (2 recovered, 0 in progress, and 1 evicted) already passed deadline 0:08
[ 1265.865287] Lustre: lustre-OST0002: Recovery over after 0:55, of 3 clients 3 recovered and 0 were evicted.
[ 1265.874909] Lustre: lustre-OST0002: deleting orphan objects from 0x0:30610 to 0x0:30689
[ 1266.197943] Lustre: lustre-OST0001: deleting orphan objects from 0x0:30126 to 0x0:30209
[ 1268.576630] Lustre: lustre-OST0000: Recovery over after 1:13, of 3 clients 2 recovered and 1 was evicted.
[ 1268.579517] Lustre: Skipped 1 previous similar message
[ 1268.580940] Lustre: lustre-OST0000: deleting orphan objects from 0x0:30740 to 0x0:30817
[ 1268.750115] Lustre: lustre-OST0003: deleting orphan objects from 0x0:30511 to 0x0:30593
[ 1268.752368] Lustre: lustre-OST0005: deleting orphan objects from 0x0:30479 to 0x0:30561
[ 1268.916992] Lustre: lustre-OST0000: Connection restored to 539fb43a-6e36-de0f-64a5-8efe899e98d9 (at 10.9.5.239@tcp)
[ 1268.920054] Lustre: Skipped 10 previous similar messages
[ 1269.886545] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
[ 1271.213350] LustreError: 16821:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
[ 1272.178278] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
[ 1272.182794] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 1 previous similar message
[ 1273.571547] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
[ 1273.575165] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 2 previous similar messages
[ 1275.741885] LustreError: 16684:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
[ 1275.744315] LustreError: 16684:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 4 previous similar messages
[ 1279.847093] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
[ 1279.850332] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 13 previous similar messages
[ 1288.005172] LustreError: 16853:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
[ 1288.009998] LustreError: 16853:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 20 previous similar messages
[ 1304.272729] LustreError: 16821:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
[ 1304.275722] LustreError: 16821:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 29 previous similar messages
[ 1309.950454] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
[ 1309.953964] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages
[ 1336.378100] LustreError: 16822:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
[ 1336.382030] LustreError: 16822:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 62 previous similar messages
[ 1395.028643] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
[ 1395.030801] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 5 previous similar messages
[ 1400.527325] LustreError: 16853:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
[ 1400.530467] LustreError: 16853:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 102 previous similar messages
[ 1525.162524] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
[ 1525.165481] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 10 previous similar messages
[ 1790.173516] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
[ 1790.175909] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 14 previous similar messages
[ 2315.566821] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Duration:               86400
Server failover period: 1200 seconds
Exited after:           20536 seconds
Number of failovers before exit:
mds1: 0 times
ost1: 3 times
ost2: 1 times
ost3: 2 times
ost4: 2 times
ost5: 4 times
ost6: 4 times
[ 2316.767711] Lustre: DEBUG MARKER: Duration: 86400
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment></environment>
        <key id="57149">LU-12858</key>
            <summary>recovery-mds-scale test failover_ost fails due to dd failure &#8220;dd: closing output file &#8216;/mnt/lustre/*/dd-file&#8217;: Input/output error&#8221;</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="jamesanunez">James Nunez</reporter>
                        <labels>
                    </labels>
                <created>Mon, 14 Oct 2019 18:52:11 +0000</created>
                <updated>Thu, 22 Dec 2022 17:11:39 +0000</updated>
                                            <version>Lustre 2.12.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="256445" author="pjones" created="Tue, 15 Oct 2019 23:27:19 +0000"  >&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="256734" author="tappro" created="Mon, 21 Oct 2019 10:25:31 +0000"  >&lt;p&gt;At first sign this doesn&apos;t looks as something new, Ive checked commit history and don&apos;t see recent grant-related changes.&lt;/p&gt;</comment>
                            <comment id="289529" author="sarah" created="Thu, 14 Jan 2021 20:01:42 +0000"  >&lt;p&gt;Hit similar error but failed with tar on master failover testing&lt;br/&gt;
&lt;a href=&quot;https://testing.whamcloud.com/test_sets/671027cc-e2d2-4a27-afc9-26ced0d16cfc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/671027cc-e2d2-4a27-afc9-26ced0d16cfc&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;tar_stout shows&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;tar: etc/dbus-1/session.conf: Cannot write: Resource temporarily unavailable
tar: Exiting with failure status due to previous errors
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="357254" author="sarah" created="Thu, 22 Dec 2022 17:11:17 +0000"  >&lt;p&gt;similar error in 2.15.2 testing&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://testing.whamcloud.com/test_sets/eb052d8a-be69-41a1-b73d-0d45ae9c4e5d&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/eb052d8a-be69-41a1-b73d-0d45ae9c4e5d&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;
tar: etc/subuid: Cannot change ownership to uid 0, gid 0: No such file or directory tar: Exiting with failure status due to previous errors

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="54315">LU-11791</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00o07:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>