<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:17:51 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8472] sanity-scrub test_5 times out </title>
                <link>https://jira.whamcloud.com/browse/LU-8472</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;sanity-scrub test 5 hangs and times out. Looking at the console logs, it&#8217;s not clear why the test is hanging. &lt;/p&gt;

&lt;p&gt;From the test_log, the last thing we see&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Starting client: onyx-40vm1.onyx.hpdd.intel.com:  -o user_xattr,flock onyx-40vm7@tcp:/lustre /mnt/lustre
CMD: onyx-40vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre
CMD: onyx-40vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-40vm7@tcp:/lustre /mnt/lustre
CMD: onyx-40vm3,onyx-40vm7 /usr/sbin/lctl set_param -n osd-ldiskfs.*.full_scrub_ratio=0
CMD: onyx-40vm3,onyx-40vm7 /usr/sbin/lctl set_param fail_val=3 fail_loc=0x190
fail_val=3
fail_loc=0x190
fail_val=3
fail_loc=0x190
  File: &apos;/mnt/lustre/d5.sanity-scrub/mds1/f5.sanity-scrub800&apos;
  Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d	Inode: 144115205322834801  Links: 1
Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-08-01 19:00:37.000000000 -0700
Modify: 2016-08-01 18:59:51.000000000 -0700
Change: 2016-08-01 19:00:37.000000000 -0700
 Birth: -
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I don&#8217;t see any LBUG or ASSERTIONs in any of the console logs. In client logs when this test completes, we see a mds_connect to the MDS fail and then later connect. From the client dmesg, we see the problem connection to the MDS, but we don&#8217;t connect with the MDS:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[13304.508640] Lustre: Unmounted lustre-client
[13304.511334] Lustre: Skipped 2 previous similar messages
[13367.608943] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre
[13367.616737] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock onyx-40vm7@tcp:/lustre /mnt/lustre
[13368.970139] Lustre: DEBUG MARKER: grep -c /mnt/lustre&apos; &apos; /proc/mounts
[13368.978423] Lustre: DEBUG MARKER: lsof -t /mnt/lustre
[13369.048679] Lustre: DEBUG MARKER: umount /mnt/lustre 2&amp;gt;&amp;amp;1
[13449.961128] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre
[13449.968913] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock onyx-40vm7@tcp:/lustre /mnt/lustre
[13450.020802] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
[13450.024064] LustreError: Skipped 2 previous similar messages
[13485.017940] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
[13485.023276] LustreError: Skipped 6 previous similar messages
[13550.024059] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
[13550.031793] LustreError: Skipped 12 previous similar messages
[13680.018274] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
[13680.023602] LustreError: Skipped 25 previous similar messages
[13940.017773] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
[13940.023074] LustreError: Skipped 51 previous similar messages
[14455.018070] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
[14455.023708] LustreError: Skipped 102 previous similar messages
[15060.018374] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
[15060.024085] LustreError: Skipped 120 previous similar messages
[15665.017755] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
[15665.024000] LustreError: Skipped 120 previous similar messages
[16270.017567] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
[16270.023955] LustreError: Skipped 120 previous similar messages
[16816.123140] SysRq : Show State
[16816.124024]   task                        PC stack   pid father
[16816.124024] systemd         S 0000000000000000     0     1      0 0x00000000
[16816.124024]  ffff88007c7b3db8 0000000000000086 ffff88007c7a8000 ffff88007c7b3fd8
[16816.124024]  ffff88007c7b3fd8 ffff88007c7b3fd8 ffff88007c7a8000 0000000000000000
[16816.124024]  0000000000000000 ffff88003693d5a0 ffff88007c7a8000 0000000000000000
[16816.124024] Call Trace:
[16816.124024]  [&amp;lt;ffffffff8163b809&amp;gt;] schedule+0x29/0x70
[16816.124024]  [&amp;lt;ffffffff8163a99d&amp;gt;] schedule_hrtimeout_range_clock+0x12d/0x150
[16816.124024]  [&amp;lt;ffffffff81228a59&amp;gt;] ? ep_scan_ready_list.isra.9+0x1b9/0x1f0
[16816.124024]  [&amp;lt;ffffffff8163a9d3&amp;gt;] schedule_hrtimeout_range+0x13/0x20
[16816.124024]  [&amp;lt;ffffffff81228cee&amp;gt;] ep_poll+0x23e/0x360
[16816.124024]  [&amp;lt;ffffffff8122b23d&amp;gt;] ? do_timerfd_settime+0x2ed/0x3a0
[16816.124024]  [&amp;lt;ffffffff810b88d0&amp;gt;] ? wake_up_state+0x20/0x20
[16816.124024]  [&amp;lt;ffffffff81229ded&amp;gt;] SyS_epoll_wait+0xed/0x120
[16816.124024]  [&amp;lt;ffffffff81646889&amp;gt;] system_call_fastpath+0x16/0x1b
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These failures do NOT look like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8070&quot; title=&quot;sanity-scrub test_5 oom-killer and times out&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8070&quot;&gt;&lt;del&gt;LU-8070&lt;/del&gt;&lt;/a&gt; nor &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8399&quot; title=&quot;MDT hung at lu_object_find_at during umount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8399&quot;&gt;&lt;del&gt;LU-8399&lt;/del&gt;&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Recent failure logs are at&lt;br/&gt;
 2016-07-22 - &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/b5c71fbc-5087-11e6-8968-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/b5c71fbc-5087-11e6-8968-5254006e85c2&lt;/a&gt;&lt;br/&gt;
2016-07-25 - &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/cdc385e8-52d9-11e6-bf87-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/cdc385e8-52d9-11e6-bf87-5254006e85c2&lt;/a&gt;&lt;br/&gt;
2016-07-25 - &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/eaaeec4a-52d6-11e6-bf87-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/eaaeec4a-52d6-11e6-bf87-5254006e85c2&lt;/a&gt;&lt;br/&gt;
2016-07-29 - &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/97c36552-5604-11e6-b5b1-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/97c36552-5604-11e6-b5b1-5254006e85c2&lt;/a&gt;&lt;br/&gt;
2016-07-29 - &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/5d523e04-55e9-11e6-aa74-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/5d523e04-55e9-11e6-aa74-5254006e85c2&lt;/a&gt;&lt;br/&gt;
2016-08-01 - &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/0dca7564-583b-11e6-aa74-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/0dca7564-583b-11e6-aa74-5254006e85c2&lt;/a&gt;&lt;br/&gt;
2016-08-01 - &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/64a45f36-584e-11e6-aa74-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/64a45f36-584e-11e6-aa74-5254006e85c2&lt;/a&gt;&lt;br/&gt;
2016-08-02 - &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/058493fc-588c-11e6-b5b1-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/058493fc-588c-11e6-b5b1-5254006e85c2&lt;/a&gt;&lt;/p&gt;</description>
                <environment>autotest review-dne-part-2</environment>
        <key id="38581">LU-8472</key>
            <summary>sanity-scrub test_5 times out </summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="yong.fan">nasf</assignee>
                                    <reporter username="jamesanunez">James Nunez</reporter>
                        <labels>
                    </labels>
                <created>Tue, 2 Aug 2016 20:11:52 +0000</created>
                <updated>Tue, 6 Feb 2018 16:31:48 +0000</updated>
                            <resolved>Mon, 26 Sep 2016 15:43:06 +0000</resolved>
                                    <version>Lustre 2.9.0</version>
                                    <fixVersion>Lustre 2.9.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="160749" author="adilger" created="Thu, 4 Aug 2016 02:26:08 +0000"  >&lt;p&gt;Fan Yong, can you please take a look at this.&lt;/p&gt;</comment>
                            <comment id="161813" author="yong.fan" created="Sat, 13 Aug 2016 05:22:32 +0000"  >&lt;p&gt;There is known issue about DNE recovery and OI scrub: if the MDT is restored from file-level backup, then when the MDT mount up, it is quite possible that only part of the OI mappings are rebuilt, the other parts will be rebuilt at background after the MDT mount if not specify &quot;-o noscrub&quot; option. But if &quot;-o noscrub&quot; option is specified when mount the MDT, then only init OI scrub fixed some important objects&apos; OI mapping, the other OI mappings are invalid until the admin trigger OI scrub manually after the MDT mount. Usually, it is fine. But after DNE introduced, things become complex, because the DEN recovery depends on update logs that are FID based llogs. When mount the MDT, it will collect update logs from the other MDTs and parse the update logs to replay the un-commited modification. These FID based recovery RPCs depend on the valid OI mappings. But as described above, it is NOT guaranteed that all the OI mappings are rebuilt during the recovery. So if &quot;-o noscrub&quot; option specified, then the cross-MDT recovery may hit &quot;-EREMCHG (-78)&quot; failure; otherwise it may hit &quot;-EINPROGRESS (-115)&quot; failure and retry for ever until related OI mapping rebuilt.&lt;/p&gt;

&lt;p&gt;In the sanity-scrub test_5, we mount the MDT without &quot;-o noscrub&quot;, the cross-MDT recovery triggered full speed OI scrub, but we set fail_loc OBD_FAIL_OSD_SCRUB_DELAY to slowdown the OI scrub speed to check its status, then cause the OI scrub run very slowly (1 OI mapping per 3 seconds), so the recovery thread has to wait there and retry from time by time. That is why the test expired.&lt;/p&gt;</comment>
                            <comment id="161814" author="yong.fan" created="Sat, 13 Aug 2016 05:43:28 +0000"  >&lt;p&gt;It is not possible to totally avoid FID based RPC during OI scrub, but we need to try to avoid that, especially on the MDT side. There are at least three points can be improved:&lt;/p&gt;

&lt;p&gt;1) The DNE logic should guarantee that when the sub-modifications on all related MDTs have been committed, related update logs should be cancelled from the MDTs. Otherwise, when the MDTs remount next time, it will trigger unnecessary update logs scanning. But we have observed related bugs on current master about the update logs failure:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[112275.488838] LustreError: 82035:0:(llog_cat.c:744:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;2) In update_recovery_exec(), before replaying the cross-MDT modifications, it needs to check whether related modifications/update logs have been committed. If yes, skip it. But for the implementation convenient, the current logic locates object with the FID in the update logs before checking whether committed or not. Such behaviour may trigger OI scrub and be blocked even if we should skip such update logs. So related logic need to be adjusted for avoid unnecessary object location.&lt;/p&gt;

&lt;p&gt;3) The OI scrub tests should try to flush all modifications before file-level backup to avoid cross-MDT recovery after restore.&lt;/p&gt;

&lt;p&gt;Currently, the 1) is in process via &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8493&quot; title=&quot;Do not set stale flag for new created OSP object&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8493&quot;&gt;&lt;del&gt;LU-8493&lt;/del&gt;&lt;/a&gt;. I will work on the point 3).&lt;br/&gt;
Di, would you please to give some adjustment of DNE recovery logic for the point 2)? Thanks!&lt;/p&gt;</comment>
                            <comment id="161821" author="gerrit" created="Sat, 13 Aug 2016 14:44:46 +0000"  >&lt;p&gt;Fan Yong (fan.yong@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/21918&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/21918&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8472&quot; title=&quot;sanity-scrub test_5 times out &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8472&quot;&gt;&lt;del&gt;LU-8472&lt;/del&gt;&lt;/a&gt; scrub: try to avoid recovery during OI scrub&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 5b6e28c38bef45878e3ef2d595d52f67d927fc15&lt;/p&gt;</comment>
                            <comment id="165979" author="yong.fan" created="Wed, 14 Sep 2016 06:26:55 +0000"  >&lt;p&gt;We hit many sanity scrub test failures recently that are related with OI scrub triggered by locating the update logs. We need the patch &lt;a href=&quot;http://review.whamcloud.com/21918&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/21918&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="167243" author="gerrit" created="Mon, 26 Sep 2016 15:19:11 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/21918/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/21918/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8472&quot; title=&quot;sanity-scrub test_5 times out &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8472&quot;&gt;&lt;del&gt;LU-8472&lt;/del&gt;&lt;/a&gt; scrub: try to avoid recovery during OI scrub&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: a41c6fad4672a60166088b9ad8aeb4f1b51c38e7&lt;/p&gt;</comment>
                            <comment id="167264" author="pjones" created="Mon, 26 Sep 2016 15:43:06 +0000"  >&lt;p&gt;Landed for 2.9&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="38765">LU-8493</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="41111">LU-8768</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="40149">LU-8646</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzyjcv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>