<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:26:43 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16404] sanity test_230n: lfs mirror: cannot get UNLOCK lease, Mirroring failed</title>
                <link>https://jira.whamcloud.com/browse/LU-16404</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This issue was created by maloo for S Buisson &amp;lt;sbuisson@ddn.com&amp;gt;&lt;/p&gt;

&lt;p&gt;This issue relates to the following test suite run: &lt;a href=&quot;https://testing.whamcloud.com/test_sets/65516983-67f2-4883-b7ab-1cb5abdf58e5&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/65516983-67f2-4883-b7ab-1cb5abdf58e5&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;test_230n failed with the following error:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mirroring failed
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Test output is:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;== sanity test 230n: Dir migration with mirrored file ==== 22:36:31 (1671057391)
lfs mirror mirror: cannot get UNLOCK lease, ext 4: No such file or directory (2)
error: lfs mirror extend: /mnt/lustre/d230n.sanity/f230n.sanity: cannot merge layout: No such file or directory
 sanity test_230n: @@@@@@ FAIL: Mirroring failed 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In client dmesg, we can see the client was evicted by the MDS:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[13212.828154] LustreError: 866765:0:(file.c:242:ll_close_inode_openhandle()) lustre-clilmv-ffff93f6048ef000:
    inode [0x200002b15:0x5b9b:0x0] mdc close failed: rc = -2
[13212.831301] LustreError: 11-0: lustre-MDT0000-mdc-ffff93f6048ef000: operation ldlm_enqueue to node 10.240.38.63@tcp failed: rc = -107
[13212.831308] Lustre: lustre-MDT0000-mdc-ffff93f6048ef000: Connection to lustre-MDT0000 (at 10.240.38.63@tcp) was lost;
    in progress operations using this service will wait for recovery to complete
[13212.833555] LustreError: Skipped 1 previous similar message
[13212.838262] LustreError: 167-0: lustre-MDT0000-mdc-ffff93f6048ef000: This client was evicted by lustre-MDT0000;
    in progress operations using this service will fail.
[13212.847340] Lustre: lustre-MDT0000-mdc-ffff93f6048ef000: Connection restored to 10.240.38.63@tcp (at 10.240.38.63@tcp)
[13213.270161] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_230n: @@@@@@ FAIL: Mirroring failed 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The MDS claims the client was not responsive to the lock request:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[13026.421567] Lustre: DEBUG MARKER: == sanity test 230n: Dir migration with mirrored file ==== 03:41:59 (1679283719)
[13070.199369] Lustre: mdt_rdpg00_003: service thread pid 541635 was inactive for 42.970 seconds. The thread might be hung,
     or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[13070.203029] Pid: 541635, comm: mdt_rdpg00_003 4.18.0-348.23.1.el8_lustre.x86_64 #1 SMP Thu Mar 2 00:54:25 UTC 2023
[13070.204965] Call Trace TBD:
[13070.206179] ldlm_completion_ast+0x7ac/0x900 [ptlrpc]
[13070.207308] ldlm_cli_enqueue_local+0x307/0x860 [ptlrpc]
[13070.208596] mdt_object_local_lock+0x509/0xb30 [mdt]
[13070.209666] mdt_object_lock_internal+0x18d/0x4a0 [mdt]
[13070.210781] mdt_object_lock+0x1b/0x20 [mdt]
[13070.211735] mdt_close_handle_layouts+0x935/0x10b0 [mdt]
[13070.212865] mdt_mfd_close+0x510/0xbc0 [mdt]
[13070.213811] mdt_close_internal+0xcc/0x250 [mdt]
[13070.214833] mdt_close+0x2c0/0x8b0 [mdt]
[13070.215745] tgt_request_handle+0xc8c/0x1950 [ptlrpc]
[13070.216850] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
[13070.218096] ptlrpc_main+0xc4e/0x1510 [ptlrpc]
[13127.540308] LustreError: 482890:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 100s:
     evicting client at 10.240.38.60@tcp  ns: mdt-lustre-MDT0000_UUID lock: 0000000001a8df2b/0x3c9ac870599a2eb7
     lrc: 3/0,0 mode: CR/CR res: [0x200002b15:0x5b9b:0x0].0x0 bits 0xd/0x0 rrc: 4 type: IBT gid 0 flags: 0x60200400000020
     nid: 10.240.38.60@tcp remote: 0x5d158cc36e2e425f expref: 49 pid: 537456 timeout: 13128 lvb_type: 0
[13128.435958] Lustre: DEBUG MARKER: sanity test_230n: @@@@@@ FAIL: Mirroring failed
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV&lt;br/&gt;
sanity test_230n - Mirroring failed&lt;/p&gt;</description>
                <environment></environment>
        <key id="73644">LU-16404</key>
            <summary>sanity test_230n: lfs mirror: cannot get UNLOCK lease, Mirroring failed</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="maloo">Maloo</reporter>
                        <labels>
                    </labels>
                <created>Thu, 15 Dec 2022 07:16:24 +0000</created>
                <updated>Mon, 20 Mar 2023 22:56:24 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="360514" author="adilger" created="Thu, 26 Jan 2023 17:19:05 +0000"  >&lt;p&gt;+12 failures in past 4 weeks, sometimes &quot;Mirroring failed&quot;, sometimes &quot;Migration failed&quot;. &lt;/p&gt;

&lt;p&gt;The first failures started on 2022-11-09 (a few earlier failures, but caused by previous subtest failures), and have been 21 on master since then (every 3-7 days), and not on any other branch. The first failure was for a patch that is fixing a later test in sanity, so very unlikely to be the source of the MDS eviction. &lt;/p&gt;

&lt;p&gt;There were 45 patches landed to master between 2022-11-01 and 2022-11-10:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# git log --oneline --after 2022-11-01 --before 2022-11-10
3737b28a9d LU-10003 lnet: to properly handle errors reported use ssize_t
93784852c8 LU-15825 ldlm: remove linefeed from LDLM_DEBUG
ed71af843d LU-16280 tests: conf-sanity/117 call setup when FS is not mounted
9a0a89520e LU-16251 obdclass: fill jobid in a safe way
b104c0a277 LU-14377 tests: make parallel-scale/rr_alloc less strict
38a52f2fc3 LU-16243 tests: Specify ping source in test_218
d55fe25a95 LU-16215 kfilnd: Use immediate for routed GETs
78c681d9f4 LU-16207 build: add rpm-build BuildRequires for SLES15 SP3
95f7ef6094 LU-16111 build: Fix include of stddef.h
0171801df5 LU-8151 obd: Show correct shadow mountpoints for server
26e765e0d2 LU-15011 tests: additional checks for pool spilling
41610e6207 LU-8837 mgc: move server-only code out of mgc_request.c
f32bdde057 LU-8837 lustre: make ldlm and target file lists
ef90a02d12 LU-13135  quota: improve checks in OSDs to ignore quota
896cd5b7bc LU-10391 lnet: fix build issue when IPv6 is disabled.
122644ae19 LU-15847 target: report multiple transno to debug log
cf121b1668 LU-16203 llog: skip bad records in llog
53b3d1de99 LU-16152 utils: fix integer overflow in cYAML parser
db0fb8f2b6 LU-10391 lnet: allow ping packet to contain large nids
5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs
8f8f6e2f36 LU-10003 lnet: use Netlink to support old and new NI APIs.
85941b9fb9 LU-15117 ofd: no lock for dt_bufs_get() in read path
293844d132 LU-16222 kernel: RHEL 8.7 client and server support
d3fc875617 LU-16256 tests: sanity-flr to remove temp files
f97ce54357 LU-15564 osd: add allocation time histogram
9aaaebfa22 LU-16252 tests: test_129() fix for not combined_mgs_mds
d41a508a49 LU-16241 obdclass: NL_SET_ERR_MSG work for older kernel
aaab50dd88 LU-16249 sec: krb5_decrypt_bulk calls decryption primitive
b533700add LU-16232 scripts: changelog/updatelog emergency cleanup
b4e06ff1e4 LU-16172 o2iblnd: add verbose debug prints for rx/tx events
c619b6d6a5 LU-16258 llite: Explicitly support .splice_write
8f718df474 LU-15852 lnet: Don&apos;t modify uptodate peer with temp NI
4f2914537c LU-15139 osp: block reads until the object is created
ec1b0dc6e4 LU-15626 tests: Fix &quot;error&quot; reported by shellcheck for replay-dual
8e563553c0 LU-15626 tests: Fix &quot;error&quot; reported by shellcheck
b2274a7160 LU-15619 osc: Remove oap lock
7e0a2b0737 LU-15046 osp: precreate thread vs connect race
ed0f74cd63 LU-6142 obdclass: make ccc_users in cl_client_cache a refcount_t
5b911e0326 LU-16160 llite: clear stale page&apos;s uptodate bit
1a79d395dd LU-15935 target: keep track of multirpc slots in last_rcvd
62fd8f9b49 Revert &quot;LU-16046 revert: &quot;LU-9964 llite: prevent mulitple group locks&quot;&quot;
dd609c6f31 Revert &quot;LU-16046 ldlm: group lock fix&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Nothing immediately jumps out as an obvious cause. One possibility might be &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15947&quot; title=&quot;Spinlock contention during wake_up_all() in obd_put_mod_rpc_slot()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15947&quot;&gt;&lt;del&gt;LU-15947&lt;/del&gt;&lt;/a&gt; obdclass: improve precision of wakeups for mod_rpcs&lt;/tt&gt;&quot; if there was some problem with the wakeups. &lt;/p&gt;

&lt;p&gt;Patches committed the previous week seem even less likely:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# git log --oneline --after 2022-10-25 --before 2022-11-01
6aee406c84 LU-14719 lod: distributed transaction check space
d4848d779b LU-16187 tests: Fix is_project_quota_supported()
9309a32c1a LU-16240 build: Use new AS_HELP_STRING
83d3f42118 LU-15305 obdclass: fix race in class_del_profile
1df5199097 LU-15807 ksocklnd: fix irq lock inversion while calling sk_data_ready()
8a2bd9661d LU-16177 kernel: kernel update RHEL9.0 [5.14.0-70.26.1.el9_0]
5547158daa LU-16175 kernel: kernel update SLES12 SP5 [4.12.14-122.133.1]
5d31fdaf68 LU-16174 kernel: kernel update SLES15 SP4 [5.14.21-150400.24.21.2]
c3467db7e7 LU-16173 kernel: kernel update SLES15 SP3 [5.3.18-150300.59.93.1]
3493db6a67 LU-16233 build: Add always target for SUSE15 SP3 LTSS
950e59ced1 LU-15234 lnet: add mechanism for dumping lnd peer debug info
0a639a32cf LU-15795 lbuild: enable KABI
c5a436d849 LU-12130 test: pool inheritance for mdt component
78dddb423f LU-15447 tests: sanity-flr/208 reset rotational status
c24a38b583 LU-16183 test: sanity-hsm/70 should detect python
04e5fa7b24 LU-13175 tests: sanity/803 to sync MDTs for actual statfs
2e08974390 LU-16139 statahead: avoid to block ptlrpcd interpret context
a966b624ac LU-16149 lnet: Discovery queue and deletion race
0a317b171e LU-15847 tgt: move tti_ transaction params to tsi_
4e2e8fd2fc LU-15847 tgt: reply always with the latest assigned transno
4468f6c9d9 LU-16025 llite: adjust read count as file got truncated
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14719&quot; title=&quot;&amp;quot;lfs migrate -m&amp;quot; creates broken agent inodes when target MDT full&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14719&quot;&gt;&lt;del&gt;LU-14719&lt;/del&gt;&lt;/a&gt; patch has had a few issues, but those are mostly during recovery, which isn&apos;t the case here. &lt;/p&gt;

&lt;p&gt;It might be necessary &lt;/p&gt;</comment>
                            <comment id="366511" author="adilger" created="Mon, 20 Mar 2023 18:43:03 +0000"  >&lt;p&gt;May possibly relate to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16633&quot; title=&quot;obd_get_mod_rpc_slot() is vulnerable to races&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16633&quot;&gt;&lt;del&gt;LU-16633&lt;/del&gt;&lt;/a&gt;, which relates to a bug initially introduced by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15947&quot; title=&quot;Spinlock contention during wake_up_all() in obd_put_mod_rpc_slot()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15947&quot;&gt;&lt;del&gt;LU-15947&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="366514" author="gerrit" created="Mon, 20 Mar 2023 19:11:52 +0000"  >&lt;p&gt;&quot;Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/50343&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/50343&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16404&quot; title=&quot;sanity test_230n: lfs mirror: cannot get UNLOCK lease, Mirroring failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16404&quot;&gt;LU-16404&lt;/a&gt; tests: check if wakeup race fixes issue&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 69aa93cf93a7c2648d4fbb2441b96c5a53ba671f&lt;/p&gt;</comment>
                            <comment id="366518" author="adilger" created="Mon, 20 Mar 2023 19:21:42 +0000"  >&lt;p&gt;If patch &lt;a href=&quot;https://review.whamcloud.com/50261&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/50261&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16633&quot; title=&quot;obd_get_mod_rpc_slot() is vulnerable to races&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16633&quot;&gt;&lt;del&gt;LU-16633&lt;/del&gt;&lt;/a&gt; obdclass: fix rpc slot leakage&lt;/tt&gt;&quot; fixes this issue, it implies that the client AST reply can be blocked due to &lt;tt&gt;max_rpcs_in_flight&lt;/tt&gt; limits.  That makes me wonder if there we are blocking AST RPCs due to &lt;tt&gt;max_rpcs_in_flight&lt;/tt&gt;, or if there should be a higher RPC limit for AST replies than regular MDC/OSC requests?&lt;/p&gt;

&lt;p&gt;The client shouldn&apos;t be evicted if it cannot send an AST reply because all of the RPC slots are used for long-running RPC requests (e.g. writes or LDLM enqueues that are blocked on different locks).  The OST lock prolonging only helps for timeouts on locks actively undergoing writeout, but doesn&apos;t help if there is an AST on a different resource and the client can&apos;t even &lt;b&gt;send&lt;/b&gt; writes under those locks because all of the RPC slots are full.&lt;/p&gt;</comment>
                            <comment id="366564" author="adilger" created="Mon, 20 Mar 2023 22:56:24 +0000"  >&lt;p&gt;The test patch passed 200x iterations of sanity test_230n with the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16633&quot; title=&quot;obd_get_mod_rpc_slot() is vulnerable to races&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16633&quot;&gt;&lt;del&gt;LU-16633&lt;/del&gt;&lt;/a&gt; patch.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="75015">LU-16633</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i03827:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>