<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:02:37 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-13599] LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs-&gt;rs_nlocks &lt; 8 ) failed</title>
                <link>https://jira.whamcloud.com/browse/LU-13599</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We hit the following crash on one of a MDS of Fir last night, running Lustre 2.12.4. Same problem occurred after re-mount of MDT and recovery. I had to kill the robinhood client that was running purge but also MDT-to-MDT migration (a single lfs migrate -m 0). Then I was able to remount this MDT. It looks a bit like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5185&quot; title=&quot;NFS export of DNE: (service.c:193:ptlrpc_save_lock()) ASSERTION( rs-&amp;gt;rs_nlocks &amp;lt; 8 ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5185&quot;&gt;&lt;del&gt;LU-5185&lt;/del&gt;&lt;/a&gt;. Unfortunately, we lost the vmcore this time. But if it happens again, I&apos;ll let you know and will attach it.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[1579975.369592] Lustre: fir-MDT0003: haven&apos;t heard from client 6fb18a53-0376-4 (at 10.50.6.54@o2ib2) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff9150778e2400, cur 1590281140 expire 1590280990 last 1590280913
[1580039.870825] Lustre: fir-MDT0003: Connection restored to b7b3778b-82b4-4 (at 10.50.6.54@o2ib2)
[1600137.924189] Lustre: fir-MDT0003: haven&apos;t heard from client ed7f1c7c-f5de-4 (at 10.50.4.54@o2ib2) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff913f6360b800, cur 1590301302 expire 1590301152 last 1590301075
[1619876.303011] Lustre: fir-MDT0003: Connection restored to 796a800c-02e4-4 (at 10.49.20.10@o2ib1)
[1639768.911234] Lustre: fir-MDT0003: Connection restored to 83415c02-51ff-4 (at 10.49.20.5@o2ib1)
[1639821.000702] Lustre: fir-MDT0003: haven&apos;t heard from client 83415c02-51ff-4 (at 10.49.20.5@o2ib1) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff913f32f56800, cur 1590340984 expire 1590340834 last 1590340757
[1647672.215034] Lustre: fir-MDT0003: haven&apos;t heard from client 19e3d49f-43e4-4 (at 10.50.9.37@o2ib2) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff913f6240bc00, cur 1590348835 expire 1590348685 last 1590348608
[1647717.069200] Lustre: fir-MDT0003: Connection restored to a4c7b337-bfab-4 (at 10.50.9.37@o2ib2)
[1667613.717650] Lustre: fir-MDT0003: haven&apos;t heard from client 20e68a82-bbdb-4 (at 10.50.6.54@o2ib2) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff914d1725c400, cur 1590368776 expire 1590368626 last 1590368549
[1667717.713398] Lustre: fir-MDT0003: Connection restored to b7b3778b-82b4-4 (at 10.50.6.54@o2ib2)
[1692403.249073] LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs-&amp;gt;rs_nlocks &amp;lt; 8 ) failed: 
[1692403.258985] LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) LBUG
[1692403.265867] Pid: 30166, comm: mdt00_002 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
[1692403.276224] Call Trace:
[1692403.278866]  [&amp;lt;ffffffffc0aff7cc&amp;gt;] libcfs_call_trace+0x8c/0xc0 [libcfs]
[1692403.285611]  [&amp;lt;ffffffffc0aff87c&amp;gt;] lbug_with_loc+0x4c/0xa0 [libcfs]
[1692403.292002]  [&amp;lt;ffffffffc0fb8851&amp;gt;] ptlrpc_save_lock+0xc1/0xd0 [ptlrpc]
[1692403.298695]  [&amp;lt;ffffffffc14b4bab&amp;gt;] mdt_save_lock+0x20b/0x360 [mdt]
[1692403.305003]  [&amp;lt;ffffffffc14b4d5c&amp;gt;] mdt_object_unlock+0x5c/0x3c0 [mdt]
[1692403.311572]  [&amp;lt;ffffffffc14b82e7&amp;gt;] mdt_object_unlock_put+0x17/0x120 [mdt]
[1692403.318479]  [&amp;lt;ffffffffc150c4fc&amp;gt;] mdt_unlock_list+0x54/0x174 [mdt]
[1692403.324876]  [&amp;lt;ffffffffc14d3fd3&amp;gt;] mdt_reint_migrate+0xa03/0x1310 [mdt]
[1692403.331619]  [&amp;lt;ffffffffc14d4963&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
[1692403.337841]  [&amp;lt;ffffffffc14b1273&amp;gt;] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[1692403.344586]  [&amp;lt;ffffffffc14bc6e7&amp;gt;] mdt_reint+0x67/0x140 [mdt]
[1692403.350463]  [&amp;lt;ffffffffc101c64a&amp;gt;] tgt_request_handle+0xada/0x1570 [ptlrpc]
[1692403.357586]  [&amp;lt;ffffffffc0fbf43b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1692403.365483]  [&amp;lt;ffffffffc0fc2da4&amp;gt;] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[1692403.371991]  [&amp;lt;ffffffffb98c2e81&amp;gt;] kthread+0xd1/0xe0
[1692403.377079]  [&amp;lt;ffffffffb9f77c24&amp;gt;] ret_from_fork_nospec_begin+0xe/0x21
[1692403.383725]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
[1692403.388918] Kernel panic - not syncing: LBUG
[1692403.393363] CPU: 44 PID: 30166 Comm: mdt00_002 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1
[1692403.406214] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.10.6 08/15/2019
[1692403.414040] Call Trace:
[1692403.416670]  [&amp;lt;ffffffffb9f65147&amp;gt;] dump_stack+0x19/0x1b
[1692403.421981]  [&amp;lt;ffffffffb9f5e850&amp;gt;] panic+0xe8/0x21f
[1692403.426954]  [&amp;lt;ffffffffc0aff8cb&amp;gt;] lbug_with_loc+0x9b/0xa0 [libcfs]
[1692403.433351]  [&amp;lt;ffffffffc0fb8851&amp;gt;] ptlrpc_save_lock+0xc1/0xd0 [ptlrpc]
[1692403.439978]  [&amp;lt;ffffffffc14b4bab&amp;gt;] mdt_save_lock+0x20b/0x360 [mdt]
[1692403.446259]  [&amp;lt;ffffffffc14b4d5c&amp;gt;] mdt_object_unlock+0x5c/0x3c0 [mdt]
[1692403.452796]  [&amp;lt;ffffffffc14b82e7&amp;gt;] mdt_object_unlock_put+0x17/0x120 [mdt]
[1692403.459687]  [&amp;lt;ffffffffc150c4fc&amp;gt;] mdt_unlock_list+0x54/0x174 [mdt]
[1692403.466057]  [&amp;lt;ffffffffc14d3fd3&amp;gt;] mdt_reint_migrate+0xa03/0x1310 [mdt]
[1692403.472794]  [&amp;lt;ffffffffc0d3cfa9&amp;gt;] ? check_unlink_entry+0x19/0xd0 [obdclass]
[1692403.479942]  [&amp;lt;ffffffffc14d4963&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
[1692403.486134]  [&amp;lt;ffffffffc14b1273&amp;gt;] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[1692403.492843]  [&amp;lt;ffffffffc14bc6e7&amp;gt;] mdt_reint+0x67/0x140 [mdt]
[1692403.498721]  [&amp;lt;ffffffffc101c64a&amp;gt;] tgt_request_handle+0xada/0x1570 [ptlrpc]
[1692403.505806]  [&amp;lt;ffffffffc0ff40b1&amp;gt;] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[1692403.513553]  [&amp;lt;ffffffffc0affbde&amp;gt;] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[1692403.520811]  [&amp;lt;ffffffffc0fbf43b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1692403.528673]  [&amp;lt;ffffffffc0fbb565&amp;gt;] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[1692403.535632]  [&amp;lt;ffffffffb98cfeb4&amp;gt;] ? __wake_up+0x44/0x50
[1692403.541065]  [&amp;lt;ffffffffc0fc2da4&amp;gt;] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[1692403.547537]  [&amp;lt;ffffffffc0fc2270&amp;gt;] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[1692403.555106]  [&amp;lt;ffffffffb98c2e81&amp;gt;] kthread+0xd1/0xe0
[1692403.560158]  [&amp;lt;ffffffffb98c2db0&amp;gt;] ? insert_kthread_work+0x40/0x40
[1692403.566424]  [&amp;lt;ffffffffb9f77c24&amp;gt;] ret_from_fork_nospec_begin+0xe/0x21
[1692403.573037]  [&amp;lt;ffffffffb98c2db0&amp;gt;] ? insert_kthread_work+0x40/0x40
[root@fir-md1-s4 127.0.0.1-2020-05-25-00:59:34]# 
Message from syslogd@fir-md1-s4 at May 25 09:46:42 ...
 kernel:LustreError: 29280:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs-&amp;gt;rs_nlocks &amp;lt; 8 ) failed: 

Message from syslogd@fir-md1-s4 at May 25 09:46:42 ...
 kernel:LustreError: 29280:0:(service.c:189:ptlrpc_save_lock()) LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;May 25 09:55:43 fir-md1-s1 kernel: Lustre: fir-MDT0003: Recovery over after 2:45, of 1302 clients 1301 recovered and 1 was evicted.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment></environment>
        <key id="59318">LU-13599</key>
            <summary>LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs-&gt;rs_nlocks &lt; 8 ) failed</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Mon, 25 May 2020 18:25:48 +0000</created>
                <updated>Tue, 4 Oct 2022 16:10:42 +0000</updated>
                            <resolved>Wed, 9 Dec 2020 19:21:13 +0000</resolved>
                                    <version>Lustre 2.12.4</version>
                                    <fixVersion>Lustre 2.14.0</fixVersion>
                    <fixVersion>Lustre 2.12.6</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="271127" author="pjones" created="Tue, 26 May 2020 14:16:34 +0000"  >&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;Any ideas here?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="271171" author="tappro" created="Tue, 26 May 2020 16:54:00 +0000"  >&lt;p&gt;The &lt;tt&gt;mdt_reint_migrate()&lt;/tt&gt; can take quite a lot of LDLM locks while &lt;tt&gt;ptlrpc_reply_state&lt;/tt&gt; has only 8 slots for saved locks. There are checks to keep that limit and don&apos;t try to save more lock that allowed but it seems there is a flaw in these checks. This part of code should be reviewed &lt;/p&gt;</comment>
                            <comment id="272841" author="sthiell" created="Sun, 14 Jun 2020 05:58:01 +0000"  >&lt;p&gt;FYI, we had another MDS crash today with same LBUG on our MDT0000 with Lustre 2.12.5 (Fir) while two &quot;lfs migrate -m 3&quot; were running.&lt;/p&gt;</comment>
                            <comment id="272842" author="tappro" created="Sun, 14 Jun 2020 06:35:27 +0000"  >&lt;p&gt;Stephane, could you report layout of migrated directory and its default LOV layout as well?&lt;/p&gt;</comment>
                            <comment id="272853" author="sthiell" created="Mon, 15 Jun 2020 03:39:09 +0000"  >&lt;p&gt;Hi Mike,&lt;/p&gt;

&lt;p&gt;One&#160; directory being migrated was &lt;tt&gt;/scratch/groups/jidoyaga&lt;/tt&gt; and the crash happened just after this output from &lt;tt&gt;lfs migrate&lt;/tt&gt;:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;migrate /scratch/groups/./jidoyaga/rleylek/atac-seq-pipeline/cromwell-executions/atac/49a5f27e-b6bd-46a8-beae-5ec5c01c1e7e/call-overlap_pr/shard-0/inputs/275296689 to MDT3 stripe count 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Looks like this directory si still on MDT0 and the default LOV layout seems to be our default one:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh02-01n60 ~]# lfs getdirstripe /scratch/groups/jidoyaga/rleylek/atac-seq-pipeline/cromwell-executions/atac/49a5f27e-b6bd-46a8-beae-5ec5c01c1e7e/call-overlap_pr/shard-0/inputs/275296689
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
[root@sh02-01n60 ~]# lfs getstripe -d /scratch/groups/jidoyaga/rleylek/atac-seq-pipeline/cromwell-executions/atac/49a5f27e-b6bd-46a8-beae-5ec5c01c1e7e/call-overlap_pr/shard-0/inputs/275296689
  lcm_layout_gen:    0
  lcm_mirror_count:  1
  lcm_entry_count:   3
    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 0
    lcme_extent.e_end:   134217728
      stripe_count:  1       stripe_size:   4194304       pattern:       raid0       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 134217728
    lcme_extent.e_end:   137438953472
      stripe_count:  2       stripe_size:   4194304       pattern:       raid0       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 137438953472
    lcme_extent.e_end:   EOF
      stripe_count:  4       stripe_size:   4194304       pattern:       raid0       stripe_offset: -1
&#160;&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;br/&gt;
When I try to access &lt;tt&gt;/scratch/groups/jidoyaga&lt;/tt&gt; now, it hangs. I can see this backtrace on MDT0003:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[970717.500738] Lustre: DEBUG MARKER: Sun Jun 14 20:17:42 2020
 [970890.723334] LNet: Service thread pid 48047 was inactive for 200.23s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
 [970890.740358] Pid: 48047, comm: mdt_rdpg03_008 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
 [970890.751051] Call Trace:
 [970890.753601] [&amp;lt;ffffffffbb588c18&amp;gt;] call_rwsem_down_read_failed+0x18/0x30
 [970890.760341] [&amp;lt;ffffffffc0f5f0bf&amp;gt;] llog_cat_declare_add_rec+0x4f/0x260 [obdclass]
 [970890.767880] [&amp;lt;ffffffffc0f56318&amp;gt;] llog_declare_add+0x78/0x1a0 [obdclass]
 [970890.774709] [&amp;lt;ffffffffc12ab8be&amp;gt;] top_trans_start+0x17e/0x940 [ptlrpc]
 [970890.781416] [&amp;lt;ffffffffc188a494&amp;gt;] lod_trans_start+0x34/0x40 [lod]
 [970890.787632] [&amp;lt;ffffffffc193f6ba&amp;gt;] mdd_trans_start+0x1a/0x20 [mdd]
 [970890.793870] [&amp;lt;ffffffffc1932c29&amp;gt;] mdd_attr_set+0x649/0xda0 [mdd]
 [970890.800016] [&amp;lt;ffffffffc17bcf5b&amp;gt;] mdt_mfd_close+0x6cb/0x870 [mdt]
 [970890.806257] [&amp;lt;ffffffffc17c25b1&amp;gt;] mdt_close_internal+0x121/0x220 [mdt]
 [970890.812915] [&amp;lt;ffffffffc17c28d0&amp;gt;] mdt_close+0x220/0x780 [mdt]
 [970890.818797] [&amp;lt;ffffffffc129a66a&amp;gt;] tgt_request_handle+0xada/0x1570 [ptlrpc]
 [970890.825829] [&amp;lt;ffffffffc123d44b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
 [970890.833642] [&amp;lt;ffffffffc1240db4&amp;gt;] ptlrpc_main+0xb34/0x1470 [ptlrpc]
 [970890.840058] [&amp;lt;ffffffffbb2c2e81&amp;gt;] kthread+0xd1/0xe0
 [970890.845058] [&amp;lt;ffffffffbb977c24&amp;gt;] ret_from_fork_nospec_begin+0xe/0x21
 [970890.851628] [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
 [970890.856735] LustreError: dumping log to /tmp/lustre-log.1592191235.48047
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;tt&gt;lfs getdirstripe /scratch/groups/jidoyaga&lt;/tt&gt; is also hanging.&lt;/p&gt;

&lt;p&gt;I&apos;m attaching a dump of the tasks on &lt;tt&gt;fir-md1-s4&lt;/tt&gt; (MDT0003 @ 10.0.10.54@o2ib7) while the &lt;tt&gt;lfs getdirstripe&lt;/tt&gt; command is hanging on a client (10.50.0.1@o2ib2) as  &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/35195/35195_fir-md1-s4_hang_LU-13599.log&quot; title=&quot;fir-md1-s4_hang_LU-13599.log attached to LU-13599&quot;&gt;fir-md1-s4_hang_LU-13599.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;. Thanks!&lt;/p&gt;</comment>
                            <comment id="272854" author="sthiell" created="Mon, 15 Jun 2020 03:48:38 +0000"  >&lt;p&gt;Also, if I start from the directory of the last migrate message, and test each parent dir, I get:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh02-01n60 ~]# lfs getdirstripe /scratch/groups/jidoyaga/rleylek/atac-seq-pipeline/cromwell-executions/atac/49a5f27e-b6bd-46a8-beae-5ec5c01c1e7e/call-overlap_pr/shard-0/inputs/275296689
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
[root@sh02-01n60 ~]# lfs getdirstripe /scratch/groups/jidoyaga/rleylek/atac-seq-pipeline/cromwell-executions/atac/49a5f27e-b6bd-46a8-beae-5ec5c01c1e7e/call-overlap_pr/shard-0/inputs
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
[root@sh02-01n60 ~]# lfs getdirstripe /scratch/groups/jidoyaga/rleylek/atac-seq-pipeline/cromwell-executions/atac/49a5f27e-b6bd-46a8-beae-5ec5c01c1e7e/call-overlap_pr/shard-0
&amp;lt;hang&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

</comment>
                            <comment id="272883" author="sthiell" created="Mon, 15 Jun 2020 15:54:14 +0000"  >&lt;p&gt;Mike,&lt;br/&gt;
A restart of MDT0003 fixed the hang issue (phew!). We had to force restart as umount was blocked with messages like these:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[1015186.802058] LustreError: 0-0: Forced cleanup waiting for fir-MDT0000-osp-MDT0003 namespace with 1 resources in use, (rc=-110)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;But then the recovery was fast and MDT is working again.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh02-01n60 ~]# lfs getdirstripe /scratch/groups/jidoyaga
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64,migrating
mdtidx		 FID[seq:oid:ver]
     3		 [0x280040561:0x1220:0x0]		
     0		 [0x200000406:0xf1:0x0]	
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="273818" author="tappro" created="Fri, 26 Jun 2020 14:28:51 +0000"  >&lt;p&gt;Stephane, I&apos;ve found possible source of problem, making fix right now.&lt;/p&gt;</comment>
                            <comment id="273820" author="sthiell" created="Fri, 26 Jun 2020 14:57:52 +0000"  >&lt;p&gt;Great, thanks for the update Mike!&lt;/p&gt;</comment>
                            <comment id="273822" author="gerrit" created="Fri, 26 Jun 2020 15:28:54 +0000"  >&lt;p&gt;Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/39191&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39191&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13599&quot; title=&quot;LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs-&amp;gt;rs_nlocks &amp;lt; 8 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13599&quot;&gt;&lt;del&gt;LU-13599&lt;/del&gt;&lt;/a&gt; mdt: fix logic of skipping local locks in reply_state&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: adda59c2b113480822b2a7c8fedb38e0fe2745c7&lt;/p&gt;</comment>
                            <comment id="273834" author="gerrit" created="Fri, 26 Jun 2020 17:20:51 +0000"  >&lt;p&gt;Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/39194&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39194&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13599&quot; title=&quot;LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs-&amp;gt;rs_nlocks &amp;lt; 8 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13599&quot;&gt;&lt;del&gt;LU-13599&lt;/del&gt;&lt;/a&gt; mdt: fix logic of skipping local locks in reply_state&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: c168bae054d35fa5dde1cb5274c8cb394df7dda8&lt;/p&gt;</comment>
                            <comment id="274660" author="sthiell" created="Tue, 7 Jul 2020 18:45:21 +0000"  >&lt;p&gt;Mike, tomorrow on Wednesday, we have a maintenance on our systems (including Fir), so I think I&apos;m going to try your patch on top of 2.12.5 on the servers, as it looks ready (unless you told me otherwise!). Thanks!&lt;/p&gt;</comment>
                            <comment id="274668" author="tappro" created="Tue, 7 Jul 2020 20:44:55 +0000"  >&lt;p&gt;Stephane, yes, I think patch is ready to try&lt;/p&gt;</comment>
                            <comment id="274700" author="sthiell" created="Wed, 8 Jul 2020 02:08:11 +0000"  >&lt;p&gt;Hi Mike, we tried the patch, but something is wrong with it I think. With 2.12.5 + patch, we hit the following crash two times. One time just after the recovery (see &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/35341/35341_fir-md1-s1-vmcore-dmesg_070720.txt&quot; title=&quot;fir-md1-s1-vmcore-dmesg_070720.txt attached to LU-13599&quot;&gt;fir-md1-s1-vmcore-dmesg_070720.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;, crash dump available upon request), and one time a bit later, after the recovery was done, but when we restarted a single &lt;tt&gt;lfs migrate -m 3&lt;/tt&gt; (from MDT0 - the one that crashed - to MDT3 - no crash):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jul 07 18:54:12 fir-md1-s1 kernel: Pid: 23205, comm: mdt00_105 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
Message from syslogd@fir-md1-s1 at Jul  7 18:54:12 ...
 kernel:LustreError: 23205:0:(mdt_handler.c:892:mdt_big_xattr_get()) ASSERTION( info-&amp;gt;mti_big_lmm_used == 0 ) failed: 
Message from syslogd@fir-md1-s1 at Jul  7 18:54:12 ...
 kernel:LustreError: 23205:0:(mdt_handler.c:892:mdt_big_xattr_get()) LBUG
Jul 07 18:54:12 fir-md1-s1 kernel: Call Trace:
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffffc0d6d7cc&amp;gt;] libcfs_call_trace+0x8c/0xc0 [libcfs]
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffffc0d6d87c&amp;gt;] lbug_with_loc+0x4c/0xa0 [libcfs]
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffffc180dcc0&amp;gt;] mdt_big_xattr_get+0x640/0x810 [mdt]
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffffc180e0c7&amp;gt;] mdt_stripe_get+0x237/0x400 [mdt]
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffffc18306b6&amp;gt;] mdt_reint_migrate+0x1056/0x1350 [mdt]
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffffc1830a33&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffffc180d273&amp;gt;] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffffc18186e7&amp;gt;] mdt_reint+0x67/0x140 [mdt]
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffffc12e566a&amp;gt;] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffffc128844b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffffc128bdb4&amp;gt;] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffff872c2e81&amp;gt;] kthread+0xd1/0xe0
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffff87977c24&amp;gt;] ret_from_fork_nospec_begin+0xe/0x21
Jul 07 18:54:12 fir-md1-s1 kernel:  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff`
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="274773" author="sthiell" created="Wed, 8 Jul 2020 16:58:56 +0000"  >&lt;p&gt;We have not applied the patch at our system maintenance today, due to the new assertion above &lt;tt&gt;ASSERTION( info-&amp;gt;mti_big_lmm_used == 0 )&lt;/tt&gt; which seems more frequent than {{ASSERTION( rs-&amp;gt;rs_nlocks &amp;lt; 8 ) }} . We&apos;ll stay with 2.12.5 until we understand more. Thanks!&lt;/p&gt;</comment>
                            <comment id="274793" author="tappro" created="Wed, 8 Jul 2020 20:16:17 +0000"  >&lt;p&gt;Stephane, this assertion was seen on master and was fixed, I will prepare patch for 2.12&lt;/p&gt;</comment>
                            <comment id="274805" author="sthiell" created="Wed, 8 Jul 2020 21:38:37 +0000"  >&lt;p&gt;Great, thanks Mike!&lt;/p&gt;</comment>
                            <comment id="275978" author="sthiell" created="Wed, 22 Jul 2020 15:58:05 +0000"  >&lt;p&gt;Hi Mike,&lt;br/&gt;
 Do you have a separate ticket for the problem of &lt;tt&gt;ASSERTION( info-&amp;gt;mti_big_lmm_used == 0 )&lt;/tt&gt;? Just wanted to check if you had the time to backport the patch? We would like to try both patches to avoid MDT crashes during MDT migration, which we&apos;re doing quite a lot (got another crash last night). Thanks!&lt;br/&gt;
Stephane&lt;/p&gt;</comment>
                            <comment id="276203" author="gerrit" created="Tue, 28 Jul 2020 11:45:25 +0000"  >&lt;p&gt;Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/39521&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39521&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13599&quot; title=&quot;LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs-&amp;gt;rs_nlocks &amp;lt; 8 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13599&quot;&gt;&lt;del&gt;LU-13599&lt;/del&gt;&lt;/a&gt; mdt: fix mti_big_lmm buffer usage&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: f2cac96782dc729cfe93272db51d670284b6d7aa&lt;/p&gt;</comment>
                            <comment id="276204" author="tappro" created="Tue, 28 Jul 2020 11:48:05 +0000"  >&lt;p&gt;Hello Stephane, this fix had no separate ticket in master branch but was made as side work in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11025&quot; title=&quot;DNE3: directory restripe&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11025&quot;&gt;&lt;del&gt;LU-11025&lt;/del&gt;&lt;/a&gt;&lt;br/&gt;
I&apos;ve extracted related things from it for 2.12. Please check if it helps&lt;/p&gt;</comment>
                            <comment id="276235" author="sthiell" created="Tue, 28 Jul 2020 18:05:55 +0000"  >&lt;p&gt;Thanks Mike!&lt;br/&gt;
 I&apos;ve applied the two patches on one of our MDS (on top of 2.12.5) which is running now while some MDT-MDT migrations are going on. Will keep you posted!&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ git log --oneline | head -3
8ac362a LU-13599 mdt: fix mti_big_lmm buffer usage
1324114 LU-13599 mdt: fix logic of skipping local locks in reply_state
78d712a New release 2.12.5
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="276319" author="sthiell" created="Wed, 29 Jul 2020 18:46:59 +0000"  >&lt;p&gt;Hi Mike,&lt;/p&gt;

&lt;p&gt;I&apos;m glad to report that all 4 MDSes on Fir have now the two patches and so far have been running without any issue, even with multiple parallel &lt;tt&gt;lfs migrate -m&lt;/tt&gt;&#160;&#160;running on a client. I&apos;ll let you know if I see any issue but it&apos;s very promising!&lt;/p&gt;

&lt;p&gt;Thanks so much!&lt;/p&gt;</comment>
                            <comment id="276415" author="ofaaland" created="Thu, 30 Jul 2020 17:41:27 +0000"  >&lt;p&gt;Hi Peter or Mike,&lt;br/&gt;
Can you talk appropriate folks into reviewing the patches?  Thanks&lt;/p&gt;</comment>
                            <comment id="276497" author="pjones" created="Fri, 31 Jul 2020 15:47:06 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ofaaland&quot; class=&quot;user-hover&quot; rel=&quot;ofaaland&quot;&gt;ofaaland&lt;/a&gt; I think that we&apos;re all set now - just need things to land&lt;/p&gt;</comment>
                            <comment id="276987" author="gerrit" created="Fri, 7 Aug 2020 21:12:41 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/39191/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39191/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13599&quot; title=&quot;LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs-&amp;gt;rs_nlocks &amp;lt; 8 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13599&quot;&gt;&lt;del&gt;LU-13599&lt;/del&gt;&lt;/a&gt; mdt: fix logic of skipping local locks in reply_state&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: dec36101852a8d300a6fdcc28c8d723989544aaa&lt;/p&gt;</comment>
                            <comment id="277820" author="sthiell" created="Thu, 20 Aug 2020 18:50:55 +0000"  >&lt;p&gt;Just checking regarding&#160;&lt;a href=&quot;https://review.whamcloud.com/#/c/39521/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/39521/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This patch is critical to avoid MDS crashes and has not landed into b2_12 yet.&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;</comment>
                            <comment id="278457" author="gerrit" created="Tue, 1 Sep 2020 03:47:26 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/39521/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39521/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13599&quot; title=&quot;LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs-&amp;gt;rs_nlocks &amp;lt; 8 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13599&quot;&gt;&lt;del&gt;LU-13599&lt;/del&gt;&lt;/a&gt; mdt: fix mti_big_lmm buffer usage&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: b09b533b6f443c359e671e7b65208355d5c201dd&lt;/p&gt;</comment>
                            <comment id="278466" author="sthiell" created="Tue, 1 Sep 2020 04:49:43 +0000"  >&lt;p&gt;Thanks y&apos;all! &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="279057" author="gerrit" created="Tue, 8 Sep 2020 18:09:24 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/39194/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39194/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13599&quot; title=&quot;LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs-&amp;gt;rs_nlocks &amp;lt; 8 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13599&quot;&gt;&lt;del&gt;LU-13599&lt;/del&gt;&lt;/a&gt; mdt: add test for rs_lock limit exceeding&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 3448cdc16e361d2504f2f5b0982c92d7a0de933d&lt;/p&gt;</comment>
                            <comment id="287143" author="pjones" created="Wed, 9 Dec 2020 19:21:13 +0000"  >&lt;p&gt;Landed for 2.12.6. Fixed on master as part of a larger change (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11025&quot; title=&quot;DNE3: directory restripe&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11025&quot;&gt;&lt;del&gt;LU-11025&lt;/del&gt;&lt;/a&gt;)&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="60108">LU-13816</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="72647">LU-16206</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="59397">LU-13615</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="35341" name="fir-md1-s1-vmcore-dmesg_070720.txt" size="177027" author="sthiell" created="Wed, 8 Jul 2020 02:07:15 +0000"/>
                            <attachment id="35195" name="fir-md1-s4_hang_LU-13599.log" size="1725770" author="sthiell" created="Mon, 15 Jun 2020 03:38:20 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i0116n:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>