<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:46:36 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11751] racer deadlocks due to DOM glimpse request</title>
                <link>https://jira.whamcloud.com/browse/LU-11751</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;racer test_1 hangs in locking. This issue looks a lot like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10852&quot; title=&quot;racer test 1 hangs in locking&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10852&quot;&gt;LU-10852&lt;/a&gt;, but the stack trace is a bit different and this failure is not in a DNE environment. So, opening a new ticket. &lt;/p&gt;

&lt;p&gt;Looking at the logs at &lt;a href=&quot;https://testing.whamcloud.com/test_sets/1c19f7d0-f9d7-11e8-b216-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/1c19f7d0-f9d7-11e8-b216-52540065bddc&lt;/a&gt;, we can see that man processes on the clients are blocked. From client 2 (vm9), in the client console we see&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[46895.114548] LustreError: 11-0: lustre-MDT0000-mdc-ffff9d96fa9a7800: operation ldlm_enqueue to node 10.2.8.143@tcp failed: rc = -107
[46895.115945] Lustre: lustre-MDT0000-mdc-ffff9d96fa9a7800: Connection to lustre-MDT0000 (at 10.2.8.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
[46895.119762] LustreError: 167-0: lustre-MDT0000-mdc-ffff9d96fa9a7800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[46895.133418] LustreError: 2635:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000403:0x942:0x0] error: rc = -5
[46895.140482] LustreError: 2167:0:(vvp_io.c:1495:vvp_io_init()) lustre: refresh file layout [0x200000405:0x70a:0x0] error -108.
[46895.146261] LustreError: 3286:0:(lmv_obd.c:1250:lmv_fid_alloc()) Can&apos;t alloc new fid, rc -19
[46895.147487] LustreError: 18765:0:(mdc_locks.c:1257:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -108
[46895.168736] LustreError: 18884:0:(mdc_request.c:1427:mdc_read_page()) lustre-MDT0000-mdc-ffff9d96fa9a7800: [0x200000402:0x23c:0x0] lock enqueue fails: rc = -108
[46895.208950] LustreError: 19579:0:(file.c:216:ll_close_inode_openhandle()) lustre-clilmv-ffff9d96fa9a7800: inode [0x200000402:0x23c:0x0] mdc close failed: rc = -108
[46895.222023] LustreError: 12501:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-MDT0000-mdc-ffff9d96fa9a7800: namespace resource [0x200000402:0x23d:0x0].0x0 (ffff9d96f98e9e40) refcount nonzero (1) after lock cleanup; forcing cleanup.
[46895.633613] LustreError: 13031:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -108
[46895.634939] LustreError: 13031:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 609 previous similar messages
[46895.635482] Lustre: lustre-MDT0000-mdc-ffff9d96fa9a7800: Connection restored to 10.2.8.143@tcp (at 10.2.8.143@tcp)
[46920.333393] INFO: task setfattr:32731 blocked for more than 120 seconds.
[46920.334207] &quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
[46920.335014] setfattr        D ffff9d9695c36eb0     0 32731    638 0x00000080
[46920.335773] Call Trace:
[46920.336068]  [&amp;lt;ffffffffae22ae2e&amp;gt;] ? generic_permission+0x15e/0x1d0
[46920.336704]  [&amp;lt;ffffffffae719e59&amp;gt;] schedule_preempt_disabled+0x29/0x70
[46920.337337]  [&amp;lt;ffffffffae717c17&amp;gt;] __mutex_lock_slowpath+0xc7/0x1d0
[46920.337958]  [&amp;lt;ffffffffae716fff&amp;gt;] mutex_lock+0x1f/0x2f
[46920.338471]  [&amp;lt;ffffffffae24700f&amp;gt;] vfs_removexattr+0x5f/0x130
[46920.339043]  [&amp;lt;ffffffffae247135&amp;gt;] removexattr+0x55/0x80
[46920.339578]  [&amp;lt;ffffffffae230a7d&amp;gt;] ? putname+0x3d/0x60
[46920.340101]  [&amp;lt;ffffffffae231c92&amp;gt;] ? user_path_at_empty+0x72/0xc0
[46920.340711]  [&amp;lt;ffffffffae2221c8&amp;gt;] ? __sb_start_write+0x58/0x110
[46920.341297]  [&amp;lt;ffffffffae72056c&amp;gt;] ? __do_page_fault+0x1bc/0x4f0
[46920.341900]  [&amp;lt;ffffffffae241d9c&amp;gt;] ? mnt_want_write+0x2c/0x50
[46920.342455]  [&amp;lt;ffffffffae247f34&amp;gt;] SyS_removexattr+0x94/0xd0
[46920.343017]  [&amp;lt;ffffffffae72579b&amp;gt;] system_call_fastpath+0x22/0x27
[46920.343616]  [&amp;lt;ffffffffae7256e1&amp;gt;] ? system_call_after_swapgs+0xae/0x146
[46920.344266] INFO: task cp:3161 blocked for more than 120 seconds.
[46920.344878] &quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
[46920.345642] cp              D ffff9d96e3c45ee0     0  3161    645 0x00000080
[46920.346377] Call Trace:
[46920.346659]  [&amp;lt;ffffffffae2d5152&amp;gt;] ? security_inode_permission+0x22/0x30
[46920.347301]  [&amp;lt;ffffffffae719e59&amp;gt;] schedule_preempt_disabled+0x29/0x70
[46920.347946]  [&amp;lt;ffffffffae717c17&amp;gt;] __mutex_lock_slowpath+0xc7/0x1d0
[46920.348543]  [&amp;lt;ffffffffae716fff&amp;gt;] mutex_lock+0x1f/0x2f
[46920.349083]  [&amp;lt;ffffffffae710bee&amp;gt;] lookup_slow+0x33/0xa7
[46920.349608]  [&amp;lt;ffffffffae22e7a8&amp;gt;] path_lookupat+0x838/0x8b0
[46920.350178]  [&amp;lt;ffffffffae1970cb&amp;gt;] ? unlock_page+0x2b/0x30
[46920.350732]  [&amp;lt;ffffffffae1faf39&amp;gt;] ? kmem_cache_alloc+0x179/0x1f0
[46920.351320]  [&amp;lt;ffffffffae230aef&amp;gt;] ? getname_flags+0x4f/0x1a0
[46920.351894]  [&amp;lt;ffffffffae22e84b&amp;gt;] filename_lookup+0x2b/0xc0
[46920.352442]  [&amp;lt;ffffffffae231c87&amp;gt;] user_path_at_empty+0x67/0xc0
[46920.353036]  [&amp;lt;ffffffffae127b72&amp;gt;] ? from_kgid_munged+0x12/0x20
[46920.353630]  [&amp;lt;ffffffffae2251df&amp;gt;] ? cp_new_stat+0x14f/0x180
[46920.354180]  [&amp;lt;ffffffffae231cf1&amp;gt;] user_path_at+0x11/0x20
[46920.354718]  [&amp;lt;ffffffffae224cd3&amp;gt;] vfs_fstatat+0x63/0xc0
[46920.355234]  [&amp;lt;ffffffffae22523e&amp;gt;] SYSC_newstat+0x2e/0x60
[46920.355776]  [&amp;lt;ffffffffae7256e1&amp;gt;] ? system_call_after_swapgs+0xae/0x146
[46920.356427]  [&amp;lt;ffffffffae7256d5&amp;gt;] ? system_call_after_swapgs+0xa2/0x146
[46920.357093]  [&amp;lt;ffffffffae7256e1&amp;gt;] ? system_call_after_swapgs+0xae/0x146
[46920.357748]  [&amp;lt;ffffffffae7256d5&amp;gt;] ? system_call_after_swapgs+0xa2/0x146
[46920.358394]  [&amp;lt;ffffffffae7256e1&amp;gt;] ? system_call_after_swapgs+0xae/0x146
[46920.359050]  [&amp;lt;ffffffffae7256d5&amp;gt;] ? system_call_after_swapgs+0xa2/0x146
[46920.359715]  [&amp;lt;ffffffffae7256e1&amp;gt;] ? system_call_after_swapgs+0xae/0x146
[46920.360355]  [&amp;lt;ffffffffae7256d5&amp;gt;] ? system_call_after_swapgs+0xa2/0x146
[46920.361018]  [&amp;lt;ffffffffae7256e1&amp;gt;] ? system_call_after_swapgs+0xae/0x146
[46920.361676]  [&amp;lt;ffffffffae7256d5&amp;gt;] ? system_call_after_swapgs+0xa2/0x146
[46920.362314]  [&amp;lt;ffffffffae7256e1&amp;gt;] ? system_call_after_swapgs+0xae/0x146
[46920.362967]  [&amp;lt;ffffffffae22551e&amp;gt;] SyS_newstat+0xe/0x10
[46920.363469]  [&amp;lt;ffffffffae72579b&amp;gt;] system_call_fastpath+0x22/0x27
[46920.364082]  [&amp;lt;ffffffffae7256e1&amp;gt;] ? system_call_after_swapgs+0xae/0x146
[46920.364739] INFO: task ln:3169 blocked for more than 120 seconds.
[46920.365340] &quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
[46920.366106] ln              D ffff9d96d8966eb0     0  3169    616 0x00000080
[46920.366863] Call Trace:
[46920.367122]  [&amp;lt;ffffffffae22e092&amp;gt;] ? path_lookupat+0x122/0x8b0
[46920.367698]  [&amp;lt;ffffffffae1f8e19&amp;gt;] ? ___slab_alloc+0x209/0x4f0
[46920.368264]  [&amp;lt;ffffffffae719e59&amp;gt;] schedule_preempt_disabled+0x29/0x70
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On client 1 (vm10), we see dir_create, ls, mkdir and other calls blocked&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[46775.380960] LustreError: 167-0: lustre-MDT0000-mdc-ffff9b35bbd7d000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[46775.392616] LustreError: 14241:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000401:0x122:0x0] error: rc = -5
[46775.394009] LustreError: 14241:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 2 previous similar messages
[46775.395224] LustreError: 30588:0:(mdc_locks.c:1257:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -108
[46775.402186] LustreError: 18530:0:(file.c:216:ll_close_inode_openhandle()) lustre-clilmv-ffff9b35bbd7d000: inode [0x200000404:0x148:0x0] mdc close failed: rc = -108
[46775.403682] LustreError: 18530:0:(file.c:216:ll_close_inode_openhandle()) Skipped 1 previous similar message
[46775.415258] LustreError: 17871:0:(llite_lib.c:1547:ll_md_setattr()) md_setattr fails: rc = -108
[46775.421674] LustreError: 18956:0:(lmv_obd.c:1250:lmv_fid_alloc()) Can&apos;t alloc new fid, rc -19
[46775.431989] Lustre: lustre-MDT0000-mdc-ffff9b35bbd7d000: Connection restored to 10.2.8.143@tcp (at 10.2.8.143@tcp)
[46920.246223] INFO: task dir_create.sh:4489 blocked for more than 120 seconds.
[46920.247112] &quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
[46920.247952] dir_create.sh   D ffff9b35baf1af70     0  4489   4447 0x00000080
[46920.248755] Call Trace:
[46920.249111]  [&amp;lt;ffffffffc0ea57e2&amp;gt;] ? ll_dcompare+0x72/0x2e0 [lustre]
[46920.249780]  [&amp;lt;ffffffff94319e59&amp;gt;] schedule_preempt_disabled+0x29/0x70
[46920.250439]  [&amp;lt;ffffffff94317c17&amp;gt;] __mutex_lock_slowpath+0xc7/0x1d0
[46920.251074]  [&amp;lt;ffffffff94316fff&amp;gt;] mutex_lock+0x1f/0x2f
[46920.251612]  [&amp;lt;ffffffff94310bee&amp;gt;] lookup_slow+0x33/0xa7
[46920.252169]  [&amp;lt;ffffffff93e2d01f&amp;gt;] link_path_walk+0x80f/0x8b0
[46920.252746]  [&amp;lt;ffffffff93e30205&amp;gt;] path_openat+0xb5/0x640
[46920.253285]  [&amp;lt;ffffffff93e31dbd&amp;gt;] do_filp_open+0x4d/0xb0
[46920.253848]  [&amp;lt;ffffffff93e3f1e4&amp;gt;] ? __alloc_fd+0xc4/0x170
[46920.254401]  [&amp;lt;ffffffff93e1e0d7&amp;gt;] do_sys_open+0x137/0x240
[46920.254959]  [&amp;lt;ffffffff943256d5&amp;gt;] ? system_call_after_swapgs+0xa2/0x146
[46920.255631]  [&amp;lt;ffffffff93e1e1fe&amp;gt;] SyS_open+0x1e/0x20
[46920.256150]  [&amp;lt;ffffffff9432579b&amp;gt;] system_call_fastpath+0x22/0x27
[46920.256760]  [&amp;lt;ffffffff943256e1&amp;gt;] ? system_call_after_swapgs+0xae/0x146
[46920.257485] INFO: task ls:6624 blocked for more than 120 seconds.
[46920.258108] &quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
[46920.258901] ls              D ffff9b359eb6eeb0     0  6624   4495 0x00000080
[46920.259661] Call Trace:
[46920.259932]  [&amp;lt;ffffffff93cc6081&amp;gt;] ? in_group_p+0x31/0x40
[46920.260487]  [&amp;lt;ffffffff94319e59&amp;gt;] schedule_preempt_disabled+0x29/0x70
[46920.261139]  [&amp;lt;ffffffff94317c17&amp;gt;] __mutex_lock_slowpath+0xc7/0x1d0
[46920.261777]  [&amp;lt;ffffffff94316fff&amp;gt;] mutex_lock+0x1f/0x2f
[46920.262300]  [&amp;lt;ffffffff93e2f11f&amp;gt;] do_last+0x28f/0x12c0
[46920.262826]  [&amp;lt;ffffffff93e30227&amp;gt;] path_openat+0xd7/0x640
[46920.263362]  [&amp;lt;ffffffff93e30aef&amp;gt;] ? getname_flags+0x4f/0x1a0
[46920.263957]  [&amp;lt;ffffffffc0eb2159&amp;gt;] ? ll_file_data_put+0x89/0x180 [lustre]
[46920.264625]  [&amp;lt;ffffffff93e31dbd&amp;gt;] do_filp_open+0x4d/0xb0
[46920.265173]  [&amp;lt;ffffffff93e3f1e4&amp;gt;] ? __alloc_fd+0xc4/0x170
[46920.265739]  [&amp;lt;ffffffff93e1e0d7&amp;gt;] do_sys_open+0x137/0x240
[46920.266281]  [&amp;lt;ffffffff943256d5&amp;gt;] ? system_call_after_swapgs+0xa2/0x146
[46920.266939]  [&amp;lt;ffffffff93e1e214&amp;gt;] SyS_openat+0x14/0x20
[46920.267473]  [&amp;lt;ffffffff9432579b&amp;gt;] system_call_fastpath+0x22/0x27
[46920.268081]  [&amp;lt;ffffffff943256e1&amp;gt;] ? system_call_after_swapgs+0xae/0x146
[46920.268778] INFO: task ln:8693 blocked for more than 120 seconds.
[46920.269388] &quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
[46920.270180] ln              D ffff9b3595468fd0     0  8693   4475 0x00000080
[46920.270940] Call Trace:
[46920.271201]  [&amp;lt;ffffffff93e2e092&amp;gt;] ? path_lookupat+0x122/0x8b0
[46920.271790]  [&amp;lt;ffffffff94319e59&amp;gt;] schedule_preempt_disabled+0x29/0x70
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;ON the MDS (vm12), we see many stack traces like the following&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[46559.329558] Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-41vm10,onyx-41vm9.onyx.whamcloud.com DURATION=900 ============ 09:34:14 (1544088854)
[46564.017497] LustreError: 5798:0:(mdt_lvb.c:430:mdt_lvbo_fill()) lustre-MDT0000: small buffer size 544 for EA 568 (max_mdsize 568): rc = -34
[46643.846755] LNet: Service thread pid 5790 was inactive for 62.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[46643.848574] Pid: 5790, comm: mdt00_011 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Wed Dec 5 03:41:24 UTC 2018
[46643.849563] Call Trace:
[46643.849882]  [&amp;lt;ffffffffc0ef8031&amp;gt;] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[46643.850848]  [&amp;lt;ffffffffc0ef8dcc&amp;gt;] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[46643.851624]  [&amp;lt;ffffffffc11e951b&amp;gt;] mdt_object_local_lock+0x50b/0xb20 [mdt]
[46643.852657]  [&amp;lt;ffffffffc11e9ba0&amp;gt;] mdt_object_lock_internal+0x70/0x360 [mdt]
[46643.853470]  [&amp;lt;ffffffffc11e9f47&amp;gt;] mdt_object_lock_try+0x27/0xb0 [mdt]
[46643.854167]  [&amp;lt;ffffffffc11eb687&amp;gt;] mdt_getattr_name_lock+0x1287/0x1c30 [mdt]
[46643.854970]  [&amp;lt;ffffffffc11f2ba5&amp;gt;] mdt_intent_getattr+0x2b5/0x480 [mdt]
[46643.855694]  [&amp;lt;ffffffffc11efa08&amp;gt;] mdt_intent_policy+0x2e8/0xd00 [mdt]
[46643.856459]  [&amp;lt;ffffffffc0edeec6&amp;gt;] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[46643.857209]  [&amp;lt;ffffffffc0f078a7&amp;gt;] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[46643.858036]  [&amp;lt;ffffffffc0f8e242&amp;gt;] tgt_enqueue+0x62/0x210 [ptlrpc]
[46643.858866]  [&amp;lt;ffffffffc0f9529a&amp;gt;] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[46643.859648]  [&amp;lt;ffffffffc0f3991b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[46643.860560]  [&amp;lt;ffffffffc0f3d24c&amp;gt;] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
[46643.861315]  [&amp;lt;ffffffff95cbdf21&amp;gt;] kthread+0xd1/0xe0
[46643.861876]  [&amp;lt;ffffffff963255f7&amp;gt;] ret_from_fork_nospec_end+0x0/0x39
[46643.862590]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
[46643.863234] LustreError: dumping log to /tmp/lustre-log.1544088939.5790
&#8230;
[47031.433294] Pid: 5782, comm: mdt00_004 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Wed Dec 5 03:41:24 UTC 2018
[47031.434325] Call Trace:
[47031.434609]  [&amp;lt;ffffffffc0ef8031&amp;gt;] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[47031.435489]  [&amp;lt;ffffffffc0ef8dcc&amp;gt;] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[47031.436322]  [&amp;lt;ffffffffc11e951b&amp;gt;] mdt_object_local_lock+0x50b/0xb20 [mdt]
[47031.437360]  [&amp;lt;ffffffffc11e9ba0&amp;gt;] mdt_object_lock_internal+0x70/0x360 [mdt]
[47031.438194]  [&amp;lt;ffffffffc11ec09a&amp;gt;] mdt_object_find_lock+0x6a/0x1a0 [mdt]
[47031.438983]  [&amp;lt;ffffffffc120aa5e&amp;gt;] mdt_reint_setxattr+0x1ce/0xfd0 [mdt]
[47031.439782]  [&amp;lt;ffffffffc12085d3&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
[47031.440507]  [&amp;lt;ffffffffc11e51a3&amp;gt;] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[47031.441242]  [&amp;lt;ffffffffc11f0487&amp;gt;] mdt_reint+0x67/0x140 [mdt]
[47031.441942]  [&amp;lt;ffffffffc0f9529a&amp;gt;] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[47031.442833]  [&amp;lt;ffffffffc0f3991b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[47031.443725]  [&amp;lt;ffffffffc0f3d24c&amp;gt;] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
[47031.444586]  [&amp;lt;ffffffff95cbdf21&amp;gt;] kthread+0xd1/0xe0
[47031.445234]  [&amp;lt;ffffffff963255f7&amp;gt;] ret_from_fork_nospec_end+0x0/0x39
[47031.445925]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
[47031.446664] LustreError: dumping log to /tmp/lustre-log.1544089327.5782
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Additional logs for this issue are at&lt;br/&gt;
&lt;a href=&quot;https://testing.whamcloud.com/sub_tests/04a3b75e-f729-11e8-bfe1-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/sub_tests/04a3b75e-f729-11e8-bfe1-52540065bddc&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://testing.whamcloud.com/sub_tests/24b8ffb6-f9ce-11e8-bb6b-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/sub_tests/24b8ffb6-f9ce-11e8-bb6b-52540065bddc&lt;/a&gt;&lt;/p&gt;</description>
                <environment></environment>
        <key id="54242">LU-11751</key>
            <summary>racer deadlocks due to DOM glimpse request</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="jamesanunez">James Nunez</reporter>
                        <labels>
                    </labels>
                <created>Mon, 10 Dec 2018 17:48:40 +0000</created>
                <updated>Sun, 16 Jan 2022 08:56:19 +0000</updated>
                            <resolved>Sun, 16 Jan 2022 08:56:19 +0000</resolved>
                                    <version>Lustre 2.12.0</version>
                    <version>Lustre 2.12.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="238444" author="bzzz" created="Wed, 12 Dec 2018 11:25:48 +0000"  >&lt;p&gt;I was unable to reproduce this with&#160;RACER_ENABLE_DOM=false&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="238459" author="pjones" created="Wed, 12 Dec 2018 18:19:45 +0000"  >&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;Is this something that you already know about?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="238489" author="tappro" created="Wed, 12 Dec 2018 22:06:07 +0000"  >&lt;p&gt;it looks like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11359&quot; title=&quot;racer test 1 times out with client hung in dir_create.sh, ls, &#8230; and MDS in ldlm_completion_ast()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11359&quot;&gt;&lt;del&gt;LU-11359&lt;/del&gt;&lt;/a&gt; duplicate&lt;/p&gt;</comment>
                            <comment id="246428" author="tappro" created="Sat, 27 Apr 2019 18:58:51 +0000"  >&lt;p&gt;This looks like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11359&quot; title=&quot;racer test 1 times out with client hung in dir_create.sh, ls, &#8230; and MDS in ldlm_completion_ast()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11359&quot;&gt;&lt;del&gt;LU-11359&lt;/del&gt;&lt;/a&gt; but has different cause. The problem exists with DOM files and is related to glimpse. When getattr RPC is returned to MDC without size then llite is asking for size attribute immediately by glimpse lock request. But at the moment original GETATTR lock is still in busy state, usually with 0x13 inodebits. Glimpse is PR lock with DOM bits - 0x40. The problem is that any other lock between these two may stuck on the first one and block the glimpse in waiting queue. But without finished glimpse the first lock will stay forever. This situation appeared along with DOM because glimpses on OSTs are about different namespace and resource.&lt;/p&gt;

&lt;p&gt;I think that patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11285&quot; title=&quot;don&amp;#39;t stop on the first blocked lock in ldlm_reprocess_queue()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11285&quot;&gt;&lt;del&gt;LU-11285&lt;/del&gt;&lt;/a&gt; will help with the problem because locks with non-crossing ibits will be not blocked in waiting queue, but ideally glimpse should be done atomically. I have also &apos;glimpse-ahead&apos; patch for that which was considered as an optimization but it will take time to apply it for current master and pass tests. This will be done under this ticket. &lt;/p&gt;</comment>
                            <comment id="246431" author="pfarrell" created="Sat, 27 Apr 2019 21:12:48 +0000"  >&lt;p&gt;Mike,&lt;/p&gt;

&lt;p&gt;Is it not possible to make the PR DOM lock part of the getattr request on MDC?&#160; Or would that cause problems (or at least be wasteful) if there was an existing DOM lock?&lt;/p&gt;</comment>
                            <comment id="246436" author="tappro" created="Sun, 28 Apr 2019 06:18:42 +0000"  >&lt;p&gt;Patrick, it is already so, if we CAN return DOM lock, but if not then separate glimpse is issued to get size. What I am thinking about is to do glimpse from server in advance, if we cannot return DOM bit and  there will be glimpse request for sure.&lt;/p&gt;</comment>
                            <comment id="246437" author="tappro" created="Sun, 28 Apr 2019 06:22:38 +0000"  >&lt;p&gt;But first I have to check how glimpse works right now, my understanding it should not be blocked in waiting queue.&lt;/p&gt;</comment>
                            <comment id="246439" author="sthiell" created="Sun, 28 Apr 2019 17:30:28 +0000"  >&lt;p&gt;I believe we are getting this issue right now, example on FID 0x2c001ad81:0xe26:0x0 but there are others (0x13 + 0x40):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Apr 28 10:16:47 fir-md1-s1 kernel: LustreError: 35049:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556471717, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0002_UUID lock: ffff8b5012ae7980/0x378007fdc6c24f96 lrc: 3/1,0 mode: --/PR res: [0x2c001ad81:0xe26:0x0].0x0 bits 0x13/0x8 rrc: 143 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 35049 timeout: 0 lvb_type: 0

Apr 28 10:17:47 fir-md1-s1 kernel: LustreError: 20444:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 10.8.7.17@o2ib6 ns: mdt-fir-MDT0002_UUID lock: ffff8b51f73e72c0/0x378007fdc5f51cf6 lrc: 3/0,0 mode: PW/PW res: [0x2c001ad81:0xe26:0x0].0x0 bits 0x40/0x0 rrc: 143 type: IBT flags: 0x60200400000020 nid: 10.8.7.17@o2ib6 remote: 0x1b8755abd081ffc4 expref: 10 pid: 34487 timeout: 471701 lvb_type: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@fir-rbh01 ~]# lfs fid2path /fir 0x2c001ad81:0xe26:0x0
/fir/users/jmz/geo_activity/activitySplit/logs/testRunOutput
[root@fir-rbh01 ~]# stat /fir/users/jmz/geo_activity/activitySplit/logs/testRunOutput
  File: &#8216;/fir/users/jmz/geo_activity/activitySplit/logs/testRunOutput&#8217;
  Size: 956       	Blocks: 16         IO Block: 4194304 regular file
Device: e64e03a8h/3863872424d	Inode: 198160228309536294  Links: 1
Access: (0644/-rw-r--r--)  Uid: (331789/     jmz)   Gid: (11886/    euan)
Access: 2019-04-28 01:46:50.000000000 -0700
Modify: 2019-04-28 10:28:15.000000000 -0700
Change: 2019-04-28 10:28:15.000000000 -0700
 Birth: -
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;followed after some time by the typical &lt;tt&gt;ldlm_completion_ast&lt;/tt&gt; traces.&lt;/p&gt;

&lt;p&gt;Even restarting the MDTs doesn&apos;t seem to fix the issue... what can we do apart from remounting with &lt;tt&gt;abort_recov&lt;/tt&gt;?&lt;/p&gt;</comment>
                            <comment id="246440" author="sthiell" created="Sun, 28 Apr 2019 17:39:32 +0000"  >&lt;p&gt;Attached &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32495/32495_fir-md1-s1-kernel-20190428.log&quot; title=&quot;fir-md1-s1-kernel-20190428.log attached to LU-11751&quot;&gt;fir-md1-s1-kernel-20190428.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; and  &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32496/32496_fir-md1-s2-kernel-20190428.log&quot; title=&quot;fir-md1-s2-kernel-20190428.log attached to LU-11751&quot;&gt;fir-md1-s2-kernel-20190428.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; which are kernel logs from both MDS.&lt;/p&gt;</comment>
                            <comment id="246448" author="sthiell" created="Sun, 28 Apr 2019 20:44:08 +0000"  >&lt;p&gt;I commented in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11285&quot; title=&quot;don&amp;#39;t stop on the first blocked lock in ldlm_reprocess_queue()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11285&quot;&gt;&lt;del&gt;LU-11285&lt;/del&gt;&lt;/a&gt;, I think we have a great dk trace with +dlmtrace. Hope that helps to fix this definitively!! For us, the issue has been resolved (for now) after holding this user&apos;s jobs and restarting MDT3 in &lt;tt&gt;abort_recov&lt;/tt&gt; (didn&apos;t work otherwise...).&lt;/p&gt;</comment>
                            <comment id="246462" author="tappro" created="Mon, 29 Apr 2019 17:28:29 +0000"  >&lt;p&gt;Stephane, thanks for logs, I didn&apos;t find the reason yet, working on that&lt;/p&gt;</comment>
                            <comment id="246572" author="sthiell" created="Wed, 1 May 2019 05:27:46 +0000"  >&lt;p&gt;Mike, I just checked and since this last event, so about two days now, I haven&apos;t seen any new &lt;tt&gt;ldlm_completion_ast&lt;/tt&gt; call trace at all on both MDS, this is surprising (in a good way!). Before that, we had many of them per day. But after the problem reported above, I added the patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11285&quot; title=&quot;don&amp;#39;t stop on the first blocked lock in ldlm_reprocess_queue()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11285&quot;&gt;&lt;del&gt;LU-11285&lt;/del&gt;&lt;/a&gt; so... perhaps it did help. We&apos;re currently running 2.12.0 + the following patches:&lt;/p&gt;

&lt;p&gt;5081aa7 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10777&quot; title=&quot;DoM performance is bad with FIO write&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10777&quot;&gt;&lt;del&gt;LU-10777&lt;/del&gt;&lt;/a&gt; dom: disable read-on-open with resend&lt;br/&gt;
455d39b &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11285&quot; title=&quot;don&amp;#39;t stop on the first blocked lock in ldlm_reprocess_queue()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11285&quot;&gt;&lt;del&gt;LU-11285&lt;/del&gt;&lt;/a&gt; ldlm: reprocess whole waiting queue for IBITS&lt;br/&gt;
2eec4f8 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12018&quot; title=&quot;deadlock on OSS: quota reintegration vs memory release&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12018&quot;&gt;&lt;del&gt;LU-12018&lt;/del&gt;&lt;/a&gt; quota: do not start a thread under memory pressure&lt;br/&gt;
1819063 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11359&quot; title=&quot;racer test 1 times out with client hung in dir_create.sh, ls, &#8230; and MDS in ldlm_completion_ast()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11359&quot;&gt;&lt;del&gt;LU-11359&lt;/del&gt;&lt;/a&gt; mdt: fix mdt_dom_discard_data() timeouts&lt;br/&gt;
3ed10f4 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11964&quot; title=&quot;Heavy load and soft lockups on MDS with DOM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11964&quot;&gt;&lt;del&gt;LU-11964&lt;/del&gt;&lt;/a&gt; mdc: prevent glimpse lock count grow&lt;br/&gt;
565011c &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12037&quot; title=&quot;Possible DNE issue leading to hung filesystem&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12037&quot;&gt;&lt;del&gt;LU-12037&lt;/del&gt;&lt;/a&gt; mdt: add option for cross-MDT rename&lt;br/&gt;
b6be1d9 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12065&quot; title=&quot;Client got evicted when  lock callback timer expired  on OSS &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12065&quot;&gt;&lt;del&gt;LU-12065&lt;/del&gt;&lt;/a&gt; lnd: increase CQ entries&lt;br/&gt;
6b2c97b &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12037&quot; title=&quot;Possible DNE issue leading to hung filesystem&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12037&quot;&gt;&lt;del&gt;LU-12037&lt;/del&gt;&lt;/a&gt; mdt: call mdt_dom_discard_data() after rename unlock&lt;/p&gt;

&lt;p&gt;Also we&apos;re now 100% sure that all clients are running with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11359&quot; title=&quot;racer test 1 times out with client hung in dir_create.sh, ls, &#8230; and MDS in ldlm_completion_ast()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11359&quot;&gt;&lt;del&gt;LU-11359&lt;/del&gt;&lt;/a&gt;, so async_discard is enabled everywhere, this is important too I guess.&lt;/p&gt;

&lt;p&gt;I&apos;ll update if things change (which usually occurs shortly after posting a comment here... &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/tongue.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;)&lt;/p&gt;</comment>
                            <comment id="246750" author="sthiell" created="Mon, 6 May 2019 17:33:58 +0000"  >&lt;p&gt;Quick update... our Fir system (running the patches above) has been stable last week and last weekend. Note that we&apos;re &quot;only&quot; running patchset #11 of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11359&quot; title=&quot;racer test 1 times out with client hung in dir_create.sh, ls, &#8230; and MDS in ldlm_completion_ast()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11359&quot;&gt;&lt;del&gt;LU-11359&lt;/del&gt;&lt;/a&gt;, and I see that since then it has evolved quite a bit... &lt;/p&gt;</comment>
                            <comment id="247030" author="pjones" created="Sat, 11 May 2019 14:38:47 +0000"  >&lt;p&gt;This seems encouraging enough to consider this ticket a duplicate (though it is hard to pin onto exactly which patch it is a duplicate of &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&#160;)&lt;/p&gt;</comment>
                            <comment id="263017" author="jamesanunez" created="Mon, 10 Feb 2020 17:22:52 +0000"  >&lt;p&gt;Mike - I think we are still seeing this issue. Would you please take a look at the racer hang at &lt;a href=&quot;https://testing.whamcloud.com/test_sets/611e867e-4b76-11ea-a1c8-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/611e867e-4b76-11ea-a1c8-52540065bddc&lt;/a&gt; ? &lt;/p&gt;

&lt;p&gt;If this is not the same hang, I&apos;ll open a new ticket. &lt;/p&gt;

&lt;p&gt;I know this ticket was closed as a duplicate, but I don&apos;t know what ticket it is duplicating and, thus, can&apos;t see if that one is still open. &lt;/p&gt;</comment>
                            <comment id="263022" author="tappro" created="Mon, 10 Feb 2020 18:42:30 +0000"  >&lt;p&gt;James. I don&apos;t see evidences for that so far, neither server nor client has traces of DOM locks, but server stack traces contains &lt;tt&gt;osp_md_object_lock&lt;/tt&gt; call and &lt;tt&gt;mdt_rename_lock&lt;/tt&gt; on servers, so I tend to think this is DNE issue, like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12037&quot; title=&quot;Possible DNE issue leading to hung filesystem&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12037&quot;&gt;&lt;del&gt;LU-12037&lt;/del&gt;&lt;/a&gt;. Can you check what is &lt;tt&gt;mdt.*.enable_remote_rename&lt;/tt&gt; parameter on servers? Probably that is remote rename DNE issue. &lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="53269">LU-11359</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="53122">LU-11285</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="32495" name="fir-md1-s1-kernel-20190428.log" size="993863" author="sthiell" created="Sun, 28 Apr 2019 17:38:45 +0000"/>
                            <attachment id="32496" name="fir-md1-s2-kernel-20190428.log" size="641636" author="sthiell" created="Sun, 28 Apr 2019 17:38:50 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i007pb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>