<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:39:43 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4106] racer test hang</title>
                <link>https://jira.whamcloud.com/browse/LU-4106</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Started running acceptance-small 12:36:21 on cluster and test hung during racer at approximately 14:35 and was still hung over 14 hours later. &lt;/p&gt;

&lt;p&gt;The nodes were setup to use HSM, but the copytool was not started on the agent (c08). The MGS/MDS (c03) did have HSM enabled since this was part of the testing for the HSM test plan.&lt;/p&gt;

&lt;p&gt;I am uploading logs now and will post a link to the test results when they are available. Since I had to kill the job, I&apos;m not sure what information will be included in the logs. I looked at dmesg and /var/log/messages before killing the job and the following is what I see on each of the nodes. &lt;/p&gt;

&lt;p&gt;On the node running the tests (c15), the console has the normal racer output but does not post &quot;PASS&quot; for racer:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;racer cleanup
racer cleanup
sleeping 5 sec ...
sleeping 5 sec ...
sleeping 5 sec ...
sleeping 5 sec ...
sleeping 5 sec ...
sleeping 5 sec ...
sleeping 5 sec ...
sleeping 5 sec ...
there should be NO racer processes:
root     26464  0.0  0.0 103244   888 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root     26466  0.0  0.0 103244   892 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
there should be NO racer processes:
Waited 5, rc=2 sleeping 10 sec ...
there should be NO racer processes:
there should be NO racer processes:
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
there should be NO racer processes:
root      8360  0.0  0.0 103244   892 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root      8362  0.0  0.0 103244   892 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
there should be NO racer processes:
root      4662  0.0  0.0 103244   872 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root      4664  0.0  0.0 103244   872 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root     12206  0.0  0.0 103248   904 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root     12208  0.0  0.0 103244   888 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
Filesystem           1K-blocks      Used Ac15: file_create.sh: no process killed
c15: file_create.sh: no process killed
c15: dir_create.sh: no process killed
c15: dir_create.sh: no process killed
c15: file_rm.sh: no process killed
c15: file_rm.sh: no process killed
c15: file_rename.sh: no process killed
c15: file_rename.sh: no process killed
c15: file_link.sh: no process killed
c15: file_link.sh: no process killed
c15: file_symlink.sh: no process killed
c15: file_symlink.sh: no process killed
c15: file_list.sh: no process killed
c15: file_list.sh: no process killed
c15: file_concat.sh: no process killed
c15: file_concat.sh: no process killed
c15: file_exec.sh: no process killed
c15: file_exec.sh: no process killed
c14: file_create.sh: no process killed
c14: file_create.sh: no process killed
c14: dir_create.sh: no process killed
c14: dir_create.sh: no process killed
c14: file_rm.sh: no process killed
c14: file_rm.sh: no process killed
c14: file_rename.sh: no process killed
c14: file_rename.sh: no process killed
c14: file_link.sh: no process killed
c14: file_link.sh: no process killed
c14: file_symlink.sh: no process killed
c14: file_symlink.sh: no process killed
c14: file_list.sh: no process killed
c14: file_list.sh: no process killed
c14: file_concat.sh: no process killed
c14: file_concat.sh: no process killed
c14: file_exec.sh: no process killed
c14: file_exec.sh: no process killed
vailable Use% Mounted on
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
root     17918  0.0  0.0 103244   884 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root     17920  0.0  0.0 103244   880 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root     15851  0.0  0.0 103244   888 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
there should be NO racer processes:
root     12012  0.0  0.0 103244   880 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
there should be NO racer processes:
root      2565  0.0  0.0 103244   876 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6904116  16843572  30% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
pid=30960 rc=0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;From /var/log/message on c15:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct 14 14:19:37 c15-ib kernel: Lustre: DEBUG MARKER: -----============= acceptance-small: racer ============----- Mon Oct 14 14:19:36 PDT 2013
Oct 14 14:19:37 c15-ib kernel: Lustre: DEBUG MARKER: excepting tests:
Oct 14 14:19:37 c15-ib kernel: Lustre: Layout lock feature supported.
Oct 14 14:19:38 c15-ib kernel: Lustre: Mounted scratch-client
Oct 14 14:19:41 c15-ib kernel: Lustre: DEBUG MARKER: Using TIMEOUT=100
Oct 14 14:19:43 c15-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
Oct 14 14:35:13 c15-ib kernel: Lustre: 4077:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1381785590/real 1381785590]  req@ffff8806fa903400 x1448904795054872/t0(0) o101-&amp;gt;scratch-MDT0000-mdc-ffff8802dd377400@192.168.2.103@o2ib:12/10 lens 576/1080 e 5 to 1 dl 1381786513 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Oct 14 14:35:13 c15-ib kernel: Lustre: scratch-MDT0000-mdc-ffff8802dd377400: Connec tion to scratch-MDT0000 (at 192.168.2.103@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 14 14:35:13 c15-ib kernel: Lustre: Skipped 2 previous similar messages
Oct 14 14:35:14 c15-ib kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff8802dd377400: Communicating with 192.168.2.103@o2ib, operation mds_connect failed with -16.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;dmesg on the MDS (c03) shows:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428f:0x102e:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004289:0x19a7:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428e:0x1fbd:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004284:0x264c:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004284:0x2b6f:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428e:0x35b5:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004282:0x45a8:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428b:0x5305:0x0], rc = 0 [1]
INFO: task mdt01_017:15697 blocked for more than 120 seconds.
&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
mdt01_017     D 0000000000000007     0 15697      2 0x00000080
 ffff880704ebb9b0 0000000000000046 ffff880704ebb950 ffffffffa04ebe75
 0000000100000000 ffffc90016eb7030 0000000000000246 0000000000000246
 ffff8806b85a9098 ffff880704ebbfd8 000000000000fb88 ffff8806b85a9098
Call Trace:
 [&amp;lt;ffffffffa04ebe75&amp;gt;] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
 [&amp;lt;ffffffffa0628274&amp;gt;] ? htable_lookup+0x1c4/0x1e0 [obdclass]
 [&amp;lt;ffffffffa062888b&amp;gt;] lu_object_find_at+0xab/0x360 [obdclass]
 [&amp;lt;ffffffffa07ac026&amp;gt;] ? lustre_msg_string+0x96/0x290 [ptlrpc]
 [&amp;lt;ffffffff81063410&amp;gt;] ? default_wake_function+0x0/0x20
 [&amp;lt;ffffffffa07abf85&amp;gt;] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
 [&amp;lt;ffffffffa0628b56&amp;gt;] lu_object_find+0x16/0x20 [obdclass]
 [&amp;lt;ffffffffa0db5af6&amp;gt;] mdt_object_find+0x56/0x170 [mdt]
 [&amp;lt;ffffffffa0dc8d34&amp;gt;] mdt_getattr_name_lock+0x7f4/0x1990 [mdt]
 [&amp;lt;ffffffffa07abf85&amp;gt;] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
 [&amp;lt;ffffffffa07d2f06&amp;gt;] ? __req_capsule_get+0x166/0x710 [ptlrpc]
 [&amp;lt;ffffffffa07ae214&amp;gt;] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
 [&amp;lt;ffffffffa0dca169&amp;gt;] mdt_intent_getattr+0x299/0x480 [mdt]
 [&amp;lt;ffffffffa0db88ce&amp;gt;] mdt_intent_policy+0x3ae/0x770 [mdt]
 [&amp;lt;ffffffffa0764461&amp;gt;] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
 [&amp;lt;ffffffffa078d17f&amp;gt;] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
 [&amp;lt;ffffffffa0db8d96&amp;gt;] mdt_enqueue+0x46/0xe0 [mdt]
 [&amp;lt;ffffffffa0dbfa8a&amp;gt;] mdt_handle_common+0x52a/0x1470 [mdt]
 [&amp;lt;ffffffffa0df9c55&amp;gt;] mds_regular_handle+0x15/0x20 [mdt]
 [&amp;lt;ffffffffa07bce25&amp;gt;] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
 [&amp;lt;ffffffffa04e827f&amp;gt;] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [&amp;lt;ffffffffa07b44c9&amp;gt;] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
 [&amp;lt;ffffffffa07be18d&amp;gt;] ptlrpc_main+0xaed/0x1740 [ptlrpc]
 [&amp;lt;ffffffffa07bd6a0&amp;gt;] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
 [&amp;lt;ffffffff81096a36&amp;gt;] kthread+0x96/0xa0
 [&amp;lt;ffffffff8100c0ca&amp;gt;] child_rip+0xa/0x20
 [&amp;lt;ffffffff810969a0&amp;gt;] ? kthread+0x0/0xa0
 [&amp;lt;ffffffff8100c0c0&amp;gt;] ? child_rip+0x0/0x20
INFO: task mdt01_017:15697 blocked for more than 120 seconds.
&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
mdt01_017     D 0000000000000007     0 15697      2 0x00000080
 ffff880704ebb9b0 0000000000000046 ffff880704ebb950 ffffffffa04ebe75
 0000000100000000 ffffc90016eb7030 0000000000000246 0000000000000246
 ffff8806b85a9098 ffff880704ebbfd8 000000000000fb88 ffff8806b85a9098
Call Trace:
 [&amp;lt;ffffffffa04ebe75&amp;gt;] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
 [&amp;lt;ffffffffa0628274&amp;gt;] ? htable_lookup+0x1c4/0x1e0 [obdclass]
 [&amp;lt;ffffffffa062888b&amp;gt;] lu_object_find_at+0xab/0x360 [obdclass]
 [&amp;lt;ffffffffa07ac026&amp;gt;] ? lustre_msg_string+0x96/0x290 [ptlrpc]
 [&amp;lt;ffffffff81063410&amp;gt;] ? default_wake_function+0x0/0x20
 [&amp;lt;ffffffffa07abf85&amp;gt;] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
 [&amp;lt;ffffffffa0628b56&amp;gt;] lu_object_find+0x16/0x20 [obdclass]
 [&amp;lt;ffffffffa0db5af6&amp;gt;] mdt_object_find+0x56/0x170 [mdt]
 [&amp;lt;ffffffffa0dc8d34&amp;gt;] mdt_getattr_name_lock+0x7f4/0x1990 [mdt]
 [&amp;lt;ffffffffa07abf85&amp;gt;] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
 [&amp;lt;ffffffffa07d2f06&amp;gt;] ? __req_capsule_get+0x166/0x710 [ptlrpc]
 [&amp;lt;ffffffffa07ae214&amp;gt;] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
 [&amp;lt;ffffffffa0dca169&amp;gt;] mdt_intent_getattr+0x299/0x480 [mdt]
 [&amp;lt;ffffffffa0db88ce&amp;gt;] mdt_intent_policy+0x3ae/0x770 [mdt]
 [&amp;lt;ffffffffa0764461&amp;gt;] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
 [&amp;lt;ffffffffa078d17f&amp;gt;] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
 [&amp;lt;ffffffffa0db8d96&amp;gt;] mdt_enqueue+0x46/0xe0 [mdt]
 [&amp;lt;ffffffffa0dbfa8a&amp;gt;] mdt_handle_common+0x52a/0x1470 [mdt]
 [&amp;lt;ffffffffa0df9c55&amp;gt;] mds_regular_handle+0x15/0x20 [mdt]
 [&amp;lt;ffffffffa07bce25&amp;gt;] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
 [&amp;lt;ffffffffa04e827f&amp;gt;] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [&amp;lt;ffffffffa07b44c9&amp;gt;] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
 [&amp;lt;ffffffffa07be18d&amp;gt;] ptlrpc_main+0xaed/0x1740 [ptlrpc]
 [&amp;lt;ffffffffa07bd6a0&amp;gt;] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
 [&amp;lt;ffffffff81096a36&amp;gt;] kthread+0x96/0xa0
 [&amp;lt;ffffffff8100c0ca&amp;gt;] child_rip+0xa/0x20
 [&amp;lt;ffffffff810969a0&amp;gt;] ? kthread+0x0/0xa0
 [&amp;lt;ffffffff8100c0c0&amp;gt;] ? child_rip+0x0/0x20
Lustre: 32077:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn&apos;t add any time (5/-207), not sending early reply  req@ffff88069ee86000 x1448904795054872/t0(0) o101-&amp;gt;5dccd683-21ab-f196-a776-5b7d390fc289@192.168.2.115@o2ib:0/0 lens 576/1080 e 5 to 0 dl 1381786402 ref 2 fl Interpret:/0/0 rc 0/0
INFO: task mdt01_017:15697 blocked for more than 120 seconds.
&#8230;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There&apos;s nothing interesting in the OST logs.&lt;/p&gt;

&lt;p&gt;dmesg from c08 (agent) and c10:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
LustreError: 11-0: scratch-MDT0000-mdc-ffff880633685000: Communicating with 192.168.2.103@o2ib, operation ldlm_enqueue failed with -71.
LustreError: 2355:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -71
LustreError: 2355:0:(file.c:3069:ll_inode_revalidate_fini()) scratch: revalidate FID [0x20000428c:0x70f:0x0] error: rc = -71
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On the node the with the Robinhood DB (c09), /var/log/messages has:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct 14 14:19:43 c09-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
Oct 14 14:20:08 c09-ib kernel: LustreError: 7042:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2
Oct 14 17:44:31 c09-ib kernel: LustreError: 7361:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 1213448 of inode ffff8807b8398b78 failed -28
Oct 14 17:44:31 c09-ib kernel: LustreError: 7361:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 1213448 of inode ffff8807b8398b78 failed -28
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;/var/log/messages on c14 contains:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct 14 14:19:43 c14-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
Oct 14 14:24:03 c14-ib kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff880815016400: Communicating with 192.168.2.103@o2ib, operation ldlm_enqueue failed with -71.
Oct 14 14:24:03 c14-ib kernel: LustreError: Skipped 1 previous similar message
Oct 14 14:24:03 c14-ib kernel: LustreError: 21327:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -71
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>Lustre 2.5.0-RC1, el6&lt;br/&gt;
&lt;br/&gt;
OpenSFS cluster with combined MGS/MDS (c03), single OSS (c04) with two OSTs, archive MGS/MDS (c05), archive OST (c06) with two OSTs, archive OST2 (c07) with two OSTs, eight clients; one agent + client(c08), one robinhood/db + client(c09) and others just running as Lustre clients (c09, c10, c11, c12, c13,c14, c15) </environment>
        <key id="21417">LU-4106</key>
            <summary>racer test hang</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="jamesanunez">James Nunez</reporter>
                        <labels>
                            <label>HB</label>
                    </labels>
                <created>Tue, 15 Oct 2013 15:51:39 +0000</created>
                <updated>Tue, 28 Jun 2016 05:14:31 +0000</updated>
                            <resolved>Thu, 23 Jan 2014 16:03:26 +0000</resolved>
                                    <version>Lustre 2.5.0</version>
                    <version>Lustre 2.6.0</version>
                    <version>Lustre 2.5.1</version>
                                    <fixVersion>Lustre 2.6.0</fixVersion>
                    <fixVersion>Lustre 2.5.3</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="68992" author="jhammond" created="Tue, 15 Oct 2013 16:10:18 +0000"  >&lt;p&gt;I saw a similar hang in lu_object_find_at() last night. It reproduces quickly using:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# MOUNT_2=y llmount.sh
###
# cd /mnt/lustre;
# while true; do
        sys_open f0 w ## open(&quot;f0&quot;, O_WRONLY)
        sys_unlink f0 ## unlink(&quot;f0&quot;)
done &amp;amp;
# cd /mnt/lustre2
# while true; do
        sys_open f0 cw ## open(&quot;f0&quot;, O_CREAT|O_WRONLY)
        sys_stat f0 ## stat(&quot;f0&quot;, ...)
done &amp;amp;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LNet: Service thread pid 555 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 555, comm: mdt00_004

# ~/stack1 mdt00_004
555 mdt00_004
[&amp;lt;ffffffffa044d88b&amp;gt;] lu_object_find_at+0xab/0x360 [obdclass]
[&amp;lt;ffffffffa044db56&amp;gt;] lu_object_find+0x16/0x20 [obdclass]
[&amp;lt;ffffffffa0accaf6&amp;gt;] mdt_object_find+0x56/0x170 [mdt]
[&amp;lt;ffffffffa0adfd34&amp;gt;] mdt_getattr_name_lock+0x7f4/0x1990 [mdt]
[&amp;lt;ffffffffa0ae1169&amp;gt;] mdt_intent_getattr+0x299/0x480 [mdt]
[&amp;lt;ffffffffa0acf8ce&amp;gt;] mdt_intent_policy+0x3ae/0x770 [mdt]
[&amp;lt;ffffffffa0587461&amp;gt;] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
[&amp;lt;ffffffffa05b017f&amp;gt;] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
[&amp;lt;ffffffffa0acfd96&amp;gt;] mdt_enqueue+0x46/0xe0 [mdt]
[&amp;lt;ffffffffa0ad6a8a&amp;gt;] mdt_handle_common+0x52a/0x1470 [mdt]
[&amp;lt;ffffffffa0b10c55&amp;gt;] mds_regular_handle+0x15/0x20 [mdt]
[&amp;lt;ffffffffa05dfe25&amp;gt;] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
[&amp;lt;ffffffffa05e118d&amp;gt;] ptlrpc_main+0xaed/0x1740 [ptlrpc]
[&amp;lt;ffffffff81096a36&amp;gt;] kthread+0x96/0xa0
[&amp;lt;ffffffff8100c0ca&amp;gt;] child_rip+0xa/0x20
[&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="69042" author="jamesanunez" created="Tue, 15 Oct 2013 22:18:14 +0000"  >&lt;p&gt;Partial logs from this run are at: &lt;a href=&quot;https://maloo.whamcloud.com/test_sessions/4574cfe8-35e1-11e3-b051-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sessions/4574cfe8-35e1-11e3-b051-52540035b04c&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="69159" author="jhammond" created="Wed, 16 Oct 2013 20:11:22 +0000"  >&lt;p&gt;In the GETATTR handler the two FIDs passed in the request may be identical (see __ll_inode_revalidate_it()). In this case calling mdt_object_find() twice (once in mdt_object_find() as the fid1 and once in mdt_getattr_name_lock() as fid2) is dangerous as a concurrent unlink may kill the object causing the second find to hang. To see that this is the issue I added a assertion to mdt_getattr_name_lock() just before we find the child object.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;diff --git a/lustre/mdt/mdt_handler.c b/lustre/mdt/mdt_handler.c
index 757bae3..4fef04e 100644
--- a/lustre/mdt/mdt_handler.c
+++ b/lustre/mdt/mdt_handler.c
@@ -1344,6 +1344,9 @@ static int mdt_getattr_name_lock(struct mdt_thread_info *info,
                 mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_POS);
         }
 
+       if (lu_fid_eq(mdt_object_fid(parent), child_fid))
+               LASSERT(!lu_object_is_dying(&amp;amp;parent-&amp;gt;mot_header));
+
         /*
          *step 3: find the child object by fid &amp;amp; lock it.
          *        regardless if it is local or remote.
          */
	child = mdt_object_find(info-&amp;gt;mti_env, info-&amp;gt;mti_mdt, child_fid);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then running the above reproducer triggers the assertion rather then haning in lu_object_find_at():&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 8891:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2
LustreError: 9101:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2
LustreError: 9371:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2
LustreError: 9371:0:(mdc_locks.c:915:mdc_enqueue()) Skipped 1 previous similar message
LustreError: 9913:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2
LustreError: 9913:0:(mdc_locks.c:915:mdc_enqueue()) Skipped 9 previous similar messages
LustreError: 11018:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2
LustreError: 11018:0:(mdc_locks.c:915:mdc_enqueue()) Skipped 25 previous similar messages
LustreError: 13088:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2
LustreError: 13088:0:(mdc_locks.c:915:mdc_enqueue()) Skipped 43 previous similar messages
LustreError: 17048:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2
LustreError: 17048:0:(mdc_locks.c:915:mdc_enqueue()) Skipped 57 previous similar messages
LustreError: 25203:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2
LustreError: 25203:0:(mdc_locks.c:915:mdc_enqueue()) Skipped 146 previous similar messages
LustreError: 10648:0:(mdt_handler.c:1348:mdt_getattr_name_lock()) ASSERTION( !lu_object_is_dying(&amp;amp;parent-&amp;gt;mot_header) ) failed: 
LustreError: 10648:0:(mdt_handler.c:1348:mdt_getattr_name_lock()) LBUG
Pid: 10648, comm: mdt00_004

Call Trace:
 [&amp;lt;ffffffffa0d95895&amp;gt;] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [&amp;lt;ffffffffa0d95e97&amp;gt;] lbug_with_loc+0x47/0xb0 [libcfs]
 [&amp;lt;ffffffffa0705ec2&amp;gt;] mdt_getattr_name_lock+0x1982/0x19e0 [mdt]
 [&amp;lt;ffffffffa1086f06&amp;gt;] ? __req_capsule_get+0x166/0x710 [ptlrpc]
 [&amp;lt;ffffffffa07061b9&amp;gt;] mdt_intent_getattr+0x299/0x480 [mdt]
 [&amp;lt;ffffffffa06f48ce&amp;gt;] mdt_intent_policy+0x3ae/0x770 [mdt]
 [&amp;lt;ffffffffa1018461&amp;gt;] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
 [&amp;lt;ffffffffa104117f&amp;gt;] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
 [&amp;lt;ffffffffa06f4d96&amp;gt;] mdt_enqueue+0x46/0xe0 [mdt]
 [&amp;lt;ffffffffa06fba8a&amp;gt;] mdt_handle_common+0x52a/0x1470 [mdt]
 [&amp;lt;ffffffffa0735ca5&amp;gt;] mds_regular_handle+0x15/0x20 [mdt]
 [&amp;lt;ffffffffa1070e25&amp;gt;] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
 [&amp;lt;ffffffffa0da727f&amp;gt;] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [&amp;lt;ffffffffa10684c9&amp;gt;] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
 [&amp;lt;ffffffffa107218d&amp;gt;] ptlrpc_main+0xaed/0x1740 [ptlrpc]
 [&amp;lt;ffffffffa10716a0&amp;gt;] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
 [&amp;lt;ffffffff81096a36&amp;gt;] kthread+0x96/0xa0
 [&amp;lt;ffffffff8100c0ca&amp;gt;] child_rip+0xa/0x20
 [&amp;lt;ffffffff810969a0&amp;gt;] ? kthread+0x0/0xa0
 [&amp;lt;ffffffff8100c0c0&amp;gt;] ? child_rip+0x0/0x20
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Perhaps mdt_getattr_name_lock() should detect the same FID case and handle this separately.&lt;/p&gt;</comment>
                            <comment id="69174" author="green" created="Wed, 16 Oct 2013 22:56:14 +0000"  >&lt;p&gt;parent == child should only happen for getattr by fid I think, and we just landed the patch about it for 3240 (&lt;a href=&quot;http://review.whamcloud.com/7910&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7910&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;I discussed with wangdi previously that in reality getattr by fid should probably be a separate function from getattr by name.&lt;/p&gt;</comment>
                            <comment id="69175" author="jhammond" created="Wed, 16 Oct 2013 23:03:42 +0000"  >&lt;p&gt;Yes. The change to __ll_inode_revalidate_it() in 7910 needs a corresponding server change. Perhaps we should revert that part for interoperability sake.&lt;/p&gt;</comment>
                            <comment id="69176" author="pjones" created="Wed, 16 Oct 2013 23:10:14 +0000"  >&lt;p&gt;Bobijam&lt;/p&gt;

&lt;p&gt;Could you please look into this one as your top priority?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="69191" author="yong.fan" created="Thu, 17 Oct 2013 03:57:04 +0000"  >&lt;p&gt;The message of &quot;LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for &lt;span class=&quot;error&quot;&gt;&amp;#91;0x20000428f:0x102e:0x0&amp;#93;&lt;/span&gt;, rc = 0 &lt;span class=&quot;error&quot;&gt;&amp;#91;1&amp;#93;&lt;/span&gt;&quot; looks abnormal, I will make a patch to detect and fix it.&lt;/p&gt;</comment>
                            <comment id="69196" author="bobijam" created="Thu, 17 Oct 2013 07:54:09 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/7990&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7990&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;commit message&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LU-4106 mdt: avoid recursive lu_object_find an object

LU-3240 (commit 762f2114d282a98ebfa4dbbeea9298a8088ad24e) set parent
dir fid the same as child fid in getattr by fid case we should not
lu_object_fid() the object recursivly, could lead to hung if there is
a concurrent unlink destroyed the object.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="69271" author="yong.fan" created="Fri, 18 Oct 2013 03:56:17 +0000"  >&lt;p&gt;The patch for OI scrub fixing:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#/c/8002/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8002/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="71908" author="jhammond" created="Tue, 19 Nov 2013 17:46:08 +0000"  >&lt;p&gt;Bobijam&apos;s lu_object_find() patch has landed to master (for 2.6.0). Fan Yong&apos;s OI scrub patch is still in flight.&lt;/p&gt;</comment>
                            <comment id="75504" author="jlevi" created="Thu, 23 Jan 2014 16:03:26 +0000"  >&lt;p&gt;Patches have landed to Master. Please reopen this ticket if more work is needed.&lt;/p&gt;</comment>
                            <comment id="87460" author="yujian" created="Wed, 25 Jun 2014 04:59:04 +0000"  >&lt;p&gt;Lustre Build: &lt;a href=&quot;https://build.whamcloud.com/job/lustre-b2_5/43&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://build.whamcloud.com/job/lustre-b2_5/43&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The same failure occurred: &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/d9f82e82-bea1-11e3-a50c-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/d9f82e82-bea1-11e3-a50c-52540035b04c&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="21878">LU-4216</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="17555">LU-2807</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="21570">LU-4132</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="21652">LU-4149</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw5nb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>11038</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>