[LU-4106] racer test hang Created: 15/Oct/13 Updated: 28/Jun/16 Resolved: 23/Jan/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0, Lustre 2.6.0, Lustre 2.5.1 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.3 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James Nunez (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
Lustre 2.5.0-RC1, el6 OpenSFS cluster with combined MGS/MDS (c03), single OSS (c04) with two OSTs, archive MGS/MDS (c05), archive OST (c06) with two OSTs, archive OST2 (c07) with two OSTs, eight clients; one agent + client(c08), one robinhood/db + client(c09) and others just running as Lustre clients (c09, c10, c11, c12, c13,c14, c15) |
||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 11038 | ||||||||||||||||||||||||
| Description |
|
Started running acceptance-small 12:36:21 on cluster and test hung during racer at approximately 14:35 and was still hung over 14 hours later. The nodes were setup to use HSM, but the copytool was not started on the agent (c08). The MGS/MDS (c03) did have HSM enabled since this was part of the testing for the HSM test plan. I am uploading logs now and will post a link to the test results when they are available. Since I had to kill the job, I'm not sure what information will be included in the logs. I looked at dmesg and /var/log/messages before killing the job and the following is what I see on each of the nodes. On the node running the tests (c15), the console has the normal racer output but does not post "PASS" for racer: racer cleanup racer cleanup sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... there should be NO racer processes: root 26464 0.0 0.0 103244 888 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 26466 0.0 0.0 103244 892 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec there should be NO racer processes: Waited 5, rc=2 sleeping 10 sec ... there should be NO racer processes: there should be NO racer processes: Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. there should be NO racer processes: root 8360 0.0 0.0 103244 892 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 8362 0.0 0.0 103244 892 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec there should be NO racer processes: root 4662 0.0 0.0 103244 872 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 4664 0.0 0.0 103244 872 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 12206 0.0 0.0 103248 904 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 12208 0.0 0.0 103244 888 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Ac15: file_create.sh: no process killed c15: file_create.sh: no process killed c15: dir_create.sh: no process killed c15: dir_create.sh: no process killed c15: file_rm.sh: no process killed c15: file_rm.sh: no process killed c15: file_rename.sh: no process killed c15: file_rename.sh: no process killed c15: file_link.sh: no process killed c15: file_link.sh: no process killed c15: file_symlink.sh: no process killed c15: file_symlink.sh: no process killed c15: file_list.sh: no process killed c15: file_list.sh: no process killed c15: file_concat.sh: no process killed c15: file_concat.sh: no process killed c15: file_exec.sh: no process killed c15: file_exec.sh: no process killed c14: file_create.sh: no process killed c14: file_create.sh: no process killed c14: dir_create.sh: no process killed c14: dir_create.sh: no process killed c14: file_rm.sh: no process killed c14: file_rm.sh: no process killed c14: file_rename.sh: no process killed c14: file_rename.sh: no process killed c14: file_link.sh: no process killed c14: file_link.sh: no process killed c14: file_symlink.sh: no process killed c14: file_symlink.sh: no process killed c14: file_list.sh: no process killed c14: file_list.sh: no process killed c14: file_concat.sh: no process killed c14: file_concat.sh: no process killed c14: file_exec.sh: no process killed c14: file_exec.sh: no process killed vailable Use% Mounted on Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. root 17918 0.0 0.0 103244 884 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 17920 0.0 0.0 103244 880 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 15851 0.0 0.0 103244 888 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. there should be NO racer processes: root 12012 0.0 0.0 103244 880 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. there should be NO racer processes: root 2565 0.0 0.0 103244 876 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6904116 16843572 30% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. pid=30960 rc=0 From /var/log/message on c15: Oct 14 14:19:37 c15-ib kernel: Lustre: DEBUG MARKER: -----============= acceptance-small: racer ============----- Mon Oct 14 14:19:36 PDT 2013 Oct 14 14:19:37 c15-ib kernel: Lustre: DEBUG MARKER: excepting tests: Oct 14 14:19:37 c15-ib kernel: Lustre: Layout lock feature supported. Oct 14 14:19:38 c15-ib kernel: Lustre: Mounted scratch-client Oct 14 14:19:41 c15-ib kernel: Lustre: DEBUG MARKER: Using TIMEOUT=100 Oct 14 14:19:43 c15-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582) Oct 14 14:35:13 c15-ib kernel: Lustre: 4077:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1381785590/real 1381785590] req@ffff8806fa903400 x1448904795054872/t0(0) o101->scratch-MDT0000-mdc-ffff8802dd377400@192.168.2.103@o2ib:12/10 lens 576/1080 e 5 to 1 dl 1381786513 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Oct 14 14:35:13 c15-ib kernel: Lustre: scratch-MDT0000-mdc-ffff8802dd377400: Connec tion to scratch-MDT0000 (at 192.168.2.103@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 14 14:35:13 c15-ib kernel: Lustre: Skipped 2 previous similar messages Oct 14 14:35:14 c15-ib kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff8802dd377400: Communicating with 192.168.2.103@o2ib, operation mds_connect failed with -16. dmesg on the MDS (c03) shows: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582) LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428f:0x102e:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004289:0x19a7:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428e:0x1fbd:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004284:0x264c:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004284:0x2b6f:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428e:0x35b5:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004282:0x45a8:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428b:0x5305:0x0], rc = 0 [1] INFO: task mdt01_017:15697 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mdt01_017 D 0000000000000007 0 15697 2 0x00000080 ffff880704ebb9b0 0000000000000046 ffff880704ebb950 ffffffffa04ebe75 0000000100000000 ffffc90016eb7030 0000000000000246 0000000000000246 ffff8806b85a9098 ffff880704ebbfd8 000000000000fb88 ffff8806b85a9098 Call Trace: [<ffffffffa04ebe75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs] [<ffffffffa0628274>] ? htable_lookup+0x1c4/0x1e0 [obdclass] [<ffffffffa062888b>] lu_object_find_at+0xab/0x360 [obdclass] [<ffffffffa07ac026>] ? lustre_msg_string+0x96/0x290 [ptlrpc] [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa0628b56>] lu_object_find+0x16/0x20 [obdclass] [<ffffffffa0db5af6>] mdt_object_find+0x56/0x170 [mdt] [<ffffffffa0dc8d34>] mdt_getattr_name_lock+0x7f4/0x1990 [mdt] [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa07d2f06>] ? __req_capsule_get+0x166/0x710 [ptlrpc] [<ffffffffa07ae214>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] [<ffffffffa0dca169>] mdt_intent_getattr+0x299/0x480 [mdt] [<ffffffffa0db88ce>] mdt_intent_policy+0x3ae/0x770 [mdt] [<ffffffffa0764461>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc] [<ffffffffa078d17f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc] [<ffffffffa0db8d96>] mdt_enqueue+0x46/0xe0 [mdt] [<ffffffffa0dbfa8a>] mdt_handle_common+0x52a/0x1470 [mdt] [<ffffffffa0df9c55>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa07bce25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa04e827f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa07b44c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] [<ffffffffa07be18d>] ptlrpc_main+0xaed/0x1740 [ptlrpc] [<ffffffffa07bd6a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 INFO: task mdt01_017:15697 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mdt01_017 D 0000000000000007 0 15697 2 0x00000080 ffff880704ebb9b0 0000000000000046 ffff880704ebb950 ffffffffa04ebe75 0000000100000000 ffffc90016eb7030 0000000000000246 0000000000000246 ffff8806b85a9098 ffff880704ebbfd8 000000000000fb88 ffff8806b85a9098 Call Trace: [<ffffffffa04ebe75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs] [<ffffffffa0628274>] ? htable_lookup+0x1c4/0x1e0 [obdclass] [<ffffffffa062888b>] lu_object_find_at+0xab/0x360 [obdclass] [<ffffffffa07ac026>] ? lustre_msg_string+0x96/0x290 [ptlrpc] [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa0628b56>] lu_object_find+0x16/0x20 [obdclass] [<ffffffffa0db5af6>] mdt_object_find+0x56/0x170 [mdt] [<ffffffffa0dc8d34>] mdt_getattr_name_lock+0x7f4/0x1990 [mdt] [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa07d2f06>] ? __req_capsule_get+0x166/0x710 [ptlrpc] [<ffffffffa07ae214>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] [<ffffffffa0dca169>] mdt_intent_getattr+0x299/0x480 [mdt] [<ffffffffa0db88ce>] mdt_intent_policy+0x3ae/0x770 [mdt] [<ffffffffa0764461>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc] [<ffffffffa078d17f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc] [<ffffffffa0db8d96>] mdt_enqueue+0x46/0xe0 [mdt] [<ffffffffa0dbfa8a>] mdt_handle_common+0x52a/0x1470 [mdt] [<ffffffffa0df9c55>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa07bce25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa04e827f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa07b44c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] [<ffffffffa07be18d>] ptlrpc_main+0xaed/0x1740 [ptlrpc] [<ffffffffa07bd6a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Lustre: 32077:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply req@ffff88069ee86000 x1448904795054872/t0(0) o101->5dccd683-21ab-f196-a776-5b7d390fc289@192.168.2.115@o2ib:0/0 lens 576/1080 e 5 to 0 dl 1381786402 ref 2 fl Interpret:/0/0 rc 0/0 INFO: task mdt01_017:15697 blocked for more than 120 seconds. … There's nothing interesting in the OST logs. dmesg from c08 (agent) and c10: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582) LustreError: 11-0: scratch-MDT0000-mdc-ffff880633685000: Communicating with 192.168.2.103@o2ib, operation ldlm_enqueue failed with -71. LustreError: 2355:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -71 LustreError: 2355:0:(file.c:3069:ll_inode_revalidate_fini()) scratch: revalidate FID [0x20000428c:0x70f:0x0] error: rc = -71 On the node the with the Robinhood DB (c09), /var/log/messages has: Oct 14 14:19:43 c09-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582) Oct 14 14:20:08 c09-ib kernel: LustreError: 7042:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2 Oct 14 17:44:31 c09-ib kernel: LustreError: 7361:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 1213448 of inode ffff8807b8398b78 failed -28 Oct 14 17:44:31 c09-ib kernel: LustreError: 7361:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 1213448 of inode ffff8807b8398b78 failed -28 /var/log/messages on c14 contains: Oct 14 14:19:43 c14-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582) Oct 14 14:24:03 c14-ib kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff880815016400: Communicating with 192.168.2.103@o2ib, operation ldlm_enqueue failed with -71. Oct 14 14:24:03 c14-ib kernel: LustreError: Skipped 1 previous similar message Oct 14 14:24:03 c14-ib kernel: LustreError: 21327:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -71 |
| Comments |
| Comment by John Hammond [ 15/Oct/13 ] |
|
I saw a similar hang in lu_object_find_at() last night. It reproduces quickly using: # MOUNT_2=y llmount.sh
###
# cd /mnt/lustre;
# while true; do
sys_open f0 w ## open("f0", O_WRONLY)
sys_unlink f0 ## unlink("f0")
done &
# cd /mnt/lustre2
# while true; do
sys_open f0 cw ## open("f0", O_CREAT|O_WRONLY)
sys_stat f0 ## stat("f0", ...)
done &
LNet: Service thread pid 555 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 555, comm: mdt00_004 # ~/stack1 mdt00_004 555 mdt00_004 [<ffffffffa044d88b>] lu_object_find_at+0xab/0x360 [obdclass] [<ffffffffa044db56>] lu_object_find+0x16/0x20 [obdclass] [<ffffffffa0accaf6>] mdt_object_find+0x56/0x170 [mdt] [<ffffffffa0adfd34>] mdt_getattr_name_lock+0x7f4/0x1990 [mdt] [<ffffffffa0ae1169>] mdt_intent_getattr+0x299/0x480 [mdt] [<ffffffffa0acf8ce>] mdt_intent_policy+0x3ae/0x770 [mdt] [<ffffffffa0587461>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc] [<ffffffffa05b017f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc] [<ffffffffa0acfd96>] mdt_enqueue+0x46/0xe0 [mdt] [<ffffffffa0ad6a8a>] mdt_handle_common+0x52a/0x1470 [mdt] [<ffffffffa0b10c55>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa05dfe25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa05e118d>] ptlrpc_main+0xaed/0x1740 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff |
| Comment by James Nunez (Inactive) [ 15/Oct/13 ] |
|
Partial logs from this run are at: https://maloo.whamcloud.com/test_sessions/4574cfe8-35e1-11e3-b051-52540035b04c |
| Comment by John Hammond [ 16/Oct/13 ] |
|
In the GETATTR handler the two FIDs passed in the request may be identical (see __ll_inode_revalidate_it()). In this case calling mdt_object_find() twice (once in mdt_object_find() as the fid1 and once in mdt_getattr_name_lock() as fid2) is dangerous as a concurrent unlink may kill the object causing the second find to hang. To see that this is the issue I added a assertion to mdt_getattr_name_lock() just before we find the child object. diff --git a/lustre/mdt/mdt_handler.c b/lustre/mdt/mdt_handler.c
index 757bae3..4fef04e 100644
--- a/lustre/mdt/mdt_handler.c
+++ b/lustre/mdt/mdt_handler.c
@@ -1344,6 +1344,9 @@ static int mdt_getattr_name_lock(struct mdt_thread_info *info,
mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_POS);
}
+ if (lu_fid_eq(mdt_object_fid(parent), child_fid))
+ LASSERT(!lu_object_is_dying(&parent->mot_header));
+
/*
*step 3: find the child object by fid & lock it.
* regardless if it is local or remote.
*/
child = mdt_object_find(info->mti_env, info->mti_mdt, child_fid);
Then running the above reproducer triggers the assertion rather then haning in lu_object_find_at(): LustreError: 8891:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2 LustreError: 9101:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2 LustreError: 9371:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2 LustreError: 9371:0:(mdc_locks.c:915:mdc_enqueue()) Skipped 1 previous similar message LustreError: 9913:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2 LustreError: 9913:0:(mdc_locks.c:915:mdc_enqueue()) Skipped 9 previous similar messages LustreError: 11018:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2 LustreError: 11018:0:(mdc_locks.c:915:mdc_enqueue()) Skipped 25 previous similar messages LustreError: 13088:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2 LustreError: 13088:0:(mdc_locks.c:915:mdc_enqueue()) Skipped 43 previous similar messages LustreError: 17048:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2 LustreError: 17048:0:(mdc_locks.c:915:mdc_enqueue()) Skipped 57 previous similar messages LustreError: 25203:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2 LustreError: 25203:0:(mdc_locks.c:915:mdc_enqueue()) Skipped 146 previous similar messages LustreError: 10648:0:(mdt_handler.c:1348:mdt_getattr_name_lock()) ASSERTION( !lu_object_is_dying(&parent->mot_header) ) failed: LustreError: 10648:0:(mdt_handler.c:1348:mdt_getattr_name_lock()) LBUG Pid: 10648, comm: mdt00_004 Call Trace: [<ffffffffa0d95895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0d95e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0705ec2>] mdt_getattr_name_lock+0x1982/0x19e0 [mdt] [<ffffffffa1086f06>] ? __req_capsule_get+0x166/0x710 [ptlrpc] [<ffffffffa07061b9>] mdt_intent_getattr+0x299/0x480 [mdt] [<ffffffffa06f48ce>] mdt_intent_policy+0x3ae/0x770 [mdt] [<ffffffffa1018461>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc] [<ffffffffa104117f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc] [<ffffffffa06f4d96>] mdt_enqueue+0x46/0xe0 [mdt] [<ffffffffa06fba8a>] mdt_handle_common+0x52a/0x1470 [mdt] [<ffffffffa0735ca5>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa1070e25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa0da727f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa10684c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] [<ffffffffa107218d>] ptlrpc_main+0xaed/0x1740 [ptlrpc] [<ffffffffa10716a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Perhaps mdt_getattr_name_lock() should detect the same FID case and handle this separately. |
| Comment by Oleg Drokin [ 16/Oct/13 ] |
|
parent == child should only happen for getattr by fid I think, and we just landed the patch about it for 3240 (http://review.whamcloud.com/7910) I discussed with wangdi previously that in reality getattr by fid should probably be a separate function from getattr by name. |
| Comment by John Hammond [ 16/Oct/13 ] |
|
Yes. The change to __ll_inode_revalidate_it() in 7910 needs a corresponding server change. Perhaps we should revert that part for interoperability sake. |
| Comment by Peter Jones [ 16/Oct/13 ] |
|
Bobijam Could you please look into this one as your top priority? Thanks Peter |
| Comment by nasf (Inactive) [ 17/Oct/13 ] |
|
The message of "LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428f:0x102e:0x0], rc = 0 [1]" looks abnormal, I will make a patch to detect and fix it. |
| Comment by Zhenyu Xu [ 17/Oct/13 ] |
|
http://review.whamcloud.com/7990 commit message LU-4106 mdt: avoid recursive lu_object_find an object LU-3240 (commit 762f2114d282a98ebfa4dbbeea9298a8088ad24e) set parent dir fid the same as child fid in getattr by fid case we should not lu_object_fid() the object recursivly, could lead to hung if there is a concurrent unlink destroyed the object. |
| Comment by nasf (Inactive) [ 18/Oct/13 ] |
|
The patch for OI scrub fixing: |
| Comment by John Hammond [ 19/Nov/13 ] |
|
Bobijam's lu_object_find() patch has landed to master (for 2.6.0). Fan Yong's OI scrub patch is still in flight. |
| Comment by Jodi Levi (Inactive) [ 23/Jan/14 ] |
|
Patches have landed to Master. Please reopen this ticket if more work is needed. |
| Comment by Jian Yu [ 25/Jun/14 ] |
|
Lustre Build: https://build.whamcloud.com/job/lustre-b2_5/43 The same failure occurred: https://maloo.whamcloud.com/test_sets/d9f82e82-bea1-11e3-a50c-52540035b04c |