Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.5.0, Lustre 2.6.0, Lustre 2.5.1
-
Lustre 2.5.0-RC1, el6
OpenSFS cluster with combined MGS/MDS (c03), single OSS (c04) with two OSTs, archive MGS/MDS (c05), archive OST (c06) with two OSTs, archive OST2 (c07) with two OSTs, eight clients; one agent + client(c08), one robinhood/db + client(c09) and others just running as Lustre clients (c09, c10, c11, c12, c13,c14, c15)Lustre 2.5.0-RC1, el6 OpenSFS cluster with combined MGS/MDS (c03), single OSS (c04) with two OSTs, archive MGS/MDS (c05), archive OST (c06) with two OSTs, archive OST2 (c07) with two OSTs, eight clients; one agent + client(c08), one robinhood/db + client(c09) and others just running as Lustre clients (c09, c10, c11, c12, c13,c14, c15)
-
3
-
11038
Description
Started running acceptance-small 12:36:21 on cluster and test hung during racer at approximately 14:35 and was still hung over 14 hours later.
The nodes were setup to use HSM, but the copytool was not started on the agent (c08). The MGS/MDS (c03) did have HSM enabled since this was part of the testing for the HSM test plan.
I am uploading logs now and will post a link to the test results when they are available. Since I had to kill the job, I'm not sure what information will be included in the logs. I looked at dmesg and /var/log/messages before killing the job and the following is what I see on each of the nodes.
On the node running the tests (c15), the console has the normal racer output but does not post "PASS" for racer:
racer cleanup racer cleanup sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... there should be NO racer processes: root 26464 0.0 0.0 103244 888 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 26466 0.0 0.0 103244 892 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec there should be NO racer processes: Waited 5, rc=2 sleeping 10 sec ... there should be NO racer processes: there should be NO racer processes: Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. there should be NO racer processes: root 8360 0.0 0.0 103244 892 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 8362 0.0 0.0 103244 892 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec there should be NO racer processes: root 4662 0.0 0.0 103244 872 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 4664 0.0 0.0 103244 872 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 12206 0.0 0.0 103248 904 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 12208 0.0 0.0 103244 888 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Ac15: file_create.sh: no process killed c15: file_create.sh: no process killed c15: dir_create.sh: no process killed c15: dir_create.sh: no process killed c15: file_rm.sh: no process killed c15: file_rm.sh: no process killed c15: file_rename.sh: no process killed c15: file_rename.sh: no process killed c15: file_link.sh: no process killed c15: file_link.sh: no process killed c15: file_symlink.sh: no process killed c15: file_symlink.sh: no process killed c15: file_list.sh: no process killed c15: file_list.sh: no process killed c15: file_concat.sh: no process killed c15: file_concat.sh: no process killed c15: file_exec.sh: no process killed c15: file_exec.sh: no process killed c14: file_create.sh: no process killed c14: file_create.sh: no process killed c14: dir_create.sh: no process killed c14: dir_create.sh: no process killed c14: file_rm.sh: no process killed c14: file_rm.sh: no process killed c14: file_rename.sh: no process killed c14: file_rename.sh: no process killed c14: file_link.sh: no process killed c14: file_link.sh: no process killed c14: file_symlink.sh: no process killed c14: file_symlink.sh: no process killed c14: file_list.sh: no process killed c14: file_list.sh: no process killed c14: file_concat.sh: no process killed c14: file_concat.sh: no process killed c14: file_exec.sh: no process killed c14: file_exec.sh: no process killed vailable Use% Mounted on Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. root 17918 0.0 0.0 103244 884 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 17920 0.0 0.0 103244 880 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 15851 0.0 0.0 103244 888 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. there should be NO racer processes: root 12012 0.0 0.0 103244 880 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6357220 17375724 27% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. there should be NO racer processes: root 2565 0.0 0.0 103244 876 ? S 14:24 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 25088052 6904116 16843572 30% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. pid=30960 rc=0
From /var/log/message on c15:
Oct 14 14:19:37 c15-ib kernel: Lustre: DEBUG MARKER: -----============= acceptance-small: racer ============----- Mon Oct 14 14:19:36 PDT 2013 Oct 14 14:19:37 c15-ib kernel: Lustre: DEBUG MARKER: excepting tests: Oct 14 14:19:37 c15-ib kernel: Lustre: Layout lock feature supported. Oct 14 14:19:38 c15-ib kernel: Lustre: Mounted scratch-client Oct 14 14:19:41 c15-ib kernel: Lustre: DEBUG MARKER: Using TIMEOUT=100 Oct 14 14:19:43 c15-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582) Oct 14 14:35:13 c15-ib kernel: Lustre: 4077:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1381785590/real 1381785590] req@ffff8806fa903400 x1448904795054872/t0(0) o101->scratch-MDT0000-mdc-ffff8802dd377400@192.168.2.103@o2ib:12/10 lens 576/1080 e 5 to 1 dl 1381786513 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Oct 14 14:35:13 c15-ib kernel: Lustre: scratch-MDT0000-mdc-ffff8802dd377400: Connec tion to scratch-MDT0000 (at 192.168.2.103@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 14 14:35:13 c15-ib kernel: Lustre: Skipped 2 previous similar messages Oct 14 14:35:14 c15-ib kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff8802dd377400: Communicating with 192.168.2.103@o2ib, operation mds_connect failed with -16.
dmesg on the MDS (c03) shows:
Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582) LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428f:0x102e:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004289:0x19a7:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428e:0x1fbd:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004284:0x264c:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004284:0x2b6f:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428e:0x35b5:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004282:0x45a8:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428b:0x5305:0x0], rc = 0 [1] INFO: task mdt01_017:15697 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mdt01_017 D 0000000000000007 0 15697 2 0x00000080 ffff880704ebb9b0 0000000000000046 ffff880704ebb950 ffffffffa04ebe75 0000000100000000 ffffc90016eb7030 0000000000000246 0000000000000246 ffff8806b85a9098 ffff880704ebbfd8 000000000000fb88 ffff8806b85a9098 Call Trace: [<ffffffffa04ebe75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs] [<ffffffffa0628274>] ? htable_lookup+0x1c4/0x1e0 [obdclass] [<ffffffffa062888b>] lu_object_find_at+0xab/0x360 [obdclass] [<ffffffffa07ac026>] ? lustre_msg_string+0x96/0x290 [ptlrpc] [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa0628b56>] lu_object_find+0x16/0x20 [obdclass] [<ffffffffa0db5af6>] mdt_object_find+0x56/0x170 [mdt] [<ffffffffa0dc8d34>] mdt_getattr_name_lock+0x7f4/0x1990 [mdt] [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa07d2f06>] ? __req_capsule_get+0x166/0x710 [ptlrpc] [<ffffffffa07ae214>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] [<ffffffffa0dca169>] mdt_intent_getattr+0x299/0x480 [mdt] [<ffffffffa0db88ce>] mdt_intent_policy+0x3ae/0x770 [mdt] [<ffffffffa0764461>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc] [<ffffffffa078d17f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc] [<ffffffffa0db8d96>] mdt_enqueue+0x46/0xe0 [mdt] [<ffffffffa0dbfa8a>] mdt_handle_common+0x52a/0x1470 [mdt] [<ffffffffa0df9c55>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa07bce25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa04e827f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa07b44c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] [<ffffffffa07be18d>] ptlrpc_main+0xaed/0x1740 [ptlrpc] [<ffffffffa07bd6a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 INFO: task mdt01_017:15697 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mdt01_017 D 0000000000000007 0 15697 2 0x00000080 ffff880704ebb9b0 0000000000000046 ffff880704ebb950 ffffffffa04ebe75 0000000100000000 ffffc90016eb7030 0000000000000246 0000000000000246 ffff8806b85a9098 ffff880704ebbfd8 000000000000fb88 ffff8806b85a9098 Call Trace: [<ffffffffa04ebe75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs] [<ffffffffa0628274>] ? htable_lookup+0x1c4/0x1e0 [obdclass] [<ffffffffa062888b>] lu_object_find_at+0xab/0x360 [obdclass] [<ffffffffa07ac026>] ? lustre_msg_string+0x96/0x290 [ptlrpc] [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa0628b56>] lu_object_find+0x16/0x20 [obdclass] [<ffffffffa0db5af6>] mdt_object_find+0x56/0x170 [mdt] [<ffffffffa0dc8d34>] mdt_getattr_name_lock+0x7f4/0x1990 [mdt] [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa07d2f06>] ? __req_capsule_get+0x166/0x710 [ptlrpc] [<ffffffffa07ae214>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] [<ffffffffa0dca169>] mdt_intent_getattr+0x299/0x480 [mdt] [<ffffffffa0db88ce>] mdt_intent_policy+0x3ae/0x770 [mdt] [<ffffffffa0764461>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc] [<ffffffffa078d17f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc] [<ffffffffa0db8d96>] mdt_enqueue+0x46/0xe0 [mdt] [<ffffffffa0dbfa8a>] mdt_handle_common+0x52a/0x1470 [mdt] [<ffffffffa0df9c55>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa07bce25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa04e827f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa07b44c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] [<ffffffffa07be18d>] ptlrpc_main+0xaed/0x1740 [ptlrpc] [<ffffffffa07bd6a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Lustre: 32077:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply req@ffff88069ee86000 x1448904795054872/t0(0) o101->5dccd683-21ab-f196-a776-5b7d390fc289@192.168.2.115@o2ib:0/0 lens 576/1080 e 5 to 0 dl 1381786402 ref 2 fl Interpret:/0/0 rc 0/0 INFO: task mdt01_017:15697 blocked for more than 120 seconds. …
There's nothing interesting in the OST logs.
dmesg from c08 (agent) and c10:
Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582) LustreError: 11-0: scratch-MDT0000-mdc-ffff880633685000: Communicating with 192.168.2.103@o2ib, operation ldlm_enqueue failed with -71. LustreError: 2355:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -71 LustreError: 2355:0:(file.c:3069:ll_inode_revalidate_fini()) scratch: revalidate FID [0x20000428c:0x70f:0x0] error: rc = -71
On the node the with the Robinhood DB (c09), /var/log/messages has:
Oct 14 14:19:43 c09-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582) Oct 14 14:20:08 c09-ib kernel: LustreError: 7042:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2 Oct 14 17:44:31 c09-ib kernel: LustreError: 7361:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 1213448 of inode ffff8807b8398b78 failed -28 Oct 14 17:44:31 c09-ib kernel: LustreError: 7361:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 1213448 of inode ffff8807b8398b78 failed -28
/var/log/messages on c14 contains:
Oct 14 14:19:43 c14-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582) Oct 14 14:24:03 c14-ib kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff880815016400: Communicating with 192.168.2.103@o2ib, operation ldlm_enqueue failed with -71. Oct 14 14:24:03 c14-ib kernel: LustreError: Skipped 1 previous similar message Oct 14 14:24:03 c14-ib kernel: LustreError: 21327:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -71
Attachments
Issue Links
- is duplicated by
-
LU-4216 lockup in mdt_intent_layout -> lu_object_find_at
- Resolved