Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3
Affects Version/s: Lustre 2.5.0, Lustre 2.6.0, Lustre 2.5.1
Labels:
- HB
Environment:

Hide
Lustre 2.5.0-RC1, el6

OpenSFS cluster with combined MGS/MDS (c03), single OSS (c04) with two OSTs, archive MGS/MDS (c05), archive OST (c06) with two OSTs, archive OST2 (c07) with two OSTs, eight clients; one agent + client(c08), one robinhood/db + client(c09) and others just running as Lustre clients (c09, c10, c11, c12, c13,c14, c15)

Show
Lustre 2.5.0-RC1, el6 OpenSFS cluster with combined MGS/MDS (c03), single OSS (c04) with two OSTs, archive MGS/MDS (c05), archive OST (c06) with two OSTs, archive OST2 (c07) with two OSTs, eight clients; one agent + client(c08), one robinhood/db + client(c09) and others just running as Lustre clients (c09, c10, c11, c12, c13,c14, c15)

Severity:
3
Rank (Obsolete):
11038

Description

Started running acceptance-small 12:36:21 on cluster and test hung during racer at approximately 14:35 and was still hung over 14 hours later.

The nodes were setup to use HSM, but the copytool was not started on the agent (c08). The MGS/MDS (c03) did have HSM enabled since this was part of the testing for the HSM test plan.

I am uploading logs now and will post a link to the test results when they are available. Since I had to kill the job, I'm not sure what information will be included in the logs. I looked at dmesg and /var/log/messages before killing the job and the following is what I see on each of the nodes.

On the node running the tests (c15), the console has the normal racer output but does not post "PASS" for racer:

racer cleanup
racer cleanup
sleeping 5 sec ...
sleeping 5 sec ...
sleeping 5 sec ...
sleeping 5 sec ...
sleeping 5 sec ...
sleeping 5 sec ...
sleeping 5 sec ...
sleeping 5 sec ...
there should be NO racer processes:
root     26464  0.0  0.0 103244   888 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root     26466  0.0  0.0 103244   892 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
there should be NO racer processes:
Waited 5, rc=2 sleeping 10 sec ...
there should be NO racer processes:
there should be NO racer processes:
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
there should be NO racer processes:
root      8360  0.0  0.0 103244   892 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root      8362  0.0  0.0 103244   892 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
there should be NO racer processes:
root      4662  0.0  0.0 103244   872 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root      4664  0.0  0.0 103244   872 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root     12206  0.0  0.0 103248   904 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root     12208  0.0  0.0 103244   888 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
Filesystem           1K-blocks      Used Ac15: file_create.sh: no process killed
c15: file_create.sh: no process killed
c15: dir_create.sh: no process killed
c15: dir_create.sh: no process killed
c15: file_rm.sh: no process killed
c15: file_rm.sh: no process killed
c15: file_rename.sh: no process killed
c15: file_rename.sh: no process killed
c15: file_link.sh: no process killed
c15: file_link.sh: no process killed
c15: file_symlink.sh: no process killed
c15: file_symlink.sh: no process killed
c15: file_list.sh: no process killed
c15: file_list.sh: no process killed
c15: file_concat.sh: no process killed
c15: file_concat.sh: no process killed
c15: file_exec.sh: no process killed
c15: file_exec.sh: no process killed
c14: file_create.sh: no process killed
c14: file_create.sh: no process killed
c14: dir_create.sh: no process killed
c14: dir_create.sh: no process killed
c14: file_rm.sh: no process killed
c14: file_rm.sh: no process killed
c14: file_rename.sh: no process killed
c14: file_rename.sh: no process killed
c14: file_link.sh: no process killed
c14: file_link.sh: no process killed
c14: file_symlink.sh: no process killed
c14: file_symlink.sh: no process killed
c14: file_list.sh: no process killed
c14: file_list.sh: no process killed
c14: file_concat.sh: no process killed
c14: file_concat.sh: no process killed
c14: file_exec.sh: no process killed
c14: file_exec.sh: no process killed
vailable Use% Mounted on
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
root     17918  0.0  0.0 103244   884 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root     17920  0.0  0.0 103244   880 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
root     15851  0.0  0.0 103244   888 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
there should be NO racer processes:
root     12012  0.0  0.0 103244   880 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
there should be NO racer processes:
root      2565  0.0  0.0 103244   876 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
Filesystem           1K-blocks      Used Available Use% Mounted on
mds@o2ib:/scratch     25088052   6904116  16843572  30% /lustre/scratch
We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
pid=30960 rc=0

From /var/log/message on c15:

Oct 14 14:19:37 c15-ib kernel: Lustre: DEBUG MARKER: -----============= acceptance-small: racer ============----- Mon Oct 14 14:19:36 PDT 2013
Oct 14 14:19:37 c15-ib kernel: Lustre: DEBUG MARKER: excepting tests:
Oct 14 14:19:37 c15-ib kernel: Lustre: Layout lock feature supported.
Oct 14 14:19:38 c15-ib kernel: Lustre: Mounted scratch-client
Oct 14 14:19:41 c15-ib kernel: Lustre: DEBUG MARKER: Using TIMEOUT=100
Oct 14 14:19:43 c15-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
Oct 14 14:35:13 c15-ib kernel: Lustre: 4077:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1381785590/real 1381785590]  req@ffff8806fa903400 x1448904795054872/t0(0) o101->scratch-MDT0000-mdc-ffff8802dd377400@192.168.2.103@o2ib:12/10 lens 576/1080 e 5 to 1 dl 1381786513 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Oct 14 14:35:13 c15-ib kernel: Lustre: scratch-MDT0000-mdc-ffff8802dd377400: Connec tion to scratch-MDT0000 (at 192.168.2.103@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 14 14:35:13 c15-ib kernel: Lustre: Skipped 2 previous similar messages
Oct 14 14:35:14 c15-ib kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff8802dd377400: Communicating with 192.168.2.103@o2ib, operation mds_connect failed with -16.

dmesg on the MDS (c03) shows:

Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428f:0x102e:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004289:0x19a7:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428e:0x1fbd:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004284:0x264c:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004284:0x2b6f:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428e:0x35b5:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004282:0x45a8:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428b:0x5305:0x0], rc = 0 [1]
INFO: task mdt01_017:15697 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mdt01_017     D 0000000000000007     0 15697      2 0x00000080
 ffff880704ebb9b0 0000000000000046 ffff880704ebb950 ffffffffa04ebe75
 0000000100000000 ffffc90016eb7030 0000000000000246 0000000000000246
 ffff8806b85a9098 ffff880704ebbfd8 000000000000fb88 ffff8806b85a9098
Call Trace:
 [<ffffffffa04ebe75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
 [<ffffffffa0628274>] ? htable_lookup+0x1c4/0x1e0 [obdclass]
 [<ffffffffa062888b>] lu_object_find_at+0xab/0x360 [obdclass]
 [<ffffffffa07ac026>] ? lustre_msg_string+0x96/0x290 [ptlrpc]
 [<ffffffff81063410>] ? default_wake_function+0x0/0x20
 [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
 [<ffffffffa0628b56>] lu_object_find+0x16/0x20 [obdclass]
 [<ffffffffa0db5af6>] mdt_object_find+0x56/0x170 [mdt]
 [<ffffffffa0dc8d34>] mdt_getattr_name_lock+0x7f4/0x1990 [mdt]
 [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
 [<ffffffffa07d2f06>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
 [<ffffffffa07ae214>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
 [<ffffffffa0dca169>] mdt_intent_getattr+0x299/0x480 [mdt]
 [<ffffffffa0db88ce>] mdt_intent_policy+0x3ae/0x770 [mdt]
 [<ffffffffa0764461>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
 [<ffffffffa078d17f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
 [<ffffffffa0db8d96>] mdt_enqueue+0x46/0xe0 [mdt]
 [<ffffffffa0dbfa8a>] mdt_handle_common+0x52a/0x1470 [mdt]
 [<ffffffffa0df9c55>] mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa07bce25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
 [<ffffffffa04e827f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [<ffffffffa07b44c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
 [<ffffffffa07be18d>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
 [<ffffffffa07bd6a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
 [<ffffffff81096a36>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810969a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
INFO: task mdt01_017:15697 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mdt01_017     D 0000000000000007     0 15697      2 0x00000080
 ffff880704ebb9b0 0000000000000046 ffff880704ebb950 ffffffffa04ebe75
 0000000100000000 ffffc90016eb7030 0000000000000246 0000000000000246
 ffff8806b85a9098 ffff880704ebbfd8 000000000000fb88 ffff8806b85a9098
Call Trace:
 [<ffffffffa04ebe75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
 [<ffffffffa0628274>] ? htable_lookup+0x1c4/0x1e0 [obdclass]
 [<ffffffffa062888b>] lu_object_find_at+0xab/0x360 [obdclass]
 [<ffffffffa07ac026>] ? lustre_msg_string+0x96/0x290 [ptlrpc]
 [<ffffffff81063410>] ? default_wake_function+0x0/0x20
 [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
 [<ffffffffa0628b56>] lu_object_find+0x16/0x20 [obdclass]
 [<ffffffffa0db5af6>] mdt_object_find+0x56/0x170 [mdt]
 [<ffffffffa0dc8d34>] mdt_getattr_name_lock+0x7f4/0x1990 [mdt]
 [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
 [<ffffffffa07d2f06>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
 [<ffffffffa07ae214>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
 [<ffffffffa0dca169>] mdt_intent_getattr+0x299/0x480 [mdt]
 [<ffffffffa0db88ce>] mdt_intent_policy+0x3ae/0x770 [mdt]
 [<ffffffffa0764461>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
 [<ffffffffa078d17f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
 [<ffffffffa0db8d96>] mdt_enqueue+0x46/0xe0 [mdt]
 [<ffffffffa0dbfa8a>] mdt_handle_common+0x52a/0x1470 [mdt]
 [<ffffffffa0df9c55>] mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa07bce25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
 [<ffffffffa04e827f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [<ffffffffa07b44c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
 [<ffffffffa07be18d>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
 [<ffffffffa07bd6a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
 [<ffffffff81096a36>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810969a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Lustre: 32077:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply  req@ffff88069ee86000 x1448904795054872/t0(0) o101->5dccd683-21ab-f196-a776-5b7d390fc289@192.168.2.115@o2ib:0/0 lens 576/1080 e 5 to 0 dl 1381786402 ref 2 fl Interpret:/0/0 rc 0/0
INFO: task mdt01_017:15697 blocked for more than 120 seconds.
…

There's nothing interesting in the OST logs.

dmesg from c08 (agent) and c10:

Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
LustreError: 11-0: scratch-MDT0000-mdc-ffff880633685000: Communicating with 192.168.2.103@o2ib, operation ldlm_enqueue failed with -71.
LustreError: 2355:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -71
LustreError: 2355:0:(file.c:3069:ll_inode_revalidate_fini()) scratch: revalidate FID [0x20000428c:0x70f:0x0] error: rc = -71

On the node the with the Robinhood DB (c09), /var/log/messages has:

Oct 14 14:19:43 c09-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
Oct 14 14:20:08 c09-ib kernel: LustreError: 7042:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2
Oct 14 17:44:31 c09-ib kernel: LustreError: 7361:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 1213448 of inode ffff8807b8398b78 failed -28
Oct 14 17:44:31 c09-ib kernel: LustreError: 7361:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 1213448 of inode ffff8807b8398b78 failed -28

/var/log/messages on c14 contains:

Oct 14 14:19:43 c14-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
Oct 14 14:24:03 c14-ib kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff880815016400: Communicating with 192.168.2.103@o2ib, operation ldlm_enqueue failed with -71.
Oct 14 14:24:03 c14-ib kernel: LustreError: Skipped 1 previous similar message
Oct 14 14:24:03 c14-ib kernel: LustreError: 21327:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -71

Attachments

Issue Links

is duplicated by

LU-4216 lockup in mdt_intent_layout -> lu_object_find_at

Resolved

is related to

LU-2807 lockup in server completion ast -> lu_object_find_at

Resolved

LU-4132 Test failure on test suite sanity, subtest test_46

Resolved

LU-4149 Test failure on test suite sanity-scrub, subtest test_7

Resolved

Activity

People

Assignee:: Zhenyu Xu

Reporter:: James Nunez (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 15/Oct/13 3:51 PM

Updated:: 28/Jun/16 5:14 AM

Resolved:: 23/Jan/14 4:03 PM