Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4106

racer test hang

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 2.5.0, Lustre 2.6.0, Lustre 2.5.1
    • 3
    • 11038

    Description

      Started running acceptance-small 12:36:21 on cluster and test hung during racer at approximately 14:35 and was still hung over 14 hours later.

      The nodes were setup to use HSM, but the copytool was not started on the agent (c08). The MGS/MDS (c03) did have HSM enabled since this was part of the testing for the HSM test plan.

      I am uploading logs now and will post a link to the test results when they are available. Since I had to kill the job, I'm not sure what information will be included in the logs. I looked at dmesg and /var/log/messages before killing the job and the following is what I see on each of the nodes.

      On the node running the tests (c15), the console has the normal racer output but does not post "PASS" for racer:

      racer cleanup
      racer cleanup
      sleeping 5 sec ...
      sleeping 5 sec ...
      sleeping 5 sec ...
      sleeping 5 sec ...
      sleeping 5 sec ...
      sleeping 5 sec ...
      sleeping 5 sec ...
      sleeping 5 sec ...
      there should be NO racer processes:
      root     26464  0.0  0.0 103244   888 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      root     26466  0.0  0.0 103244   892 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      there should be NO racer processes:
      Waited 5, rc=2 sleeping 10 sec ...
      there should be NO racer processes:
      there should be NO racer processes:
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      there should be NO racer processes:
      root      8360  0.0  0.0 103244   892 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      root      8362  0.0  0.0 103244   892 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      there should be NO racer processes:
      root      4662  0.0  0.0 103244   872 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      root      4664  0.0  0.0 103244   872 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      root     12206  0.0  0.0 103248   904 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      root     12208  0.0  0.0 103244   888 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      Filesystem           1K-blocks      Used Ac15: file_create.sh: no process killed
      c15: file_create.sh: no process killed
      c15: dir_create.sh: no process killed
      c15: dir_create.sh: no process killed
      c15: file_rm.sh: no process killed
      c15: file_rm.sh: no process killed
      c15: file_rename.sh: no process killed
      c15: file_rename.sh: no process killed
      c15: file_link.sh: no process killed
      c15: file_link.sh: no process killed
      c15: file_symlink.sh: no process killed
      c15: file_symlink.sh: no process killed
      c15: file_list.sh: no process killed
      c15: file_list.sh: no process killed
      c15: file_concat.sh: no process killed
      c15: file_concat.sh: no process killed
      c15: file_exec.sh: no process killed
      c15: file_exec.sh: no process killed
      c14: file_create.sh: no process killed
      c14: file_create.sh: no process killed
      c14: dir_create.sh: no process killed
      c14: dir_create.sh: no process killed
      c14: file_rm.sh: no process killed
      c14: file_rm.sh: no process killed
      c14: file_rename.sh: no process killed
      c14: file_rename.sh: no process killed
      c14: file_link.sh: no process killed
      c14: file_link.sh: no process killed
      c14: file_symlink.sh: no process killed
      c14: file_symlink.sh: no process killed
      c14: file_list.sh: no process killed
      c14: file_list.sh: no process killed
      c14: file_concat.sh: no process killed
      c14: file_concat.sh: no process killed
      c14: file_exec.sh: no process killed
      c14: file_exec.sh: no process killed
      vailable Use% Mounted on
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
      mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      root     17918  0.0  0.0 103244   884 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      root     17920  0.0  0.0 103244   880 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      root     15851  0.0  0.0 103244   888 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      there should be NO racer processes:
      root     12012  0.0  0.0 103244   880 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch     25088052   6357220  17375724  27% /lustre/scratch
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      there should be NO racer processes:
      root      2565  0.0  0.0 103244   876 ?        S    14:24   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch     25088052   6904116  16843572  30% /lustre/scratch
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      pid=30960 rc=0
      

      From /var/log/message on c15:

      Oct 14 14:19:37 c15-ib kernel: Lustre: DEBUG MARKER: -----============= acceptance-small: racer ============----- Mon Oct 14 14:19:36 PDT 2013
      Oct 14 14:19:37 c15-ib kernel: Lustre: DEBUG MARKER: excepting tests:
      Oct 14 14:19:37 c15-ib kernel: Lustre: Layout lock feature supported.
      Oct 14 14:19:38 c15-ib kernel: Lustre: Mounted scratch-client
      Oct 14 14:19:41 c15-ib kernel: Lustre: DEBUG MARKER: Using TIMEOUT=100
      Oct 14 14:19:43 c15-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
      Oct 14 14:35:13 c15-ib kernel: Lustre: 4077:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1381785590/real 1381785590]  req@ffff8806fa903400 x1448904795054872/t0(0) o101->scratch-MDT0000-mdc-ffff8802dd377400@192.168.2.103@o2ib:12/10 lens 576/1080 e 5 to 1 dl 1381786513 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      Oct 14 14:35:13 c15-ib kernel: Lustre: scratch-MDT0000-mdc-ffff8802dd377400: Connec tion to scratch-MDT0000 (at 192.168.2.103@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Oct 14 14:35:13 c15-ib kernel: Lustre: Skipped 2 previous similar messages
      Oct 14 14:35:14 c15-ib kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff8802dd377400: Communicating with 192.168.2.103@o2ib, operation mds_connect failed with -16.
      

      dmesg on the MDS (c03) shows:

      Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
      LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428f:0x102e:0x0], rc = 0 [1]
      LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004289:0x19a7:0x0], rc = 0 [1]
      LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428e:0x1fbd:0x0], rc = 0 [1]
      LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004284:0x264c:0x0], rc = 0 [1]
      LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004284:0x2b6f:0x0], rc = 0 [1]
      LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428e:0x35b5:0x0], rc = 0 [1]
      LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200004282:0x45a8:0x0], rc = 0 [1]
      LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x20000428b:0x5305:0x0], rc = 0 [1]
      INFO: task mdt01_017:15697 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mdt01_017     D 0000000000000007     0 15697      2 0x00000080
       ffff880704ebb9b0 0000000000000046 ffff880704ebb950 ffffffffa04ebe75
       0000000100000000 ffffc90016eb7030 0000000000000246 0000000000000246
       ffff8806b85a9098 ffff880704ebbfd8 000000000000fb88 ffff8806b85a9098
      Call Trace:
       [<ffffffffa04ebe75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
       [<ffffffffa0628274>] ? htable_lookup+0x1c4/0x1e0 [obdclass]
       [<ffffffffa062888b>] lu_object_find_at+0xab/0x360 [obdclass]
       [<ffffffffa07ac026>] ? lustre_msg_string+0x96/0x290 [ptlrpc]
       [<ffffffff81063410>] ? default_wake_function+0x0/0x20
       [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
       [<ffffffffa0628b56>] lu_object_find+0x16/0x20 [obdclass]
       [<ffffffffa0db5af6>] mdt_object_find+0x56/0x170 [mdt]
       [<ffffffffa0dc8d34>] mdt_getattr_name_lock+0x7f4/0x1990 [mdt]
       [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
       [<ffffffffa07d2f06>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
       [<ffffffffa07ae214>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
       [<ffffffffa0dca169>] mdt_intent_getattr+0x299/0x480 [mdt]
       [<ffffffffa0db88ce>] mdt_intent_policy+0x3ae/0x770 [mdt]
       [<ffffffffa0764461>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
       [<ffffffffa078d17f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
       [<ffffffffa0db8d96>] mdt_enqueue+0x46/0xe0 [mdt]
       [<ffffffffa0dbfa8a>] mdt_handle_common+0x52a/0x1470 [mdt]
       [<ffffffffa0df9c55>] mds_regular_handle+0x15/0x20 [mdt]
       [<ffffffffa07bce25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
       [<ffffffffa04e827f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
       [<ffffffffa07b44c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
       [<ffffffffa07be18d>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
       [<ffffffffa07bd6a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
       [<ffffffff81096a36>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810969a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      INFO: task mdt01_017:15697 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mdt01_017     D 0000000000000007     0 15697      2 0x00000080
       ffff880704ebb9b0 0000000000000046 ffff880704ebb950 ffffffffa04ebe75
       0000000100000000 ffffc90016eb7030 0000000000000246 0000000000000246
       ffff8806b85a9098 ffff880704ebbfd8 000000000000fb88 ffff8806b85a9098
      Call Trace:
       [<ffffffffa04ebe75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
       [<ffffffffa0628274>] ? htable_lookup+0x1c4/0x1e0 [obdclass]
       [<ffffffffa062888b>] lu_object_find_at+0xab/0x360 [obdclass]
       [<ffffffffa07ac026>] ? lustre_msg_string+0x96/0x290 [ptlrpc]
       [<ffffffff81063410>] ? default_wake_function+0x0/0x20
       [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
       [<ffffffffa0628b56>] lu_object_find+0x16/0x20 [obdclass]
       [<ffffffffa0db5af6>] mdt_object_find+0x56/0x170 [mdt]
       [<ffffffffa0dc8d34>] mdt_getattr_name_lock+0x7f4/0x1990 [mdt]
       [<ffffffffa07abf85>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
       [<ffffffffa07d2f06>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
       [<ffffffffa07ae214>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
       [<ffffffffa0dca169>] mdt_intent_getattr+0x299/0x480 [mdt]
       [<ffffffffa0db88ce>] mdt_intent_policy+0x3ae/0x770 [mdt]
       [<ffffffffa0764461>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
       [<ffffffffa078d17f>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
       [<ffffffffa0db8d96>] mdt_enqueue+0x46/0xe0 [mdt]
       [<ffffffffa0dbfa8a>] mdt_handle_common+0x52a/0x1470 [mdt]
       [<ffffffffa0df9c55>] mds_regular_handle+0x15/0x20 [mdt]
       [<ffffffffa07bce25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
       [<ffffffffa04e827f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
       [<ffffffffa07b44c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
       [<ffffffffa07be18d>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
       [<ffffffffa07bd6a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
       [<ffffffff81096a36>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810969a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Lustre: 32077:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply  req@ffff88069ee86000 x1448904795054872/t0(0) o101->5dccd683-21ab-f196-a776-5b7d390fc289@192.168.2.115@o2ib:0/0 lens 576/1080 e 5 to 0 dl 1381786402 ref 2 fl Interpret:/0/0 rc 0/0
      INFO: task mdt01_017:15697 blocked for more than 120 seconds.
      …
      

      There's nothing interesting in the OST logs.

      dmesg from c08 (agent) and c10:

      Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
      LustreError: 11-0: scratch-MDT0000-mdc-ffff880633685000: Communicating with 192.168.2.103@o2ib, operation ldlm_enqueue failed with -71.
      LustreError: 2355:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -71
      LustreError: 2355:0:(file.c:3069:ll_inode_revalidate_fini()) scratch: revalidate FID [0x20000428c:0x70f:0x0] error: rc = -71
      

      On the node the with the Robinhood DB (c09), /var/log/messages has:

      Oct 14 14:19:43 c09-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
      Oct 14 14:20:08 c09-ib kernel: LustreError: 7042:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2
      Oct 14 17:44:31 c09-ib kernel: LustreError: 7361:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 1213448 of inode ffff8807b8398b78 failed -28
      Oct 14 17:44:31 c09-ib kernel: LustreError: 7361:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 1213448 of inode ffff8807b8398b78 failed -28
      

      /var/log/messages on c14 contains:

      Oct 14 14:19:43 c14-ib kernel: Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10,c11,c12,c13,c14,c15 DURATION=300 == 14:19:42 (1381785582)
      Oct 14 14:24:03 c14-ib kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff880815016400: Communicating with 192.168.2.103@o2ib, operation ldlm_enqueue failed with -71.
      Oct 14 14:24:03 c14-ib kernel: LustreError: Skipped 1 previous similar message
      Oct 14 14:24:03 c14-ib kernel: LustreError: 21327:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -71
      

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: