Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1772

Test failure on test suite racer, subtest test_1

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.3.0, Lustre 2.4.0
    • Lustre 2.3.0
    • None
    • server and client: lustre-master-tag2.2.93 RHEL6
    • 3
    • 4303

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/9d777c2e-e9a0-11e1-881a-52540035b04c.

      The sub-test test_1 failed with the following error:

      test failed to respond and timed out

      There are some hanging threads on client side

      14:14:58:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-26vm5,client-26vm6.lab.whamcloud.com DURATION=900 == 14:14:56 (1345238096)
      14:14:59:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
      14:14:59:Lustre: DEBUG MARKER: DURATION=900 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer 
      14:15:08:Lustre: DEBUG MARKER: DURATION=900 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 
      14:18:31:LustreError: 11-0: an error occurred while communicating with 10.10.4.151@tcp. The ost_write operation failed with -2
      14:18:31:LustreError: 11-0: an error occurred while communicating with 10.10.4.151@tcp. The ost_write operation failed with -2
      14:18:32:LustreError: 9716:0:(vvp_io.c:1039:vvp_io_commit_write()) Write page 34217 of inode ffff880078a36178 failed -2
      14:18:32:Lustre: lustre-MDT0000-mdc-ffff880074b62000: Connection to lustre-MDT0000 (at 10.10.4.150@tcp) was lost; in progress operations using this service will wait for recovery to complete
      14:18:32:LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
      14:18:32:LustreError: 9716:0:(file.c:155:ll_close_inode_openhandle()) inode 144115205272502283 mdc close failed: rc = -5
      14:18:32:LustreError: 23544:0:(ldlm_resource.c:761:ldlm_resource_complain()) Namespace lustre-MDT0000-mdc-ffff880074b62000 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      14:18:34:LustreError: 23544:0:(ldlm_resource.c:767:ldlm_resource_complain()) Resource: ffff88007a1c2480 (8589935617/1747/0/0) (rc: 1)
      14:18:34:Lustre: lustre-MDT0000-mdc-ffff880074b62000: Connection restored to lustre-MDT0000 (at 10.10.4.150@tcp)
      14:18:36:LustreError: 11-0: an error occurred while communicating with 10.10.4.151@tcp. The obd_ping operation failed with -107
      14:18:36:LustreError: Skipped 1 previous similar message
      14:18:36:Lustre: lustre-OST0001-osc-ffff880074b62000: Connection to lustre-OST0001 (at 10.10.4.151@tcp) was lost; in progress operations using this service will wait for recovery to complete
      14:18:36:LustreError: 167-0: This client was evicted by lustre-OST0001; in progress operations using this service will fail.
      14:19:14:INFO: task ldlm_bl_02:8789 blocked for more than 120 seconds.
      14:19:14:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      14:19:14:ldlm_bl_02    D 0000000000000000     0  8789      2 0x00000080
      14:19:15: ffff8800763fdd50 0000000000000046 0000000100000000 0000000000000000
      14:19:15: ffff8800763fdce0 ffffffffa04fb262 ffff8800763fdd40 0000000000000001
      14:19:15: ffff8800377e4638 ffff8800763fdfd8 000000000000fb88 ffff8800377e4638
      14:19:15:Call Trace:
      14:19:15: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
      14:19:15: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
      14:19:15: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
      14:19:15: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
      14:19:15: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
      14:19:15: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc]
      14:19:15: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      14:19:15: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
      14:19:15: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
      14:19:17: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
      14:19:17: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:17: [<ffffffff8100c14a>] child_rip+0xa/0x20
      14:19:17: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:17: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:17: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      14:19:17:INFO: task ldlm_bl_03:8790 blocked for more than 120 seconds.
      14:19:17:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      14:19:17:ldlm_bl_03    D 0000000000000000     0  8790      2 0x00000080
      14:19:17: ffff880052c8dd50 0000000000000046 0000000100000000 0000000000000000
      14:19:18: ffff880052c8dce0 ffffffffa04fb262 ffff880052c8dd40 0000000000000001
      14:19:18: ffff88007a4545f8 ffff880052c8dfd8 000000000000fb88 ffff88007a4545f8
      14:19:18:Call Trace:
      14:19:18: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
      14:19:18: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
      14:19:18: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
      14:19:19: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
      14:19:19: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
      14:19:19: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc]
      14:19:19: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      14:19:19: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
      14:19:19: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
      14:19:19: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
      14:19:19: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:19: [<ffffffff8100c14a>] child_rip+0xa/0x20
      14:19:19: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:19: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:19: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      14:19:19:INFO: task ldlm_bl_05:9660 blocked for more than 120 seconds.
      14:19:19:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      14:19:19:ldlm_bl_05    D 0000000000000000     0  9660      2 0x00000080
      14:19:19: ffff88007adbbd50 0000000000000046 0000000100000000 0000000000000000
      14:19:20: ffff88007adbbce0 ffffffffa04fb262 ffff88007adbbd40 0000000000000001
      14:19:20: ffff880075433098 ffff88007adbbfd8 000000000000fb88 ffff880075433098
      14:19:20:Call Trace:
      14:19:21: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
      14:19:21: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
      14:19:21: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
      14:19:21: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
      14:19:21: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
      14:19:21: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc]
      14:19:21: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      14:19:22: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
      14:19:22: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
      14:19:22: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
      14:19:22: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:22: [<ffffffff8100c14a>] child_rip+0xa/0x20
      14:19:22: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:22: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:22: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      14:19:22:INFO: task ldlm_bl_08:9663 blocked for more than 120 seconds.
      14:19:22:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      14:19:23:ldlm_bl_08    D 0000000000000000     0  9663      2 0x00000080
      14:19:23: ffff88007ade1d50 0000000000000046 0000000100000000 0000000000000000
      14:19:24: ffff88007ade1ce0 ffffffffa04fb262 ffff88007ade1d40 0000000000000001
      14:19:24: ffff88007a1ce5f8 ffff88007ade1fd8 000000000000fb88 ffff88007a1ce5f8
      14:19:24:Call Trace:
      14:19:24: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
      14:19:24: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
      14:19:24: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
      14:19:24: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
      14:19:24: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
      14:19:24: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc]
      14:19:24: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      14:19:24: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
      14:19:24: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
      14:19:25: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
      14:19:25: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:25: [<ffffffff8100c14a>] child_rip+0xa/0x20
      14:19:27: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:27: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:27: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      14:19:27:INFO: task ldlm_bl_12:9667 blocked for more than 120 seconds.
      14:19:27:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      14:19:27:ldlm_bl_12    D 0000000000000000     0  9667      2 0x00000080
      14:19:27: ffff880053011ad8 0000000000000046 000000000027a41e ffff880078a56030
      14:19:27: ffff880078a57000 ffff880053011a90 ffffffff81039678 ffff8800ffffffff
      14:19:27: ffff88005300fab8 ffff880053011fd8 000000000000fb88 ffff88005300fab8
      14:19:28:Call Trace:
      14:19:28: [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
      14:19:28: [<ffffffff814ffec5>] rwsem_down_failed_common+0x95/0x1d0
      14:19:28: [<ffffffff81500056>] rwsem_down_read_failed+0x26/0x30
      14:19:28: [<ffffffff8127e664>] call_rwsem_down_read_failed+0x14/0x30
      14:19:28: [<ffffffff814ff554>] ? down_read+0x24/0x30
      14:19:28: [<ffffffffa09a4f04>] lov_lsm_addref+0x34/0x150 [lov]
      14:19:28: [<ffffffffa09a56e5>] lov_io_init+0x75/0x1c0 [lov]
      14:19:28: [<ffffffffa0659378>] cl_io_init0+0x98/0x160 [obdclass]
      14:19:29: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
      14:19:29: [<ffffffffa065c264>] cl_io_init+0x64/0x100 [obdclass]
      14:19:29: [<ffffffffa0653fa4>] cl_lock_discard_pages+0x64/0x1f0 [obdclass]
      14:19:30: [<ffffffffa09004e0>] osc_lock_flush+0x110/0x200 [osc]
      14:19:30: [<ffffffffa0900629>] osc_lock_cancel+0x59/0x1a0 [osc]
      14:19:30: [<ffffffffa0651dc5>] cl_lock_cancel0+0x75/0x160 [obdclass]
      14:19:30: [<ffffffffa0652a2b>] cl_lock_cancel+0x13b/0x140 [obdclass]
      14:19:30: [<ffffffffa090230a>] osc_ldlm_blocking_ast+0x13a/0x380 [osc]
      14:19:30: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
      14:19:31: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
      14:19:31: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
      14:19:31: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:31: [<ffffffff8100c14a>] child_rip+0xa/0x20
      14:19:31: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:31: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:31: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      14:19:31:INFO: task ldlm_bl_13:9668 blocked for more than 120 seconds.
      14:19:32:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      14:19:32:ldlm_bl_13    D 0000000000000000     0  9668      2 0x00000080
      14:19:32: ffff880053015d50 0000000000000046 0000000100000000 0000000000000000
      14:19:32: ffff880053015ce0 ffffffffa04fb262 ffff880053015d40 0000000000000001
      14:19:32: ffff88005300f058 ffff880053015fd8 000000000000fb88 ffff88005300f058
      14:19:32:Call Trace:
      14:19:33: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
      14:19:33: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
      14:19:33: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
      14:19:33: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
      14:19:33: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
      14:19:34: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc]
      14:19:34: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      14:19:34: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
      14:19:34: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
      14:19:34: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
      14:19:34: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:35: [<ffffffff8100c14a>] child_rip+0xa/0x20
      14:19:35: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:35: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:35: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      14:19:35:INFO: task ldlm_bl_15:9676 blocked for more than 120 seconds.
      14:19:35:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      14:19:35:ldlm_bl_15    D 0000000000000000     0  9676      2 0x00000080
      14:19:36: ffff880053155d50 0000000000000046 0000000100000000 0000000000000000
      14:19:36: ffff880053155ce0 ffffffffa04fb262 ffff880053155d40 0000000000000001
      14:19:36: ffff880053057af8 ffff880053155fd8 000000000000fb88 ffff880053057af8
      14:19:36:Call Trace:
      14:19:36: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
      14:19:36: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
      14:19:36: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
      14:19:36: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
      14:19:36: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
      14:19:36: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc]
      14:19:37: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      14:19:37: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
      14:19:37: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
      14:19:37: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
      14:19:37: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:37: [<ffffffff8100c14a>] child_rip+0xa/0x20
      14:19:38: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:38: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:38: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      14:19:39:INFO: task ldlm_bl_18:10487 blocked for more than 120 seconds.
      14:19:39:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      14:19:39:ldlm_bl_18    D 0000000000000000     0 10487      2 0x00000080
      14:19:39: ffff880052cffd50 0000000000000046 0000000100000000 0000000000000000
      14:19:39: ffff880052cffce0 ffffffffa04fb262 ffff880052cffd40 0000000000000001
      14:19:39: ffff8800532c1098 ffff880052cfffd8 000000000000fb88 ffff8800532c1098
      14:19:39:Call Trace:
      14:19:39: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
      14:19:40: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
      14:19:40: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
      14:19:40: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
      14:19:41: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
      14:19:41: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc]
      14:19:41: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      14:19:41: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
      14:19:41: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
      14:19:41: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
      14:19:41: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:41: [<ffffffff8100c14a>] child_rip+0xa/0x20
      14:19:42: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:42: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:42: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      14:19:42:INFO: task ldlm_bl_23:10492 blocked for more than 120 seconds.
      14:19:42:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      14:19:42:ldlm_bl_23    D 0000000000000000     0 10492      2 0x00000080
      14:19:43: ffff880053d7bd50 0000000000000046 0000000100000000 0000000000000000
      14:19:43: ffff880053d7bce0 ffffffffa04fb262 ffff880053d7bd40 0000000000000001
      14:19:43: ffff880053d75098 ffff880053d7bfd8 000000000000fb88 ffff880053d75098
      14:19:43:Call Trace:
      14:19:43: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
      14:19:43: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
      14:19:43: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
      14:19:43: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
      14:19:44: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
      14:19:44: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc]
      14:19:44: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      14:19:44: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
      14:19:45: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
      14:19:45: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
      14:19:45: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:45: [<ffffffff8100c14a>] child_rip+0xa/0x20
      14:19:45: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:45: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      14:19:47: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      14:19:47:INFO: task ls:21953 blocked for more than 120 seconds.
      14:19:47:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      14:19:47:ls            D 0000000000000000     0 21953   9441 0x00000080
      14:19:47: ffff88004288bae8 0000000000000082 000000000000009e 0020000000000080
      14:19:47: 502eb4be00000000 00000000000ee1f8 000055c100000000 00000a2d00000000
      14:19:47: ffff88007920faf8 ffff88004288bfd8 000000000000fb88 ffff88007920faf8
      14:19:48:Call Trace:
      14:19:48: [<ffffffffa04f5e63>] ? libcfs_debug_vmsg2+0x4e3/0xb60 [libcfs]
      14:19:48: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
      14:19:48: [<ffffffff81193788>] ? __d_lookup+0xd8/0x150
      14:19:48: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
      14:19:48: [<ffffffff8118969b>] do_lookup+0x11b/0x230
      14:19:48: [<ffffffff81189abd>] __link_path_walk+0x20d/0x1030
      14:19:48: [<ffffffff81097e2f>] ? up+0x2f/0x50
      14:19:48: [<ffffffffa0a55dbc>] ? ll_follow_link+0x1ec/0x260 [lustre]
      14:19:48: [<ffffffff8118a647>] __link_path_walk+0xd97/0x1030
      14:19:48: [<ffffffff8118ab6a>] path_walk+0x6a/0xe0
      14:19:48: [<ffffffff8118ad3b>] do_path_lookup+0x5b/0xa0
      14:19:50: [<ffffffff8117c780>] ? get_empty_filp+0xa0/0x180
      14:19:50: [<ffffffff8118bc6b>] do_filp_open+0xfb/0xd60
      14:19:50: [<ffffffff8119a460>] ? mntput_no_expire+0x30/0x110
      14:19:50: [<ffffffff811982b2>] ? alloc_fd+0x92/0x160
      14:19:50: [<ffffffff81178769>] do_sys_open+0x69/0x140
      14:19:51: [<ffffffff81178880>] sys_open+0x20/0x30
      14:19:51: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
      

      Attachments

        Activity

          [LU-1772] Test failure on test suite racer, subtest test_1
          laisiyao Lai Siyao added a comment -

          The deadlock is as below:
          1. Process mv takes lov->lo_type_guard read semaphore.
          2. Process ldlm_bl_12 takes cl_lock_mutex_lock.
          3. Process mkdir enqueues lov->lo_type_guard write semaphore (step 1 has taken read lock).
          4. Process ldlm_bl_12 enqueues lov->lo_type_guard read semaphore (because rw_semaphore fairness, read semaphore is blocked if a write is already enqueued).
          5. Process mv tries to take cl_lock_mutex_lock.

          Then the three processes deadlocked. We'll try to give a fix tomorrow.

          laisiyao Lai Siyao added a comment - The deadlock is as below: 1. Process mv takes lov->lo_type_guard read semaphore. 2. Process ldlm_bl_12 takes cl_lock_mutex_lock. 3. Process mkdir enqueues lov->lo_type_guard write semaphore (step 1 has taken read lock). 4. Process ldlm_bl_12 enqueues lov->lo_type_guard read semaphore (because rw_semaphore fairness, read semaphore is blocked if a write is already enqueued). 5. Process mv tries to take cl_lock_mutex_lock. Then the three processes deadlocked. We'll try to give a fix tomorrow.

          I think the problem is due to lov_layout_wait() in lov_conf_set(). It sleeps while holding lo_type_guard. The proper fix would be to not hold type guard while sleeping.

          jay Jinshan Xiong (Inactive) added a comment - I think the problem is due to lov_layout_wait() in lov_conf_set(). It sleeps while holding lo_type_guard. The proper fix would be to not hold type guard while sleeping.

          Lai has not been able to reproduce this yet, but believes it is a CLIO bug. He will talk to Jinshan to try to determine fix.

          jlevi Jodi Levi (Inactive) added a comment - Lai has not been able to reproduce this yet, but believes it is a CLIO bug. He will talk to Jinshan to try to determine fix.
          laisiyao Lai Siyao added a comment -

          There looks to be deadlock between below processes:

          ldlm_bl_12    D 0000000000000000     0  9667      2 0x00000080
           ffff880053011ad8 0000000000000046 000000000027a41e ffff880078a56030
           ffff880078a57000 ffff880053011a90 ffffffff81039678 ffff8800ffffffff
           ffff88005300fab8 ffff880053011fd8 000000000000fb88 ffff88005300fab8
          Call Trace:
           [<ffffffff814ffec5>] rwsem_down_failed_common+0x95/0x1d0
           [<ffffffff81500056>] rwsem_down_read_failed+0x26/0x30
           [<ffffffff8127e664>] call_rwsem_down_read_failed+0x14/0x30
           [<ffffffffa09a4f04>] lov_lsm_addref+0x34/0x150 [lov]
           [<ffffffffa09a56e5>] lov_io_init+0x75/0x1c0 [lov]
           [<ffffffffa0659378>] cl_io_init0+0x98/0x160 [obdclass]
           [<ffffffffa065c264>] cl_io_init+0x64/0x100 [obdclass]
           [<ffffffffa0653fa4>] cl_lock_discard_pages+0x64/0x1f0 [obdclass]
           [<ffffffffa09004e0>] osc_lock_flush+0x110/0x200 [osc]
           [<ffffffffa0900629>] osc_lock_cancel+0x59/0x1a0 [osc]
           [<ffffffffa0651dc5>] cl_lock_cancel0+0x75/0x160 [obdclass]
           [<ffffffffa0652a2b>] cl_lock_cancel+0x13b/0x140 [obdclass]
           [<ffffffffa090230a>] osc_ldlm_blocking_ast+0x13a/0x380 [osc]
           [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
           [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
           [<ffffffff8100c14a>] child_rip+0xa/0x20
          mv            D 0000000000000000     0 23117      1 0x00000080
           ffff88004078d928 0000000000000086 00000000ffffffff 000000000078d908
           ffff880000000030 ffff88004078d9a8 ffff88004078d938 ffffffff81039678
           ffff880078895058 ffff88004078dfd8 000000000000fb88 ffff880078895058
          Call Trace:
           [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
           [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
           [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
           [<ffffffffa0657b76>] cl_lock_hold_mutex+0x96/0x6c0 [obdclass]
           [<ffffffffa06581d3>] cl_lock_hold+0x33/0x100 [obdclass]
           [<ffffffffa09abc9d>] lov_sublock_alloc+0x11d/0x470 [lov]
           [<ffffffffa09acc04>] lov_lock_init_raid0+0x3e4/0xed0 [lov]
           [<ffffffffa09a5338>] lov_lock_init+0x68/0xe0 [lov]
           [<ffffffffa0657e7c>] cl_lock_hold_mutex+0x39c/0x6c0 [obdclass]
           [<ffffffffa0658302>] cl_lock_request+0x62/0x280 [obdclass]
           [<ffffffffa0a6a6bb>] cl_glimpse_lock+0x17b/0x4a0 [lustre]
           [<ffffffffa0a6af47>] cl_glimpse_size0+0x187/0x190 [lustre]
           [<ffffffffa0a27f62>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre]
           [<ffffffffa0a28079>] ll_getattr_it+0x49/0x170 [lustre]
           [<ffffffffa0a281d7>] ll_getattr+0x37/0x40 [lustre]
           [<ffffffff81180571>] vfs_getattr+0x51/0x80
           [<ffffffff81180600>] vfs_fstatat+0x60/0x80
           [<ffffffff8118068e>] vfs_lstat+0x1e/0x20
           [<ffffffff811806b4>] sys_newlstat+0x24/0x50
           [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
          

          Process ldlm_bl_12:
          cl_lock_mutex_lock in osc_ldlm_block_ast() > lov>lo_type_guard in lov_lsm_addref().

          Process mv:
          lov->lo_type_guard in lov_lock_init() -> cl_lock_mutex_lock in cl_lock_hold_mutex().

          Due to the complication of CLIO lock, Jinshan, could you give some advice on the fix?

          laisiyao Lai Siyao added a comment - There looks to be deadlock between below processes: ldlm_bl_12 D 0000000000000000 0 9667 2 0x00000080 ffff880053011ad8 0000000000000046 000000000027a41e ffff880078a56030 ffff880078a57000 ffff880053011a90 ffffffff81039678 ffff8800ffffffff ffff88005300fab8 ffff880053011fd8 000000000000fb88 ffff88005300fab8 Call Trace: [<ffffffff814ffec5>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff81500056>] rwsem_down_read_failed+0x26/0x30 [<ffffffff8127e664>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffffa09a4f04>] lov_lsm_addref+0x34/0x150 [lov] [<ffffffffa09a56e5>] lov_io_init+0x75/0x1c0 [lov] [<ffffffffa0659378>] cl_io_init0+0x98/0x160 [obdclass] [<ffffffffa065c264>] cl_io_init+0x64/0x100 [obdclass] [<ffffffffa0653fa4>] cl_lock_discard_pages+0x64/0x1f0 [obdclass] [<ffffffffa09004e0>] osc_lock_flush+0x110/0x200 [osc] [<ffffffffa0900629>] osc_lock_cancel+0x59/0x1a0 [osc] [<ffffffffa0651dc5>] cl_lock_cancel0+0x75/0x160 [obdclass] [<ffffffffa0652a2b>] cl_lock_cancel+0x13b/0x140 [obdclass] [<ffffffffa090230a>] osc_ldlm_blocking_ast+0x13a/0x380 [osc] [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc] [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc] [<ffffffff8100c14a>] child_rip+0xa/0x20 mv D 0000000000000000 0 23117 1 0x00000080 ffff88004078d928 0000000000000086 00000000ffffffff 000000000078d908 ffff880000000030 ffff88004078d9a8 ffff88004078d938 ffffffff81039678 ffff880078895058 ffff88004078dfd8 000000000000fb88 ffff880078895058 Call Trace: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] [<ffffffffa0657b76>] cl_lock_hold_mutex+0x96/0x6c0 [obdclass] [<ffffffffa06581d3>] cl_lock_hold+0x33/0x100 [obdclass] [<ffffffffa09abc9d>] lov_sublock_alloc+0x11d/0x470 [lov] [<ffffffffa09acc04>] lov_lock_init_raid0+0x3e4/0xed0 [lov] [<ffffffffa09a5338>] lov_lock_init+0x68/0xe0 [lov] [<ffffffffa0657e7c>] cl_lock_hold_mutex+0x39c/0x6c0 [obdclass] [<ffffffffa0658302>] cl_lock_request+0x62/0x280 [obdclass] [<ffffffffa0a6a6bb>] cl_glimpse_lock+0x17b/0x4a0 [lustre] [<ffffffffa0a6af47>] cl_glimpse_size0+0x187/0x190 [lustre] [<ffffffffa0a27f62>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre] [<ffffffffa0a28079>] ll_getattr_it+0x49/0x170 [lustre] [<ffffffffa0a281d7>] ll_getattr+0x37/0x40 [lustre] [<ffffffff81180571>] vfs_getattr+0x51/0x80 [<ffffffff81180600>] vfs_fstatat+0x60/0x80 [<ffffffff8118068e>] vfs_lstat+0x1e/0x20 [<ffffffff811806b4>] sys_newlstat+0x24/0x50 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b Process ldlm_bl_12: cl_lock_mutex_lock in osc_ldlm_block_ast() > lov >lo_type_guard in lov_lsm_addref(). Process mv: lov->lo_type_guard in lov_lock_init() -> cl_lock_mutex_lock in cl_lock_hold_mutex(). Due to the complication of CLIO lock, Jinshan, could you give some advice on the fix?
          laisiyao Lai Siyao added a comment -

          Below backtraces show that there may be a deadlock, but I haven't figured out yet, I'll read more CLIO code.

          ldlm_bl_12    D 0000000000000000     0  9667      2 0x00000080
           ffff880053011ad8 0000000000000046 000000000027a41e ffff880078a56030
           ffff880078a57000 ffff880053011a90 ffffffff81039678 ffff8800ffffffff
           ffff88005300fab8 ffff880053011fd8 000000000000fb88 ffff88005300fab8
          Call Trace:
           [<ffffffffa09a4f04>] lov_lsm_addref+0x34/0x150 [lov]
           [<ffffffffa09a56e5>] lov_io_init+0x75/0x1c0 [lov]
           [<ffffffffa0659378>] cl_io_init0+0x98/0x160 [obdclass]
           [<ffffffffa065c264>] cl_io_init+0x64/0x100 [obdclass]
           [<ffffffffa0653fa4>] cl_lock_discard_pages+0x64/0x1f0 [obdclass]
           [<ffffffffa09004e0>] osc_lock_flush+0x110/0x200 [osc]
           [<ffffffffa0900629>] osc_lock_cancel+0x59/0x1a0 [osc]
           [<ffffffffa0651dc5>] cl_lock_cancel0+0x75/0x160 [obdclass]
           [<ffffffffa0652a2b>] cl_lock_cancel+0x13b/0x140 [obdclass]
           [<ffffffffa090230a>] osc_ldlm_blocking_ast+0x13a/0x380 [osc]
           [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
           [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
           [<ffffffff8100c14a>] child_rip+0xa/0x20
          mv            D 0000000000000000     0 23117      1 0x00000080
           ffff88004078d928 0000000000000086 00000000ffffffff 000000000078d908
           ffff880000000030 ffff88004078d9a8 ffff88004078d938 ffffffff81039678
           ffff880078895058 ffff88004078dfd8 000000000000fb88 ffff880078895058
          Call Trace:
           [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
           [<ffffffffa0657b76>] cl_lock_hold_mutex+0x96/0x6c0 [obdclass]
           [<ffffffffa06581d3>] cl_lock_hold+0x33/0x100 [obdclass]
           [<ffffffffa09abc9d>] lov_sublock_alloc+0x11d/0x470 [lov]
           [<ffffffffa09acc04>] lov_lock_init_raid0+0x3e4/0xed0 [lov]
           [<ffffffffa09a5338>] lov_lock_init+0x68/0xe0 [lov]
           [<ffffffffa0657e7c>] cl_lock_hold_mutex+0x39c/0x6c0 [obdclass]
           [<ffffffffa0658302>] cl_lock_request+0x62/0x280 [obdclass]
           [<ffffffffa0a6a6bb>] cl_glimpse_lock+0x17b/0x4a0 [lustre]
           [<ffffffffa0a6af47>] cl_glimpse_size0+0x187/0x190 [lustre]
           [<ffffffffa0a27f62>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre]
           [<ffffffffa0a28079>] ll_getattr_it+0x49/0x170 [lustre]
           [<ffffffffa0a281d7>] ll_getattr+0x37/0x40 [lustre]
           [<ffffffff81180571>] vfs_getattr+0x51/0x80
           [<ffffffff81180600>] vfs_fstatat+0x60/0x80
           [<ffffffff8118068e>] vfs_lstat+0x1e/0x20
           [<ffffffff811806b4>] sys_newlstat+0x24/0x50
           [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
          mkdir         D 0000000000000000     0 23111      1 0x00000080
           ffff8800407c7788 0000000000000086 0000000000000099 0001000000010000
           502eb4bf00000000 000000000008aac5 00005a4700000000 000004ef00000000
           ffff8800786e0638 ffff8800407c7fd8 000000000000fb88 ffff8800786e0638
          Call Trace:
           [<ffffffffa0649ff8>] cl_conf_set+0x58/0x100 [obdclass]
           [<ffffffffa0a6b6bc>] cl_file_inode_init+0xfc/0x360 [lustre]
           [<ffffffffa0a38e92>] ll_update_inode+0x112/0xe60 [lustre]
           [<ffffffffa0a50e73>] ll_iget+0x1e3/0x2a0 [lustre]
           [<ffffffffa0a3a0c1>] ll_prep_inode+0x4e1/0xbf0 [lustre]
           [<ffffffffa0a4ef77>] ll_lookup_it_finish+0x107/0x9d0 [lustre]
           [<ffffffffa0a4fc43>] ll_lookup_it+0x403/0xbc0 [lustre]
           [<ffffffffa0a5146c>] ll_lookup_nd+0x8c/0x400 [lustre]
           [<ffffffff81189725>] do_lookup+0x1a5/0x230
           [<ffffffff81189fe4>] __link_path_walk+0x734/0x1030
           [<ffffffff8118ab6a>] path_walk+0x6a/0xe0
           [<ffffffff8118ad3b>] do_path_lookup+0x5b/0xa0
           [<ffffffff8118b9a7>] user_path_at+0x57/0xa0
           [<ffffffff811794e7>] sys_chdir+0x27/0x90
           [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
          cp            D 0000000000000000     0 23166      1 0x00000080
           ffff8800407f7b80 0000000000000082 ffffffffa0810740 ffff88007a1c2480
           0000000000000000 ffff8800407f7c18 ffffffffa0767876 1d7145a988afca52
           ffff8800774f45f8 ffff8800407f7fd8 000000000000fb88 ffff8800774f45f8
          Call Trace:
           [<ffffffffa09a4f04>] lov_lsm_addref+0x34/0x150 [lov]
           [<ffffffffa09a56e5>] lov_io_init+0x75/0x1c0 [lov]
           [<ffffffffa0659378>] cl_io_init0+0x98/0x160 [obdclass]
           [<ffffffffa065c264>] cl_io_init+0x64/0x100 [obdclass]
           [<ffffffffa0a6ae3d>] cl_glimpse_size0+0x7d/0x190 [lustre]
           [<ffffffffa0a27f62>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre]
           [<ffffffffa0a28079>] ll_getattr_it+0x49/0x170 [lustre]
           [<ffffffffa0a281d7>] ll_getattr+0x37/0x40 [lustre]
           [<ffffffff81180571>] vfs_getattr+0x51/0x80
           [<ffffffff8118082f>] vfs_fstat+0x3f/0x60
           [<ffffffff81180874>] sys_newfstat+0x24/0x40
           [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
          
          
          ll_imp_inval  D 0000000000000000     0 23859      2 0x00000080
           ffff8800534ebad0 0000000000000046 0000000100000000 0000000000000000
           ffff8800534eba60 ffffffffa04fb262 ffff8800534ebac0 0000000000000001
           ffff88005337d058 ffff8800534ebfd8 000000000000fb88 ffff88005337d058
          Call Trace:
           [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
           [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc]
           [<ffffffffa076a480>] ldlm_cancel_callback+0x60/0x100 [ptlrpc]
           [<ffffffffa0778c1b>] ldlm_cli_cancel_local+0x7b/0x380 [ptlrpc]
           [<ffffffffa077cae8>] ldlm_cli_cancel+0x58/0x3a0 [ptlrpc]
           [<ffffffffa0772958>] cleanup_resource+0x168/0x300 [ptlrpc]
           [<ffffffffa0772b1a>] ldlm_resource_clean+0x2a/0x50 [ptlrpc]
           [<ffffffffa04fc97f>] cfs_hash_for_each_relax+0x17f/0x380 [libcfs]
           [<ffffffffa04fe3bf>] cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs]
           [<ffffffffa076f629>] ldlm_namespace_cleanup+0x29/0xb0 [ptlrpc]
           [<ffffffffa08f5f11>] osc_import_event+0x5a1/0x19b0 [osc]
           [<ffffffffa07c1b69>] ptlrpc_invalidate_import+0x2a9/0x8d0 [ptlrpc]
           [<ffffffffa07c23ff>] ptlrpc_invalidate_import_thread+0x4f/0x2f0 [ptlrpc]
           [<ffffffff8100c14a>] child_rip+0xa/0x20
          
          laisiyao Lai Siyao added a comment - Below backtraces show that there may be a deadlock, but I haven't figured out yet, I'll read more CLIO code. ldlm_bl_12 D 0000000000000000 0 9667 2 0x00000080 ffff880053011ad8 0000000000000046 000000000027a41e ffff880078a56030 ffff880078a57000 ffff880053011a90 ffffffff81039678 ffff8800ffffffff ffff88005300fab8 ffff880053011fd8 000000000000fb88 ffff88005300fab8 Call Trace: [<ffffffffa09a4f04>] lov_lsm_addref+0x34/0x150 [lov] [<ffffffffa09a56e5>] lov_io_init+0x75/0x1c0 [lov] [<ffffffffa0659378>] cl_io_init0+0x98/0x160 [obdclass] [<ffffffffa065c264>] cl_io_init+0x64/0x100 [obdclass] [<ffffffffa0653fa4>] cl_lock_discard_pages+0x64/0x1f0 [obdclass] [<ffffffffa09004e0>] osc_lock_flush+0x110/0x200 [osc] [<ffffffffa0900629>] osc_lock_cancel+0x59/0x1a0 [osc] [<ffffffffa0651dc5>] cl_lock_cancel0+0x75/0x160 [obdclass] [<ffffffffa0652a2b>] cl_lock_cancel+0x13b/0x140 [obdclass] [<ffffffffa090230a>] osc_ldlm_blocking_ast+0x13a/0x380 [osc] [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc] [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc] [<ffffffff8100c14a>] child_rip+0xa/0x20 mv D 0000000000000000 0 23117 1 0x00000080 ffff88004078d928 0000000000000086 00000000ffffffff 000000000078d908 ffff880000000030 ffff88004078d9a8 ffff88004078d938 ffffffff81039678 ffff880078895058 ffff88004078dfd8 000000000000fb88 ffff880078895058 Call Trace: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] [<ffffffffa0657b76>] cl_lock_hold_mutex+0x96/0x6c0 [obdclass] [<ffffffffa06581d3>] cl_lock_hold+0x33/0x100 [obdclass] [<ffffffffa09abc9d>] lov_sublock_alloc+0x11d/0x470 [lov] [<ffffffffa09acc04>] lov_lock_init_raid0+0x3e4/0xed0 [lov] [<ffffffffa09a5338>] lov_lock_init+0x68/0xe0 [lov] [<ffffffffa0657e7c>] cl_lock_hold_mutex+0x39c/0x6c0 [obdclass] [<ffffffffa0658302>] cl_lock_request+0x62/0x280 [obdclass] [<ffffffffa0a6a6bb>] cl_glimpse_lock+0x17b/0x4a0 [lustre] [<ffffffffa0a6af47>] cl_glimpse_size0+0x187/0x190 [lustre] [<ffffffffa0a27f62>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre] [<ffffffffa0a28079>] ll_getattr_it+0x49/0x170 [lustre] [<ffffffffa0a281d7>] ll_getattr+0x37/0x40 [lustre] [<ffffffff81180571>] vfs_getattr+0x51/0x80 [<ffffffff81180600>] vfs_fstatat+0x60/0x80 [<ffffffff8118068e>] vfs_lstat+0x1e/0x20 [<ffffffff811806b4>] sys_newlstat+0x24/0x50 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b mkdir D 0000000000000000 0 23111 1 0x00000080 ffff8800407c7788 0000000000000086 0000000000000099 0001000000010000 502eb4bf00000000 000000000008aac5 00005a4700000000 000004ef00000000 ffff8800786e0638 ffff8800407c7fd8 000000000000fb88 ffff8800786e0638 Call Trace: [<ffffffffa0649ff8>] cl_conf_set+0x58/0x100 [obdclass] [<ffffffffa0a6b6bc>] cl_file_inode_init+0xfc/0x360 [lustre] [<ffffffffa0a38e92>] ll_update_inode+0x112/0xe60 [lustre] [<ffffffffa0a50e73>] ll_iget+0x1e3/0x2a0 [lustre] [<ffffffffa0a3a0c1>] ll_prep_inode+0x4e1/0xbf0 [lustre] [<ffffffffa0a4ef77>] ll_lookup_it_finish+0x107/0x9d0 [lustre] [<ffffffffa0a4fc43>] ll_lookup_it+0x403/0xbc0 [lustre] [<ffffffffa0a5146c>] ll_lookup_nd+0x8c/0x400 [lustre] [<ffffffff81189725>] do_lookup+0x1a5/0x230 [<ffffffff81189fe4>] __link_path_walk+0x734/0x1030 [<ffffffff8118ab6a>] path_walk+0x6a/0xe0 [<ffffffff8118ad3b>] do_path_lookup+0x5b/0xa0 [<ffffffff8118b9a7>] user_path_at+0x57/0xa0 [<ffffffff811794e7>] sys_chdir+0x27/0x90 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b cp D 0000000000000000 0 23166 1 0x00000080 ffff8800407f7b80 0000000000000082 ffffffffa0810740 ffff88007a1c2480 0000000000000000 ffff8800407f7c18 ffffffffa0767876 1d7145a988afca52 ffff8800774f45f8 ffff8800407f7fd8 000000000000fb88 ffff8800774f45f8 Call Trace: [<ffffffffa09a4f04>] lov_lsm_addref+0x34/0x150 [lov] [<ffffffffa09a56e5>] lov_io_init+0x75/0x1c0 [lov] [<ffffffffa0659378>] cl_io_init0+0x98/0x160 [obdclass] [<ffffffffa065c264>] cl_io_init+0x64/0x100 [obdclass] [<ffffffffa0a6ae3d>] cl_glimpse_size0+0x7d/0x190 [lustre] [<ffffffffa0a27f62>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre] [<ffffffffa0a28079>] ll_getattr_it+0x49/0x170 [lustre] [<ffffffffa0a281d7>] ll_getattr+0x37/0x40 [lustre] [<ffffffff81180571>] vfs_getattr+0x51/0x80 [<ffffffff8118082f>] vfs_fstat+0x3f/0x60 [<ffffffff81180874>] sys_newfstat+0x24/0x40 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b ll_imp_inval D 0000000000000000 0 23859 2 0x00000080 ffff8800534ebad0 0000000000000046 0000000100000000 0000000000000000 ffff8800534eba60 ffffffffa04fb262 ffff8800534ebac0 0000000000000001 ffff88005337d058 ffff8800534ebfd8 000000000000fb88 ffff88005337d058 Call Trace: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc] [<ffffffffa076a480>] ldlm_cancel_callback+0x60/0x100 [ptlrpc] [<ffffffffa0778c1b>] ldlm_cli_cancel_local+0x7b/0x380 [ptlrpc] [<ffffffffa077cae8>] ldlm_cli_cancel+0x58/0x3a0 [ptlrpc] [<ffffffffa0772958>] cleanup_resource+0x168/0x300 [ptlrpc] [<ffffffffa0772b1a>] ldlm_resource_clean+0x2a/0x50 [ptlrpc] [<ffffffffa04fc97f>] cfs_hash_for_each_relax+0x17f/0x380 [libcfs] [<ffffffffa04fe3bf>] cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs] [<ffffffffa076f629>] ldlm_namespace_cleanup+0x29/0xb0 [ptlrpc] [<ffffffffa08f5f11>] osc_import_event+0x5a1/0x19b0 [osc] [<ffffffffa07c1b69>] ptlrpc_invalidate_import+0x2a9/0x8d0 [ptlrpc] [<ffffffffa07c23ff>] ptlrpc_invalidate_import_thread+0x4f/0x2f0 [ptlrpc] [<ffffffff8100c14a>] child_rip+0xa/0x20
          laisiyao Lai Siyao added a comment -

          Log shows that both MDT and OST timeout on lock cancel, and finally it cause client eviction. But the client log is not complete, I can't see why the lock cancel timeouts there. I'll read more of the logs and try to reproduce it.

          laisiyao Lai Siyao added a comment - Log shows that both MDT and OST timeout on lock cancel, and finally it cause client eviction. But the client log is not complete, I can't see why the lock cancel timeouts there. I'll read more of the logs and try to reproduce it.
          pjones Peter Jones added a comment -

          Lai

          Could you please look into this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Lai Could you please look into this one? Thanks Peter

          People

            laisiyao Lai Siyao
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: