[LU-1772] Test failure on test suite racer, subtest test_1 Created: 20/Aug/12 Updated: 02/Sep/12 Resolved: 02/Sep/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server and client: lustre-master-tag2.2.93 RHEL6 |
||
| Severity: | 3 |
| Rank (Obsolete): | 4303 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/9d777c2e-e9a0-11e1-881a-52540035b04c. The sub-test test_1 failed with the following error:
There are some hanging threads on client side 14:14:58:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-26vm5,client-26vm6.lab.whamcloud.com DURATION=900 == 14:14:56 (1345238096) 14:14:59:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u 14:14:59:Lustre: DEBUG MARKER: DURATION=900 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer 14:15:08:Lustre: DEBUG MARKER: DURATION=900 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 14:18:31:LustreError: 11-0: an error occurred while communicating with 10.10.4.151@tcp. The ost_write operation failed with -2 14:18:31:LustreError: 11-0: an error occurred while communicating with 10.10.4.151@tcp. The ost_write operation failed with -2 14:18:32:LustreError: 9716:0:(vvp_io.c:1039:vvp_io_commit_write()) Write page 34217 of inode ffff880078a36178 failed -2 14:18:32:Lustre: lustre-MDT0000-mdc-ffff880074b62000: Connection to lustre-MDT0000 (at 10.10.4.150@tcp) was lost; in progress operations using this service will wait for recovery to complete 14:18:32:LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. 14:18:32:LustreError: 9716:0:(file.c:155:ll_close_inode_openhandle()) inode 144115205272502283 mdc close failed: rc = -5 14:18:32:LustreError: 23544:0:(ldlm_resource.c:761:ldlm_resource_complain()) Namespace lustre-MDT0000-mdc-ffff880074b62000 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 14:18:34:LustreError: 23544:0:(ldlm_resource.c:767:ldlm_resource_complain()) Resource: ffff88007a1c2480 (8589935617/1747/0/0) (rc: 1) 14:18:34:Lustre: lustre-MDT0000-mdc-ffff880074b62000: Connection restored to lustre-MDT0000 (at 10.10.4.150@tcp) 14:18:36:LustreError: 11-0: an error occurred while communicating with 10.10.4.151@tcp. The obd_ping operation failed with -107 14:18:36:LustreError: Skipped 1 previous similar message 14:18:36:Lustre: lustre-OST0001-osc-ffff880074b62000: Connection to lustre-OST0001 (at 10.10.4.151@tcp) was lost; in progress operations using this service will wait for recovery to complete 14:18:36:LustreError: 167-0: This client was evicted by lustre-OST0001; in progress operations using this service will fail. 14:19:14:INFO: task ldlm_bl_02:8789 blocked for more than 120 seconds. 14:19:14:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 14:19:14:ldlm_bl_02 D 0000000000000000 0 8789 2 0x00000080 14:19:15: ffff8800763fdd50 0000000000000046 0000000100000000 0000000000000000 14:19:15: ffff8800763fdce0 ffffffffa04fb262 ffff8800763fdd40 0000000000000001 14:19:15: ffff8800377e4638 ffff8800763fdfd8 000000000000fb88 ffff8800377e4638 14:19:15:Call Trace: 14:19:15: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] 14:19:15: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] 14:19:15: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 14:19:15: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 14:19:15: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] 14:19:15: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc] 14:19:15: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 14:19:15: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc] 14:19:15: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc] 14:19:17: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 14:19:17: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:17: [<ffffffff8100c14a>] child_rip+0xa/0x20 14:19:17: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:17: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:17: [<ffffffff8100c140>] ? child_rip+0x0/0x20 14:19:17:INFO: task ldlm_bl_03:8790 blocked for more than 120 seconds. 14:19:17:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 14:19:17:ldlm_bl_03 D 0000000000000000 0 8790 2 0x00000080 14:19:17: ffff880052c8dd50 0000000000000046 0000000100000000 0000000000000000 14:19:18: ffff880052c8dce0 ffffffffa04fb262 ffff880052c8dd40 0000000000000001 14:19:18: ffff88007a4545f8 ffff880052c8dfd8 000000000000fb88 ffff88007a4545f8 14:19:18:Call Trace: 14:19:18: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] 14:19:18: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] 14:19:18: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 14:19:19: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 14:19:19: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] 14:19:19: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc] 14:19:19: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 14:19:19: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc] 14:19:19: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc] 14:19:19: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 14:19:19: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:19: [<ffffffff8100c14a>] child_rip+0xa/0x20 14:19:19: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:19: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:19: [<ffffffff8100c140>] ? child_rip+0x0/0x20 14:19:19:INFO: task ldlm_bl_05:9660 blocked for more than 120 seconds. 14:19:19:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 14:19:19:ldlm_bl_05 D 0000000000000000 0 9660 2 0x00000080 14:19:19: ffff88007adbbd50 0000000000000046 0000000100000000 0000000000000000 14:19:20: ffff88007adbbce0 ffffffffa04fb262 ffff88007adbbd40 0000000000000001 14:19:20: ffff880075433098 ffff88007adbbfd8 000000000000fb88 ffff880075433098 14:19:20:Call Trace: 14:19:21: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] 14:19:21: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] 14:19:21: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 14:19:21: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 14:19:21: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] 14:19:21: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc] 14:19:21: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 14:19:22: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc] 14:19:22: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc] 14:19:22: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 14:19:22: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:22: [<ffffffff8100c14a>] child_rip+0xa/0x20 14:19:22: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:22: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:22: [<ffffffff8100c140>] ? child_rip+0x0/0x20 14:19:22:INFO: task ldlm_bl_08:9663 blocked for more than 120 seconds. 14:19:22:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 14:19:23:ldlm_bl_08 D 0000000000000000 0 9663 2 0x00000080 14:19:23: ffff88007ade1d50 0000000000000046 0000000100000000 0000000000000000 14:19:24: ffff88007ade1ce0 ffffffffa04fb262 ffff88007ade1d40 0000000000000001 14:19:24: ffff88007a1ce5f8 ffff88007ade1fd8 000000000000fb88 ffff88007a1ce5f8 14:19:24:Call Trace: 14:19:24: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] 14:19:24: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] 14:19:24: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 14:19:24: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 14:19:24: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] 14:19:24: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc] 14:19:24: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 14:19:24: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc] 14:19:24: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc] 14:19:25: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 14:19:25: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:25: [<ffffffff8100c14a>] child_rip+0xa/0x20 14:19:27: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:27: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:27: [<ffffffff8100c140>] ? child_rip+0x0/0x20 14:19:27:INFO: task ldlm_bl_12:9667 blocked for more than 120 seconds. 14:19:27:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 14:19:27:ldlm_bl_12 D 0000000000000000 0 9667 2 0x00000080 14:19:27: ffff880053011ad8 0000000000000046 000000000027a41e ffff880078a56030 14:19:27: ffff880078a57000 ffff880053011a90 ffffffff81039678 ffff8800ffffffff 14:19:27: ffff88005300fab8 ffff880053011fd8 000000000000fb88 ffff88005300fab8 14:19:28:Call Trace: 14:19:28: [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0 14:19:28: [<ffffffff814ffec5>] rwsem_down_failed_common+0x95/0x1d0 14:19:28: [<ffffffff81500056>] rwsem_down_read_failed+0x26/0x30 14:19:28: [<ffffffff8127e664>] call_rwsem_down_read_failed+0x14/0x30 14:19:28: [<ffffffff814ff554>] ? down_read+0x24/0x30 14:19:28: [<ffffffffa09a4f04>] lov_lsm_addref+0x34/0x150 [lov] 14:19:28: [<ffffffffa09a56e5>] lov_io_init+0x75/0x1c0 [lov] 14:19:28: [<ffffffffa0659378>] cl_io_init0+0x98/0x160 [obdclass] 14:19:29: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] 14:19:29: [<ffffffffa065c264>] cl_io_init+0x64/0x100 [obdclass] 14:19:29: [<ffffffffa0653fa4>] cl_lock_discard_pages+0x64/0x1f0 [obdclass] 14:19:30: [<ffffffffa09004e0>] osc_lock_flush+0x110/0x200 [osc] 14:19:30: [<ffffffffa0900629>] osc_lock_cancel+0x59/0x1a0 [osc] 14:19:30: [<ffffffffa0651dc5>] cl_lock_cancel0+0x75/0x160 [obdclass] 14:19:30: [<ffffffffa0652a2b>] cl_lock_cancel+0x13b/0x140 [obdclass] 14:19:30: [<ffffffffa090230a>] osc_ldlm_blocking_ast+0x13a/0x380 [osc] 14:19:30: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc] 14:19:31: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc] 14:19:31: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 14:19:31: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:31: [<ffffffff8100c14a>] child_rip+0xa/0x20 14:19:31: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:31: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:31: [<ffffffff8100c140>] ? child_rip+0x0/0x20 14:19:31:INFO: task ldlm_bl_13:9668 blocked for more than 120 seconds. 14:19:32:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 14:19:32:ldlm_bl_13 D 0000000000000000 0 9668 2 0x00000080 14:19:32: ffff880053015d50 0000000000000046 0000000100000000 0000000000000000 14:19:32: ffff880053015ce0 ffffffffa04fb262 ffff880053015d40 0000000000000001 14:19:32: ffff88005300f058 ffff880053015fd8 000000000000fb88 ffff88005300f058 14:19:32:Call Trace: 14:19:33: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] 14:19:33: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] 14:19:33: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 14:19:33: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 14:19:33: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] 14:19:34: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc] 14:19:34: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 14:19:34: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc] 14:19:34: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc] 14:19:34: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 14:19:34: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:35: [<ffffffff8100c14a>] child_rip+0xa/0x20 14:19:35: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:35: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:35: [<ffffffff8100c140>] ? child_rip+0x0/0x20 14:19:35:INFO: task ldlm_bl_15:9676 blocked for more than 120 seconds. 14:19:35:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 14:19:35:ldlm_bl_15 D 0000000000000000 0 9676 2 0x00000080 14:19:36: ffff880053155d50 0000000000000046 0000000100000000 0000000000000000 14:19:36: ffff880053155ce0 ffffffffa04fb262 ffff880053155d40 0000000000000001 14:19:36: ffff880053057af8 ffff880053155fd8 000000000000fb88 ffff880053057af8 14:19:36:Call Trace: 14:19:36: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] 14:19:36: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] 14:19:36: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 14:19:36: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 14:19:36: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] 14:19:36: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc] 14:19:37: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 14:19:37: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc] 14:19:37: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc] 14:19:37: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 14:19:37: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:37: [<ffffffff8100c14a>] child_rip+0xa/0x20 14:19:38: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:38: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:38: [<ffffffff8100c140>] ? child_rip+0x0/0x20 14:19:39:INFO: task ldlm_bl_18:10487 blocked for more than 120 seconds. 14:19:39:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 14:19:39:ldlm_bl_18 D 0000000000000000 0 10487 2 0x00000080 14:19:39: ffff880052cffd50 0000000000000046 0000000100000000 0000000000000000 14:19:39: ffff880052cffce0 ffffffffa04fb262 ffff880052cffd40 0000000000000001 14:19:39: ffff8800532c1098 ffff880052cfffd8 000000000000fb88 ffff8800532c1098 14:19:39:Call Trace: 14:19:39: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] 14:19:40: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] 14:19:40: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 14:19:40: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 14:19:41: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] 14:19:41: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc] 14:19:41: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 14:19:41: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc] 14:19:41: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc] 14:19:41: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 14:19:41: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:41: [<ffffffff8100c14a>] child_rip+0xa/0x20 14:19:42: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:42: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:42: [<ffffffff8100c140>] ? child_rip+0x0/0x20 14:19:42:INFO: task ldlm_bl_23:10492 blocked for more than 120 seconds. 14:19:42:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 14:19:42:ldlm_bl_23 D 0000000000000000 0 10492 2 0x00000080 14:19:43: ffff880053d7bd50 0000000000000046 0000000100000000 0000000000000000 14:19:43: ffff880053d7bce0 ffffffffa04fb262 ffff880053d7bd40 0000000000000001 14:19:43: ffff880053d75098 ffff880053d7bfd8 000000000000fb88 ffff880053d75098 14:19:43:Call Trace: 14:19:43: [<ffffffffa04fb262>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] 14:19:43: [<ffffffffa04fb0c4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] 14:19:43: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 14:19:43: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 14:19:44: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] 14:19:44: [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc] 14:19:44: [<ffffffffa04f6521>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 14:19:44: [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc] 14:19:45: [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc] 14:19:45: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 14:19:45: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:45: [<ffffffff8100c14a>] child_rip+0xa/0x20 14:19:45: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:45: [<ffffffffa077fe90>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 14:19:47: [<ffffffff8100c140>] ? child_rip+0x0/0x20 14:19:47:INFO: task ls:21953 blocked for more than 120 seconds. 14:19:47:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 14:19:47:ls D 0000000000000000 0 21953 9441 0x00000080 14:19:47: ffff88004288bae8 0000000000000082 000000000000009e 0020000000000080 14:19:47: 502eb4be00000000 00000000000ee1f8 000055c100000000 00000a2d00000000 14:19:47: ffff88007920faf8 ffff88004288bfd8 000000000000fb88 ffff88007920faf8 14:19:48:Call Trace: 14:19:48: [<ffffffffa04f5e63>] ? libcfs_debug_vmsg2+0x4e3/0xb60 [libcfs] 14:19:48: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 14:19:48: [<ffffffff81193788>] ? __d_lookup+0xd8/0x150 14:19:48: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 14:19:48: [<ffffffff8118969b>] do_lookup+0x11b/0x230 14:19:48: [<ffffffff81189abd>] __link_path_walk+0x20d/0x1030 14:19:48: [<ffffffff81097e2f>] ? up+0x2f/0x50 14:19:48: [<ffffffffa0a55dbc>] ? ll_follow_link+0x1ec/0x260 [lustre] 14:19:48: [<ffffffff8118a647>] __link_path_walk+0xd97/0x1030 14:19:48: [<ffffffff8118ab6a>] path_walk+0x6a/0xe0 14:19:48: [<ffffffff8118ad3b>] do_path_lookup+0x5b/0xa0 14:19:50: [<ffffffff8117c780>] ? get_empty_filp+0xa0/0x180 14:19:50: [<ffffffff8118bc6b>] do_filp_open+0xfb/0xd60 14:19:50: [<ffffffff8119a460>] ? mntput_no_expire+0x30/0x110 14:19:50: [<ffffffff811982b2>] ? alloc_fd+0x92/0x160 14:19:50: [<ffffffff81178769>] do_sys_open+0x69/0x140 14:19:51: [<ffffffff81178880>] sys_open+0x20/0x30 14:19:51: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Peter Jones [ 20/Aug/12 ] |
|
Lai Could you please look into this one? Thanks Peter |
| Comment by Lai Siyao [ 21/Aug/12 ] |
|
Log shows that both MDT and OST timeout on lock cancel, and finally it cause client eviction. But the client log is not complete, I can't see why the lock cancel timeouts there. I'll read more of the logs and try to reproduce it. |
| Comment by Lai Siyao [ 23/Aug/12 ] |
|
Below backtraces show that there may be a deadlock, but I haven't figured out yet, I'll read more CLIO code. ldlm_bl_12 D 0000000000000000 0 9667 2 0x00000080 ffff880053011ad8 0000000000000046 000000000027a41e ffff880078a56030 ffff880078a57000 ffff880053011a90 ffffffff81039678 ffff8800ffffffff ffff88005300fab8 ffff880053011fd8 000000000000fb88 ffff88005300fab8 Call Trace: [<ffffffffa09a4f04>] lov_lsm_addref+0x34/0x150 [lov] [<ffffffffa09a56e5>] lov_io_init+0x75/0x1c0 [lov] [<ffffffffa0659378>] cl_io_init0+0x98/0x160 [obdclass] [<ffffffffa065c264>] cl_io_init+0x64/0x100 [obdclass] [<ffffffffa0653fa4>] cl_lock_discard_pages+0x64/0x1f0 [obdclass] [<ffffffffa09004e0>] osc_lock_flush+0x110/0x200 [osc] [<ffffffffa0900629>] osc_lock_cancel+0x59/0x1a0 [osc] [<ffffffffa0651dc5>] cl_lock_cancel0+0x75/0x160 [obdclass] [<ffffffffa0652a2b>] cl_lock_cancel+0x13b/0x140 [obdclass] [<ffffffffa090230a>] osc_ldlm_blocking_ast+0x13a/0x380 [osc] [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc] [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc] [<ffffffff8100c14a>] child_rip+0xa/0x20 mv D 0000000000000000 0 23117 1 0x00000080 ffff88004078d928 0000000000000086 00000000ffffffff 000000000078d908 ffff880000000030 ffff88004078d9a8 ffff88004078d938 ffffffff81039678 ffff880078895058 ffff88004078dfd8 000000000000fb88 ffff880078895058 Call Trace: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] [<ffffffffa0657b76>] cl_lock_hold_mutex+0x96/0x6c0 [obdclass] [<ffffffffa06581d3>] cl_lock_hold+0x33/0x100 [obdclass] [<ffffffffa09abc9d>] lov_sublock_alloc+0x11d/0x470 [lov] [<ffffffffa09acc04>] lov_lock_init_raid0+0x3e4/0xed0 [lov] [<ffffffffa09a5338>] lov_lock_init+0x68/0xe0 [lov] [<ffffffffa0657e7c>] cl_lock_hold_mutex+0x39c/0x6c0 [obdclass] [<ffffffffa0658302>] cl_lock_request+0x62/0x280 [obdclass] [<ffffffffa0a6a6bb>] cl_glimpse_lock+0x17b/0x4a0 [lustre] [<ffffffffa0a6af47>] cl_glimpse_size0+0x187/0x190 [lustre] [<ffffffffa0a27f62>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre] [<ffffffffa0a28079>] ll_getattr_it+0x49/0x170 [lustre] [<ffffffffa0a281d7>] ll_getattr+0x37/0x40 [lustre] [<ffffffff81180571>] vfs_getattr+0x51/0x80 [<ffffffff81180600>] vfs_fstatat+0x60/0x80 [<ffffffff8118068e>] vfs_lstat+0x1e/0x20 [<ffffffff811806b4>] sys_newlstat+0x24/0x50 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b mkdir D 0000000000000000 0 23111 1 0x00000080 ffff8800407c7788 0000000000000086 0000000000000099 0001000000010000 502eb4bf00000000 000000000008aac5 00005a4700000000 000004ef00000000 ffff8800786e0638 ffff8800407c7fd8 000000000000fb88 ffff8800786e0638 Call Trace: [<ffffffffa0649ff8>] cl_conf_set+0x58/0x100 [obdclass] [<ffffffffa0a6b6bc>] cl_file_inode_init+0xfc/0x360 [lustre] [<ffffffffa0a38e92>] ll_update_inode+0x112/0xe60 [lustre] [<ffffffffa0a50e73>] ll_iget+0x1e3/0x2a0 [lustre] [<ffffffffa0a3a0c1>] ll_prep_inode+0x4e1/0xbf0 [lustre] [<ffffffffa0a4ef77>] ll_lookup_it_finish+0x107/0x9d0 [lustre] [<ffffffffa0a4fc43>] ll_lookup_it+0x403/0xbc0 [lustre] [<ffffffffa0a5146c>] ll_lookup_nd+0x8c/0x400 [lustre] [<ffffffff81189725>] do_lookup+0x1a5/0x230 [<ffffffff81189fe4>] __link_path_walk+0x734/0x1030 [<ffffffff8118ab6a>] path_walk+0x6a/0xe0 [<ffffffff8118ad3b>] do_path_lookup+0x5b/0xa0 [<ffffffff8118b9a7>] user_path_at+0x57/0xa0 [<ffffffff811794e7>] sys_chdir+0x27/0x90 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b cp D 0000000000000000 0 23166 1 0x00000080 ffff8800407f7b80 0000000000000082 ffffffffa0810740 ffff88007a1c2480 0000000000000000 ffff8800407f7c18 ffffffffa0767876 1d7145a988afca52 ffff8800774f45f8 ffff8800407f7fd8 000000000000fb88 ffff8800774f45f8 Call Trace: [<ffffffffa09a4f04>] lov_lsm_addref+0x34/0x150 [lov] [<ffffffffa09a56e5>] lov_io_init+0x75/0x1c0 [lov] [<ffffffffa0659378>] cl_io_init0+0x98/0x160 [obdclass] [<ffffffffa065c264>] cl_io_init+0x64/0x100 [obdclass] [<ffffffffa0a6ae3d>] cl_glimpse_size0+0x7d/0x190 [lustre] [<ffffffffa0a27f62>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre] [<ffffffffa0a28079>] ll_getattr_it+0x49/0x170 [lustre] [<ffffffffa0a281d7>] ll_getattr+0x37/0x40 [lustre] [<ffffffff81180571>] vfs_getattr+0x51/0x80 [<ffffffff8118082f>] vfs_fstat+0x3f/0x60 [<ffffffff81180874>] sys_newfstat+0x24/0x40 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b ll_imp_inval D 0000000000000000 0 23859 2 0x00000080 ffff8800534ebad0 0000000000000046 0000000100000000 0000000000000000 ffff8800534eba60 ffffffffa04fb262 ffff8800534ebac0 0000000000000001 ffff88005337d058 ffff8800534ebfd8 000000000000fb88 ffff88005337d058 Call Trace: [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] [<ffffffffa090224a>] osc_ldlm_blocking_ast+0x7a/0x380 [osc] [<ffffffffa076a480>] ldlm_cancel_callback+0x60/0x100 [ptlrpc] [<ffffffffa0778c1b>] ldlm_cli_cancel_local+0x7b/0x380 [ptlrpc] [<ffffffffa077cae8>] ldlm_cli_cancel+0x58/0x3a0 [ptlrpc] [<ffffffffa0772958>] cleanup_resource+0x168/0x300 [ptlrpc] [<ffffffffa0772b1a>] ldlm_resource_clean+0x2a/0x50 [ptlrpc] [<ffffffffa04fc97f>] cfs_hash_for_each_relax+0x17f/0x380 [libcfs] [<ffffffffa04fe3bf>] cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs] [<ffffffffa076f629>] ldlm_namespace_cleanup+0x29/0xb0 [ptlrpc] [<ffffffffa08f5f11>] osc_import_event+0x5a1/0x19b0 [osc] [<ffffffffa07c1b69>] ptlrpc_invalidate_import+0x2a9/0x8d0 [ptlrpc] [<ffffffffa07c23ff>] ptlrpc_invalidate_import_thread+0x4f/0x2f0 [ptlrpc] [<ffffffff8100c14a>] child_rip+0xa/0x20 |
| Comment by Lai Siyao [ 24/Aug/12 ] |
|
There looks to be deadlock between below processes: ldlm_bl_12 D 0000000000000000 0 9667 2 0x00000080 ffff880053011ad8 0000000000000046 000000000027a41e ffff880078a56030 ffff880078a57000 ffff880053011a90 ffffffff81039678 ffff8800ffffffff ffff88005300fab8 ffff880053011fd8 000000000000fb88 ffff88005300fab8 Call Trace: [<ffffffff814ffec5>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff81500056>] rwsem_down_read_failed+0x26/0x30 [<ffffffff8127e664>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffffa09a4f04>] lov_lsm_addref+0x34/0x150 [lov] [<ffffffffa09a56e5>] lov_io_init+0x75/0x1c0 [lov] [<ffffffffa0659378>] cl_io_init0+0x98/0x160 [obdclass] [<ffffffffa065c264>] cl_io_init+0x64/0x100 [obdclass] [<ffffffffa0653fa4>] cl_lock_discard_pages+0x64/0x1f0 [obdclass] [<ffffffffa09004e0>] osc_lock_flush+0x110/0x200 [osc] [<ffffffffa0900629>] osc_lock_cancel+0x59/0x1a0 [osc] [<ffffffffa0651dc5>] cl_lock_cancel0+0x75/0x160 [obdclass] [<ffffffffa0652a2b>] cl_lock_cancel+0x13b/0x140 [obdclass] [<ffffffffa090230a>] osc_ldlm_blocking_ast+0x13a/0x380 [osc] [<ffffffffa077fcd3>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc] [<ffffffffa0780111>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc] [<ffffffff8100c14a>] child_rip+0xa/0x20 mv D 0000000000000000 0 23117 1 0x00000080 ffff88004078d928 0000000000000086 00000000ffffffff 000000000078d908 ffff880000000030 ffff88004078d9a8 ffff88004078d938 ffffffff81039678 ffff880078895058 ffff88004078dfd8 000000000000fb88 ffff880078895058 Call Trace: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 [<ffffffffa06545b7>] cl_lock_mutex_get+0x77/0xe0 [obdclass] [<ffffffffa0657b76>] cl_lock_hold_mutex+0x96/0x6c0 [obdclass] [<ffffffffa06581d3>] cl_lock_hold+0x33/0x100 [obdclass] [<ffffffffa09abc9d>] lov_sublock_alloc+0x11d/0x470 [lov] [<ffffffffa09acc04>] lov_lock_init_raid0+0x3e4/0xed0 [lov] [<ffffffffa09a5338>] lov_lock_init+0x68/0xe0 [lov] [<ffffffffa0657e7c>] cl_lock_hold_mutex+0x39c/0x6c0 [obdclass] [<ffffffffa0658302>] cl_lock_request+0x62/0x280 [obdclass] [<ffffffffa0a6a6bb>] cl_glimpse_lock+0x17b/0x4a0 [lustre] [<ffffffffa0a6af47>] cl_glimpse_size0+0x187/0x190 [lustre] [<ffffffffa0a27f62>] ll_inode_revalidate_it+0xf2/0x1c0 [lustre] [<ffffffffa0a28079>] ll_getattr_it+0x49/0x170 [lustre] [<ffffffffa0a281d7>] ll_getattr+0x37/0x40 [lustre] [<ffffffff81180571>] vfs_getattr+0x51/0x80 [<ffffffff81180600>] vfs_fstatat+0x60/0x80 [<ffffffff8118068e>] vfs_lstat+0x1e/0x20 [<ffffffff811806b4>] sys_newlstat+0x24/0x50 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b Process ldlm_bl_12: Process mv: Due to the complication of CLIO lock, Jinshan, could you give some advice on the fix? |
| Comment by Jodi Levi (Inactive) [ 24/Aug/12 ] |
|
Lai has not been able to reproduce this yet, but believes it is a CLIO bug. He will talk to Jinshan to try to determine fix. |
| Comment by Jinshan Xiong (Inactive) [ 27/Aug/12 ] |
|
I think the problem is due to lov_layout_wait() in lov_conf_set(). It sleeps while holding lo_type_guard. The proper fix would be to not hold type guard while sleeping. |
| Comment by Lai Siyao [ 28/Aug/12 ] |
|
The deadlock is as below: Then the three processes deadlocked. We'll try to give a fix tomorrow. |
| Comment by Lai Siyao [ 28/Aug/12 ] |
|
review is on http://review.whamcloud.com/#change,3807 |
| Comment by Peter Jones [ 02/Sep/12 ] |
|
Landed for 2.3 and 2.4 |