[LU-7036] sanityn tests_77e: dd hang Created: 24/Aug/15 Updated: 21/Sep/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Emoly Liu |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
autotest review-dne-part-1 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
sanityn test 77e hangs when writing to the file system with dd. The logs are at https://testing.hpdd.intel.com/test_sets/8e04941a-489e-11e5-95b4-5254006e85c2 and the test fails with 'test failed to respond and timed out' On the client console, we see client time outs and then 01:11:32:Lustre: 17445:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1440205834/real 1440205834] req@ffff88006bf3a9c0 x1510162543485732/t0(0) o400->lustre-OST0000-osc-ffff88007db9d800@10.1.5.222@tcp:28/4 lens 224/224 e 0 to 1 dl 1440205879 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 01:11:32:Lustre: 17445:0:(client.c:2020:ptlrpc_expire_one_request()) Skipped 73 previous similar messages 01:11:32:INFO: task dd:13776 blocked for more than 120 seconds. 01:11:32: Not tainted 2.6.32-504.30.3.el6.x86_64 #1 01:11:32:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 01:11:32:dd D 0000000000000001 0 13776 13764 0x00000080 01:11:32: ffff880037babb30 0000000000000086 0000000000000000 845a01438f7a556c 01:11:32: ffffffffa09948e8 ffff880037babab8 ffff880072872070 ffff88006d8894b0 01:11:32: ffff88007b49ddf0 ffff88006ba8d1a8 ffff880071a75ad8 ffff880037babfd8 01:11:32:Call Trace: 01:11:32: [<ffffffffa09948e8>] ? osc_lock_to_lockless+0x148/0x260 [osc] 01:11:32: [<ffffffff8152cc25>] rwsem_down_failed_common+0x95/0x1d0 01:11:32: [<ffffffff8152cdb6>] rwsem_down_read_failed+0x26/0x30 01:11:32: [<ffffffffa0a3212c>] ? lov_lock_enqueue+0xbc/0x170 [lov] 01:11:32: [<ffffffff81298f14>] call_rwsem_down_read_failed+0x14/0x30 01:11:32: [<ffffffff8152c2b4>] ? down_read+0x24/0x30 01:11:32: [<ffffffffa0b0a3d9>] vvp_io_write_start+0x229/0x5c0 [lustre] 01:11:32: [<ffffffffa0538d1b>] ? cl_lock_request+0x7b/0x200 [obdclass] 01:11:32: [<ffffffffa0539e9a>] cl_io_start+0x6a/0x140 [obdclass] 01:11:32: [<ffffffffa053e3e4>] cl_io_loop+0xb4/0x1b0 [obdclass] 01:11:32: [<ffffffffa0ab37a7>] ll_file_io_generic+0x317/0xab0 [lustre] 01:11:32: [<ffffffffa0ab5c8b>] ll_file_aio_write+0x20b/0x860 [lustre] 01:11:32: [<ffffffffa0ab640b>] ll_file_write+0x12b/0x260 [lustre] 01:11:32: [<ffffffff8118e3a8>] vfs_write+0xb8/0x1a0 01:11:32: [<ffffffff8118ed71>] sys_write+0x51/0x90 01:11:32: [<ffffffff810e5c5e>] ? __audit_syscall_exit+0x25e/0x290 01:11:32: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b 01:11:32:INFO: task dd:13836 blocked for more than 120 seconds. 01:11:32: Not tainted 2.6.32-504.30.3.el6.x86_64 #1 01:11:32:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 01:11:32:dd D 0000000000000000 0 13836 13827 0x00000080 01:11:32: ffff88006ff77a38 0000000000000082 ffff88006ff77a18 ffffffff81296bd0 01:11:32: 000000000000000f 0000000affffffff ffffffffffffffff 0000000000000000 01:11:32: 0000000000000001 6c8adcc8a078dcfc ffff88007c9a5ad8 ffff88006ff77fd8 01:11:32:Call Trace: 01:11:32: [<ffffffff81296bd0>] ? vsnprintf+0x450/0x5e0 01:11:32: [<ffffffff8152aa05>] schedule_timeout+0x215/0x2e0 01:11:32: [<ffffffffa075129c>] ? lustre_msg_set_opc+0x6c/0xc0 [ptlrpc] 01:11:32: [<ffffffffa073c7f6>] ? ptlrpc_set_add_new_req+0x66/0x160 [ptlrpc] 01:11:32: [<ffffffff8152a683>] wait_for_common+0x123/0x180 01:11:32: [<ffffffff81064c00>] ? default_wake_function+0x0/0x20 01:11:32: [<ffffffff8152a79d>] wait_for_completion+0x1d/0x20 01:11:32: [<ffffffffa0996f0c>] osc_io_setattr_end+0xbc/0x190 [osc] 01:11:32: [<ffffffffa09980f0>] ? osc_io_setattr_start+0x1b0/0x400 [osc] 01:11:32: [<ffffffffa0a2d140>] ? lov_io_end_wrapper+0x0/0x100 [lov] 01:11:32: [<ffffffffa0539140>] cl_io_end+0x60/0x150 [obdclass] 01:11:32: [<ffffffffa0539e30>] ? cl_io_start+0x0/0x140 [obdclass] 01:11:32: [<ffffffffa0a2d231>] lov_io_end_wrapper+0xf1/0x100 [lov] 01:11:32: [<ffffffffa0a2cf7e>] lov_io_call+0x8e/0x130 [lov] 01:11:32: [<ffffffffa0a2ed8c>] lov_io_end+0x4c/0xf0 [lov] 01:11:32: [<ffffffffa0539140>] cl_io_end+0x60/0x150 [obdclass] 01:11:32: [<ffffffffa053e3f2>] cl_io_loop+0xc2/0x1b0 [obdclass] 01:11:32: [<ffffffffa0b033d8>] cl_setattr_ost+0x208/0x2c0 [lustre] 01:11:32: [<ffffffffa0ad45b8>] ll_setattr_raw+0xe18/0x1370 [lustre] 01:11:32: [<ffffffff81150019>] ? handle_mm_fault+0x299/0x3d0 01:11:32: [<ffffffffa0ad4b75>] ll_setattr+0x65/0xd0 [lustre] 01:11:32: [<ffffffff811ad4b8>] notify_change+0x168/0x340 01:11:32: [<ffffffff8118c484>] do_truncate+0x64/0xa0 01:11:32: [<ffffffffa0a3ea6a>] ? lov_read_and_clear_async_rc+0x8a/0x1e0 [lov] 01:11:32: [<ffffffff8118c830>] sys_ftruncate+0x120/0x130 01:11:32: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b 01:11:32:INFO: task dd:15276 blocked for more than 120 seconds. 01:11:32: Not tainted 2.6.32-504.30.3.el6.x86_64 #1 01:11:32:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 01:11:32:dd D 0000000000000001 0 15276 15275 0x00000080 01:11:32: ffff88006ce6fdd8 0000000000000082 0000000000000000 0000000000000044 01:11:32: ffff880078df8910 0020000078f1d490 0000095fc7713150 0000000000000000 01:11:32: ffff88006ce6fe58 000000010098b185 ffff8800736885f8 ffff88006ce6ffd8 01:11:32:Call Trace: 01:11:32: [<ffffffff8152b486>] __mutex_lock_slowpath+0x96/0x210 01:11:32: [<ffffffff8152c2a6>] ? down_read+0x16/0x30 01:11:32: [<ffffffff8152afab>] mutex_lock+0x2b/0x50 01:11:32: [<ffffffff8118e5e4>] generic_file_llseek_size+0x44/0xd0 01:11:32: [<ffffffffa0ab7e80>] ll_file_seek+0xf0/0x310 [lustre] 01:11:32: [<ffffffff8118c93a>] vfs_llseek+0x3a/0x40 01:11:32: [<ffffffff8118e6d6>] sys_lseek+0x66/0x80 01:11:32: [<ffffffff810e5c5e>] ? __audit_syscall_exit+0x25e/0x290 01:11:32: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b 01:11:32:INFO: task dd:15438 blocked for more than 120 seconds. 01:11:32: Not tainted 2.6.32-504.30.3.el6.x86_64 #1 01:11:32:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 01:11:32:dd D 0000000000000001 0 15438 15437 0x00000080 01:11:32: ffff88006c83fdd8 0000000000000086 ffffffffa0b3f3e0 0000000000000044 01:11:32: ffff880078df8910 0020000078f1d490 ffffffffa0b17f28 0000000000000000 01:11:32: ffff88006c83fe58 0000000000000000 ffff88007a80f068 ffff88006c83ffd8 01:11:32:Call Trace: 01:11:32: [<ffffffff8152b486>] __mutex_lock_slowpath+0x96/0x210 01:11:32: [<ffffffff8152c2a6>] ? down_read+0x16/0x30 01:11:32: [<ffffffff8152afab>] mutex_lock+0x2b/0x50 01:11:32: [<ffffffff8118e5e4>] generic_file_llseek_size+0x44/0xd0 01:11:32: [<ffffffffa0ab7e80>] ll_file_seek+0xf0/0x310 [lustre] 01:11:32: [<ffffffff8118c93a>] vfs_llseek+0x3a/0x40 01:11:32: [<ffffffff8118e6d6>] sys_lseek+0x66/0x80 01:11:32: [<ffffffff810e5c5e>] ? __audit_syscall_exit+0x25e/0x290 01:11:32: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b 01:11:32:INFO: task dd:15442 blocked for more than 120 seconds. 01:11:32: Not tainted 2.6.32-504.30.3.el6.x86_64 #1 01:11:32:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 01:11:32:dd D 0000000000000001 0 15442 15441 0x00000080 01:11:32: ffff88006b199dd8 0000000000000082 0000000000000000 0000000000000044 01:11:32: ffff880078df8910 0020000078f1d490 000009601181b63f 0000000000000000 01:11:32: ffff88006b199e58 000000010098b684 ffff880072a73068 ffff88006b199fd8 01:11:32:Call Trace: 01:11:32: [<ffffffff8152b486>] __mutex_lock_slowpath+0x96/0x210 01:11:32: [<ffffffff8152c2a6>] ? down_read+0x16/0x30 01:11:32: [<ffffffff8152afab>] mutex_lock+0x2b/0x50 01:11:32: [<ffffffff8118e5e4>] generic_file_llseek_size+0x44/0xd0 01:11:32: [<ffffffffa0ab7e80>] ll_file_seek+0xf0/0x310 [lustre] 01:11:32: [<ffffffff8118c93a>] vfs_llseek+0x3a/0x40 01:11:32: [<ffffffff8118e6d6>] sys_lseek+0x66/0x80 01:11:32: [<ffffffff810e5c5e>] ? __audit_syscall_exit+0x25e/0x290 01:13:23: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b 01:13:23:INFO: task dd:15456 blocked for more than 120 seconds. 01:13:23: Not tainted 2.6.32-504.30.3.el6.x86_64 #1 01:13:23:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 01:13:23:dd D 0000000000000000 0 15456 15455 0x00000080 01:13:23: ffff88007b1e7dd8 0000000000000082 0000000000000000 0000000000000044 01:13:23: ffff880078df8890 0020000078f1d490 00000960146b3f76 0000000000000000 01:13:23: ffff88007b1e7e58 000000010098b6b5 ffff880071a8b068 ffff88007b1e7fd8 01:13:23:Call Trace: 01:13:23: [<ffffffff8152b486>] __mutex_lock_slowpath+0x96/0x210 01:13:23: [<ffffffff8152c2a6>] ? down_read+0x16/0x30 01:13:23: [<ffffffff8152afab>] mutex_lock+0x2b/0x50 01:13:23: [<ffffffff8118e5e4>] generic_file_llseek_size+0x44/0xd0 01:13:23: [<ffffffffa0ab7e80>] ll_file_seek+0xf0/0x310 [lustre] 01:13:23: [<ffffffff8118c93a>] vfs_llseek+0x3a/0x40 01:13:23: [<ffffffff8118e6d6>] sys_lseek+0x66/0x80 01:13:23: [<ffffffff810e5c5e>] ? __audit_syscall_exit+0x25e/0x290 01:13:23: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b In the OST console: 01:11:25:Pid: 23863, comm: ll_ost_io00_049 Not tainted 2.6.32-504.30.3.el6_lustre.g683b4b2.x86_64 #1 Red Hat KVM 01:11:25:RIP: 0010:[<ffffffff8152d84e>] [<ffffffff8152d84e>] _spin_lock+0x1e/0x30 01:11:25:RSP: 0018:ffff880060e15d00 EFLAGS: 00000206 01:11:25:RAX: 0000000000000002 RBX: ffff880060e15d00 RCX: 0000000000000000 01:11:25:RDX: 0000000000000000 RSI: ffff8800468014c0 RDI: ffff8800752230e0 01:11:25:RBP: ffffffff8100bc0e R08: ffff8800752230e0 R09: 0000000000000000 01:11:25:R10: 000000000000000f R11: 000000000000000f R12: ffff880060e15c90 01:11:25:R13: 00000000ffffffeb R14: ffff8800468013c0 R15: ffff8800719dbd40 01:11:25:FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 01:11:25:CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b 01:11:25:CR2: 00007fa84927c936 CR3: 000000007847b000 CR4: 00000000000006f0 01:11:25:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 01:11:25:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 01:11:25:Process ll_ost_io00_049 (pid: 23863, threadinfo ffff880060e14000, task ffff88005e86e040) 01:11:25:Stack: 01:11:25: ffff880060e15d40 ffffffffa0857341 ffffffffffffffff ffff8800468013c0 01:11:25:<d> ffff8800468013c0 ffff880064c608c0 0000000000000000 ffff88005dd90e60 01:11:25:<d> ffff880060e15d60 ffffffffa085af4b ffff880060e15d60 ffff880075223000 01:11:25:Call Trace: 01:11:25: [<ffffffffa0857341>] ? nrs_resource_get_safe+0x41/0x100 [ptlrpc] 01:11:25: [<ffffffffa085af4b>] ? ptlrpc_nrs_req_initialize+0x3b/0x90 [ptlrpc] 01:11:25: [<ffffffffa081e407>] ? ptlrpc_server_handle_req_in+0x957/0xd30 [ptlrpc] 01:11:25: [<ffffffffa0825d46>] ? ptlrpc_main+0x9d6/0x1920 [ptlrpc] 01:11:25: [<ffffffffa0825370>] ? ptlrpc_main+0x0/0x1920 [ptlrpc] 01:11:25: [<ffffffff8109e78e>] ? kthread+0x9e/0xc0 01:11:25: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 01:11:25: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 01:11:25: [<ffffffff8100c280>] ? child_rip+0x0/0x20 01:11:25:Code: 00 00 00 01 74 05 e8 32 bf d6 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> 1f 44 00 00 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 01:11:25:Call Trace: 01:11:25: [<ffffffffa0857341>] ? nrs_resource_get_safe+0x41/0x100 [ptlrpc] 01:11:25: [<ffffffffa085af4b>] ? ptlrpc_nrs_req_initialize+0x3b/0x90 [ptlrpc] 01:11:25: [<ffffffffa081e407>] ? ptlrpc_server_handle_req_in+0x957/0xd30 [ptlrpc] 01:11:25: [<ffffffffa0825d46>] ? ptlrpc_main+0x9d6/0x1920 [ptlrpc] 01:11:25: [<ffffffffa0825370>] ? ptlrpc_main+0x0/0x1920 [ptlrpc] 01:11:25: [<ffffffff8109e78e>] ? kthread+0x9e/0xc0 01:11:25: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 01:11:25: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 01:11:25: [<ffffffff8100c280>] ? child_rip+0x0/0x20 01:11:25:Kernel panic - not syncing: softlockup: hung tasks 01:11:25:Pid: 23863, comm: ll_ost_io00_049 Not tainted 2.6.32-504.30.3.el6_lustre.g683b4b2.x86_64 #1 01:11:25:Call Trace: 01:11:25: <IRQ> [<ffffffff81529c9c>] ? panic+0xa7/0x16f 01:11:25: [<ffffffff810ea6d0>] ? watchdog_timer_fn+0x0/0x1e0 01:11:25: [<ffffffff810ea89a>] ? watchdog_timer_fn+0x1ca/0x1e0 01:11:25: [<ffffffff810a34ae>] ? __run_hrtimer+0x8e/0x1d0 01:11:25: [<ffffffff810aa8df>] ? ktime_get_update_offsets+0x4f/0xd0 01:11:25: [<ffffffff810a3846>] ? hrtimer_interrupt+0xe6/0x260 01:11:25: [<ffffffff8103427d>] ? local_apic_timer_interrupt+0x3d/0x70 01:11:25: [<ffffffff815345f5>] ? smp_apic_timer_interrupt+0x45/0x60 01:11:25: [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20 01:11:25: <EOI> [<ffffffff8152d84e>] ? _spin_lock+0x1e/0x30 01:11:25: [<ffffffffa0857341>] ? nrs_resource_get_safe+0x41/0x100 [ptlrpc] 01:11:25: [<ffffffffa085af4b>] ? ptlrpc_nrs_req_initialize+0x3b/0x90 [ptlrpc] 01:11:25: [<ffffffffa081e407>] ? ptlrpc_server_handle_req_in+0x957/0xd30 [ptlrpc] 01:11:25: [<ffffffffa0825d46>] ? ptlrpc_main+0x9d6/0x1920 [ptlrpc] 01:11:25: [<ffffffffa0825370>] ? ptlrpc_main+0x0/0x1920 [ptlrpc] 01:11:25: [<ffffffff8109e78e>] ? kthread+0x9e/0xc0 01:11:25: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 01:11:25: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 01:11:25: [<ffffffff8100c280>] ? child_rip+0x0/0x20 |
| Comments |
| Comment by Oleg Drokin [ 24/Aug/15 ] |
|
I see a crashdump was dumped. |
| Comment by Saurabh Tandan (Inactive) [ 16/Dec/15 ] |
|
Instance found on Autotest Patch review system: |
| Comment by James Nunez (Inactive) [ 30/Dec/15 ] |
|
Another failure on master: |
| Comment by Bob Glossman (Inactive) [ 01/Jan/16 ] |
|
another on master: |
| Comment by Wang Shilong (Inactive) [ 21/Sep/17 ] |
|
https://testing.hpdd.intel.com/test_sets/eb3ea05e-9e9a-11e7-b778-5254006e85c2 |