Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
None
-
3
-
8546
Description
We have found that it is fairly easy to get a client to hang when running a shared-file IOR such as this:
srun -l -D /p/lcrater2/surya1/tmp -N 56 -n 112 -- /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v -w -W -r -R -t 1m -b 1024m
A client node will experience trouble communicating, and get evicted:
2011-06-17 13:14:16 LustreError: 11-0: an error occurred while communicating with 10.1.1.48@o2ib9. The obd_ping operation failed with -107 2011-06-17 13:14:16 Lustre: lc2-OST0005-osc-ffff880200164400: Connection to service lc2-OST0005 via nid 10.1.1.48@o2ib9 was lost; in progress operations using this service will wait for recovery to complete. 2011-06-17 13:14:17 LustreError: 167-0: This client was evicted by lc2-OST0005; in progress operations using this service will fail. 2011-06-17 13:14:17 LustreError: 8675:0:(ldlm_request.c:1169:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway 2011-06-17 13:14:17 LustreError: 8675:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 2011-06-17 13:18:06 INFO: task ll_imp_inval:8757 blocked for more than 120 seconds. 2011-06-17 13:18:06 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
During this time, two IOR processes are stuck here:
#0 [ffff880105521398] schedule at ffffffff814dabb9 #1 [ffff880105521460] cfs_waitq_wait at ffffffffa034e75e [libcfs] #2 [ffff880105521470] osc_enter_cache at ffffffffa06c6eda [osc] #3 [ffff880105521580] osc_queue_async_io at ffffffffa06c7439 [osc] #4 [ffff880105521690] osc_page_cache_add at ffffffffa06d549a [osc] #5 [ffff880105521700] cl_page_invoke at ffffffffa0481008 [obdclass] #6 [ffff880105521770] cl_page_cache_add at ffffffffa0481d88 [obdclass] #7 [ffff8801055217f0] vvp_io_commit_write at ffffffffa07d2a43 [lustre] #8 [ffff8801055218a0] cl_io_commit_write at ffffffffa048fb8f [obdclass] #9 [ffff880105521920] ll_commit_write at ffffffffa07aa12d [lustre] #10 [ffff8801055219b0] ll_write_end at ffffffffa07c17a0 [lustre] #11 [ffff8801055219e0] generic_file_buffered_write at ffffffff8110dc74 #12 [ffff880105521ab0] __generic_file_aio_write at ffffffff8110f560 #13 [ffff880105521b70] generic_file_aio_write at ffffffff8110f7ff #14 [ffff880105521bc0] vvp_io_write_start at ffffffffa07d32d1 [lustre] #15 [ffff880105521c40] cl_io_start at ffffffffa048cb38 [obdclass] #16 [ffff880105521ca0] cl_io_loop at ffffffffa0490f40 [obdclass] #17 [ffff880105521d10] ll_file_io_generic at ffffffffa077b8e3 [lustre] #18 [ffff880105521dd0] ll_file_aio_write at ffffffffa0780aaf [lustre] #19 [ffff880105521e60] ll_file_write at ffffffffa0780df1 [lustre] #20 [ffff880105521ef0] vfs_write at ffffffff811725b8 #21 [ffff880105521f30] sys_write at ffffffff81172ff1 #22 [ffff880105521f80] system_call_fastpath at ffffffff8100b172
The ldlm_bl_* thread backtraces look like this when they are spinning:
2011-06-17 14:27:59 <NMI> [<ffffffff81009d45>] ? show_regs+0x45/0x50 2011-06-17 14:27:59 [<ffffffff814decd9>] ? arch_trigger_all_cpu_backtrace_handler+0x79/0xa0 2011-06-17 14:27:59 [<ffffffff814e05c5>] ? notifier_call_chain+0x55/0x80 2011-06-17 14:27:59 [<ffffffff814e062a>] ? atomic_notifier_call_chain+0x1a/0x20 2011-06-17 14:27:59 [<ffffffff8109409e>] ? notify_die+0x2e/0x30 2011-06-17 14:27:59 [<ffffffff814de233>] ? do_nmi+0x173/0x2b0 2011-06-17 14:27:59 [<ffffffff814ddb40>] ? nmi+0x20/0x30 2011-06-17 14:27:59 [<ffffffff814dd3a8>] ? _spin_lock+0x18/0x30 2011-06-17 14:27:59 <<EOE>> [<ffffffffa0484b99>] ? cl_page_gang_lookup+0x219/0x380 [obdclass] 2011-06-17 14:28:00 [<ffffffffa0483860>] ? cl_page_own+0x0/0x20 [obdclass] 2011-06-17 14:28:00 [<ffffffffa048aed7>] ? cl_lock_page_out+0xb7/0x320 [obdclass] 2011-06-17 14:28:00 [<ffffffffa047fae6>] ? cl_env_peek+0x36/0x110 [obdclass] 2011-06-17 14:28:00 [<ffffffffa06d6d8f>] ? osc_lock_flush+0x4f/0x90 [osc] 2011-06-17 14:28:00 [<ffffffffa06d6e29>] ? osc_lock_cancel+0x59/0x170 [osc] 2011-06-17 14:28:00 [<ffffffffa047fae6>] ? cl_env_peek+0x36/0x110 [obdclass] 2011-06-17 14:28:00 [<ffffffffa0485e9d>] ? cl_lock_cancel0+0x6d/0x130 [obdclass] 2011-06-17 14:28:00 [<ffffffffa04869fb>] ? cl_lock_cancel+0x11b/0x120 [obdclass] 2011-06-17 14:28:00 [<ffffffffa06d7bf8>] ? osc_ldlm_blocking_ast+0x148/0x300 [osc] 2011-06-17 14:28:00 [<ffffffff810dbe15>] ? call_rcu_sched+0x15/0x20 2011-06-17 14:28:00 [<ffffffff810dbe2e>] ? call_rcu+0xe/0x10 2011-06-17 14:28:00 [<ffffffffa054287d>] ? ldlm_handle_bl_callback+0x10d/0x2a0 [ptlrpc] 2011-06-17 14:28:00 [<ffffffffa0542c51>] ? ldlm_bl_thread_main+0x241/0x3d0 [ptlrpc] 2011-06-17 14:28:00 [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20 2011-06-17 14:28:00 [<ffffffff8100c1ca>] ? child_rip+0xa/0x20 2011-06-17 14:28:00 [<ffffffffa0542a10>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 2011-06-17 14:28:00 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
With one of the runs, the node became "unstuck" after a vvp_io_commit_write() call fails:
2011-06-17 14:36:55 LustreError: 8747:0:(vvp_io.c:987:vvp_io_commit_write()) Write page 28712193 of inode ffff8801d20666b8 failed -4 2011-06-17 14:59:59 mstflint:4325 freeing invalid memtype fe900000-fe910000 2011-06-17 14:59:59 mstflint:4325 freeing invalid memtype fe910000-fea00000 2011-06-17 15:00:02 mstflint:4333 freeing invalid memtype fe900000-fe910000 2011-06-17 15:00:02 Jun 17 15:00:02 zwicky161 kernel: sfit43 ren nai etp e000f900 2011-06-17 15:00:02 mstflint:4333 freeing invalid memtype fe910000-fea00000 2011-06-17 15:00:07 LustreError: 8757:0:(ldlm_resource.c:748:ldlm_resource_complain()) Namespace lc2-OST0005-osc-ffff880200164400 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 2011-06-17 15:00:07 LustreError: 8757:0:(ldlm_resource.c:754:ldlm_resource_complain()) Resource: ffff8801ff756700 (61363/0/0/0) (rc: 1) 2011-06-17 15:00:07 Lustre: lc2-OST0005-osc-ffff880200164400: Connection restored to service lc2-OST0005 using nid 10.1.1.48@o2ib9. 2011-06-17 15:00:07 CPUFREQ: Per core ondemand sysfs interface is deprecated - up_threshold 2011-06-17 15:00:31 LustreError: 11-0: an error occurred while communicating with 10.1.1.48@o2ib9. The obd_ping operation failed with -107 2011-06-17 15:00:31 Lustre: lc2-OST0005-osc-ffff880200164400: Connection to service lc2-OST0005 via nid 10.1.1.48@o2ib9 was lost; in progress operations using this service will wait for recovery to complete. 2011-06-17 15:00:31 LustreError: 167-0: This client was evicted by lc2-OST0005; in progress operations using this service will fail. 2011-06-17 15:00:31 Lustre: lc2-OST0005-osc-ffff880200164400: Connection restored to service lc2-OST0005 using nid 10.1.1.48@o2ib9.
Although I have seen another node that seemed stuck in the same way, and was still stuck long after the IOR processes had been killed.