[LU-437] Client hang with spinning ldlm_bl_* and ll_imp_inval threads - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Blocker
Fix Version/s: Lustre 2.1.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
8546

Description

We have found that it is fairly easy to get a client to hang when running a shared-file IOR such as this:

srun -l -D /p/lcrater2/surya1/tmp -N 56 -n 112 -- /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v  -w -W -r -R -t 1m -b 1024m

A client node will experience trouble communicating, and get evicted:

2011-06-17 13:14:16 LustreError: 11-0: an error occurred while communicating with 10.1.1.48@o2ib9. The obd_ping operation failed with -107
2011-06-17 13:14:16 Lustre: lc2-OST0005-osc-ffff880200164400: Connection to service lc2-OST0005 via nid 10.1.1.48@o2ib9 was lost; in progress operations using this service will wait for recovery to complete.
2011-06-17 13:14:17 LustreError: 167-0: This client was evicted by lc2-OST0005; in progress operations using this service will fail.
2011-06-17 13:14:17 LustreError: 8675:0:(ldlm_request.c:1169:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
2011-06-17 13:14:17 LustreError: 8675:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
2011-06-17 13:18:06 INFO: task ll_imp_inval:8757 blocked for more than 120 seconds.
2011-06-17 13:18:06 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

During this time, two IOR processes are stuck here:

 #0 [ffff880105521398] schedule at ffffffff814dabb9
 #1 [ffff880105521460] cfs_waitq_wait at ffffffffa034e75e [libcfs]
 #2 [ffff880105521470] osc_enter_cache at ffffffffa06c6eda [osc]
 #3 [ffff880105521580] osc_queue_async_io at ffffffffa06c7439 [osc]
 #4 [ffff880105521690] osc_page_cache_add at ffffffffa06d549a [osc]
 #5 [ffff880105521700] cl_page_invoke at ffffffffa0481008 [obdclass]
 #6 [ffff880105521770] cl_page_cache_add at ffffffffa0481d88 [obdclass]
 #7 [ffff8801055217f0] vvp_io_commit_write at ffffffffa07d2a43 [lustre]
 #8 [ffff8801055218a0] cl_io_commit_write at ffffffffa048fb8f [obdclass]
 #9 [ffff880105521920] ll_commit_write at ffffffffa07aa12d [lustre]
#10 [ffff8801055219b0] ll_write_end at ffffffffa07c17a0 [lustre]
#11 [ffff8801055219e0] generic_file_buffered_write at ffffffff8110dc74
#12 [ffff880105521ab0] __generic_file_aio_write at ffffffff8110f560
#13 [ffff880105521b70] generic_file_aio_write at ffffffff8110f7ff
#14 [ffff880105521bc0] vvp_io_write_start at ffffffffa07d32d1 [lustre]
#15 [ffff880105521c40] cl_io_start at ffffffffa048cb38 [obdclass]
#16 [ffff880105521ca0] cl_io_loop at ffffffffa0490f40 [obdclass]
#17 [ffff880105521d10] ll_file_io_generic at ffffffffa077b8e3 [lustre]
#18 [ffff880105521dd0] ll_file_aio_write at ffffffffa0780aaf [lustre]
#19 [ffff880105521e60] ll_file_write at ffffffffa0780df1 [lustre]
#20 [ffff880105521ef0] vfs_write at ffffffff811725b8
#21 [ffff880105521f30] sys_write at ffffffff81172ff1
#22 [ffff880105521f80] system_call_fastpath at ffffffff8100b172

The ldlm_bl_* thread backtraces look like this when they are spinning:

2011-06-17 14:27:59  <NMI>  [<ffffffff81009d45>] ? show_regs+0x45/0x50
2011-06-17 14:27:59  [<ffffffff814decd9>] ?
arch_trigger_all_cpu_backtrace_handler+0x79/0xa0
2011-06-17 14:27:59  [<ffffffff814e05c5>] ? notifier_call_chain+0x55/0x80
2011-06-17 14:27:59  [<ffffffff814e062a>] ?
atomic_notifier_call_chain+0x1a/0x20
2011-06-17 14:27:59  [<ffffffff8109409e>] ? notify_die+0x2e/0x30
2011-06-17 14:27:59  [<ffffffff814de233>] ? do_nmi+0x173/0x2b0
2011-06-17 14:27:59  [<ffffffff814ddb40>] ? nmi+0x20/0x30
2011-06-17 14:27:59  [<ffffffff814dd3a8>] ? _spin_lock+0x18/0x30
2011-06-17 14:27:59  <<EOE>>  [<ffffffffa0484b99>] ?
cl_page_gang_lookup+0x219/0x380 [obdclass]
2011-06-17 14:28:00  [<ffffffffa0483860>] ? cl_page_own+0x0/0x20 [obdclass]
2011-06-17 14:28:00  [<ffffffffa048aed7>] ? cl_lock_page_out+0xb7/0x320
[obdclass]
2011-06-17 14:28:00  [<ffffffffa047fae6>] ? cl_env_peek+0x36/0x110 [obdclass]
2011-06-17 14:28:00  [<ffffffffa06d6d8f>] ? osc_lock_flush+0x4f/0x90 [osc]
2011-06-17 14:28:00  [<ffffffffa06d6e29>] ? osc_lock_cancel+0x59/0x170 [osc]
2011-06-17 14:28:00  [<ffffffffa047fae6>] ? cl_env_peek+0x36/0x110 [obdclass]
2011-06-17 14:28:00  [<ffffffffa0485e9d>] ? cl_lock_cancel0+0x6d/0x130
[obdclass]
2011-06-17 14:28:00  [<ffffffffa04869fb>] ? cl_lock_cancel+0x11b/0x120
[obdclass]
2011-06-17 14:28:00  [<ffffffffa06d7bf8>] ? osc_ldlm_blocking_ast+0x148/0x300
[osc]
2011-06-17 14:28:00  [<ffffffff810dbe15>] ? call_rcu_sched+0x15/0x20
2011-06-17 14:28:00  [<ffffffff810dbe2e>] ? call_rcu+0xe/0x10
2011-06-17 14:28:00  [<ffffffffa054287d>] ? ldlm_handle_bl_callback+0x10d/0x2a0
[ptlrpc]
2011-06-17 14:28:00  [<ffffffffa0542c51>] ? ldlm_bl_thread_main+0x241/0x3d0
[ptlrpc]
2011-06-17 14:28:00  [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
2011-06-17 14:28:00  [<ffffffff8100c1ca>] ? child_rip+0xa/0x20
2011-06-17 14:28:00  [<ffffffffa0542a10>] ? ldlm_bl_thread_main+0x0/0x3d0
[ptlrpc]
2011-06-17 14:28:00  [<ffffffff8100c1c0>] ? child_rip+0x0/0x20

With one of the runs, the node became "unstuck" after a vvp_io_commit_write() call fails:

2011-06-17 14:36:55 LustreError: 8747:0:(vvp_io.c:987:vvp_io_commit_write())
Write page 28712193 of inode ffff8801d20666b8 failed -4
2011-06-17 14:59:59 mstflint:4325 freeing invalid memtype fe900000-fe910000
2011-06-17 14:59:59 mstflint:4325 freeing invalid memtype fe910000-fea00000
2011-06-17 15:00:02 mstflint:4333 freeing invalid memtype fe900000-fe910000
2011-06-17 15:00:02 Jun 17 15:00:02 zwicky161 kernel: sfit43 ren nai etp
e000f900
2011-06-17 15:00:02 mstflint:4333 freeing invalid memtype fe910000-fea00000
2011-06-17 15:00:07 LustreError:
8757:0:(ldlm_resource.c:748:ldlm_resource_complain()) Namespace
lc2-OST0005-osc-ffff880200164400 resource refcount nonzero (1) after lock
cleanup; forcing cleanup.
2011-06-17 15:00:07 LustreError:
8757:0:(ldlm_resource.c:754:ldlm_resource_complain()) Resource:
ffff8801ff756700 (61363/0/0/0) (rc: 1)
2011-06-17 15:00:07 Lustre: lc2-OST0005-osc-ffff880200164400: Connection
restored to service lc2-OST0005 using nid 10.1.1.48@o2ib9.
2011-06-17 15:00:07 CPUFREQ: Per core ondemand sysfs interface is deprecated -
up_threshold
2011-06-17 15:00:31 LustreError: 11-0: an error occurred while communicating
with 10.1.1.48@o2ib9. The obd_ping operation failed with -107
2011-06-17 15:00:31 Lustre: lc2-OST0005-osc-ffff880200164400: Connection to
service lc2-OST0005 via nid 10.1.1.48@o2ib9 was lost; in progress operations
using this service will wait for recovery to complete.
2011-06-17 15:00:31 LustreError: 167-0: This client was evicted by lc2-OST0005;
in progress operations using this service will fail.
2011-06-17 15:00:31 Lustre: lc2-OST0005-osc-ffff880200164400: Connection
restored to service lc2-OST0005 using nid 10.1.1.48@o2ib9.

Although I have seen another node that seemed stuck in the same way, and was still stuck long after the IOR processes had been killed.

Attachments

Activity

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 20/Jun/11 8:48 PM

Updated:: 26/Jul/11 12:10 PM

Resolved:: 26/Jul/11 12:10 PM