This appears to be related to
A private customer ticket reported a hang on a client which was suffering repeated evictions.
The client threads all seem to be waiting in two connected places.
First, the eviction:
[<ffffffffc11f8c05>] osc_object_invalidate+0x115/0x290 [osc]
[<ffffffffc11e9f4f>] osc_ldlm_resource_invalidate+0xaf/0x190 [osc]
[<ffffffffc0ce8d10>] cfs_hash_for_each_relax+0x250/0x450 [libcfs]
[<ffffffffc0cec0a5>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
[<ffffffffc11f1427>] osc_import_event+0x497/0x1370 [osc]
[<ffffffffc13b3590>] ptlrpc_invalidate_import+0x220/0x8f0 [ptlrpc]
[<ffffffffc13b50c8>] ptlrpc_invalidate_import_thread+0x48/0x2b0 [ptlrpc]
And then the other side:
[<ffffffffc11faf65>] osc_lru_alloc+0x265/0x390 [osc]
[<ffffffffc11fb1c2>] osc_page_init+0x132/0x1d0 [osc]
[<ffffffffc0ff146f>] lov_page_init_composite+0x26f/0x4c0 [lov]
[<ffffffffc0fe8b11>] lov_page_init+0x21/0x60 [lov]
[<ffffffffc0e849bd>] cl_page_alloc+0x10d/0x280 [obdclass]
[<ffffffffc0e84ba4>] cl_page_find+0x74/0x280 [obdclass]
[<ffffffffc1111653>] ll_readpage+0x83/0x6e0 [lustre]
[<ffffffffc1139037>] vvp_io_read_start+0x4b7/0x600 [lustre]
[<ffffffffc0e87b78>] cl_io_start+0x68/0x130 [obdclass]
[<ffffffffc0e89f5e>] cl_io_loop+0x12e/0xc90 [obdclass]
[<ffffffffc10e43c8>] ll_file_io_generic+0x498/0xc80 [lustre]
[<ffffffffc10e547a>] ll_file_aio_read+0x34a/0x3e0 [lustre]
[<ffffffffc10e55de>] ll_file_read+0xce/0x1e0 [lustre]
The eviction side is waiting for:
l_wait_event(osc->oo_io_waitq, atomic_read(&osc->oo_nr_ios) == 0, &lwi);
This is the first action in osc_object_invalidate.
And the other side, in osc_lru_alloc, sleeps with no timeout on the osc_lru_waitq:
struct l_wait_info lwi = LWI_INTR(LWI_ON_SIGNAL_NOOP, NULL);
rc = l_wait_event(osc_lru_waitq,
atomic_long_read(cli->cl_lru_left) > 0,
osc_lru_alloc is called after osc_io_iter_init, which increases oo_nr_ios, so it's sleeping here with oo_nr_ios elevated.
The OSC eviction path does not tickle osc_lru_waitq directly, it does so by freeing pages from objects, so if the first object to be invalidated has threads waiting for pages, I think it will get stuck here. (We would also expect that the failure of whatever is holding these LRU pages would free them up - We may have an ordering issue here.)
Additionally, the osc_lru_alloc code does not appear to have any method to fail if the import is being evicted. It looks like we have to successfully get a page in here before we'll spool out in to the larger i/o, which will eventually catch the eviction and fail.