Details
-
Improvement
-
Resolution: Fixed
-
Major
-
Lustre 2.10.7, Lustre 2.12.3
-
None
-
9223372036854775807
Description
This appears to be related to LU-6271.
A private customer ticket reported a hang on a client which was suffering repeated evictions.
The client threads all seem to be waiting in two connected places.
First, the eviction:
[<ffffffffc11f8c05>] osc_object_invalidate+0x115/0x290 [osc] [<ffffffffc11e9f4f>] osc_ldlm_resource_invalidate+0xaf/0x190 [osc] [<ffffffffc0ce8d10>] cfs_hash_for_each_relax+0x250/0x450 [libcfs] [<ffffffffc0cec0a5>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] [<ffffffffc11f1427>] osc_import_event+0x497/0x1370 [osc] [<ffffffffc13b3590>] ptlrpc_invalidate_import+0x220/0x8f0 [ptlrpc] [<ffffffffc13b50c8>] ptlrpc_invalidate_import_thread+0x48/0x2b0 [ptlrpc] [<ffffffffa52c1c71>] kthread+0xd1/0xe0 [<ffffffffa5974c1d>] ret_from_fork_nospec_begin+0x7/0x21 [<ffffffffffffffff>] 0xffffffffffffffff
And then the other side:
[<ffffffffc11faf65>] osc_lru_alloc+0x265/0x390 [osc] [<ffffffffc11fb1c2>] osc_page_init+0x132/0x1d0 [osc] [<ffffffffc0ff146f>] lov_page_init_composite+0x26f/0x4c0 [lov] [<ffffffffc0fe8b11>] lov_page_init+0x21/0x60 [lov] [<ffffffffc0e849bd>] cl_page_alloc+0x10d/0x280 [obdclass] [<ffffffffc0e84ba4>] cl_page_find+0x74/0x280 [obdclass] [<ffffffffc1111653>] ll_readpage+0x83/0x6e0 [lustre] [<ffffffffa53b81f0>] generic_file_aio_read+0x3f0/0x790 [<ffffffffc1139037>] vvp_io_read_start+0x4b7/0x600 [lustre] [<ffffffffc0e87b78>] cl_io_start+0x68/0x130 [obdclass] [<ffffffffc0e89f5e>] cl_io_loop+0x12e/0xc90 [obdclass] [<ffffffffc10e43c8>] ll_file_io_generic+0x498/0xc80 [lustre] [<ffffffffc10e547a>] ll_file_aio_read+0x34a/0x3e0 [lustre] [<ffffffffc10e55de>] ll_file_read+0xce/0x1e0 [lustre] [<ffffffffa54414bf>] vfs_read+0x9f/0x170 [<ffffffffa544237f>] SyS_read+0x7f/0xf0 [<ffffffffa5974ddb>] system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] 0xffffffffffffffff
The eviction side is waiting for:
l_wait_event(osc->oo_io_waitq, atomic_read(&osc->oo_nr_ios) == 0, &lwi);
This is the first action in osc_object_invalidate.
And the other side, in osc_lru_alloc, sleeps with no timeout on the osc_lru_waitq:
struct l_wait_info lwi = LWI_INTR(LWI_ON_SIGNAL_NOOP, NULL); [.....] rc = l_wait_event(osc_lru_waitq, atomic_long_read(cli->cl_lru_left) > 0, &lwi);
osc_lru_alloc is called after osc_io_iter_init, which increases oo_nr_ios, so it's sleeping here with oo_nr_ios elevated.
The OSC eviction path does not tickle osc_lru_waitq directly, it does so by freeing pages from objects, so if the first object to be invalidated has threads waiting for pages, I think it will get stuck here. (We would also expect that the failure of whatever is holding these LRU pages would free them up - We may have an ordering issue here.)
Additionally, the osc_lru_alloc code does not appear to have any method to fail if the import is being evicted. It looks like we have to successfully get a page in here before we'll spool out in to the larger i/o, which will eventually catch the eviction and fail.
Attachments
Issue Links
- is related to
-
LU-6271 (osc_cache.c:3150:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:
- Resolved