[LU-12142] Hang in OSC on eviction - threads stuck in read() and ldlm_bl_NN Created: 01/Apr/19 Updated: 05/Nov/21 Resolved: 06/Apr/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.7, Lustre 2.12.3 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Improvement | Priority: | Major |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | Wang Shilong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This appears to be related to A private customer ticket reported a hang on a client which was suffering repeated evictions. The client threads all seem to be waiting in two connected places. First, the eviction: [<ffffffffc11f8c05>] osc_object_invalidate+0x115/0x290 [osc] [<ffffffffc11e9f4f>] osc_ldlm_resource_invalidate+0xaf/0x190 [osc] [<ffffffffc0ce8d10>] cfs_hash_for_each_relax+0x250/0x450 [libcfs] [<ffffffffc0cec0a5>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] [<ffffffffc11f1427>] osc_import_event+0x497/0x1370 [osc] [<ffffffffc13b3590>] ptlrpc_invalidate_import+0x220/0x8f0 [ptlrpc] [<ffffffffc13b50c8>] ptlrpc_invalidate_import_thread+0x48/0x2b0 [ptlrpc] [<ffffffffa52c1c71>] kthread+0xd1/0xe0 [<ffffffffa5974c1d>] ret_from_fork_nospec_begin+0x7/0x21 [<ffffffffffffffff>] 0xffffffffffffffff And then the other side: [<ffffffffc11faf65>] osc_lru_alloc+0x265/0x390 [osc] [<ffffffffc11fb1c2>] osc_page_init+0x132/0x1d0 [osc] [<ffffffffc0ff146f>] lov_page_init_composite+0x26f/0x4c0 [lov] [<ffffffffc0fe8b11>] lov_page_init+0x21/0x60 [lov] [<ffffffffc0e849bd>] cl_page_alloc+0x10d/0x280 [obdclass] [<ffffffffc0e84ba4>] cl_page_find+0x74/0x280 [obdclass] [<ffffffffc1111653>] ll_readpage+0x83/0x6e0 [lustre] [<ffffffffa53b81f0>] generic_file_aio_read+0x3f0/0x790 [<ffffffffc1139037>] vvp_io_read_start+0x4b7/0x600 [lustre] [<ffffffffc0e87b78>] cl_io_start+0x68/0x130 [obdclass] [<ffffffffc0e89f5e>] cl_io_loop+0x12e/0xc90 [obdclass] [<ffffffffc10e43c8>] ll_file_io_generic+0x498/0xc80 [lustre] [<ffffffffc10e547a>] ll_file_aio_read+0x34a/0x3e0 [lustre] [<ffffffffc10e55de>] ll_file_read+0xce/0x1e0 [lustre] [<ffffffffa54414bf>] vfs_read+0x9f/0x170 [<ffffffffa544237f>] SyS_read+0x7f/0xf0 [<ffffffffa5974ddb>] system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] 0xffffffffffffffff The eviction side is waiting for: l_wait_event(osc->oo_io_waitq, atomic_read(&osc->oo_nr_ios) == 0, &lwi); This is the first action in osc_object_invalidate.
And the other side, in osc_lru_alloc, sleeps with no timeout on the osc_lru_waitq: struct l_wait_info lwi = LWI_INTR(LWI_ON_SIGNAL_NOOP, NULL);
[.....]
rc = l_wait_event(osc_lru_waitq,
atomic_long_read(cli->cl_lru_left) > 0,
&lwi);
osc_lru_alloc is called after osc_io_iter_init, which increases oo_nr_ios, so it's sleeping here with oo_nr_ios elevated.
The OSC eviction path does not tickle osc_lru_waitq directly, it does so by freeing pages from objects, so if the first object to be invalidated has threads waiting for pages, I think it will get stuck here. (We would also expect that the failure of whatever is holding these LRU pages would free them up - We may have an ordering issue here.) Additionally, the osc_lru_alloc code does not appear to have any method to fail if the import is being evicted. It looks like we have to successfully get a page in here before we'll spool out in to the larger i/o, which will eventually catch the eviction and fail. |
| Comments |
| Comment by Andreas Dilger [ 02/Sep/20 ] |
|
It looks like the root of this problem is caused by llite.*.max_cached_mb being too small (=128 on the /home filesystem and =2048 on the /scratch filesystem) for multiple threads reading from the same filesystem to reserve enough pages for the RDMA reads at one time. This results in all of the threads being stuck holding some number of pages, but waiting for additional pages before it has enough to send the read RPC. The clients have osc.*.max_pages_per_rpc=16M, so it would only need 9+ threads preparing concurrent 16MB read RPCs from the /home filesystem before the livelock could be hit. With clients having 20-30 or more cores and the max_pages_per_rpc increasing, this is increasingly likely to be hit, as seen when the unused_mb is stuck at 0, as below: $ lctl get_param llite.home*.max_cached_mb llite.home-ffff880c1c30ec00.max_cached_mb= users: 8 max_cached_mb: 128 used_mb: 128 unused_mb: 0 reclaim_count: 0 Increasing the llite.*.max_cached_mb values for both filesystems allowed the read threads to get the pages they needed and get unstuck. The llite.*.max_cached_mb value was reduced during debugging another issue related to a memory problem. |
| Comment by Andreas Dilger [ 02/Sep/20 ] |
|
I think there are a couple of ways to get out of this kind of deadlock situation:
|
| Comment by Wang Shilong (Inactive) [ 03/Sep/20 ] |
|
We definitely try to reserve LRU pages before that, see cl_io_iter_init->osc_io_rw_iter_init->osc_lru_reserve() but this doesn't give gurantee, it just try reserve LRU pages in batch in advance if there are plenty of free pages, and it try to tigger async reclaim if there are not enough free pages. Maybe we should modify osc_lru_reserve() logic to be blocked if there are not enough free LRU to trigger |
| Comment by Andreas Dilger [ 03/Sep/20 ] |
|
Another alternative may be to send a smaller RPC if there are not enough pages to form a full-sized RPC? |
| Comment by Wang Shilong (Inactive) [ 03/Sep/20 ] |
|
adilger I am not sure for that, this might need some debugging and testing, for example even max_cached_mb is bigger, it might still possibly run out, |
| Comment by Gerrit Updater [ 14/Oct/20 ] |
|
Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40237 |
| Comment by Gerrit Updater [ 17/Mar/21 ] |
|
Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/42060 |
| Comment by Gerrit Updater [ 06/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42060/ |
| Comment by Gerrit Updater [ 06/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40237/ |
| Comment by Peter Jones [ 06/Apr/21 ] |
|
Landed for 2.15 |
| Comment by Alex Zhuravlev [ 23/Aug/21 ] |
|
hitting the following deadlock in racer quite often: schedule,vvp_io_setattr_start,cl_io_start,cl_io_loop,cl_setattr_ost,ll_setattr_raw,do_truncate,path_openat,do_filp_open,do_sys_open PIDs(2): "dir_create.sh":9061 "dir_create.sh":9134 schedule,wait_for_common,osc_io_setattr_end,cl_io_end,lov_io_end_wrapper,lov_io_call,lov_io_end,cl_io_end,cl_io_loop,cl_setattr_ost,ll_setattr_raw,do_truncate,path_openat,do_filp_open,do_sys_open PIDs(1): "dir_create.sh":9363 schedule,ldlm_completion_ast,ldlm_cli_enqueue_local,ofd_destroy_by_fid,ofd_destroy_hdl,tgt_request_handle,ptlrpc_main PIDs(1): "ll_ost00_007":12274 schedule,osc_object_invalidate,osc_ldlm_resource_invalidate,cfs_hash_for_each_relax,cfs_hash_for_each_nolock,osc_import_event,ptlrpc_invalidate_import,ptlrpc_invalidate_import_thread PIDs(1): "ll_imp_inval":553105 |