Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
3
-
9223372036854775807
Description
lustre’s write should not send enqueue rpc to mds while having osc or mdc ldlm lock held. This may happen currently via:
cl_io_loop cl_io_lock <- ldlm lock is taken here cl_io_start vvp_io_write_start ... __generic_file_aio_write file_remove_privs security_inode_need_killpriv ... ll_xattr_get_common ... mdc_intent_lock <- enqueue rpc is sent here cl_io_unlock <- ldlm lock is released
That may lead to client eviction. The following scenario has been observed during write load with DoM involved:
- write holds mdc ldlm lock (L1) and is waiting on free rpc slot in
obd_get_request_slot trying to do ll_xattr_get_common(). - all the rpc slots are busy by write processes which wait for enqueue
rpc completion. - mds in order to serve the enqueue requests has sent blocking ast for
the lock L1 and eventually evicts the client as it does not cancel
L1.
There has been observed another more complex scenario caused by this problem. Clients get evicted by osts during mdtest+ior+failover hw testing.
Attachments
Issue Links
- is related to
-
LU-15639 replay-dual test_31 error: set_param: param_path 'at_max': No such file or directory
-
- Open
-
1. have max_rpcs_in_flight writes to max_rpcs_in_flight files. have them to pause somewhere at file_remove_suid->ll_xattr_cache_refill.
2. have max_rpcs_in_flight writes to the same files from another client. Server will notice max_rpcs_in_flight conflicts and send blocking asts to first client.
3. First client is unable to cancel locks, as ll_xattr_cache_refill has to complete first.
4. have max_rpcs_in_flight new writes to enqueue dlm locks (because the locks are callback pending). Those new writes occupy rpc slots. As those enqueues will complete only after enqueues from client2 are completed.
5. First writes want to do enqueue in ll_xattr_find_get_lock, but all slots are occupied.
Patchset 8 of https://review.whamcloud.com/#/c/34977/ contains this test: sanityn:105c.