[LU-12347] lustre write: do not enqueue rpc holding osc/mdc ldlm lock held Created: 28/May/19 Updated: 05/Jul/23 Resolved: 30/Nov/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Vladimir Saveliev | Assignee: | Vladimir Saveliev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
lustre’s write should not send enqueue rpc to mds while having osc or mdc ldlm lock held. This may happen currently via: cl_io_loop
cl_io_lock <- ldlm lock is taken here
cl_io_start
vvp_io_write_start
...
__generic_file_aio_write
file_remove_privs
security_inode_need_killpriv
...
ll_xattr_get_common
...
mdc_intent_lock <- enqueue rpc is sent here
cl_io_unlock <- ldlm lock is released
That may lead to client eviction. The following scenario has been observed during write load with DoM involved:
There has been observed another more complex scenario caused by this problem. Clients get evicted by osts during mdtest+ior+failover hw testing. |
| Comments |
| Comment by Gerrit Updater [ 28/May/19 ] |
|
Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/34977 |
| Comment by Andreas Dilger [ 24/Oct/19 ] |
|
he full trace looks like: PID: 20755 TASK: ffff8ff1f07fa080 CPU: 1 COMMAND: "xdd" #0 [ffff8ff1e3c47328] __schedule at ffffffffab367747 #1 [ffff8ff1e3c473b0] schedule at ffffffffab367c49 #2 [ffff8ff1e3c473c0] obd_get_request_slot at ffffffffc0692c24 [obdclass] #3 [ffff8ff1e3c47470] ldlm_cli_enqueue at ffffffffc083a7d0 [ptlrpc] #4 [ffff8ff1e3c47528] mdc_enqueue_base at ffffffffc0a10e81 [mdc] #5 [ffff8ff1e3c47640] mdc_intent_lock at ffffffffc0a12f15 [mdc] #6 [ffff8ff1e3c47718] lmv_intent_lock at ffffffffc09824f2 [lmv] #7 [ffff8ff1e3c477c8] ll_xattr_cache_refill at ffffffffc0aea395 [lustre] #8 [ffff8ff1e3c478a8] ll_xattr_cache_get at ffffffffc0aeb2ab [lustre] #9 [ffff8ff1e3c47900] ll_xattr_list at ffffffffc0ae7cdc [lustre] #10 [ffff8ff1e3c47968] ll_xattr_get_common at ffffffffc0ae83ef [lustre] #11 [ffff8ff1e3c479a8] ll_xattr_get_common_3_11 at ffffffffc0ae88d8 [lustre] #12 [ffff8ff1e3c479b8] generic_getxattr at ffffffffaae693d2 #13 [ffff8ff1e3c479e8] cap_inode_need_killpriv at ffffffffaaef6b9f #14 [ffff8ff1e3c479f8] security_inode_need_killpriv at ffffffffaaef938c #15 [ffff8ff1e3c47a08] dentry_needs_remove_privs at ffffffffaae5e6bf #16 [ffff8ff1e3c47a28] file_remove_privs at ffffffffaae5e8f8 #17 [ffff8ff1e3c47aa0] __generic_file_aio_write at ffffffffaadb88a8 #18 [ffff8ff1e3c47b20] __generic_file_write_iter at ffffffffc0afa36b [lustre] #19 [ffff8ff1e3c47b90] vvp_io_write_start at ffffffffc0afe6ab [lustre] #20 [ffff8ff1e3c47c00] cl_io_start at ffffffffc06d4828 [obdclass] #21 [ffff8ff1e3c47c28] cl_io_loop at ffffffffc06d69fc [obdclass] #22 [ffff8ff1e3c47c58] ll_file_io_generic at ffffffffc0ab4c1b [lustre] #23 [ffff8ff1e3c47d60] ll_file_aio_write at ffffffffc0ab58b2 [lustre] #24 [ffff8ff1e3c47dd8] ll_file_write at ffffffffc0ab5aa4 [lustre] #25 [ffff8ff1e3c47ec0] vfs_write at ffffffffaae41650 #26 [ffff8ff1e3c47f00] sys_pwrite64 at ffffffffaae42632 #27 [ffff8ff1e3c47f50] system_call_fastpath at ffffffffab374ddb |
| Comment by Vladimir Saveliev [ 17/Jan/20 ] |
The patchset 3 is an attempt to implement Patrick's proposal. Unfortunately, it faced with complication with finding put whether extra slot is to be used. The below are dumps of tests included in the patch runs with disabled fixe: == sanity test 820: write and lfs setstripe race ===================================================== 07:42:51 (1579236171) fail_loc=0x80000329 lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout write: Input/output error sanity test_820: @@@@@@ FAIL: multiop failed lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Transport endpoint is not connected error: lt-lfs setstripe: invalid layout Trace dump: and == sanity test 821: write race ======================================================================= 07:52:43 (1579236763) fail_loc=0x80000329 write: Input/output error write: Input/output error write: Input/output error write: Input/output error write: Input/output error write: Input/output error write: Input/output error write: Input/output error sanity test_821: @@@@@@ FAIL: multiop failed write: Input/output error Trace dump: |
| Comment by Mikhail Pershin [ 24/Jan/20 ] |
|
Vladimir, in description it says: "all the rpc slots are busy by write processes which wait for enqueue rpc completion" - could you explain this a bit - are these 'write processes' about the same file and what enqueue they are waiting for? I don't get yet who saturates all rpc slots |
| Comment by Vladimir Saveliev [ 24/Jan/20 ] |
Yes. 1. clientA:write 1 writes file F: 2. clientB:write1 writes file F: 3. clientA receives the blocking ast, mark the lock is CB_PENDING, but it can not cancel because it is in use by clientA:write 1. 4. 'other writes' to file F come on clientA. They are to do enqueue (somewhere in cl_io_lock), because lock L is already CB_PENDING. So, those writes will be granted the lock only after clientB gets and release it. So, they are blocked by server having occupied rpc slots, possibly all of them. 5. clientA:write1 continues file_remove_privs()->getxattr and gets blocked by as all rpc slots are occupied. |
| Comment by Mikhail Pershin [ 24/Jan/20 ] |
|
OK, but I wonder why there are many of those write on A so they are blocking all slots, if new lock is needed then that is one slot for new enqueue. But it seems there are many slots are being used, I am trying to understand why. Also by 'write using slot' do you mean write lock enqueue or real BRW? |
| Comment by Vladimir Saveliev [ 24/Jan/20 ] |
lock enqueue in the following code path:
ll_file_io_generic
cl_io_loop
cl_io_lock
cl_lockset_lock
cl_lock_request
cl_lock_enqueue
lov_lock_enqueue
mdc_lock_enqueue
mdc_enqueue_send
ldlm_lock_match <-- here we get miss
ldlm_cli_enqueue
obd_get_request_slot <-- slot is occupied
ptlrpc_queue_wait <-- here all writes stuck, as server does not get cancel from clientA:write 1 because that one waits for free slots
Each of those additional writes enqueues its own lock, I believe, this is because for mdc_enqueue_send()->ldlm_lock_match() fails. All those cli->cl_max_rpcs_in_flight writes have the following: 00000080:00000001:0.0:1579840865.602638:0:30118:0:(file.c:1995:ll_file_write()) Process entered ... 00000002:00000001:0.0:1579840865.602825:0:30118:0:(mdc_dev.c:842:mdc_lock_enqueue()) Process entered ... 00010000:00010000:0.0:1579840865.602838:0:30118:0:(ldlm_lock.c:1505:ldlm_lock_match_with_skip()) ### not matched ns ffff8c2607875c00 type 13 mode 2 res 8589935618/4 (0 0) ... grab rpc slot via ldlm_cli_enqueue ()->obd_get_request_slot() ... 00010000:00010000:0.0:1579840865.602912:0:30118:0:(ldlm_request.c:1121:ldlm_cli_enqueue()) ### sending request ns: lustre-MDT0000-mdc-ffff8c2618072000 lock: ffff8c263c315e00/0xdd4ac26a917dc529 lrc: 3/0,1 mode: --/PW res: [0x200000402:0x4:0x0].0x0 bits 0x40/0x0 rrc: 6 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 30118 timeout: 0 lvb_type: 0 ... |
| Comment by Mikhail Pershin [ 24/Jan/20 ] |
|
yes, that is what I thought, probably this is the reason, I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code. |
| Comment by Vladimir Saveliev [ 24/Jan/20 ] |
Ok, that makes sense, however, even if you fix that - we still need a fix for the rpc slot lockup on ll_file_write->remove_file_suid. Because the lockup may happen during concurrent writes to different files as well. |
| Comment by Mikhail Pershin [ 24/Jan/20 ] |
|
could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit. |
| Comment by Mikhail Pershin [ 30/Jan/20 ] |
|
FYI, I've found why all slots are filled with BRW/glipmse enqueue RPCs. There is ldlm_lock_match() in mdc_enqueue_send() which tries to find any granted or waiting locks to don't enqueue new similar lock, but the problem is that we have one cpbending lock which can't be matched and each new enqueue RPC stuck on server waiting for it. Meanwhile new lock is put in waiting queue on client side only when it gets reply from server, i.e. enqueue RPC finishes and there are no such locks, every one stays in RPC slot waiting for server response and is not added to the waiting queue yet, so each new enqueue match no lock and goes to the server too consuming slots. |
| Comment by Mikhail Pershin [ 30/Jan/20 ] |
|
So I would propose to don't add extra slot here but keep all other changes - IT_GETXATTR adding, removal of ols_has_ref and passing einfo in ldlm_cli_enqueue_fini(). Also I'd consider exclusion of DOM locks from consuming RPC slots similarly to EXTENT locks |
| Comment by Vladimir Saveliev [ 31/Jan/20 ] |
1. have max_rpcs_in_flight writes to max_rpcs_in_flight files. have them to pause somewhere at file_remove_suid->ll_xattr_cache_refill. Patchset 8 of https://review.whamcloud.com/#/c/34977/ contains this test: sanityn:105c. |
| Comment by Gerrit Updater [ 06/Jul/21 ] |
|
Vladimir Saveliev (vlaidimir.saveliev@hpe.com) uploaded a new patch: https://review.whamcloud.com/44151 |
| Comment by Vladimir Saveliev [ 06/Jul/21 ] |
With The following however is still possible: - clientA:write1 writes to file F and holds mdc ldlm lock (L1) and runs somewhere on the way to file_remove_privs()->ll_xattr_get_common() - clientB:write is going to write file F and enqueues DoM lock. mds handles conflict on L1 and sends blocking ast to clientA - clientA: max_mod_rpcs_in_flight simultaneous creates occupy all mod rpc slots and get delayed on mds side waiting for preallocated objects. Preallocation is delayed by ost failover. - clientA:write1 tries to get mod rpc slot to enqueue xattr request, all slots are busy, so lock L1 can not be cancelled one of creates completes its rpc which it stuck on preallocation. - lock callback timer expires on mds first and evicts client1. This can be fixed with adding IT_GETXATTR in mdc_skip_mod_rpc_slot().
ok. see https://review.whamcloud.com/44151
These are in already as part of https://review.whamcloud.com/36903. |
| Comment by Gerrit Updater [ 30/Nov/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44151/ |
| Comment by Peter Jones [ 30/Nov/21 ] |
|
Landed for 2.15 |