[LU-12347] lustre write: do not enqueue rpc holding osc/mdc ldlm lock held - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.15.0
Affects Version/s: None
Labels:
- patch

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

lustre’s write should not send enqueue rpc to mds while having osc or mdc ldlm lock held. This may happen currently via:

    cl_io_loop
      cl_io_lock                    <- ldlm lock is taken here
      cl_io_start
        vvp_io_write_start
        ...
          __generic_file_aio_write
            file_remove_privs
              security_inode_need_killpriv
              ...
                ll_xattr_get_common
                ...
                  mdc_intent_lock   <- enqueue rpc is sent here
      cl_io_unlock                  <- ldlm lock is released

That may lead to client eviction. The following scenario has been observed during write load with DoM involved:

write holds mdc ldlm lock (L1) and is waiting on free rpc slot in
obd_get_request_slot trying to do ll_xattr_get_common().
all the rpc slots are busy by write processes which wait for enqueue
rpc completion.
mds in order to serve the enqueue requests has sent blocking ast for
the lock L1 and eventually evicts the client as it does not cancel
L1.

There has been observed another more complex scenario caused by this problem. Clients get evicted by osts during mdtest+ior+failover hw testing.

Attachments

Issue Links

is related to

LU-15639 replay-dual test_31 error: set_param: param_path 'at_max': No such file or directory

Open

Activity

[LU-12347] lustre write: do not enqueue rpc holding osc/mdc ldlm lock held

Vladimir Saveliev added a comment - 31/Jan/20 2:32 PM - edited

could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit.

1. have max_rpcs_in_flight writes to max_rpcs_in_flight files. have them to pause somewhere at file_remove_suid->ll_xattr_cache_refill.
2. have max_rpcs_in_flight writes to the same files from another client. Server will notice max_rpcs_in_flight conflicts and send blocking asts to first client.
3. First client is unable to cancel locks, as ll_xattr_cache_refill has to complete first.
4. have max_rpcs_in_flight new writes to enqueue dlm locks (because the locks are callback pending). Those new writes occupy rpc slots. As those enqueues will complete only after enqueues from client2 are completed.
5. First writes want to do enqueue in ll_xattr_find_get_lock, but all slots are occupied.

Patchset 8 of https://review.whamcloud.com/#/c/34977/ contains this test: sanityn:105c.

Vladimir Saveliev added a comment - 31/Jan/20 2:32 PM - edited could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit. 1. have max_rpcs_in_flight writes to max_rpcs_in_flight files. have them to pause somewhere at file_remove_suid->ll_xattr_cache_refill. 2. have max_rpcs_in_flight writes to the same files from another client. Server will notice max_rpcs_in_flight conflicts and send blocking asts to first client. 3. First client is unable to cancel locks, as ll_xattr_cache_refill has to complete first. 4. have max_rpcs_in_flight new writes to enqueue dlm locks (because the locks are callback pending). Those new writes occupy rpc slots. As those enqueues will complete only after enqueues from client2 are completed. 5. First writes want to do enqueue in ll_xattr_find_get_lock, but all slots are occupied. Patchset 8 of https://review.whamcloud.com/#/c/34977/ contains this test: sanityn:105c.

Mikhail Pershin added a comment - 30/Jan/20 5:15 AM

So I would propose to don't add extra slot here but keep all other changes - IT_GETXATTR adding, removal of ols_has_ref and passing einfo in ldlm_cli_enqueue_fini(). Also I'd consider exclusion of DOM locks from consuming RPC slots similarly to EXTENT locks

Mikhail Pershin added a comment - 30/Jan/20 5:15 AM So I would propose to don't add extra slot here but keep all other changes - IT_GETXATTR adding, removal of ols_has_ref and passing einfo in ldlm_cli_enqueue_fini() . Also I'd consider exclusion of DOM locks from consuming RPC slots similarly to EXTENT locks

Mikhail Pershin added a comment - 30/Jan/20 5:06 AM

FYI, I've found why all slots are filled with BRW/glipmse enqueue RPCs. There is ldlm_lock_match() in mdc_enqueue_send() which tries to find any granted or waiting locks to don't enqueue new similar lock, but the problem is that we have one cpbending lock which can't be matched and each new enqueue RPC stuck on server waiting for it. Meanwhile new lock is put in waiting queue on client side only when it gets reply from server, i.e. enqueue RPC finishes and there are no such locks, every one stays in RPC slot waiting for server response and is not added to the waiting queue yet, so each new enqueue match no lock and goes to the server too consuming slots.
I have some observation and proposals about that.
1) ldlm_request_slot_needed() takes slot only for FLOCK and IBITS locks but not EXTENT, I suppose that is because IO locks need no flow control because they are usually result of file operation with other RPCs sent under flow control already. Maybe there are other reasons too. Anyway, DOM enqueue RPC also can be excluded from taking RPC slot similarly.
2) All MDT locks are ATOMIC on server, so they are waiting for lock to be granted before replying to client. That keeps enqueue RPC in slot for quite a long time and that also can be reason for MDC RPC flow control to limit number of outgoing locks and don't overload client import. OSC IO locks are asynchronous and server replies without waiting for lock is granted. DOM locks are also 'atomic' right now, so wait for lock be granted on server. If they would be done in async manner, they would not stuck in RPC slots forever waiting for blocking locks. I have such patch here: https://review.whamcloud.com/36903 and I think it will help with current issue.

Mikhail Pershin added a comment - 30/Jan/20 5:06 AM FYI, I've found why all slots are filled with BRW/glipmse enqueue RPCs. There is ldlm_lock_match() in mdc_enqueue_send() which tries to find any granted or waiting locks to don't enqueue new similar lock, but the problem is that we have one cpbending lock which can't be matched and each new enqueue RPC stuck on server waiting for it. Meanwhile new lock is put in waiting queue on client side only when it gets reply from server, i.e. enqueue RPC finishes and there are no such locks, every one stays in RPC slot waiting for server response and is not added to the waiting queue yet, so each new enqueue match no lock and goes to the server too consuming slots. I have some observation and proposals about that. 1) ldlm_request_slot_needed() takes slot only for FLOCK and IBITS locks but not EXTENT, I suppose that is because IO locks need no flow control because they are usually result of file operation with other RPCs sent under flow control already. Maybe there are other reasons too. Anyway, DOM enqueue RPC also can be excluded from taking RPC slot similarly. 2) All MDT locks are ATOMIC on server, so they are waiting for lock to be granted before replying to client. That keeps enqueue RPC in slot for quite a long time and that also can be reason for MDC RPC flow control to limit number of outgoing locks and don't overload client import. OSC IO locks are asynchronous and server replies without waiting for lock is granted. DOM locks are also 'atomic' right now, so wait for lock be granted on server. If they would be done in async manner, they would not stuck in RPC slots forever waiting for blocking locks. I have such patch here: https://review.whamcloud.com/36903 and I think it will help with current issue.

Mikhail Pershin added a comment - 24/Jan/20 4:30 PM - edited

could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit.

Mikhail Pershin added a comment - 24/Jan/20 4:30 PM - edited could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit.

Vladimir Saveliev added a comment - 24/Jan/20 4:19 PM

I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code.

Ok, that makes sense, however, even if you fix that - we still need a fix for the rpc slot lockup on ll_file_write->remove_file_suid. Because the lockup may happen during concurrent writes to different files as well.

Vladimir Saveliev added a comment - 24/Jan/20 4:19 PM I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code. Ok, that makes sense, however, even if you fix that - we still need a fix for the rpc slot lockup on ll_file_write->remove_file_suid. Because the lockup may happen during concurrent writes to different files as well.

Mikhail Pershin added a comment - 24/Jan/20 3:56 PM - edited

yes, that is what I thought, probably this is the reason, I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code.

Mikhail Pershin added a comment - 24/Jan/20 3:56 PM - edited yes, that is what I thought, probably this is the reason, I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code.

Vladimir Saveliev added a comment - 24/Jan/20 1:30 PM

Also by 'write using slot' do you mean write lock enqueue or real BRW?

lock enqueue in the following code path:

ll_file_io_generic
  cl_io_loop
    cl_io_lock
      cl_lockset_lock
        cl_lock_request
          cl_lock_enqueue
            lov_lock_enqueue
              mdc_lock_enqueue
                mdc_enqueue_send
                  ldlm_lock_match            <-- here we get miss
                  ldlm_cli_enqueue
                    obd_get_request_slot  <-- slot is occupied
                    ptlrpc_queue_wait        <-- here all writes stuck, as server does not get cancel from clientA:write 1 because that one waits for free slots

OK, but I wonder why there are many of those write on A so they are blocking all slots, if new lock is needed then that is one slot for new enqueue. But it seems there are many slots are being used, I am trying to understand why.

Each of those additional writes enqueues its own lock, I believe, this is because for mdc_enqueue_send()->ldlm_lock_match() fails.

All those cli->cl_max_rpcs_in_flight writes have the following:

00000080:00000001:0.0:1579840865.602638:0:30118:0:(file.c:1995:ll_file_write()) Process entered
...
00000002:00000001:0.0:1579840865.602825:0:30118:0:(mdc_dev.c:842:mdc_lock_enqueue()) Process entered
...
00010000:00010000:0.0:1579840865.602838:0:30118:0:(ldlm_lock.c:1505:ldlm_lock_match_with_skip()) ### not matched ns ffff8c2607875c00 type 13 mode 2 res 8589935618/4 (0 0)
...
grab rpc slot via ldlm_cli_enqueue ()->obd_get_request_slot()
...
00010000:00010000:0.0:1579840865.602912:0:30118:0:(ldlm_request.c:1121:ldlm_cli_enqueue()) ### sending request ns: lustre-MDT0000-mdc-ffff8c2618072000 lock: ffff8c263c315e00/0xdd4ac26a917dc529 lrc: 3/0,1 mode: --/PW res: [0x200000402:0x4:0x0].0x0 bits 0x40/0x0 rrc: 6 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 30118 timeout: 0 lvb_type: 0
...

Vladimir Saveliev added a comment - 24/Jan/20 1:30 PM Also by 'write using slot' do you mean write lock enqueue or real BRW? lock enqueue in the following code path: ll_file_io_generic cl_io_loop cl_io_lock cl_lockset_lock cl_lock_request cl_lock_enqueue lov_lock_enqueue mdc_lock_enqueue mdc_enqueue_send ldlm_lock_match <-- here we get miss ldlm_cli_enqueue obd_get_request_slot <-- slot is occupied ptlrpc_queue_wait <-- here all writes stuck, as server does not get cancel from clientA:write 1 because that one waits for free slots OK, but I wonder why there are many of those write on A so they are blocking all slots, if new lock is needed then that is one slot for new enqueue. But it seems there are many slots are being used, I am trying to understand why. Each of those additional writes enqueues its own lock, I believe, this is because for mdc_enqueue_send()->ldlm_lock_match() fails. All those cli->cl_max_rpcs_in_flight writes have the following: 00000080:00000001:0.0:1579840865.602638:0:30118:0:(file.c:1995:ll_file_write()) Process entered ... 00000002:00000001:0.0:1579840865.602825:0:30118:0:(mdc_dev.c:842:mdc_lock_enqueue()) Process entered ... 00010000:00010000:0.0:1579840865.602838:0:30118:0:(ldlm_lock.c:1505:ldlm_lock_match_with_skip()) ### not matched ns ffff8c2607875c00 type 13 mode 2 res 8589935618/4 (0 0) ... grab rpc slot via ldlm_cli_enqueue ()->obd_get_request_slot() ... 00010000:00010000:0.0:1579840865.602912:0:30118:0:(ldlm_request.c:1121:ldlm_cli_enqueue()) ### sending request ns: lustre-MDT0000-mdc-ffff8c2618072000 lock: ffff8c263c315e00/0xdd4ac26a917dc529 lrc: 3/0,1 mode: --/PW res: [0x200000402:0x4:0x0].0x0 bits 0x40/0x0 rrc: 6 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 30118 timeout: 0 lvb_type: 0 ...

Mikhail Pershin added a comment - 24/Jan/20 10:10 AM

OK, but I wonder why there are many of those write on A so they are blocking all slots, if new lock is needed then that is one slot for new enqueue. But it seems there are many slots are being used, I am trying to understand why. Also by 'write using slot' do you mean write lock enqueue or real BRW?

Mikhail Pershin added a comment - 24/Jan/20 10:10 AM OK, but I wonder why there are many of those write on A so they are blocking all slots, if new lock is needed then that is one slot for new enqueue. But it seems there are many slots are being used, I am trying to understand why. Also by 'write using slot' do you mean write lock enqueue or real BRW?

Vladimir Saveliev added a comment - 24/Jan/20 9:38 AM

could you explain this a bit - are these 'write processes' about the same file and what enqueue they are waiting for?

Yes.

1. clientA:write 1 writes file F:
it gets to somewhere about to enter file_remove_privs. It holds DLM lock L.

2. clientB:write1 writes file F:
it enqueues for lock, server sees conflict with L owned by clientA, so it sends blocking ast to clientA.

3. clientA receives the blocking ast, mark the lock is CB_PENDING, but it can not cancel because it is in use by clientA:write 1.

4. 'other writes' to file F come on clientA. They are to do enqueue (somewhere in cl_io_lock), because lock L is already CB_PENDING. So, those writes will be granted the lock only after clientB gets and release it. So, they are blocked by server having occupied rpc slots, possibly all of them.

5. clientA:write1 continues file_remove_privs()->getxattr and gets blocked by as all rpc slots are occupied.

Vladimir Saveliev added a comment - 24/Jan/20 9:38 AM could you explain this a bit - are these 'write processes' about the same file and what enqueue they are waiting for? Yes. 1. clientA:write 1 writes file F: it gets to somewhere about to enter file_remove_privs. It holds DLM lock L. 2. clientB:write1 writes file F: it enqueues for lock, server sees conflict with L owned by clientA, so it sends blocking ast to clientA. 3. clientA receives the blocking ast, mark the lock is CB_PENDING, but it can not cancel because it is in use by clientA:write 1. 4. 'other writes' to file F come on clientA. They are to do enqueue (somewhere in cl_io_lock), because lock L is already CB_PENDING. So, those writes will be granted the lock only after clientB gets and release it. So, they are blocked by server having occupied rpc slots, possibly all of them. 5. clientA:write1 continues file_remove_privs()->getxattr and gets blocked by as all rpc slots are occupied.

Mikhail Pershin added a comment - 24/Jan/20 9:15 AM

Vladimir, in description it says: "all the rpc slots are busy by write processes which wait for enqueue rpc completion" - could you explain this a bit - are these 'write processes' about the same file and what enqueue they are waiting for? I don't get yet who saturates all rpc slots

Mikhail Pershin added a comment - 24/Jan/20 9:15 AM Vladimir, in description it says: "all the rpc slots are busy by write processes which wait for enqueue rpc completion" - could you explain this a bit - are these 'write processes' about the same file and what enqueue they are waiting for? I don't get yet who saturates all rpc slots

Vladimir Saveliev added a comment - 17/Jan/20 1:41 PM

So I have two suggestions for different ways to fix this that would be less complicated, and I think (2) should definitely work. 1. I am less sure about.
1. See if getxattr truly needs to be modifying. obd_skip_mod_rpc_slot considers it a modifying RPC, but it seems weird that it is. Perhaps it doesn't need to be - It would be good to know why getxattr is considered modifying. If it's not modifying, problem is solved.
2. Alternately, looking at:
obd_mod_rpc_slot_avail_locked()
We see that one extra close request is allowed to avoid a deadlock.
" * On the MDC client, to avoid a potential deadlock (see Bugzilla 3462),

one close request is allowed above the maximum."
If we did the same for getxattr, that should fix this as well.
Either is much simpler than this, and I think we should be able to do one of those...?

The patchset 3 is an attempt to implement Patrick's proposal. Unfortunately, it faced with complication with finding put whether extra slot is to be used.

The below are dumps of tests included in the patch runs with disabled fixe:

== sanity test 820: write and lfs setstripe race ===================================================== 07:42:51 (1579236171)
fail_loc=0x80000329
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
error: lt-lfs setstripe: invalid layout
write: Input/output error
 sanity test_820: @@@@@@ FAIL: multiop failed 
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Transport endpoint is not connected
error: lt-lfs setstripe: invalid layout
  Trace dump:

and

== sanity test 821: write race ======================================================================= 07:52:43 (1579236763)
fail_loc=0x80000329
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
 sanity test_821: @@@@@@ FAIL: multiop failed 
write: Input/output error
  Trace dump:

Vladimir Saveliev added a comment - 17/Jan/20 1:41 PM So I have two suggestions for different ways to fix this that would be less complicated, and I think (2) should definitely work. 1. I am less sure about. 1. See if getxattr truly needs to be modifying. obd_skip_mod_rpc_slot considers it a modifying RPC, but it seems weird that it is. Perhaps it doesn't need to be - It would be good to know why getxattr is considered modifying. If it's not modifying, problem is solved. 2. Alternately, looking at: obd_mod_rpc_slot_avail_locked() We see that one extra close request is allowed to avoid a deadlock. " * On the MDC client, to avoid a potential deadlock (see Bugzilla 3462), one close request is allowed above the maximum." If we did the same for getxattr, that should fix this as well. Either is much simpler than this, and I think we should be able to do one of those...? The patchset 3 is an attempt to implement Patrick's proposal. Unfortunately, it faced with complication with finding put whether extra slot is to be used. The below are dumps of tests included in the patch runs with disabled fixe: == sanity test 820: write and lfs setstripe race ===================================================== 07:42:51 (1579236171) fail_loc=0x80000329 lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout write: Input/output error sanity test_820: @@@@@@ FAIL: multiop failed lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Transport endpoint is not connected error: lt-lfs setstripe: invalid layout Trace dump: and == sanity test 821: write race ======================================================================= 07:52:43 (1579236763) fail_loc=0x80000329 write: Input/output error write: Input/output error write: Input/output error write: Input/output error write: Input/output error write: Input/output error write: Input/output error write: Input/output error sanity test_821: @@@@@@ FAIL: multiop failed write: Input/output error Trace dump:

People

Assignee:: Vladimir Saveliev

Reporter:: Vladimir Saveliev

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 28/May/19 1:22 PM

Updated:: 05/Jul/23 6:03 PM

Resolved:: 30/Nov/21 1:41 PM