[LU-12347] lustre write: do not enqueue rpc holding osc/mdc ldlm lock held Created: 28/May/19  Updated: 05/Jul/23  Resolved: 30/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Vladimir Saveliev Assignee: Vladimir Saveliev
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-15639 replay-dual test_31 error: set_param... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

lustre’s write should not send enqueue rpc to mds while having osc or mdc ldlm lock held. This may happen currently via:

    cl_io_loop
      cl_io_lock                    <- ldlm lock is taken here
      cl_io_start
        vvp_io_write_start
        ...
          __generic_file_aio_write
            file_remove_privs
              security_inode_need_killpriv
              ...
                ll_xattr_get_common
                ...
                  mdc_intent_lock   <- enqueue rpc is sent here
      cl_io_unlock                  <- ldlm lock is released

That may lead to client eviction. The following scenario has been observed during write load with DoM involved:

  • write holds mdc ldlm lock (L1) and is waiting on free rpc slot in
    obd_get_request_slot trying to do ll_xattr_get_common().
  • all the rpc slots are busy by write processes which wait for enqueue
    rpc completion.
  • mds in order to serve the enqueue requests has sent blocking ast for
    the lock L1 and eventually evicts the client as it does not cancel
    L1.

There has been observed another more complex scenario caused by this problem. Clients get evicted by osts during mdtest+ior+failover hw testing.



 Comments   
Comment by Gerrit Updater [ 28/May/19 ]

Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/34977
Subject: LU-12347 llite: call file_remove_privs before taking ldlm
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4b7265b8a06ec5dced501b2c25c7fd74894a7f45

Comment by Andreas Dilger [ 24/Oct/19 ]

he full trace looks like:

 PID: 20755  TASK: ffff8ff1f07fa080  CPU: 1   COMMAND: "xdd"
 #0 [ffff8ff1e3c47328] __schedule at ffffffffab367747
 #1 [ffff8ff1e3c473b0] schedule at ffffffffab367c49
 #2 [ffff8ff1e3c473c0] obd_get_request_slot at ffffffffc0692c24 [obdclass]
 #3 [ffff8ff1e3c47470] ldlm_cli_enqueue at ffffffffc083a7d0 [ptlrpc]
 #4 [ffff8ff1e3c47528] mdc_enqueue_base at ffffffffc0a10e81 [mdc]
 #5 [ffff8ff1e3c47640] mdc_intent_lock at ffffffffc0a12f15 [mdc]
 #6 [ffff8ff1e3c47718] lmv_intent_lock at ffffffffc09824f2 [lmv]
 #7 [ffff8ff1e3c477c8] ll_xattr_cache_refill at ffffffffc0aea395 [lustre]
 #8 [ffff8ff1e3c478a8] ll_xattr_cache_get at ffffffffc0aeb2ab [lustre]
 #9 [ffff8ff1e3c47900] ll_xattr_list at ffffffffc0ae7cdc [lustre]
#10 [ffff8ff1e3c47968] ll_xattr_get_common at ffffffffc0ae83ef [lustre]
#11 [ffff8ff1e3c479a8] ll_xattr_get_common_3_11 at ffffffffc0ae88d8 [lustre]
#12 [ffff8ff1e3c479b8] generic_getxattr at ffffffffaae693d2
#13 [ffff8ff1e3c479e8] cap_inode_need_killpriv at ffffffffaaef6b9f
#14 [ffff8ff1e3c479f8] security_inode_need_killpriv at ffffffffaaef938c
#15 [ffff8ff1e3c47a08] dentry_needs_remove_privs at ffffffffaae5e6bf
#16 [ffff8ff1e3c47a28] file_remove_privs at ffffffffaae5e8f8
#17 [ffff8ff1e3c47aa0] __generic_file_aio_write at ffffffffaadb88a8
#18 [ffff8ff1e3c47b20] __generic_file_write_iter at ffffffffc0afa36b [lustre]
#19 [ffff8ff1e3c47b90] vvp_io_write_start at ffffffffc0afe6ab [lustre]
#20 [ffff8ff1e3c47c00] cl_io_start at ffffffffc06d4828 [obdclass]
#21 [ffff8ff1e3c47c28] cl_io_loop at ffffffffc06d69fc [obdclass]
#22 [ffff8ff1e3c47c58] ll_file_io_generic at ffffffffc0ab4c1b [lustre]
#23 [ffff8ff1e3c47d60] ll_file_aio_write at ffffffffc0ab58b2 [lustre]
#24 [ffff8ff1e3c47dd8] ll_file_write at ffffffffc0ab5aa4 [lustre]
#25 [ffff8ff1e3c47ec0] vfs_write at ffffffffaae41650
#26 [ffff8ff1e3c47f00] sys_pwrite64 at ffffffffaae42632
#27 [ffff8ff1e3c47f50] system_call_fastpath at ffffffffab374ddb
Comment by Vladimir Saveliev [ 17/Jan/20 ]

So I have two suggestions for different ways to fix this that would be less complicated, and I think (2) should definitely work. 1. I am less sure about.
1. See if getxattr truly needs to be modifying. obd_skip_mod_rpc_slot considers it a modifying RPC, but it seems weird that it is. Perhaps it doesn't need to be - It would be good to know why getxattr is considered modifying. If it's not modifying, problem is solved.
2. Alternately, looking at:
obd_mod_rpc_slot_avail_locked()
We see that one extra close request is allowed to avoid a deadlock.
" * On the MDC client, to avoid a potential deadlock (see Bugzilla 3462),

  • one close request is allowed above the maximum."
    If we did the same for getxattr, that should fix this as well.
    Either is much simpler than this, and I think we should be able to do one of those...?

The patchset 3 is an attempt to implement Patrick's proposal. Unfortunately, it faced with complication with finding put whether extra slot is to be used.

The below are dumps of tests included in the patch runs with disabled fixe:

== sanity test 820: write and lfs setstripe race ===================================================== 07:42:51 (1579236171)
fail_loc=0x80000329
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
error: lt-lfs setstripe: invalid layout
write: Input/output error
 sanity test_820: @@@@@@ FAIL: multiop failed 
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
error: lt-lfs setstripe: invalid layout
lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Transport endpoint is not connected
error: lt-lfs setstripe: invalid layout
  Trace dump:

and

== sanity test 821: write race ======================================================================= 07:52:43 (1579236763)
fail_loc=0x80000329
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
write: Input/output error
 sanity test_821: @@@@@@ FAIL: multiop failed 
write: Input/output error
  Trace dump:
Comment by Mikhail Pershin [ 24/Jan/20 ]

Vladimir, in description it says: "all the rpc slots are busy by write processes which wait for enqueue rpc completion" -  could you explain this a bit - are these 'write processes' about the same file and what enqueue they are waiting for? I don't get yet who saturates all rpc slots

Comment by Vladimir Saveliev [ 24/Jan/20 ]

could you explain this a bit - are these 'write processes' about the same file and what enqueue they are waiting for?

Yes.

1. clientA:write 1 writes file F:
it gets to somewhere about to enter file_remove_privs. It holds DLM lock L.

2. clientB:write1 writes file F:
it enqueues for lock, server sees conflict with L owned by clientA, so it sends blocking ast to clientA.

3. clientA receives the blocking ast, mark the lock is CB_PENDING, but it can not cancel because it is in use by clientA:write 1.

4. 'other writes' to file F come on clientA. They are to do enqueue (somewhere in cl_io_lock), because lock L is already CB_PENDING. So, those writes will be granted the lock only after clientB gets and release it. So, they are blocked by server having occupied rpc slots, possibly all of them.

5. clientA:write1 continues file_remove_privs()->getxattr and gets blocked by as all rpc slots are occupied.

Comment by Mikhail Pershin [ 24/Jan/20 ]

OK, but I wonder why there are many of those write on A so they are blocking all slots, if new lock is needed then that is one slot for new enqueue. But it seems there are many slots are being used, I am trying to understand why. Also by 'write using slot' do you mean write lock enqueue or real BRW?

Comment by Vladimir Saveliev [ 24/Jan/20 ]

Also by 'write using slot' do you mean write lock enqueue or real BRW?

lock enqueue in the following code path:

ll_file_io_generic
  cl_io_loop
    cl_io_lock
      cl_lockset_lock
        cl_lock_request
          cl_lock_enqueue
            lov_lock_enqueue
              mdc_lock_enqueue
                mdc_enqueue_send
                  ldlm_lock_match            <-- here we get miss
                  ldlm_cli_enqueue
                    obd_get_request_slot  <-- slot is occupied
                    ptlrpc_queue_wait        <-- here all writes stuck, as server does not get cancel from clientA:write 1 because that one waits for free slots

OK, but I wonder why there are many of those write on A so they are blocking all slots, if new lock is needed then that is one slot for new enqueue. But it seems there are many slots are being used, I am trying to understand why.

Each of those additional writes enqueues its own lock, I believe, this is because for mdc_enqueue_send()->ldlm_lock_match() fails.

All those cli->cl_max_rpcs_in_flight writes have the following:

00000080:00000001:0.0:1579840865.602638:0:30118:0:(file.c:1995:ll_file_write()) Process entered
...
00000002:00000001:0.0:1579840865.602825:0:30118:0:(mdc_dev.c:842:mdc_lock_enqueue()) Process entered
...
00010000:00010000:0.0:1579840865.602838:0:30118:0:(ldlm_lock.c:1505:ldlm_lock_match_with_skip()) ### not matched ns ffff8c2607875c00 type 13 mode 2 res 8589935618/4 (0 0)
...
grab rpc slot via ldlm_cli_enqueue ()->obd_get_request_slot()
...
00010000:00010000:0.0:1579840865.602912:0:30118:0:(ldlm_request.c:1121:ldlm_cli_enqueue()) ### sending request ns: lustre-MDT0000-mdc-ffff8c2618072000 lock: ffff8c263c315e00/0xdd4ac26a917dc529 lrc: 3/0,1 mode: --/PW res: [0x200000402:0x4:0x0].0x0 bits 0x40/0x0 rrc: 6 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 30118 timeout: 0 lvb_type: 0
...
Comment by Mikhail Pershin [ 24/Jan/20 ]

yes, that is what I thought, probably this is the reason, I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code.

Comment by Vladimir Saveliev [ 24/Jan/20 ]

I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code.

Ok, that makes sense, however, even if you fix that - we still need a fix for the rpc slot lockup on ll_file_write->remove_file_suid. Because the lockup may happen during concurrent writes to different files as well.

Comment by Mikhail Pershin [ 24/Jan/20 ]

could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit.

Comment by Mikhail Pershin [ 30/Jan/20 ]

FYI, I've found why all slots are filled with BRW/glipmse enqueue RPCs. There is ldlm_lock_match() in mdc_enqueue_send() which tries to find any granted or waiting locks to don't enqueue new similar lock, but the problem is that we have one cpbending lock which can't be matched and each new enqueue RPC stuck on server waiting for it. Meanwhile new lock is put in waiting queue on client side only when it gets reply from server, i.e. enqueue RPC finishes and there are no such locks, every one stays in RPC slot waiting for server response and is not added to the waiting queue yet, so each new enqueue match no lock and goes to the server too consuming slots.
I have some observation and proposals about that.
1) ldlm_request_slot_needed() takes slot only for FLOCK and IBITS locks but not EXTENT, I suppose that is because IO locks need no flow control because they are usually result of file operation with other RPCs sent under flow control already. Maybe there are other reasons too. Anyway, DOM enqueue RPC also can be excluded from taking RPC slot similarly.
2) All MDT locks are ATOMIC on server, so they are waiting for lock to be granted before replying to client. That keeps enqueue RPC in slot for quite a long time and that also can be reason for MDC RPC flow control to limit number of outgoing locks and don't overload client import. OSC IO locks are asynchronous and server replies without waiting for lock is granted. DOM locks are also 'atomic' right now, so wait for lock be granted on server. If they would be done in async manner, they would not stuck in RPC slots forever waiting for blocking locks. I have such patch here: https://review.whamcloud.com/36903 and I think it will help with current issue.

Comment by Mikhail Pershin [ 30/Jan/20 ]

So I would propose to don't add extra slot here but keep all other changes - IT_GETXATTR adding, removal of ols_has_ref and passing einfo in ldlm_cli_enqueue_fini(). Also I'd consider exclusion of DOM locks from consuming RPC slots similarly to EXTENT locks

Comment by Vladimir Saveliev [ 31/Jan/20 ]

could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit.

1. have max_rpcs_in_flight writes to max_rpcs_in_flight files. have them to pause somewhere at file_remove_suid->ll_xattr_cache_refill.
2. have max_rpcs_in_flight writes to the same files from another client. Server will notice max_rpcs_in_flight conflicts and send blocking asts to first client.
3. First client is unable to cancel locks, as ll_xattr_cache_refill has to complete first.
4. have max_rpcs_in_flight new writes to enqueue dlm locks (because the locks are callback pending). Those new writes occupy rpc slots. As those enqueues will complete only after enqueues from client2 are completed.
5. First writes want to do enqueue in ll_xattr_find_get_lock, but all slots are occupied.

Patchset 8 of https://review.whamcloud.com/#/c/34977/ contains this test: sanityn:105c.

Comment by Gerrit Updater [ 06/Jul/21 ]

Vladimir Saveliev (vlaidimir.saveliev@hpe.com) uploaded a new patch: https://review.whamcloud.com/44151
Subject: LU-12347 llite: do not take mod rpc slot for getxattr
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c68d11fcbdd03113b5618ea93b7662a5e5790dce

Comment by Vladimir Saveliev [ 06/Jul/21 ]

could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit.

With LU-13645 these scenarios became impossible.

The following however is still possible:

- clientA:write1 writes to file F and holds mdc ldlm lock (L1) and
  runs somewhere on the way to
  file_remove_privs()->ll_xattr_get_common()

- clientB:write is going to write file F and enqueues DoM lock. mds
  handles conflict on L1 and sends blocking ast to clientA

- clientA: max_mod_rpcs_in_flight simultaneous creates occupy all mod
  rpc slots and get delayed on mds side waiting for preallocated
  objects. Preallocation is delayed by ost failover.

- clientA:write1 tries to get mod rpc slot to enqueue xattr request,
  all slots are busy, so lock L1 can not be cancelled one of creates
  completes its rpc which it stuck on preallocation.

- lock callback timer expires on mds first and evicts client1.

This can be fixed with adding IT_GETXATTR in mdc_skip_mod_rpc_slot().

but keep all other changes - IT_GETXATTR adding, removal of ols_has_ref

ok. see  https://review.whamcloud.com/44151

and passing einfo in ldlm_cli_enqueue_fini(). Also I'd consider exclusion of DOM locks from consuming RPC slots similarly to EXTENT locks

These are in already as part of https://review.whamcloud.com/36903.

Comment by Gerrit Updater [ 30/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44151/
Subject: LU-12347 llite: do not take mod rpc slot for getxattr
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: eb64594e4473af859e74a0e831316cead0f5c49b

Comment by Peter Jones [ 30/Nov/21 ]

Landed for 2.15

Generated at Sat Feb 10 02:51:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.