Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12347

lustre write: do not enqueue rpc holding osc/mdc ldlm lock held

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      lustre’s write should not send enqueue rpc to mds while having osc or mdc ldlm lock held. This may happen currently via:

          cl_io_loop
            cl_io_lock                    <- ldlm lock is taken here
            cl_io_start
              vvp_io_write_start
              ...
                __generic_file_aio_write
                  file_remove_privs
                    security_inode_need_killpriv
                    ...
                      ll_xattr_get_common
                      ...
                        mdc_intent_lock   <- enqueue rpc is sent here
            cl_io_unlock                  <- ldlm lock is released
      

      That may lead to client eviction. The following scenario has been observed during write load with DoM involved:

      • write holds mdc ldlm lock (L1) and is waiting on free rpc slot in
        obd_get_request_slot trying to do ll_xattr_get_common().
      • all the rpc slots are busy by write processes which wait for enqueue
        rpc completion.
      • mds in order to serve the enqueue requests has sent blocking ast for
        the lock L1 and eventually evicts the client as it does not cancel
        L1.

      There has been observed another more complex scenario caused by this problem. Clients get evicted by osts during mdtest+ior+failover hw testing.

      Attachments

        Issue Links

          Activity

            [LU-12347] lustre write: do not enqueue rpc holding osc/mdc ldlm lock held
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44151/
            Subject: LU-12347 llite: do not take mod rpc slot for getxattr
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: eb64594e4473af859e74a0e831316cead0f5c49b

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44151/ Subject: LU-12347 llite: do not take mod rpc slot for getxattr Project: fs/lustre-release Branch: master Current Patch Set: Commit: eb64594e4473af859e74a0e831316cead0f5c49b

            could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit.

            With LU-13645 these scenarios became impossible.

            The following however is still possible:

            - clientA:write1 writes to file F and holds mdc ldlm lock (L1) and
              runs somewhere on the way to
              file_remove_privs()->ll_xattr_get_common()
            
            - clientB:write is going to write file F and enqueues DoM lock. mds
              handles conflict on L1 and sends blocking ast to clientA
            
            - clientA: max_mod_rpcs_in_flight simultaneous creates occupy all mod
              rpc slots and get delayed on mds side waiting for preallocated
              objects. Preallocation is delayed by ost failover.
            
            - clientA:write1 tries to get mod rpc slot to enqueue xattr request,
              all slots are busy, so lock L1 can not be cancelled one of creates
              completes its rpc which it stuck on preallocation.
            
            - lock callback timer expires on mds first and evicts client1.
            

            This can be fixed with adding IT_GETXATTR in mdc_skip_mod_rpc_slot().

            but keep all other changes - IT_GETXATTR adding, removal of ols_has_ref

            ok. see  https://review.whamcloud.com/44151

            and passing einfo in ldlm_cli_enqueue_fini(). Also I'd consider exclusion of DOM locks from consuming RPC slots similarly to EXTENT locks

            These are in already as part of https://review.whamcloud.com/36903.

            vsaveliev Vladimir Saveliev added a comment - could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit. With LU-13645 these scenarios became impossible. The following however is still possible: - clientA:write1 writes to file F and holds mdc ldlm lock (L1) and runs somewhere on the way to file_remove_privs()->ll_xattr_get_common() - clientB:write is going to write file F and enqueues DoM lock. mds handles conflict on L1 and sends blocking ast to clientA - clientA: max_mod_rpcs_in_flight simultaneous creates occupy all mod rpc slots and get delayed on mds side waiting for preallocated objects. Preallocation is delayed by ost failover. - clientA:write1 tries to get mod rpc slot to enqueue xattr request, all slots are busy, so lock L1 can not be cancelled one of creates completes its rpc which it stuck on preallocation. - lock callback timer expires on mds first and evicts client1. This can be fixed with adding IT_GETXATTR in mdc_skip_mod_rpc_slot(). but keep all other changes - IT_GETXATTR adding, removal of ols_has_ref ok. see  https://review.whamcloud.com/44151 and passing einfo in ldlm_cli_enqueue_fini(). Also I'd consider exclusion of DOM locks from consuming RPC slots similarly to EXTENT locks These are in already as part of https://review.whamcloud.com/36903 .

            Vladimir Saveliev (vlaidimir.saveliev@hpe.com) uploaded a new patch: https://review.whamcloud.com/44151
            Subject: LU-12347 llite: do not take mod rpc slot for getxattr
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c68d11fcbdd03113b5618ea93b7662a5e5790dce

            gerrit Gerrit Updater added a comment - Vladimir Saveliev (vlaidimir.saveliev@hpe.com) uploaded a new patch: https://review.whamcloud.com/44151 Subject: LU-12347 llite: do not take mod rpc slot for getxattr Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c68d11fcbdd03113b5618ea93b7662a5e5790dce
            vsaveliev Vladimir Saveliev added a comment - - edited

            could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit.

            1. have max_rpcs_in_flight writes to max_rpcs_in_flight files. have them to pause somewhere at file_remove_suid->ll_xattr_cache_refill.
            2. have max_rpcs_in_flight writes to the same files from another client. Server will notice max_rpcs_in_flight conflicts and send blocking asts to first client.
            3. First client is unable to cancel locks, as ll_xattr_cache_refill has to complete first.
            4. have max_rpcs_in_flight new writes to enqueue dlm locks (because the locks are callback pending). Those new writes occupy rpc slots. As those enqueues will complete only after enqueues from client2 are completed.
            5. First writes want to do enqueue in ll_xattr_find_get_lock, but all slots are occupied.

            Patchset 8 of https://review.whamcloud.com/#/c/34977/ contains this test: sanityn:105c.

            vsaveliev Vladimir Saveliev added a comment - - edited could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit. 1. have max_rpcs_in_flight writes to max_rpcs_in_flight files. have them to pause somewhere at file_remove_suid->ll_xattr_cache_refill. 2. have max_rpcs_in_flight writes to the same files from another client. Server will notice max_rpcs_in_flight conflicts and send blocking asts to first client. 3. First client is unable to cancel locks, as ll_xattr_cache_refill has to complete first. 4. have max_rpcs_in_flight new writes to enqueue dlm locks (because the locks are callback pending). Those new writes occupy rpc slots. As those enqueues will complete only after enqueues from client2 are completed. 5. First writes want to do enqueue in ll_xattr_find_get_lock, but all slots are occupied. Patchset 8 of https://review.whamcloud.com/#/c/34977/ contains this test: sanityn:105c.

            So I would propose to don't add extra slot here but keep all other changes - IT_GETXATTR adding, removal of ols_has_ref and passing einfo in ldlm_cli_enqueue_fini(). Also I'd consider exclusion of DOM locks from consuming RPC slots similarly to EXTENT locks

            tappro Mikhail Pershin added a comment - So I would propose to don't add extra slot here but keep all other changes - IT_GETXATTR adding, removal of ols_has_ref and passing einfo in ldlm_cli_enqueue_fini() . Also I'd consider exclusion of DOM locks from consuming RPC slots similarly to EXTENT locks

            FYI, I've found why all slots are filled with BRW/glipmse enqueue RPCs. There is ldlm_lock_match() in mdc_enqueue_send() which tries to find any granted or waiting locks to don't enqueue new similar lock, but the problem is that we have one cpbending lock which can't be matched and each new enqueue RPC stuck on server waiting for it. Meanwhile new lock is put in waiting queue on client side only when it gets reply from server, i.e. enqueue RPC finishes and there are no such locks, every one stays in RPC slot waiting for server response and is not added to the waiting queue yet, so each new enqueue match no lock and goes to the server too consuming slots.
            I have some observation and proposals about that.
            1) ldlm_request_slot_needed() takes slot only for FLOCK and IBITS locks but not EXTENT, I suppose that is because IO locks need no flow control because they are usually result of file operation with other RPCs sent under flow control already. Maybe there are other reasons too. Anyway, DOM enqueue RPC also can be excluded from taking RPC slot similarly.
            2) All MDT locks are ATOMIC on server, so they are waiting for lock to be granted before replying to client. That keeps enqueue RPC in slot for quite a long time and that also can be reason for MDC RPC flow control to limit number of outgoing locks and don't overload client import. OSC IO locks are asynchronous and server replies without waiting for lock is granted. DOM locks are also 'atomic' right now, so wait for lock be granted on server. If they would be done in async manner, they would not stuck in RPC slots forever waiting for blocking locks. I have such patch here: https://review.whamcloud.com/36903 and I think it will help with current issue.

            tappro Mikhail Pershin added a comment - FYI, I've found why all slots are filled with BRW/glipmse enqueue RPCs. There is ldlm_lock_match() in mdc_enqueue_send() which tries to find any granted or waiting locks to don't enqueue new similar lock, but the problem is that we have one cpbending lock which can't be matched and each new enqueue RPC stuck on server waiting for it. Meanwhile new lock is put in waiting queue on client side only when it gets reply from server, i.e. enqueue RPC finishes and there are no such locks, every one stays in RPC slot waiting for server response and is not added to the waiting queue yet, so each new enqueue match no lock and goes to the server too consuming slots. I have some observation and proposals about that. 1) ldlm_request_slot_needed() takes slot only for FLOCK and IBITS locks but not EXTENT, I suppose that is because IO locks need no flow control because they are usually result of file operation with other RPCs sent under flow control already. Maybe there are other reasons too. Anyway, DOM enqueue RPC also can be excluded from taking RPC slot similarly. 2) All MDT locks are ATOMIC on server, so they are waiting for lock to be granted before replying to client. That keeps enqueue RPC in slot for quite a long time and that also can be reason for MDC RPC flow control to limit number of outgoing locks and don't overload client import. OSC IO locks are asynchronous and server replies without waiting for lock is granted. DOM locks are also 'atomic' right now, so wait for lock be granted on server. If they would be done in async manner, they would not stuck in RPC slots forever waiting for blocking locks. I have such patch here: https://review.whamcloud.com/36903 and I think it will help with current issue.
            tappro Mikhail Pershin added a comment - - edited

            could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit.

            tappro Mikhail Pershin added a comment - - edited could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit.

            I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code.

            Ok, that makes sense, however, even if you fix that - we still need a fix for the rpc slot lockup on ll_file_write->remove_file_suid. Because the lockup may happen during concurrent writes to different files as well.

            vsaveliev Vladimir Saveliev added a comment - I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code. Ok, that makes sense, however, even if you fix that - we still need a fix for the rpc slot lockup on ll_file_write->remove_file_suid. Because the lockup may happen during concurrent writes to different files as well.
            tappro Mikhail Pershin added a comment - - edited

            yes, that is what I thought, probably this is the reason, I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code.

            tappro Mikhail Pershin added a comment - - edited yes, that is what I thought, probably this is the reason, I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code.

            People

              vsaveliev Vladimir Saveliev
              vsaveliev Vladimir Saveliev
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: