Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12347

lustre write: do not enqueue rpc holding osc/mdc ldlm lock held

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      lustre’s write should not send enqueue rpc to mds while having osc or mdc ldlm lock held. This may happen currently via:

          cl_io_loop
            cl_io_lock                    <- ldlm lock is taken here
            cl_io_start
              vvp_io_write_start
              ...
                __generic_file_aio_write
                  file_remove_privs
                    security_inode_need_killpriv
                    ...
                      ll_xattr_get_common
                      ...
                        mdc_intent_lock   <- enqueue rpc is sent here
            cl_io_unlock                  <- ldlm lock is released
      

      That may lead to client eviction. The following scenario has been observed during write load with DoM involved:

      • write holds mdc ldlm lock (L1) and is waiting on free rpc slot in
        obd_get_request_slot trying to do ll_xattr_get_common().
      • all the rpc slots are busy by write processes which wait for enqueue
        rpc completion.
      • mds in order to serve the enqueue requests has sent blocking ast for
        the lock L1 and eventually evicts the client as it does not cancel
        L1.

      There has been observed another more complex scenario caused by this problem. Clients get evicted by osts during mdtest+ior+failover hw testing.

      Attachments

        Issue Links

          Activity

            [LU-12347] lustre write: do not enqueue rpc holding osc/mdc ldlm lock held
            vsaveliev Vladimir Saveliev added a comment - - edited

            could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit.

            1. have max_rpcs_in_flight writes to max_rpcs_in_flight files. have them to pause somewhere at file_remove_suid->ll_xattr_cache_refill.
            2. have max_rpcs_in_flight writes to the same files from another client. Server will notice max_rpcs_in_flight conflicts and send blocking asts to first client.
            3. First client is unable to cancel locks, as ll_xattr_cache_refill has to complete first.
            4. have max_rpcs_in_flight new writes to enqueue dlm locks (because the locks are callback pending). Those new writes occupy rpc slots. As those enqueues will complete only after enqueues from client2 are completed.
            5. First writes want to do enqueue in ll_xattr_find_get_lock, but all slots are occupied.

            Patchset 8 of https://review.whamcloud.com/#/c/34977/ contains this test: sanityn:105c.

            vsaveliev Vladimir Saveliev added a comment - - edited could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit. 1. have max_rpcs_in_flight writes to max_rpcs_in_flight files. have them to pause somewhere at file_remove_suid->ll_xattr_cache_refill. 2. have max_rpcs_in_flight writes to the same files from another client. Server will notice max_rpcs_in_flight conflicts and send blocking asts to first client. 3. First client is unable to cancel locks, as ll_xattr_cache_refill has to complete first. 4. have max_rpcs_in_flight new writes to enqueue dlm locks (because the locks are callback pending). Those new writes occupy rpc slots. As those enqueues will complete only after enqueues from client2 are completed. 5. First writes want to do enqueue in ll_xattr_find_get_lock, but all slots are occupied. Patchset 8 of https://review.whamcloud.com/#/c/34977/ contains this test: sanityn:105c.

            So I would propose to don't add extra slot here but keep all other changes - IT_GETXATTR adding, removal of ols_has_ref and passing einfo in ldlm_cli_enqueue_fini(). Also I'd consider exclusion of DOM locks from consuming RPC slots similarly to EXTENT locks

            tappro Mikhail Pershin added a comment - So I would propose to don't add extra slot here but keep all other changes - IT_GETXATTR adding, removal of ols_has_ref and passing einfo in ldlm_cli_enqueue_fini() . Also I'd consider exclusion of DOM locks from consuming RPC slots similarly to EXTENT locks

            FYI, I've found why all slots are filled with BRW/glipmse enqueue RPCs. There is ldlm_lock_match() in mdc_enqueue_send() which tries to find any granted or waiting locks to don't enqueue new similar lock, but the problem is that we have one cpbending lock which can't be matched and each new enqueue RPC stuck on server waiting for it. Meanwhile new lock is put in waiting queue on client side only when it gets reply from server, i.e. enqueue RPC finishes and there are no such locks, every one stays in RPC slot waiting for server response and is not added to the waiting queue yet, so each new enqueue match no lock and goes to the server too consuming slots.
            I have some observation and proposals about that.
            1) ldlm_request_slot_needed() takes slot only for FLOCK and IBITS locks but not EXTENT, I suppose that is because IO locks need no flow control because they are usually result of file operation with other RPCs sent under flow control already. Maybe there are other reasons too. Anyway, DOM enqueue RPC also can be excluded from taking RPC slot similarly.
            2) All MDT locks are ATOMIC on server, so they are waiting for lock to be granted before replying to client. That keeps enqueue RPC in slot for quite a long time and that also can be reason for MDC RPC flow control to limit number of outgoing locks and don't overload client import. OSC IO locks are asynchronous and server replies without waiting for lock is granted. DOM locks are also 'atomic' right now, so wait for lock be granted on server. If they would be done in async manner, they would not stuck in RPC slots forever waiting for blocking locks. I have such patch here: https://review.whamcloud.com/36903 and I think it will help with current issue.

            tappro Mikhail Pershin added a comment - FYI, I've found why all slots are filled with BRW/glipmse enqueue RPCs. There is ldlm_lock_match() in mdc_enqueue_send() which tries to find any granted or waiting locks to don't enqueue new similar lock, but the problem is that we have one cpbending lock which can't be matched and each new enqueue RPC stuck on server waiting for it. Meanwhile new lock is put in waiting queue on client side only when it gets reply from server, i.e. enqueue RPC finishes and there are no such locks, every one stays in RPC slot waiting for server response and is not added to the waiting queue yet, so each new enqueue match no lock and goes to the server too consuming slots. I have some observation and proposals about that. 1) ldlm_request_slot_needed() takes slot only for FLOCK and IBITS locks but not EXTENT, I suppose that is because IO locks need no flow control because they are usually result of file operation with other RPCs sent under flow control already. Maybe there are other reasons too. Anyway, DOM enqueue RPC also can be excluded from taking RPC slot similarly. 2) All MDT locks are ATOMIC on server, so they are waiting for lock to be granted before replying to client. That keeps enqueue RPC in slot for quite a long time and that also can be reason for MDC RPC flow control to limit number of outgoing locks and don't overload client import. OSC IO locks are asynchronous and server replies without waiting for lock is granted. DOM locks are also 'atomic' right now, so wait for lock be granted on server. If they would be done in async manner, they would not stuck in RPC slots forever waiting for blocking locks. I have such patch here: https://review.whamcloud.com/36903 and I think it will help with current issue.
            tappro Mikhail Pershin added a comment - - edited

            could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit.

            tappro Mikhail Pershin added a comment - - edited could you describe that lockup with an example, please? There were several related scenarios, I've lost track a bit.

            I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code.

            Ok, that makes sense, however, even if you fix that - we still need a fix for the rpc slot lockup on ll_file_write->remove_file_suid. Because the lockup may happen during concurrent writes to different files as well.

            vsaveliev Vladimir Saveliev added a comment - I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code. Ok, that makes sense, however, even if you fix that - we still need a fix for the rpc slot lockup on ll_file_write->remove_file_suid. Because the lockup may happen during concurrent writes to different files as well.
            tappro Mikhail Pershin added a comment - - edited

            yes, that is what I thought, probably this is the reason, I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code.

            tappro Mikhail Pershin added a comment - - edited yes, that is what I thought, probably this is the reason, I would think that if lock is enqueued then no need to do the same again and again, especially considering that DOM has single range for all locks. I will check that part of code.

            Also by 'write using slot' do you mean write lock enqueue or real BRW?

            lock enqueue in the following code path:

            ll_file_io_generic
              cl_io_loop
                cl_io_lock
                  cl_lockset_lock
                    cl_lock_request
                      cl_lock_enqueue
                        lov_lock_enqueue
                          mdc_lock_enqueue
                            mdc_enqueue_send
                              ldlm_lock_match            <-- here we get miss
                              ldlm_cli_enqueue
                                obd_get_request_slot  <-- slot is occupied
                                ptlrpc_queue_wait        <-- here all writes stuck, as server does not get cancel from clientA:write 1 because that one waits for free slots
            

            OK, but I wonder why there are many of those write on A so they are blocking all slots, if new lock is needed then that is one slot for new enqueue. But it seems there are many slots are being used, I am trying to understand why.

            Each of those additional writes enqueues its own lock, I believe, this is because for mdc_enqueue_send()->ldlm_lock_match() fails.

            All those cli->cl_max_rpcs_in_flight writes have the following:

            00000080:00000001:0.0:1579840865.602638:0:30118:0:(file.c:1995:ll_file_write()) Process entered
            ...
            00000002:00000001:0.0:1579840865.602825:0:30118:0:(mdc_dev.c:842:mdc_lock_enqueue()) Process entered
            ...
            00010000:00010000:0.0:1579840865.602838:0:30118:0:(ldlm_lock.c:1505:ldlm_lock_match_with_skip()) ### not matched ns ffff8c2607875c00 type 13 mode 2 res 8589935618/4 (0 0)
            ...
            grab rpc slot via ldlm_cli_enqueue ()->obd_get_request_slot()
            ...
            00010000:00010000:0.0:1579840865.602912:0:30118:0:(ldlm_request.c:1121:ldlm_cli_enqueue()) ### sending request ns: lustre-MDT0000-mdc-ffff8c2618072000 lock: ffff8c263c315e00/0xdd4ac26a917dc529 lrc: 3/0,1 mode: --/PW res: [0x200000402:0x4:0x0].0x0 bits 0x40/0x0 rrc: 6 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 30118 timeout: 0 lvb_type: 0
            ...
            
            vsaveliev Vladimir Saveliev added a comment - Also by 'write using slot' do you mean write lock enqueue or real BRW? lock enqueue in the following code path: ll_file_io_generic cl_io_loop cl_io_lock cl_lockset_lock cl_lock_request cl_lock_enqueue lov_lock_enqueue mdc_lock_enqueue mdc_enqueue_send ldlm_lock_match <-- here we get miss ldlm_cli_enqueue obd_get_request_slot <-- slot is occupied ptlrpc_queue_wait <-- here all writes stuck, as server does not get cancel from clientA:write 1 because that one waits for free slots OK, but I wonder why there are many of those write on A so they are blocking all slots, if new lock is needed then that is one slot for new enqueue. But it seems there are many slots are being used, I am trying to understand why. Each of those additional writes enqueues its own lock, I believe, this is because for mdc_enqueue_send()->ldlm_lock_match() fails. All those cli->cl_max_rpcs_in_flight writes have the following: 00000080:00000001:0.0:1579840865.602638:0:30118:0:(file.c:1995:ll_file_write()) Process entered ... 00000002:00000001:0.0:1579840865.602825:0:30118:0:(mdc_dev.c:842:mdc_lock_enqueue()) Process entered ... 00010000:00010000:0.0:1579840865.602838:0:30118:0:(ldlm_lock.c:1505:ldlm_lock_match_with_skip()) ### not matched ns ffff8c2607875c00 type 13 mode 2 res 8589935618/4 (0 0) ... grab rpc slot via ldlm_cli_enqueue ()->obd_get_request_slot() ... 00010000:00010000:0.0:1579840865.602912:0:30118:0:(ldlm_request.c:1121:ldlm_cli_enqueue()) ### sending request ns: lustre-MDT0000-mdc-ffff8c2618072000 lock: ffff8c263c315e00/0xdd4ac26a917dc529 lrc: 3/0,1 mode: --/PW res: [0x200000402:0x4:0x0].0x0 bits 0x40/0x0 rrc: 6 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 30118 timeout: 0 lvb_type: 0 ...

            OK, but I wonder why there are many of those write on A so they are blocking all slots, if new lock is needed then that is one slot for new enqueue. But it seems there are many slots are being used, I am trying to understand why. Also by 'write using slot' do you mean write lock enqueue or real BRW?

            tappro Mikhail Pershin added a comment - OK, but I wonder why there are many of those write on A so they are blocking all slots, if new lock is needed then that is one slot for new enqueue. But it seems there are many slots are being used, I am trying to understand why. Also by 'write using slot' do you mean write lock enqueue or real BRW?

            could you explain this a bit - are these 'write processes' about the same file and what enqueue they are waiting for?

            Yes.

            1. clientA:write 1 writes file F:
            it gets to somewhere about to enter file_remove_privs. It holds DLM lock L.

            2. clientB:write1 writes file F:
            it enqueues for lock, server sees conflict with L owned by clientA, so it sends blocking ast to clientA.

            3. clientA receives the blocking ast, mark the lock is CB_PENDING, but it can not cancel because it is in use by clientA:write 1.

            4. 'other writes' to file F come on clientA. They are to do enqueue (somewhere in cl_io_lock), because lock L is already CB_PENDING. So, those writes will be granted the lock only after clientB gets and release it. So, they are blocked by server having occupied rpc slots, possibly all of them.

            5. clientA:write1 continues file_remove_privs()->getxattr and gets blocked by as all rpc slots are occupied.

            vsaveliev Vladimir Saveliev added a comment - could you explain this a bit - are these 'write processes' about the same file and what enqueue they are waiting for? Yes. 1. clientA:write 1 writes file F: it gets to somewhere about to enter file_remove_privs. It holds DLM lock L. 2. clientB:write1 writes file F: it enqueues for lock, server sees conflict with L owned by clientA, so it sends blocking ast to clientA. 3. clientA receives the blocking ast, mark the lock is CB_PENDING, but it can not cancel because it is in use by clientA:write 1. 4. 'other writes' to file F come on clientA. They are to do enqueue (somewhere in cl_io_lock), because lock L is already CB_PENDING. So, those writes will be granted the lock only after clientB gets and release it. So, they are blocked by server having occupied rpc slots, possibly all of them. 5. clientA:write1 continues file_remove_privs()->getxattr and gets blocked by as all rpc slots are occupied.

            Vladimir, in description it says: "all the rpc slots are busy by write processes which wait for enqueue rpc completion" -  could you explain this a bit - are these 'write processes' about the same file and what enqueue they are waiting for? I don't get yet who saturates all rpc slots

            tappro Mikhail Pershin added a comment - Vladimir, in description it says: "all the rpc slots are busy by write processes which wait for enqueue rpc completion" -  could you explain this a bit - are these 'write processes' about the same file and what enqueue they are waiting for? I don't get yet who saturates all rpc slots

            So I have two suggestions for different ways to fix this that would be less complicated, and I think (2) should definitely work. 1. I am less sure about.
            1. See if getxattr truly needs to be modifying. obd_skip_mod_rpc_slot considers it a modifying RPC, but it seems weird that it is. Perhaps it doesn't need to be - It would be good to know why getxattr is considered modifying. If it's not modifying, problem is solved.
            2. Alternately, looking at:
            obd_mod_rpc_slot_avail_locked()
            We see that one extra close request is allowed to avoid a deadlock.
            " * On the MDC client, to avoid a potential deadlock (see Bugzilla 3462),

            • one close request is allowed above the maximum."
              If we did the same for getxattr, that should fix this as well.
              Either is much simpler than this, and I think we should be able to do one of those...?

            The patchset 3 is an attempt to implement Patrick's proposal. Unfortunately, it faced with complication with finding put whether extra slot is to be used.

            The below are dumps of tests included in the patch runs with disabled fixe:

            == sanity test 820: write and lfs setstripe race ===================================================== 07:42:51 (1579236171)
            fail_loc=0x80000329
            lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
            error: lt-lfs setstripe: invalid layout
            lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
            error: lt-lfs setstripe: invalid layout
            lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
            error: lt-lfs setstripe: invalid layout
            lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
            error: lt-lfs setstripe: invalid layout
            lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
            error: lt-lfs setstripe: invalid layout
            write: Input/output error
             sanity test_820: @@@@@@ FAIL: multiop failed 
            lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error
            error: lt-lfs setstripe: invalid layout
            lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Transport endpoint is not connected
            error: lt-lfs setstripe: invalid layout
              Trace dump:
            

            and

            == sanity test 821: write race ======================================================================= 07:52:43 (1579236763)
            fail_loc=0x80000329
            write: Input/output error
            write: Input/output error
            write: Input/output error
            write: Input/output error
            write: Input/output error
            write: Input/output error
            write: Input/output error
            write: Input/output error
             sanity test_821: @@@@@@ FAIL: multiop failed 
            write: Input/output error
              Trace dump:
            
            vsaveliev Vladimir Saveliev added a comment - So I have two suggestions for different ways to fix this that would be less complicated, and I think (2) should definitely work. 1. I am less sure about. 1. See if getxattr truly needs to be modifying. obd_skip_mod_rpc_slot considers it a modifying RPC, but it seems weird that it is. Perhaps it doesn't need to be - It would be good to know why getxattr is considered modifying. If it's not modifying, problem is solved. 2. Alternately, looking at: obd_mod_rpc_slot_avail_locked() We see that one extra close request is allowed to avoid a deadlock. " * On the MDC client, to avoid a potential deadlock (see Bugzilla 3462), one close request is allowed above the maximum." If we did the same for getxattr, that should fix this as well. Either is much simpler than this, and I think we should be able to do one of those...? The patchset 3 is an attempt to implement Patrick's proposal. Unfortunately, it faced with complication with finding put whether extra slot is to be used. The below are dumps of tests included in the patch runs with disabled fixe: == sanity test 820: write and lfs setstripe race ===================================================== 07:42:51 (1579236171) fail_loc=0x80000329 lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout write: Input/output error sanity test_820: @@@@@@ FAIL: multiop failed lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Input/output error error: lt-lfs setstripe: invalid layout lt-lfs setstripe setstripe: cannot read layout from '/mnt/lustre/d820.sanity/f820.sanity': Transport endpoint is not connected error: lt-lfs setstripe: invalid layout Trace dump: and == sanity test 821: write race ======================================================================= 07:52:43 (1579236763) fail_loc=0x80000329 write: Input/output error write: Input/output error write: Input/output error write: Input/output error write: Input/output error write: Input/output error write: Input/output error write: Input/output error sanity test_821: @@@@@@ FAIL: multiop failed write: Input/output error Trace dump:

            People

              vsaveliev Vladimir Saveliev
              vsaveliev Vladimir Saveliev
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: