Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11285

don't stop on the first blocked lock in ldlm_reprocess_queue()

Details

    • 9223372036854775807

    Description

      The ldlm_reprocess_queue() stops on the first blocked lock in the waiting queue, meanwhile for IBITS locks there may be more waiting locks which can be granted immediately and don't interfere with that blocking lock. That is all about different IBITS, e.g. this is resource dump from racer run:

      [ 1192.782632] Lustre: 13366:0:(ldlm_resource.c:1728:ldlm_resource_dump()) Granted locks (in reverse order):
      [ 1192.782638] Lustre: 13366:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b58c7700/0x5c7c511d9b5b4287 lrc: 3/0,0 mode: PW/PW res: [0x200000401:0x395:0x0].0x0 bits 0x40/0x0 rrc: 18 type: IBT flags: 0x60200400000020 nid: 0@lo remote: 0x5c7c511d9b5b4279 expref: 342 pid: 16202 timeout: 1193 lvb_type: 0
      [ 1192.782643] Lustre: 13366:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e209c000/0x5c7c511d9b5b44da lrc: 2/0,0 mode: PR/PR res: [0x200000401:0x395:0x0].0x0 bits 0x1a/0x0 rrc: 18 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x5c7c511d9b5b447f expref: 325 pid: 17956 timeout: 1193 lvb_type: 0
      [ 1192.782647] Lustre: 13366:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801dda96300/0x5c7c511d9b5b2274 lrc: 2/0,0 mode: CR/CR res: [0x200000401:0x395:0x0].0x0 bits 0x8/0x0 rrc: 18 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x5c7c511d9b5b2258 expref: 325 pid: 16202 timeout: 0 lvb_type: 0
      [ 1192.782651] Lustre: 13366:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e5b1e800/0x5c7c511d9b5b1ed8 lrc: 2/0,0 mode: CR/CR res: [0x200000401:0x395:0x0].0x0 bits 0x8/0x0 rrc: 18 type: IBT flags: 0x40000000000000 nid: 0@lo remote: 0x5c7c511d9b5b1ed1 expref: 342 pid: 18163 timeout: 0 lvb_type: 3
      [ 1192.782653] Lustre: 13366:0:(ldlm_resource.c:1742:ldlm_resource_dump()) Waiting locks:
      [ 1192.782657] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801dd895400/0x5c7c511d9b5b4504 lrc: 3/0,1 mode: --/PW res: [0x200000401:0x395:0x0].0x0 bits 0x40/0x0 rrc: 18 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 15795 timeout: 0 lvb_type: 0
      [ 1192.782661] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e240af80/0x5c7c511d9b5b4519 lrc: 3/0,1 mode: --/PW res: [0x200000401:0x395:0x0].0x0 bits 0x40/0x0 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 13377 timeout: 0 lvb_type: 0
      [ 1192.782665] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e6e32300/0x5c7c511d9b5b4806 lrc: 3/0,1 mode: --/EX res: [0x200000401:0x395:0x0].0x0 bits 0x21/0x0 rrc: 18 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 15808 timeout: 0 lvb_type: 0
      [ 1192.782668] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b769c280/0x5c7c511d9b5b488b lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x13/0x8 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 15827 timeout: 0 lvb_type: 0
      [ 1192.782672] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b58c6800/0x5c7c511d9b5b48ca lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x13/0x8 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 15807 timeout: 0 lvb_type: 0
      [ 1192.782679] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b58c6300/0x5c7c511d9b5b4909 lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x13/0x8 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 17956 timeout: 0 lvb_type: 0
      [ 1192.782683] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b58c6080/0x5c7c511d9b5b4b71 lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x13/0x8 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 13373 timeout: 0 lvb_type: 0
      [ 1192.782686] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e5aa8000/0x5c7c511d9b5b4b9b lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x20/0x0 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 13378 timeout: 0 lvb_type: 0
      

      the first lock in the waiting queue is DOM lock waiting for other DOM lock, but the next one EX lock has no matched bits either with both DOM locks or with all other granted locks, so can be granted immediately. Instead of that it will wait for DOM lock to be granted, because we don't allow locks to be granted out of order.

      With IBITS locks it should be safe to grant such waiting locks which has no ibits match with any granted and any waiting locks before it.

      The reason for this improvement is DOM lock mostly, because they may block for quite long time, e.g. CLIO may need to lock whole file region, so it will take DOM lock and all OST locks with all blocking ASTs, that means DOM lock may wait for quite long time and that is not big problem with it due to prolong mechanism. But waiting in a common waiting queue with other metadata locks it may stop whole lock processing for a while. With out-of-order lock granting this problem will be much less severe.

      Attachments

        Issue Links

          Activity

            [LU-11285] don't stop on the first blocked lock in ldlm_reprocess_queue()

            I confirm that it resolves LU-12017 deadlock.

            askulysh Andriy Skulysh added a comment - I confirm that it resolves LU-12017 deadlock.

            Understand, I was confused because you start to take a DoM lock without releasing a MD locks.
            While it don't needs and resolve this race without changes in ldlm code, which affect a performance.

            shadow Alexey Lyashkov added a comment - Understand, I was confused because you start to take a DoM lock without releasing a MD locks. While it don't needs and resolve this race without changes in ldlm code, which affect a performance.

            yes, and that is why we cannot take discard DOM lock when unlink starts, combing it with other child bits, it is to be taken when object is being deleted. That is answer on your proposal to combine DOM lock with child bits on unlink few comments above.

            tappro Mikhail Pershin added a comment - yes, and that is why we cannot take discard DOM lock when unlink starts, combing it with other child bits, it is to be taken when object is being deleted. That is answer on your proposal to combine DOM lock with child bits on unlink few comments above.

            Last unlink situation isn't need to have hold any locks in MD namespace.
            One operation is remove from MD namespace and connect to the orphan list, second operation is orphan list handling with destroy "data" locks.
            it's same as for any data placement. for OST objects you put an FID in unlink llog, for DoM objects you can destroy orphan object by self. I don't see any differences in this case.

            shadow Alexey Lyashkov added a comment - Last unlink situation isn't need to have hold any locks in MD namespace. One operation is remove from MD namespace and connect to the orphan list, second operation is orphan list handling with destroy "data" locks. it's same as for any data placement. for OST objects you put an FID in unlink llog, for DoM objects you can destroy orphan object by self. I don't see any differences in this case.

            unfortunately we can't do that with DOM - at the moment of lock taking we don't know if that is last unlink or not. This is known only later, with lock already taken. On OST we always know that object is being deleted and have no such problem.
            As I mentioned before, LU-11359 patch solves that problem with discard lock by using non-blocking completion AST, so that type of deadlock should be solved already.

            tappro Mikhail Pershin added a comment - unfortunately we can't do that with DOM - at the moment of lock taking we don't know if that is last unlink or not. This is known only later, with lock already taken. On OST we always know that object is being deleted and have no such problem. As I mentioned before, LU-11359 patch solves that problem with discard lock by using non-blocking completion AST, so that type of deadlock should be solved already.

            Mike,

            okey. But what about take an DOM lock with child bits on unlink ? it will be save an second lock enq so any posility to deadlock will avoid ?

            shadow Alexey Lyashkov added a comment - Mike, okey. But what about take an DOM lock with child bits on unlink ? it will be save an second lock enq so any posility to deadlock will avoid ?

            Alexey, the getattr takes DOM bit optionally, only if it is available (via trybits) and it has no DOM bit in the locks dump as you can see. The source of blocking is DOM lock from unlink process, which is taken to discard data which is waiting for getattr lock in waiting queue and unlink can't finish. This should be resolved by LU-11359 which takes non-blocking discard lock. But also this is exactly case when the current patch for reprocessing could help, it is clear that last DOM lock is waiting for getattr for nothing because they have no common bits and can be granted immediately instead of waiting.

            tappro Mikhail Pershin added a comment - Alexey, the getattr takes DOM bit optionally, only if it is available (via trybits) and it has no DOM bit in the locks dump as you can see. The source of blocking is DOM lock from unlink process, which is taken to discard data which is waiting for getattr lock in waiting queue and unlink can't finish. This should be resolved by LU-11359 which takes non-blocking discard lock. But also this is exactly case when the current patch for reprocessing could help, it is clear that last DOM lock is waiting for getattr for nothing because they have no common bits and can be granted immediately instead of waiting.

            Mike,
            I found other case related to this issue, but it looks can be fixed easy, and it can be our's case.
            it's unlink vs getattr race. Unlink takes a EX lock with 0x3 bits, and want a DOM lock to flush a data.
            But, getattr want to have 0x3 + DOM lock, so it block a unlink to flush a data. Corresponded resource locks dump

            --- Resource: [0x2000098e8:0x2:0].0 (ffff880d5987f500) refcount = 4
            Granted locks (in reverse order):
            ns: NS lock: ffff880634af1680/0x6d1e17d7c609dcc lrc: 2/0,1 mode: EX/EX res: [0x2000098e8:0x2:0].0 bits 0x3/0 rrc: 4 type: IBITS flags: 0x40210400000020 nid: local remote: 0 exp: 0000000000000000 expref: -99 pid: 53694 timeout: 0 lvb_type: 0
            Waiting locks:
            ns: NS lock: ffff880bcf287080/0x6d1e17d7c609e51 lrc: 3/1,0 mode: --/PR res: [0x2000098e8:0x2:0].0 bits 0x13/0 rrc: 4 type: IBITS flags: 0x40210000000000 nid: local remote: 0 exp: 0000000000000000 expref: -99 pid: 53709 timeout: 0 lvb_type: 0
            ns: NS lock: ffff880634af3cc0/0x6d1e17d7c609e90 lrc: 3/0,1 mode: --/PW res: [0x2000098e8:0x2:0].0 bits 0x40/0 rrc: 4 type: IBITS flags: 0x40010080000000 nid: local remote: 0 exp: 0000000000000000 expref: -99 pid: 53694 timeout: 0 lvb_type: 0
            

            backtraces

            Stack :
            __schedule
            schedule
            ldlm_completion_ast
            ldlm_cli_enqueue_local
            mdt_object_local_lock
            mdt_object_lock_internal
            mdt_object_lock_try
            mdt_getattr_name_lock
            mdt_intent_getattr
            mdt_intent_policy
            ldlm_lock_enqueue
            ldlm_handle_enqueue0
            tgt_enqueue
            tgt_request_handle
            ptlrpc_server_handle_request
            ptlrpc_main
            kthread
            ret_from_fork
            Progs:  53709 "mdt02_061"
            
            Stack :
            __schedule
            schedule
            ldlm_completion_ast
            ldlm_cli_enqueue_local
            mdt_dom_discard_data
            mdt_reint_unlink
            mdt_reint_rec
            mdt_reint_internal
            mdt_reint
            tgt_request_handle
            ptlrpc_server_handle_request
            ptlrpc_main
            kthread
            ret_from_fork
            Progs:  53694 "mdt02_046"
            

            It looks DOM lock was obtained to take a correct size, but It not a requirement for the getattr request as it used to lookup name in most cases. So DOM bit can be dropped in getattr case and it resolve a deadlock.

            shadow Alexey Lyashkov added a comment - Mike, I found other case related to this issue, but it looks can be fixed easy, and it can be our's case. it's unlink vs getattr race. Unlink takes a EX lock with 0x3 bits, and want a DOM lock to flush a data. But, getattr want to have 0x3 + DOM lock, so it block a unlink to flush a data. Corresponded resource locks dump --- Resource: [0x2000098e8:0x2:0].0 (ffff880d5987f500) refcount = 4 Granted locks (in reverse order): ns: NS lock: ffff880634af1680/0x6d1e17d7c609dcc lrc: 2/0,1 mode: EX/EX res: [0x2000098e8:0x2:0].0 bits 0x3/0 rrc: 4 type: IBITS flags: 0x40210400000020 nid: local remote: 0 exp: 0000000000000000 expref: -99 pid: 53694 timeout: 0 lvb_type: 0 Waiting locks: ns: NS lock: ffff880bcf287080/0x6d1e17d7c609e51 lrc: 3/1,0 mode: --/PR res: [0x2000098e8:0x2:0].0 bits 0x13/0 rrc: 4 type: IBITS flags: 0x40210000000000 nid: local remote: 0 exp: 0000000000000000 expref: -99 pid: 53709 timeout: 0 lvb_type: 0 ns: NS lock: ffff880634af3cc0/0x6d1e17d7c609e90 lrc: 3/0,1 mode: --/PW res: [0x2000098e8:0x2:0].0 bits 0x40/0 rrc: 4 type: IBITS flags: 0x40010080000000 nid: local remote: 0 exp: 0000000000000000 expref: -99 pid: 53694 timeout: 0 lvb_type: 0 backtraces Stack : __schedule schedule ldlm_completion_ast ldlm_cli_enqueue_local mdt_object_local_lock mdt_object_lock_internal mdt_object_lock_try mdt_getattr_name_lock mdt_intent_getattr mdt_intent_policy ldlm_lock_enqueue ldlm_handle_enqueue0 tgt_enqueue tgt_request_handle ptlrpc_server_handle_request ptlrpc_main kthread ret_from_fork Progs: 53709 "mdt02_061" Stack : __schedule schedule ldlm_completion_ast ldlm_cli_enqueue_local mdt_dom_discard_data mdt_reint_unlink mdt_reint_rec mdt_reint_internal mdt_reint tgt_request_handle ptlrpc_server_handle_request ptlrpc_main kthread ret_from_fork Progs: 53694 "mdt02_046" It looks DOM lock was obtained to take a correct size, but It not a requirement for the getattr request as it used to lookup name in most cases. So DOM bit can be dropped in getattr case and it resolve a deadlock.

            We put his jobs on hold and things are back to normal. The user told us that "In brief, both files are unfortunate temporary files that are side-effects of a process that my current set of jobs calls once every minute or so. (Of course, with on the order of 1k jobs running, those same files are hit substantially more often than 1/min.)"

            He will try to fix that but this is clearly an issue with DOM...

            sthiell Stephane Thiell added a comment - We put his jobs on hold and things are back to normal. The user told us that "In brief, both files are unfortunate temporary files that are side-effects of a process that my current set of jobs calls once every minute or so. (Of course, with on the order of 1k jobs running, those same files are hit substantially more often than 1/min.)" He will try to fix that but this is clearly an issue with DOM...

            This user has 1762 jobs running that seem to re-recreate these empty files rna.ps/dot.ps in this directory on MDT3. I was able to unlink them but they came back, and that makes Lustre very unhappy...

             After unliking dot.ps, the file came back with a new FID but this is the same issue:

            Apr 28 12:44:58 fir-md1-s2 kernel: LustreError: 102428:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556480608, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff933040d386c0/0x4dfba1dda2f1f65b lrc: 3/0,1 mode: --/PW res: [0x28002498b:0x1:0x0].0x0 bits 0x40/0x0 rrc: 603 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 102428 timeout: 0 lvb_type: 0
             
            [root@fir-rbh01 ~]# lfs fid2path /fir 0x28002498b:0x1:0x0
            /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/dot.ps
            

            So it must be the access pattern and/or many jobs trying to create these files that confuse the locking. We'll try to cancel those jobs.

            sthiell Stephane Thiell added a comment - This user has 1762 jobs running that seem to re-recreate these empty files rna.ps/dot.ps in this directory on MDT3. I was able to unlink them but they came back, and that makes Lustre very unhappy...  After unliking dot.ps, the file came back with a new FID but this is the same issue: Apr 28 12:44:58 fir-md1-s2 kernel: LustreError: 102428:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556480608, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff933040d386c0/0x4dfba1dda2f1f65b lrc: 3/0,1 mode: --/PW res: [0x28002498b:0x1:0x0].0x0 bits 0x40/0x0 rrc: 603 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 102428 timeout: 0 lvb_type: 0 [root@fir-rbh01 ~]# lfs fid2path /fir 0x28002498b:0x1:0x0 /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/dot.ps So it must be the access pattern and/or many jobs trying to create these files that confuse the locking. We'll try to cancel those jobs.

            In desperation I applied this patch this morning on all MDS on our Fir filesystem. This didn't allow us to resume production, we seem to hit LU-11751 we're currently blocked... Attached a lctl dk with +dlmtrace  as fir-md1-s2-dk-LU-11285.log in case that help . I see lines like these:

            00010000:00010000:2.0:1556479276.391990:0:97954:0:(ldlm_inodebits.c:103:ldlm_reprocess_inodebits_queue()) ### Skip reprocess, bl_ibits: 40  ns: mdt-fir-MDT0003_UUID lock: ffff931056ef7980/0xc660af4b73ae11d2 lrc: 3/0,1 mode: --/PW res: [0x28001b768:0x1b5d0:0x0].0x0 bits 0x40/0x0 rrc: 481 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 98645 timeout: 0 lvb_type: 0
            

            Which seems to indicate the patch doesn't help. This FID comes very often  in our case. It is an empty file... parent dir is on MDT 3

            $ lfs fid2path /fir 0x28001b768:0x1b5d0:0x0
            /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/rna.ps
            
            $ lfs getdirstripe /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt
            lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none
            
            $ stat /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/dot.ps
              File: ‘/fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/dot.ps’
              Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
            Device: e64e03a8h/3863872424d	Inode: 180145872330405329  Links: 1
            Access: (0644/-rw-r--r--)  Uid: (293564/  amw579)   Gid: (40387/   rhiju)
            Access: 2019-04-28 07:23:54.000000000 -0700
            Modify: 2019-04-28 11:54:08.000000000 -0700
            Change: 2019-04-28 11:54:08.000000000 -0700
             Birth: -
            
            sthiell Stephane Thiell added a comment - In desperation I applied this patch this morning on all MDS on our Fir filesystem. This didn't allow us to resume production, we seem to hit LU-11751  we're currently blocked... Attached a lctl dk with +dlmtrace  as fir-md1-s2-dk-LU-11285.log  in case that help . I see lines like these: 00010000:00010000:2.0:1556479276.391990:0:97954:0:(ldlm_inodebits.c:103:ldlm_reprocess_inodebits_queue()) ### Skip reprocess, bl_ibits: 40 ns: mdt-fir-MDT0003_UUID lock: ffff931056ef7980/0xc660af4b73ae11d2 lrc: 3/0,1 mode: --/PW res: [0x28001b768:0x1b5d0:0x0].0x0 bits 0x40/0x0 rrc: 481 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 98645 timeout: 0 lvb_type: 0 Which seems to indicate the patch doesn't help. This FID comes very often  in our case. It is an empty file... parent dir is on MDT 3 $ lfs fid2path /fir 0x28001b768:0x1b5d0:0x0 /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/rna.ps $ lfs getdirstripe /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none $ stat /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/dot.ps File: ‘/fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/dot.ps’ Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: e64e03a8h/3863872424d Inode: 180145872330405329 Links: 1 Access: (0644/-rw-r--r--) Uid: (293564/ amw579) Gid: (40387/ rhiju) Access: 2019-04-28 07:23:54.000000000 -0700 Modify: 2019-04-28 11:54:08.000000000 -0700 Change: 2019-04-28 11:54:08.000000000 -0700 Birth: -

            People

              tappro Mikhail Pershin
              tappro Mikhail Pershin
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: