[LU-11285] don't stop on the first blocked lock in ldlm_reprocess_queue() Created: 25/Aug/18 Updated: 05/Nov/21 Resolved: 12/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Type: | Improvement | Priority: | Major |
| Reporter: | Mikhail Pershin | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | DoM2 | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
The ldlm_reprocess_queue() stops on the first blocked lock in the waiting queue, meanwhile for IBITS locks there may be more waiting locks which can be granted immediately and don't interfere with that blocking lock. That is all about different IBITS, e.g. this is resource dump from racer run: [ 1192.782632] Lustre: 13366:0:(ldlm_resource.c:1728:ldlm_resource_dump()) Granted locks (in reverse order): [ 1192.782638] Lustre: 13366:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b58c7700/0x5c7c511d9b5b4287 lrc: 3/0,0 mode: PW/PW res: [0x200000401:0x395:0x0].0x0 bits 0x40/0x0 rrc: 18 type: IBT flags: 0x60200400000020 nid: 0@lo remote: 0x5c7c511d9b5b4279 expref: 342 pid: 16202 timeout: 1193 lvb_type: 0 [ 1192.782643] Lustre: 13366:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e209c000/0x5c7c511d9b5b44da lrc: 2/0,0 mode: PR/PR res: [0x200000401:0x395:0x0].0x0 bits 0x1a/0x0 rrc: 18 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x5c7c511d9b5b447f expref: 325 pid: 17956 timeout: 1193 lvb_type: 0 [ 1192.782647] Lustre: 13366:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801dda96300/0x5c7c511d9b5b2274 lrc: 2/0,0 mode: CR/CR res: [0x200000401:0x395:0x0].0x0 bits 0x8/0x0 rrc: 18 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x5c7c511d9b5b2258 expref: 325 pid: 16202 timeout: 0 lvb_type: 0 [ 1192.782651] Lustre: 13366:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e5b1e800/0x5c7c511d9b5b1ed8 lrc: 2/0,0 mode: CR/CR res: [0x200000401:0x395:0x0].0x0 bits 0x8/0x0 rrc: 18 type: IBT flags: 0x40000000000000 nid: 0@lo remote: 0x5c7c511d9b5b1ed1 expref: 342 pid: 18163 timeout: 0 lvb_type: 3 [ 1192.782653] Lustre: 13366:0:(ldlm_resource.c:1742:ldlm_resource_dump()) Waiting locks: [ 1192.782657] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801dd895400/0x5c7c511d9b5b4504 lrc: 3/0,1 mode: --/PW res: [0x200000401:0x395:0x0].0x0 bits 0x40/0x0 rrc: 18 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 15795 timeout: 0 lvb_type: 0 [ 1192.782661] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e240af80/0x5c7c511d9b5b4519 lrc: 3/0,1 mode: --/PW res: [0x200000401:0x395:0x0].0x0 bits 0x40/0x0 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 13377 timeout: 0 lvb_type: 0 [ 1192.782665] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e6e32300/0x5c7c511d9b5b4806 lrc: 3/0,1 mode: --/EX res: [0x200000401:0x395:0x0].0x0 bits 0x21/0x0 rrc: 18 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 15808 timeout: 0 lvb_type: 0 [ 1192.782668] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b769c280/0x5c7c511d9b5b488b lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x13/0x8 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 15827 timeout: 0 lvb_type: 0 [ 1192.782672] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b58c6800/0x5c7c511d9b5b48ca lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x13/0x8 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 15807 timeout: 0 lvb_type: 0 [ 1192.782679] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b58c6300/0x5c7c511d9b5b4909 lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x13/0x8 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 17956 timeout: 0 lvb_type: 0 [ 1192.782683] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b58c6080/0x5c7c511d9b5b4b71 lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x13/0x8 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 13373 timeout: 0 lvb_type: 0 [ 1192.782686] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e5aa8000/0x5c7c511d9b5b4b9b lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x20/0x0 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 13378 timeout: 0 lvb_type: 0 the first lock in the waiting queue is DOM lock waiting for other DOM lock, but the next one EX lock has no matched bits either with both DOM locks or with all other granted locks, so can be granted immediately. Instead of that it will wait for DOM lock to be granted, because we don't allow locks to be granted out of order. With IBITS locks it should be safe to grant such waiting locks which has no ibits match with any granted and any waiting locks before it. The reason for this improvement is DOM lock mostly, because they may block for quite long time, e.g. CLIO may need to lock whole file region, so it will take DOM lock and all OST locks with all blocking ASTs, that means DOM lock may wait for quite long time and that is not big problem with it due to prolong mechanism. But waiting in a common waiting queue with other metadata locks it may stop whole lock processing for a while. With out-of-order lock granting this problem will be much less severe. |
| Comments |
| Comment by Gerrit Updater [ 25/Aug/18 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33077 |
| Comment by Stephane Thiell [ 28/Apr/19 ] |
|
In desperation I applied this patch this morning on all MDS on our Fir filesystem. This didn't allow us to resume production, we seem to hit 00010000:00010000:2.0:1556479276.391990:0:97954:0:(ldlm_inodebits.c:103:ldlm_reprocess_inodebits_queue()) ### Skip reprocess, bl_ibits: 40 ns: mdt-fir-MDT0003_UUID lock: ffff931056ef7980/0xc660af4b73ae11d2 lrc: 3/0,1 mode: --/PW res: [0x28001b768:0x1b5d0:0x0].0x0 bits 0x40/0x0 rrc: 481 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 98645 timeout: 0 lvb_type: 0 Which seems to indicate the patch doesn't help. This FID comes very often in our case. It is an empty file... parent dir is on MDT 3 $ lfs fid2path /fir 0x28001b768:0x1b5d0:0x0 /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/rna.ps $ lfs getdirstripe /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none $ stat /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/dot.ps File: ‘/fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/dot.ps’ Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: e64e03a8h/3863872424d Inode: 180145872330405329 Links: 1 Access: (0644/-rw-r--r--) Uid: (293564/ amw579) Gid: (40387/ rhiju) Access: 2019-04-28 07:23:54.000000000 -0700 Modify: 2019-04-28 11:54:08.000000000 -0700 Change: 2019-04-28 11:54:08.000000000 -0700 Birth: - |
| Comment by Stephane Thiell [ 28/Apr/19 ] |
|
This user has 1762 jobs running that seem to re-recreate these empty files rna.ps/dot.ps in this directory on MDT3. I was able to unlink them but they came back, and that makes Lustre very unhappy... After unliking dot.ps, the file came back with a new FID but this is the same issue: Apr 28 12:44:58 fir-md1-s2 kernel: LustreError: 102428:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556480608, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff933040d386c0/0x4dfba1dda2f1f65b lrc: 3/0,1 mode: --/PW res: [0x28002498b:0x1:0x0].0x0 bits 0x40/0x0 rrc: 603 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 102428 timeout: 0 lvb_type: 0 [root@fir-rbh01 ~]# lfs fid2path /fir 0x28002498b:0x1:0x0 /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/dot.ps So it must be the access pattern and/or many jobs trying to create these files that confuse the locking. We'll try to cancel those jobs. |
| Comment by Stephane Thiell [ 28/Apr/19 ] |
|
We put his jobs on hold and things are back to normal. The user told us that "In brief, both files are unfortunate temporary files that are side-effects of a process that my current set of jobs calls once every minute or so. (Of course, with on the order of 1k jobs running, those same files are hit substantially more often than 1/min.)" He will try to fix that but this is clearly an issue with DOM... |
| Comment by Alexey Lyashkov [ 09/May/19 ] |
|
Mike, --- Resource: [0x2000098e8:0x2:0].0 (ffff880d5987f500) refcount = 4 Granted locks (in reverse order): ns: NS lock: ffff880634af1680/0x6d1e17d7c609dcc lrc: 2/0,1 mode: EX/EX res: [0x2000098e8:0x2:0].0 bits 0x3/0 rrc: 4 type: IBITS flags: 0x40210400000020 nid: local remote: 0 exp: 0000000000000000 expref: -99 pid: 53694 timeout: 0 lvb_type: 0 Waiting locks: ns: NS lock: ffff880bcf287080/0x6d1e17d7c609e51 lrc: 3/1,0 mode: --/PR res: [0x2000098e8:0x2:0].0 bits 0x13/0 rrc: 4 type: IBITS flags: 0x40210000000000 nid: local remote: 0 exp: 0000000000000000 expref: -99 pid: 53709 timeout: 0 lvb_type: 0 ns: NS lock: ffff880634af3cc0/0x6d1e17d7c609e90 lrc: 3/0,1 mode: --/PW res: [0x2000098e8:0x2:0].0 bits 0x40/0 rrc: 4 type: IBITS flags: 0x40010080000000 nid: local remote: 0 exp: 0000000000000000 expref: -99 pid: 53694 timeout: 0 lvb_type: 0 backtraces Stack : __schedule schedule ldlm_completion_ast ldlm_cli_enqueue_local mdt_object_local_lock mdt_object_lock_internal mdt_object_lock_try mdt_getattr_name_lock mdt_intent_getattr mdt_intent_policy ldlm_lock_enqueue ldlm_handle_enqueue0 tgt_enqueue tgt_request_handle ptlrpc_server_handle_request ptlrpc_main kthread ret_from_fork Progs: 53709 "mdt02_061" Stack : __schedule schedule ldlm_completion_ast ldlm_cli_enqueue_local mdt_dom_discard_data mdt_reint_unlink mdt_reint_rec mdt_reint_internal mdt_reint tgt_request_handle ptlrpc_server_handle_request ptlrpc_main kthread ret_from_fork Progs: 53694 "mdt02_046" It looks DOM lock was obtained to take a correct size, but It not a requirement for the getattr request as it used to lookup name in most cases. So DOM bit can be dropped in getattr case and it resolve a deadlock. |
| Comment by Mikhail Pershin [ 09/May/19 ] |
|
Alexey, the getattr takes DOM bit optionally, only if it is available (via trybits) and it has no DOM bit in the locks dump as you can see. The source of blocking is DOM lock from unlink process, which is taken to discard data which is waiting for getattr lock in waiting queue and unlink can't finish. This should be resolved by |
| Comment by Alexey Lyashkov [ 09/May/19 ] |
|
Mike, okey. But what about take an DOM lock with child bits on unlink ? it will be save an second lock enq so any posility to deadlock will avoid ? |
| Comment by Mikhail Pershin [ 09/May/19 ] |
|
unfortunately we can't do that with DOM - at the moment of lock taking we don't know if that is last unlink or not. This is known only later, with lock already taken. On OST we always know that object is being deleted and have no such problem. |
| Comment by Alexey Lyashkov [ 10/May/19 ] |
|
Last unlink situation isn't need to have hold any locks in MD namespace. |
| Comment by Mikhail Pershin [ 10/May/19 ] |
|
yes, and that is why we cannot take discard DOM lock when unlink starts, combing it with other child bits, it is to be taken when object is being deleted. That is answer on your proposal to combine DOM lock with child bits on unlink few comments above. |
| Comment by Alexey Lyashkov [ 10/May/19 ] |
|
Understand, I was confused because you start to take a DoM lock without releasing a MD locks. |
| Comment by Andriy Skulysh [ 10/May/19 ] |
|
I confirm that it resolves |
| Comment by Gerrit Updater [ 03/Jun/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35045 |
| Comment by Gerrit Updater [ 12/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35045/ |
| Comment by Gerrit Updater [ 28/Aug/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35955 |
| Comment by Gerrit Updater [ 12/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35955/ |
| Comment by Peter Jones [ 12/Sep/19 ] |
|
As per Mike, all necessary fixes have now landed for 2.13 |