[LU-11285] don't stop on the first blocked lock in ldlm_reprocess_queue() Created: 25/Aug/18  Updated: 05/Nov/21  Resolved: 12/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Improvement Priority: Major
Reporter: Mikhail Pershin Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: DoM2

Attachments: Text File fir-md1-s2-dk-LU-11285.log    
Issue Links:
Related
is related to LU-10176 Data-on-MDT phase II Open
is related to LU-12834 MDT hung during failover Open
is related to LU-12037 Possible DNE issue leading to hung fi... Resolved
is related to LU-12017 Truncate vs setxattr deadlock with DoM Resolved
is related to LU-11751 racer deadlocks due to DOM glimpse re... Closed
Rank (Obsolete): 9223372036854775807

 Description   

The ldlm_reprocess_queue() stops on the first blocked lock in the waiting queue, meanwhile for IBITS locks there may be more waiting locks which can be granted immediately and don't interfere with that blocking lock. That is all about different IBITS, e.g. this is resource dump from racer run:

[ 1192.782632] Lustre: 13366:0:(ldlm_resource.c:1728:ldlm_resource_dump()) Granted locks (in reverse order):
[ 1192.782638] Lustre: 13366:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b58c7700/0x5c7c511d9b5b4287 lrc: 3/0,0 mode: PW/PW res: [0x200000401:0x395:0x0].0x0 bits 0x40/0x0 rrc: 18 type: IBT flags: 0x60200400000020 nid: 0@lo remote: 0x5c7c511d9b5b4279 expref: 342 pid: 16202 timeout: 1193 lvb_type: 0
[ 1192.782643] Lustre: 13366:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e209c000/0x5c7c511d9b5b44da lrc: 2/0,0 mode: PR/PR res: [0x200000401:0x395:0x0].0x0 bits 0x1a/0x0 rrc: 18 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x5c7c511d9b5b447f expref: 325 pid: 17956 timeout: 1193 lvb_type: 0
[ 1192.782647] Lustre: 13366:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801dda96300/0x5c7c511d9b5b2274 lrc: 2/0,0 mode: CR/CR res: [0x200000401:0x395:0x0].0x0 bits 0x8/0x0 rrc: 18 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x5c7c511d9b5b2258 expref: 325 pid: 16202 timeout: 0 lvb_type: 0
[ 1192.782651] Lustre: 13366:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e5b1e800/0x5c7c511d9b5b1ed8 lrc: 2/0,0 mode: CR/CR res: [0x200000401:0x395:0x0].0x0 bits 0x8/0x0 rrc: 18 type: IBT flags: 0x40000000000000 nid: 0@lo remote: 0x5c7c511d9b5b1ed1 expref: 342 pid: 18163 timeout: 0 lvb_type: 3
[ 1192.782653] Lustre: 13366:0:(ldlm_resource.c:1742:ldlm_resource_dump()) Waiting locks:
[ 1192.782657] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801dd895400/0x5c7c511d9b5b4504 lrc: 3/0,1 mode: --/PW res: [0x200000401:0x395:0x0].0x0 bits 0x40/0x0 rrc: 18 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 15795 timeout: 0 lvb_type: 0
[ 1192.782661] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e240af80/0x5c7c511d9b5b4519 lrc: 3/0,1 mode: --/PW res: [0x200000401:0x395:0x0].0x0 bits 0x40/0x0 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 13377 timeout: 0 lvb_type: 0
[ 1192.782665] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e6e32300/0x5c7c511d9b5b4806 lrc: 3/0,1 mode: --/EX res: [0x200000401:0x395:0x0].0x0 bits 0x21/0x0 rrc: 18 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 15808 timeout: 0 lvb_type: 0
[ 1192.782668] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b769c280/0x5c7c511d9b5b488b lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x13/0x8 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 15827 timeout: 0 lvb_type: 0
[ 1192.782672] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b58c6800/0x5c7c511d9b5b48ca lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x13/0x8 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 15807 timeout: 0 lvb_type: 0
[ 1192.782679] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b58c6300/0x5c7c511d9b5b4909 lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x13/0x8 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 17956 timeout: 0 lvb_type: 0
[ 1192.782683] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8800b58c6080/0x5c7c511d9b5b4b71 lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x13/0x8 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 13373 timeout: 0 lvb_type: 0
[ 1192.782686] Lustre: 13366:0:(ldlm_resource.c:1744:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e5aa8000/0x5c7c511d9b5b4b9b lrc: 3/1,0 mode: --/PR res: [0x200000401:0x395:0x0].0x0 bits 0x20/0x0 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 13378 timeout: 0 lvb_type: 0

the first lock in the waiting queue is DOM lock waiting for other DOM lock, but the next one EX lock has no matched bits either with both DOM locks or with all other granted locks, so can be granted immediately. Instead of that it will wait for DOM lock to be granted, because we don't allow locks to be granted out of order.

With IBITS locks it should be safe to grant such waiting locks which has no ibits match with any granted and any waiting locks before it.

The reason for this improvement is DOM lock mostly, because they may block for quite long time, e.g. CLIO may need to lock whole file region, so it will take DOM lock and all OST locks with all blocking ASTs, that means DOM lock may wait for quite long time and that is not big problem with it due to prolong mechanism. But waiting in a common waiting queue with other metadata locks it may stop whole lock processing for a while. With out-of-order lock granting this problem will be much less severe.



 Comments   
Comment by Gerrit Updater [ 25/Aug/18 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33077
Subject: LU-11285 ldlm: reprocess whole waiting queue for IBITS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f8ef3d3aef8cffb493ec19d79fb80c8ee90d640f

Comment by Stephane Thiell [ 28/Apr/19 ]

In desperation I applied this patch this morning on all MDS on our Fir filesystem. This didn't allow us to resume production, we seem to hit LU-11751 we're currently blocked... Attached a lctl dk with +dlmtrace  as fir-md1-s2-dk-LU-11285.log in case that help . I see lines like these:

00010000:00010000:2.0:1556479276.391990:0:97954:0:(ldlm_inodebits.c:103:ldlm_reprocess_inodebits_queue()) ### Skip reprocess, bl_ibits: 40  ns: mdt-fir-MDT0003_UUID lock: ffff931056ef7980/0xc660af4b73ae11d2 lrc: 3/0,1 mode: --/PW res: [0x28001b768:0x1b5d0:0x0].0x0 bits 0x40/0x0 rrc: 481 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 98645 timeout: 0 lvb_type: 0

Which seems to indicate the patch doesn't help. This FID comes very often  in our case. It is an empty file... parent dir is on MDT 3

$ lfs fid2path /fir 0x28001b768:0x1b5d0:0x0
/fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/rna.ps

$ lfs getdirstripe /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt
lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none

$ stat /fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/dot.ps
  File: ‘/fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/dot.ps’
  Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
Device: e64e03a8h/3863872424d	Inode: 180145872330405329  Links: 1
Access: (0644/-rw-r--r--)  Uid: (293564/  amw579)   Gid: (40387/   rhiju)
Access: 2019-04-28 07:23:54.000000000 -0700
Modify: 2019-04-28 11:54:08.000000000 -0700
Change: 2019-04-28 11:54:08.000000000 -0700
 Birth: -
Comment by Stephane Thiell [ 28/Apr/19 ]

This user has 1762 jobs running that seem to re-recreate these empty files rna.ps/dot.ps in this directory on MDT3. I was able to unlink them but they came back, and that makes Lustre very unhappy...

 After unliking dot.ps, the file came back with a new FID but this is the same issue:

Apr 28 12:44:58 fir-md1-s2 kernel: LustreError: 102428:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556480608, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff933040d386c0/0x4dfba1dda2f1f65b lrc: 3/0,1 mode: --/PW res: [0x28002498b:0x1:0x0].0x0 bits 0x40/0x0 rrc: 603 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 102428 timeout: 0 lvb_type: 0
 
[root@fir-rbh01 ~]# lfs fid2path /fir 0x28002498b:0x1:0x0
/fir/users/amw579/SWM_Design_Benchmark/SAM_I_kt/dot.ps

So it must be the access pattern and/or many jobs trying to create these files that confuse the locking. We'll try to cancel those jobs.

Comment by Stephane Thiell [ 28/Apr/19 ]

We put his jobs on hold and things are back to normal. The user told us that "In brief, both files are unfortunate temporary files that are side-effects of a process that my current set of jobs calls once every minute or so. (Of course, with on the order of 1k jobs running, those same files are hit substantially more often than 1/min.)"

He will try to fix that but this is clearly an issue with DOM...

Comment by Alexey Lyashkov [ 09/May/19 ]

Mike,
I found other case related to this issue, but it looks can be fixed easy, and it can be our's case.
it's unlink vs getattr race. Unlink takes a EX lock with 0x3 bits, and want a DOM lock to flush a data.
But, getattr want to have 0x3 + DOM lock, so it block a unlink to flush a data. Corresponded resource locks dump

--- Resource: [0x2000098e8:0x2:0].0 (ffff880d5987f500) refcount = 4
Granted locks (in reverse order):
ns: NS lock: ffff880634af1680/0x6d1e17d7c609dcc lrc: 2/0,1 mode: EX/EX res: [0x2000098e8:0x2:0].0 bits 0x3/0 rrc: 4 type: IBITS flags: 0x40210400000020 nid: local remote: 0 exp: 0000000000000000 expref: -99 pid: 53694 timeout: 0 lvb_type: 0
Waiting locks:
ns: NS lock: ffff880bcf287080/0x6d1e17d7c609e51 lrc: 3/1,0 mode: --/PR res: [0x2000098e8:0x2:0].0 bits 0x13/0 rrc: 4 type: IBITS flags: 0x40210000000000 nid: local remote: 0 exp: 0000000000000000 expref: -99 pid: 53709 timeout: 0 lvb_type: 0
ns: NS lock: ffff880634af3cc0/0x6d1e17d7c609e90 lrc: 3/0,1 mode: --/PW res: [0x2000098e8:0x2:0].0 bits 0x40/0 rrc: 4 type: IBITS flags: 0x40010080000000 nid: local remote: 0 exp: 0000000000000000 expref: -99 pid: 53694 timeout: 0 lvb_type: 0

backtraces

Stack :
__schedule
schedule
ldlm_completion_ast
ldlm_cli_enqueue_local
mdt_object_local_lock
mdt_object_lock_internal
mdt_object_lock_try
mdt_getattr_name_lock
mdt_intent_getattr
mdt_intent_policy
ldlm_lock_enqueue
ldlm_handle_enqueue0
tgt_enqueue
tgt_request_handle
ptlrpc_server_handle_request
ptlrpc_main
kthread
ret_from_fork
Progs:  53709 "mdt02_061"

Stack :
__schedule
schedule
ldlm_completion_ast
ldlm_cli_enqueue_local
mdt_dom_discard_data
mdt_reint_unlink
mdt_reint_rec
mdt_reint_internal
mdt_reint
tgt_request_handle
ptlrpc_server_handle_request
ptlrpc_main
kthread
ret_from_fork
Progs:  53694 "mdt02_046"

It looks DOM lock was obtained to take a correct size, but It not a requirement for the getattr request as it used to lookup name in most cases. So DOM bit can be dropped in getattr case and it resolve a deadlock.

Comment by Mikhail Pershin [ 09/May/19 ]

Alexey, the getattr takes DOM bit optionally, only if it is available (via trybits) and it has no DOM bit in the locks dump as you can see. The source of blocking is DOM lock from unlink process, which is taken to discard data which is waiting for getattr lock in waiting queue and unlink can't finish. This should be resolved by LU-11359 which takes non-blocking discard lock. But also this is exactly case when the current patch for reprocessing could help, it is clear that last DOM lock is waiting for getattr for nothing because they have no common bits and can be granted immediately instead of waiting.

Comment by Alexey Lyashkov [ 09/May/19 ]

Mike,

okey. But what about take an DOM lock with child bits on unlink ? it will be save an second lock enq so any posility to deadlock will avoid ?

Comment by Mikhail Pershin [ 09/May/19 ]

unfortunately we can't do that with DOM - at the moment of lock taking we don't know if that is last unlink or not. This is known only later, with lock already taken. On OST we always know that object is being deleted and have no such problem.
As I mentioned before, LU-11359 patch solves that problem with discard lock by using non-blocking completion AST, so that type of deadlock should be solved already.

Comment by Alexey Lyashkov [ 10/May/19 ]

Last unlink situation isn't need to have hold any locks in MD namespace.
One operation is remove from MD namespace and connect to the orphan list, second operation is orphan list handling with destroy "data" locks.
it's same as for any data placement. for OST objects you put an FID in unlink llog, for DoM objects you can destroy orphan object by self. I don't see any differences in this case.

Comment by Mikhail Pershin [ 10/May/19 ]

yes, and that is why we cannot take discard DOM lock when unlink starts, combing it with other child bits, it is to be taken when object is being deleted. That is answer on your proposal to combine DOM lock with child bits on unlink few comments above.

Comment by Alexey Lyashkov [ 10/May/19 ]

Understand, I was confused because you start to take a DoM lock without releasing a MD locks.
While it don't needs and resolve this race without changes in ldlm code, which affect a performance.

Comment by Andriy Skulysh [ 10/May/19 ]

I confirm that it resolves LU-12017 deadlock.

Comment by Gerrit Updater [ 03/Jun/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35045
Subject: LU-11285 mdt: improve IBITS lock definitions
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 355db9abe915bd78319a7100777b252934eadb91

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35045/
Subject: LU-11285 mdt: improve IBITS lock definitions
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3611352b699ce479779c0ff92ca558d9321e58a2

Comment by Gerrit Updater [ 28/Aug/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35955
Subject: LU-11285 mdt: improve IBITS lock definitions
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 88bf2ec7953ff34a69f34f9ff86fafbb0454b01e

Comment by Gerrit Updater [ 12/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35955/
Subject: LU-11285 mdt: improve IBITS lock definitions
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: e71e845156becd1fc7efd676247bc85467881a38

Comment by Peter Jones [ 12/Sep/19 ]

As per Mike, all necessary fixes have now landed for 2.13

Generated at Sat Feb 10 02:42:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.