[LU-12708] Cannot access directory and lock timed out Created: 28/Aug/19  Updated: 21/Jul/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Peter Jones
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.6 Lustre 2.12.0+patches


Attachments: File dlmtrace.big.log.gz     File dlmtrace.log.gz     File dlmtrace_ls.log.gz     Text File fir-md1-s1.full.log    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On our 2.12 Fir filesystem, it looks like a directory is not accessible anymore, it's hosted on MDT0000:

/fir/users/bjing/caspposes/CASP11/taskdir1
FID: 0x200029d02:0x1b59c:0x0

[root@fir-rbh01 ~]# lfs fid2path /fir 0x200029d02:0x1b59c:0x0
/fir/users/bjing/caspposes/CASP11/taskdir1
[root@fir-rbh01 ~]# lfs getdirstripe /fir/users/bjing/caspposes/CASP11/taskdir1
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none

strace of ls:

stat("/fir/users/bjing/caspposes/CASP11/", {st_mode=S_IFDIR|S_ISGID|0775, st_size=12288, ...}) = 0
openat(AT_FDCWD, "/fir/users/bjing/caspposes/CASP11/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
getdents(3,

Logs showing the FID on the MDS of MDT0000:

[root@fir-md1-s1 ~]# journalctl -n 100000 -k | grep 0x200029d02:0x1b59c:0x0
Aug 11 19:29:07 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 57s: evicting client at 10.8.26.28@o2ib6  ns: mdt-fir-MDT0000_UUID lock: ffff8f32cb2f7740/0x5d9ee6c5054b1779 lrc: 4/0,0 mode: PR/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 40 type: IBT flags: 0x60200400000020 nid: 10.8.26.28@o2ib6 remote: 0xff0b1f607b1120a1 expref: 3155 pid: 97646 timeout: 4692007 lvb_type: 0
Aug 11 19:29:47 fir-md1-s1 kernel: LustreError: 23597:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1565576897, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f3483a0ec00/0x5d9ee6c5055796b8 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 34 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23597 timeout: 0 lvb_type: 0
Aug 11 19:30:50 fir-md1-s1 kernel: LustreError: 21003:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1565576960, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f3245e35580/0x5d9ee6c5057392f2 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 33 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 21003 timeout: 0 lvb_type: 0
Aug 27 13:10:04 fir-md1-s1 kernel: LustreError: 21452:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566936514, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2d9e083180/0x5d9ee6e65c38e54b lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 28 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 21452 timeout: 0 lvb_type: 0
Aug 27 14:34:43 fir-md1-s1 kernel: LustreError: 23645:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566941593, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f28934e18c0/0x5d9ee6e686d8d27d lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 30 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23645 timeout: 0 lvb_type: 0
Aug 27 14:35:13 fir-md1-s1 kernel: LustreError: 50442:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566941623, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2ee5499f80/0x5d9ee6e686e6af45 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 30 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 50442 timeout: 0 lvb_type: 0
Aug 27 22:18:01 fir-md1-s1 kernel: LustreError: 10504:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566969391, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f1275822640/0x5d9ee6e70bada6a3 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 31 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 10504 timeout: 0 lvb_type: 0
Aug 27 22:20:00 fir-md1-s1 kernel: LustreError: 20457:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566969510, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f07c5f1ba80/0x5d9ee6e70bdaec42 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 33 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 20457 timeout: 0 lvb_type: 0
Aug 27 22:20:30 fir-md1-s1 kernel: LustreError: 23607:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566969540, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2ac3114a40/0x5d9ee6e70be681a2 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 33 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23607 timeout: 0 lvb_type: 0
Aug 28 10:28:39 fir-md1-s1 kernel: LustreError: 21681:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567013229, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f4136e41b00/0x5d9ee6e786713af9 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 35 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 21681 timeout: 0 lvb_type: 0
Aug 28 10:29:09 fir-md1-s1 kernel: LustreError: 23681:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567013259, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2ace24c140/0x5d9ee6e78694ca10 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 35 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23681 timeout: 0 lvb_type: 0
Aug 28 10:57:40 fir-md1-s1 kernel: LustreError: 23603:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567014969, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2ab0598480/0x5d9ee6e78df1fd5f lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 37 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23603 timeout: 0 lvb_type: 0
Aug 28 10:58:10 fir-md1-s1 kernel: LustreError: 50447:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567014999, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2920b8cec0/0x5d9ee6e78e1a0d54 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 37 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 50447 timeout: 0 lvb_type: 0

My guess is that a client is still holding the lock on it. It there a way to know which client (knowing the FID)?

Thanks!
Stephane



 Comments   
Comment by Stephane Thiell [ 28/Aug/19 ]

Whoops, I messed up the formatting of the Description, sorry about that. Let me know if there is a way to edit it.

Comment by Patrick Farrell (Inactive) [ 28/Aug/19 ]

Stephane,

The way to do it is to dump the lock tables and look for other locks on that resource.

lctl set_param debug=+dlmtrace
lctl clear
lctl set_param ldlm.dump_namespaces=1
lctl set_param debug=-dlmtrace
lctl dk > /tmp/[log] 

Please attach that here too.

Comment by Stephane Thiell [ 28/Aug/19 ]

Thanks Patrick! Did that but can't seem to find the FID in there, hmmm.

Comment by Stephane Thiell [ 28/Aug/19 ]

Note also that we run with the patch "LU-11285 ldlm: reprocess whole waiting queue for IBITS" which has not landed (https://review.whamcloud.com/#/c/33077/)

Comment by Patrick Farrell (Inactive) [ 28/Aug/19 ]

Then that strongly suggests the problem has cleared up, possibly via eviction - Are you still seeing log messages, etc, currently?

Comment by Stephane Thiell [ 28/Aug/19 ]

Unfortunately, the directory is still not accessible. When we try to access it, after a while I can see backtraces like that on the MDS:

Aug 28 11:11:56 fir-md1-s1 kernel: Lustre: 23561:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
                                     req@ffff8f0eaa486c00 x1643052947772848/t0(0) o101->f37a46e0-1e70-6b27-1459-0c7be76fae27@10.0.10.3@o2ib7:1/0 lens 584/3264 e 1 to 0 dl 1567015921 ref 2 fl Interpret:/0/0 rc 0/0
Aug 28 11:13:11 fir-md1-s1 kernel: LustreError: 23653:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567015901, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-M
Aug 28 11:15:01 fir-md1-s1 kernel: LNet: Service thread pid 23653 was inactive for 200.64s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Aug 28 11:15:01 fir-md1-s1 kernel: Pid: 23653, comm: mdt00_097 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Aug 28 11:15:01 fir-md1-s1 kernel: Call Trace:
Aug 28 11:15:01 fir-md1-s1 kernel:  [<ffffffffc0d92e55>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc]
Aug 28 11:15:01 fir-md1-s1 kernel:  [<ffffffffc0d93cbc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Aug 28 11:15:01 fir-md1-s1 kernel:  [<ffffffffc14744bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Aug 28 11:15:01 fir-md1-s1 kernel:  [<ffffffffc1474b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc1475d2a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc147dbc5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc147aa28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc0d79d46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc0da2707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc0e296e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc0e3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc0dd4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc0dd863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffff84ec1c31>] kthread+0xd1/0xe0
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffff85574c24>] ret_from_fork_nospec_begin+0xe/0x21
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Aug 28 11:15:02 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1567016102.23653
Comment by Patrick Farrell (Inactive) [ 28/Aug/19 ]

Stephane,

I don't see a FID/the LDLM lock description in those most recent messages, can you provide those if they exist?

Comment by Stephane Thiell [ 28/Aug/19 ]

Ah yes. This one is better I guess:

Aug 28 11:13:11 fir-md1-s1 kernel: LustreError: 23653:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567015901, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f14e3261440/0x5d9ee6e793a06640 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 39 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23653 timeout: 0 lvb_type: 0
Aug 28 11:15:01 fir-md1-s1 kernel: LNet: Service thread pid 23653 was inactive for 200.64s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Aug 28 11:15:01 fir-md1-s1 kernel: Pid: 23653, comm: mdt00_097 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Aug 28 11:15:01 fir-md1-s1 kernel: Call Trace:
Aug 28 11:15:01 fir-md1-s1 kernel:  [<ffffffffc0d92e55>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc]
Aug 28 11:15:01 fir-md1-s1 kernel:  [<ffffffffc0d93cbc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Aug 28 11:15:01 fir-md1-s1 kernel:  [<ffffffffc14744bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Aug 28 11:15:01 fir-md1-s1 kernel:  [<ffffffffc1474b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc1475d2a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc147dbc5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc147aa28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc0d79d46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc0da2707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc0e296e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc0e3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc0dd4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffc0dd863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffff84ec1c31>] kthread+0xd1/0xe0
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffff85574c24>] ret_from_fork_nospec_begin+0xe/0x21
Aug 28 11:15:02 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Aug 28 11:15:02 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1567016102.23653
Comment by Stephane Thiell [ 28/Aug/19 ]

The first 'lock callback timer expired' message in the logs was from August 11:

Aug 11 19:29:07 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 57s: evicting client at 10.8.26.28@o2ib6  ns: mdt-fir-MDT0000_UUID lock: ffff8f32cb2f7740/0x5d9ee6c5054b1779 lrc: 4/0,0 mode: PR/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 40 type: IBT flags: 0x60200400000020 nid: 10.8.26.28@o2ib6 remote: 0xff0b1f607b1120a1 expref: 3155 pid: 97646 timeout: 4692007 lvb_type: 0

So we just tried to power off 10.8.26.28@o2ib6 (sh-26-28) just in case. The node was evicted by the MDTs but unfortunately, the directory is still no accessible.

Comment by Patrick Farrell (Inactive) [ 29/Aug/19 ]

Stephane,

Sorry, temporarily lost track of this in the shuffle yesterday.

First, specific thoughts here:
It looks like the lock table dump got cut off due to running out of debug buffer on the CPU doing the dumping.  That's probably why we don't see the resource in the lock tables - Sorry, this didn't immediately occur to me yesterday, but it seems to be what happened.

Let's try it again with a much larger debug_mb - So, save your debug_mb value, then set it to something like 20000 (basically "huge", you're going to turn it down again in a moment).  Then re-do the namespace dump like before (exactly as before - particularly marking sure to include the lctl clear command), and re-set your debug_mb to the original value once done.

Now, general thoughts:

Not having the full lock tables makes this harder to say (again, wish I had realized yesterday that we didn't have the whole thing - whoops), but:

This could definitely be a bug in LU-11285.  Basically, it kind of looks like the server is waiting for something that it shouldn't be - There was some sort of deadlock/hang/etc and a client holding a lock on that resource was evicted (10.8.26.28@o2ib6, as you noted), which should've freed things up.

We have, at least for now, basically replaced LU-11285 with LU-12017LU-12017 should address the same issues as LU-11285, and is landed:

[https://review.whamcloud.com/#/c/35057/

]

We are probably still going to land LU-11285 eventually, but it got bogged down a bit, and LU-12017 was ready before it.

So as we move forward here, that's something to consider - We believe LU-12017 will resolve the issues for which we took LU-11285, and we're more confident it's fully ready.  (For example, LU-12017 and not LU-11285 is slated for the next 2.12 release.)

Comment by Stephane Thiell [ 29/Aug/19 ]

Hi Patrick,
Thanks a lot for your response and no worries! The directory is still blocked. Actually I think this is more the parent directory that is the problem:

# lfs fid2path /fir 0x200029ce3:0x1b32:0x0
/fir/users/bjing/caspposes/CASP11

I tried to dump the locks with debug_mb=20000 but no luck. I can't find neither FIDs (at least formatted as 0x200029ce3:0x1b32:0x0 or 0x200029d02:0x1b59c:0x0). Still, I'm attaching the result as dlmtrace.big.log.gz just in case.

If I try a ls from a client with +dlmtrace, I can see a few lines referencing FID 0x200029ce3:0x1b32:0x0, perhaps that helps?

[root@fir-md1-s1 ~]# grep 0x200029ce3:0x1b32:0x0 /tmp/dlmtrace_ls.log 
00010000:00010000:7.0:1567102724.224467:0:23718:0:(ldlm_lock.c:766:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: mdt-fir-MDT0000_UUID lock: ffff8f12417569c0/0x5d9ee6e875247e2e lrc: 3/1,0 mode: --/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x0/0x0 rrc: 80 type: IBT flags: 0x40000000000000 nid: local remote: 0x0 expref: -99 pid: 23718 timeout: 0 lvb_type: 0
00010000:00010000:7.0:1567102724.224471:0:23718:0:(ldlm_lock.c:1067:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-fir-MDT0000_UUID lock: ffff8f12417569c0/0x5d9ee6e875247e2e lrc: 3/1,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x13/0x0 rrc: 80 type: IBT flags: 0x50210000000000 nid: local remote: 0x0 expref: -99 pid: 23718 timeout: 0 lvb_type: 0
00010000:00010000:7.0:1567102724.224476:0:23718:0:(ldlm_request.c:516:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: mdt-fir-MDT0000_UUID lock: ffff8f12417569c0/0x5d9ee6e875247e2e lrc: 3/1,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x13/0x0 rrc: 80 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23718 timeout: 0 lvb_type: 0
00000004:00010000:7.0:1567102724.224486:0:23718:0:(mdt_handler.c:1856:mdt_getattr_name_lock()) ### Returning lock to client ns: mdt-fir-MDT0000_UUID lock: ffff8f12417569c0/0x5d9ee6e875247e2e lrc: 3/1,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x13/0x0 rrc: 80 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23718 timeout: 0 lvb_type: 0
00010000:00010000:7.0:1567102724.224554:0:23718:0:(ldlm_lockd.c:1410:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-fir-MDT0000_UUID lock: ffff8f12417569c0/0x5d9ee6e875247e2e lrc: 3/0,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x13/0x0 rrc: 80 type: IBT flags: 0x40200000000000 nid: 10.9.0.1@o2ib4 remote: 0x193907ffa4cfdd71 expref: 14 pid: 23718 timeout: 0 lvb_type: 0
00010000:00010000:7.0:1567102724.224563:0:23718:0:(ldlm_inodebits.c:90:ldlm_reprocess_inodebits_queue()) --- Reprocess resource [0x200029ce3:0x1b32:0x0].0x0 (ffff8f0bf614a780)
00010000:00010000:24.0:1567102724.224788:0:23565:0:(ldlm_lockd.c:1262:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-fir-MDT0000_UUID lock: ffff8f0db37aaac0/0x5d9ee6e875247e35 lrc: 2/0,0 mode: --/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x0/0x0 rrc: 81 type: IBT flags: 0x40000000000000 nid: local remote: 0x193907ffa4cfdd8d expref: -99 pid: 23565 timeout: 0 lvb_type: 0
00010000:00010000:24.0:1567102724.224796:0:23565:0:(ldlm_lock.c:766:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: mdt-fir-MDT0000_UUID lock: ffff8f0db37ad340/0x5d9ee6e875247e3c lrc: 3/1,0 mode: --/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x0/0x0 rrc: 82 type: IBT flags: 0x40000000000000 nid: local remote: 0x0 expref: -99 pid: 23565 timeout: 0 lvb_type: 0
00010000:00010000:24.0:1567102724.224801:0:23565:0:(ldlm_lock.c:1067:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-fir-MDT0000_UUID lock: ffff8f0db37ad340/0x5d9ee6e875247e3c lrc: 3/1,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x20/0x0 rrc: 82 type: IBT flags: 0x50210000000000 nid: local remote: 0x0 expref: -99 pid: 23565 timeout: 0 lvb_type: 0
00010000:00010000:24.0:1567102724.224806:0:23565:0:(ldlm_request.c:516:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: mdt-fir-MDT0000_UUID lock: ffff8f0db37ad340/0x5d9ee6e875247e3c lrc: 3/1,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x20/0x0 rrc: 82 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23565 timeout: 0 lvb_type: 0
00010000:00010000:24.0:1567102724.224833:0:23565:0:(ldlm_lock.c:210:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-fir-MDT0000_UUID lock: ffff8f0db37aaac0/0x5d9ee6e875247e35 lrc: 0/0,0 mode: --/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x20/0x0 rrc: 82 type: IBT flags: 0x44000000000000 nid: 10.9.0.1@o2ib4 remote: 0x193907ffa4cfdd8d expref: 16 pid: 23565 timeout: 0 lvb_type: 0
00010000:00010000:24.0:1567102724.224837:0:23565:0:(ldlm_lockd.c:1410:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-fir-MDT0000_UUID lock: ffff8f0db37ad340/0x5d9ee6e875247e3c lrc: 3/0,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x20/0x0 rrc: 81 type: IBT flags: 0x40200000000000 nid: 10.9.0.1@o2ib4 remote: 0x193907ffa4cfdd8d expref: 15 pid: 23565 timeout: 0 lvb_type: 0
00010000:00010000:24.0:1567102724.224841:0:23565:0:(ldlm_inodebits.c:90:ldlm_reprocess_inodebits_queue()) --- Reprocess resource [0x200029ce3:0x1b32:0x0].0x0 (ffff8f0bf614a780)

I'm attaching dlmtrace_ls.log.gz for the full output (warning: does not contain a lock dump).

I'll review the tickets that you provided until doing anything else. It seems to affect only this specific path for now.

Comment by Stephane Thiell [ 29/Aug/19 ]

Patrick, I'm gonna build a b2_12 +  LU-11761 fld: let's caller to retry FLD_QUERY (hopefully this one will land for 2.12.3, could you please check? it has +1)

That way, we will get rid of the current patch for LU-11285 and add the one for LU-12017, and I think all of other patches that we use have also landed.

Comment by Andreas Dilger [ 30/Aug/19 ]

Another way you might try to find the offending client is to grep in /proc/fs/lustre/mdt/*/exports/*/open_files for the FID(s) in question. It might be that this will list all of the blocked clients trying to read this directory, but it might only show the client that has the directory open?

Comment by Stephane Thiell [ 30/Aug/19 ]

Interesting! One FID shows up:

 

[root@fir-md1-s1 ~]# grep 0x200029ce3:0x1b32:0x0 /proc/fs/lustre/mdt/*/exports/*/open_files
/proc/fs/lustre/mdt/fir-MDT0000/exports/10.8.0.67@o2ib6/open_files:[0x200029ce3:0x1b32:0x0]
/proc/fs/lustre/mdt/fir-MDT0000/exports/10.8.0.68@o2ib6/open_files:[0x200029ce3:0x1b32:0x0]
/proc/fs/lustre/mdt/fir-MDT0000/exports/10.9.0.1@o2ib4/open_files:[0x200029ce3:0x1b32:0x0]
/proc/fs/lustre/mdt/fir-MDT0000/exports/10.9.0.62@o2ib4/open_files:[0x200029ce3:0x1b32:0x0]
/proc/fs/lustre/mdt/fir-MDT0000/exports/10.9.0.64@o2ib4/open_files:[0x200029ce3:0x1b32:0x0]

It looks like those are the blocked clients: 10.9.0.1@o2ib4 is us testing from a management node (getting blocked), and the other NIDs are login nodes, it might be the user trying to access the directory from different login nodes.

Generated at Sat Feb 10 02:54:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.