[LU-12708] Cannot access directory and lock timed out Created: 28/Aug/19 Updated: 21/Jul/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephane Thiell | Assignee: | Peter Jones |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 Lustre 2.12.0+patches |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
On our 2.12 Fir filesystem, it looks like a directory is not accessible anymore, it's hosted on MDT0000: /fir/users/bjing/caspposes/CASP11/taskdir1 [root@fir-rbh01 ~]# lfs fid2path /fir 0x200029d02:0x1b59c:0x0 /fir/users/bjing/caspposes/CASP11/taskdir1 [root@fir-rbh01 ~]# lfs getdirstripe /fir/users/bjing/caspposes/CASP11/taskdir1 lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none strace of ls: stat("/fir/users/bjing/caspposes/CASP11/", {st_mode=S_IFDIR|S_ISGID|0775, st_size=12288, ...}) = 0
openat(AT_FDCWD, "/fir/users/bjing/caspposes/CASP11/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
getdents(3,
Logs showing the FID on the MDS of MDT0000: [root@fir-md1-s1 ~]# journalctl -n 100000 -k | grep 0x200029d02:0x1b59c:0x0 Aug 11 19:29:07 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 57s: evicting client at 10.8.26.28@o2ib6 ns: mdt-fir-MDT0000_UUID lock: ffff8f32cb2f7740/0x5d9ee6c5054b1779 lrc: 4/0,0 mode: PR/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 40 type: IBT flags: 0x60200400000020 nid: 10.8.26.28@o2ib6 remote: 0xff0b1f607b1120a1 expref: 3155 pid: 97646 timeout: 4692007 lvb_type: 0 Aug 11 19:29:47 fir-md1-s1 kernel: LustreError: 23597:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1565576897, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f3483a0ec00/0x5d9ee6c5055796b8 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 34 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23597 timeout: 0 lvb_type: 0 Aug 11 19:30:50 fir-md1-s1 kernel: LustreError: 21003:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1565576960, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f3245e35580/0x5d9ee6c5057392f2 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 33 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 21003 timeout: 0 lvb_type: 0 Aug 27 13:10:04 fir-md1-s1 kernel: LustreError: 21452:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566936514, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2d9e083180/0x5d9ee6e65c38e54b lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 28 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 21452 timeout: 0 lvb_type: 0 Aug 27 14:34:43 fir-md1-s1 kernel: LustreError: 23645:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566941593, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f28934e18c0/0x5d9ee6e686d8d27d lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 30 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23645 timeout: 0 lvb_type: 0 Aug 27 14:35:13 fir-md1-s1 kernel: LustreError: 50442:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566941623, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2ee5499f80/0x5d9ee6e686e6af45 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 30 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 50442 timeout: 0 lvb_type: 0 Aug 27 22:18:01 fir-md1-s1 kernel: LustreError: 10504:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566969391, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f1275822640/0x5d9ee6e70bada6a3 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 31 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 10504 timeout: 0 lvb_type: 0 Aug 27 22:20:00 fir-md1-s1 kernel: LustreError: 20457:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566969510, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f07c5f1ba80/0x5d9ee6e70bdaec42 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 33 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 20457 timeout: 0 lvb_type: 0 Aug 27 22:20:30 fir-md1-s1 kernel: LustreError: 23607:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566969540, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2ac3114a40/0x5d9ee6e70be681a2 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 33 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23607 timeout: 0 lvb_type: 0 Aug 28 10:28:39 fir-md1-s1 kernel: LustreError: 21681:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567013229, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f4136e41b00/0x5d9ee6e786713af9 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 35 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 21681 timeout: 0 lvb_type: 0 Aug 28 10:29:09 fir-md1-s1 kernel: LustreError: 23681:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567013259, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2ace24c140/0x5d9ee6e78694ca10 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 35 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23681 timeout: 0 lvb_type: 0 Aug 28 10:57:40 fir-md1-s1 kernel: LustreError: 23603:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567014969, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2ab0598480/0x5d9ee6e78df1fd5f lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 37 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23603 timeout: 0 lvb_type: 0 Aug 28 10:58:10 fir-md1-s1 kernel: LustreError: 50447:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567014999, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2920b8cec0/0x5d9ee6e78e1a0d54 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 37 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 50447 timeout: 0 lvb_type: 0 My guess is that a client is still holding the lock on it. It there a way to know which client (knowing the FID)? Thanks! |
| Comments |
| Comment by Stephane Thiell [ 28/Aug/19 ] |
|
Whoops, I messed up the formatting of the Description, sorry about that. Let me know if there is a way to edit it. |
| Comment by Patrick Farrell (Inactive) [ 28/Aug/19 ] |
|
Stephane, The way to do it is to dump the lock tables and look for other locks on that resource. lctl set_param debug=+dlmtrace lctl clear lctl set_param ldlm.dump_namespaces=1 lctl set_param debug=-dlmtrace lctl dk > /tmp/[log] Please attach that here too. |
| Comment by Stephane Thiell [ 28/Aug/19 ] |
|
Thanks Patrick! Did that but can't seem to find the FID in there, hmmm. |
| Comment by Stephane Thiell [ 28/Aug/19 ] |
|
Note also that we run with the patch " |
| Comment by Patrick Farrell (Inactive) [ 28/Aug/19 ] |
|
Then that strongly suggests the problem has cleared up, possibly via eviction - Are you still seeing log messages, etc, currently? |
| Comment by Stephane Thiell [ 28/Aug/19 ] |
|
Unfortunately, the directory is still not accessible. When we try to access it, after a while I can see backtraces like that on the MDS: Aug 28 11:11:56 fir-md1-s1 kernel: Lustre: 23561:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
req@ffff8f0eaa486c00 x1643052947772848/t0(0) o101->f37a46e0-1e70-6b27-1459-0c7be76fae27@10.0.10.3@o2ib7:1/0 lens 584/3264 e 1 to 0 dl 1567015921 ref 2 fl Interpret:/0/0 rc 0/0
Aug 28 11:13:11 fir-md1-s1 kernel: LustreError: 23653:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567015901, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-M
Aug 28 11:15:01 fir-md1-s1 kernel: LNet: Service thread pid 23653 was inactive for 200.64s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Aug 28 11:15:01 fir-md1-s1 kernel: Pid: 23653, comm: mdt00_097 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Aug 28 11:15:01 fir-md1-s1 kernel: Call Trace:
Aug 28 11:15:01 fir-md1-s1 kernel: [<ffffffffc0d92e55>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc]
Aug 28 11:15:01 fir-md1-s1 kernel: [<ffffffffc0d93cbc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Aug 28 11:15:01 fir-md1-s1 kernel: [<ffffffffc14744bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Aug 28 11:15:01 fir-md1-s1 kernel: [<ffffffffc1474b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc1475d2a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc147dbc5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc147aa28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc0d79d46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc0da2707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc0e296e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc0e3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc0dd4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc0dd863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffff84ec1c31>] kthread+0xd1/0xe0
Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffff85574c24>] ret_from_fork_nospec_begin+0xe/0x21
Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Aug 28 11:15:02 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1567016102.23653
|
| Comment by Patrick Farrell (Inactive) [ 28/Aug/19 ] |
|
Stephane, I don't see a FID/the LDLM lock description in those most recent messages, can you provide those if they exist? |
| Comment by Stephane Thiell [ 28/Aug/19 ] |
|
Ah yes. This one is better I guess: Aug 28 11:13:11 fir-md1-s1 kernel: LustreError: 23653:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567015901, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f14e3261440/0x5d9ee6e793a06640 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 39 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23653 timeout: 0 lvb_type: 0 Aug 28 11:15:01 fir-md1-s1 kernel: LNet: Service thread pid 23653 was inactive for 200.64s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Aug 28 11:15:01 fir-md1-s1 kernel: Pid: 23653, comm: mdt00_097 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Aug 28 11:15:01 fir-md1-s1 kernel: Call Trace: Aug 28 11:15:01 fir-md1-s1 kernel: [<ffffffffc0d92e55>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Aug 28 11:15:01 fir-md1-s1 kernel: [<ffffffffc0d93cbc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Aug 28 11:15:01 fir-md1-s1 kernel: [<ffffffffc14744bb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Aug 28 11:15:01 fir-md1-s1 kernel: [<ffffffffc1474b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt] Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc1475d2a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt] Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc147dbc5>] mdt_intent_getattr+0x2b5/0x480 [mdt] Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc147aa28>] mdt_intent_policy+0x2e8/0xd00 [mdt] Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc0d79d46>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc0da2707>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc0e296e2>] tgt_enqueue+0x62/0x210 [ptlrpc] Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc0e3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc0dd4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffc0dd863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffff84ec1c31>] kthread+0xd1/0xe0 Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffff85574c24>] ret_from_fork_nospec_begin+0xe/0x21 Aug 28 11:15:02 fir-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Aug 28 11:15:02 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1567016102.23653 |
| Comment by Stephane Thiell [ 28/Aug/19 ] |
|
The first 'lock callback timer expired' message in the logs was from August 11: Aug 11 19:29:07 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 57s: evicting client at 10.8.26.28@o2ib6 ns: mdt-fir-MDT0000_UUID lock: ffff8f32cb2f7740/0x5d9ee6c5054b1779 lrc: 4/0,0 mode: PR/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 40 type: IBT flags: 0x60200400000020 nid: 10.8.26.28@o2ib6 remote: 0xff0b1f607b1120a1 expref: 3155 pid: 97646 timeout: 4692007 lvb_type: 0 So we just tried to power off 10.8.26.28@o2ib6 (sh-26-28) just in case. The node was evicted by the MDTs but unfortunately, the directory is still no accessible. |
| Comment by Patrick Farrell (Inactive) [ 29/Aug/19 ] |
|
Stephane, Sorry, temporarily lost track of this in the shuffle yesterday. First, specific thoughts here: Let's try it again with a much larger debug_mb - So, save your debug_mb value, then set it to something like 20000 (basically "huge", you're going to turn it down again in a moment). Then re-do the namespace dump like before (exactly as before - particularly marking sure to include the lctl clear command), and re-set your debug_mb to the original value once done. Now, general thoughts: Not having the full lock tables makes this harder to say (again, wish I had realized yesterday that we didn't have the whole thing - whoops), but: This could definitely be a bug in We have, at least for now, basically replaced [https://review.whamcloud.com/#/c/35057/ ] We are probably still going to land So as we move forward here, that's something to consider - We believe |
| Comment by Stephane Thiell [ 29/Aug/19 ] |
|
Hi Patrick, # lfs fid2path /fir 0x200029ce3:0x1b32:0x0 /fir/users/bjing/caspposes/CASP11 I tried to dump the locks with debug_mb=20000 but no luck. I can't find neither FIDs (at least formatted as 0x200029ce3:0x1b32:0x0 or 0x200029d02:0x1b59c:0x0). Still, I'm attaching the result as dlmtrace.big.log.gz If I try a ls from a client with +dlmtrace, I can see a few lines referencing FID 0x200029ce3:0x1b32:0x0, perhaps that helps? [root@fir-md1-s1 ~]# grep 0x200029ce3:0x1b32:0x0 /tmp/dlmtrace_ls.log 00010000:00010000:7.0:1567102724.224467:0:23718:0:(ldlm_lock.c:766:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: mdt-fir-MDT0000_UUID lock: ffff8f12417569c0/0x5d9ee6e875247e2e lrc: 3/1,0 mode: --/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x0/0x0 rrc: 80 type: IBT flags: 0x40000000000000 nid: local remote: 0x0 expref: -99 pid: 23718 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1567102724.224471:0:23718:0:(ldlm_lock.c:1067:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-fir-MDT0000_UUID lock: ffff8f12417569c0/0x5d9ee6e875247e2e lrc: 3/1,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x13/0x0 rrc: 80 type: IBT flags: 0x50210000000000 nid: local remote: 0x0 expref: -99 pid: 23718 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1567102724.224476:0:23718:0:(ldlm_request.c:516:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: mdt-fir-MDT0000_UUID lock: ffff8f12417569c0/0x5d9ee6e875247e2e lrc: 3/1,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x13/0x0 rrc: 80 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23718 timeout: 0 lvb_type: 0 00000004:00010000:7.0:1567102724.224486:0:23718:0:(mdt_handler.c:1856:mdt_getattr_name_lock()) ### Returning lock to client ns: mdt-fir-MDT0000_UUID lock: ffff8f12417569c0/0x5d9ee6e875247e2e lrc: 3/1,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x13/0x0 rrc: 80 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23718 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1567102724.224554:0:23718:0:(ldlm_lockd.c:1410:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-fir-MDT0000_UUID lock: ffff8f12417569c0/0x5d9ee6e875247e2e lrc: 3/0,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x13/0x0 rrc: 80 type: IBT flags: 0x40200000000000 nid: 10.9.0.1@o2ib4 remote: 0x193907ffa4cfdd71 expref: 14 pid: 23718 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1567102724.224563:0:23718:0:(ldlm_inodebits.c:90:ldlm_reprocess_inodebits_queue()) --- Reprocess resource [0x200029ce3:0x1b32:0x0].0x0 (ffff8f0bf614a780) 00010000:00010000:24.0:1567102724.224788:0:23565:0:(ldlm_lockd.c:1262:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-fir-MDT0000_UUID lock: ffff8f0db37aaac0/0x5d9ee6e875247e35 lrc: 2/0,0 mode: --/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x0/0x0 rrc: 81 type: IBT flags: 0x40000000000000 nid: local remote: 0x193907ffa4cfdd8d expref: -99 pid: 23565 timeout: 0 lvb_type: 0 00010000:00010000:24.0:1567102724.224796:0:23565:0:(ldlm_lock.c:766:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: mdt-fir-MDT0000_UUID lock: ffff8f0db37ad340/0x5d9ee6e875247e3c lrc: 3/1,0 mode: --/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x0/0x0 rrc: 82 type: IBT flags: 0x40000000000000 nid: local remote: 0x0 expref: -99 pid: 23565 timeout: 0 lvb_type: 0 00010000:00010000:24.0:1567102724.224801:0:23565:0:(ldlm_lock.c:1067:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-fir-MDT0000_UUID lock: ffff8f0db37ad340/0x5d9ee6e875247e3c lrc: 3/1,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x20/0x0 rrc: 82 type: IBT flags: 0x50210000000000 nid: local remote: 0x0 expref: -99 pid: 23565 timeout: 0 lvb_type: 0 00010000:00010000:24.0:1567102724.224806:0:23565:0:(ldlm_request.c:516:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: mdt-fir-MDT0000_UUID lock: ffff8f0db37ad340/0x5d9ee6e875247e3c lrc: 3/1,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x20/0x0 rrc: 82 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23565 timeout: 0 lvb_type: 0 00010000:00010000:24.0:1567102724.224833:0:23565:0:(ldlm_lock.c:210:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-fir-MDT0000_UUID lock: ffff8f0db37aaac0/0x5d9ee6e875247e35 lrc: 0/0,0 mode: --/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x20/0x0 rrc: 82 type: IBT flags: 0x44000000000000 nid: 10.9.0.1@o2ib4 remote: 0x193907ffa4cfdd8d expref: 16 pid: 23565 timeout: 0 lvb_type: 0 00010000:00010000:24.0:1567102724.224837:0:23565:0:(ldlm_lockd.c:1410:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-fir-MDT0000_UUID lock: ffff8f0db37ad340/0x5d9ee6e875247e3c lrc: 3/0,0 mode: PR/PR res: [0x200029ce3:0x1b32:0x0].0x0 bits 0x20/0x0 rrc: 81 type: IBT flags: 0x40200000000000 nid: 10.9.0.1@o2ib4 remote: 0x193907ffa4cfdd8d expref: 15 pid: 23565 timeout: 0 lvb_type: 0 00010000:00010000:24.0:1567102724.224841:0:23565:0:(ldlm_inodebits.c:90:ldlm_reprocess_inodebits_queue()) --- Reprocess resource [0x200029ce3:0x1b32:0x0].0x0 (ffff8f0bf614a780) I'm attaching dlmtrace_ls.log.gz I'll review the tickets that you provided until doing anything else. It seems to affect only this specific path for now. |
| Comment by Stephane Thiell [ 29/Aug/19 ] |
|
Patrick, I'm gonna build a b2_12 + That way, we will get rid of the current patch for |
| Comment by Andreas Dilger [ 30/Aug/19 ] |
|
Another way you might try to find the offending client is to grep in /proc/fs/lustre/mdt/*/exports/*/open_files for the FID(s) in question. It might be that this will list all of the blocked clients trying to read this directory, but it might only show the client that has the directory open? |
| Comment by Stephane Thiell [ 30/Aug/19 ] |
|
Interesting! One FID shows up:
[root@fir-md1-s1 ~]# grep 0x200029ce3:0x1b32:0x0 /proc/fs/lustre/mdt/*/exports/*/open_files /proc/fs/lustre/mdt/fir-MDT0000/exports/10.8.0.67@o2ib6/open_files:[0x200029ce3:0x1b32:0x0] /proc/fs/lustre/mdt/fir-MDT0000/exports/10.8.0.68@o2ib6/open_files:[0x200029ce3:0x1b32:0x0] /proc/fs/lustre/mdt/fir-MDT0000/exports/10.9.0.1@o2ib4/open_files:[0x200029ce3:0x1b32:0x0] /proc/fs/lustre/mdt/fir-MDT0000/exports/10.9.0.62@o2ib4/open_files:[0x200029ce3:0x1b32:0x0] /proc/fs/lustre/mdt/fir-MDT0000/exports/10.9.0.64@o2ib4/open_files:[0x200029ce3:0x1b32:0x0] It looks like those are the blocked clients: 10.9.0.1@o2ib4 is us testing from a management node (getting blocked), and the other NIDs are login nodes, it might be the user trying to access the directory from different login nodes. |