[LU-12037] Possible DNE issue leading to hung filesystem - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1
Affects Version/s: Lustre 2.12.0
Labels:
None
Environment:
CentOS 7.6 - Servers: 2.12.0 Clients: 2.12.0 + patch ~~LU-11964~~

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

A new issue with 2.12 started first with directories not accessible from a specific client (10.9.0.1@o2ib4) and ended up with a full filesystem deadlock. Restarting all MDTs resolved (temporarily) the issue...

This is the console of fir-md1-s1 (serves MDT0000 and MDT0002), when the directories started to become unaccessible:

[1378713.475739] LNet: Service thread pid 48757 was inactive for 212.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[1378713.488774] LNet: Skipped 3 previous similar messages
[1378713.494012] LustreError: dumping log to /tmp/lustre-log.1551393702.48757
[1378716.547830] LustreError: dumping log to /tmp/lustre-log.1551393705.22268
[1378721.106914] Lustre: fir-MDT0000-osp-MDT0002: Connection to fir-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[1378721.108926] LustreError: 51417:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393409, 300s ago); not entering recovery in server code, just going back to sleep ns:
[1378721.161814] LustreError: 51459:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393409, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378721.202150] Lustre: fir-MDT0000: Received new LWP connection from 0@lo, removing former export from same NID
[1378721.943952] LustreError: 22249:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393410, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378721.984222] LustreError: 22249:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
[1378724.392997] LustreError: 51445:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393413, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378728.847103] LustreError: 22142:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393417, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378729.910121] LustreError: 21796:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393418, 300s ago); not entering recovery in server code, just going back to sleep ns:
[1378729.949613] LustreError: 21796:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 4 previous similar messages
[1378736.197272] LustreError: 21546:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393424, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378743.454431] LustreError: 21974:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393432, 300s ago); not entering recovery in server code, just going back to sleep ns:
[1378753.250665] LustreError: 22233:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393441, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378753.290920] LustreError: 22233:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
[1378771.845093] LNet: Service thread pid 22241 was inactive for 236.44s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[1378771.858133] LNet: Skipped 1 previous similar message
[1378771.863282] LustreError: dumping log to /tmp/lustre-log.1551393760.22241
[1378774.917158] LustreError: dumping log to /tmp/lustre-log.1551393763.21826
[1378789.253489] LustreError: dumping log to /tmp/lustre-log.1551393777.51431
[1378796.422652] LustreError: dumping log to /tmp/lustre-log.1551393785.22159
[1378800.517748] LustreError: dumping log to /tmp/lustre-log.1551393789.47900
[1378801.468777] LustreError: 48757:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393490, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378801.509027] LustreError: 48757:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 3 previous similar messages
[1378801.541773] LustreError: dumping log to /tmp/lustre-log.1551393790.21878
[1378833.603516] LustreError: 22159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393522, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378833.643766] LustreError: 22159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
[1378878.343542] LNet: Service thread pid 22157 was inactive for 313.10s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[1378878.356579] LNet: Skipped 5 previous similar messages
[1378878.361831] LustreError: dumping log to /tmp/lustre-log.1551393867.22157
[1378938.760938] LNet: Service thread pid 21779 was inactive for 362.88s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[1378938.778069] LNet: Skipped 3 previous similar messages
[1378938.783305] Pid: 21779, comm: mdt01_015 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[1378938.793247] Call Trace:
[1378938.795883]  [<ffffffffc12340bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[1378938.802998]  [<ffffffffc123612f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
[1378938.810278]  [<ffffffffc12389ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
[1378938.817141]  [<ffffffffc16507a2>] osp_md_object_lock+0x162/0x2d0 [osp]
[1378938.823899]  [<ffffffffc1a7be03>] lod_object_lock+0xf3/0x7b0 [lod]
[1378938.830299]  [<ffffffffc1afcd3e>] mdd_object_lock+0x3e/0xe0 [mdd]
[1378938.836613]  [<ffffffffc199b2e1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
[1378938.844053]  [<ffffffffc199b87a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
[1378938.850995]  [<ffffffffc19bb022>] mdt_reint_rename_or_migrate.isra.51+0x362/0x860 [mdt]
[1378938.859211]  [<ffffffffc19bb553>] mdt_reint_rename+0x13/0x20 [mdt]
[1378938.865606]  [<ffffffffc19bb5e3>] mdt_reint_rec+0x83/0x210 [mdt]
[1378938.871828]  [<ffffffffc1998133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[1378938.878586]  [<ffffffffc19a3497>] mdt_reint+0x67/0x140 [mdt]
[1378938.884456]  [<ffffffffc12d135a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[1378938.891594]  [<ffffffffc127592b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1378938.899485]  [<ffffffffc127925c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[1378938.905998]  [<ffffffff8e8c1c31>] kthread+0xd1/0xe0
[1378938.911090]  [<ffffffff8ef74c24>] ret_from_fork_nospec_begin+0xe/0x21
[1378938.917767]  [<ffffffffffffffff>] 0xffffffffffffffff
[1378938.922966] LustreError: dumping log to /tmp/lustre-log.1551393927.21779

MDTs restarted occurred at ~15:47

I was able to gather some logs:

a tar of the lustre-log files above (automatically generated) is available in the FTP (fir-md1-s1-lustre-log-20190228.tar.gz)
I used the commands that Patrick provided in ~~LU-11989~~, I did it two times, the results is available in fir-md1-s1-20190228-1.log.gz and fir-md1-s1-20190228-2.log.gz attached to this ticket
fir-md1-s1 full kernel syslog dump in fir-md1-s1-kern-syslog-20190228.log

A bit more about our DNE configuration:
We use DNEv2 only to balance groups and users directories on different MDT (but only 1 each), but that's all. Basically:

$ lfs getdirstripe /fir
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
$ lfs getdirstripe /fir/users
lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     0           [0x200000400:0x5:0x0]  
     1           [0x240000402:0x5:0x0]  
     2           [0x2c0000400:0x5:0x0]  
     3           [0x280000401:0x5:0x0]  
$ lfs getdirstripe /fir/groups
lmv_stripe_count: 4 lmv_stripe_offset: 1 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     1           [0x240000400:0x8:0x0]  
     2           [0x2c0000401:0x6:0x0]  
     3           [0x280000403:0x2:0x0]  
     0           [0x200000401:0x6:0x0]  
$ lfs getdirstripe /fir/users/sthiell
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
$ lfs getdirstripe /fir/users/kilian
lmv_stripe_count: 0 lmv_stripe_offset: 1 lmv_hash_type: none

Thanks!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dk.fir-md1-s1.gz
13.62 MB
12/Apr/19 6:28 PM
dk-full.fir-md1-s1.gz
17.88 MB
12/Apr/19 6:41 PM
fir-md1-s1_20190304-dk.log.gz
7.55 MB
04/Mar/19 1:17 PM
fir-md1-s1_20190403-dk.log.gz
3.00 MB
03/Apr/19 11:53 PM
fir-md1-s1_20190412.log
4.70 MB
12/Apr/19 6:17 PM
fir-md1-s1_console_full_20190309.log.gz
424 kB
09/Mar/19 7:14 PM
fir-md1-s1_dkdlmtrace_20190314.log.gz
46 kB
14/Mar/19 7:31 PM
fir-md1-s1_dkdlmtrace_20190327.log.gz
562 kB
27/Mar/19 10:37 PM
fir-md1-s1_dkdlmtrace_20190327.log.gz
562 kB
27/Mar/19 10:36 PM
fir-md1-s1_kern_20190314.log.gz
155 kB
14/Mar/19 3:16 PM
fir-md1-s1_sysrq-t_20190308.log.gz
343 kB
09/Mar/19 12:39 AM
fir-md1-s1-20190228-1.log.gz
6.10 MB
01/Mar/19 10:28 PM
fir-md1-s1-20190228-2.log.gz
747 kB
01/Mar/19 10:28 PM
fir-md1-s1-console-20190403.gz
375 kB
03/Apr/19 11:53 PM
fir-md1-s1-kern_20190327.log.gz
172 kB
27/Mar/19 10:37 PM
fir-md1-s1-kern_20190327.log.gz
172 kB
27/Mar/19 10:36 PM
fir-md1-s1-kern-syslog-20190228.log
598 kB
01/Mar/19 10:28 PM
fir-md1-s2_20190304-dk.log.gz
1.75 MB
04/Mar/19 1:17 PM
fir-md1-s2_20190403-dk.log.gz
1.76 MB
03/Apr/19 11:53 PM
fir-md1-s2_console_full_20190309.log.gz
516 kB
09/Mar/19 7:14 PM
fir-md1-s2_dkdlmtrace_20190314.log.gz
2.51 MB
14/Mar/19 7:31 PM
fir-md1-s2_dkdlmtrace_20190327.log.gz
9.69 MB
27/Mar/19 10:43 PM
fir-md1-s2_dkdlmtrace_20190327.log.gz
9.69 MB
27/Mar/19 10:41 PM
fir-md1-s2_kern_20190314.log.gz
179 kB
14/Mar/19 3:16 PM
fir-md1-s2_sysrq-t_20190308.log.gz
358 kB
09/Mar/19 12:39 AM
fir-md1-s2-console-20190403.gz
362 kB
03/Apr/19 11:53 PM
fir-md1-s2-dlmtrace_20190311.log.gz
4.26 MB
12/Mar/19 5:38 AM
fir-md1-s2-kern_20190327.log.gz
133 kB
27/Mar/19 10:37 PM
fir-md1-s2-kern_20190327.log.gz
133 kB
27/Mar/19 10:36 PM

Issue Links

is related to

LU-12834 MDT hung during failover

Open

LU-11989 Global filesystem hangs in 2.12

Resolved

LU-12125 Allow parallel rename of regular files

Resolved

is related to

LU-11964 Heavy load and soft lockups on MDS with DOM

Resolved

LU-11359 racer test 1 times out with client hung in dir_create.sh, ls, … and MDS in ldlm_completion_ast()

Resolved

LU-11285 don't stop on the first blocked lock in ldlm_reprocess_queue()

Resolved

(1 is related to )

Activity

[LU-12037] Possible DNE issue leading to hung filesystem

Stephane Thiell added a comment - 23/Jul/19 3:19 AM

Because mdt.*.enable_remote_rename=0 was impacting performance, as it took a very long time for some users to mv within the filesystem, we decided to set enable_remote_rename=1 again. We did that last week and we have survived until now. It was a wrong track I guess. This is the other fixes that really helped like async_discard or ~~LU-11285~~ "ldlm: reprocess whole waiting queue for IBITS". I think this one can be closed as soon as ~~LU-11285~~ has landed...

Stephane Thiell added a comment - 23/Jul/19 3:19 AM Because mdt.*.enable_remote_rename=0 was impacting performance, as it took a very long time for some users to mv within the filesystem, we decided to set enable_remote_rename=1 again. We did that last week and we have survived until now. It was a wrong track I guess. This is the other fixes that really helped like async_discard or LU-11285 "ldlm: reprocess whole waiting queue for IBITS". I think this one can be closed as soon as LU-11285 has landed...

Peter Jones added a comment - 22/Apr/19 1:16 PM

Yes I think that switching to ~~LU-11359~~ makes sense

Peter Jones added a comment - 22/Apr/19 1:16 PM Yes I think that switching to LU-11359 makes sense

Stephane Thiell added a comment - 22/Apr/19 3:34 AM - edited

Ok no problem. Should I update in ~~LU-11359~~ regarding: the DoM locking issue (mdt_dom_discard_data) then? We only applied the patch on ~400 clients so far, so the issue isn't resolved yet. Today was very difficult, a lot of issues again, we'll patch more aggressively the rest of the ~1000 clients tomorrow with https://review.whamcloud.com/#/c/34071/11/

Also, we hit ~~LU-12120~~ while restarting the MDTs multiple times today while trying to resolve the DOM locking issues:

fir-io1-s1:[Sun Apr 21 17:06:43 2019][6239672.584384] LustreError: 1474:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io1-s1:[Sun Apr 21 18:49:41 2019][ 6062.486131] LustreError: 65943:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io1-s2:[Sun Apr 21 16:46:46 2019][6242707.929072] LustreError: 108555:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io2-s2:[Sun Apr 21 16:50:04 2019][8112352.271586] LustreError: 3051:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.406551] LustreError: 65916:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.417849] LustreError: 65963:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.443818] LustreError: 62284:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io3-s2:[Sun Apr 21 17:06:45 2019][6239507.696718] LustreError: 37106:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG

After many restart attempts, the filesystem seems to have stabilized at this point, for now...

Stephane Thiell added a comment - 22/Apr/19 3:34 AM - edited Ok no problem. Should I update in LU-11359 regarding: the DoM locking issue ( mdt_dom_discard_data ) then? We only applied the patch on ~400 clients so far, so the issue isn't resolved yet. Today was very difficult, a lot of issues again, we'll patch more aggressively the rest of the ~1000 clients tomorrow with https://review.whamcloud.com/#/c/34071/11/ Also, we hit LU-12120 while restarting the MDTs multiple times today while trying to resolve the DOM locking issues: fir-io1-s1:[Sun Apr 21 17:06:43 2019][6239672.584384] LustreError: 1474:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io1-s1:[Sun Apr 21 18:49:41 2019][ 6062.486131] LustreError: 65943:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io1-s2:[Sun Apr 21 16:46:46 2019][6242707.929072] LustreError: 108555:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io2-s2:[Sun Apr 21 16:50:04 2019][8112352.271586] LustreError: 3051:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.406551] LustreError: 65916:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.417849] LustreError: 65963:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.443818] LustreError: 62284:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io3-s2:[Sun Apr 21 17:06:45 2019][6239507.696718] LustreError: 37106:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG After many restart attempts, the filesystem seems to have stabilized at this point, for now...

Peter Jones added a comment - 21/Apr/19 1:57 PM

ok, for release tracking purposes, let's mark this ticket resolved because both patches have landed for 2.13 and will be in 2.12.1. There seem to be no shortage of other tickets if we need further action still.

Peter Jones added a comment - 21/Apr/19 1:57 PM ok, for release tracking purposes, let's mark this ticket resolved because both patches have landed for 2.13 and will be in 2.12.1. There seem to be no shortage of other tickets if we need further action still.

Gerrit Updater added a comment - 21/Apr/19 1:54 PM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34616/
Subject: ~~LU-12037~~ mdt: add option for cross-MDT rename
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 9cbefa99f234586c1419243a4d2c136e4f405b29

Gerrit Updater added a comment - 21/Apr/19 1:54 PM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34616/ Subject: LU-12037 mdt: add option for cross-MDT rename Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 9cbefa99f234586c1419243a4d2c136e4f405b29

Gerrit Updater added a comment - 21/Apr/19 1:53 PM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34410/
Subject: ~~LU-12037~~ mdt: add option for cross-MDT rename
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c6f3d533542b9462b5b8df95183d80321d4d9c34

Gerrit Updater added a comment - 21/Apr/19 1:53 PM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34410/ Subject: LU-12037 mdt: add option for cross-MDT rename Project: fs/lustre-release Branch: master Current Patch Set: Commit: c6f3d533542b9462b5b8df95183d80321d4d9c34

Stephane Thiell added a comment - 19/Apr/19 4:35 AM

FYI, we had another "event" so we decided to update the servers and now we're slowly starting to redeploy clients, this will take a while.

Our first client with async_discard:

[root@sh-101-60 ~]# lctl get_param mdc.fir-*.import | grep -q async_discard && echo ok
ok

Thanks much for your efforts! Fingers crossed.

Stephane Thiell added a comment - 19/Apr/19 4:35 AM FYI, we had another "event" so we decided to update the servers and now we're slowly starting to redeploy clients, this will take a while. Our first client with async_discard: [root@sh-101-60 ~]# lctl get_param mdc.fir-*.import | grep -q async_discard && echo ok ok Thanks much for your efforts! Fingers crossed.

Stephane Thiell added a comment - 18/Apr/19 9:03 PM

Will try the following on top of 2.12.0:

servers (fir):

$ git log --oneline
1819063 LU-11359 mdt: fix mdt_dom_discard_data() timeouts
3ed10f4 LU-11964 mdc: prevent glimpse lock count grow
565011c LU-12037 mdt: add option for cross-MDT rename
b6be1d9 LU-12065 lnd: increase CQ entries
6b2c97b LU-12037 mdt: call mdt_dom_discard_data() after rename unlock

clients (sherlock):

8a47fe6 LU-11359 mdt: fix mdt_dom_discard_data() timeouts
8bf1d95 LU-12065 lnd: increase CQ entries
7b8a3b3 LU-11964 mdc: prevent glimpse lock count grow

Stephane Thiell added a comment - 18/Apr/19 9:03 PM Will try the following on top of 2.12.0: servers (fir): $ git log --oneline 1819063 LU-11359 mdt: fix mdt_dom_discard_data() timeouts 3ed10f4 LU-11964 mdc: prevent glimpse lock count grow 565011c LU-12037 mdt: add option for cross-MDT rename b6be1d9 LU-12065 lnd: increase CQ entries 6b2c97b LU-12037 mdt: call mdt_dom_discard_data() after rename unlock clients (sherlock): 8a47fe6 LU-11359 mdt: fix mdt_dom_discard_data() timeouts 8bf1d95 LU-12065 lnd: increase CQ entries 7b8a3b3 LU-11964 mdc: prevent glimpse lock count grow

Stephane Thiell added a comment - 18/Apr/19 7:30 PM

Thanks!!

This does compile against my 2.12.0 branch: https://gist.github.com/thiell/ce2445aa1e37b365e50208ebbb0b5263

Let me know if that sounds ok to you. I will build new packages after lunch.

Thanks for the remote lock explanation, this is super interesting and I'm glad to hear that it doesn't necessary mean a remote rename!!

Stephane Thiell added a comment - 18/Apr/19 7:30 PM Thanks!! This does compile against my 2.12.0 branch: https://gist.github.com/thiell/ce2445aa1e37b365e50208ebbb0b5263 Let me know if that sounds ok to you. I will build new packages after lunch. Thanks for the remote lock explanation, this is super interesting and I'm glad to hear that it doesn't necessary mean a remote rename!!

Patrick Farrell (Inactive) added a comment - 18/Apr/19 7:23 PM - edited

Stephane,

You can replace ldlm_is_granted(lock) with (lock->l_req_mode == lock->l_granted_mode).

About the remote lock:
If, as in the case from earlier, it's inside mdt_rename_lock and it's on a secondary MDT (not MDT0), then it's not an issue. [Edit: Sorry, too brief....] It's just trying to get the rename lock on MDT0. That does indicate some sort of hang if you're getting back traces, but it doesn't actually indicate a remote rename. There is only a single rename lock ("big file system lock" or BFL, as Andreas labels it) for the whole file system. This is kind of standard (EXT4, at least, is the same) as it massively simplifies rename locking (which is still quite complicated).

Patrick Farrell (Inactive) added a comment - 18/Apr/19 7:23 PM - edited Stephane, You can replace ldlm_is_granted(lock) with (lock->l_req_mode == lock->l_granted_mode). About the remote lock: If, as in the case from earlier, it's inside mdt_rename_lock and it's on a secondary MDT (not MDT0), then it's not an issue. [Edit: Sorry, too brief....] It's just trying to get the rename lock on MDT0. That does indicate some sort of hang if you're getting back traces, but it doesn't actually indicate a remote rename. There is only a single rename lock ("big file system lock" or BFL, as Andreas labels it) for the whole file system. This is kind of standard (EXT4, at least, is the same) as it massively simplifies rename locking (which is still quite complicated).

People

Assignee:: Lai Siyao

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 01/Mar/19 10:31 PM

Updated:: 05/Nov/21 5:20 PM

Resolved:: 21/Apr/19 1:57 PM