[LU-12037] Possible DNE issue leading to hung filesystem Created: 01/Mar/19  Updated: 05/Nov/21  Resolved: 21/Apr/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.6 - Servers: 2.12.0 Clients: 2.12.0 + patch LU-11964


Attachments: File dk-full.fir-md1-s1.gz     File dk.fir-md1-s1.gz     File fir-md1-s1-20190228-1.log.gz     File fir-md1-s1-20190228-2.log.gz     File fir-md1-s1-console-20190403.gz     Text File fir-md1-s1-kern-syslog-20190228.log     File fir-md1-s1-kern_20190327.log.gz     File fir-md1-s1-kern_20190327.log.gz     File fir-md1-s1_20190304-dk.log.gz     File fir-md1-s1_20190403-dk.log.gz     Text File fir-md1-s1_20190412.log     File fir-md1-s1_console_full_20190309.log.gz     File fir-md1-s1_dkdlmtrace_20190314.log.gz     File fir-md1-s1_dkdlmtrace_20190327.log.gz     File fir-md1-s1_dkdlmtrace_20190327.log.gz     File fir-md1-s1_kern_20190314.log.gz     File fir-md1-s1_sysrq-t_20190308.log.gz     File fir-md1-s2-console-20190403.gz     File fir-md1-s2-dlmtrace_20190311.log.gz     File fir-md1-s2-kern_20190327.log.gz     File fir-md1-s2-kern_20190327.log.gz     File fir-md1-s2_20190304-dk.log.gz     File fir-md1-s2_20190403-dk.log.gz     File fir-md1-s2_console_full_20190309.log.gz     File fir-md1-s2_dkdlmtrace_20190314.log.gz     File fir-md1-s2_dkdlmtrace_20190327.log.gz     File fir-md1-s2_dkdlmtrace_20190327.log.gz     File fir-md1-s2_kern_20190314.log.gz     File fir-md1-s2_sysrq-t_20190308.log.gz    
Issue Links:
Related
is related to LU-11964 Heavy load and soft lockups on MDS wi... Resolved
is related to LU-11359 racer test 1 times out with client hu... Resolved
is related to LU-11285 don't stop on the first blocked lock ... Resolved
is related to LU-12834 MDT hung during failover Open
is related to LU-11989 Global filesystem hangs in 2.12 Resolved
is related to LU-12125 Allow parallel rename of regular files Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

A new issue with 2.12 started first with directories not accessible from a specific client (10.9.0.1@o2ib4) and ended up with a full filesystem deadlock. Restarting all MDTs resolved (temporarily) the issue...

This is the console of fir-md1-s1 (serves MDT0000 and MDT0002), when the directories started to become unaccessible:

[1378713.475739] LNet: Service thread pid 48757 was inactive for 212.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[1378713.488774] LNet: Skipped 3 previous similar messages
[1378713.494012] LustreError: dumping log to /tmp/lustre-log.1551393702.48757
[1378716.547830] LustreError: dumping log to /tmp/lustre-log.1551393705.22268
[1378721.106914] Lustre: fir-MDT0000-osp-MDT0002: Connection to fir-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[1378721.108926] LustreError: 51417:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393409, 300s ago); not entering recovery in server code, just going back to sleep ns:
[1378721.161814] LustreError: 51459:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393409, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378721.202150] Lustre: fir-MDT0000: Received new LWP connection from 0@lo, removing former export from same NID
[1378721.943952] LustreError: 22249:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393410, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378721.984222] LustreError: 22249:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
[1378724.392997] LustreError: 51445:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393413, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378728.847103] LustreError: 22142:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393417, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378729.910121] LustreError: 21796:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393418, 300s ago); not entering recovery in server code, just going back to sleep ns:
[1378729.949613] LustreError: 21796:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 4 previous similar messages
[1378736.197272] LustreError: 21546:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393424, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378743.454431] LustreError: 21974:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393432, 300s ago); not entering recovery in server code, just going back to sleep ns:
[1378753.250665] LustreError: 22233:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393441, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378753.290920] LustreError: 22233:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
[1378771.845093] LNet: Service thread pid 22241 was inactive for 236.44s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[1378771.858133] LNet: Skipped 1 previous similar message
[1378771.863282] LustreError: dumping log to /tmp/lustre-log.1551393760.22241
[1378774.917158] LustreError: dumping log to /tmp/lustre-log.1551393763.21826
[1378789.253489] LustreError: dumping log to /tmp/lustre-log.1551393777.51431
[1378796.422652] LustreError: dumping log to /tmp/lustre-log.1551393785.22159
[1378800.517748] LustreError: dumping log to /tmp/lustre-log.1551393789.47900
[1378801.468777] LustreError: 48757:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393490, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378801.509027] LustreError: 48757:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 3 previous similar messages
[1378801.541773] LustreError: dumping log to /tmp/lustre-log.1551393790.21878
[1378833.603516] LustreError: 22159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393522, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
[1378833.643766] LustreError: 22159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
[1378878.343542] LNet: Service thread pid 22157 was inactive for 313.10s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[1378878.356579] LNet: Skipped 5 previous similar messages
[1378878.361831] LustreError: dumping log to /tmp/lustre-log.1551393867.22157
[1378938.760938] LNet: Service thread pid 21779 was inactive for 362.88s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[1378938.778069] LNet: Skipped 3 previous similar messages
[1378938.783305] Pid: 21779, comm: mdt01_015 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[1378938.793247] Call Trace:
[1378938.795883]  [<ffffffffc12340bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[1378938.802998]  [<ffffffffc123612f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
[1378938.810278]  [<ffffffffc12389ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
[1378938.817141]  [<ffffffffc16507a2>] osp_md_object_lock+0x162/0x2d0 [osp]
[1378938.823899]  [<ffffffffc1a7be03>] lod_object_lock+0xf3/0x7b0 [lod]
[1378938.830299]  [<ffffffffc1afcd3e>] mdd_object_lock+0x3e/0xe0 [mdd]
[1378938.836613]  [<ffffffffc199b2e1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
[1378938.844053]  [<ffffffffc199b87a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
[1378938.850995]  [<ffffffffc19bb022>] mdt_reint_rename_or_migrate.isra.51+0x362/0x860 [mdt]
[1378938.859211]  [<ffffffffc19bb553>] mdt_reint_rename+0x13/0x20 [mdt]
[1378938.865606]  [<ffffffffc19bb5e3>] mdt_reint_rec+0x83/0x210 [mdt]
[1378938.871828]  [<ffffffffc1998133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[1378938.878586]  [<ffffffffc19a3497>] mdt_reint+0x67/0x140 [mdt]
[1378938.884456]  [<ffffffffc12d135a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[1378938.891594]  [<ffffffffc127592b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1378938.899485]  [<ffffffffc127925c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[1378938.905998]  [<ffffffff8e8c1c31>] kthread+0xd1/0xe0
[1378938.911090]  [<ffffffff8ef74c24>] ret_from_fork_nospec_begin+0xe/0x21
[1378938.917767]  [<ffffffffffffffff>] 0xffffffffffffffff
[1378938.922966] LustreError: dumping log to /tmp/lustre-log.1551393927.21779

MDTs restarted occurred at ~15:47

I was able to gather some logs:

  • a tar of the lustre-log files above (automatically generated) is available in the FTP (fir-md1-s1-lustre-log-20190228.tar.gz)
  • I used the commands that Patrick provided in LU-11989, I did it two times, the results is available in fir-md1-s1-20190228-1.log.gz and fir-md1-s1-20190228-2.log.gz attached to this ticket
  • fir-md1-s1 full kernel syslog dump in fir-md1-s1-kern-syslog-20190228.log

A bit more about our DNE configuration:
We use DNEv2 only to balance groups and users directories on different MDT (but only 1 each), but that's all. Basically:

$ lfs getdirstripe /fir
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
$ lfs getdirstripe /fir/users
lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     0           [0x200000400:0x5:0x0]  
     1           [0x240000402:0x5:0x0]  
     2           [0x2c0000400:0x5:0x0]  
     3           [0x280000401:0x5:0x0]  
$ lfs getdirstripe /fir/groups
lmv_stripe_count: 4 lmv_stripe_offset: 1 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     1           [0x240000400:0x8:0x0]  
     2           [0x2c0000401:0x6:0x0]  
     3           [0x280000403:0x2:0x0]  
     0           [0x200000401:0x6:0x0]  
$ lfs getdirstripe /fir/users/sthiell
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
$ lfs getdirstripe /fir/users/kilian
lmv_stripe_count: 0 lmv_stripe_offset: 1 lmv_hash_type: none

Thanks!



 Comments   
Comment by Peter Jones [ 01/Mar/19 ]

Lai

Could you please investigate?

Thanks

Peter

Comment by Stephane Thiell [ 02/Mar/19 ]

Another thing, I've been told by the cluster sysadmin that this problem often starts just after creating a batch of new users / groups in the morning, so some mkdir in /fir/users or /fir/groups hang, this is just a hint... but this would mean it is a DNEv2-specific issue indeed.

Comment by Stephane Thiell [ 04/Mar/19 ]

We got a new occurrence of the issue last night, hung filesystem on several clients, including 3 out of 8 login nodes, okay on the others. Super weird but very impacting issue. I gathered more logs first:

fir-md1-s1 (MDT0000,MDT0002):
$ lctl set_param debug=+dlmtrace; lctl clear
$ lctl set_param ldlm.dump_namespaces=1
$ lctl set_param debug=-dlmtrace
$ lctl dk > /tmp/fir-md1-s1_20190304-dk.log

fir-md1-s2 (MDT0001,MDT0003):
$ lctl set_param debug=+dlmtrace; lctl clear
$ lctl set_param ldlm.dump_namespaces=1
$ lctl set_param debug=-dlmtrace
$ lctl dk > /tmp/fir-md1-s2_20190304-dk.log

I'm attached the corresponding .gz files to this ticket.

Same kinds of call traces again:

Mar 04 04:46:26 fir-md1-s1 kernel: LNet: Service thread pid 106776 was inactive for 1200.49s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Mar 04 04:46:26 fir-md1-s1 kernel: LNet: Skipped 2 previous similar messages
Mar 04 04:46:26 fir-md1-s1 kernel: Pid: 106776, comm: mdt02_025 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 04 04:46:26 fir-md1-s1 kernel: Call Trace:
Mar 04 04:46:26 fir-md1-s1 kernel:  [<ffffffffc12340bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Mar 04 04:46:26 fir-md1-s1 kernel:  [<ffffffffc1234dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Mar 04 04:46:26 fir-md1-s1 kernel:  [<ffffffffc19bb33a>] mdt_reint_rename_or_migrate.isra.51+0x67a/0x860 [mdt]
Mar 04 04:46:26 fir-md1-s1 kernel:  [<ffffffffc19bb553>] mdt_reint_rename+0x13/0x20 [mdt]
Mar 04 04:46:26 fir-md1-s1 kernel:  [<ffffffffc19bb5e3>] mdt_reint_rec+0x83/0x210 [mdt]
Mar 04 04:46:26 fir-md1-s1 kernel:  [<ffffffffc1998133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Mar 04 04:46:26 fir-md1-s1 kernel:  [<ffffffffc19a3497>] mdt_reint+0x67/0x140 [mdt]
Mar 04 04:46:26 fir-md1-s1 kernel:  [<ffffffffc12d135a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 04 04:46:26 fir-md1-s1 kernel:  [<ffffffffc127592b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 04 04:46:26 fir-md1-s1 kernel:  [<ffffffffc127925c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 04 04:46:26 fir-md1-s1 kernel:  [<ffffffff8e8c1c31>] kthread+0xd1/0xe0
Mar 04 04:46:26 fir-md1-s1 kernel:  [<ffffffff8ef74c24>] ret_from_fork_nospec_begin+0xe/0x21
Mar 04 04:46:26 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Mar 04 04:46:26 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1551703586.106776
Mar 04 04:46:50 fir-md1-s1 kernel: Pid: 106617, comm: mdt00_006 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 04 04:46:50 fir-md1-s1 kernel: Call Trace:
Mar 04 04:46:50 fir-md1-s1 kernel:  [<ffffffffc12340bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Mar 04 04:46:50 fir-md1-s1 kernel:  [<ffffffffc1234dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Mar 04 04:46:50 fir-md1-s1 kernel:  [<ffffffffc19bb33a>] mdt_reint_rename_or_migrate.isra.51+0x67a/0x860 [mdt]
Mar 04 04:46:50 fir-md1-s1 kernel:  [<ffffffffc19bb553>] mdt_reint_rename+0x13/0x20 [mdt]
Mar 04 04:46:50 fir-md1-s1 kernel:  [<ffffffffc19bb5e3>] mdt_reint_rec+0x83/0x210 [mdt]
Mar 04 04:46:50 fir-md1-s1 kernel:  [<ffffffffc1998133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Mar 04 04:46:50 fir-md1-s1 kernel:  [<ffffffffc19a3497>] mdt_reint+0x67/0x140 [mdt]
Mar 04 04:46:50 fir-md1-s1 kernel:  [<ffffffffc12d135a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 04 04:46:50 fir-md1-s1 kernel:  [<ffffffffc127592b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 04 04:46:50 fir-md1-s1 kernel:  [<ffffffffc127925c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 04 04:46:50 fir-md1-s1 kernel:  [<ffffffff8e8c1c31>] kthread+0xd1/0xe0
Mar 04 04:46:50 fir-md1-s1 kernel:  [<ffffffff8ef74c24>] ret_from_fork_nospec_begin+0xe/0x21
Mar 04 04:46:50 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Mar 04 04:46:50 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1551703610.106617
Mar 04 04:48:00 fir-md1-s1 kernel: Pid: 106986, comm: mdt01_079 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 04 04:48:00 fir-md1-s1 kernel: Call Trace:
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc1234031>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc123612f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc12389ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc16507a2>] osp_md_object_lock+0x162/0x2d0 [osp]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc1a7be03>] lod_object_lock+0xf3/0x7b0 [lod]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc1afcd3e>] mdd_object_lock+0x3e/0xe0 [mdd]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc199b2e1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc199b87a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc19bb022>] mdt_reint_rename_or_migrate.isra.51+0x362/0x860 [mdt]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc19bb553>] mdt_reint_rename+0x13/0x20 [mdt]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc19bb5e3>] mdt_reint_rec+0x83/0x210 [mdt]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc1998133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc19a3497>] mdt_reint+0x67/0x140 [mdt]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc12d135a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc127592b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffc127925c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffff8e8c1c31>] kthread+0xd1/0xe0
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffff8ef74c24>] ret_from_fork_nospec_begin+0xe/0x21
Mar 04 04:48:00 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Mar 04 04:48:00 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1551703680.106986
Mar 04 04:48:08 fir-md1-s1 kernel: Lustre: fir-MDT0002: Client 34c03761-c352-7fa2-a35d-e32b61a041a3 (at 10.9.101.53@o2ib4) reconnecting
Mar 04 04:48:08 fir-md1-s1 kernel: Lustre: Skipped 76 previous similar messages
Mar 04 04:48:08 fir-md1-s1 kernel: Lustre: fir-MDT0002: Connection restored to  (at 10.9.101.53@o2ib4)
Mar 04 04:48:08 fir-md1-s1 kernel: Lustre: Skipped 76 previous similar messages
Mar 04 04:49:40 fir-md1-s1 kernel: Lustre: fir-MDT0000-osp-MDT0002: Connection to fir-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Mar 04 04:49:40 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message
Mar 04 04:49:40 fir-md1-s1 kernel: LustreError: 106986:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551703480, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MDT0000-osp-MDT0002 lock: ffff9cb330ebaf40/0xb7044c67f3839567 lrc: 4/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2/0x0 rrc: 101 type: IBT flags: 0x1000001000000 nid: local remote: 0xb7044c67f383956e expref: -99 pid: 106986 timeout: 0 lvb_type: 0
Mar 04 04:49:40 fir-md1-s1 kernel: LustreError: 106986:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 3 previous similar messages
Mar 04 04:49:40 fir-md1-s1 kernel: Lustre: fir-MDT0000: Received new LWP connection from 0@lo, removing former export from same NID
Mar 04 04:49:40 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message

not sure it is related, but just in case, ldlm lock count when this happened:

fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=26975
fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=2491852
fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=7067
fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_count=11995

Then tried lru_size=clear on all nodes without success (filesystem was still hung), however the lock counters dropped a bit:

[root@fir-hn01 sthiell.root]# clush -w@mds -L lctl get_param ldlm.*.*mdt*.lock_count 
fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=2229
fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=86937
fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=1719
fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_count=3698

Finally I had to restart all MDTs for the filesystem to come back online.

Comment by Stephane Thiell [ 04/Mar/19 ]

I'm also wondering if the garbage found in the lustre dk logs are normal:

00000100:00080000:37.0:1551704355.938320:0:106787:0:(import.c:104:ptlrpc_import_enter_resend()) ffff9cd4f96e1000 °r³iÁ<9c>ÿÿð<87>ýx»<9c>ÿÿ: changing import state from FULL to RECOVER
00000100:00080000:37.0:1551704355.938323:0:106787:0:(import.c:1529:ptlrpc_import_recovery_state_machine()) ffff9cd4f96e1000 °r³iÁ<9c>ÿÿð<87>ýx»<9c>ÿÿ: changing import state from RECOVER to FULL
Comment by Andreas Dilger [ 04/Mar/19 ]

Stephane, when you are creating user/group directories, is it just a simple "mkdir" at the top level, or is there a script and/or command that is run to create those directories? It looks like all of your stack traces involve remote renames, which definitely makes things more complex on the DNE front than a simple mkdir. Is it possible to capture a debug log of a (sample) user/group creation on the client, so we can see what operations are being done there?

Comment by Stephane Thiell [ 04/Mar/19 ]

Hmm ok.  We don't think that it is related to the account creation process anymore as no new accounts were created during last weekend (last one was on Friday), but the issue happened last night anyway.

Account creation is basically this:

mkdir -p $user_scratch
chown $user: $user_scratch
chmod 700 $user_scratch
lfs setquota [...]

But due to our DNE setup (each user (or group) directory is randomly stripped to a single MDT) , if the issue is due to remote renames, that would mean mv between different user or group directories, that is not impossible, but I would guess it's rather unlikely. Although, one possible case would be a user that moves data to/from his $SCRATCH (/fir/users/<user>) to/from $GROUP_SCRATCH (/fir/groups/<group>), as both directories are not necessarily on the same MDT in that case.

Do you see any interesting FIDs that I could check?

Comment by Stephane Thiell [ 05/Mar/19 ]

It's happening again, and this time even the recovery failed

[1801189.800687] Call Trace:
[1801189.803332]  [<ffffffffc12340bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[1801189.810460]  [<ffffffffc1234dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[1801189.817848]  [<ffffffffc19bb33a>] mdt_reint_rename_or_migrate.isra.51+0x67a/0x860 [mdt]
[1801189.826094]  [<ffffffffc19bb553>] mdt_reint_rename+0x13/0x20 [mdt]
[1801189.832508]  [<ffffffffc19bb5e3>] mdt_reint_rec+0x83/0x210 [mdt]
[1801189.838732]  [<ffffffffc1998133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[1801189.845480]  [<ffffffffc19a3497>] mdt_reint+0x67/0x140 [mdt]
[1801189.851358]  [<ffffffffc12d135a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[1801189.858498]  [<ffffffffc127592b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1801189.866414]  [<ffffffffc127925c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[1801189.872939]  [<ffffffff8e8c1c31>] kthread+0xd1/0xe0
[1801189.878044]  [<ffffffff8ef74c24>] ret_from_fork_nospec_begin+0xe/0x21
[1801189.884692]  [<ffffffffffffffff>] 0xffffffffffffffff
[1801189.889922] LustreError: dumping log to /tmp/lustre-log.1551816168.55197
[1801206.175151] Pid: 55297, comm: mdt00_053 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[1801206.185065] Call Trace:
[1801206.187727]  [<ffffffffc12340bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[1801206.194852]  [<ffffffffc123612f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
[1801206.202154]  [<ffffffffc12389ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
[1801206.209024]  [<ffffffffc16507a2>] osp_md_object_lock+0x162/0x2d0 [osp]
[1801206.215782]  [<ffffffffc1a7be03>] lod_object_lock+0xf3/0x7b0 [lod]
[1801206.222182]  [<ffffffffc1afcd3e>] mdd_object_lock+0x3e/0xe0 [mdd]
[1801206.228503]  [<ffffffffc199b2e1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
[1801206.235944]  [<ffffffffc199b87a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
[1801206.242889]  [<ffffffffc19bb022>] mdt_reint_rename_or_migrate.isra.51+0x362/0x860 [mdt]
[1801206.251111]  [<ffffffffc19bb553>] mdt_reint_rename+0x13/0x20 [mdt]
[1801206.257510]  [<ffffffffc19bb5e3>] mdt_reint_rec+0x83/0x210 [mdt]
[1801206.263738]  [<ffffffffc1998133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[1801206.270487]  [<ffffffffc19a3497>] mdt_reint+0x67/0x140 [mdt]
[1801206.276371]  [<ffffffffc12d135a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[1801206.283513]  [<ffffffffc127592b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1801206.291411]  [<ffffffffc127925c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[1801206.297927]  [<ffffffff8e8c1c31>] kthread+0xd1/0xe0
[1801206.303017]  [<ffffffff8ef74c24>] ret_from_fork_nospec_begin+0xe/0x21
[1801206.309695]  [<ffffffffffffffff>] 0xffffffffffffffff
[1801206.314891] LustreError: dumping log to /tmp/lustre-log.1551816185.55297
[1801240.773914] Lustre: Failing over fir-MDT0000
[1801240.778365] Lustre: Skipped 1 previous similar message
[1801240.804022] Lustre: fir-MDT0002: Not available for connect from 10.8.9.4@o2ib6 (stopping)
[1801240.812376] Lustre: Skipped 5 previous similar messages
[1801240.955740] LustreError: 129573:0:(ldlm_lockd.c:2322:ldlm_cancel_handler()) ldlm_cancel from 10.8.8.32@o2ib6 arrived at 1551816219 with bad export cookie 13187749618252332009
[1801240.971486] LustreError: 129573:0:(ldlm_lock.c:2689:ldlm_lock_dump_handle()) ### ### ns: mdt-fir-MDT0002_UUID lock: ffff9c9703740d80/0xb7044c69d58ab18b lrc: 3/0,0 mode: PW/PW res: [0x2c0007733:0x183b1:0x0].0x0 bits 0x40/0x0 rrc: 4 type: IBT flags: 0x40200000000000 nid: 10.8.8.32@o2ib6 remote: 0x34f09e608bc3ca46 expref: 18737 pid: 55430 timeout: 0 lvb_type: 0
[1801240.980397] LNet: Service thread pid 55630 completed after 3273.04s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[1801240.980628] LustreError: 55270:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff9c9c7c327000 ns: mdt-fir-MDT0002_UUID lock: ffff9c98af38f980/0xb7044c69e7fe6f3e lrc: 3/0,0 mode: PR/PR res: [0x2c00086f1:0x905:0x0].0x0 bits 0x5b/0x0 rrc: 7 type: IBT flags: 0x50200000000000 nid: 10.9.114.3@o2ib4 remote: 0xc92201bd420f6205 expref: 3 pid: 55270 timeout: 0 lvb_type: 0
[1801240.980632] LustreError: 55270:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) Skipped 2 previous similar messages
[1801241.032966] LustreError: 11-0: fir-MDT0002-osp-MDT0000: operation mds_disconnect to node 0@lo failed: rc = -19
[1801241.032993] LustreError: 83931:0:(osp_dev.c:485:osp_disconnect()) fir-MDT0002-osp-MDT0000: can't disconnect: rc = -19
[1801241.036803] LustreError: 83931:0:(lod_dev.c:265:lod_sub_process_config()) fir-MDT0000-mdtlov: error cleaning up LOD index 2: cmd 0xcf031: rc = -19
[1801241.094313] LustreError: 55430:0:(mdt_reint.c:2603:mdt_reint_rename_or_migrate()) fir-MDT0002: can't lock FS for rename: rc = -5
[1801241.096009] LustreError: 83930:0:(ldlm_resource.c:1146:ldlm_resource_complain()) fir-MDT0000-osp-MDT0002: namespace resource [0x200000004:0x1:0x0].0x0 (ffff9ca1fc66b200) refcount nonzero (89) after lock cleanup; forcing cleanup.
[1801241.096013] LustreError: 83930:0:(ldlm_resource.c:1146:ldlm_resource_complain()) Skipped 1 previous similar message
[1801241.141631] LustreError: 129573:0:(ldlm_lock.c:2689:ldlm_lock_dump_handle()) Skipped 1 previous similar message
[1801241.154050] LustreError: 54975:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9c9f9566b000 x1625962857264336/t0(0) o104->fir-MDT0000@10.8.0.66@o2ib6:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[1801241.175175] LustreError: 54975:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 69 previous similar messages
[1801241.478798] LustreError: 129573:0:(ldlm_lock.c:2689:ldlm_lock_dump_handle()) ### ### ns: mdt-fir-MDT0002_UUID lock: ffff9c98f2557bc0/0xb7044c69c15f2f17 lrc: 3/0,0 mode: PR/PR res: [0x2c0007180:0x83ac:0x0].0x0 bits 0x5b/0x0 rrc: 49 type: IBT flags: 0x40200000000000 nid: 10.9.106.41@o2ib4 remote: 0xd219d38c6b9c697f expref: 33 pid: 80695 timeout: 0 lvb_type: 0
[1801241.510566] LustreError: 129573:0:(ldlm_lock.c:2689:ldlm_lock_dump_handle()) Skipped 2 previous similar messages
[1801242.126082] LustreError: 83931:0:(ldlm_resource.c:1146:ldlm_resource_complain()) mdt-fir-MDT0000_UUID: namespace resource [0x20000674e:0x9cf:0x0].0x0 (ffff9cab8163ba40) refcount nonzero (1) after lock cleanup; forcing cleanup.
[1801242.146322] LustreError: 83931:0:(ldlm_resource.c:1146:ldlm_resource_complain()) Skipped 1 previous similar message
[1801242.208020] LustreError: 129056:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9cb30eb08300 x1625962857266144/t0(0) o41->fir-MDT0000-osp-MDT0002@0@lo:24/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[1801242.229127] LustreError: 129056:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 29 previous similar messages
[1801242.806106] Lustre: fir-MDT0000: Not available for connect from 10.8.24.14@o2ib6 (stopping)
[1801242.814641] Lustre: Skipped 369 previous similar messages
[1801242.996721] LustreError: 129186:0:(ldlm_lockd.c:2322:ldlm_cancel_handler()) ldlm_cancel from 10.9.106.40@o2ib4 arrived at 1551816221 with bad export cookie 13187749618252324512
[1801243.012630] LustreError: 129186:0:(ldlm_lockd.c:2322:ldlm_cancel_handler()) Skipped 3 previous similar messages
[1801243.022924] LustreError: 129186:0:(ldlm_lock.c:2689:ldlm_lock_dump_handle()) ### ### ns: mdt-fir-MDT0002_UUID lock: ffff9cbebbbd9680/0xb7044c69d65a19cd lrc: 3/0,0 mode: PR/PR res: [0x2c0003af3:0xf58:0x0].0x0 bits 0x13/0x0 rrc: 12 type: IBT flags: 0x40200000000000 nid: 10.9.106.40@o2ib4 remote: 0x20c2168cebe99145 expref: 40 pid: 55583 timeout: 0 lvb_type: 0
[1801245.392097] LustreError: 129060:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9c9f95668300 x1625962857266240/t0(0) o41->fir-MDT0002-osp-MDT0000@0@lo:24/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[1801245.413208] LustreError: 129060:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 1 previous similar message
[1801245.694733] LustreError: 83930:0:(osp_object.c:594:osp_attr_get()) fir-MDT0000-osp-MDT0002:osp_attr_get update error [0x20000000a:0x0:0x0]: rc = -5
[1801245.715297] LustreError: 83930:0:(llog_cat.c:424:llog_cat_close()) fir-MDT0000-osp-MDT0002: failure destroying log during cleanup: rc = -5
[1801245.727909] LustreError: 83930:0:(llog_cat.c:424:llog_cat_close()) Skipped 7 previous similar messages
[1801246.816559] Lustre: fir-MDT0000: Not available for connect from 10.9.106.50@o2ib4 (stopping)
[1801246.825173] Lustre: Skipped 442 previous similar messages
[1801247.183525] LustreError: 129573:0:(ldlm_lockd.c:2322:ldlm_cancel_handler()) ldlm_cancel from 10.8.9.3@o2ib6 arrived at 1551816226 with bad export cookie 13187749619015517593
[1801247.199164] LustreError: 129573:0:(ldlm_lockd.c:2322:ldlm_cancel_handler()) Skipped 5 previous similar messages
[1801247.728137] LustreError: 129056:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9cac2a371200 x1625962857268608/t0(0) o41->fir-MDT0003-osp-MDT0002@10.0.10.52@o2ib7:24/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[1801247.750276] LustreError: 129056:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 5 previous similar messages
[1801661.148814] Lustre: fir-MDT0000: Denying connection for new client 1cbf77f1-25f5-e39f-f802-ab9685fcc565(at 10.8.3.11@o2ib6), waiting for 1417 known clients (3 recovered, 629 in progress, and 0 evicted) to recover in 8:27
[1801661.168523] Lustre: Skipped 3 previous similar messages
[1801735.596319] Lustre: Failing over fir-MDT0000
[1801735.600771] Lustre: Skipped 1 previous similar message
[1801735.635188] LustreError: 129042:0:(import.c:664:ptlrpc_connect_import()) can't connect to a closed import
[1801735.644931] LustreError: 129042:0:(import.c:664:ptlrpc_connect_import()) Skipped 3 previous similar messages
[1801735.682971] LustreError: 86752:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0002: Aborting recovery
[1801735.683009] Lustre: 85759:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[1801735.683012] Lustre: 85759:0:(ldlm_lib.c:2058:target_recovery_overseer()) Skipped 5 previous similar messages
[1801735.683022] LustreError: 85759:0:(ldlm_lib.c:1638:abort_lock_replay_queue()) @@@ aborted:  req@ffff9cb44ba87200 x1626244246703856/t0(0) o101->ab669dd9-9e1f-c618-1b5e-a1b4edf4a48d@10.9.112.4@o2ib4:140/0 lens 328/0 e 0 to 0 dl 1551817325 ref 1 fl Complete:/40/ffffffff rc 0/-1
[1801735.683024] LustreError: 85759:0:(ldlm_lib.c:1638:abort_lock_replay_queue()) Skipped 4223 previous similar messages
[1801735.684466] Lustre: fir-MDT0000: Not available for connect from 10.9.108.52@o2ib4 (stopping)
[1801735.684468] Lustre: Skipped 719 previous similar messages
[1801735.699302] Lustre: fir-MDT0000: Recovery over after 2:24, of 1417 clients 0 recovered and 1417 were evicted.
[1801735.699304] Lustre: Skipped 1 previous similar message
[1801735.728163] Lustre: 85759:0:(mdt_handler.c:6543:mdt_postrecov()) fir-MDT0000: auto trigger paused LFSCK failed: rc = -6
[1801735.728249] LustreError: 85676:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9ca325aab300 x1625962857356064/t0(0) o5->fir-OST0000-osc-MDT0000@10.0.10.101@o2ib7:28/4 lens 432/432 e 0 to 0 dl 0 ref 2 fl Rpc:N/0/ffffffff rc 0/-1
[1801735.728251] LustreError: 85676:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 97 previous similar messages
[1801735.728271] LustreError: 85720:0:(osp_precreate.c:940:osp_precreate_cleanup_orphans()) fir-OST0003-osc-MDT0000: cannot cleanup orphans: rc = -5
[1801735.728273] LustreError: 85716:0:(osp_precreate.c:940:osp_precreate_cleanup_orphans()) fir-OST0001-osc-MDT0000: cannot cleanup orphans: rc = -5
[1801735.728275] LustreError: 85720:0:(osp_precreate.c:940:osp_precreate_cleanup_orphans()) Skipped 1 previous similar message
[1801735.728276] LustreError: 85716:0:(osp_precreate.c:940:osp_precreate_cleanup_orphans()) Skipped 1 previous similar message
[1801735.791015] LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.9.0.61@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
[1801735.791016] LustreError: Skipped 81 previous similar messages
[1801735.893615] LustreError: 86752:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) Skipped 1 previous similar message
[1801737.285829] Lustre: server umount fir-MDT0000 complete
[1801737.291148] Lustre: Skipped 1 previous similar message
[1801761.665074] LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5
[1801761.721077] LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
[1801761.853370] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
[1801761.907863] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
[1801762.491332] Lustre: fir-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[1801762.501770] Lustre: Skipped 1 previous similar message
[1801762.635726] Lustre: fir-MDD0000: changelog on
[1801762.642849] Lustre: fir-MDT0000: in recovery but waiting for the first client to connect
[1801762.651119] Lustre: Skipped 1 previous similar message
[1801762.774466] LustreError: 11-0: fir-MDT0000-osp-MDT0002: operation mds_connect to node 0@lo failed: rc = -114
[1801767.498644] Lustre: fir-MDT0000: Will be in recovery for at least 2:30, or until 1417 clients reconnect
[1801767.508221] Lustre: Skipped 1 previous similar message
[1801767.662046] LustreError: 87047:0:(mdt_open.c:1364:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff9cabc3be1800 x1626782472669232/t0(55986962201) o101->73cb0dc7-e397-e758-abcd-289d6f4ca268@10.9.0.61@o2ib4:316/0 lens 1784/3288 e 0 to 0 dl 1551817501 ref 1 fl Interpret:/4/0 rc 0/0
[1801768.737621] LustreError: 87407:0:(mdt_open.c:1364:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff9cb44ba85700 x1626472703432928/t0(55991756253) o101->0f82de9c-d37a-1eea-ad4a-b15f87c0fd06@10.9.0.62@o2ib4:317/0 lens 1784/3288 e 0 to 0 dl 1551817502 ref 1 fl Interpret:/4/0 rc 0/0
[1801768.764346] LustreError: 87407:0:(mdt_open.c:1364:mdt_reint_open()) Skipped 1 previous similar message
[1801772.323585] LustreError: 87362:0:(mdt_open.c:1364:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff9cd45d660600 x1626200132291648/t0(55989738647) o101->ca268eff-8bf8-0ad8-eaa9-213224fab174@10.9.0.63@o2ib4:321/0 lens 1768/3288 e 0 to 0 dl 1551817506 ref 1 fl Interpret:/4/0 rc 0/0
[1801772.350388] LustreError: 87362:0:(mdt_open.c:1364:mdt_reint_open()) Skipped 1 previous similar message
[1801811.679798] Lustre: fir-MDT0000: Denying connection for new client 1cbf77f1-25f5-e39f-f802-ab9685fcc565(at 10.8.3.11@o2ib6), waiting for 1417 known clients (0 recovered, 629 in progress, and 0 evicted) to recover in 8:52
[1801811.699561] Lustre: Skipped 7 previous similar messages
Comment by Stephane Thiell [ 05/Mar/19 ]

Finally, it looks like all MDTs restart completed successfully:

Mar 05 12:04:42 fir-md1-s1 kernel: LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5
Mar 05 12:04:42 fir-md1-s1 kernel: LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
Mar 05 12:04:42 fir-md1-s1 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
Mar 05 12:04:42 fir-md1-s1 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
Mar 05 12:04:43 fir-md1-s1 kernel: Lustre: fir-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
Mar 05 12:04:43 fir-md1-s1 kernel: Lustre: fir-MDD0002: changelog on
Mar 05 12:04:43 fir-md1-s1 kernel: Lustre: fir-MDT0002: in recovery but waiting for the first client to connect
Mar 05 12:04:43 fir-md1-s1 kernel: Lustre: fir-MDT0002: Will be in recovery for at least 2:30, or until 1417 clients reconnect
Mar 05 12:11:54 fir-md1-s1 kernel: LustreError: 85720:0:(osp_precreate.c:940:osp_precreate_cleanup_orphans()) fir-OST0003-osc-MDT0000: cannot cleanup orphans: rc = -5
Mar 05 12:11:54 fir-md1-s1 kernel: LustreError: 85716:0:(osp_precreate.c:940:osp_precreate_cleanup_orphans()) fir-OST0001-osc-MDT0000: cannot cleanup orphans: rc = -5
Mar 05 12:11:54 fir-md1-s1 kernel: LustreError: 85720:0:(osp_precreate.c:940:osp_precreate_cleanup_orphans()) Skipped 1 previous similar message
Mar 05 12:11:54 fir-md1-s1 kernel: LustreError: 85716:0:(osp_precreate.c:940:osp_precreate_cleanup_orphans()) Skipped 1 previous similar message
Mar 05 12:11:54 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.9.0.61@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
Mar 05 12:11:54 fir-md1-s1 kernel: LustreError: Skipped 81 previous similar messages
Mar 05 12:11:54 fir-md1-s1 kernel: LustreError: 86752:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) Skipped 1 previous similar message
Mar 05 12:11:56 fir-md1-s1 kernel: Lustre: server umount fir-MDT0000 complete
Mar 05 12:11:56 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message
Mar 05 12:12:20 fir-md1-s1 kernel: LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5
Mar 05 12:12:20 fir-md1-s1 kernel: LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
Mar 05 12:12:20 fir-md1-s1 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
Mar 05 12:12:20 fir-md1-s1 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
Mar 05 12:12:21 fir-md1-s1 kernel: Lustre: fir-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
Mar 05 12:12:21 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message
Mar 05 12:12:21 fir-md1-s1 kernel: Lustre: fir-MDD0000: changelog on
Mar 05 12:12:21 fir-md1-s1 kernel: Lustre: fir-MDT0000: in recovery but waiting for the first client to connect
Mar 05 12:12:21 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message
Mar 05 12:12:21 fir-md1-s1 kernel: LustreError: 11-0: fir-MDT0000-osp-MDT0002: operation mds_connect to node 0@lo failed: rc = -114
Mar 05 12:12:26 fir-md1-s1 kernel: Lustre: fir-MDT0000: Will be in recovery for at least 2:30, or until 1417 clients reconnect
Mar 05 12:12:26 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message
Mar 05 12:12:26 fir-md1-s1 kernel: LustreError: 87047:0:(mdt_open.c:1364:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff9cabc3be1800 x1626782472669232/t0(55986962201) o101->73cb0dc7-e397-e758-abcd-289d6f4ca268@10.9.0.61@o2ib4:316/0 lens 1784/3288 e 0 to 0 dl 1551817501 ref 1 fl Interpret:/4/0 rc 0/0
Mar 05 12:12:27 fir-md1-s1 kernel: LustreError: 87407:0:(mdt_open.c:1364:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff9cb44ba85700 x1626472703432928/t0(55991756253) o101->0f82de9c-d37a-1eea-ad4a-b15f87c0fd06@10.9.0.62@o2ib4:317/0 lens 1784/3288 e 0 to 0 dl 1551817502 ref 1 fl Interpret:/4/0 rc 0/0
Mar 05 12:12:27 fir-md1-s1 kernel: LustreError: 87407:0:(mdt_open.c:1364:mdt_reint_open()) Skipped 1 previous similar message
Mar 05 12:12:31 fir-md1-s1 kernel: LustreError: 87362:0:(mdt_open.c:1364:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff9cd45d660600 x1626200132291648/t0(55989738647) o101->ca268eff-8bf8-0ad8-eaa9-213224fab174@10.9.0.63@o2ib4:321/0 lens 1768/3288 e 0 to 0 dl 1551817506 ref 1 fl Interpret:/4/0 rc 0/0
Mar 05 12:12:31 fir-md1-s1 kernel: LustreError: 87362:0:(mdt_open.c:1364:mdt_reint_open()) Skipped 1 previous similar message
Mar 05 12:13:10 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client 1cbf77f1-25f5-e39f-f802-ab9685fcc565(at 10.8.3.11@o2ib6), waiting for 1417 known clients (0 recovered, 629 in progress, and 0 evicted) to recover in 8:52
Mar 05 12:13:10 fir-md1-s1 kernel: Lustre: Skipped 7 previous similar messages
Mar 05 12:17:18 fir-md1-s1 kernel: Lustre: fir-MDT0002: Connection restored to  (at 10.8.27.4@o2ib6)
Mar 05 12:17:18 fir-md1-s1 kernel: Lustre: Skipped 1619 previous similar messages
Mar 05 12:17:28 fir-md1-s1 kernel: LustreError: 87377:0:(mdt_open.c:1364:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff9cac23e80900 x1626108495014368/t0(55988639638) o101->f3a357e0-f66c-d7f2-ee4c-00445e5c0c19@10.8.0.67@o2ib6:618/0 lens 1768/3288 e 0 to 0 dl 1551817803 ref 1 fl Interpret:/4/0 rc 0/0
Mar 05 12:17:28 fir-md1-s1 kernel: LustreError: 87377:0:(mdt_open.c:1364:mdt_reint_open()) Skipped 1 previous similar message
Mar 05 12:17:45 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client 1cbf77f1-25f5-e39f-f802-ab9685fcc565(at 10.8.3.11@o2ib6), waiting for 1417 known clients (786 recovered, 629 in progress, and 0 evicted) to recover in 4:17
Mar 05 12:17:45 fir-md1-s1 kernel: Lustre: Skipped 21 previous similar messages
Mar 05 12:26:31 fir-md1-s1 kernel: Lustre: fir-MDT0000: Denying connection for new client 1cbf77f1-25f5-e39f-f802-ab9685fcc565(at 10.8.3.11@o2ib6), waiting for 1417 known clients (786 recovered, 629 in progress, and 0 evicted) already passed deadline 4:28
Mar 05 12:26:31 fir-md1-s1 kernel: Lustre: Skipped 41 previous similar messages
Mar 05 12:27:02 fir-md1-s1 kernel: Lustre: fir-MDT0002: recovery is timed out, evict stale exports
Mar 05 12:27:02 fir-md1-s1 kernel: Lustre: fir-MDT0000: disconnecting 2 stale clients
Mar 05 12:27:02 fir-md1-s1 kernel: Lustre: 87215:0:(ldlm_lib.c:1771:extend_recovery_timer()) fir-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1
Mar 05 12:27:02 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message
Mar 05 12:27:02 fir-md1-s1 kernel: Lustre: fir-MDT0000: Recovery already passed deadline 4:36, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
Mar 05 12:27:02 fir-md1-s1 kernel: Lustre: fir-MDT0000: Recovery over after 14:36, of 1417 clients 1415 recovered and 2 were evicted.
 

But this issue is really annoying. We were creating home dirs when this happened and mkdir was blocked. However, we're creating new users all the time so I'm not sure this is really realted. To do something at least, we're gonna get rid of the DNEv2 stuff for now, and do a lfs mkdir to create new user dirs, to see if that helps.

Comment by Lai Siyao [ 06/Mar/19 ]

Stephane, can you dump all process backtraces on all MDS's by 'echo t > /proc/sysrq-trigger' next time? Which may help identify there are real deadlocks in MDS, or maybe it's just caused by some slow clients not canceling locks, and MDT has to wait for client eviction and recovery finish, and renames between MDTs causes more lock dependencies and finally cause the system looks like deadlock.

I'm checking LDLM code to see if we can fail MDT local lock request after timeout instead of wait indefinitely to avoid deep lock dependencies, I'll update you if this is working.

Comment by Stephane Thiell [ 06/Mar/19 ]

Thanks Lai, will do!

Comment by Stephane Thiell [ 08/Mar/19 ]

Hello,
I wanted to update this ticket because I just discovered that I provided some wrong information before.
We were actually using rename during the account creation process, basically something like:

cd /fir/users
mkdir $tmpdir
mv $tmpdir $user_scratch
chown $user: $user_scratch
chmod 700 $user_scratch
lfs setquota [...]

Because /fir/users is a DNEv2 directory, that probably explains why we see some remote renames in the backtrace then.

For now, we have switched to lfs mkdir to avoid DNEv2 (and with no rename) in order to stabilize the filesystem (because it has been painful). Perhaps next week at some point, we can try again to do mkdir + rename to see if I can get MDS traces.

Comment by Andreas Dilger [ 08/Mar/19 ]

Using lfs mkdir will still invoke DNE1, which means remote entries for many of the directories, because you won't know in advance what MDT the directory will hash to. Using simple "mkdir" will not invoke DNE because it will only create the name and directory on the same MDT, so long as you don't use rename in the process.

Comment by Stephane Thiell [ 08/Mar/19 ]

Thanks Andreas, I think I get it! We'll just use mkdir without rename for now.

Comment by Stephane Thiell [ 09/Mar/19 ]

This happened again a few minutes ago (without us creating any new account today), but users can always move data from different group spaces so...
I had to restart all MDTs to fix the issue. This time, the recovery was fast. Also, I was also able to dump the tasks using sysrq on both MDS when this happened.
Attaching fir-md1-s1_sysrq-t_20190308.log.gz (MDT0000 and MDT0002) and fir-md1-s2_sysrq-t_20190308.log.gz (MDT0001 and MDT0003). Hope that helps...

Comment by Stephane Thiell [ 09/Mar/19 ]

This issue hit us again and we were mostly down last night. The strange thing is that a few compute nodes could still access the filesystem, so it was not a total deadlock. Anyway, I dumped the tasks and also took a crash dump of both MDS.

I have gathered the following files to help troubleshoot this issue:

Attached to this ticket:

  • fir-md1-s1_console_full_20190309.log.gz - full console logs of fir-md1-s1 (MDT0000 and MDT0002) from the last MDT startup (Fri Mar 8 16:34:33 2019), including the dump of the tasks with SysRq, the reboot (power diag) and the restart of the MDTs with the recovery.
  • fir-md1-s2_console_full_20190309.log.gz - same for fir-md1-s2 (MDT0001 and MDT0003)

Uploaded to the FTP server, two tarballs of the MDS crash dump:

  • fir-md1-s1-crash-2019-03-09-10-11-07.tar.gz
  • fir-md1-s2-crash-2019-03-09-10-14-21.tar.gz

Kernel version is 3.10.0-957.1.3.el7_lustre.x86_64 and I believe the debuginfo rpms have already been uploaded (kernel-debuginfo-3.10.0-957.1.3.el7_lustre.x86_64). Let me know if anything else is needed.

We really appreciate your help in this matter!

Comment by Stephane Thiell [ 11/Mar/19 ]

Same issue happened 4 or 5 times today, each time restarting the MDTs unblocked the situation, but sometimes the problem comes just after a few minutes.  This is greatly affecting our users and their research work and I don't know what to do to avoid it at this point... I hope you'll be able to provide some workaround or patch. Thanks!

Comment by Andreas Dilger [ 11/Mar/19 ]

Stephane, is this always still happening with rename between MDTs? Is it possible that your DNE2 striping is inherited into the subdirectories so that any rename is a cross-MDT rename?

One option to avoid the DNE deadlocks is to return -EXDEV for cross-MDT renames instead of doing the rename, which causes mv to copy the file in userspace. It would be possible to make a patch to allow this behavior to be configured.

There is a single DLM lock that is protecting renames in the filesystem to avoid circular directory trees. At one point we discussed allowing renames of regular files within a directory to happen in parallel with only a local lock so that there isn't so much contention on the global rename lock. However, I'm not sure if that is implemented yet.

Comment by Stephane Thiell [ 11/Mar/19 ]

Hi Andreas! I don't think our DNE2 stripping is inherited into the subdirectories, unless I missed something (more below). Each time, I can see backtraces involving mdt_reint_rename* and mdt_remote_object_lock*. But it's strange that they are so many renames, they should be pretty rare in our setup indeed.

Typical trace:

[109711.735643]  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[109711.742681]  [<ffffffffc0e0b12f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
[109711.749883]  [<ffffffffc0e0d9ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
[109711.756635]  [<ffffffffc16147a2>] osp_md_object_lock+0x162/0x2d0 [osp]
[109711.763300]  [<ffffffffc1528e03>] lod_object_lock+0xf3/0x7b0 [lod]
[109711.769602]  [<ffffffffc15a9d3e>] mdd_object_lock+0x3e/0xe0 [mdd]
[109711.775846]  [<ffffffffc14182e1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
[109711.783217]  [<ffffffffc141887a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
[109711.790052]  [<ffffffffc1438022>] mdt_reint_rename_or_migrate.isra.51+0x362/0x860 [mdt]
[109711.798206]  [<ffffffffc1438553>] mdt_reint_rename+0x13/0x20 [mdt]
[109711.804510]  [<ffffffffc14385e3>] mdt_reint_rec+0x83/0x210 [mdt]
[109711.810638]  [<ffffffffc1415133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[109711.817300]  [<ffffffffc1420497>] mdt_reint+0x67/0x140 [mdt]
[109711.823092]  [<ffffffffc0ea635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[109711.830130]  [<ffffffffc0e4a92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[109711.837948]  [<ffffffffc0e4e25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[109711.844362]  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
[109711.849387]  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
[109711.855940]  [<ffffffffffffffff>] 0xffffffffffffffff

I checked every subdirectory in each user directory and I didn't find any inconsistency:

$ for u in $(ls /fir/users); do find /fir/users/$u -maxdepth 1 -type d -exec lfs getdirstripe -i {} \; | sort -n | uniq -c | wc -l; done 2>/dev/null | sort -n | uniq -c
   3871 1
$ for g in $(ls /fir/groups); do find /fir/groups/$g -maxdepth 1 -type d -exec lfs getdirstripe -i {} \; | sort -n | uniq -c | wc -l; done 2>/dev/null | sort -n | uniq -c
    668 1

They all report 'none' as the lmv_hash_type, so I guess that means DNE1:

$ for u in $(ls /fir/users); do find /fir/users/$u -maxdepth 1 -type d -exec lfs getdirstripe {} \; | grep -v none; done
$ for g in $(ls /fir/groups); do find /fir/groups/$g -maxdepth 1 -type d -exec lfs getdirstripe {} \; | grep -v none; done
$

Typical subdirs:

$ lfs getdirstripe /fir/users
lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx		 FID[seq:oid:ver]
     0		 [0x200000400:0x5:0x0]		
     1		 [0x240000402:0x5:0x0]		
     2		 [0x2c0000400:0x5:0x0]		
     3		 [0x280000401:0x5:0x0]

$ lfs getdirstripe /fir/users/sthiell
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
$ lfs getdirstripe /fir/users/sthiell/ior
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none

$ lfs getdirstripe /fir/users/kilian
lmv_stripe_count: 0 lmv_stripe_offset: 1 lmv_hash_type: none                  
$ lfs getdirstripe /fir/users/kilian/benchs
lmv_stripe_count: 0 lmv_stripe_offset: 1 lmv_hash_type: none

It could be that a user is doing renames from $GROUP_SCRATCH to $SCRATCH (from/to a deep subdirectory), but otherwise I don't see why there would be so many remote renames.

Would there be a way to enable traces on the MDT so I could find the FIDs involved in remote renames? That way I may be able to spot a specific user or job.

And yes, having the possibility to enable -EXDEV on remote rename would be great.

Comment by Andreas Dilger [ 11/Mar/19 ]

There are mdt.*.rename_stats files for all the MDTs that can tell you the number of renames, both within a directory and across directories. You can also check mdt.*.exports.*.stats and mdt.*.job_stats to see which clients or jobs are doing a large number of rename operations. It may well be that some new workload is doing a lot of renames.

Comment by Stephane Thiell [ 11/Mar/19 ]

Perhaps some specific workload is doing this yes, because for example today, things are running okay so far...

rename_stats shows quite a lot of crossdir renames:

[root@fir-md1-s1 ~]# lctl get_param mdt.*.rename_stats
mdt.fir-MDT0000.rename_stats=
rename_stats:
- snapshot_time:  1552344763.503420069
- same_dir       
      4KB: { sample: 124757, pct:  65, cum_pct:  65 }
      8KB: { sample:   0, pct:   0, cum_pct:  65 }
      16KB: { sample: 31900, pct:  16, cum_pct:  82 }
      32KB: { sample: 1594, pct:   0, cum_pct:  83 }
      64KB: { sample: 2436, pct:   1, cum_pct:  84 }
      128KB: { sample: 3635, pct:   1, cum_pct:  86 }
      256KB: { sample: 1374, pct:   0, cum_pct:  87 }
      512KB: { sample: 4473, pct:   2, cum_pct:  89 }
      1MB: { sample: 20189, pct:  10, cum_pct: 100 }
- crossdir_src   
      4KB: { sample:  88, pct:  24, cum_pct:  24 }
      8KB: { sample:   0, pct:   0, cum_pct:  24 }
      16KB: { sample: 158, pct:  43, cum_pct:  67 }
      32KB: { sample:  69, pct:  18, cum_pct:  86 }
      64KB: { sample:  35, pct:   9, cum_pct:  95 }
      128KB: { sample:   0, pct:   0, cum_pct:  95 }
      256KB: { sample:  15, pct:   4, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample: 148, pct:  40, cum_pct:  40 }
      8KB: { sample:   0, pct:   0, cum_pct:  40 }
      16KB: { sample: 139, pct:  38, cum_pct:  78 }
      32KB: { sample:  44, pct:  12, cum_pct:  90 }
      64KB: { sample:  34, pct:   9, cum_pct: 100 }
mdt.fir-MDT0002.rename_stats=
rename_stats:
- snapshot_time:  1552344763.503679596
- same_dir       
      4KB: { sample: 28442, pct:  14, cum_pct:  14 }
      8KB: { sample:   0, pct:   0, cum_pct:  14 }
      16KB: { sample: 18463, pct:   9, cum_pct:  23 }
      32KB: { sample: 116725, pct:  59, cum_pct:  83 }
      64KB: { sample: 2574, pct:   1, cum_pct:  84 }
      128KB: { sample: 5828, pct:   2, cum_pct:  87 }
      256KB: { sample: 23572, pct:  12, cum_pct:  99 }
      512KB: { sample:  14, pct:   0, cum_pct:  99 }
      1MB: { sample:   7, pct:   0, cum_pct: 100 }
- crossdir_src   
      4KB: { sample:  56, pct:   0, cum_pct:   0 }
      8KB: { sample:   0, pct:   0, cum_pct:   0 }
      16KB: { sample: 136, pct:   0, cum_pct:   0 }
      32KB: { sample:   4, pct:   0, cum_pct:   0 }
      64KB: { sample:   0, pct:   0, cum_pct:   0 }
      128KB: { sample: 3969, pct:  12, cum_pct:  12 }
      256KB: { sample: 7792, pct:  23, cum_pct:  36 }
      512KB: { sample: 11381, pct:  34, cum_pct:  70 }
      1MB: { sample: 9710, pct:  29, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample: 1607, pct:   4, cum_pct:   4 }
      8KB: { sample:   0, pct:   0, cum_pct:   4 }
      16KB: { sample: 1703, pct:   5, cum_pct:  10 }
      32KB: { sample: 4349, pct:  13, cum_pct:  23 }
      64KB: { sample: 6850, pct:  20, cum_pct:  43 }
      128KB: { sample: 8948, pct:  27, cum_pct:  70 }
      256KB: { sample: 4643, pct:  14, cum_pct:  85 }
      512KB: { sample: 4946, pct:  14, cum_pct:  99 }
      1MB: { sample:   2, pct:   0, cum_pct: 100 }
mdt.fir-MDT0001.rename_stats=
rename_stats:
- snapshot_time:  1552345057.683942486
- same_dir       
      4KB: { sample: 285321, pct:  98, cum_pct:  98 }
      8KB: { sample:   0, pct:   0, cum_pct:  98 }
      16KB: { sample: 639, pct:   0, cum_pct:  98 }
      32KB: { sample: 331, pct:   0, cum_pct:  98 }
      64KB: { sample: 1918, pct:   0, cum_pct:  99 }
      128KB: { sample: 180, pct:   0, cum_pct:  99 }
      256KB: { sample:  26, pct:   0, cum_pct:  99 }
      512KB: { sample:   0, pct:   0, cum_pct:  99 }
      1MB: { sample:   0, pct:   0, cum_pct:  99 }
      2MB: { sample: 1623, pct:   0, cum_pct: 100 }
- crossdir_src   
      4KB: { sample: 182, pct:   1, cum_pct:   1 }
      8KB: { sample:   0, pct:   0, cum_pct:   1 }
      16KB: { sample:   0, pct:   0, cum_pct:   1 }
      32KB: { sample:   0, pct:   0, cum_pct:   1 }
      64KB: { sample: 1194, pct:   6, cum_pct:   7 }
      128KB: { sample:   0, pct:   0, cum_pct:   7 }
      256KB: { sample:   0, pct:   0, cum_pct:   7 }
      512KB: { sample:   0, pct:   0, cum_pct:   7 }
      1MB: { sample:   0, pct:   0, cum_pct:   7 }
      2MB: { sample:   0, pct:   0, cum_pct:   7 }
      4MB: { sample:   0, pct:   0, cum_pct:   7 }
      8MB: { sample: 16000, pct:  92, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample: 588, pct:   3, cum_pct:   3 }
      8KB: { sample:   0, pct:   0, cum_pct:   3 }
      16KB: { sample: 297, pct:   1, cum_pct:   5 }
      32KB: { sample: 485, pct:   2, cum_pct:   7 }
      64KB: { sample:   6, pct:   0, cum_pct:   7 }
      128KB: { sample: 2104, pct:  12, cum_pct:  20 }
      256KB: { sample: 2181, pct:  12, cum_pct:  32 }
      512KB: { sample: 4139, pct:  23, cum_pct:  56 }
      1MB: { sample: 7576, pct:  43, cum_pct: 100 }
mdt.fir-MDT0003.rename_stats=
rename_stats:
- snapshot_time:  1552345057.684301349
- same_dir       
      4KB: { sample: 392450, pct:  90, cum_pct:  90 }
      8KB: { sample:   0, pct:   0, cum_pct:  90 }
      16KB: { sample: 11513, pct:   2, cum_pct:  93 }
      32KB: { sample: 6056, pct:   1, cum_pct:  95 }
      64KB: { sample: 17452, pct:   4, cum_pct:  99 }
      128KB: { sample: 2634, pct:   0, cum_pct:  99 }
      256KB: { sample: 1281, pct:   0, cum_pct:  99 }
      512KB: { sample:   1, pct:   0, cum_pct: 100 }
- crossdir_src   
      4KB: { sample: 188, pct:  10, cum_pct:  10 }
      8KB: { sample:   0, pct:   0, cum_pct:  10 }
      16KB: { sample:   0, pct:   0, cum_pct:  10 }
      32KB: { sample: 1656, pct:  89, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample: 630, pct:  34, cum_pct:  34 }
      8KB: { sample:   0, pct:   0, cum_pct:  34 }
      16KB: { sample: 530, pct:  28, cum_pct:  62 }
      32KB: { sample: 684, pct:  37, cum_pct: 100 }

But what does '16,000 x 8MB crossdir_src' mean, for example?

I'm going to track the stats per job, thanks for the guidance!

Comment by Andreas Dilger [ 12/Mar/19 ]

The "samedirr" stats count the number of renames vs. the size of the directory for local renames, while "crossdir_src" and "crossdir_tgt" count the number of renames for a given the source and target directory size. This allows seeing whether renames are typically in large or small directories, and whether they are typically within a single directory or between directories.

It looks like the 16000 renames from the 8MB directory (which is fairly large) are to smaller target directories, or possibly the same target that is growing larger as it is used? Maybe someone moving a bunch of files around?

I see that these stats do not show if the rename is from a local or remote MDT, which would be useful.

Comment by Stephane Thiell [ 12/Mar/19 ]

Ok. thanks!

So the issue just occurred again, so I checked exports.*.stats at that time and there were only 4 nids showing crossdir_rename operations:

mdt.fir-MDT0000.exports.10.8.29.7@o2ib6.stats=
snapshot_time             1552351684.242347304 secs.nsecs
open                      657 samples [reqs]
close                     617 samples [reqs]
rename                    1 samples [reqs]
getattr                   258 samples [reqs]
setattr                   181 samples [reqs]
getxattr                  32 samples [reqs]
crossdir_rename           1 samples [reqs]

mdt.fir-MDT0001.exports.10.9.105.33@o2ib4.stats=
snapshot_time             1552351912.501630936 secs.nsecs
open                      2099 samples [reqs]
close                     2097 samples [reqs]
mknod                     19 samples [reqs]
unlink                    7 samples [reqs]
mkdir                     1 samples [reqs]
rmdir                     3 samples [reqs]
rename                    8 samples [reqs]
getattr                   1368 samples [reqs]
setattr                   2 samples [reqs]
getxattr                  21 samples [reqs]
statfs                    9 samples [reqs]
sync                      1 samples [reqs]
samedir_rename            4 samples [reqs]
crossdir_rename           4 samples [reqs]

mdt.fir-MDT0002.exports.10.9.0.63@o2ib4.stats=
snapshot_time             1552351684.290638485 secs.nsecs
open                      34 samples [reqs]
close                     30 samples [reqs]
mkdir                     1 samples [reqs]
rename                    1 samples [reqs]
getattr                   1474 samples [reqs]
crossdir_rename           1 samples [reqs]

mdt.fir-MDT0003.exports.10.8.0.66@o2ib6.stats=
snapshot_time             1552351819.298848549 secs.nsecs
open                      3013 samples [reqs]
close                     3011 samples [reqs]
mknod                     16 samples [reqs]
unlink                    18 samples [reqs]
rmdir                     1 samples [reqs]
rename                    799 samples [reqs]
getattr                   5161 samples [reqs]
setattr                   472 samples [reqs]
getxattr                  599 samples [reqs]
samedir_rename            1 samples [reqs]
crossdir_rename           798 samples [reqs]

The last one, 10.8.0.66@o2ib6, is a login node (sh-ln06), so that makes sense, people usually use them to organize/move their files around. The others are compute nodes. But overall, it doesn't seem there are a lot of occurrences of crossdir_rename. I'll track these more closely, now that the MDTs have been restarted.

Comment by Stephane Thiell [ 12/Mar/19 ]

fir-md1-s2 (serving MDT0001 and MDT0003) has started to show the typical backtraces:

[213508.653137]  [<ffffffffc0f930bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[213508.660170]  [<ffffffffc0f93dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[213508.667476]  [<ffffffffc14e54ab>] mdt_object_local_lock+0x50b/0xb20 [mdt]
[213508.674394]  [<ffffffffc14e5b30>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
[213508.681489]  [<ffffffffc14eee35>] mdt_intent_getxattr+0xb5/0x270 [mdt]
[213508.688137]  [<ffffffffc14eba18>] mdt_intent_policy+0x2e8/0xd00 [mdt]
[213508.694724]  [<ffffffffc0f79ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[213508.701573]  [<ffffffffc0fa28a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[213508.708763]  [<ffffffffc1029302>] tgt_enqueue+0x62/0x210 [ptlrpc]
[213508.715015]  [<ffffffffc103035a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[213508.722052]  [<ffffffffc0fd492b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[213508.729853]  [<ffffffffc0fd825c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[213508.736279]  [<ffffffffb36c1c31>] kthread+0xd1/0xe0
[213508.741284]  [<ffffffffb3d74c24>] ret_from_fork_nospec_begin+0xe/0x21
[213508.747879]  [<ffffffffffffffff>] 0xffffffffffffffff

I am attaching a dump of the lock table of this MDS as fir-md1-s2-dlmtrace_20190311.log.gz, perhaps that could help. I'll do it again when the filesystem is really hung. No backtrace can be found yet on fir-md1-s1 (MDT0000 and MDT0002) at this time.

Comment by Stephane Thiell [ 12/Mar/19 ]

Sorry, I just realized that the backtraces are not the typical ones, so it might be another issue.. in any case, the filesystem is still responsive at the moment, so perhaps just ignore my last comment!

Comment by Lai Siyao [ 12/Mar/19 ]

Hi Stephane, the backtrace shows one rename thread is stuck in mdt_dom_discard_data():

[Fri Mar  8 16:25:51 2019][2076228.060403] mdt02_095       S ffff9ca3d2959040     0 91665      2 0x00000080
[Fri Mar  8 16:25:51 2019][2076228.067686] Call Trace:
[Fri Mar  8 16:25:51 2019][2076228.070340]  [<ffffffffc0ce7c05>] ? ktime_get_seconds+0x25/0x40 [libcfs]
[Fri Mar  8 16:25:51 2019][2076228.077246]  [<ffffffffc1233860>] ? ldlm_completion_ast_async+0x300/0x300 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.085023]  [<ffffffff8ef67c49>] schedule+0x29/0x70
[Fri Mar  8 16:25:51 2019][2076228.090189]  [<ffffffffc12340bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.097271]  [<ffffffffc121af6f>] ? ldlm_lock_enqueue+0x40f/0xa60 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.104361]  [<ffffffffc0e29f7b>] ? class_handle_hash+0xab/0x2f0 [obdclass]
[Fri Mar  8 16:25:51 2019][2076228.111501]  [<ffffffff8e8d67b0>] ? wake_up_state+0x20/0x20
[Fri Mar  8 16:25:51 2019][2076228.117285]  [<ffffffffc1234dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.124642]  [<ffffffffc1233a80>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.132675]  [<ffffffffc1239830>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.140442]  [<ffffffffc19db771>] mdt_dom_discard_data+0x101/0x130 [mdt]
[Fri Mar  8 16:25:51 2019][2076228.147361]  [<ffffffffc1239830>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.155131]  [<ffffffffc1233a80>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.163151]  [<ffffffffc19b1268>] mdt_reint_rename_internal.isra.46+0xb78/0x2760 [mdt]
[Fri Mar  8 16:25:51 2019][2076228.171288]  [<ffffffffc1233a80>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.179306]  [<ffffffffc19bae5b>] mdt_reint_rename_or_migrate.isra.51+0x19b/0x860 [mdt]
[Fri Mar  8 16:25:51 2019][2076228.187496]  [<ffffffffc19a8761>] ? mdt_root_squash+0x21/0x430 [mdt]
[Fri Mar  8 16:25:51 2019][2076228.194067]  [<ffffffffc19a8092>] ? ucred_set_audit_enabled.isra.13+0x22/0x60 [mdt]
[Fri Mar  8 16:25:51 2019][2076228.201906]  [<ffffffffc19bb553>] mdt_reint_rename+0x13/0x20 [mdt]
[Fri Mar  8 16:25:51 2019][2076228.208278]  [<ffffffffc19bb5e3>] mdt_reint_rec+0x83/0x210 [mdt]
[Fri Mar  8 16:25:51 2019][2076228.214503]  [<ffffffffc1998133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[Fri Mar  8 16:25:51 2019][2076228.221215]  [<ffffffffc19a03f4>] ? mdt_thread_info_init+0xa4/0x1e0 [mdt]
[Fri Mar  8 16:25:51 2019][2076228.228194]  [<ffffffffc19a3497>] mdt_reint+0x67/0x140 [mdt]
[Fri Mar  8 16:25:51 2019][2076228.234088]  [<ffffffffc12d135a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.241188]  [<ffffffffc12aaa51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.248938]  [<ffffffffc0ce7bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[Fri Mar  8 16:25:51 2019][2076228.256204]  [<ffffffffc127592b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.264070]  [<ffffffffc12727b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.271034]  [<ffffffff8e8d67c2>] ? default_wake_function+0x12/0x20
[Fri Mar  8 16:25:51 2019][2076228.277498]  [<ffffffff8e8cba9b>] ? __wake_up_common+0x5b/0x90
[Fri Mar  8 16:25:51 2019][2076228.283558]  [<ffffffffc127925c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.290034]  [<ffffffffc1278760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[Fri Mar  8 16:25:51 2019][2076228.297622]  [<ffffffff8e8c1c31>] kthread+0xd1/0xe0
[Fri Mar  8 16:25:51 2019][2076228.302685]  [<ffffffff8e8c1b60>] ? insert_kthread_work+0x40/0x40
[Fri Mar  8 16:25:51 2019][2076228.308961]  [<ffffffff8ef74c24>] ret_from_fork_nospec_begin+0xe/0x21
[Fri Mar  8 16:25:51 2019][2076228.315596]  [<ffffffff8e8c1b60>] ? insert_kthread_work+0x40/0x40

Which causes all subsequent renames waiting on rename_lock, I'll commit a fix soon.

Comment by Gerrit Updater [ 12/Mar/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34401
Subject: LU-12037 mdt: call mdt_dom_discard_data() after rename unlock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 521781819bc780f201445c1385eed49051756342

Comment by Mikhail Pershin [ 12/Mar/19 ]

Lai, this can be fixed by https://review.whamcloud.com/#/c/34071/ patch already, if you have a reproducer, can you check that?

Comment by Lai Siyao [ 12/Mar/19 ]

Hi Mike, yes, your patch can fully fix mdt_dom_discard_data() deadlock, but it may take a while to pass review and land, and this fix is trivial, which can be used now.

Stephane, you can update MDS with https://review.whamcloud.com/34401 and see whether such deadlock still happens.

Comment by Stephane Thiell [ 12/Mar/19 ]

Awesome! Thanks! I just patched both MDS with Lai's patch. We'll see how it goes, I'll keep you posted!

fir-md1-s1: Mar 12 08:56:18 fir-md1-s1 kernel: Lustre: Lustre: Build Version: 2.12.0_lu12037
fir-md1-s2: Mar 12 09:00:05 fir-md1-s2 kernel: Lustre: Lustre: Build Version: 2.12.0_lu12037
Comment by Andreas Dilger [ 12/Mar/19 ]

Lai, can you also work on a separate patch that adds a new mdt.*.enable_remote_rename parameter (default = 1, but can be disabled if needed) that replaces remote renames with an -EXDEV return value, so that the client will do a userspace copy of the files. This should be relatively straight forward I think.

We may not need this if your patch is successful, but it may still be useful if there are DNE rename problems even after the patch, or as a workaround for other problems in the future.

Comment by Lai Siyao [ 13/Mar/19 ]

Andreas, okay, I'll work on it later.

Comment by Gerrit Updater [ 13/Mar/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34410
Subject: LU-12037 mdt: add option for cross-MDT rename
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a80e3621b65d93f10510fed8b5916d3387c5dc87

Comment by Stephane Thiell [ 14/Mar/19 ]

Unfortunately, https://review.whamcloud.com/34401 doesn't seem to fix our issue, while the filesystem has been working fine (even very well) for ~36 hours, then the issue occurred again at 2:45AM and then the filesystem was not available for the rest of the night and users have already reported I/O issues from jobs and login nodes.

It seems that the first error message was:

Mar 14 02:42:05 fir-md1-s1 kernel: Lustre: 45565:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
                                     req@ffff8b649db26000 x1627076382032624/t0(0) o36->46ae3b3f-393c-afee-8b81-08da5de57baa@10.9.104.68@o2ib4:10/0 lens 608/2888 e 1 to 0 dl 1552556530 ref 2 fl Interpret:/0/0 rc 0/0
Mar 14 02:42:05 fir-md1-s1 kernel: Lustre: 45565:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages

Then also MDT0000 and MDT0002 lost connectivity (?? there are both running on fir-md1-s1):

Mar 14 02:43:20 fir-md1-s1 kernel: Lustre: fir-MDT0000-osp-MDT0002: Connection to fir-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Mar 14 02:43:20 fir-md1-s1 kernel: LustreError: 45572:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1552556510, 90s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MDT0000-osp-MDT0002 lock: ffff8b6929f4f500/0x974d801fa51d72 lrc: 4/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2/0x0 rrc: 26 type: IBT flags: 0x1000001000000 nid: local remote: 0x974d801fa51d79 expref: -99 pid: 45572 timeout: 0 lvb_type: 0

Then the usual call trace:

Mar 14 02:45:10 fir-md1-s1 kernel: LNet: Service thread pid 45572 was inactive for 200.23s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Mar 14 02:45:10 fir-md1-s1 kernel: Pid: 45572, comm: mdt01_099 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 14 02:45:10 fir-md1-s1 kernel: Call Trace:
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc0e0b12f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc0e0d9ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc166b7a2>] osp_md_object_lock+0x162/0x2d0 [osp]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc157fe03>] lod_object_lock+0xf3/0x7b0 [lod]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc1600d3e>] mdd_object_lock+0x3e/0xe0 [mdd]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc14702e1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc147087a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc1483abe>] mdt_rename_lock+0xbe/0x4b0 [mdt]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc1486450>] mdt_reint_rename+0x900/0x2810 [mdt]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc148eb33>] mdt_reint_rec+0x83/0x210 [mdt]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc146d133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc1478497>] mdt_reint+0x67/0x140 [mdt]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc0ea635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc0e4a92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffc0e4e25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Mar 14 02:45:10 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Mar 14 02:45:10 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1552556710.45572

We restarted all MDTs this morning at 7:58. After a few minutes, all I/Os have resumed. I'm attaching full kernel logs for both MDS servers (fir-md1-s1_kern_20190314.log.gz and fir-md1-s2_kern_20190314.log.gz). Note that this was with Lustre build version 2.12.0_lu12037 (at the end of the log files).

Comment by Stephane Thiell [ 14/Mar/19 ]

It happened again just now.

[436173.232484] LNet: Service thread pid 85822 was inactive for 200.48s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[436173.249505] LNet: Skipped 4 previous similar messages
[436173.254652] Pid: 85822, comm: mdt03_081 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[436173.264496] Call Trace:
[436173.267056]  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[436173.274096]  [<ffffffffc0e09dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[436173.281391]  [<ffffffffc1483c4b>] mdt_rename_lock+0x24b/0x4b0 [mdt]
[436173.287791]  [<ffffffffc1486450>] mdt_reint_rename+0x900/0x2810 [mdt]
[436173.294376]  [<ffffffffc148eb33>] mdt_reint_rec+0x83/0x210 [mdt]
[436173.300513]  [<ffffffffc146d133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[436173.307193]  [<ffffffffc1478497>] mdt_reint+0x67/0x140 [mdt]
[436173.313009]  [<ffffffffc0ea635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[436173.320063]  [<ffffffffc0e4a92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[436173.327865]  [<ffffffffc0e4e25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[436173.334293]  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
[436173.339298]  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
[436173.345873]  [<ffffffffffffffff>] 0xffffffffffffffff

This time I was able to dump the ldlm lock tables on both MDS while the filesystem was stuck. Attaching files fir-md1-s1_dkdlmtrace_20190314.log.gz and fir-md1-s2_dkdlmtrace_20190314.log.gz

Comment by Stephane Thiell [ 14/Mar/19 ]

The only "recent" mdt_dom_discard_data is this one:

Mar 14 09:32:47 fir-md1-s1 kernel: LNet: Service thread pid 81450 was inactive for 200.67s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Mar 14 09:32:47 fir-md1-s1 kernel: LNet: Skipped 5 previous similar messages
Mar 14 09:32:47 fir-md1-s1 kernel: Pid: 81450, comm: mdt02_079 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 14 09:32:47 fir-md1-s1 kernel: Call Trace:
Mar 14 09:32:47 fir-md1-s1 kernel:  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Mar 14 09:32:47 fir-md1-s1 kernel:  [<ffffffffc0e09dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Mar 14 09:32:47 fir-md1-s1 kernel:  [<ffffffffc14aecc1>] mdt_dom_discard_data+0x101/0x130 [mdt]
Mar 14 09:32:47 fir-md1-s1 kernel:  [<ffffffffc1489bc1>] mdt_reint_unlink+0x331/0x1480 [mdt]
Mar 14 09:32:47 fir-md1-s1 kernel:  [<ffffffffc148eb33>] mdt_reint_rec+0x83/0x210 [mdt]
Mar 14 09:32:47 fir-md1-s1 kernel:  [<ffffffffc146d133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Mar 14 09:32:47 fir-md1-s1 kernel:  [<ffffffffc1478497>] mdt_reint+0x67/0x140 [mdt]
Mar 14 09:32:47 fir-md1-s1 kernel:  [<ffffffffc0ea635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 14 09:32:47 fir-md1-s1 kernel:  [<ffffffffc0e4a92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 14 09:32:47 fir-md1-s1 kernel:  [<ffffffffc0e4e25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 14 09:32:47 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Mar 14 09:32:47 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Mar 14 09:32:47 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Mar 14 09:32:47 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1552581167.81450
Comment by Andreas Dilger [ 15/Mar/19 ]

Did you try the patch https://review.whamcloud.com/34410 yet?

Comment by Stephane Thiell [ 15/Mar/19 ]

Not yet Andreas, we'll use it in last resort. Last night has been ok so perhaps Lai's patch is helping having less frequent deadlocks after all (or the workload has changed, I don't know). What about https://review.whamcloud.com/#/c/34071/ ? Would it be interesting for you if we try this one first, or is that too risky (too big a change)? Thx!

Comment by Patrick Farrell (Inactive) [ 15/Mar/19 ]

Stephane,

Mike and I talked over that patch this morning and we think you should give it a try.  It is a client & server patch, just FYI, not server only (Just pointing that out because 34410 is server only).

Comment by Stephane Thiell [ 15/Mar/19 ]

Hi Patrick,
Ok, we're willing to try it then if a rolling upgrade of the clients is possible. So can I patch the 2xMDS now and then the clients progressively over say 1-2 weeks?
Also just to confirm, OSS don't need the patch right?
Thanks!!

Comment by Patrick Farrell (Inactive) [ 15/Mar/19 ]

It should be possible - There is a small chance that an un-upgraded client could get an error when writing to a file that is being deleted (final unlink).  But that should be rare, and since you are unlinking the file, the error does not represent a real problem.

And yes, not needed on the OSS.  Just MDS and clients.

Comment by Mikhail Pershin [ 15/Mar/19 ]

Stephane, I've just checked that - it is better to upgrade clients along with servers, clients may experience assertion with that patch, that is why client changes were made in fact. It seems that client can be safely updated also prior server. Let me check that and also think if that is possible upgrade server first, maybe with patch modification

Comment by Stephane Thiell [ 15/Mar/19 ]

Thanks much! I haven't updated anything yet, so just let me know.

Comment by Stephane Thiell [ 16/Mar/19 ]

Filesystem blocked again, I'll need to know ASAP if we it's doable and in what order we can apply your patch Mike, otherwise I might have to fail back to the EXDEV patch. Interestingly this time I don't see any mdt_dom_discard_data trace, and the other traces are not only remote renames.. I hope it's not yet another issue with 2.12...

Mar 15 18:21:06 fir-md1-s1 kernel: Pid: 87303, comm: mdt01_078 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 15 18:21:06 fir-md1-s1 kernel: Call Trace:
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffc0e09dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffc14714ab>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffc1471b30>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffc1472d1a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffc147abb5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffc1477a18>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffc0defec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffc0e188a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffc0e9f302>] tgt_enqueue+0x62/0x210 [ptlrpc]
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffc0ea635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffc0e4a92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffc0e4e25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Mar 15 18:21:06 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Mar 15 18:21:06 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1552699266.87303
Mar 15 18:21:08 fir-md1-s1 kernel: Pid: 87061, comm: mdt00_018 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 15 18:21:08 fir-md1-s1 kernel: Call Trace:
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e09dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc1483c4b>] mdt_rename_lock+0x24b/0x4b0 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc1486450>] mdt_reint_rename+0x900/0x2810 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc148eb33>] mdt_reint_rec+0x83/0x210 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc146d133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc1478497>] mdt_reint+0x67/0x140 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0ea635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e4a92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e4e25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Mar 15 18:21:08 fir-md1-s1 kernel: Pid: 87035, comm: mdt00_012 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 15 18:21:08 fir-md1-s1 kernel: Call Trace:
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e09dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc14714ab>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc1471b30>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc1472d1a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc147abb5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc1477a18>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0defec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e188a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e9f302>] tgt_enqueue+0x62/0x210 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0ea635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e4a92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e4e25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Mar 15 18:21:08 fir-md1-s1 kernel: Pid: 87043, comm: mdt00_014 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 15 18:21:08 fir-md1-s1 kernel: Call Trace:
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e0b12f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e0d9ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc166b7a2>] osp_md_object_lock+0x162/0x2d0 [osp]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc157fe03>] lod_object_lock+0xf3/0x7b0 [lod]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc1600d3e>] mdd_object_lock+0x3e/0xe0 [mdd]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc14702e1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc147087a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc1483abe>] mdt_rename_lock+0xbe/0x4b0 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc1486450>] mdt_reint_rename+0x900/0x2810 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc148eb33>] mdt_reint_rec+0x83/0x210 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc146d133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc1478497>] mdt_reint+0x67/0x140 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0ea635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e4a92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e4e25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Mar 15 18:21:08 fir-md1-s1 kernel: Pid: 87216, comm: mdt01_056 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 15 18:21:08 fir-md1-s1 kernel: Call Trace:
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0e09dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc14714ab>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc1471b30>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc1472d1a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc147abb5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc1477a18>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Mar 15 18:21:08 fir-md1-s1 kernel:  [<ffffffffc0defec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
...
Comment by Andreas Dilger [ 16/Mar/19 ]

I guess one comment that is worthwhile to make in this ticket is that it seems your system is using PFL and DoM in addition to DNE. I missed that initially, since it wasn't actually written anywhere in this ticket, but the reference to LU-11964 and the problems with mdt_dom_discard_data() that this is the case.

Note that the https://review.whamcloud.com/34410 -EXDEV patch just adds a tunable that can turn on/off cross-MDT renames, so it should be relatively safe to apply this patch and have it available, even if you don't immediately intend to use it. To disable remote renames you would run "lctl set_param mdt.*.enable_remote_rename=0" on the MDS nodes (default is enable_remote_rename=1), which essentially reverts the MDS rename behaviour to DNE1 when remote directory rename was not implemented so that cross-MDT renames are handled in userspace like a cross-filesystem rename.

The main drawback is that it could increase data traffic significantly if they are moving large files between /fir/users and /fir/groups that are on different MDTs on a regular basis, but maybe that is acceptable in the short term to avoid MDS hangs until this issue is resolved.

Comment by Stephane Thiell [ 16/Mar/19 ]

Yes sorry, we're definitively looking for trouble by using PFL, DNE and DoM (and ldiskfs backend, not ZFS). Our goal when designing this system was to improve mixed workload performance and ideally read I/O performance with small files (eg. AI training sets with million of files...), without impacting traditional HPC workloads.

On a positive note, we've seen excellent performance results with these features so far on 2.12, especially PFL and read performance with large files (our filesystem was able to reach 48GB/s of sustained read bandwidth recently with jobs using many large wide-stripped files of 300-400GB each). DNE has been helping us to scan the filesystem faster (up to 16,000 files/sec with Robinhood with 16 threads) and DoM's advantage is more difficult to evaluate but it is probably helping on multiple fronts (likely one of the main advantage is to reduce the seeks on the OSTs - which are spinning disk based - for small files), but I also know that in our case, DOM makes a Linux kernel compilation on Lustre a bit faster (when the filesystem is not loaded).

Anyway, I digress, as our main issue now is to fix this MDT deadlock. So, you're right.. and I have prepared Lai's patch that adds enable_remote_rename (https://review.whamcloud.com/34410 ). This patch seems to depends on Lai's previous patch (https://review.whamcloud.com/34401) that I added too, but we're already using it. The -EXDEV patch doesn't apply cleanly but I was able to fix it for 2.12 easily. It looks like a mdt_readonly flag was added to master that doesn't exist in 2.12. So I'm ready to switch to this new patched version on our 2 MDS, either at the next occurrence of the issue, or maybe later tonight. I'll keep you posted. Thanks a lot (all of you) for the help as always...

Comment by Mikhail Pershin [ 16/Mar/19 ]

Stephane, for now the only way to apply patch https://review.whamcloud.com/#/c/34071/ is 'client first' approach. I am still thinking how to resolve that in 'the server first' way, anyway that will be needed for compatibility but I have no solution right now.

As for other stack traces you seen - I think I saw them also, rename stack and getattr stacks, so that can be different issue with rename.

Comment by Stephane Thiell [ 17/Mar/19 ]

Ok thanks Mike! For now, in order to buy more time and avoid (I hope!) the MDS hangs, I have just switched off remote rename. Lai's patch seems to work as expected:

1. remote_rename switched off on all 4 MDTs:

$ clush -w@mds lctl get_param mdt.*.enable_remote_rename
fir-md1-s2: mdt.fir-MDT0001.enable_remote_rename=0
fir-md1-s2: mdt.fir-MDT0003.enable_remote_rename=0
fir-md1-s1: mdt.fir-MDT0000.enable_remote_rename=0
fir-md1-s1: mdt.fir-MDT0002.enable_remote_rename=0

2. On the cluster, example with mv between two dirs that not on the same MDT:

$ pwd
/fir/groups/ruthm
$ lfs getdirstripe .
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
$ lfs getdirstripe /fir/users/kilian/
lmv_stripe_count: 0 lmv_stripe_offset: 1 lmv_hash_type: none

$ ls -l remotetest
total 0
-rw-r--r-- 1 root ruthm 1048576000 Mar 17 09:07 bar

$ time strace -e renameat2 -Tf mv remotetest /fir/users/kilian/
renameat2(AT_FDCWD, "remotetest", AT_FDCWD, "/fir/users/kilian/remotetest", 0) = -1 EXDEV (Invalid cross-device link) <0.000956>
renameat2(AT_FDCWD, "remotetest/bar", AT_FDCWD, "/fir/users/kilian/remotetest/bar", 0) = -1 EXDEV (Invalid cross-device link) <0.008032>
+++ exited with 0 +++

real	0m1.133s
user	0m0.088s
sys	0m0.961s
Comment by Mikhail Pershin [ 18/Mar/19 ]

Stephane, please hold on with patch https://review.whamcloud.com/#/c/34071/, I am concerning about one thing with it and have to check that first.

Comment by Stephane Thiell [ 19/Mar/19 ]

Ok Mike. I won't use it, no worries. So far it's working without major issue (with -EXDEV). The only backtrace I noticed on a MDS was this one last night:

Mar 19 01:48:14 fir-md1-s1 kernel: Lustre: fir-MDT0000: haven't heard from client 4af51c32-5433-8946-8fc2-4f1ad9c274b4 (at 10.8.14.8@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8b6c11
3f1c00, cur 1552985294 expire 1552985144 last 1552985067
Mar 19 01:48:32 fir-md1-s1 kernel: LNet: Service thread pid 93052 was inactive for 200.35s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purpose
s:
Mar 19 01:48:32 fir-md1-s1 kernel: Pid: 93052, comm: mdt00_041 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 19 01:48:32 fir-md1-s1 kernel: Call Trace:
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffc0dea890>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffc0da8185>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffc0dc986b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffc14dccf9>] mdt_do_glimpse+0x1e9/0x4c0 [mdt]
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffc14dd573>] mdt_glimpse_enqueue+0x3d3/0x4f0 [mdt]
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffc148e4ef>] mdt_intent_glimpse+0x1f/0x30 [mdt]
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffc14a6a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffc0da8ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffc0dd18a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffc0e58302>] tgt_enqueue+0x62/0x210 [ptlrpc]
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffc0e5f35a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffc0e0392b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffc0e0725c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Mar 19 01:48:32 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Mar 19 01:48:32 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1552985312.93052
Mar 19 01:48:35 fir-md1-s1 kernel: LustreError: 92952:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.8.14.8@o2ib6) failed to reply to blocking AST (req@ffff8b8cd5493c00 x1628270300268816 status 0 rc -110), evict it ns: mdt-fir-MDT0002_UUID lock: ffff8b68b6ee1d40/0xe9a38a334924c658 lrc: 4/0,0 mode: PR/PR res: [0x2c0001679:0x2:0x0].0x0 bits 0x13/0x0 rrc: 22 type: IBT flags: 0x60200400000020 nid: 10.8.14.8@o2ib6 remote: 0x693efb2d8519c447 expref: 347 pid: 92798 timeout: 830253 lvb_type: 0
Mar 19 01:48:35 fir-md1-s1 kernel: LustreError: 138-a: fir-MDT0002: A client on nid 10.8.14.8@o2ib6 was evicted due to a lock blocking callback time out: rc -110

But I don't think it had an impact on the whole filesystem, we have no report of issue at this point. If this is confirmed that DOM+DNE2 on 2.12 is usable in production this way, it would be nice to have both Lai's patches included in 2.12.1.

Comment by Andreas Dilger [ 19/Mar/19 ]

I'd consider the -EXDEV patch more of a workaround than a long-term solution. For large files this can have a significant overhead in terms of data copies. I think anything we could do with -EXDEV to punt data copies to userspace (eg. heuristics to copy small files between MDTs instead of creating remote entries) could also be done inside lustre.

Comment by Stephane Thiell [ 19/Mar/19 ]

Ok, but stability is paramount at this point to restore our users' trust. I really hope Mike will find out what's wrong with the current implementation, because I agree that cross-MDTs rename is a great feature. Thanks!

Comment by Andreas Dilger [ 20/Mar/19 ]

Definitely yes, getting back to a stable working environment is critical. I'm not suggesting that the -EXDEV patch shouldn't land, since it is configurable, I just wanted it to be clear that landing that patch is not the conclusion of this issue.

Comment by Stephane Thiell [ 20/Mar/19 ]

Yep, I'm totally in line with you Andreas.

Additionally, I wanted to report more issues last night, I don't know if it's directly related so I opened LU-12091 for that in order to provide more details. It seemed to have impacted only a few clients.

Comment by Stephane Thiell [ 26/Mar/19 ]

After about a week, our two 2.12 MDS seem stable when using the two patches from Lai including -EXDEV and enable_remote_rename=0. Not ideal, but stable! Fingers crossed!

Comment by Stephane Thiell [ 27/Mar/19 ]

ARGH! I spoke too soon....  filesystem was completely stuck about 30 minutes ago, and I found a lot of call traces involving {{mdt_reint_rename }}on both MDS.

As a reminder, we are running with the -EXDEV patch and have set enable_remote_rename=0 on all MDTs:

fir-md1-s2: mdt.fir-MDT0001.enable_remote_rename=0
fir-md1-s2: mdt.fir-MDT0003.enable_remote_rename=0
fir-md1-s1: mdt.fir-MDT0000.enable_remote_rename=0
fir-md1-s1: mdt.fir-MDT0002.enable_remote_rename=0

So I was able (I hope) to dump the ldlm locks on both servers while the filesystem was stuck:

  • fir-md1-s1_dkdlmtrace_20190327.log.gz
  • fir-md1-s2_dkdlmtrace_20190327.log.gz

And kernel logs are also attached:

  • fir-md1-s1-kern_20190327.log.gz
  • fir-md1-s2-kern_20190327.log.gz

A typical call trace is:

Mar 27 15:06:33 fir-md1-s1 kernel: Pid: 6518, comm: mdt03_076 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 27 15:06:33 fir-md1-s1 kernel: Call Trace:
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc0dc20bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc0dc412f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc0dc69ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc169a7a2>] osp_md_object_lock+0x162/0x2d0 [osp]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc15aee03>] lod_object_lock+0xf3/0x7b0 [lod]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc162fd3e>] mdd_object_lock+0x3e/0xe0 [mdd]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc149f2f1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc149f88a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc14b2ace>] mdt_rename_lock+0xbe/0x4b0 [mdt]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc14b5470>] mdt_reint_rename+0x910/0x28f0 [mdt]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc14bdc53>] mdt_reint_rec+0x83/0x210 [mdt]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc149c143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc14a74a7>] mdt_reint+0x67/0x140 [mdt]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc0e5f35a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc0e0392b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffc0e0725c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Mar 27 15:06:33 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Comment by Andreas Dilger [ 28/Mar/19 ]

It looks like there may still be some improvements needed for the patch https://review.whamcloud.com/34410 "LU-12037 mdt: add option for cross-MDT rename", since it is still getting the "Big Filesystem Lock" (BFL) on MDT0000 for every rename. The -EXDEV return could be done before any locking is done to reduce contention on the MDS and avoid whatever deadlock (or overload and timeout) that is being seen in that code.

Separately, it should be possible to completely avoid the use of BFL for renaming regular files within a single directory on the same MDS, and between different directories on the same MDS by locking only the directory itself. According to the rename_stats data previously posted in comment-243700, there were 1107408 same-directory renames vs. only 52629 cross-directory renames (4.75%). While this is not conclusive in terms of local vs. remote MDT or file vs. directory, it is likely that most same-directory renames are regular files on the same MDT, unless striped directories are widely used, so we could avoid the BFL for over 95% of renames, and possibly more. This improvement is being tracked via LU-12125.

Comment by Lai Siyao [ 28/Mar/19 ]

Stephane, I updated patch https://review.whamcloud.com/34410, and next time when you see deadlock, it's better to dump process backtraces on all MDS's to help identify which process is the root of the deadlock.

Comment by Stephane Thiell [ 28/Mar/19 ]

Thanks much to both of you! I'm glad that you were able to identify the problem there.

Andreas, striped directories are definitively not widely used in our case (and we plan to keep it as is). And I'll follow LU-12125, sounds interesting!

Lai, ok I'll make sure to dump all tasks with sysrq next time. I'll keep an eye at the patch and update our MDS once it has positive reviews.

Comment by Gerrit Updater [ 01/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34401/
Subject: LU-12037 mdt: call mdt_dom_discard_data() after rename unlock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 480a221aa7e76f001288c8b1127e14e7922ce497

Comment by Stephane Thiell [ 01/Apr/19 ]

We've upgraded the two MDS of Fir with Lai's latest patch 5 from
https://review.whamcloud.com/#/c/34410/
and we set mdt.*.enable_remote_rename=0

However, after a few hours of production, I'm still seeing a few crossdir renames in the stats on MDT0000 and MDT0002.

$ clush -w@mds -b 'lctl get_param mdt.*.rename_stats'
---------------
fir-md1-s1
---------------
mdt.fir-MDT0000.rename_stats=
rename_stats:
- snapshot_time:  1554133203.915113857
- same_dir       
      4KB: { sample: 12405, pct:  98, cum_pct:  98 }
      8KB: { sample:   0, pct:   0, cum_pct:  98 }
      16KB: { sample:  69, pct:   0, cum_pct:  99 }
      32KB: { sample:   2, pct:   0, cum_pct:  99 }
      64KB: { sample:  78, pct:   0, cum_pct:  99 }
      128KB: { sample:  34, pct:   0, cum_pct: 100 }
- crossdir_src   
      4KB: { sample:  26, pct:  89, cum_pct:  89 }
      8KB: { sample:   0, pct:   0, cum_pct:  89 }
      16KB: { sample:   0, pct:   0, cum_pct:  89 }
      32KB: { sample:   3, pct:  10, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample:   9, pct:  31, cum_pct:  31 }
      8KB: { sample:   0, pct:   0, cum_pct:  31 }
      16KB: { sample:   0, pct:   0, cum_pct:  31 }
      32KB: { sample:  20, pct:  68, cum_pct: 100 }
mdt.fir-MDT0002.rename_stats=
rename_stats:
- snapshot_time:  1554133203.915166095
- same_dir       
      4KB: { sample: 2110, pct:   1, cum_pct:   1 }
      8KB: { sample:   0, pct:   0, cum_pct:   1 }
      16KB: { sample: 102480, pct:  91, cum_pct:  92 }
      32KB: { sample: 7875, pct:   6, cum_pct:  99 }
      64KB: { sample:   1, pct:   0, cum_pct:  99 }
      128KB: { sample: 102, pct:   0, cum_pct: 100 }
- crossdir_src   
      4KB: { sample:  14, pct: 100, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample:  12, pct:  85, cum_pct:  85 }
      8KB: { sample:   0, pct:   0, cum_pct:  85 }
      16KB: { sample:   0, pct:   0, cum_pct:  85 }
      32KB: { sample:   0, pct:   0, cum_pct:  85 }
      64KB: { sample:   0, pct:   0, cum_pct:  85 }
      128KB: { sample:   0, pct:   0, cum_pct:  85 }
      256KB: { sample:   2, pct:  14, cum_pct: 100 }
---------------
fir-md1-s2
---------------
mdt.fir-MDT0001.rename_stats=
rename_stats:
- snapshot_time:  1554133410.893851229
- same_dir       
      4KB: { sample: 1593, pct:  77, cum_pct:  77 }
      8KB: { sample:   0, pct:   0, cum_pct:  77 }
      16KB: { sample: 469, pct:  22, cum_pct: 100 }
mdt.fir-MDT0003.rename_stats=
rename_stats:
- snapshot_time:  1554133410.893892657
- same_dir       
      4KB: { sample: 173, pct:  12, cum_pct:  12 }
      8KB: { sample:   0, pct:   0, cum_pct:  12 }
      16KB: { sample:   0, pct:   0, cum_pct:  12 }
      32KB: { sample: 336, pct:  24, cum_pct:  36 }
      64KB: { sample: 860, pct:  62, cum_pct:  99 }
      128KB: { sample:   5, pct:   0, cum_pct:  99 }
      256KB: { sample:   3, pct:   0, cum_pct: 100 }

Just FYI, on Fir's MDS, we're now running 2.12.0 with the following patches:

  • LU-12065 lnd: increase CQ entries
  • LU-12037 mdt: add option for cross-MDT rename (patchset 5 ported to 2.12.0)
  • LU-12037 mdt: call mdt_dom_discard_data() after rename unlock (the one that just landed in master)

and the following config changes:

  • osc.*.idle_timeout=0 (LU-11976)
  • mdt.*.enable_remote_rename=0 (-EXDEV)

and a workaround to minimize LU-11989:

options ptlrpc ldlm_enqueue_min=30
options ptlrpc at_max=20
Comment by Andreas Dilger [ 02/Apr/19 ]

Note that patch https://review.whamcloud.com/34410 "LU-12037 mdt: add option for cross-MDT rename" is related to disabling renames to a remote MDT. It does not disable renames between directories on the same MDT (which would typically be the same for the per-user or per-group directories). Even in a striped directory, some fraction of renames may be on the same MDT.

Comment by Stephane Thiell [ 03/Apr/19 ]

Ok Andreas, thanks!

 

We just got a new occurrence of a hung filesystem, even with the patches...   I was able to dump the kernel tasks and also the lustre ldlm locks using the commands that Patrick provided before, while everything was hung.

Looks like the issue started at around 15:45 today.

There is a first backtrace in mdt_dom_discard_data, then others with mdt_remote_object_lock:

[Wed Apr  3 15:45:54 2019][2176460.674055] Pid: 103934, comm: mdt00_025 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018^M
[Wed Apr  3 15:45:54 2019][2176460.684073] Call Trace:^M
[Wed Apr  3 15:45:54 2019][2176460.686712]  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]^M
[Wed Apr  3 15:45:54 2019][2176460.693843]  [<ffffffffc0e09dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]^M
[Wed Apr  3 15:45:54 2019][2176460.701226]  [<ffffffffc14aeeb1>] mdt_dom_discard_data+0x101/0x130 [mdt]^M
[Wed Apr  3 15:45:54 2019][2176460.708153]  [<ffffffffc14864c9>] mdt_reint_rename+0x9d9/0x2870 [mdt]^M
[Wed Apr  3 15:45:54 2019][2176460.714811]  [<ffffffffc148ec13>] mdt_reint_rec+0x83/0x210 [mdt]^M
[Wed Apr  3 15:45:54 2019][2176460.721056]  [<ffffffffc146d143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]^M
[Wed Apr  3 15:45:54 2019][2176460.727802]  [<ffffffffc14784a7>] mdt_reint+0x67/0x140 [mdt]^M
[Wed Apr  3 15:45:54 2019][2176460.733699]  [<ffffffffc0ea635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
[Wed Apr  3 15:45:54 2019][2176460.740845]  [<ffffffffc0e4a92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
[Wed Apr  3 15:45:54 2019][2176460.748764]  [<ffffffffc0e4e25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]^M
[Wed Apr  3 15:45:54 2019][2176460.755270]  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0^M
[Wed Apr  3 15:45:54 2019][2176460.760364]  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21^M
[Wed Apr  3 15:45:54 2019][2176460.767033]  [<ffffffffffffffff>] 0xffffffffffffffff^M
[Wed Apr  3 15:45:54 2019][2176460.772236] LustreError: dumping log to /tmp/lustre-log.1554331615.103934^M
[Wed Apr  3 15:45:55 2019][2176462.177573] LNet: Service thread pid 104476 was inactive for 201.47s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:^M
[Wed Apr  3 15:45:55 2019][2176462.194771] Pid: 104476, comm: mdt01_082 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018^M
[Wed Apr  3 15:45:55 2019][2176462.204769] Call Trace:^M
[Wed Apr  3 15:45:55 2019][2176462.207413]  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]^M
[Wed Apr  3 15:45:55 2019][2176462.214536]  [<ffffffffc0e0b12f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]^M
[Wed Apr  3 15:45:55 2019][2176462.221854]  [<ffffffffc0e0d9ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]^M
[Wed Apr  3 15:45:56 2019][2176462.228720]  [<ffffffffc166b7a2>] osp_md_object_lock+0x162/0x2d0 [osp]^M
[Wed Apr  3 15:45:56 2019][2176462.235469]  [<ffffffffc157fe03>] lod_object_lock+0xf3/0x7b0 [lod]^M
[Wed Apr  3 15:45:56 2019][2176462.241859]  [<ffffffffc1600d3e>] mdd_object_lock+0x3e/0xe0 [mdd]^M
[Wed Apr  3 15:45:56 2019][2176462.248179]  [<ffffffffc14702f1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]^M
[Wed Apr  3 15:45:56 2019][2176462.255619]  [<ffffffffc147088a>] mdt_remote_object_lock+0x2a/0x30 [mdt]^M
[Wed Apr  3 15:45:56 2019][2176462.262563]  [<ffffffffc1483a5e>] mdt_rename_lock+0xbe/0x4b0 [mdt]^M
[Wed Apr  3 15:45:56 2019][2176462.268959]  [<ffffffffc1485f30>] mdt_reint_rename+0x440/0x2870 [mdt]^M
[Wed Apr  3 15:45:56 2019][2176462.275630]  [<ffffffffc148ec13>] mdt_reint_rec+0x83/0x210 [mdt]^M
[Wed Apr  3 15:45:56 2019][2176462.281872]  [<ffffffffc146d143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]^M
[Wed Apr  3 15:45:56 2019][2176462.288619]  [<ffffffffc14784a7>] mdt_reint+0x67/0x140 [mdt]^M
[Wed Apr  3 15:45:56 2019][2176462.294502]  [<ffffffffc0ea635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
[Wed Apr  3 15:45:56 2019][2176462.301642]  [<ffffffffc0e4a92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
[Wed Apr  3 15:45:56 2019][2176462.309530]  [<ffffffffc0e4e25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]^M
[Wed Apr  3 15:45:56 2019][2176462.316044]  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0^M
[Wed Apr  3 15:45:56 2019][2176462.321150]  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21^M
[Wed Apr  3 15:45:56 2019][2176462.327814]  [<ffffffffffffffff>] 0xffffffffffffffff^M

File attached:

  • fir-md1-s1-console-20190403.gz full fs lifecycle console logs of MDS with MDT0000 and MDT0002
  • fir-md1-s2-console-20190403.gz full fs lifecycle console logs of MDS with MDT0001 and MDT0003
  • fir-md1-s1_20190403-dk.log.gz ldlm.dump_namespaces for fir-md1-s1
  • fir-md1-s2_20190403-dk.log.gz ldlm.dump_namespaces for fir-md1-s2
     

 

Comment by Stephane Thiell [ 03/Apr/19 ]

Perhaps an issue with unlink on rename() - which should be legit.

 mdt_dom_discard_data is called by mdt_reint_rename in that case:

[Wed Apr  3 15:45:54 2019][2176460.701226]  [<ffffffffc14aeeb1>] mdt_dom_discard_data+0x101/0x130 [mdt]^M
[Wed Apr  3 15:45:54 2019][2176460.708153]  [<ffffffffc14864c9>] mdt_reint_rename+0x9d9/0x2870 [mdt]^M 
Comment by Gerrit Updater [ 08/Apr/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34615
Subject: LU-12037 mdt: call mdt_dom_discard_data() after rename unlock
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: e6a9a9ed6bc1dc4ee92fdf75c03da2dae45eeab3

Comment by Gerrit Updater [ 08/Apr/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34616
Subject: LU-12037 mdt: add option for cross-MDT rename
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 4f9b5eead1616db8cb1878194b6c2131b8a2aa65

Comment by Stephane Thiell [ 08/Apr/19 ]

Issue occurring again even with the EXDEV patch:

[2591924.489534] Lustre: fir-MDT0003: Connection restored to  (at 10.8.18.11@o2ib6)
[2591924.496940] Lustre: Skipped 1104 previous similar messages
[2592014.258583] LNet: Service thread pid 13476 was inactive for 200.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[2592014.275692] LNet: Skipped 9 previous similar messages
[2592014.280925] Pid: 13476, comm: mdt01_039 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[2592014.290836] Call Trace:
[2592014.293479]  [<ffffffffc0f930bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[2592014.300592]  [<ffffffffc0f93dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[2592014.307960]  [<ffffffffc14e64bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
[2592014.314971]  [<ffffffffc14e6b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
[2592014.322145]  [<ffffffffc14e6f67>] mdt_object_lock_try+0x27/0xb0 [mdt]
[2592014.328828]  [<ffffffffc14e86a7>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt]
[2592014.335996]  [<ffffffffc14efbc5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[2592014.342746]  [<ffffffffc14eca28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
[2592014.349397]  [<ffffffffc0f79ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[2592014.356338]  [<ffffffffc0fa28a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[2592014.363618]  [<ffffffffc1029302>] tgt_enqueue+0x62/0x210 [ptlrpc]
[2592014.369970]  [<ffffffffc103035a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[2592014.377103]  [<ffffffffc0fd492b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[2592014.384990]  [<ffffffffc0fd825c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[2592014.391532]  [<ffffffffb36c1c31>] kthread+0xd1/0xe0
[2592014.396620]  [<ffffffffb3d74c24>] ret_from_fork_nospec_begin+0xe/0x21
[2592014.403284]  [<ffffffffffffffff>] 0xffffffffffffffff

 

[root@fir-md1-s2 ~]# lctl get_param ldlm.namespaces.mdt-fir-MDT000*_UUID.lock_count
ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=2813920
ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_count=1110534
[root@fir-md1-s2 ~]# lctl get_param mdt.*.rename_stats
mdt.fir-MDT0001.rename_stats=
rename_stats:
- snapshot_time:  1554747505.901862704
- same_dir       
      4KB: { sample: 440634, pct:  89, cum_pct:  89 }
      8KB: { sample:   0, pct:   0, cum_pct:  89 }
      16KB: { sample: 7257, pct:   1, cum_pct:  91 }
      32KB: { sample: 5923, pct:   1, cum_pct:  92 }
      64KB: { sample: 32738, pct:   6, cum_pct:  99 }
      128KB: { sample: 3016, pct:   0, cum_pct:  99 }
      256KB: { sample: 1779, pct:   0, cum_pct:  99 }
      512KB: { sample:   0, pct:   0, cum_pct:  99 }
      1MB: { sample:   0, pct:   0, cum_pct:  99 }
      2MB: { sample:   0, pct:   0, cum_pct:  99 }
      4MB: { sample:   2, pct:   0, cum_pct: 100 }
- crossdir_src   
      4KB: { sample: 609, pct:   1, cum_pct:   1 }
      8KB: { sample:   0, pct:   0, cum_pct:   1 }
      16KB: { sample: 2925, pct:   6, cum_pct:   8 }
      32KB: { sample: 11090, pct:  25, cum_pct:  33 }
      64KB: { sample: 25807, pct:  59, cum_pct:  93 }
      128KB: { sample:   0, pct:   0, cum_pct:  93 }
      256KB: { sample: 3000, pct:   6, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample: 11631, pct:  26, cum_pct:  26 }
      8KB: { sample:   0, pct:   0, cum_pct:  26 }
      16KB: { sample: 11490, pct:  26, cum_pct:  53 }
      32KB: { sample: 18106, pct:  41, cum_pct:  94 }
      64KB: { sample: 1587, pct:   3, cum_pct:  98 }
      128KB: { sample: 617, pct:   1, cum_pct: 100 }
mdt.fir-MDT0003.rename_stats=
rename_stats:
- snapshot_time:  1554747505.902158388
- same_dir       
      4KB: { sample: 69919, pct:  28, cum_pct:  28 }
      8KB: { sample:   0, pct:   0, cum_pct:  28 }
      16KB: { sample: 19406, pct:   7, cum_pct:  36 }
      32KB: { sample: 27998, pct:  11, cum_pct:  48 }
      64KB: { sample: 27143, pct:  11, cum_pct:  59 }
      128KB: { sample: 9504, pct:   3, cum_pct:  63 }
      256KB: { sample: 6856, pct:   2, cum_pct:  66 }
      512KB: { sample: 5559, pct:   2, cum_pct:  68 }
      1MB: { sample: 7179, pct:   2, cum_pct:  71 }
      2MB: { sample: 14867, pct:   6, cum_pct:  77 }
      4MB: { sample: 31429, pct:  12, cum_pct:  90 }
      8MB: { sample: 22948, pct:   9, cum_pct: 100 }
- crossdir_src   
      4KB: { sample: 193, pct:   2, cum_pct:   2 }
      8KB: { sample:   0, pct:   0, cum_pct:   2 }
      16KB: { sample:   0, pct:   0, cum_pct:   2 }
      32KB: { sample: 1877, pct:  25, cum_pct:  28 }
      64KB: { sample: 4301, pct:  58, cum_pct:  86 }
      128KB: { sample:   0, pct:   0, cum_pct:  86 }
      256KB: { sample: 986, pct:  13, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample: 976, pct:  13, cum_pct:  13 }
      8KB: { sample:   0, pct:   0, cum_pct:  13 }
      16KB: { sample: 985, pct:  13, cum_pct:  26 }
      32KB: { sample: 2449, pct:  33, cum_pct:  59 }
      64KB: { sample: 2449, pct:  33, cum_pct:  93 }
      128KB: { sample: 498, pct:   6, cum_pct: 100 }

I'll restart the MDTs

Comment by Stephane Thiell [ 10/Apr/19 ]

Patrick,

We just applied https://review.whamcloud.com/#/c/34410/6

on both of our MDS on Fir (note for self: 2.12.0.pl4)

Will need some time to confirm it's ok I think.

# clush -w@mds lctl get_param mdt.*.enable_remote_rename
fir-md1-s1: mdt.fir-MDT0000.enable_remote_rename=0
fir-md1-s1: mdt.fir-MDT0002.enable_remote_rename=0
fir-md1-s2: mdt.fir-MDT0001.enable_remote_rename=0
fir-md1-s2: mdt.fir-MDT0003.enable_remote_rename=0

# clush -w@mds -b  lctl get_param mdt.*.rename_stats
---------------
fir-md1-s1
---------------
mdt.fir-MDT0000.rename_stats=
rename_stats:
- snapshot_time:  1554907784.182156441
- same_dir       
      4KB: { sample: 249, pct:  99, cum_pct:  99 }
      8KB: { sample:   0, pct:   0, cum_pct:  99 }
      16KB: { sample:   0, pct:   0, cum_pct:  99 }
      32KB: { sample:   2, pct:   0, cum_pct: 100 }
mdt.fir-MDT0002.rename_stats=
rename_stats:
- snapshot_time:  1554907784.182205333
---------------
fir-md1-s2
---------------
mdt.fir-MDT0001.rename_stats=
rename_stats:
- snapshot_time:  1554907784.181266117
- same_dir       
      4KB: { sample:   5, pct:  27, cum_pct:  27 }
      8KB: { sample:   0, pct:   0, cum_pct:  27 }
      16KB: { sample:   9, pct:  50, cum_pct:  77 }
      32KB: { sample:   4, pct:  22, cum_pct: 100 }
mdt.fir-MDT0003.rename_stats=
rename_stats:
- snapshot_time:  1554907784.181311011
- same_dir       
      4KB: { sample:  33, pct:  27, cum_pct:  27 }
      8KB: { sample:   0, pct:   0, cum_pct:  27 }
      16KB: { sample:   2, pct:   1, cum_pct:  29 }
      32KB: { sample:   5, pct:   4, cum_pct:  33 }
      64KB: { sample:  80, pct:  66, cum_pct: 100 }
Comment by Patrick Farrell (Inactive) [ 10/Apr/19 ]

Thanks Stephane - Please keep us updated, particularly if you still see any cross mdt renames...

Comment by Stephane Thiell [ 11/Apr/19 ]

Doesn't seem to work and on top of that, we got another MDS deadlock (I opened LU-12178).

Stats for MDT0000 and MDT0002 (they were on the MDS that didn't crash):

mdt.fir-MDT0000.rename_stats=
rename_stats:
- snapshot_time:  1554957753.619746357
- same_dir       
      4KB: { sample: 101272, pct:  77, cum_pct:  77 }
      8KB: { sample:   0, pct:   0, cum_pct:  77 }
      16KB: { sample: 3282, pct:   2, cum_pct:  80 }
      32KB: { sample: 7074, pct:   5, cum_pct:  85 }
      64KB: { sample: 6166, pct:   4, cum_pct:  90 }
      128KB: { sample: 11437, pct:   8, cum_pct:  99 }
      256KB: { sample: 860, pct:   0, cum_pct: 100 }
- crossdir_src   
      4KB: { sample: 283, pct:  56, cum_pct:  56 }
      8KB: { sample:   0, pct:   0, cum_pct:  56 }
      16KB: { sample: 153, pct:  30, cum_pct:  87 }
      32KB: { sample:  62, pct:  12, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample: 454, pct:  91, cum_pct:  91 }
      8KB: { sample:   0, pct:   0, cum_pct:  91 }
      16KB: { sample:   0, pct:   0, cum_pct:  91 }
      32KB: { sample:  44, pct:   8, cum_pct: 100 }
mdt.fir-MDT0001.rename_stats=
rename_stats:
- snapshot_time:  1554957753.619960398
- same_dir       
      4KB: { sample: 309, pct:  38, cum_pct:  38 }
      8KB: { sample:   0, pct:   0, cum_pct:  38 }
      16KB: { sample:  41, pct:   5, cum_pct:  43 }
      32KB: { sample:  41, pct:   5, cum_pct:  48 }
      64KB: { sample: 414, pct:  51, cum_pct: 100 }
- crossdir_src   
      4KB: { sample:   2, pct:   0, cum_pct:   0 }
      8KB: { sample:   0, pct:   0, cum_pct:   0 }
      16KB: { sample:   0, pct:   0, cum_pct:   0 }
      32KB: { sample: 421, pct:  99, cum_pct:  99 }
      64KB: { sample:   1, pct:   0, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample: 104, pct:  24, cum_pct:  24 }
      8KB: { sample:   0, pct:   0, cum_pct:  24 }
      16KB: { sample: 109, pct:  25, cum_pct:  50 }
      32KB: { sample: 210, pct:  49, cum_pct:  99 }
      64KB: { sample:   1, pct:   0, cum_pct: 100 }
mdt.fir-MDT0002.rename_stats=
rename_stats:
- snapshot_time:  1554957753.620148602
- same_dir       
      4KB: { sample: 43443, pct:  75, cum_pct:  75 }
      8KB: { sample:   0, pct:   0, cum_pct:  75 }
      16KB: { sample: 10513, pct:  18, cum_pct:  94 }
      32KB: { sample: 1518, pct:   2, cum_pct:  96 }
      64KB: { sample: 1284, pct:   2, cum_pct:  99 }
      128KB: { sample: 488, pct:   0, cum_pct:  99 }
      256KB: { sample:   0, pct:   0, cum_pct:  99 }
      512KB: { sample:   3, pct:   0, cum_pct: 100 }
- crossdir_src   
      4KB: { sample:  65, pct:   1, cum_pct:   1 }
      8KB: { sample:   0, pct:   0, cum_pct:   1 }
      16KB: { sample: 184, pct:   3, cum_pct:   4 }
      32KB: { sample:   0, pct:   0, cum_pct:   4 }
      64KB: { sample: 3098, pct:  60, cum_pct:  65 }
      128KB: { sample: 1782, pct:  34, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample: 328, pct:   6, cum_pct:   6 }
      8KB: { sample:   0, pct:   0, cum_pct:   6 }
      16KB: { sample: 390, pct:   7, cum_pct:  13 }
      32KB: { sample: 2263, pct:  44, cum_pct:  58 }
      64KB: { sample: 2148, pct:  41, cum_pct: 100 }
Comment by Patrick Farrell (Inactive) [ 11/Apr/19 ]

Stephane,

That's confusing...  Can you double check you're running the patch on the MDS?  If this is indeed happening, then I guess there must be another gap in the patch logic.

Comment by Stephane Thiell [ 11/Apr/19 ]

I think so, I just checked and I did pick the correct patch set (6).
I had to port it to 2.12.0 but not very difficult. This is the patch I'm using in my 2.12.0.pl4 on both Fir's MDS:
https://gist.github.com/thiell/9d9f9046660dc9748cd695e64d1c6f21

[root@fir-md1-s1 ~]# cat /sys/fs/lustre/version 
2.12.0.pl4
[root@fir-md1-s1 ~]# lctl get_param mdt.*.enable_remote_rename
mdt.fir-MDT0000.enable_remote_rename=0
mdt.fir-MDT0001.enable_remote_rename=0
mdt.fir-MDT0002.enable_remote_rename=0
mdt.fir-MDT0003.enable_remote_rename=0
[root@fir-md1-s1 ~]# lctl get_param mdt.*.rename_stats
mdt.fir-MDT0000.rename_stats=
rename_stats:
- snapshot_time:  1555006234.455267326
- same_dir       
      4KB: { sample: 114045, pct:  79, cum_pct:  79 }
      8KB: { sample:   0, pct:   0, cum_pct:  79 }
      16KB: { sample: 3444, pct:   2, cum_pct:  82 }
      32KB: { sample: 7227, pct:   5, cum_pct:  87 }
      64KB: { sample: 6203, pct:   4, cum_pct:  91 }
      128KB: { sample: 11437, pct:   7, cum_pct:  99 }
      256KB: { sample: 860, pct:   0, cum_pct: 100 }
- crossdir_src   
      4KB: { sample: 453, pct:   4, cum_pct:   4 }
      8KB: { sample:   0, pct:   0, cum_pct:   4 }
      16KB: { sample: 362, pct:   3, cum_pct:   7 }
      32KB: { sample: 226, pct:   2, cum_pct:   9 }
      64KB: { sample:   1, pct:   0, cum_pct:   9 }
      128KB: { sample:   0, pct:   0, cum_pct:   9 }
      256KB: { sample:   0, pct:   0, cum_pct:   9 }
      512KB: { sample:   0, pct:   0, cum_pct:   9 }
      1MB: { sample: 10000, pct:  90, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample: 1025, pct:   9, cum_pct:   9 }
      8KB: { sample:   0, pct:   0, cum_pct:   9 }
      16KB: { sample: 148, pct:   1, cum_pct:  10 }
      32KB: { sample: 292, pct:   2, cum_pct:  13 }
      64KB: { sample: 403, pct:   3, cum_pct:  16 }
      128KB: { sample: 1103, pct:   9, cum_pct:  26 }
      256KB: { sample: 2327, pct:  21, cum_pct:  47 }
      512KB: { sample: 3986, pct:  36, cum_pct:  84 }
      1MB: { sample: 1758, pct:  15, cum_pct: 100 }
mdt.fir-MDT0001.rename_stats=
rename_stats:
- snapshot_time:  1555006234.455585213
- same_dir       
      4KB: { sample: 3376, pct:  67, cum_pct:  67 }
      8KB: { sample:   0, pct:   0, cum_pct:  67 }
      16KB: { sample: 173, pct:   3, cum_pct:  71 }
      32KB: { sample: 158, pct:   3, cum_pct:  74 }
      64KB: { sample: 1268, pct:  25, cum_pct: 100 }
- crossdir_src   
      4KB: { sample: 113, pct:   8, cum_pct:   8 }
      8KB: { sample:   0, pct:   0, cum_pct:   8 }
      16KB: { sample:   0, pct:   0, cum_pct:   8 }
      32KB: { sample: 421, pct:  30, cum_pct:  38 }
      64KB: { sample: 859, pct:  61, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample: 416, pct:  29, cum_pct:  29 }
      8KB: { sample:   0, pct:   0, cum_pct:  29 }
      16KB: { sample: 330, pct:  23, cum_pct:  53 }
      32KB: { sample: 638, pct:  45, cum_pct:  99 }
      64KB: { sample:   9, pct:   0, cum_pct: 100 }
mdt.fir-MDT0002.rename_stats=
rename_stats:
- snapshot_time:  1555006234.455769900
- same_dir       
      4KB: { sample: 100705, pct:  76, cum_pct:  76 }
      8KB: { sample:   0, pct:   0, cum_pct:  76 }
      16KB: { sample: 13872, pct:  10, cum_pct:  87 }
      32KB: { sample: 13518, pct:  10, cum_pct:  97 }
      64KB: { sample: 1709, pct:   1, cum_pct:  98 }
      128KB: { sample: 1439, pct:   1, cum_pct:  99 }
      256KB: { sample:   0, pct:   0, cum_pct:  99 }
      512KB: { sample:   3, pct:   0, cum_pct: 100 }
- crossdir_src   
      4KB: { sample:  76, pct:   1, cum_pct:   1 }
      8KB: { sample:   0, pct:   0, cum_pct:   1 }
      16KB: { sample: 184, pct:   2, cum_pct:   3 }
      32KB: { sample: 425, pct:   5, cum_pct:   9 }
      64KB: { sample: 3098, pct:  43, cum_pct:  53 }
      128KB: { sample: 3303, pct:  46, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample: 453, pct:   6, cum_pct:   6 }
      8KB: { sample:   0, pct:   0, cum_pct:   6 }
      16KB: { sample: 528, pct:   7, cum_pct:  13 }
      32KB: { sample: 2436, pct:  34, cum_pct:  48 }
      64KB: { sample: 2148, pct:  30, cum_pct:  78 }
      128KB: { sample: 1521, pct:  21, cum_pct: 100 }
mdt.fir-MDT0003.rename_stats=
rename_stats:
- snapshot_time:  1555006234.456015071
- same_dir       
      4KB: { sample: 1683, pct:  88, cum_pct:  88 }
      8KB: { sample:   0, pct:   0, cum_pct:  88 }
      16KB: { sample: 111, pct:   5, cum_pct:  94 }
      32KB: { sample:  31, pct:   1, cum_pct:  96 }
      64KB: { sample:  70, pct:   3, cum_pct:  99 }
      128KB: { sample:   0, pct:   0, cum_pct:  99 }
      256KB: { sample:   4, pct:   0, cum_pct: 100 }
- crossdir_src   
      4KB: { sample: 115, pct:  28, cum_pct:  28 }
      8KB: { sample:   0, pct:   0, cum_pct:  28 }
      16KB: { sample:   0, pct:   0, cum_pct:  28 }
      32KB: { sample: 287, pct:  71, cum_pct: 100 }
- crossdir_tgt   
      4KB: { sample: 115, pct:  28, cum_pct:  28 }
      8KB: { sample:   0, pct:   0, cum_pct:  28 }
      16KB: { sample:   0, pct:   0, cum_pct:  28 }
      32KB: { sample: 287, pct:  71, cum_pct: 100 }

(all 4 MDTs are currently running on fir-md1-s1)

Comment by Patrick Farrell (Inactive) [ 11/Apr/19 ]

Hm, thanks for double checking.  Your patch looks just fine.  Still a mystery, then.

Comment by Stephane Thiell [ 11/Apr/19 ]

Could it be an issue with the stats?

It has been running ok last night, still ok at the moment, but I just noticed two call traces (one with mdt_dom_discard_data). No apparent hang however. Just in case:

Apr 11 11:25:48 fir-md1-s1 kernel: LustreError: 118177:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1555007058, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0002_UUID lock: ffff8b54ccb96300/0x53d0f21983285b0b lrc: 3/1,0 mode: --/PR res: [0x2c0010636:0x604a:0x0].0x0 bits 0x13/0x8 rrc: 11 type: IBT flags: 0x40210000000000 nid: local remote
Apr 11 11:25:48 fir-md1-s1 kernel: LustreError: 118177:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Apr 11 11:27:04 fir-md1-s1 kernel: LustreError: 119078:0:(mdt_io.c:442:mdt_preprw_write()) fir-MDT0002: BRW to missing obj [0x2c00107a5:0x5650:0x0]
Apr 11 11:27:38 fir-md1-s1 kernel: LNet: Service thread pid 118308 was inactive for 200.30s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 11 11:27:38 fir-md1-s1 kernel: LNet: Skipped 1 previous similar message
Apr 11 11:27:38 fir-md1-s1 kernel: Pid: 118308, comm: mdt00_058 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 11 11:27:38 fir-md1-s1 kernel: Call Trace:
Apr 11 11:27:38 fir-md1-s1 kernel:  [<ffffffffc0dc20bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Apr 11 11:27:38 fir-md1-s1 kernel:  [<ffffffffc0dc2dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Apr 11 11:27:38 fir-md1-s1 kernel:  [<ffffffffc14ddf51>] mdt_dom_discard_data+0x101/0x130 [mdt]
Apr 11 11:27:38 fir-md1-s1 kernel:  [<ffffffffc14b8cd1>] mdt_reint_unlink+0x331/0x1480 [mdt]
Apr 11 11:27:38 fir-md1-s1 kernel:  [<ffffffffc14bdcb3>] mdt_reint_rec+0x83/0x210 [mdt]
Apr 11 11:27:38 fir-md1-s1 kernel:  [<ffffffffc149c143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Apr 11 11:27:38 fir-md1-s1 kernel:  [<ffffffffc14a74a7>] mdt_reint+0x67/0x140 [mdt]
Apr 11 11:27:38 fir-md1-s1 kernel:  [<ffffffffc0e5f35a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 11 11:27:38 fir-md1-s1 kernel:  [<ffffffffc0e0392b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 11 11:27:38 fir-md1-s1 kernel:  [<ffffffffc0e0725c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 11 11:27:38 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Apr 11 11:27:38 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 11 11:27:38 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Apr 11 11:27:38 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1555007258.118308
Apr 11 11:27:40 fir-md1-s1 kernel: Pid: 118177, comm: mdt00_017 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 11 11:27:40 fir-md1-s1 kernel: Call Trace:
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc0dc20bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc0dc2dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc14a04bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc14a0b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc14a0f67>] mdt_object_lock_try+0x27/0xb0 [mdt]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc14a26a7>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc14a9bc5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc14a6a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc0da8ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc0dd18a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc0e58302>] tgt_enqueue+0x62/0x210 [ptlrpc]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc0e5f35a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc0e0392b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffc0e0725c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 11 11:27:40 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Apr 11 11:28:00 fir-md1-s1 kernel: Lustre: fir-MDT0002: Connection restored to b8776385-d179-aafa-3da1-de7b6320d2e0 (at 10.8.18.18@o2ib6)
Apr 11 11:28:00 fir-md1-s1 kernel: Lustre: Skipped 1552 previous similar messages
Apr 11 11:28:11 fir-md1-s1 kernel: Lustre: fir-MDT0002: Client 86eef2fc-6fc9-9a07-71d2-411730e19074 (at 10.9.102.18@o2ib4) reconnecting
Apr 11 11:28:11 fir-md1-s1 kernel: Lustre: Skipped 1537 previous similar messages
Apr 11 11:35:32 fir-md1-s1 kernel: Lustre: 118928:0:(tgt_handler.c:562:tgt_handle_recovery()) @@@ rq_xid 1628640699697392 matches saved xid, expected REPLAY or RESENT flag (0)  req@ffff8b551cef3450 x1628640699697392/t0(0) o4->398eea62-82fb-c936-bc1a-b5cbed90bbf5@10.8.8.12@o2ib6:8/0 lens 1136/0 e 0 to 0 dl 1555007738 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Apr 11 11:35:51 fir-md1-s1 kernel: LustreError: 119037:0:(ldlm_lib.c:3207:target_bulk_io()) @@@ bulk READ failed: rc -107  req@ffff8b52af30e300 x1629298046237264/t0(0) o37->f5140abb-0b33-a9a1-9459-14bf253ba61e@10.9.106.60@o2ib4:27/0 lens 448/440 e 0 to 0 dl 1555007757 ref 1 fl Interpret:/0/0 rc 0/0
Apr 11 11:38:01 fir-md1-s1 kernel: Lustre: fir-MDT0002: Connection restored to fe0dfafd-913e-8f07-208e-abc118273e9e (at 10.8.8.8@o2ib6)
Apr 11 11:38:01 fir-md1-s1 kernel: Lustre: Skipped 1556 previous similar messages
Apr 11 11:38:11 fir-md1-s1 kernel: Lustre: fir-MDT0002: Client 267d8456-47e2-363d-5081-c23100dd5f9b (at 10.8.27.20@o2ib6) reconnecting
Apr 11 11:38:11 fir-md1-s1 kernel: Lustre: Skipped 1562 previous similar messages

 

Comment by Patrick Farrell (Inactive) [ 11/Apr/19 ]

An issue with the stats is a good guess, but I checked that.  The only place the rename stats are modified is this function:

 mdt_rename_counter_tally(info, info->mti_mdt, req, mdt_rename_counter_tally(info, info->mti_mdt, req,  msrcdir, mtgtdir);

Which is only called from mdt_reint_rename, and only (as far as I can see) in the section we skip when we recognize this case and return -EXDEV.

Hmm, so we have an explanation for your other MDS deadlock...  Since applying the most recent version of the patch, have you seen the rename deadlock specifically?

The discard data hung threads thing you're reporting is concerning...  Could you dump stack for all your threads, perhaps?  Maybe we can see if others are also in that state.  Though it looks like it may have evicted and unwound already.

Comment by Stephane Thiell [ 11/Apr/19 ]

So far, no Patrick, I cannot confirm that we've seen the same deadlock yet (like we did 3 days ago before the patch).

Yep I've seen your updates for yesterday's MDS deadlock and wanted to thank you for looking at it so quickly. This sounds like a very good lead.

And OK - I'm going to dump the stack for all threads when I noticed new call traces, before possible evictions. This happens pretty often.

Comment by Andreas Dilger [ 12/Apr/19 ]

Note that the stats do not have any specific details for remote renames, only same-dir and cross-dir renames. With DNE1 the same-dir renames would not be remote, but with DNE2 it would be possible that even same-dir renames could be remote. It probably makes sense to update the stats to include local vs. remote rename stats.

Comment by Stephane Thiell [ 12/Apr/19 ]

Oh I finally got it!! Thanks for explaining twice Andreas So for us, cross-dir renames should be common and normal, because all subdirectories within a same /fir/users/$u or /fir/groups/$g are on the same MDT, and it's perfectly normal that users move stuff across subdirs. All same-dirs should be on the same MDT too, we only use DNE2 in the base directories /fir/users and /fir/groups and we don't do any rename there. After that, we might have occasional rename between MDTs (remote renames), but now we have this -EXDEV patch to avoid that.

And yes, it would be great to have a way to track remote renames indeed!

At this point, to help validating the -EXDEV patch, what I will do is report back if I see any call trace with mdt_remote_object_lock. But right now I don't see any.

Comment by Patrick Farrell (Inactive) [ 12/Apr/19 ]

Ack, reading comprehension fail on my part!  Thank you for pointing that out, Andreas.  Duh.

So actually, no reason to doubt the updated patch.  That's good.

Comment by Stephane Thiell [ 12/Apr/19 ]

Hi Patrick,
Our 2.12 filesystem just became partially unresponsive and I can see the following traces on the MDS:

[Fri Apr 12 11:05:55 2019][2937277.371515] Lustre: fir-MDT0000-osp-MDT0001: Connection to fir-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[Fri Apr 12 11:05:55 2019][2937277.386679] LustreError: 118245:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1555092265, 90s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MDT0000-osp-MDT0001 lock: ffff8b75f8648240/0x53d0f223f98d7135 lrc: 4/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2/0x0 rrc: 24 type: IBT flags: 0x1000001000000 nid: local remote: 0x53d0f223f98d7231 expref: -99 pid: 118245 timeout: 0 lvb_type: 0
[Fri Apr 12 11:05:55 2019][2937277.427024] LustreError: 118245:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 3 previous similar messages
[Fri Apr 12 11:05:56 2019][2937278.413559] LNet: Service thread pid 118364 was inactive for 200.30s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[Fri Apr 12 11:05:56 2019][2937278.430779] Pid: 118364, comm: mdt02_044 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[Fri Apr 12 11:05:56 2019][2937278.440802] Call Trace:
[Fri Apr 12 11:05:56 2019][2937278.443463]  [<ffffffffc0dc20bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[Fri Apr 12 11:05:56 2019][2937278.450593]  [<ffffffffc0dc2dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[Fri Apr 12 11:05:56 2019][2937278.457989]  [<ffffffffc14a04bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
[Fri Apr 12 11:05:56 2019][2937278.465001]  [<ffffffffc14a0b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
[Fri Apr 12 11:05:56 2019][2937278.472176]  [<ffffffffc14a1d2a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
[Fri Apr 12 11:05:56 2019][2937278.479268]  [<ffffffffc14a9bc5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[Fri Apr 12 11:05:56 2019][2937278.486009]  [<ffffffffc14a6a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
[Fri Apr 12 11:05:56 2019][2937278.492684]  [<ffffffffc0da8ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[Fri Apr 12 11:05:56 2019][2937278.499643]  [<ffffffffc0dd18a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[Fri Apr 12 11:05:56 2019][2937278.506939]  [<ffffffffc0e58302>] tgt_enqueue+0x62/0x210 [ptlrpc]
[Fri Apr 12 11:05:56 2019][2937278.513296]  [<ffffffffc0e5f35a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[Fri Apr 12 11:05:56 2019][2937278.520436]  [<ffffffffc0e0392b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[Fri Apr 12 11:05:56 2019][2937278.528357]  [<ffffffffc0e0725c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[Fri Apr 12 11:05:56 2019][2937278.534928]  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
[Fri Apr 12 11:05:56 2019][2937278.540048]  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
[Fri Apr 12 11:05:56 2019][2937278.546696]  [<ffffffffffffffff>] 0xffffffffffffffff
[Fri Apr 12 11:05:56 2019][2937278.551914] LustreError: dumping log to /tmp/lustre-log.1555092356.118364

So I dumped all tasks using sysrq-t

Please find logs + sysrq traces in fir-md1-s1_20190412.log

I may need to restart all MDTs if it doesn't recover by itself.

Comment by Patrick Farrell (Inactive) [ 12/Apr/19 ]

If you're able, dumping the lock tables could be helpful.  I'll see if I can take a quick look...

Comment by Stephane Thiell [ 12/Apr/19 ]
[root@fir-md1-s1 ~]# lctl get_param ldlm.*.*mdt*.*
ldlm.namespaces.mdt-fir-MDT0000_UUID.contended_locks=32
ldlm.namespaces.mdt-fir-MDT0000_UUID.contention_seconds=2
ldlm.namespaces.mdt-fir-MDT0000_UUID.ctime_age_limit=10
ldlm.namespaces.mdt-fir-MDT0000_UUID.dirty_age_limit=10
ldlm.namespaces.mdt-fir-MDT0000_UUID.early_lock_cancel=0
ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=2920395
ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_timeouts=5
ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_unused_count=0
ldlm.namespaces.mdt-fir-MDT0000_UUID.lru_max_age=3900000
ldlm.namespaces.mdt-fir-MDT0000_UUID.lru_size=4800
ldlm.namespaces.mdt-fir-MDT0000_UUID.max_nolock_bytes=0
ldlm.namespaces.mdt-fir-MDT0000_UUID.max_parallel_ast=1024
ldlm.namespaces.mdt-fir-MDT0000_UUID.resource_count=184784
ldlm.namespaces.mdt-fir-MDT0001_UUID.contended_locks=32
ldlm.namespaces.mdt-fir-MDT0001_UUID.contention_seconds=2
ldlm.namespaces.mdt-fir-MDT0001_UUID.ctime_age_limit=10
ldlm.namespaces.mdt-fir-MDT0001_UUID.dirty_age_limit=10
ldlm.namespaces.mdt-fir-MDT0001_UUID.early_lock_cancel=0
ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=2760390
ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_timeouts=1
ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_unused_count=0
ldlm.namespaces.mdt-fir-MDT0001_UUID.lru_max_age=3900000
ldlm.namespaces.mdt-fir-MDT0001_UUID.lru_size=4800
ldlm.namespaces.mdt-fir-MDT0001_UUID.max_nolock_bytes=0
ldlm.namespaces.mdt-fir-MDT0001_UUID.max_parallel_ast=1024
ldlm.namespaces.mdt-fir-MDT0001_UUID.resource_count=333925
ldlm.namespaces.mdt-fir-MDT0002_UUID.contended_locks=32
ldlm.namespaces.mdt-fir-MDT0002_UUID.contention_seconds=2
ldlm.namespaces.mdt-fir-MDT0002_UUID.ctime_age_limit=10
ldlm.namespaces.mdt-fir-MDT0002_UUID.dirty_age_limit=10
ldlm.namespaces.mdt-fir-MDT0002_UUID.early_lock_cancel=0
ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=3512236
ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_timeouts=7
ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_unused_count=0
ldlm.namespaces.mdt-fir-MDT0002_UUID.lru_max_age=3900000
ldlm.namespaces.mdt-fir-MDT0002_UUID.lru_size=4800
ldlm.namespaces.mdt-fir-MDT0002_UUID.max_nolock_bytes=0
ldlm.namespaces.mdt-fir-MDT0002_UUID.max_parallel_ast=1024
ldlm.namespaces.mdt-fir-MDT0002_UUID.resource_count=296776
ldlm.namespaces.mdt-fir-MDT0003_UUID.contended_locks=32
ldlm.namespaces.mdt-fir-MDT0003_UUID.contention_seconds=2
ldlm.namespaces.mdt-fir-MDT0003_UUID.ctime_age_limit=10
ldlm.namespaces.mdt-fir-MDT0003_UUID.dirty_age_limit=10
ldlm.namespaces.mdt-fir-MDT0003_UUID.early_lock_cancel=0
ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_count=368677
ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_timeouts=0
ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_unused_count=0
ldlm.namespaces.mdt-fir-MDT0003_UUID.lru_max_age=3900000
ldlm.namespaces.mdt-fir-MDT0003_UUID.lru_size=4800
ldlm.namespaces.mdt-fir-MDT0003_UUID.max_nolock_bytes=0
ldlm.namespaces.mdt-fir-MDT0003_UUID.max_parallel_ast=1024
ldlm.namespaces.mdt-fir-MDT0003_UUID.resource_count=189951
 

dumping the lock tables now

Comment by Stephane Thiell [ 12/Apr/19 ]

dk.fir-md1-s1.gz

Comment by Stephane Thiell [ 12/Apr/19 ]

I'm about to restart all MDTs, but it's interesting that some of the latest traces still contain mdt_remote_object_lock:

Apr 12 11:28:42 fir-md1-s1 kernel: Lustre: fir-MDT0002: Connection restored to 9af20acb-4644-66cf-ffdb-64cc6425272a (at 10.9.107.28@o2ib4)
Apr 12 11:28:42 fir-md1-s1 kernel: Lustre: Skipped 4201 previous similar messages
Apr 12 11:28:50 fir-md1-s1 kernel: Pid: 118374, comm: mdt02_047 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 12 11:28:50 fir-md1-s1 kernel: Call Trace:
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc0dc20bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc0dc412f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc0dc69ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc169a7a2>] osp_md_object_lock+0x162/0x2d0 [osp]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc15aee03>] lod_object_lock+0xf3/0x7b0 [lod]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc162fd3e>] mdd_object_lock+0x3e/0xe0 [mdd]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc149f2f1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc149f88a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc14b2aae>] mdt_rename_lock+0xbe/0x4b0 [mdt]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc14b4e15>] mdt_reint_rename+0x2d5/0x28e0 [mdt]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc14bdcb3>] mdt_reint_rec+0x83/0x210 [mdt]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc149c143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc14a74a7>] mdt_reint+0x67/0x140 [mdt]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc0e5f35a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc0e0392b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffc0e0725c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 12 11:28:50 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Apr 12 11:28:50 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1555093730.118374
Apr 12 11:28:55 fir-md1-s1 kernel: Lustre: fir-MDT0002: Client 39b26b25-3151-384d-34d9-da20c3dc7f78 (at 10.8.7.35@o2ib6) reconnecting
Apr 12 11:28:55 fir-md1-s1 kernel: Lustre: Skipped 4181 previous similar messages
Apr 12 11:29:22 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1555093762.117533
 
[root@fir-md1-s1 ~]# lctl set_param mdt.*.enable_remote_rename=0
mdt.fir-MDT0000.enable_remote_rename=0
mdt.fir-MDT0001.enable_remote_rename=0
mdt.fir-MDT0002.enable_remote_rename=0
mdt.fir-MDT0003.enable_remote_rename=0
Comment by Stephane Thiell [ 12/Apr/19 ]

Attached a few seconds of full debug (-1) just in case as dk-full.fir-md1-s1.gz

It is back online

fir-md1-s2: Apr 12 11:40:49 fir-md1-s2 kernel: Lustre: fir-MDT0001: Recovery over after 0:51, of 1322 clients 1322 recovered and 0 were evicted.
fir-md1-s2: Apr 12 11:40:50 fir-md1-s2 kernel: Lustre: fir-MDT0003: Recovery over after 0:51, of 1322 clients 1322 recovered and 0 were evicted.
fir-md1-s1: Apr 12 11:40:56 fir-md1-s1 kernel: Lustre: fir-MDT0002: Recovery over after 1:01, of 1322 clients 1322 recovered and 0 were evicted.
fir-md1-s1: Apr 12 11:44:57 fir-md1-s1 kernel: Lustre: fir-MDT0000: Recovery over after 5:01, of 1323 clients 1322 recovered and 1 was evicted.

There are already a few call traces though:

Apr 12 11:44:09 fir-md1-s2 kernel: LNet: Service thread pid 43198 was inactive for 200.52s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 12 11:44:09 fir-md1-s2 kernel: Pid: 43198, comm: mdt02_015 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 12 11:44:09 fir-md1-s2 kernel: Call Trace:
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc0fe8890>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc0fe8ce3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc17f8aa3>] osp_remote_sync+0xd3/0x200 [osp]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc17dec63>] osp_attr_get+0x463/0x730 [osp]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc17db8cd>] osp_object_init+0x16d/0x2d0 [osp]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc0dbbfd5>] lu_object_alloc+0xe5/0x320 [obdclass]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc0dbc2e6>] lu_object_find_at+0x76/0x280 [obdclass]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc0dbc506>] lu_object_find+0x16/0x20 [obdclass]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc15f7fcb>] mdt_object_find+0x4b/0x170 [mdt]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc160ba77>] mdt_rename_lock+0x87/0x4b0 [mdt]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc160de15>] mdt_reint_rename+0x2d5/0x28e0 [mdt]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc1616cb3>] mdt_reint_rec+0x83/0x210 [mdt]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc15f5143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc16004a7>] mdt_reint+0x67/0x140 [mdt]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc105d35a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc100192b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffc100525c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffbc8c1c31>] kthread+0xd1/0xe0
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffbcf74c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 12 11:44:09 fir-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Apr 12 11:44:09 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1555094649.43198

We'll see...

Comment by Stephane Thiell [ 12/Apr/19 ]

I restarted again all MDTs after setting enable_remote_rename=0 permanently using

 lctl set_param -P mdt.fir-*.enable_remote_rename=0

instead of a "late" tuning. It's much quieter now, but we'll see.

Comment by Patrick Farrell (Inactive) [ 12/Apr/19 ]

The OSP call trace suggests possible OST issues.  Something to check for.

Comment by Stephane Thiell [ 12/Apr/19 ]

Those could actually be seen only shortly after I started the MDTs, so perhaps due to the load when they reconnect to the OSTs.

The definitively more pressing issue I guess is to find out why we got these very often:

Apr 12 15:43:55 fir-md1-s1 kernel: LustreError: 52120:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1555108945, 90s ago); not entering recovery in server code, just going 
Apr 12 15:43:55 fir-md1-s1 kernel: LustreError: 50859:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1555108945, 90s ago); not entering recovery in server code, just going 
Apr 12 15:45:45 fir-md1-s1 kernel: LNet: Service thread pid 50859 was inactive for 200.41s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purpose
Apr 12 15:45:45 fir-md1-s1 kernel: Pid: 50859, comm: mdt03_000 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 12 15:45:45 fir-md1-s1 kernel: Call Trace:
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc0e09dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc14714bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc1471b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc1471f67>] mdt_object_lock_try+0x27/0xb0 [mdt]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc14736a7>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc147abc5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc1477a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc0defec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc0e188a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc0e9f302>] tgt_enqueue+0x62/0x210 [ptlrpc]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc0ea635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc0e4a92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffc0e4e25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 12 15:45:45 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Apr 12 15:45:45 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1555109145.50859
Apr 12 15:45:46 fir-md1-s1 kernel: Pid: 52120, comm: mdt03_057 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 12 15:45:46 fir-md1-s1 kernel: Call Trace:
Apr 12 15:45:46 fir-md1-s1 kernel:  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Apr 12 15:45:46 fir-md1-s1 kernel:  [<ffffffffc0e09dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Apr 12 15:45:46 fir-md1-s1 kernel:  [<ffffffffc14aef51>] mdt_dom_discard_data+0x101/0x130 [mdt]
Apr 12 15:45:46 fir-md1-s1 kernel:  [<ffffffffc1489cd1>] mdt_reint_unlink+0x331/0x1480 [mdt]
Apr 12 15:45:46 fir-md1-s1 kernel:  [<ffffffffc148ecb3>] mdt_reint_rec+0x83/0x210 [mdt]
Apr 12 15:45:46 fir-md1-s1 kernel:  [<ffffffffc146d143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Apr 12 15:45:46 fir-md1-s1 kernel:  [<ffffffffc14784a7>] mdt_reint+0x67/0x140 [mdt]
Apr 12 15:45:46 fir-md1-s1 kernel:  [<ffffffffc0ea635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 12 15:45:46 fir-md1-s1 kernel:  [<ffffffffc0e4a92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 12 15:45:46 fir-md1-s1 kernel:  [<ffffffffc0e4e25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 12 15:45:46 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Apr 12 15:45:46 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 12 15:45:46 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Comment by Mikhail Pershin [ 13/Apr/19 ]

Stephane, I think it is result of LU-11359 issue, we were discussing already patch https://review.whamcloud.com/34071, I am updating it to handle non-updated clients gracefully so you will be able to apply patch in any order on servers/clients. It will be ready today I think

Comment by Stephane Thiell [ 13/Apr/19 ]

Mike, thanks for the update! This looks like it, yes. We'll likely start to apply the patch on the servers first. If we can do that sometimes next week that would be great, so let me know when you're confident with this patch.

Comment by Andreas Dilger [ 13/Apr/19 ]

Stephane, the current patch is not quite ready, since it doesn't properly check whether the client has the fix. Hopefully Mike will update the patch with a proper feature fix soon.

Comment by Stephane Thiell [ 14/Apr/19 ]

Hi Andreas,
Thanks, I'll wait for you to confirm that the patch is ready.
But this is quite urgent, as we have more and more users reporting simple mv commands hanging.
One recent report was just a move of a directory into another one, on the same MDT.
The report was that the following command was hanging:

$ mv code ignore/ 

I checked and all directories are on the same MDT:

$ pwd
/fir/users/beelaj/lab
$ ls -ld code
drwxr-xr-x 5 beelaj kornberg 4096 Apr  9 16:48 code
$ ls -ld ignore
drwxr-xr-x 2 beelaj kornberg 4096 Apr 13 22:26 ignore
[root@sh-101-59 lab]# lfs getdirstripe .
lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none
$ lfs getdirstripe code
lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none
$ lfs getdirstripe ignore
lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none

The mv still hasn't succeeded at that time. This is impacting research work. We end up having no other solution but to recommend using other filesystems (either our NFS appliance or Oak which is running 2.10).

I see plenty of backtraces on the MDS where MDT2 is located:

Apr 13 22:41:46 fir-md1-s1 kernel: Pid: 51848, comm: mdt00_068 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 13 22:41:46 fir-md1-s1 kernel: Call Trace:
Apr 13 22:41:46 fir-md1-s1 kernel:  [<ffffffffc0e090bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Apr 13 22:41:46 fir-md1-s1 kernel:  [<ffffffffc0e09dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Apr 13 22:41:46 fir-md1-s1 kernel:  [<ffffffffc1483c3b>] mdt_rename_lock+0x24b/0x4b0 [mdt]
Apr 13 22:41:46 fir-md1-s1 kernel:  [<ffffffffc1485e15>] mdt_reint_rename+0x2d5/0x28e0 [mdt]
Apr 13 22:41:46 fir-md1-s1 kernel:  [<ffffffffc148ecb3>] mdt_reint_rec+0x83/0x210 [mdt]
Apr 13 22:41:46 fir-md1-s1 kernel:  [<ffffffffc146d143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Apr 13 22:41:46 fir-md1-s1 kernel:  [<ffffffffc14784a7>] mdt_reint+0x67/0x140 [mdt]
Apr 13 22:41:46 fir-md1-s1 kernel:  [<ffffffffc0ea635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 13 22:41:46 fir-md1-s1 kernel:  [<ffffffffc0e4a92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 13 22:41:46 fir-md1-s1 kernel:  [<ffffffffc0e4e25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 13 22:41:46 fir-md1-s1 kernel:  [<ffffffffa3ec1c31>] kthread+0xd1/0xe0
Apr 13 22:41:46 fir-md1-s1 kernel:  [<ffffffffa4574c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 13 22:41:46 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Apr 13 22:41:46 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1555220506.51848
Comment by Stephane Thiell [ 14/Apr/19 ]

If I understand the commit message of Mike's patch correctly, that is likely also related to the mdt_dom_discard_data locking issue, as it says "leading to cascading timeouts for any other locks waiting on the same resource and parent directory."

After restarting the MDT, the mv command completed. I think we need to perform some scheduled MDT restart for now, until the patch is ready.

Comment by Gerrit Updater [ 15/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34615/
Subject: LU-12037 mdt: call mdt_dom_discard_data() after rename unlock
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: d557ce56bee3267fbe7b29a47e4ce7a99062e42c

Comment by Stephane Thiell [ 17/Apr/19 ]

More locking issues today (was pretty quiet the last two days though, but we restart all MDTs every night). Is Patch Set 11 ready and if yes, can I upgrade the servers first? Will the issue only be fixed when clients+servers are all up-to-date?

Apr 17 16:40:00 fir-md1-s2 kernel: Lustre: 11914:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
                                     req@ffff98cd3cf7e600 x1630626855117776/t0(0) o101->b4407670-b333-88f8-738e-cbd276e333a3@10.9.108.55@o2ib4:5/0 lens 576/3264 e 1 to 0 dl 1555544405 ref 2 fl Interpret:/0/0 rc 
Apr 17 16:41:15 fir-md1-s2 kernel: LustreError: 12542:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1555544385, 90s ago); not entering recovery in server code, just going 
Apr 17 16:41:15 fir-md1-s2 kernel: LustreError: 12542:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Apr 17 16:42:09 fir-md1-s2 kernel: Lustre: fir-MDT0003: Client b310d3db-4fa8-39c6-f54e-664a3cfcc55e (at 10.9.108.53@o2ib4) reconnecting
Apr 17 16:42:09 fir-md1-s2 kernel: Lustre: Skipped 117 previous similar messages
Apr 17 16:42:09 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to b310d3db-4fa8-39c6-f54e-664a3cfcc55e (at 10.9.108.53@o2ib4)
Apr 17 16:42:09 fir-md1-s2 kernel: Lustre: Skipped 121 previous similar messages
Apr 17 16:43:05 fir-md1-s2 kernel: LNet: Service thread pid 12542 was inactive for 200.45s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purpose
Apr 17 16:43:05 fir-md1-s2 kernel: LNet: Skipped 1 previous similar message
Apr 17 16:43:05 fir-md1-s2 kernel: Pid: 12542, comm: mdt03_025 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 17 16:43:05 fir-md1-s2 kernel: Call Trace:
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc0f790bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc0f79dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc16004bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc1600b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc1600f67>] mdt_object_lock_try+0x27/0xb0 [mdt]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc16026a7>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc1609bc5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc1606a28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc0f5fec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc0f888a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc100f302>] tgt_enqueue+0x62/0x210 [ptlrpc]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc101635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc0fba92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc0fbe25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffbc8c1c31>] kthread+0xd1/0xe0
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffbcf74c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Apr 17 16:43:05 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1555544585.12542
Apr 17 16:43:05 fir-md1-s2 kernel: Pid: 12726, comm: mdt00_060 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 17 16:43:05 fir-md1-s2 kernel: Call Trace:
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc0f790bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc0f79dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc163df51>] mdt_dom_discard_data+0x101/0x130 [mdt]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc1618cd1>] mdt_reint_unlink+0x331/0x1480 [mdt]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc161dcb3>] mdt_reint_rec+0x83/0x210 [mdt]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc15fc143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc16074a7>] mdt_reint+0x67/0x140 [mdt]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc101635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc0fba92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffc0fbe25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffbc8c1c31>] kthread+0xd1/0xe0
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffbcf74c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 17 16:43:05 fir-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Comment by Patrick Farrell (Inactive) [ 18/Apr/19 ]

Stephane,

I would say it's ready.  I hope tappro will correct me if I'm wrong, but in the interests of getting you unblocked on trying the patch, my understanding is:  You can upgrade the servers first - they now recognize older clients and change behavior accordingly.  The issue will not be fully resolved until everything is upgraded, yeah.

Comment by Stephane Thiell [ 18/Apr/19 ]

Hi Patrick,

Thanks I'll try to apply the patch for the DOM locks.

This morning however, we have seen this remote lock issue again, you told me to report any new issue related to that so here it is...

first of all, as a reminder, we're running with enable_remote_rename=0

fir-md1-s2: mdt.fir-MDT0001.enable_remote_rename=0
fir-md1-s2: mdt.fir-MDT0003.enable_remote_rename=0
fir-md1-s1: mdt.fir-MDT0000.enable_remote_rename=0
fir-md1-s1: mdt.fir-MDT0002.enable_remote_rename=0

I was able to reproduce the issue, mv easily hangs (sh-ln01 is 10.9.0.61@o2ib4):

[sthiell@sh-ln01 login /fir/users/sthiell]$ mkdir testdir1
[sthiell@sh-ln01 login /fir/users/sthiell]$ mv testdir1 testdir2
(hanging)

Then cd from another node hangs too:

[root@sh-hn01 sthiell.root]# cd /fir/users/sthiell
(hanging)
Apr 18 11:23:33 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1555611813.40139
Apr 18 11:23:52 fir-md1-s2 kernel: Pid: 40212, comm: mdt01_062 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 18 11:23:52 fir-md1-s2 kernel: Call Trace:
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc0f790bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc0f7b12f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc0f7d9ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc17fa7a2>] osp_md_object_lock+0x162/0x2d0 [osp]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc170ee03>] lod_object_lock+0xf3/0x7b0 [lod]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc178fd3e>] mdd_object_lock+0x3e/0xe0 [mdd]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc15ff2f1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc15ff88a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc1612aae>] mdt_rename_lock+0xbe/0x4b0 [mdt]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc1614e15>] mdt_reint_rename+0x2d5/0x28e0 [mdt]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc161dcb3>] mdt_reint_rec+0x83/0x210 [mdt]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc15fc143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc16074a7>] mdt_reint+0x67/0x140 [mdt]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc101635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc0fba92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffc0fbe25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffbc8c1c31>] kthread+0xd1/0xe0
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffbcf74c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 18 11:23:52 fir-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Apr 18 11:23:52 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1555611832.40212
Apr 18 11:24:11 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to 01fefea8-43c5-f114-cf69-e9c5be19a5dd (at 10.8.2.15@o2ib6)
Apr 18 11:24:11 fir-md1-s2 kernel: Lustre: Skipped 1786 previous similar messages
Apr 18 11:24:13 fir-md1-s2 kernel: Pid: 39036, comm: mdt01_008 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 18 11:24:13 fir-md1-s2 kernel: Call Trace:
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc0f790bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc0f7b12f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc0f7d9ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc17fa7a2>] osp_md_object_lock+0x162/0x2d0 [osp]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc170ee03>] lod_object_lock+0xf3/0x7b0 [lod]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc178fd3e>] mdd_object_lock+0x3e/0xe0 [mdd]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc15ff2f1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc15ff88a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc1612aae>] mdt_rename_lock+0xbe/0x4b0 [mdt]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc1614e15>] mdt_reint_rename+0x2d5/0x28e0 [mdt]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc161dcb3>] mdt_reint_rec+0x83/0x210 [mdt]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc15fc143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc16074a7>] mdt_reint+0x67/0x140 [mdt]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc101635a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc0fba92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffc0fbe25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffbc8c1c31>] kthread+0xd1/0xe0
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffbcf74c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 18 11:24:13 fir-md1-s2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Apr 18 11:24:13 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1555611853.39036

sh-ln01 can be found there:

[root@fir-md1-s2 ~]# strings /tmp/lustre-log.1555611853.39036 | grep 10.9.0.61@o2ib4
fir-MDT0003: Client d688f6d7-017e-8817-dbcd-6b016c3633d8 (at 10.9.0.61@o2ib4) reconnecting
fir-MDT0003: Connection restored to d688f6d7-017e-8817-dbcd-6b016c3633d8 (at 10.9.0.61@o2ib4)
fir-MDT0001: Client d688f6d7-017e-8817-dbcd-6b016c3633d8 (at 10.9.0.61@o2ib4) reconnecting
fir-MDT0001: Connection restored to d688f6d7-017e-8817-dbcd-6b016c3633d8 (at 10.9.0.61@o2ib4)

However, /fir/users/sthiell is on MDT0:

[root@sh-ln01 ~]# lfs getdirstripe /fir/users/sthiell
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none

Which is weird because the backtrace seems to come from MDT1 or 3. Anyway, something is still wrong. So I'm thinking to fix the DOM locks first then we can see if we still see the remote_lock problem...

Comment by Patrick Farrell (Inactive) [ 18/Apr/19 ]

Wait, you can reproduce the issue with just mv?  Can you describe the directory the mv is happening in?  DNE involved somehow?

Upon reflection, I'm guessing you're only able to reproduce it with mv after you've already got a problem?

The trace you're showing here isn't a problem - In this case, the secondary MDT needs the rename lock.  The entire file system has one rename lock, and it's on MDT0.  (mdt_rename_lock is asking for that).  So this sequence is expected even with Lai's patch.

Hm, can you give getdirstripe of "users" and of the directory you're trying to mv?

Comment by Stephane Thiell [ 18/Apr/19 ]

Yes yes sorry, I was able to reproduce a hanging rename because the filesystem was already in a weird state... I was seeing traces on both MDS already (but not any recent mdt_dom_discard_data-related traces).

Comment by Stephane Thiell [ 18/Apr/19 ]
[root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users
lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx		 FID[seq:oid:ver]
     0		 [0x200000400:0x5:0x0]		
     1		 [0x240000402:0x5:0x0]		
     2		 [0x2c0000400:0x5:0x0]		
     3		 [0x280000401:0x5:0x0]	
	
[root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users/sthiell
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
 
Comment by Stephane Thiell [ 18/Apr/19 ]

After restarting the MDTs, the mv has completed:

[root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users/sthiell/testdir1
lfs getdirstripe: cannot open '/fir/users/sthiell/testdir1': No such file or directory (2)
error: getdirstripe failed for /fir/users/sthiell/testdir1.
[root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users/sthiell/testdir2
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
Comment by Stephane Thiell [ 18/Apr/19 ]

And then, after the recovery of the MDTs, the only backtraces I can see are related to the mdt_dom_discard_data (see below). So again, I think we need to focus on fixing this issue first. But the fact that we could see the mdt_remote_object_lock_try is disturbing to me.

[657310.752760] Lustre: Skipped 3 previous similar messages
[657325.266727] LNet: Service thread pid 42430 was inactive for 200.28s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[657325.283780] Pid: 42430, comm: mdt00_005 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[657325.293610] Call Trace:
[657325.296166]  [<ffffffffc0fc00bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[657325.303203]  [<ffffffffc0fc0dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[657325.310523]  [<ffffffffc15f94bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
[657325.317440]  [<ffffffffc15f9b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
[657325.324552]  [<ffffffffc15fb43d>] mdt_getattr_name_lock+0x101d/0x1c30 [mdt]
[657325.331646]  [<ffffffffc1602bc5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[657325.338324]  [<ffffffffc15ffa28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
[657325.344912]  [<ffffffffc0fa6ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[657325.351777]  [<ffffffffc0fcf8a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[657325.358971]  [<ffffffffc1056302>] tgt_enqueue+0x62/0x210 [ptlrpc]
[657325.365253]  [<ffffffffc105d35a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[657325.372285]  [<ffffffffc100192b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[657325.380135]  [<ffffffffc100525c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[657325.386578]  [<ffffffffbc8c1c31>] kthread+0xd1/0xe0
[657325.391586]  [<ffffffffbcf74c24>] ret_from_fork_nospec_begin+0xe/0x21
[657325.398192]  [<ffffffffffffffff>] 0xffffffffffffffff
[657325.403331] LustreError: dumping log to /tmp/lustre-log.1555612752.42430
[657325.410851] Pid: 42431, comm: mdt01_004 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[657325.420686] Call Trace:
[657325.423237]  [<ffffffffc0fc00bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[657325.430289]  [<ffffffffc0fc0dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[657325.437589]  [<ffffffffc15f94bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
[657325.444504]  [<ffffffffc15f9b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
[657325.451599]  [<ffffffffc15f9f67>] mdt_object_lock_try+0x27/0xb0 [mdt]
[657325.458186]  [<ffffffffc15fb6a7>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt]
[657325.465293]  [<ffffffffc1602bc5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[657325.471952]  [<ffffffffc15ffa28>] mdt_intent_policy+0x2e8/0xd00 [mdt]
[657325.478551]  [<ffffffffc0fa6ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[657325.485403]  [<ffffffffc0fcf8a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[657325.492612]  [<ffffffffc1056302>] tgt_enqueue+0x62/0x210 [ptlrpc]
[657325.498862]  [<ffffffffc105d35a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[657325.505910]  [<ffffffffc100192b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[657325.513742]  [<ffffffffc100525c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[657325.520172]  [<ffffffffbc8c1c31>] kthread+0xd1/0xe0
[657325.525173]  [<ffffffffbcf74c24>] ret_from_fork_nospec_begin+0xe/0x21
[657325.531751]  [<ffffffffffffffff>] 0xffffffffffffffff
[657325.536854] Pid: 42699, comm: mdt02_033 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[657325.546711] Call Trace:
[657325.549259]  [<ffffffffc0fc00bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[657325.556280]  [<ffffffffc0fc0dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[657325.563588]  [<ffffffffc1636f51>] mdt_dom_discard_data+0x101/0x130 [mdt]
[657325.570441]  [<ffffffffc1611cd1>] mdt_reint_unlink+0x331/0x1480 [mdt]
[657325.577039]  [<ffffffffc1616cb3>] mdt_reint_rec+0x83/0x210 [mdt]
[657325.583175]  [<ffffffffc15f5143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[657325.589844]  [<ffffffffc16004a7>] mdt_reint+0x67/0x140 [mdt]
[657325.595636]  [<ffffffffc105d35a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[657325.602681]  [<ffffffffc100192b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[657325.610500]  [<ffffffffc100525c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[657325.616913]  [<ffffffffbc8c1c31>] kthread+0xd1/0xe0
[657325.621930]  [<ffffffffbcf74c24>] ret_from_fork_nospec_begin+0xe/0x21
[657325.628493]  [<ffffffffffffffff>] 0xffffffffffffffff
Comment by Stephane Thiell [ 18/Apr/19 ]

Patrick, the patch https://review.whamcloud.com/#/c/34071/11/ does apply with a few minor changes, but it doesn't compile against 2.12.0 because:

in ldlm_request.c, there is a reference to ldlm_is_granted() that doesn't seem to exist. I might need some help for that part.

lustre-release/lustre/ptlrpc/../../lustre/ldlm/ldlm_request.c: In function ‘is_granted_or_cancelled_nolock’:
lustre-release/lustre/ptlrpc/../../lustre/ldlm/ldlm_request.c:157:2: error: implicit declaration of function ‘ldlm_is_granted’ [-Werror=implicit-function-declaration]
  if (ldlm_is_granted(lock) && !ldlm_is_cp_reqd(lock))
  ^

I might need some help for the porting of this patch to 2.12

Comment by Patrick Farrell (Inactive) [ 18/Apr/19 ]

Stephane,

You can replace ldlm_is_granted(lock) with (lock->l_req_mode == lock->l_granted_mode).

About the remote lock:
If, as in the case from earlier, it's inside mdt_rename_lock and it's on a secondary MDT (not MDT0), then it's not an issue.  [Edit: Sorry, too brief....] It's just trying to get the rename lock on MDT0.  That does indicate some sort of hang if you're getting back traces, but it doesn't actually indicate a remote rename.  There is only a single rename lock ("big file system lock" or BFL, as Andreas labels it) for the whole file system.  This is kind of standard (EXT4, at least, is the same) as it massively simplifies rename locking (which is still quite complicated).

Comment by Stephane Thiell [ 18/Apr/19 ]

Thanks!!

This does compile against my 2.12.0 branch: https://gist.github.com/thiell/ce2445aa1e37b365e50208ebbb0b5263

Let me know if that sounds ok to you. I will build new packages after lunch.

Thanks for the remote lock explanation, this is super interesting and I'm glad to hear that it doesn't necessary mean a remote rename!!

Comment by Stephane Thiell [ 18/Apr/19 ]

Will try the following on top of 2.12.0:

servers (fir):

$ git log --oneline
1819063 LU-11359 mdt: fix mdt_dom_discard_data() timeouts
3ed10f4 LU-11964 mdc: prevent glimpse lock count grow
565011c LU-12037 mdt: add option for cross-MDT rename
b6be1d9 LU-12065 lnd: increase CQ entries
6b2c97b LU-12037 mdt: call mdt_dom_discard_data() after rename unlock

clients (sherlock):

8a47fe6 LU-11359 mdt: fix mdt_dom_discard_data() timeouts
8bf1d95 LU-12065 lnd: increase CQ entries
7b8a3b3 LU-11964 mdc: prevent glimpse lock count grow
Comment by Stephane Thiell [ 19/Apr/19 ]

FYI, we had another "event" so we decided to update the servers and now we're slowly starting to redeploy clients, this will take a while.

Our first client with async_discard:

[root@sh-101-60 ~]# lctl get_param mdc.fir-*.import | grep -q async_discard && echo ok
ok

Thanks much for your efforts! Fingers crossed.

Comment by Gerrit Updater [ 21/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34410/
Subject: LU-12037 mdt: add option for cross-MDT rename
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c6f3d533542b9462b5b8df95183d80321d4d9c34

Comment by Gerrit Updater [ 21/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34616/
Subject: LU-12037 mdt: add option for cross-MDT rename
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 9cbefa99f234586c1419243a4d2c136e4f405b29

Comment by Peter Jones [ 21/Apr/19 ]

ok, for release tracking purposes, let's mark this ticket resolved because both patches have landed for 2.13 and will be in 2.12.1. There seem to be no shortage of other tickets if we need further action still.

Comment by Stephane Thiell [ 22/Apr/19 ]

Ok no problem. Should I update in LU-11359 regarding: the DoM locking issue (mdt_dom_discard_data) then? We only applied the patch on ~400 clients so far, so the issue isn't resolved yet. Today was very difficult, a lot of issues again, we'll patch more aggressively the rest of the ~1000 clients tomorrow with https://review.whamcloud.com/#/c/34071/11/

Also, we hit LU-12120 while restarting the MDTs multiple times today while trying to resolve the DOM locking issues:

fir-io1-s1:[Sun Apr 21 17:06:43 2019][6239672.584384] LustreError: 1474:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io1-s1:[Sun Apr 21 18:49:41 2019][ 6062.486131] LustreError: 65943:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io1-s2:[Sun Apr 21 16:46:46 2019][6242707.929072] LustreError: 108555:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io2-s2:[Sun Apr 21 16:50:04 2019][8112352.271586] LustreError: 3051:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.406551] LustreError: 65916:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.417849] LustreError: 65963:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.443818] LustreError: 62284:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
fir-io3-s2:[Sun Apr 21 17:06:45 2019][6239507.696718] LustreError: 37106:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG

After many restart attempts, the filesystem seems to have stabilized at this point, for now...

Comment by Peter Jones [ 22/Apr/19 ]

Yes I think that switching to LU-11359 makes sense

Comment by Stephane Thiell [ 23/Jul/19 ]

Because mdt.*.enable_remote_rename=0 was impacting performance, as it took a very long time for some users to mv within the filesystem, we decided to set enable_remote_rename=1 again. We did that last week and we have survived until now. It was a wrong track I guess. This is the other fixes that really helped like async_discard or LU-11285 "ldlm: reprocess whole waiting queue for IBITS". I think this one can be closed as soon as LU-11285 has landed...

Generated at Sat Feb 10 08:56:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.