[LU-4570] Metadata slowdowns on production filesystem at ORNL Created: 01/Feb/14  Updated: 09/Jul/14  Resolved: 09/Jul/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Major
Reporter: Jason Hill (Inactive) Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL 6.4, 2.6.32-358.18.1.el6.x86_64 kernel


Attachments: File atlas-mds1-sysrq-t.tgz     Text File ldlm_lockd.c.lustrekernel.txt     File rhea-lustre-logs-LU4570.tgz     Text File rpctrace-mds1.bz2     File titan-lustre-logs-LU4570.tgz    
Issue Links:
Related
is related to LU-4579 Timeout system horribly broken Resolved
Severity: 3
Rank (Obsolete): 12490

 Description   

Clients:
Some 1.8.6, 1.8.9 clients
Majority 2.4.X clients

Server (noted above)

Users are seeing several minute metadata operation times (interactive performance is usually the canary in the coal mine). We are unable to enable DNE due to 1.8 client interactions and the need to have the data available.

We run periodic lctl dk with +rpctrace and I can provide information.

To date we've found one application that was issuing ~30k LDLM_ENQUEUE RPC's in a 30-45s range (lctl dk overflowed, so only got 45/60s of trace data). We've put that application on hold, but are still seeing some slowdowns.

We have some "ls timers" scattered around compute clusters and have been seeing some responses in the 60-80s ranges for an ls of a pre-determined directory.

Questions:

1. My current guess is that we are not getting any of the metadata improvements in 2.4. Can you confirm?

2. We are running with options mdt mds_num_threads=1024. We are thinking of changing that to 2048. Recommendations?

3. What further data would be of interest in tracking this down? I have the raw rpctrace data. Would it be useful to add +nettrace or +dlmtrace and send that along? (I think +dlmtrace, but others may be good too)

4. If we were to get rid of all the 1.8 clients would the 2.4 metadata improvements be realized without a reboot of the servers? We are waiting on Cray to provide a software stack for some of our login servers that supports 2.4 clients, and are highly blocked on this.

5. If we cannot get rid of the 1.8 clients, are there any other suggestions for increasing the metadata throughput? We are not blocking on host IO to the MDT device (less than 6MB/s as recorded by iostat); we have had some past memory pressure on the MDS (have recently reduced ldlm.namespaces.*.lru_size on cluster (non-Cray) clients from default of unlimited to 128 on compute 2000 on login nodes), but in general the mds host seems responsive and "normal".

Happy to provide more information, just ask and we'll get it to you.

Thanks!


-Jason



 Comments   
Comment by Peter Jones [ 01/Feb/14 ]

Oleg

Please can you advise how to proceed here?

Thanks

Peter

Comment by Oleg Drokin [ 01/Feb/14 ]

I guess the most important question here would be - when you experience these issues - are all MDT threads are busy (and blocked on something most likely at the same time)?
Like the big job that is ow disabled for example, did it happen to open a single file (or small number of files) from a lot of clients with O_CREAT?

Comment by Jason Hill (Inactive) [ 01/Feb/14 ]

Oleg,

I've not been on the system during one of these events. I can work on getting another job to run tomorrow sometime, or maybe Monday. Currently on the MDS there are no processes (mdt*) in any state other than sleeping or running. I do see a high (both by number and CPU utilization) of kiblnd processes (from my historical memory, nothing concrete). We've put the offending user/code into scheduler jail, so they can't run and we're seeing if the pitchforks and torches go away for the weekend. Started seeing issues on Wednesday and our traditional tools identified a particular code that was issuing ~50k LDLM_ENQUEUE's every 30 seconds. The app developer changed some IO and dropped it to ~25K LDLM_ENQUEUE's every 30s, but we were still getting complaints. Not much interactive use at midnight on a Friday, so I don't have a good measuring stick just yet. Working on automating some more of that.

We don't know much about what the application is doing, just what we see from our homegrown tools. Will try to find out more about the app over the weekend.

Will keep you posted. Thanks to you and Peter for commenting at this hour on a Friday.

Comment by Oleg Drokin [ 01/Feb/14 ]

So those 25K enqueues - what are they for? Opens? what mode, same file/different files?
Specific details here would help

Comment by Jason Hill (Inactive) [ 01/Feb/14 ]

Specifics. Here's the data I have from during a period when the application we identified was running:

00000100:00100000:1.0:1391197263.389662:0:12671:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn02_007:69482ed8-e0da-f1ce-52fa-1f94fb17f79b+248664:16552:x1457850727013322:12345-10.36.207.211@o2ib:103
00000100:00100000:1.0:1391197263.389676:0:12671:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn02_007:69482ed8-e0da-f1ce-52fa-1f94fb17f79b+248663:16552:x1457850727013322:12345-10.36.207.211@o2ib:103 Request procesed in 17us (37us total) trans 0 rc 0/0
00010000:00080000:6.0:1391197263.391911:0:20306:0:(ldlm_lib.c:2515:target_send_reply()) Schedule reply immediately
00010000:00080000:6.0:1391197263.394777:0:20306:0:(ldlm_lib.c:2515:target_send_reply()) Schedule reply immediately
00000100:00100000:10.0:1391197263.399170:0:12670:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn01_006:7affdafd-903d-c4e0-829e-93e7deb30326+236769:19570:x1457850685779849:12345-10.36.207.212@o2ib:103
00000100:00100000:10.0:1391197263.399185:0:12670:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn01_006:7affdafd-903d-c4e0-829e-93e7deb30326+236768:19570:x1457850685779849:12345-10.36.207.212@o2ib:103 Request procesed in 18us (39us total) trans 0 rc 0/0
00010000:00080000:20.0:1391197263.399332:0:20469:0:(ldlm_lib.c:2515:target_send_reply()) Schedule reply immediately
00010000:00080000:6.0:1391197263.401981:0:20306:0:(ldlm_lib.c:2515:target_send_reply()) Schedule reply immediately
00000100:00100000:2.0:1391197263.405698:0:19999:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn00_000:f77d78c9-ff95-ba2a-857d-91ec128ac2b0+732343:10304:x1457850698872414:12345-10.36.207.210@o2ib:103
00000100:00100000:2.0:1391197263.405712:0:19999:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn00_000:f77d78c9-ff95-ba2a-857d-91ec128ac2b0+732342:10304:x1457850698872414:12345-10.36.207.210@o2ib:103 Request procesed in 18us (37us total) trans 0 rc 0/0
00010000:00080000:20.0:1391197263.406692:0:20469:0:(ldlm_lib.c:2515:target_send_reply()) Schedule reply immediately

I don't have specific knowledge of mapping those sorts of requests back to operations (open, close, stat, unlink, etc.). Happy to get educated more on this. Again, data generated with +rpctrace (no dlmtrace). and I just grepped for ldlm from that output to get the snippet above. Snippet not guaranteed to be coherent or consistent. Uploading the entire rpctrace now.

Comment by Oleg Drokin [ 01/Feb/14 ]

opc 103 is cancel, so a lot of cancel activity.

There's no way to easily see file mode and operation in mode details from this sort of trace, but since you isolated the app, you can just ask the app author what sort of things does he do

Comment by Matt Ezell [ 01/Feb/14 ]

$ grep "Handling RPC" rpctrace-mds1|awk -F':' '

{print $20}

' | sort -n |uniq -c |sort -nr
52301 101
8037 35
6469 36
1255 103
245 37
181 400
25 49
11 41
1 44

101 = LDLM_ENQUEUE
35 = MDS_CLOSE
36 = MDS_REINT
103 = LDLM_CANCEL
37 = MDS_READPAGE
400 = OBD_PING

$ grep "Handling RPC" rpctrace-mds1|awk -F':' '

{print $19}

'|sort|uniq -c|sort -nr|head -10
5298 12345-10.36.205.208@o2ib
5241 12345-10.36.205.201@o2ib
3151 12345-10.36.205.207@o2ib
2618 12345-10.36.205.200@o2ib
2324 12345-10.36.207.213@o2ib
1583 12345-10.36.202.130@o2ib
1169 12345-10.36.207.211@o2ib
1104 12345-10.36.207.210@o2ib
589 12345-10.36.207.212@o2ib
411 12345-10.36.205.198@o2ib

10.36.205.208@o2ib = titan-ext7
10.36.205.201@o2ib = titan-ext4
10.36.205.207@o2ib = titan-ext6
10.36.205.200@o2ib = titan-ext3
10.36.207.213@o2ib = lens38
10.36.202.130@o2ib = dtn03
10.36.207.211@o2ib = lens36
10.36.207.210@o2ib = lens35
10.36.207.212@o2ib = lens37
10.36.205.198@o2ib = titan-ext1

Looks like this is coming mostly from interactive use on the external login nodes.

Comment by Andreas Dilger [ 01/Feb/14 ]

Is this filesystem upgraded from 1.x or is it newly formatted with 2.x? Please check what the output from the following command is:

mds# dumpe2fs -h | grep feature

The reason I ask is because the dir_data feature (which should be in the above list) is not automatically enabled on upgrade filesystems ( it would cause problems if the filesystem needed to downgrade) but improves. 2.x performance significantly. Note that this affects files created with 2.x most significantly, and particularly "ls -l" and other readdir() workloads.

Please see LU-1248 for more details. Not sure if this is the case, but it was my first thought when I read this bug.

Comment by Matt Ezell [ 01/Feb/14 ]

It was formatted under 2.4.

We have some clients that are acting up. I tried to return all unused locks (by echo'ing clear into lru_size)

I see the following messages:
2014-02-01T18:21:06.280508-05:00 c24-0c0s6n0 LustreError: 11222:0:(import.c:324:ptlrpc_invalidate_import()) atlas2-MDT0000_UUID: rc = -110 waiting for callback (4 != 0)
2014-02-01T18:21:06.310918-05:00 c24-0c0s6n0 LustreError: 11222:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff880306cacc00 x1458495644509504/t0(0) o101->atlas2-MDT0000-mdc-ffff88033e1f1800@10.36.226.74@o2ib200:12/10 lens 592/25176 e 0 to 0 dl 1391272868 ref 1 fl Rpc:RE/0/ffffffff rc -5/-1
2014-02-01T18:21:06.310968-05:00 c24-0c0s6n0 LustreError: 11222:0:(import.c:350:ptlrpc_invalidate_import()) Skipped 3 previous similar messages
2014-02-01T18:21:06.310984-05:00 c24-0c0s6n0 LustreError: 11222:0:(import.c:366:ptlrpc_invalidate_import()) atlas2-MDT0000_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out.

Comment by Jason Hill (Inactive) [ 02/Feb/14 ]

Answering Andreas for completeness:

[root@atlas-mds1 ~]# dumpe2fs -h /dev/mapper/atlas1-mdt1 | grep feature
dumpe2fs 1.42.7.wc1 (12-Apr-2013)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg ea_inode dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota

Tonight I have seen two messages in the logs that make me think there is something definitely wrong; the first on the MDS:

[962845.511495] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 5269s: evicting client at 9307@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff881a2c51fd80/0x62d1aa23e9a027df lrc: 3/0,0 mode: PR/PR res: [0x20005c272:0x2:0x0].0 bits 0x1b rrc: 2 type: IBT flags: 0x200000000020 nid: 9307@gni100 remote: 0xca436432b3a155cc expref: 32 pid: 20112 timeout: 5258943031 lvb_type: 0

So it took us 5269 seconds to try to cancel a lock and then we evicted the client (am I reading that correct?)

Next on an OSS:

Feb 1 22:28:52 atlas-oss2g3 kernel: [1881119.843613] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 10034s: evicting client at 9307@gni102 ns: filter-atlas1-OST02ba_UUID lock: ffff88064926b480/0x6acfe1d5cba7b850 lrc: 3/0,0 mode: PR/PR res: [0x111e85:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 9307@gni102 remote: 0xca436432b3a15682 expref: 4 pid: 26076 timeout: 6179874642 lvb_type: 1

Whoa. 10,034s.

Any thoughts on where to go digging in greater detail to help guide us more?

Comment by Jason Hill (Inactive) [ 02/Feb/14 ]

Also of use from Matt's analysis last night (sorry, just reading this now), those top 10 clients issuing RPC's are all Lustre 1.8.X clients (titan-ext* is Cray 1.8.6, lens* and dtn03 are Intel 1.8.9 compiled @ ORNL). Thanks for the analysis Matt.

Comment by Jason Hill (Inactive) [ 02/Feb/14 ]

Just for posterity, it looks like Cray has not put LU-532 into their base 1.8.6 client as we see messages like:

Feb 1 23:53:56 atlas-mds1 kernel: [ 4357.493520] Lustre: 14082:0:(mdt_xattr.c:363:mdt_reint_setxattr()) atlas1-MDT0000: client miss to set OBD_MD_FLCTIME when setxattr system.posix_acl_access: [object [0x20005b506:0x19b79:0x0]] [valid 68719476736]
Feb 1 23:53:56 atlas-mds1 kernel: [ 4357.514700] Lustre: 14082:0:(mdt_xattr.c:363:mdt_reint_setxattr()) Skipped 204 previous similar messages

A lot. From reading the bug it's a client issue, and my guess is that our non-Intel clients don't have the patch.

Comment by James A Simmons [ 02/Feb/14 ]

A new ticket was opened for the port of the LU-532 patches to b1_8. That ticket is LU-1838 and those patches have landed to 1.8.9.

Comment by Jason Hill (Inactive) [ 03/Feb/14 ]

I wrote some bash to go look for the longest RPC from each event where we noted that it took longer than 1s to perform a metadata operation since 02-01-2014. In almost every case the operation that took the longest was a lock enquque:

[root@lustre-mgmt1]# /tmp/mdstrace-parse.sh &> /tmp/rpclist.txt
[root@lustre-mgmt1]# cat /tmp/rpclist.txt | awk -F\: '

{print $21}

' | awk '

{print $1}

' | sort -u
101
36

Comment by Oleg Drokin [ 03/Feb/14 ]

If you have lock timeouts in thousands of seconds, that's sure to block any competing lock enqueues for as long as well.

So I wonder if these events are correlated?

Comment by James Nunez (Inactive) [ 04/Feb/14 ]

Jason,

I've spent the day researching and speaking with our engineers about this issue. We consider the most fruitful approach is to start by understanding the workload that experienced the slowdown. Do you have any information on the application, the one that was issuing the 25 - 30K LDLM_ENQUEUE RPCs, and what it is doing in terms of metadata? For example is it opening a single file or a small number of files from many processes, opening multiple files, what modes, etc.

Thanks,
James

Comment by Jason Hill (Inactive) [ 04/Feb/14 ]

James,

The code that was mentioned on Friday has been held since we discovered that the change the developer made did not result in lower impact. We are continuing to see long response times for metadata operations in the absence of that code – so I think we have something more systemic going on.

Cray has an open bug (I mentioned this in the email to you and Peter) for their client (in Cray's bugzilla), and also thinks that LU-4509 might be something to look at for their client. We are actively working to get the 1.8 clients off the filesystem but that will take some time still.

-Jason

Comment by Jason Hill (Inactive) [ 04/Feb/14 ]

On the advice of Oleg I'm going to spend some time tonight trying to correlate the long response times to lock timeout and eviction notices on the clients. This will take some time as I'll need to basically create a coherent stream of log data from disparate sources and then try to follow messages. Hoping to have something by morning.

Comment by Matt Ezell [ 04/Feb/14 ]

After reading LU-4579, I think our issues may be similar.

Jason, I look forward to seeing your analysis - I think it will be really helpful.

We have lots of threads being hung:

$ grep -c "was inactive for" kern.log 
47
Feb  3 22:40:17 atlas-mds1 kernel: [172488.242493] LNet: Service thread pid 15401 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Feb  3 22:40:17 atlas-mds1 kernel: [172488.261927] LNet: Skipped 4 previous similar messagesFeb  3 22:40:17 atlas-mds1 kernel: [172488.267771] Pid: 15401, comm: mdt03_237
Feb  3 22:40:17 atlas-mds1 kernel: [172488.272252] 
Feb  3 22:40:17 atlas-mds1 kernel: [172488.272253] Call Trace:
Feb  3 22:40:17 atlas-mds1 kernel: [172488.277084]  [<ffffffffa06948b5>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.281355] Lustre: lock timed out (enqueued at 1391484017, 1200s ago)
Feb  3 22:40:17 atlas-mds1 kernel: [172488.281357] Lustre: Skipped 5 previous similar messages
Feb  3 22:40:17 atlas-mds1 kernel: [172488.298432]  [<ffffffff8150f120>] ? _cond_resched+0x30/0x40
Feb  3 22:40:17 atlas-mds1 kernel: [172488.304870]  [<ffffffffa03b265e>] ? cfs_cond_resched+0xe/0x10 [libcfs]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.312380]  [<ffffffffa03c85bf>] ? cfs_hash_for_each_relax+0x17f/0x360 [libcfs]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.320973]  [<ffffffffa069d850>] ? ldlm_res_hash_dump+0x0/0x60 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.328685]  [<ffffffffa069d850>] ? ldlm_res_hash_dump+0x0/0x60 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.336381]  [<ffffffffa03c9f6f>] ? cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.344975]  [<ffffffffa069da6c>] ? ldlm_namespace_dump+0xec/0x1c0 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.352978]  [<ffffffffa06b4a3f>] ? ldlm_expired_completion_wait+0x2af/0x390 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.362048]  [<ffffffffa06b9038>] ? ldlm_completion_ast+0x508/0x960 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.370144]  [<ffffffffa06b4790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.378995]  [<ffffffff81063600>] ? default_wake_function+0x0/0x20
Feb  3 22:40:17 atlas-mds1 kernel: [172488.386125]  [<ffffffffa06b8758>] ? ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.394614]  [<ffffffffa06b8b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.402507]  [<ffffffffa0d11a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.409813]  [<ffffffffa0d17c7b>] ? mdt_object_lock0+0x33b/0xaf0 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.417312]  [<ffffffffa0d11a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.424640]  [<ffffffffa06b8b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.432533]  [<ffffffffa0d184f4>] ? mdt_object_lock+0x14/0x20 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.439741]  [<ffffffffa0d275a9>] ? mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.447861]  [<ffffffffa06e1135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.455290]  [<ffffffffa0709646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.463197]  [<ffffffffa06e33c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.471184]  [<ffffffffa0d283ad>] ? mdt_intent_getattr+0x29d/0x490 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.478879]  [<ffffffffa0d14f1e>] ? mdt_intent_policy+0x39e/0x720 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.486491]  [<ffffffffa0699831>] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.494398]  [<ffffffffa06c01ef>] ? ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.502771]  [<ffffffffa0d153a6>] ? mdt_enqueue+0x46/0xe0 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.509582]  [<ffffffffa0d1ba97>] ? mdt_handle_common+0x647/0x16d0 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.517282]  [<ffffffffa0d553f5>] ? mds_regular_handle+0x15/0x20 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.524805]  [<ffffffffa06f23c8>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.533864]  [<ffffffffa03b25de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.541076]  [<ffffffffa03c3d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.548893]  [<ffffffffa06e9729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.556704]  [<ffffffffa06f375e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.564121]  [<ffffffffa06f2c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.571324]  [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
Feb  3 22:40:17 atlas-mds1 kernel: [172488.577289]  [<ffffffffa06f2c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.584510]  [<ffffffffa06f2c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.591709]  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

We are getting evictions due to lock callback timers expiring

$ grep -c "lock callback timer expired" kern.log 
52

I understand these big numbers below may be bogus. Some of these are showing over 7 hours. Does it even make sense for a client to keep a metadata lock that long?

Feb  3 22:09:31 atlas-mds1 kernel: [170644.624185] LustreError: 13226:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 25769s: evicting client at 108@gni3  ns: mdt-atlas1-MDT0000_UUID lock: ffff883f0ca506c0/0xce18f06f09f073de lrc: 3/0,0 mode: PR/PR res: [0x200044aee:0x52d6:0x0].0 bits 0x13 rrc: 409 type: IBT flags: 0x200400000020 nid: 108@gni3 remote: 0x115740db98f79fb1 expref: 117 pid: 14806 timeout: 4465563044 lvb_type: 0
Feb  3 22:20:01 atlas-mds1 kernel: [171273.688384] LustreError: 14898:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 26399s: evicting client at 181@gni3  ns: mdt-atlas1-MDT0000_UUID lock: ffff883ae3910480/0xce18f06f09f0741d lrc: 3/0,0 mode: PR/PR res: [0x200044aee:0x52d6:0x0].0 bits 0x13 rrc: 404 type: IBT flags: 0x200400000020 nid: 181@gni3 remote: 0xddf404e7006e98bb expref: 117 pid: 15497 timeout: 4466193960 lvb_type: 0
Feb  3 22:30:51 atlas-mds1 kernel: [171922.722297] LustreError: 15176:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 27049s: evicting client at 179@gni3  ns: mdt-atlas1-MDT0000_UUID lock: ffff88356b380900/0xce18f06f09f07582 lrc: 3/0,0 mode: PR/PR res: [0x200044aee:0x52d6:0x0].0 bits 0x13 rrc: 401 type: IBT flags: 0x200400000020 nid: 179@gni3 remote: 0x4c53c41d524a00b6 expref: 117 pid: 14857 timeout: 4466843371 lvb_type: 0

Many of these are from gni3, which is our XC30 that uses Lustre 2.3 clients.

Comment by Oleg Drokin [ 04/Feb/14 ]

Clients hold locks as long as they wish if there is no contention, since it's essentially free, but saves an RPC if it's actually needed again. Cache is tied to locks.
I believe the long timeouts printed are bogus - it's time since successful enqueue, as I discovered. We probably want to fix it to have a better idea of how long the waiting actually was.

the thread sitting there idle for 1200 seconds IS real time. The stack trace does indicate it was waiting for some lock to be canceled. What's your at_max?

Comment by Matt Ezell [ 04/Feb/14 ]

All of our clients should be setting at_max to 600 seconds. This also appears to be the default, so it's probably safe to say that's a global setting.

Comment by Matt Ezell [ 04/Feb/14 ]

> Clients hold locks as long as they wish if there is no contention, since it's essentially free, but saves an RPC if it's actually needed again.

Sure, if the same node needs it again. But if a different node needs the lock, it has to wait on the blocking ast. Our Cray clients have lru_max_age set to 9000000 (104 days) which is better than the default value 36000000 (416 days!). We might want to look into reducing this, but really they will be returned earlier because of lru_size. We set the MDT lru_size to 2000 to (try to) minimize memory usage on the MDS.

Comment by Oleg Drokin [ 04/Feb/14 ]

I see. Enqueue timeout is defined as ldlm_get_enq_timeout(lock) * 2. ldlm_get_enq_timeout would likely return at_max in your case, hence 1200 seconds.
It's probable that the lock it was trying to get conflicted on some other lock held by some other server thread too and you might have gotten a whole lot of traces like that in the end.
Now, we should have evicted a client much sooner, at around 600 seconds since we sent the client (you can probably see an eviction about -600 seconds from this dump I'd hazard to guess), but if there were several locks like that - we'd exceed the 1200 seconds like what happened here, I guess.

Understanding why the clients don't send the cancels in (or they also might be just dropped somewhere along the way) might be the key here, I guess.
Stuff should not be stuck in ELC cancels either since then we'd have ASTs return with EINVAL and server would just drop such locks effortlessly.

Comment by Oleg Drokin [ 04/Feb/14 ]

If differnt node needs the lock, that's ok, it needs to send an RPC anyway and it's not fast either, we might allow it to holt the lock as long as it needs.

Currently all heuristic is assuming the resources are not really contended. What we need to do better eventually is see which resources are actually highly contended and only release those locks (possibly as soon as we are done with them too).

Comment by Jason Hill (Inactive) [ 04/Feb/14 ]

Tangential to the current analysis of timeouts, evictions, and long responses; the application we identified had ~2000 MPI ranks finishing at the same time and opening a file each (in the same directory, not pre-created files) for writing output. The mitigation we have put in place is to only allow 100 ranks to write at a time, and iterate until all ranks have written. There are other tactics to advise the user to take, but this is one we're grabbing for a short term test.

This by no way resolves the issues (as I stated above the user has been in scheduler jail since Friday and we continued to see issues), but I did want to follow up as some were interested in how the application was able to generate ~50K lock enqueues in 30s.

Comment by Oleg Drokin [ 05/Feb/14 ]

Copyignmy comment from LU-4579 here as well as what you have here seems to be very related and might make sense to run with the same patch:

Ok, I uploaded http://review.whamcloud.com/9127 that I tested lightly here.

It fixes lock eviction message to actually specify how long was the wait since sending the RPC.
Adds a few timeout checks just in case (I see some refresh calls from e.g. ofd may not use at_max.
Prints number of busy lock prolongations (a bit hackish, only for ibit locks - i.e. MDS)

This only makes sense to install on servers, primarily mds. Though I tested and clients don't die either in case you install it there by mistake too.

I imagine it would be interesting to run with this and once the first 1200 timeout hits, have a log of this timeout + preceeding hour of messages.

Comment by Matt Ezell [ 05/Feb/14 ]

Thanks Oleg.

Right now, our theory is that clients are getting stuck with LU-4300. So clients are actually holding locks for an extended period of time and fail to respond to blocking ASTs.

We have two possible workarounds that we are trying; one today and the second if the first doesn't help:

  • Enabling DLMTRACE to perturb timing. Even if it doesn't prevent the problem, at least we get some good logs to look at
  • Disable early_lock_cancel. This was recommended by DDN, but I don't yet understand how it affects the situation

We hope Alexey will have a client-side patch we can try in the next day or two.

Jason Hill: I think it makes sense to watch for a couple days with the above workarounds before we consider taking a MDS patch. Do you agree?

Comment by Jason Hill (Inactive) [ 05/Feb/14 ]

All,

Based on the findings by Cray in their bugs for the client and matt's information above I highly recommend changing the state from Major to Minor. Thank you all for your detailed attention. We have talked about possibly enabling the debugging patches when Titan gets an updated client from Cray/Xyratex; we will keep you informed.


-Jason

Comment by James Nunez (Inactive) [ 05/Feb/14 ]

Jason,

Thanks for the update. I've changed the priority of this ticket at your request.

James

Comment by Oleg Drokin [ 05/Feb/14 ]

Matt, the 4300 might explain long lock cancel times for OSTs, but not for MDTs, so I'd still want to better understand what goes on with MDT locks here.

Comment by Jason Hill (Inactive) [ 08/Feb/14 ]

Oleg,

We are rolling titan's clients back to 1.8.6 from Cray on Monday. If we integrate your debugging patch and roll the client back, will you get the data you need to analyze?


-Jason

Comment by Oleg Drokin [ 08/Feb/14 ]

yes.
The patch is server side.
As long as you still have the same problem, the data with the patch should be more useful than not having it.

Comment by Matt Ezell [ 11/Feb/14 ]

Debug patch was installed during maintenance yesterday. Many hits of the new "Long-waiting lock" message.

Feb 11 05:34:45 atlas-mds1 kernel: [66331.830426] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 172 sec
Feb 11 05:44:05 atlas-mds1 kernel: [66892.365498] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 213 sec
Feb 11 05:54:12 atlas-mds1 kernel: [67499.945964] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 213 sec
Feb 11 06:16:29 atlas-mds1 kernel: [68838.222714] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 78 sec
Feb 11 06:24:04 atlas-mds1 kernel: [69293.656216] LustreError: 20107:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 224 sec
Feb 11 06:34:05 atlas-mds1 kernel: [69895.230302] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 224 sec
Feb 11 06:44:06 atlas-mds1 kernel: [70496.803999] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 224 sec
Feb 11 06:54:21 atlas-mds1 kernel: [71112.391365] LustreError: 18817:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 210 sec
Feb 11 07:04:28 atlas-mds1 kernel: [71719.971048] LustreError: 18852:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 204 sec
Feb 11 07:15:48 atlas-mds1 kernel: [72400.620205] LustreError: 19317:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 125 sec
Feb 11 07:24:43 atlas-mds1 kernel: [72936.132036] LustreError: 18544:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 191 sec
Feb 11 07:34:41 atlas-mds1 kernel: [73534.702299] LustreError: 20151:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 195 sec
Feb 11 07:44:15 atlas-mds1 kernel: [74109.251117] LustreError: 20446:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 222 sec
Feb 11 07:54:16 atlas-mds1 kernel: [74710.824274] LustreError: 18539:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 222 sec
Feb 11 08:04:39 atlas-mds1 kernel: [75334.420048] LustreError: 20352:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 219 sec
Feb 11 08:15:47 atlas-mds1 kernel: [76003.057160] LustreError: 19125:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 161 sec
Feb 11 08:26:26 atlas-mds1 kernel: [76642.667237] LustreError: 18540:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 123 sec
Feb 11 08:34:55 atlas-mds1 kernel: [77152.154357] LustreError: 19405:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 216 sec
Feb 11 08:45:50 atlas-mds1 kernel: [77807.778867] LustreError: 19036:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 162 sec
Feb 11 08:55:17 atlas-mds1 kernel: [78375.320158] LustreError: 25518:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 209 sec
Feb 11 09:05:03 atlas-mds1 kernel: [78961.879719] LustreError: 19181:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 224 sec
Feb 11 09:15:04 atlas-mds1 kernel: [79563.453708] LustreError: 19691:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 224 sec
Feb 11 09:25:12 atlas-mds1 kernel: [80172.034167] LustreError: 25536:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 217 sec
Feb 11 09:35:07 atlas-mds1 kernel: [80767.602473] LustreError: 12379:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 223 sec
Feb 11 09:45:07 atlas-mds1 kernel: [81368.175242] LustreError: 19396:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 224 sec

No hits of "requested timeout %d, more than at_max" or "Adding a lock, but the front position is scheduled in %d seconds"

Also seeing messages like:

Feb 11 09:48:52 atlas-mds1 kernel: [81593.435989] LNet: Service thread pid 19926 completed after 16665.22s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Feb 11 09:56:36 atlas-mds1 kernel: [82058.639480] Lustre: atlas1-MDT0000: Client c7c41ec7-373b-1a4a-8f79-d741cf79e22a (at 12589@gni100) refused reconnection, still busy with 2 active RPCs
Comment by Oleg Drokin [ 11/Feb/14 ]

Hm, so only the extra waiting, huh. And the numbers are not all that far off.

The slow service thread and the still busy messages could not be new ones, you should have gotten them in the past too, right?
Also I imagine you still get lock timeouts on mdt with corresponding client evictions, can you paste a few here, so that we can see what is the actual timeout time now?

Comment by Matt Ezell [ 11/Feb/14 ]
Feb 11 10:18:55 atlas-mds1 kernel: [83398.111851] LustreError: 17613:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 18469s: evicting client at 14346@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881463276000/0xd030cd07cae2a16b lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1847 type: IBT flags: 0x200400000020 nid: 14346@gni100 remote: 0xc3ce954c318ff7ca expref: 37 pid: 18996 timeout: 4377983092 lvb_type: 0 used 0
Feb 11 10:28:56 atlas-mds1 kernel: [83999.685683] LustreError: 19235:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 19070s: evicting client at 14349@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881f796cb000/0xd030cd07cae2a205 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1833 type: IBT flags: 0x200400000020 nid: 14349@gni100 remote: 0x4c3f68e1e5538822 expref: 36 pid: 19097 timeout: 4378584127 lvb_type: 0 used 0
Feb 11 10:38:57 atlas-mds1 kernel: [84601.260349] LustreError: 17617:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 19671s: evicting client at 16378@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff88155ad90480/0xd030cd07cae2a259 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1828 type: IBT flags: 0x200400000020 nid: 16378@gni100 remote: 0xc7d8717e54253dc2 expref: 36 pid: 19226 timeout: 4379185137 lvb_type: 0 used 0
Feb 11 10:48:58 atlas-mds1 kernel: [85202.833436] LustreError: 20236:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 20272s: evicting client at 16310@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881a9a7d46c0/0xd030cd07cae2a2b4 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1824 type: IBT flags: 0x200400000020 nid: 16310@gni100 remote: 0xbd5c384bff16401 expref: 36 pid: 19337 timeout: 4379786098 lvb_type: 0 used 0
Feb 11 10:58:59 atlas-mds1 kernel: [85804.407052] LustreError: 19336:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 20873s: evicting client at 13309@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff883fb36c7480/0xd030cd07cae2a3e8 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1820 type: IBT flags: 0x200400000020 nid: 13309@gni100 remote: 0x33c7e22dc8d25f25 expref: 36 pid: 19745 timeout: 4380387245 lvb_type: 0 used 0
Feb 11 11:09:00 atlas-mds1 kernel: [86405.980993] LustreError: 19342:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 21474s: evicting client at 16366@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff883ba9de5240/0xd030cd07cae2a489 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1811 type: IBT flags: 0x200400000020 nid: 16366@gni100 remote: 0x1e95102493397b80 expref: 37 pid: 20353 timeout: 4380988521 lvb_type: 0 used 0
Feb 11 11:19:02 atlas-mds1 kernel: [87008.555529] LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 22076s: evicting client at 10380@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881f7f49f900/0xd030cd07cae2a577 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1788 type: IBT flags: 0x200400000020 nid: 10380@gni100 remote: 0x758d3352cfc91da4 expref: 36 pid: 19368 timeout: 4381590703 lvb_type: 0 used 0
Feb 11 11:29:04 atlas-mds1 kernel: [87611.130362] LustreError: 17619:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 22678s: evicting client at 11922@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881448cdad80/0xd030cd07cae2a61f lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1780 type: IBT flags: 0x200400000020 nid: 11922@gni100 remote: 0x9f490f3250dcfa26 expref: 40 pid: 19083 timeout: 4382192237 lvb_type: 0 used 0
Feb 11 11:39:05 atlas-mds1 kernel: [88212.705133] LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 23279s: evicting client at 16399@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881a19db7d80/0xd030cd07cae2a657 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1775 type: IBT flags: 0x200400000020 nid: 16399@gni100 remote: 0xae3b55b3b0e8fcaf expref: 37 pid: 19294 timeout: 4382793129 lvb_type: 0 used 0
Feb 11 11:49:06 atlas-mds1 kernel: [88814.277996] LustreError: 18622:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 23880s: evicting client at 13288@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff883fb1568b40/0xd030cd07cae2a6a4 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1772 type: IBT flags: 0x200400000020 nid: 13288@gni100 remote: 0xbc7520cd6a29b2c2 expref: 37 pid: 19662 timeout: 4383394123 lvb_type: 0 used 0
Feb 11 11:59:07 atlas-mds1 kernel: [89415.852733] LustreError: 25491:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 24481s: evicting client at 12596@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff883fac92e900/0xd030cd07cae2a745 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1769 type: IBT flags: 0x200400000020 nid: 12596@gni100 remote: 0xb235a780d20e5531 expref: 38 pid: 20296 timeout: 4383995131 lvb_type: 0 used 0
Feb 11 12:09:08 atlas-mds1 kernel: [90017.426664] LustreError: 20036:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 25082s: evicting client at 11054@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff883fa39f1d80/0xd030cd07cae2a895 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1755 type: IBT flags: 0x200400000020 nid: 11054@gni100 remote: 0x39056ad1e7dcf9ff expref: 39 pid: 20374 timeout: 4384596118 lvb_type: 0 used 0
Feb 11 12:19:11 atlas-mds1 kernel: [90621.002347] LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 25684s: evicting client at 12598@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881aac5e0480/0xd030cd07cae2ab12 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1742 type: IBT flags: 0x200400000020 nid: 12598@gni100 remote: 0xbc132fc49289524 expref: 40 pid: 18871 timeout: 4385199738 lvb_type: 0 used 0
Comment by Matt Ezell [ 11/Feb/14 ]

Of the 48 evictions I've seen today, most of them refer to the same FID:

$ grep "lock callback" kern.log |awk '

{print $29}

'|sort|uniq -c | sort -nr
46 [0x20007d38d:0xf5:0x0].0
1 [0x200064129:0x1e493:0x0].0
1 [0x200064129:0x1e45f:0x0].0

Using 'lfs fid2path', all 48 of these correspond to a directory. [0x20007d38d:0xf5:0x0] is a directory with 1262 files in it.

Comment by Oleg Drokin [ 11/Feb/14 ]

hot long is the timeout listed in the lock eviction message?

1262 files is not all that many too.

Comment by Oleg Drokin [ 11/Feb/14 ]

oh, I see you pasted the eviction messages too. the time is still in 20k seconds which cannot be right, sigh.

Comment by Matt Ezell [ 11/Feb/14 ]

It's logging every 10 minutes. The time listed increases by about 600s (10 minutes) each time. It seems like the lock is bouncing between clients. If a client gets evicted due to a callback timer expiring, and that lock is re-granted to the next waiting client, do you update l_last_activity?

Comment by Oleg Drokin [ 11/Feb/14 ]

Yes, the re-granting is actually entirely new enqueue so it gets counted from zero of course.
It's almost like the last_activity is somehow zeroed-out instead.

I also note that all expired locks listed are of PR mode. Since this is a difectory, this could only be a result of readdir.
I wonder if the lock is held up because the client that got this lock tries to obtain the directory pages, but cannot for some reason, hence the lock is never relased. Do you happen to get messages like "error reading dir ..." on clients or any other errors from llite/dir.c ?

Comment by Matt Ezell [ 11/Feb/14 ]

Here's the logs from dir.c from the Cray clients. c0-0c0s2n2 is 90@gni100 and c24-0c0s6n0 is 9228@gni100. Both of those are login nodes.

console-20140130:2014-01-30T18:23:55.673423-05:00 c0-0c0s2n2 LustreError: 4961:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200037b5b:0x307f:0x0] at 0: rc -108
console-20140130:2014-01-30T18:23:55.673430-05:00 c0-0c0s2n2 LustreError: 4961:0:(dir.c:595:ll_dir_read()) error reading dir [0x200037b5b:0x307f:0x0] at 0: rc -108
console-20140201:2014-02-01T18:10:11.186576-05:00 c24-0c0s6n0 LustreError: 14610:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200000007:0x1:0x0] at 0: rc -108
console-20140201:2014-02-01T18:10:11.186640-05:00 c24-0c0s6n0 LustreError: 14610:0:(dir.c:595:ll_dir_read()) error reading dir [0x200000007:0x1:0x0] at 0: rc -108
console-20140201:2014-02-01T18:10:13.317079-05:00 c24-0c0s6n0 LustreError: 14611:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200000007:0x1:0x0] at 0: rc -108
console-20140201:2014-02-01T18:10:13.317127-05:00 c24-0c0s6n0 LustreError: 14611:0:(dir.c:595:ll_dir_read()) error reading dir [0x200000007:0x1:0x0] at 0: rc -108
console-20140201:2014-02-01T18:10:16.408287-05:00 c24-0c0s6n0 LustreError: 14612:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200000007:0x1:0x0] at 0: rc -108
console-20140201:2014-02-01T18:10:16.408328-05:00 c24-0c0s6n0 LustreError: 14612:0:(dir.c:595:ll_dir_read()) error reading dir [0x200000007:0x1:0x0] at 0: rc -108

Here are some client logs from 14346@gni100 (the first client evicted in the log above). 10 minutes before the eviction, it dropped a PUT from the MDS. How do I debug why it was dropped?

console-20140211:2014-02-11T09:08:07.891822-05:00 c13-5c0s5n0 LustreError: 11-0: an error occurred while communicating with 10.36.226.72@o2ib200. The mds_connect operation failed with -16
console-20140211:2014-02-11T09:08:07.891830-05:00 c13-5c0s5n0 LustreError: Skipped 1 previous similar message
console-20140211:2014-02-11T09:26:32.479117-05:00 c13-5c0s5n0 LustreError: 11-0: an error occurred while communicating with 10.36.226.72@o2ib200. The mds_connect operation failed with -16
console-20140211:2014-02-11T09:26:32.479133-05:00 c13-5c0s5n0 LustreError: Skipped 1 previous similar message
console-20140211:2014-02-11T09:44:52.493914-05:00 c13-5c0s5n0 LustreError: 11-0: an error occurred while communicating with 10.36.226.72@o2ib200. The mds_connect operation failed with -16
console-20140211:2014-02-11T09:44:52.493935-05:00 c13-5c0s5n0 LustreError: Skipped 1 previous similar message
console-20140211:2014-02-11T10:03:13.998980-05:00 c13-5c0s5n0 LustreError: 11-0: an error occurred while communicating with 10.36.226.72@o2ib200. The mds_connect operation failed with -16
console-20140211:2014-02-11T10:03:13.999004-05:00 c13-5c0s5n0 LustreError: Skipped 1 previous similar message
console-20140211:2014-02-11T10:08:54.976373-05:00 c13-5c0s5n0 LNet: 15569:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.226.72@o2ib200 portal 10 match 1459678641612461 offset
 192 length 536: 2
console-20140211:2014-02-11T10:21:38.458816-05:00 c13-5c0s5n0 LustreError: 167-0: This client was evicted by atlas1-MDT0000; in progress operations using this service will fail.
console-20140211:2014-02-11T10:21:38.458837-05:00 c13-5c0s5n0 LustreError: Skipped 10 previous similar messages
console-20140211:2014-02-11T10:21:38.458844-05:00 c13-5c0s5n0 Lustre: Server atlas1-MDT0000_UUID version (2.4.2.0) is much newer than client version (1.8.6)
console-20140211:2014-02-11T10:21:38.458851-05:00 c13-5c0s5n0 Lustre: Skipped 10 previous similar messages
console-20140211:2014-02-11T10:21:38.458859-05:00 c13-5c0s5n0 Lustre: atlas1-MDT0000-mdc-ffff88080dd5c800: Connection restored to service atlas1-MDT0000 using nid 10.36.226.72@o2ib200.
Comment by Oleg Drokin [ 12/Feb/14 ]

Hm, the mds_connect failing with -16 before the dropped put means the client is already disconnected from MDS and MDS does not let it in because it believes it is already serving a request from this client (server side messages are "refused reconnection, still busy with X active RPCs").

Dropping PUT is because something was sent to a client that the client did not expect. Portal 10 is MDC_REPLY. What likely happened here is that the request on MDS took really-really long time, enough for a client to timeout it's original request and try to reconnect (only to be rejected by the server as above), then eventually MDS thread completed (will be visible as a "This indicates the system was overloaded" message) and sent a response, alas the client is no longer listening for that response. But because the server is no longer processing anything from this client - next reconnect succeeds (and client notices it actually was evicted).

Earlier on in that same log you'll see there was a request timed out message of some sort, what was it and when (or if more than one - all of them)? How synced up are the clocks on the servers and clients (for log purposes)? 10:18:55 is before the dropped put reply. Could it be we had a deadlock-like situation where a client holds the lock, then readdir (or whatever was the request, but it must have had the lock held to prevent cancellation) had MDS to perform an operation that wanted that lock canceled.
In the MDS log before the 10:08 time stamp "This indicates the system was overloaded" likely there was a watchdog trigger for the same pid, what was the stacktrace?

Additionally, this seems to be a fairly recent problem, I believe that ORNL was running 2.4 servers + 1.8 clients for quite a while before, so can you briefly describe what have changed before the problems appeared? Just fs reformat?

if you can dump a sysrq-t from the first node that develops 'The mds_connect operation failed with -16' symptom, that might be interesting to see what other stuff was happening there, though I guess the problem might be due to several nodes interoperations.

I wonder if something like this is a possiblity:
thread 1: lock the lock, issue readdir
thread 2: do some sort of chmod/create that would invalidate directory lock.
....
On server: create arrived/started to be executed to the point of getting locks first. Sent blocking ast back to the client that's waiting for readdir to finish.
readdir arrived and blocked on something from create (because reddir rpc does not take any ldlm locks). (another option, readdir is just stuck in the list waiting to be serviced, but cannot be due to lack of free threads? Should not be possible because readdir goes to a separate portal.).
Can you also grab sysrq-t from mdt when this happens (the refused reconnection, still busy with X active RPCs)

Comment by Matt Ezell [ 12/Feb/14 ]

I've been having problems stitching together the logs. The MDS suppressed dumping stack traces to 3 every 300 seconds, so it seems like I'm missing different pieces for different occurrences at different times. I'll look into it some more tonight.

Interestingly, it looks like this behavior has calmed down today. I'm guessing it's workload-related.

Time should be pretty-well synchronized. Everything uses NTP.

ORNL was running with (Cray) 1.8.6 clients against multiple file systems. Most users were still on Widow (1.8.x based) and we brought in our new Atlas system (2.4.x). In December and January, users migrated from Widow to Atlas. As more users moved to Atlas, it seemed like our slowdowns got more common/worse. A couple weeks ago we decommissioned Widow and moved most clients to 2.4 (since we only had 2.4 servers). We hit some client-side issues and decided to downgrade clients to (Cray) 1.8.6 this Monday.

Comment by Oleg Drokin [ 13/Feb/14 ]

It well could be lock related.
If MDC readpage/readdir RPC is slowed down for some reason, it might lead to locks timing out, this feeds onto itself since while a client holds that readdir lock, other RPCs (like creates in that same dir) would be blocked and would clog the incoming request queue for the normal rpcs portal at the very least.

If you can do sysrq-t when you are in this situation, it should be useful, at least we should be able to get some idea of what's waiting where (and esp. where do readdir threads wait).

Comment by Jason Hill (Inactive) [ 14/Feb/14 ]

We attempted to crash dump the MDS today (memory pressure was indicated in the logs by messages like "page allocation failure"). Last time I looked there was 175GB in slabs; I did not grab a slabtop to see where we were holding memory. The crash dump process was stuck for ~20 minutes and the admin decided to kill the crash dump (ipmi power off), so we don't have additional data. The performance for metadata operations continues to be a large complaint from our user base, so we'd like to raise the priority back. We ran with the 1.8.6 client for 1 week and the logs on titan are much quieter. Please indicate the next steps that we should take in terms of gathering data. Current status:

Server: Lustre 1.4.2 + LU-4006 + LU-4597( patch 9127)
Clients:
18,696 Cray Lustre 1.8.6
747 Cray Lustre 2.3.?
316 Intel Lustre 2.4.2
12 Intel Lustre 1.8.9
1 Intel Lustre 2.4.1

Can we raise the priority?

Thanks,
-Jason

Comment by James Nunez (Inactive) [ 14/Feb/14 ]

Jason,

I've returned the priority of this ticket to Major.

James

Comment by Oleg Drokin [ 14/Feb/14 ]

Jason:
So, sysrq-t from mdt when you have locks timeouts would be great so we know what waits where. this should be saef for your mdt compared to a crashdump too.

The previous analysis where most of lock timeouts were attributed to 2.3 clients was also interesting. Is this still the pattern?

Page allocation failure, was that "tage" allocation failure by any chance, have a sample if not?

Comment by Jason Hill (Inactive) [ 17/Feb/14 ]

Oleg,

We've not seen the "refused connection" messages since 02/11/14. However since introducing the patch from LU-4579 and LU-4006 (minor upgrade to the kernel in the RHEL 6.4 stream also), we've been having memory issues. Nothing like "tage" allocation, but we're consuming ~100GB (of 256GB) in ldiskfs_inode_cache. MDS was rebooted on Friday (noted above), and since then we've consumed 95GB in ldiskfs_inode_cache. Output from slabtop below. Could there be any reason in those two patches for this behavior to have started?

[root@atlas-mds1 ~]# uname -a
Linux atlas-mds1 2.6.32-358.23.2.el6.atlas.x86_64 #1 SMP Thu Jan 23 08:28:06 EST 2014 x86_64 x86_64 x86_64 GNU/Linux

[slabtop]
Active / Total Objects (% used) : 464456162 / 554753575 (83.7%)
Active / Total Slabs (% used) : 47894092 / 47894105 (100.0%)
Active / Total Caches (% used) : 134 / 238 (56.3%)
Active / Total Size (% used) : 154353997.52K / 165485911.12K (93.3%)
Minimum / Average / Maximum Object : 0.02K / 0.30K / 4096.00K

OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
71585049 71500585 39% 1.03K 23861683 3 95446732K
ldiskfs_inode_cache
258255738 257781953 0% 0.21K 14347541 18 57390164K osp_obj
33404544 19776814 59% 0.24K 2087784 16 8351136K mdt_obj
34789660 20228836 58% 0.19K 1739483 20 6957932K size-192
11802420 11362818 96% 0.56K 1686060 7 6744240K ldlm_locks
36565504 19776806 54% 0.12K 1142672 32 4570688K lod_obj
38187240 19776783 51% 0.09K 954681 40 3818724K mdd_obj
10246056 9800319 95% 0.31K 853838 12 3415352K ldlm_resources
15167743 15146729 99% 0.10K 409939 37 1639756K buffer_head
40185488 15123748 37% 0.03K 358799 112 1435196K size-32
61638 61626 99% 8.00K 61638 1 493104K size-8192
2075220 1957851 94% 0.12K 69174 30 276696K size-128
99638 97555 97% 2.00K 49819 2 199276K size-2048
347039 346726 99% 0.55K 49577 7 198308K radix_tree_node
274722 274712 99% 0.62K 45787 6 183148K proc_inode_cache
318336 310874 97% 0.50K 39792 8 159168K size-512
102492 102443 99% 1.02K 34164 3 136656K nfs_inode_cache
98316 91419 92% 1.00K 24579 4 98316K size-1024
22746 22746 100% 4.00K 22746 1 90984K size-4096
396660 395855 99% 0.19K 19833 20 79332K dentry
2354 2354 100% 16.00K 2354 1 37664K size-16384
280 280 100% 128.00K 280 1 35840K size-131072
492414 279596 56% 0.06K 8346 59 33384K size-64

Comment by Jason Hill (Inactive) [ 17/Feb/14 ]

for completeness:

[root@atlas-mds1 ~]# rpm -qi lustre
Name : lustre Relocations: (not relocatable)
Version : 2.4.2 Vendor: (none)
Release : 2.6.32_358.23.2.el6.atlas.x86_64 Build Date: Thu Feb 6 15:15:56 2014
Install Date: Mon Feb 10 09:39:10 2014 Build Host: lgmgmt1.ccs.ornl.gov
Group : Utilities/System Source RPM: lustre-2.4.2-2.6.32_358.23.2.el6.atlas.x86_64.src.rpm
Size : 4062709 License: GPL
Signature : (none)
URL : http://wiki.whamcloud.com/
Summary : Lustre File System
Description :
Userspace tools and files for the Lustre file system.

Comment by Jason Hill (Inactive) [ 17/Feb/14 ]

Also to correlate, we've consumed a large number of MDT inodes for atlas1:

[root@lfs-mgmt1 scratch]# df -i /lustre/atlas1
Filesystem Inodes IUsed IFree IUse% Mounted on
10.36.226.77@o2ib:/atlas1
1073741824 427809242 645932582 40% /lustre/atlas1
[root@lfs-mgmt1 scratch]# df -i /lustre/atlas2
Filesystem Inodes IUsed IFree IUse% Mounted on
10.36.226.77@o2ib:/atlas2
1073741824 88963619 984778205 9% /lustre/atlas2

current slab usage for atlas-mds3 (atlas2's mds) is ~30GB.

next test is to fire up the TDS and create ~500M files to see if the slab issue follows with the number of files in the filesystem.

Comment by Matt Ezell [ 18/Feb/14 ]

We ran oprofile on the mds and got the following top-consumers:

samples  %        image name               app name symbol name
4753117  40.5645  vmlinux                  vmlinux __write_lock_failed
1276800  10.8966  libcfs.ko                libcfs.ko cfs_percpt_lock
869346    7.4193  oprofiled                oprofiled /usr/bin/oprofiled
470403    4.0146  vmlinux                  vmlinux __vmalloc_node
411024    3.5078  vmlinux                  vmlinux rb_get_reader_page
277246    2.3661  vmlinux                  vmlinux ring_buffer_consume
276266    2.3577  vmlinux                  vmlinux poll_idle
239955    2.0478  lod.ko                   lod.ko lod_alloc_qos.clone.0
211548    1.8054  vmlinux                  vmlinux intel_idle
198649    1.6953  vmlinux                  vmlinux remove_vm_area
146320    1.2487  vmlinux                  vmlinux smp_call_function_many
141106    1.2042  vmlinux                  vmlinux generic_smp_call_function_interrupt

That looks a lot like the results from LU-4008, focusing on vmalloc. I know vmalloc has a lock, is the __write_lock_failed indicating lock contention in vmalloc?

Comment by Oleg Drokin [ 18/Feb/14 ]

Yes, you are definitely seeing lock contention on some lock, but it could be any lock that's compiled into the kernel. Without calltrace analysis it's hard to tell more.

Patch from lu4579 should not really affect any memory allocations whatsoever.

As for slab usage, are you sure this did not happen before? I would imagine the cache will grow to fill entire available memory as you access more and more files. This is the purpose of the cache.

Comment by Matt Ezell [ 19/Feb/14 ]

We'll do some more analysis tomorrow to try to get call graphs to see what locks we are contending on. We think its vmalloc at this point. We plan to try the LU-3338 patch (http://review.whamcloud.com/#/c/6339) in testbed to see how it impacts behavior. Also turn on +malloc to watch as things are allocated and freed.

What are the implications of increasing OBD_ALLOC_BIG? Just more likely to hit fragmentation/allocation failures/slowdowns? Is the following a really bad idea:

diff --git a/lustre/include/obd_support.h b/lustre/include/obd_support.h
index 24671c3..83d3c62 100644
--- a/lustre/include/obd_support.h
+++ b/lustre/include/obd_support.h
@@ -704,7 +704,8 @@ do {                                                                              \
  * since vmalloc in Linux doesn't perform well on multi-cores system, calling
  * vmalloc in critical path would hurt peformance badly. See LU-66.
  */
-#define OBD_ALLOC_BIG (4 * PAGE_CACHE_SIZE)
+#define OBD_ALLOC_BIG_ORDER 4
+#define OBD_ALLOC_BIG (PAGE_CACHE_SIZE << OBD_ALLOC_BIG_ORDER)
 
 #define OBD_ALLOC_LARGE(ptr, size)                                            \
 do {                                                                          \

Or instead try to kmalloc without GFP_WAIT and fall back to vmalloc if it fails? Then we can compare the pointer to see if its between VMALLOC_START and VMALLOC_END to know which deallocation function to use (except it looks like those are being deprecated in the kernel). Might only want to do that if the size is over 4*PAGE_CACHE_SIZE to avoid extra checks in the fast path for small allocations.

Comment by Jason Hill (Inactive) [ 19/Feb/14 ]

Oleg - slab usage got a separate issue – LU-4641. We think we have this tracked down to a handful of nodes running mlocate inside of Lustre. As Matt said we should have some more data for you in the morning.

Comment by Jason Hill (Inactive) [ 19/Feb/14 ]

Oleg:

We did get a few trips of your debug patch in the last two days. I failed to look for it yesterday. Sorry about that. Here's the messages, I will work on getting you more info from around these times:

  1. grep "Adding a lock, but the front position is scheduled in" lustrekernel
    Feb 18 15:43:54 atlas-oss2b2 kernel: [708001.133886] LustreError: 30384:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 15:43:55 atlas-oss2b2 kernel: [708001.265487] LustreError: 24590:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 15:44:11 atlas-oss2b2 kernel: [708017.670420] LustreError: 86222:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:12:36 atlas-oss2a1 kernel: [709728.444442] LustreError: 27317:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:12:38 atlas-oss2a1 kernel: [709730.285546] LustreError: 27317:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:20:00 atlas-oss2b2 kernel: [710167.876157] LustreError: 69582:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:20:00 atlas-oss2b2 kernel: [710168.014783] LustreError: 85586:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 16:20:09 atlas-oss2b2 kernel: [710175.926957] LustreError: 30398:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:20:23 atlas-oss2b2 kernel: [710190.363610] LustreError: 25260:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:40:10 atlas-oss2i7 kernel: [711378.566730] LustreError: 70567:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:40:52 atlas-oss2b1 kernel: [711423.942634] LustreError: 24667:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:40:52 atlas-oss2b1 kernel: [711424.073057] LustreError: 21012:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 16:41:01 atlas-oss2i7 kernel: [711430.285640] LustreError: 20357:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:41:08 atlas-oss2b1 kernel: [711440.138592] LustreError: 21114:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:41:10 atlas-oss2i7 kernel: [711438.859467] LustreError: 21096:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:41:15 atlas-oss2i7 kernel: [711443.961101] LustreError: 20357:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:44:54 atlas-oss1c2 kernel: [711666.257479] LustreError: 27271:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:45:01 atlas-oss1c2 kernel: [711672.979860] LustreError: 27246:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:45:04 atlas-oss2i7 kernel: [711673.026049] LustreError: 26061:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:45:04 atlas-oss1c2 kernel: [711676.191623] LustreError: 23342:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 16:45:17 atlas-oss1c2 kernel: [711688.908916] LustreError: 56747:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 16:45:32 atlas-oss2i7 kernel: [711701.477851] LustreError: 26410:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 17:02:02 atlas-oss1c1 kernel: [712041.677197] LustreError: 20599:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 17:16:59 atlas-oss1i7 kernel: [713596.079487] LustreError: 25774:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 17:17:00 atlas-oss1i7 kernel: [713596.219754] LustreError: 87438:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 18:09:59 atlas-oss2a1 kernel: [716774.401901] LustreError: 23027:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:09:59 atlas-oss2a1 kernel: [716774.571574] LustreError: 23068:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 18:10:05 atlas-oss1c1 kernel: [716126.241513] LustreError: 24944:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:10:05 atlas-oss1c1 kernel: [716126.372856] LustreError: 24732:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 18:10:06 atlas-oss1c1 kernel: [716127.799754] LustreError: 20009:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 18:10:07 atlas-oss2i7 kernel: [716777.904758] LustreError: 3700:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ###
    Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:10:13 atlas-oss2a1 kernel: [716787.846448] LustreError: 27319:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:10:13 atlas-oss1c2 kernel: [716787.372402] LustreError: 23223:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:10:14 atlas-oss1c1 kernel: [716135.339937] LustreError: 21260:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 18:10:20 atlas-oss1c2 kernel: [716793.891680] LustreError: 24121:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 18:10:21 atlas-oss2a1 kernel: [716795.708684] LustreError: 23071:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:10:28 atlas-oss1c1 kernel: [716149.801314] LustreError: 20529:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:10:29 atlas-oss1c2 kernel: [716802.920675] LustreError: 23217:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 18:10:30 atlas-oss2a1 kernel: [716805.563884] LustreError: 23266:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:11:01 atlas-oss1c2 kernel: [716835.642508] LustreError: 56747:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:11:16 atlas-oss1c2 kernel: [716849.686177] LustreError: 22306:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 18:11:31 atlas-oss1c2 kernel: [716865.522612] LustreError: 70981:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:11:32 atlas-oss2i7 kernel: [716862.873513] LustreError: 26426:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 18:11:33 atlas-oss2a1 kernel: [716868.256716] LustreError: 23017:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:11:40 atlas-oss1c1 kernel: [716221.117935] LustreError: 20598:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:11:47 atlas-oss1c2 kernel: [716880.889169] LustreError: 39989:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 18:12:18 atlas-oss1c2 kernel: [716912.213317] LustreError: 24105:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 18:44:56 atlas-oss1i7 kernel: [718874.544200] LustreError: 25633:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:44:56 atlas-oss1i7 kernel: [718874.674722] LustreError: 25737:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 18:45:02 atlas-oss1i7 kernel: [718880.495318] LustreError: 26485:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:45:04 atlas-oss1i7 kernel: [718882.996827] LustreError: 26485:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 18:45:40 atlas-oss1i7 kernel: [718919.070183] LustreError: 25664:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 21:45:19 atlas-oss1c2 kernel: [729697.635596] LustreError: 70926:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 21:45:26 atlas-oss1c2 kernel: [729704.727897] LustreError: 56654:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 21:45:36 atlas-oss1c2 kernel: [729715.381797] LustreError: 24092:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 22:29:39 atlas-oss1g5 kernel: [732361.024093] LustreError: 24155:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 22:29:47 atlas-oss1g5 kernel: [732368.605072] LustreError: 30981:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 22:30:02 atlas-oss1g5 kernel: [732384.000878] LustreError: 30989:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 22:43:57 atlas-oss1g5 kernel: [733219.848018] LustreError: 25273:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 22:43:58 atlas-oss1g5 kernel: [733220.017588] LustreError: 29909:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 22:44:00 atlas-oss1g5 kernel: [733222.307693] LustreError: 30989:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 22:44:23 atlas-oss1g5 kernel: [733245.326600] LustreError: 24198:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 23:21:33 atlas-oss1g5 kernel: [735476.569174] LustreError: 30970:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 23:21:35 atlas-oss1g5 kernel: [735478.669772] LustreError: 30985:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 23:21:42 atlas-oss1g5 kernel: [735485.443866] LustreError: 24422:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 23:52:02 atlas-oss1g5 kernel: [737305.746429] LustreError: 30970:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 23:52:02 atlas-oss1g5 kernel: [737305.882243] LustreError: 84274:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 18 23:52:12 atlas-oss1g5 kernel: [737316.153863] LustreError: 24458:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 23:52:20 atlas-oss1g5 kernel: [737323.616670] LustreError: 30098:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 23:52:28 atlas-oss1g5 kernel: [737332.071392] LustreError: 29882:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 23:56:00 atlas-oss1g5 kernel: [737544.361153] LustreError: 24445:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 18 23:56:16 atlas-oss1g5 kernel: [737559.817256] LustreError: 24228:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 19 00:22:15 atlas-oss1g5 kernel: [739119.386062] LustreError: 30683:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 19 00:22:23 atlas-oss1g5 kernel: [739127.548279] LustreError: 70024:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds
    Feb 19 00:22:30 atlas-oss1g5 kernel: [739134.417534] LustreError: 30853:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds
    Feb 19 01:11:26 atlas-oss3h4 kernel: [742068.910961] LustreError: 44224:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 673 seconds
    Feb 19 01:11:27 atlas-oss3h4 kernel: [742069.285045] LustreError: 26648:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 673 seconds
    Feb 19 01:11:28 atlas-oss3h4 kernel: [742070.594347] LustreError: 28404:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 671 seconds
    Feb 19 01:11:29 atlas-oss3h4 kernel: [742071.713368] LustreError: 25522:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 670 seconds
    Feb 19 01:11:31 atlas-oss3h4 kernel: [742073.793994] LustreError: 25522:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 668 seconds
    Feb 19 01:26:13 atlas-oss3h4 kernel: [742955.826446] LustreError: 25531:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 752 seconds
    Feb 19 01:26:18 atlas-oss3h4 kernel: [742960.947155] LustreError: 25531:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 747 seconds
    Feb 19 01:26:19 atlas-oss3h4 kernel: [742962.066280] LustreError: 27899:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 746 seconds
    Feb 19 01:26:21 atlas-oss3h4 kernel: [742964.134639] LustreError: 25531:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 744 seconds
    Feb 19 01:26:29 atlas-oss3h4 kernel: [742971.895279] LustreError: 25531:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 736 seconds
    Feb 19 01:26:37 atlas-oss3h4 kernel: [742980.276772] LustreError: 25531:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 728 seconds
    Feb 19 01:26:58 atlas-oss3h4 kernel: [743000.696535] LustreError: 25531:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 708 seconds
  1. grep "more than at_max" lustrekernel
    Feb 18 15:43:46 atlas-oss2b2 kernel: [707992.551979] LustreError: 69652:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 15:43:46 atlas-oss2b2 kernel: [707992.687373] LustreError: 69652:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 15:43:48 atlas-oss2b2 kernel: [707994.249556] LustreError: 69726:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 15:43:54 atlas-oss2b2 kernel: [708000.981415] LustreError: 49076:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 15:43:57 atlas-oss2b2 kernel: [708003.467150] LustreError: 49076:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 15:44:11 atlas-oss2b2 kernel: [708017.513082] LustreError: 69652:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 15:51:17 atlas-oss2a1 kernel: [708448.700510] LustreError: 117487:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 15:54:23 atlas-oss2a1 kernel: [708634.942513] LustreError: 117487:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 15:55:43 atlas-oss2a1 kernel: [708715.089797] LustreError: 56030:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 15:57:43 atlas-oss2a1 kernel: [708834.978812] LustreError: 56028:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:05:22 atlas-oss2b2 kernel: [709289.540122] LustreError: 69652:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:12:30 atlas-oss2a1 kernel: [709722.059256] LustreError: 68203:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:12:36 atlas-oss2a1 kernel: [709728.287823] LustreError: 47664:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:12:38 atlas-oss2a1 kernel: [709730.128986] LustreError: 47664:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:16:35 atlas-oss2a1 kernel: [709967.619831] LustreError: 42467:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:17:54 atlas-oss2b2 kernel: [710041.296582] LustreError: 49907:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:17:54 atlas-oss2b2 kernel: [710041.480052] LustreError: 49907:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:18:56 atlas-oss2a1 kernel: [710108.933421] LustreError: 42467:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:20:00 atlas-oss2b2 kernel: [710167.759432] LustreError: 69701:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:20:08 atlas-oss2b2 kernel: [710175.774219] LustreError: 69620:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:20:13 atlas-oss2b2 kernel: [710180.416937] LustreError: 69701:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:20:23 atlas-oss2b2 kernel: [710190.212341] LustreError: 69701:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:23:23 atlas-oss2b2 kernel: [710370.936123] LustreError: 41623:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:27:25 atlas-oss2a1 kernel: [710617.716497] LustreError: 68321:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:40:10 atlas-oss2i7 kernel: [711378.435362] LustreError: 70646:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:40:10 atlas-oss2i7 kernel: [711378.706590] LustreError: 70646:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:40:52 atlas-oss2b1 kernel: [711423.821480] LustreError: 75044:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:40:52 atlas-oss2i7 kernel: [711420.949475] LustreError: 70662:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:40:52 atlas-oss2b1 kernel: [711424.203728] LustreError: 75044:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:41:00 atlas-oss2b1 kernel: [711431.432813] LustreError: 75057:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:41:01 atlas-oss2i7 kernel: [711430.139913] LustreError: 70662:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:41:08 atlas-oss2b1 kernel: [711439.988737] LustreError: 45021:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:41:10 atlas-oss2i7 kernel: [711438.714549] LustreError: 70662:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:41:15 atlas-oss2i7 kernel: [711443.816261] LustreError: 70662:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:41:15 atlas-oss2b1 kernel: [711447.327967] LustreError: 45021:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:44:48 atlas-oss1c2 kernel: [711660.584115] LustreError: 70981:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:44:48 atlas-oss1c2 kernel: [711660.707221] LustreError: 56619:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:44:54 atlas-oss1c2 kernel: [711666.099490] LustreError: 70928:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:45:01 atlas-oss1c2 kernel: [711672.820288] LustreError: 70919:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:45:04 atlas-oss1c2 kernel: [711676.039009] LustreError: 70889:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:45:04 atlas-oss2i7 kernel: [711672.871923] LustreError: 70502:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:45:11 atlas-oss1c2 kernel: [711683.037316] LustreError: 70889:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:45:19 atlas-oss1c2 kernel: [711691.220293] LustreError: 56624:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 16:45:32 atlas-oss2i7 kernel: [711701.329287] LustreError: 70502:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 17:01:25 atlas-oss1c1 kernel: [712004.769014] LustreError: 66807:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 17:01:25 atlas-oss1c1 kernel: [712004.906977] LustreError: 66729:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 17:02:02 atlas-oss1c1 kernel: [712041.523613] LustreError: 66798:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 17:15:25 atlas-oss1i7 kernel: [713502.163436] LustreError: 48723:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 17:15:26 atlas-oss1i7 kernel: [713502.305135] LustreError: 48723:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 17:16:14 atlas-oss1i7 kernel: [713550.619589] LustreError: 87465:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 17:16:59 atlas-oss1i7 kernel: [713595.933761] LustreError: 87465:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 17:18:14 atlas-oss1i7 kernel: [713671.242717] LustreError: 87465:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 17:22:40 atlas-oss2a1 kernel: [713933.735218] LustreError: 40709:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 17:47:35 atlas-oss2a1 kernel: [715429.087633] LustreError: 68321:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 17:47:50 atlas-oss2a1 kernel: [715444.189061] LustreError: 68321:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:09:51 atlas-oss1c1 kernel: [716112.054344] LustreError: 41362:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:09:51 atlas-oss1c1 kernel: [716112.222414] LustreError: 41362:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:09:51 atlas-oss2i7 kernel: [716762.131266] LustreError: 39146:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:09:51 atlas-oss1c2 kernel: [716765.381385] LustreError: 22345:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:09:57 atlas-oss2i7 kernel: [716767.800955] LustreError: 70437:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:09:59 atlas-oss1c1 kernel: [716120.700564] LustreError: 41362:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:09:59 atlas-oss2a1 kernel: [716774.249011] LustreError: 68308:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:00 atlas-oss2a1 kernel: [716774.711337] LustreError: 68308:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:05 atlas-oss1c1 kernel: [716126.087293] LustreError: 41362:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:05 atlas-oss1c2 kernel: [716778.716316] LustreError: 56669:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:07 atlas-oss2a1 kernel: [716781.898424] LustreError: 68227:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:07 atlas-oss2i7 kernel: [716777.756610] LustreError: 36709:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:07 atlas-oss1c1 kernel: [716128.449354] LustreError: 44839:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:12 atlas-oss1c1 kernel: [716133.872074] LustreError: 66779:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:13 atlas-oss2a1 kernel: [716787.693697] LustreError: 68227:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:13 atlas-oss1c2 kernel: [716787.212697] LustreError: 56669:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:15 atlas-oss2a1 kernel: [716790.056572] LustreError: 68227:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:20 atlas-oss2a1 kernel: [716795.549134] LustreError: 68227:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:22 atlas-oss1c2 kernel: [716795.736584] LustreError: 56659:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:23 atlas-oss2i7 kernel: [716793.876972] LustreError: 36709:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:28 atlas-oss1c1 kernel: [716149.640965] LustreError: 44839:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:30 atlas-oss2a1 kernel: [716805.411905] LustreError: 68339:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:59 atlas-oss1c2 kernel: [716833.350545] LustreError: 70955:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:11:01 atlas-oss2a1 kernel: [716835.801342] LustreError: 68308:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:11:01 atlas-oss1c1 kernel: [716182.847151] LustreError: 41362:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:11:08 atlas-oss2i7 kernel: [716838.563276] LustreError: 70662:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:11:31 atlas-oss1c2 kernel: [716865.369423] LustreError: 70955:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:11:33 atlas-oss2a1 kernel: [716868.103793] LustreError: 68339:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:11:39 atlas-oss1c1 kernel: [716220.959921] LustreError: 66779:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:20:59 atlas-oss1i7 kernel: [717437.327217] LustreError: 44941:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:44:56 atlas-oss1i7 kernel: [718874.423889] LustreError: 59983:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:44:56 atlas-oss1i7 kernel: [718874.813740] LustreError: 59983:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:45:02 atlas-oss1i7 kernel: [718880.336502] LustreError: 87429:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:45:04 atlas-oss1i7 kernel: [718882.844594] LustreError: 87429:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:45:10 atlas-oss1i7 kernel: [718889.094591] LustreError: 87520:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:45:25 atlas-oss1i7 kernel: [718903.807626] LustreError: 58796:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:45:40 atlas-oss1i7 kernel: [718918.917446] LustreError: 87396:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 19:29:02 atlas-oss1i7 kernel: [721522.017991] LustreError: 87301:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 21:24:02 atlas-oss1c2 kernel: [728420.566978] LustreError: 70979:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 21:45:18 atlas-oss1c2 kernel: [729697.474808] LustreError: 70959:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 21:45:26 atlas-oss1c2 kernel: [729704.569173] LustreError: 43863:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 21:45:28 atlas-oss1c2 kernel: [729707.282184] LustreError: 70959:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 21:45:36 atlas-oss1c2 kernel: [729715.224382] LustreError: 43863:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 21:45:51 atlas-oss1c2 kernel: [729730.293744] LustreError: 70959:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:29:30 atlas-oss1g5 kernel: [732351.785694] LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:29:30 atlas-oss1g5 kernel: [732351.917490] LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:29:39 atlas-oss1g5 kernel: [732360.901707] LustreError: 49635:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:29:46 atlas-oss1g5 kernel: [732368.452562] LustreError: 49635:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:30:02 atlas-oss1g5 kernel: [732383.839945] LustreError: 49635:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:30:09 atlas-oss1g5 kernel: [732391.346122] LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:41:28 atlas-oss1g5 kernel: [733069.735891] LustreError: 84264:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:43:52 atlas-oss1g5 kernel: [733213.899990] LustreError: 70179:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:43:57 atlas-oss1g5 kernel: [733219.697281] LustreError: 70111:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:44:14 atlas-oss1g5 kernel: [733236.420909] LustreError: 70127:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:44:23 atlas-oss1g5 kernel: [733245.170739] LustreError: 70111:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 23:21:33 atlas-oss1g5 kernel: [735476.415442] LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 23:21:35 atlas-oss1g5 kernel: [735478.508845] LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 23:21:42 atlas-oss1g5 kernel: [735485.290958] LustreError: 70171:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 23:52:02 atlas-oss1g5 kernel: [737305.577723] LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:15 atlas-oss2a1 kernel: [716790.056572] LustreError: 68227:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:20 atlas-oss2a1 kernel: [716795.549134] LustreError: 68227:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:22 atlas-oss1c2 kernel: [716795.736584] LustreError: 56659:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:23 atlas-oss2i7 kernel: [716793.876972] LustreError: 36709:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:28 atlas-oss1c1 kernel: [716149.640965] LustreError: 44839:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:30 atlas-oss2a1 kernel: [716805.411905] LustreError: 68339:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:10:59 atlas-oss1c2 kernel: [716833.350545] LustreError: 70955:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:11:01 atlas-oss2a1 kernel: [716835.801342] LustreError: 68308:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:11:01 atlas-oss1c1 kernel: [716182.847151] LustreError: 41362:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:11:08 atlas-oss2i7 kernel: [716838.563276] LustreError: 70662:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:11:31 atlas-oss1c2 kernel: [716865.369423] LustreError: 70955:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:11:33 atlas-oss2a1 kernel: [716868.103793] LustreError: 68339:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:11:39 atlas-oss1c1 kernel: [716220.959921] LustreError: 66779:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:20:59 atlas-oss1i7 kernel: [717437.327217] LustreError: 44941:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:44:56 atlas-oss1i7 kernel: [718874.423889] LustreError: 59983:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:44:56 atlas-oss1i7 kernel: [718874.813740] LustreError: 59983:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:45:02 atlas-oss1i7 kernel: [718880.336502] LustreError: 87429:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:45:04 atlas-oss1i7 kernel: [718882.844594] LustreError: 87429:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:45:10 atlas-oss1i7 kernel: [718889.094591] LustreError: 87520:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:45:25 atlas-oss1i7 kernel: [718903.807626] LustreError: 58796:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 18:45:40 atlas-oss1i7 kernel: [718918.917446] LustreError: 87396:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 19:29:02 atlas-oss1i7 kernel: [721522.017991] LustreError: 87301:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 21:24:02 atlas-oss1c2 kernel: [728420.566978] LustreError: 70979:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 21:45:18 atlas-oss1c2 kernel: [729697.474808] LustreError: 70959:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 21:45:26 atlas-oss1c2 kernel: [729704.569173] LustreError: 43863:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 21:45:28 atlas-oss1c2 kernel: [729707.282184] LustreError: 70959:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 21:45:36 atlas-oss1c2 kernel: [729715.224382] LustreError: 43863:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 21:45:51 atlas-oss1c2 kernel: [729730.293744] LustreError: 70959:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:29:30 atlas-oss1g5 kernel: [732351.785694] LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:29:30 atlas-oss1g5 kernel: [732351.917490] LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:29:39 atlas-oss1g5 kernel: [732360.901707] LustreError: 49635:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:29:46 atlas-oss1g5 kernel: [732368.452562] LustreError: 49635:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:30:02 atlas-oss1g5 kernel: [732383.839945] LustreError: 49635:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:30:09 atlas-oss1g5 kernel: [732391.346122] LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:41:28 atlas-oss1g5 kernel: [733069.735891] LustreError: 84264:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:43:52 atlas-oss1g5 kernel: [733213.899990] LustreError: 70179:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:43:57 atlas-oss1g5 kernel: [733219.697281] LustreError: 70111:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:44:14 atlas-oss1g5 kernel: [733236.420909] LustreError: 70127:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 22:44:23 atlas-oss1g5 kernel: [733245.170739] LustreError: 70111:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 23:21:33 atlas-oss1g5 kernel: [735476.415442] LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 23:21:35 atlas-oss1g5 kernel: [735478.508845] LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 23:21:42 atlas-oss1g5 kernel: [735485.290958] LustreError: 70171:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 23:52:02 atlas-oss1g5 kernel: [737305.577723] LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 23:52:12 atlas-oss1g5 kernel: [737315.993998] LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 23:52:20 atlas-oss1g5 kernel: [737323.457799] LustreError: 84264:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 23:52:28 atlas-oss1g5 kernel: [737331.921511] LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 23:56:00 atlas-oss1g5 kernel: [737544.208456] LustreError: 70127:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 18 23:56:18 atlas-oss1g5 kernel: [737562.320020] LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 19 00:22:15 atlas-oss1g5 kernel: [739119.233426] LustreError: 70171:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 19 00:22:23 atlas-oss1g5 kernel: [739127.393845] LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 19 00:22:31 atlas-oss1g5 kernel: [739135.399082] LustreError: 23539:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 19 00:26:56 atlas-oss1g5 kernel: [739400.814619] LustreError: 70102:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 19 01:10:05 atlas-oss3h4 kernel: [741987.229563] LustreError: 96360:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 19 01:11:26 atlas-oss3h4 kernel: [742068.781671] LustreError: 44224:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 19 01:15:45 atlas-oss3h4 kernel: [742327.452668] LustreError: 37185:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 19 01:26:11 atlas-oss3h4 kernel: [742953.606805] LustreError: 10346:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
    Feb 19 01:32:19 atlas-oss3h4 kernel: [743322.403606] LustreError: 10346:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600
Comment by Andreas Dilger [ 19/Feb/14 ]

For OBD_FREE_LARGE() it can use is_vmalloc_addr() to determine if a passed pointer was allocated via kmalloc() or vmalloc(). That would allow OBD_ALLOC_LARGE() to try kmalloc() first (maybe without __GFP_WAIT), and only fall back to vmalloc() in case of failure.

Comment by James A Simmons [ 19/Feb/14 ]

I was just discussing that on LU-4008. I have some data to show what exactly is causing the vmalloc lock contention. The only draw back to that approach is memory fragmentation.

Comment by Jason Hill (Inactive) [ 19/Feb/14 ]

Oleg, I'm attaching the Lustre logs from atlas and Titan during the timeranges from above (we hit some more this morning around 9:40 also). Curiously there are no titan entries for most of that time. I will to look over at our smaller Cray XC30 (no Lustre messages during any of these time frames), and also our 200 node linux cluster – those are our three biggest consumers of Lustre – to get their log data from these time frames.

hilljj@titan-smw0:/tmp$ for i in $(ls titan-lustre-2014-02-1*); do wc -l $i; done
114989 titan-lustre-2014-02-18-hr10.txt
0 titan-lustre-2014-02-18-hr15.txt
8 titan-lustre-2014-02-18-hr16.txt
0 titan-lustre-2014-02-18-hr17.txt
45 titan-lustre-2014-02-18-hr18.txt
0 titan-lustre-2014-02-18-hr21.txt
0 titan-lustre-2014-02-18-hr22.txt
0 titan-lustre-2014-02-18-hr23.txt
0 titan-lustre-2014-02-19-hr00.txt
0 titan-lustre-2014-02-19-hr01.txt
4 titan-lustre-2014-02-19-hr09.txt
hilljj@titan-smw0:/tmp$ md5sum titan-lustre-logs-LU4570.tgz
79571cb30d28a42e969461eeed947279 titan-lustre-logs-LU4570.tgz

[root@rhea-mgmt2 tmp]# wc -l rhea-lustre-2014-02-1*
0 rhea-lustre-2014-02-18-hr10.txt
714 rhea-lustre-2014-02-18-hr15.txt
1347 rhea-lustre-2014-02-18-hr16.txt
1179 rhea-lustre-2014-02-18-hr17.txt
1605 rhea-lustre-2014-02-18-hr18.txt
318 rhea-lustre-2014-02-18-hr21.txt
337 rhea-lustre-2014-02-18-hr22.txt
364 rhea-lustre-2014-02-18-hr23.txt
293 rhea-lustre-2014-02-19-hr00.txt
200 rhea-lustre-2014-02-19-hr01.txt
787 rhea-lustre-2014-02-19-hr09.txt
7144 total

[root@rhea-mgmt2 tmp]# md5sum rhea-lustre-logs-LU4570.tgz
cfdf017d0d5b4b7ef813b74f21f687cb rhea-lustre-logs-LU4570.tgz

hilljj@hill-desktop:~$ wc -l ccs_homedir/ldlm_lockd.c.lustrekernel.txt
397 ccs_homedir/ldlm_lockd.c.lustrekernel.txt

Comment by Oleg Drokin [ 20/Feb/14 ]

Hm, so all the trips of my at_max message seem to be at max 755 seconds.
This seem to be coming from at_est2timeout() that potentially can increase the maximum timeout possible by 125% + 5 seconds. I guess clients won't really appreciate that certain lock operations would take longer than at_max when such a conflicting lock is encountered.
But this only happens on OSTs, so it would not explain any timeouts on MDS.

On the client logs we see timeouts both to OSTs and MDTs, but sadly the reduced printing only shows xid, but not the opc so there's no way to know what requests are failing. Interesting that there's quite a bunch of timeouts that are in low range - 17s, 19s, 31s could not be explained by too long timeouts or anything like that. I wonder why those were not replied to by the server in time, if only to send an early reply?
Timeouts go as high as 829 seconds and more too, which is way higher than at_max interestingly (high network latency setting at play?)

Comment by Oleg Drokin [ 22/Feb/14 ]

So, I've been digging around the code and it looks like we send close RPCs to readdir portal now, so if you have heavy close traffic, and close is affected by LU-4008 making it to use tons of CPU time, it is probably quite feasible that all of those are consumed?
This would probably also mean that you have quite a ton of them present once the problem hit (and it only exacerbates the lock contention on vmalloc).
Hopefully with LU-4008 we are going to hit two birds with one stone as general close processing time should improve, at the same time improving readdir processing time that should free readdir locks much faster.

Comment by Jason Hill (Inactive) [ 28/Feb/14 ]

console output of echo t > /proc/sysrq-trigger under load.

Comment by James A Simmons [ 28/Feb/14 ]

I uploaded the MDS strack traces to ftp.whamcloud.com/uploads/LU-4570/mds1.20140227.console.log.bz2

Comment by Oleg Drokin [ 03/Mar/14 ]

Hm, looking at the traces I see 134 threads waiting for object allocations from OSTs.
355 threads are waiting for locks to be cancelled.

This is nothing too critical I imagine.
There's nothing waiting in the readdir, so that angle did not realize, but all evicted locks are still PR locks pointing at this readdir lock somehow.

If you can manage to dump sysrq-t trace from a node that's trying to reconnect and get's refused with 'refused reconnection, still busy with X active RPCs' while this is ongoing, I am interested in getting that.

Comment by Oleg Drokin [ 03/Mar/14 ]

I guess another bit of useful info would be if you can cross reference jobs run on the nodes that enter this stuck state and could not reconnect due to BUSY on MDT.
Is it by any chance the same sort of job?

Comment by Matt Ezell [ 12/Mar/14 ]

We had two "still busy" messages since we last patched, but there were special circumstances:

Mar  8 09:47:54 atlas-mds1 kernel: [159420.993577] Lustre: atlas1-MDT0000: Client b4ea3026-13aa-b280-3f2e-94a4bab53a72 (at 95@gni100) refused reconnection, still busy with 1 active RPCs
Mar  8 09:47:55 atlas-mds1 kernel: [159421.712317] Lustre: atlas1-MDT0000: Client 9fc952d9-a299-68de-a80e-21913bc6e90d (at 9311@gni100) refused reconnection, still busy with 1 active RPCs

Around the same time, a Gemini network error on our Cray caused the network to quiesce and re-route.

2014-03-08T09:47:27.602931-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) @@@ Request x1459682730665633 sent from MGC10.36.226.77@o2ib200 to NID 10.36.226.77@o2ib200 17s ago has timed out (17s prior to deadline).
2014-03-08T09:47:27.602945-05:00 c24-0c0s0n3  req@ffff8803ff9e3000 x1459682730665633/t0 o400->MGS@MGC10.36.226.77@o2ib200_0:26/25 lens 192/384 e 0 to 1 dl 1394290046 ref 2 fl Rpc:N/0/0 rc 0/0
2014-03-08T09:47:27.602952-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
2014-03-08T09:47:27.602961-05:00 c24-0c0s0n3 LustreError: 166-1: MGC10.36.226.77@o2ib200: Connection to service MGS via nid 10.36.226.77@o2ib200 was lost; in progress operations using this service will fail.
2014-03-08T09:47:27.602970-05:00 c24-0c0s0n3 Lustre: linkfarm-MDT0000-mdc-ffff880407978c00: Connection to service linkfarm-MDT0000 via nid 10.36.226.83@o2ib200 was lost; in progress operations using this service will wait for recovery to complete.
2014-03-08T09:47:33.651947-05:00 c24-0c0s0n3 LNet: Waiting for hardware quiesce flag to clear
2014-03-08T09:47:57.909338-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.169@o2ib218 portal 4 match 1459682730665920 offset 192 length 192: 2
2014-03-08T09:47:57.909351-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1 previous similar message
2014-03-08T09:47:57.909358-05:00 c24-0c0s0n3 Lustre: atlas2-MDT0000-mdc-ffff8802ad97c400: Connection restored to service atlas2-MDT0000 using nid 10.36.226.74@o2ib200.
2014-03-08T09:47:57.909366-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.245@o2ib219 portal 4 match 1459682730666861 offset 192 length 192: 2
2014-03-08T09:47:57.909425-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.42@o2ib203 portal 4 match 1459682730665937 offset 192 length 192: 2
2014-03-08T09:47:57.909432-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1516 previous similar messages

I think the clients sent requests, but the response didn't get back to them before the network timed them out. I'm not sure why it timed out, I thought during a quiesce situation gnilnd would extend timeouts. When the client timed out the request, it unlinked the MD, so the responses from the MDS were dropped. The MDS didn't get a NACK, so it thinks it's still waiting on that request to be completed.

This is probably a special situation, and likely cannot explain all the occurrences we've seen of this previously.

Comment by Oleg Drokin [ 20/Mar/14 ]

I think you are right and this instance is different from before due to network issues.

Comment by James Nunez (Inactive) [ 02/Jul/14 ]

Jason,

The two patches referenced in this ticket have landed; http://review.whamcloud.com/#/c/10601/ and http://review.whamcloud.com/#/c/6339 .

Are you still experiencing this problem? If not, should be close this ticket?

Comment by Jason Hill (Inactive) [ 07/Jul/14 ]

James:

yes, let's go ahead and resolve this issue.


-Jason

Comment by James A Simmons [ 07/Jul/14 ]

As a note I back ported both of the above patches to 2.5. They are at :

http://review.whamcloud.com/#/c/10962
http://review.whamcloud.com/#/c/10997

Comment by James Nunez (Inactive) [ 09/Jul/14 ]

We will track the b2_5 patches separately.

Generated at Sat Feb 10 01:43:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.