[LU-4570] Metadata slowdowns on production filesystem at ORNL Created: 01/Feb/14 Updated: 09/Jul/14 Resolved: 09/Jul/14 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Jason Hill (Inactive) | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 6.4, 2.6.32-358.18.1.el6.x86_64 kernel |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 12490 | ||||||||
| Description |
|
Clients: Server (noted above) Users are seeing several minute metadata operation times (interactive performance is usually the canary in the coal mine). We are unable to enable DNE due to 1.8 client interactions and the need to have the data available. We run periodic lctl dk with +rpctrace and I can provide information. To date we've found one application that was issuing ~30k LDLM_ENQUEUE RPC's in a 30-45s range (lctl dk overflowed, so only got 45/60s of trace data). We've put that application on hold, but are still seeing some slowdowns. We have some "ls timers" scattered around compute clusters and have been seeing some responses in the 60-80s ranges for an ls of a pre-determined directory. Questions: 1. My current guess is that we are not getting any of the metadata improvements in 2.4. Can you confirm? 2. We are running with options mdt mds_num_threads=1024. We are thinking of changing that to 2048. Recommendations? 3. What further data would be of interest in tracking this down? I have the raw rpctrace data. Would it be useful to add +nettrace or +dlmtrace and send that along? (I think +dlmtrace, but others may be good too) 4. If we were to get rid of all the 1.8 clients would the 2.4 metadata improvements be realized without a reboot of the servers? We are waiting on Cray to provide a software stack for some of our login servers that supports 2.4 clients, and are highly blocked on this. 5. If we cannot get rid of the 1.8 clients, are there any other suggestions for increasing the metadata throughput? We are not blocking on host IO to the MDT device (less than 6MB/s as recorded by iostat); we have had some past memory pressure on the MDS (have recently reduced ldlm.namespaces.*.lru_size on cluster (non-Cray) clients from default of unlimited to 128 on compute 2000 on login nodes), but in general the mds host seems responsive and "normal". Happy to provide more information, just ask and we'll get it to you. Thanks! – |
| Comments |
| Comment by Peter Jones [ 01/Feb/14 ] |
|
Oleg Please can you advise how to proceed here? Thanks Peter |
| Comment by Oleg Drokin [ 01/Feb/14 ] |
|
I guess the most important question here would be - when you experience these issues - are all MDT threads are busy (and blocked on something most likely at the same time)? |
| Comment by Jason Hill (Inactive) [ 01/Feb/14 ] |
|
Oleg, I've not been on the system during one of these events. I can work on getting another job to run tomorrow sometime, or maybe Monday. Currently on the MDS there are no processes (mdt*) in any state other than sleeping or running. I do see a high (both by number and CPU utilization) of kiblnd processes (from my historical memory, nothing concrete). We've put the offending user/code into scheduler jail, so they can't run and we're seeing if the pitchforks and torches go away for the weekend. Started seeing issues on Wednesday and our traditional tools identified a particular code that was issuing ~50k LDLM_ENQUEUE's every 30 seconds. The app developer changed some IO and dropped it to ~25K LDLM_ENQUEUE's every 30s, but we were still getting complaints. Not much interactive use at midnight on a Friday, so I don't have a good measuring stick just yet. Working on automating some more of that. We don't know much about what the application is doing, just what we see from our homegrown tools. Will try to find out more about the app over the weekend. Will keep you posted. Thanks to you and Peter for commenting at this hour on a Friday. |
| Comment by Oleg Drokin [ 01/Feb/14 ] |
|
So those 25K enqueues - what are they for? Opens? what mode, same file/different files? |
| Comment by Jason Hill (Inactive) [ 01/Feb/14 ] |
|
Specifics. Here's the data I have from during a period when the application we identified was running: 00000100:00100000:1.0:1391197263.389662:0:12671:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn02_007:69482ed8-e0da-f1ce-52fa-1f94fb17f79b+248664:16552:x1457850727013322:12345-10.36.207.211@o2ib:103 I don't have specific knowledge of mapping those sorts of requests back to operations (open, close, stat, unlink, etc.). Happy to get educated more on this. Again, data generated with +rpctrace (no dlmtrace). and I just grepped for ldlm from that output to get the snippet above. Snippet not guaranteed to be coherent or consistent. Uploading the entire rpctrace now. |
| Comment by Oleg Drokin [ 01/Feb/14 ] |
|
opc 103 is cancel, so a lot of cancel activity. There's no way to easily see file mode and operation in mode details from this sort of trace, but since you isolated the app, you can just ask the app author what sort of things does he do |
| Comment by Matt Ezell [ 01/Feb/14 ] |
|
$ grep "Handling RPC" rpctrace-mds1|awk -F':' ' {print $20}' | sort -n |uniq -c |sort -nr 101 = LDLM_ENQUEUE $ grep "Handling RPC" rpctrace-mds1|awk -F':' ' {print $19}'|sort|uniq -c|sort -nr|head -10 10.36.205.208@o2ib = titan-ext7 Looks like this is coming mostly from interactive use on the external login nodes. |
| Comment by Andreas Dilger [ 01/Feb/14 ] |
|
Is this filesystem upgraded from 1.x or is it newly formatted with 2.x? Please check what the output from the following command is: mds# dumpe2fs -h | grep feature The reason I ask is because the dir_data feature (which should be in the above list) is not automatically enabled on upgrade filesystems ( it would cause problems if the filesystem needed to downgrade) but improves. 2.x performance significantly. Note that this affects files created with 2.x most significantly, and particularly "ls -l" and other readdir() workloads. Please see |
| Comment by Matt Ezell [ 01/Feb/14 ] |
|
It was formatted under 2.4. We have some clients that are acting up. I tried to return all unused locks (by echo'ing clear into lru_size) I see the following messages: |
| Comment by Jason Hill (Inactive) [ 02/Feb/14 ] |
|
Answering Andreas for completeness: [root@atlas-mds1 ~]# dumpe2fs -h /dev/mapper/atlas1-mdt1 | grep feature Tonight I have seen two messages in the logs that make me think there is something definitely wrong; the first on the MDS: [962845.511495] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 5269s: evicting client at 9307@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff881a2c51fd80/0x62d1aa23e9a027df lrc: 3/0,0 mode: PR/PR res: [0x20005c272:0x2:0x0].0 bits 0x1b rrc: 2 type: IBT flags: 0x200000000020 nid: 9307@gni100 remote: 0xca436432b3a155cc expref: 32 pid: 20112 timeout: 5258943031 lvb_type: 0 So it took us 5269 seconds to try to cancel a lock and then we evicted the client (am I reading that correct?) Next on an OSS: Feb 1 22:28:52 atlas-oss2g3 kernel: [1881119.843613] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 10034s: evicting client at 9307@gni102 ns: filter-atlas1-OST02ba_UUID lock: ffff88064926b480/0x6acfe1d5cba7b850 lrc: 3/0,0 mode: PR/PR res: [0x111e85:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 9307@gni102 remote: 0xca436432b3a15682 expref: 4 pid: 26076 timeout: 6179874642 lvb_type: 1 Whoa. 10,034s. Any thoughts on where to go digging in greater detail to help guide us more? |
| Comment by Jason Hill (Inactive) [ 02/Feb/14 ] |
|
Also of use from Matt's analysis last night (sorry, just reading this now), those top 10 clients issuing RPC's are all Lustre 1.8.X clients (titan-ext* is Cray 1.8.6, lens* and dtn03 are Intel 1.8.9 compiled @ ORNL). Thanks for the analysis Matt. |
| Comment by Jason Hill (Inactive) [ 02/Feb/14 ] |
|
Just for posterity, it looks like Cray has not put Feb 1 23:53:56 atlas-mds1 kernel: [ 4357.493520] Lustre: 14082:0:(mdt_xattr.c:363:mdt_reint_setxattr()) atlas1-MDT0000: client miss to set OBD_MD_FLCTIME when setxattr system.posix_acl_access: [object [0x20005b506:0x19b79:0x0]] [valid 68719476736] A lot. From reading the bug it's a client issue, and my guess is that our non-Intel clients don't have the patch. |
| Comment by James A Simmons [ 02/Feb/14 ] |
|
A new ticket was opened for the port of the |
| Comment by Jason Hill (Inactive) [ 03/Feb/14 ] |
|
I wrote some bash to go look for the longest RPC from each event where we noted that it took longer than 1s to perform a metadata operation since 02-01-2014. In almost every case the operation that took the longest was a lock enquque: [root@lustre-mgmt1]# /tmp/mdstrace-parse.sh &> /tmp/rpclist.txt ' | awk ' {print $1}' | sort -u |
| Comment by Oleg Drokin [ 03/Feb/14 ] |
|
If you have lock timeouts in thousands of seconds, that's sure to block any competing lock enqueues for as long as well. So I wonder if these events are correlated? |
| Comment by James Nunez (Inactive) [ 04/Feb/14 ] |
|
Jason, I've spent the day researching and speaking with our engineers about this issue. We consider the most fruitful approach is to start by understanding the workload that experienced the slowdown. Do you have any information on the application, the one that was issuing the 25 - 30K LDLM_ENQUEUE RPCs, and what it is doing in terms of metadata? For example is it opening a single file or a small number of files from many processes, opening multiple files, what modes, etc. Thanks, |
| Comment by Jason Hill (Inactive) [ 04/Feb/14 ] |
|
James, The code that was mentioned on Friday has been held since we discovered that the change the developer made did not result in lower impact. We are continuing to see long response times for metadata operations in the absence of that code – so I think we have something more systemic going on. Cray has an open bug (I mentioned this in the email to you and Peter) for their client (in Cray's bugzilla), and also thinks that -Jason |
| Comment by Jason Hill (Inactive) [ 04/Feb/14 ] |
|
On the advice of Oleg I'm going to spend some time tonight trying to correlate the long response times to lock timeout and eviction notices on the clients. This will take some time as I'll need to basically create a coherent stream of log data from disparate sources and then try to follow messages. Hoping to have something by morning. |
| Comment by Matt Ezell [ 04/Feb/14 ] |
|
After reading Jason, I look forward to seeing your analysis - I think it will be really helpful. We have lots of threads being hung: $ grep -c "was inactive for" kern.log
47
Feb 3 22:40:17 atlas-mds1 kernel: [172488.242493] LNet: Service thread pid 15401 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Feb 3 22:40:17 atlas-mds1 kernel: [172488.261927] LNet: Skipped 4 previous similar messagesFeb 3 22:40:17 atlas-mds1 kernel: [172488.267771] Pid: 15401, comm: mdt03_237 Feb 3 22:40:17 atlas-mds1 kernel: [172488.272252] Feb 3 22:40:17 atlas-mds1 kernel: [172488.272253] Call Trace: Feb 3 22:40:17 atlas-mds1 kernel: [172488.277084] [<ffffffffa06948b5>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.281355] Lustre: lock timed out (enqueued at 1391484017, 1200s ago) Feb 3 22:40:17 atlas-mds1 kernel: [172488.281357] Lustre: Skipped 5 previous similar messages Feb 3 22:40:17 atlas-mds1 kernel: [172488.298432] [<ffffffff8150f120>] ? _cond_resched+0x30/0x40 Feb 3 22:40:17 atlas-mds1 kernel: [172488.304870] [<ffffffffa03b265e>] ? cfs_cond_resched+0xe/0x10 [libcfs] Feb 3 22:40:17 atlas-mds1 kernel: [172488.312380] [<ffffffffa03c85bf>] ? cfs_hash_for_each_relax+0x17f/0x360 [libcfs] Feb 3 22:40:17 atlas-mds1 kernel: [172488.320973] [<ffffffffa069d850>] ? ldlm_res_hash_dump+0x0/0x60 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.328685] [<ffffffffa069d850>] ? ldlm_res_hash_dump+0x0/0x60 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.336381] [<ffffffffa03c9f6f>] ? cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs] Feb 3 22:40:17 atlas-mds1 kernel: [172488.344975] [<ffffffffa069da6c>] ? ldlm_namespace_dump+0xec/0x1c0 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.352978] [<ffffffffa06b4a3f>] ? ldlm_expired_completion_wait+0x2af/0x390 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.362048] [<ffffffffa06b9038>] ? ldlm_completion_ast+0x508/0x960 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.370144] [<ffffffffa06b4790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.378995] [<ffffffff81063600>] ? default_wake_function+0x0/0x20 Feb 3 22:40:17 atlas-mds1 kernel: [172488.386125] [<ffffffffa06b8758>] ? ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.394614] [<ffffffffa06b8b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.402507] [<ffffffffa0d11a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Feb 3 22:40:17 atlas-mds1 kernel: [172488.409813] [<ffffffffa0d17c7b>] ? mdt_object_lock0+0x33b/0xaf0 [mdt] Feb 3 22:40:17 atlas-mds1 kernel: [172488.417312] [<ffffffffa0d11a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Feb 3 22:40:17 atlas-mds1 kernel: [172488.424640] [<ffffffffa06b8b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.432533] [<ffffffffa0d184f4>] ? mdt_object_lock+0x14/0x20 [mdt] Feb 3 22:40:17 atlas-mds1 kernel: [172488.439741] [<ffffffffa0d275a9>] ? mdt_getattr_name_lock+0xe19/0x1980 [mdt] Feb 3 22:40:17 atlas-mds1 kernel: [172488.447861] [<ffffffffa06e1135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.455290] [<ffffffffa0709646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.463197] [<ffffffffa06e33c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.471184] [<ffffffffa0d283ad>] ? mdt_intent_getattr+0x29d/0x490 [mdt] Feb 3 22:40:17 atlas-mds1 kernel: [172488.478879] [<ffffffffa0d14f1e>] ? mdt_intent_policy+0x39e/0x720 [mdt] Feb 3 22:40:17 atlas-mds1 kernel: [172488.486491] [<ffffffffa0699831>] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.494398] [<ffffffffa06c01ef>] ? ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.502771] [<ffffffffa0d153a6>] ? mdt_enqueue+0x46/0xe0 [mdt] Feb 3 22:40:17 atlas-mds1 kernel: [172488.509582] [<ffffffffa0d1ba97>] ? mdt_handle_common+0x647/0x16d0 [mdt] Feb 3 22:40:17 atlas-mds1 kernel: [172488.517282] [<ffffffffa0d553f5>] ? mds_regular_handle+0x15/0x20 [mdt] Feb 3 22:40:17 atlas-mds1 kernel: [172488.524805] [<ffffffffa06f23c8>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.533864] [<ffffffffa03b25de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Feb 3 22:40:17 atlas-mds1 kernel: [172488.541076] [<ffffffffa03c3d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Feb 3 22:40:17 atlas-mds1 kernel: [172488.548893] [<ffffffffa06e9729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.556704] [<ffffffffa06f375e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.564121] [<ffffffffa06f2c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.571324] [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 Feb 3 22:40:17 atlas-mds1 kernel: [172488.577289] [<ffffffffa06f2c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.584510] [<ffffffffa06f2c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Feb 3 22:40:17 atlas-mds1 kernel: [172488.591709] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 We are getting evictions due to lock callback timers expiring $ grep -c "lock callback timer expired" kern.log
52
I understand these big numbers below may be bogus. Some of these are showing over 7 hours. Does it even make sense for a client to keep a metadata lock that long? Feb 3 22:09:31 atlas-mds1 kernel: [170644.624185] LustreError: 13226:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 25769s: evicting client at 108@gni3 ns: mdt-atlas1-MDT0000_UUID lock: ffff883f0ca506c0/0xce18f06f09f073de lrc: 3/0,0 mode: PR/PR res: [0x200044aee:0x52d6:0x0].0 bits 0x13 rrc: 409 type: IBT flags: 0x200400000020 nid: 108@gni3 remote: 0x115740db98f79fb1 expref: 117 pid: 14806 timeout: 4465563044 lvb_type: 0 Feb 3 22:20:01 atlas-mds1 kernel: [171273.688384] LustreError: 14898:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 26399s: evicting client at 181@gni3 ns: mdt-atlas1-MDT0000_UUID lock: ffff883ae3910480/0xce18f06f09f0741d lrc: 3/0,0 mode: PR/PR res: [0x200044aee:0x52d6:0x0].0 bits 0x13 rrc: 404 type: IBT flags: 0x200400000020 nid: 181@gni3 remote: 0xddf404e7006e98bb expref: 117 pid: 15497 timeout: 4466193960 lvb_type: 0 Feb 3 22:30:51 atlas-mds1 kernel: [171922.722297] LustreError: 15176:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 27049s: evicting client at 179@gni3 ns: mdt-atlas1-MDT0000_UUID lock: ffff88356b380900/0xce18f06f09f07582 lrc: 3/0,0 mode: PR/PR res: [0x200044aee:0x52d6:0x0].0 bits 0x13 rrc: 401 type: IBT flags: 0x200400000020 nid: 179@gni3 remote: 0x4c53c41d524a00b6 expref: 117 pid: 14857 timeout: 4466843371 lvb_type: 0 Many of these are from gni3, which is our XC30 that uses Lustre 2.3 clients. |
| Comment by Oleg Drokin [ 04/Feb/14 ] |
|
Clients hold locks as long as they wish if there is no contention, since it's essentially free, but saves an RPC if it's actually needed again. Cache is tied to locks. the thread sitting there idle for 1200 seconds IS real time. The stack trace does indicate it was waiting for some lock to be canceled. What's your at_max? |
| Comment by Matt Ezell [ 04/Feb/14 ] |
|
All of our clients should be setting at_max to 600 seconds. This also appears to be the default, so it's probably safe to say that's a global setting. |
| Comment by Matt Ezell [ 04/Feb/14 ] |
|
> Clients hold locks as long as they wish if there is no contention, since it's essentially free, but saves an RPC if it's actually needed again. Sure, if the same node needs it again. But if a different node needs the lock, it has to wait on the blocking ast. Our Cray clients have lru_max_age set to 9000000 (104 days) which is better than the default value 36000000 (416 days!). We might want to look into reducing this, but really they will be returned earlier because of lru_size. We set the MDT lru_size to 2000 to (try to) minimize memory usage on the MDS. |
| Comment by Oleg Drokin [ 04/Feb/14 ] |
|
I see. Enqueue timeout is defined as ldlm_get_enq_timeout(lock) * 2. ldlm_get_enq_timeout would likely return at_max in your case, hence 1200 seconds. Understanding why the clients don't send the cancels in (or they also might be just dropped somewhere along the way) might be the key here, I guess. |
| Comment by Oleg Drokin [ 04/Feb/14 ] |
|
If differnt node needs the lock, that's ok, it needs to send an RPC anyway and it's not fast either, we might allow it to holt the lock as long as it needs. Currently all heuristic is assuming the resources are not really contended. What we need to do better eventually is see which resources are actually highly contended and only release those locks (possibly as soon as we are done with them too). |
| Comment by Jason Hill (Inactive) [ 04/Feb/14 ] |
|
Tangential to the current analysis of timeouts, evictions, and long responses; the application we identified had ~2000 MPI ranks finishing at the same time and opening a file each (in the same directory, not pre-created files) for writing output. The mitigation we have put in place is to only allow 100 ranks to write at a time, and iterate until all ranks have written. There are other tactics to advise the user to take, but this is one we're grabbing for a short term test. This by no way resolves the issues (as I stated above the user has been in scheduler jail since Friday and we continued to see issues), but I did want to follow up as some were interested in how the application was able to generate ~50K lock enqueues in 30s. |
| Comment by Oleg Drokin [ 05/Feb/14 ] |
|
Copyignmy comment from Ok, I uploaded http://review.whamcloud.com/9127 that I tested lightly here. It fixes lock eviction message to actually specify how long was the wait since sending the RPC. This only makes sense to install on servers, primarily mds. Though I tested and clients don't die either in case you install it there by mistake too. I imagine it would be interesting to run with this and once the first 1200 timeout hits, have a log of this timeout + preceeding hour of messages. |
| Comment by Matt Ezell [ 05/Feb/14 ] |
|
Thanks Oleg. Right now, our theory is that clients are getting stuck with We have two possible workarounds that we are trying; one today and the second if the first doesn't help:
We hope Alexey will have a client-side patch we can try in the next day or two. Jason Hill: I think it makes sense to watch for a couple days with the above workarounds before we consider taking a MDS patch. Do you agree? |
| Comment by Jason Hill (Inactive) [ 05/Feb/14 ] |
|
All, Based on the findings by Cray in their bugs for the client and matt's information above I highly recommend changing the state from Major to Minor. Thank you all for your detailed attention. We have talked about possibly enabling the debugging patches when Titan gets an updated client from Cray/Xyratex; we will keep you informed. – |
| Comment by James Nunez (Inactive) [ 05/Feb/14 ] |
|
Jason, Thanks for the update. I've changed the priority of this ticket at your request. James |
| Comment by Oleg Drokin [ 05/Feb/14 ] |
|
Matt, the 4300 might explain long lock cancel times for OSTs, but not for MDTs, so I'd still want to better understand what goes on with MDT locks here. |
| Comment by Jason Hill (Inactive) [ 08/Feb/14 ] |
|
Oleg, We are rolling titan's clients back to 1.8.6 from Cray on Monday. If we integrate your debugging patch and roll the client back, will you get the data you need to analyze? – |
| Comment by Oleg Drokin [ 08/Feb/14 ] |
|
yes. |
| Comment by Matt Ezell [ 11/Feb/14 ] |
|
Debug patch was installed during maintenance yesterday. Many hits of the new "Long-waiting lock" message. Feb 11 05:34:45 atlas-mds1 kernel: [66331.830426] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 172 sec Feb 11 05:44:05 atlas-mds1 kernel: [66892.365498] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 213 sec Feb 11 05:54:12 atlas-mds1 kernel: [67499.945964] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 213 sec Feb 11 06:16:29 atlas-mds1 kernel: [68838.222714] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 78 sec Feb 11 06:24:04 atlas-mds1 kernel: [69293.656216] LustreError: 20107:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 224 sec Feb 11 06:34:05 atlas-mds1 kernel: [69895.230302] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 224 sec Feb 11 06:44:06 atlas-mds1 kernel: [70496.803999] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 224 sec Feb 11 06:54:21 atlas-mds1 kernel: [71112.391365] LustreError: 18817:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 210 sec Feb 11 07:04:28 atlas-mds1 kernel: [71719.971048] LustreError: 18852:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 204 sec Feb 11 07:15:48 atlas-mds1 kernel: [72400.620205] LustreError: 19317:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 125 sec Feb 11 07:24:43 atlas-mds1 kernel: [72936.132036] LustreError: 18544:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 191 sec Feb 11 07:34:41 atlas-mds1 kernel: [73534.702299] LustreError: 20151:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 195 sec Feb 11 07:44:15 atlas-mds1 kernel: [74109.251117] LustreError: 20446:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 222 sec Feb 11 07:54:16 atlas-mds1 kernel: [74710.824274] LustreError: 18539:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 222 sec Feb 11 08:04:39 atlas-mds1 kernel: [75334.420048] LustreError: 20352:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 219 sec Feb 11 08:15:47 atlas-mds1 kernel: [76003.057160] LustreError: 19125:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 161 sec Feb 11 08:26:26 atlas-mds1 kernel: [76642.667237] LustreError: 18540:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 123 sec Feb 11 08:34:55 atlas-mds1 kernel: [77152.154357] LustreError: 19405:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 216 sec Feb 11 08:45:50 atlas-mds1 kernel: [77807.778867] LustreError: 19036:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 162 sec Feb 11 08:55:17 atlas-mds1 kernel: [78375.320158] LustreError: 25518:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 209 sec Feb 11 09:05:03 atlas-mds1 kernel: [78961.879719] LustreError: 19181:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 224 sec Feb 11 09:15:04 atlas-mds1 kernel: [79563.453708] LustreError: 19691:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 224 sec Feb 11 09:25:12 atlas-mds1 kernel: [80172.034167] LustreError: 25536:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 217 sec Feb 11 09:35:07 atlas-mds1 kernel: [80767.602473] LustreError: 12379:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 223 sec Feb 11 09:45:07 atlas-mds1 kernel: [81368.175242] LustreError: 19396:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### Long-waiting lock 224 sec No hits of "requested timeout %d, more than at_max" or "Adding a lock, but the front position is scheduled in %d seconds" Also seeing messages like: Feb 11 09:48:52 atlas-mds1 kernel: [81593.435989] LNet: Service thread pid 19926 completed after 16665.22s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Feb 11 09:56:36 atlas-mds1 kernel: [82058.639480] Lustre: atlas1-MDT0000: Client c7c41ec7-373b-1a4a-8f79-d741cf79e22a (at 12589@gni100) refused reconnection, still busy with 2 active RPCs |
| Comment by Oleg Drokin [ 11/Feb/14 ] |
|
Hm, so only the extra waiting, huh. And the numbers are not all that far off. The slow service thread and the still busy messages could not be new ones, you should have gotten them in the past too, right? |
| Comment by Matt Ezell [ 11/Feb/14 ] |
Feb 11 10:18:55 atlas-mds1 kernel: [83398.111851] LustreError: 17613:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 18469s: evicting client at 14346@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff881463276000/0xd030cd07cae2a16b lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1847 type: IBT flags: 0x200400000020 nid: 14346@gni100 remote: 0xc3ce954c318ff7ca expref: 37 pid: 18996 timeout: 4377983092 lvb_type: 0 used 0 Feb 11 10:28:56 atlas-mds1 kernel: [83999.685683] LustreError: 19235:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 19070s: evicting client at 14349@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff881f796cb000/0xd030cd07cae2a205 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1833 type: IBT flags: 0x200400000020 nid: 14349@gni100 remote: 0x4c3f68e1e5538822 expref: 36 pid: 19097 timeout: 4378584127 lvb_type: 0 used 0 Feb 11 10:38:57 atlas-mds1 kernel: [84601.260349] LustreError: 17617:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 19671s: evicting client at 16378@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff88155ad90480/0xd030cd07cae2a259 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1828 type: IBT flags: 0x200400000020 nid: 16378@gni100 remote: 0xc7d8717e54253dc2 expref: 36 pid: 19226 timeout: 4379185137 lvb_type: 0 used 0 Feb 11 10:48:58 atlas-mds1 kernel: [85202.833436] LustreError: 20236:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 20272s: evicting client at 16310@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff881a9a7d46c0/0xd030cd07cae2a2b4 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1824 type: IBT flags: 0x200400000020 nid: 16310@gni100 remote: 0xbd5c384bff16401 expref: 36 pid: 19337 timeout: 4379786098 lvb_type: 0 used 0 Feb 11 10:58:59 atlas-mds1 kernel: [85804.407052] LustreError: 19336:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 20873s: evicting client at 13309@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff883fb36c7480/0xd030cd07cae2a3e8 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1820 type: IBT flags: 0x200400000020 nid: 13309@gni100 remote: 0x33c7e22dc8d25f25 expref: 36 pid: 19745 timeout: 4380387245 lvb_type: 0 used 0 Feb 11 11:09:00 atlas-mds1 kernel: [86405.980993] LustreError: 19342:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 21474s: evicting client at 16366@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff883ba9de5240/0xd030cd07cae2a489 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1811 type: IBT flags: 0x200400000020 nid: 16366@gni100 remote: 0x1e95102493397b80 expref: 37 pid: 20353 timeout: 4380988521 lvb_type: 0 used 0 Feb 11 11:19:02 atlas-mds1 kernel: [87008.555529] LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 22076s: evicting client at 10380@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff881f7f49f900/0xd030cd07cae2a577 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1788 type: IBT flags: 0x200400000020 nid: 10380@gni100 remote: 0x758d3352cfc91da4 expref: 36 pid: 19368 timeout: 4381590703 lvb_type: 0 used 0 Feb 11 11:29:04 atlas-mds1 kernel: [87611.130362] LustreError: 17619:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 22678s: evicting client at 11922@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff881448cdad80/0xd030cd07cae2a61f lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1780 type: IBT flags: 0x200400000020 nid: 11922@gni100 remote: 0x9f490f3250dcfa26 expref: 40 pid: 19083 timeout: 4382192237 lvb_type: 0 used 0 Feb 11 11:39:05 atlas-mds1 kernel: [88212.705133] LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 23279s: evicting client at 16399@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff881a19db7d80/0xd030cd07cae2a657 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1775 type: IBT flags: 0x200400000020 nid: 16399@gni100 remote: 0xae3b55b3b0e8fcaf expref: 37 pid: 19294 timeout: 4382793129 lvb_type: 0 used 0 Feb 11 11:49:06 atlas-mds1 kernel: [88814.277996] LustreError: 18622:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 23880s: evicting client at 13288@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff883fb1568b40/0xd030cd07cae2a6a4 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1772 type: IBT flags: 0x200400000020 nid: 13288@gni100 remote: 0xbc7520cd6a29b2c2 expref: 37 pid: 19662 timeout: 4383394123 lvb_type: 0 used 0 Feb 11 11:59:07 atlas-mds1 kernel: [89415.852733] LustreError: 25491:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 24481s: evicting client at 12596@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff883fac92e900/0xd030cd07cae2a745 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1769 type: IBT flags: 0x200400000020 nid: 12596@gni100 remote: 0xb235a780d20e5531 expref: 38 pid: 20296 timeout: 4383995131 lvb_type: 0 used 0 Feb 11 12:09:08 atlas-mds1 kernel: [90017.426664] LustreError: 20036:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 25082s: evicting client at 11054@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff883fa39f1d80/0xd030cd07cae2a895 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1755 type: IBT flags: 0x200400000020 nid: 11054@gni100 remote: 0x39056ad1e7dcf9ff expref: 39 pid: 20374 timeout: 4384596118 lvb_type: 0 used 0 Feb 11 12:19:11 atlas-mds1 kernel: [90621.002347] LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 25684s: evicting client at 12598@gni100 ns: mdt-atlas1-MDT0000_UUID lock: ffff881aac5e0480/0xd030cd07cae2ab12 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1742 type: IBT flags: 0x200400000020 nid: 12598@gni100 remote: 0xbc132fc49289524 expref: 40 pid: 18871 timeout: 4385199738 lvb_type: 0 used 0 |
| Comment by Matt Ezell [ 11/Feb/14 ] |
|
Of the 48 evictions I've seen today, most of them refer to the same FID: $ grep "lock callback" kern.log |awk ' {print $29}'|sort|uniq -c | sort -nr Using 'lfs fid2path', all 48 of these correspond to a directory. [0x20007d38d:0xf5:0x0] is a directory with 1262 files in it. |
| Comment by Oleg Drokin [ 11/Feb/14 ] |
|
hot long is the timeout listed in the lock eviction message? 1262 files is not all that many too. |
| Comment by Oleg Drokin [ 11/Feb/14 ] |
|
oh, I see you pasted the eviction messages too. the time is still in 20k seconds which cannot be right, sigh. |
| Comment by Matt Ezell [ 11/Feb/14 ] |
|
It's logging every 10 minutes. The time listed increases by about 600s (10 minutes) each time. It seems like the lock is bouncing between clients. If a client gets evicted due to a callback timer expiring, and that lock is re-granted to the next waiting client, do you update l_last_activity? |
| Comment by Oleg Drokin [ 11/Feb/14 ] |
|
Yes, the re-granting is actually entirely new enqueue so it gets counted from zero of course. I also note that all expired locks listed are of PR mode. Since this is a difectory, this could only be a result of readdir. |
| Comment by Matt Ezell [ 11/Feb/14 ] |
|
Here's the logs from dir.c from the Cray clients. c0-0c0s2n2 is 90@gni100 and c24-0c0s6n0 is 9228@gni100. Both of those are login nodes. console-20140130:2014-01-30T18:23:55.673423-05:00 c0-0c0s2n2 LustreError: 4961:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200037b5b:0x307f:0x0] at 0: rc -108 console-20140130:2014-01-30T18:23:55.673430-05:00 c0-0c0s2n2 LustreError: 4961:0:(dir.c:595:ll_dir_read()) error reading dir [0x200037b5b:0x307f:0x0] at 0: rc -108 console-20140201:2014-02-01T18:10:11.186576-05:00 c24-0c0s6n0 LustreError: 14610:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200000007:0x1:0x0] at 0: rc -108 console-20140201:2014-02-01T18:10:11.186640-05:00 c24-0c0s6n0 LustreError: 14610:0:(dir.c:595:ll_dir_read()) error reading dir [0x200000007:0x1:0x0] at 0: rc -108 console-20140201:2014-02-01T18:10:13.317079-05:00 c24-0c0s6n0 LustreError: 14611:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200000007:0x1:0x0] at 0: rc -108 console-20140201:2014-02-01T18:10:13.317127-05:00 c24-0c0s6n0 LustreError: 14611:0:(dir.c:595:ll_dir_read()) error reading dir [0x200000007:0x1:0x0] at 0: rc -108 console-20140201:2014-02-01T18:10:16.408287-05:00 c24-0c0s6n0 LustreError: 14612:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200000007:0x1:0x0] at 0: rc -108 console-20140201:2014-02-01T18:10:16.408328-05:00 c24-0c0s6n0 LustreError: 14612:0:(dir.c:595:ll_dir_read()) error reading dir [0x200000007:0x1:0x0] at 0: rc -108 Here are some client logs from 14346@gni100 (the first client evicted in the log above). 10 minutes before the eviction, it dropped a PUT from the MDS. How do I debug why it was dropped? console-20140211:2014-02-11T09:08:07.891822-05:00 c13-5c0s5n0 LustreError: 11-0: an error occurred while communicating with 10.36.226.72@o2ib200. The mds_connect operation failed with -16 console-20140211:2014-02-11T09:08:07.891830-05:00 c13-5c0s5n0 LustreError: Skipped 1 previous similar message console-20140211:2014-02-11T09:26:32.479117-05:00 c13-5c0s5n0 LustreError: 11-0: an error occurred while communicating with 10.36.226.72@o2ib200. The mds_connect operation failed with -16 console-20140211:2014-02-11T09:26:32.479133-05:00 c13-5c0s5n0 LustreError: Skipped 1 previous similar message console-20140211:2014-02-11T09:44:52.493914-05:00 c13-5c0s5n0 LustreError: 11-0: an error occurred while communicating with 10.36.226.72@o2ib200. The mds_connect operation failed with -16 console-20140211:2014-02-11T09:44:52.493935-05:00 c13-5c0s5n0 LustreError: Skipped 1 previous similar message console-20140211:2014-02-11T10:03:13.998980-05:00 c13-5c0s5n0 LustreError: 11-0: an error occurred while communicating with 10.36.226.72@o2ib200. The mds_connect operation failed with -16 console-20140211:2014-02-11T10:03:13.999004-05:00 c13-5c0s5n0 LustreError: Skipped 1 previous similar message console-20140211:2014-02-11T10:08:54.976373-05:00 c13-5c0s5n0 LNet: 15569:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.226.72@o2ib200 portal 10 match 1459678641612461 offset 192 length 536: 2 console-20140211:2014-02-11T10:21:38.458816-05:00 c13-5c0s5n0 LustreError: 167-0: This client was evicted by atlas1-MDT0000; in progress operations using this service will fail. console-20140211:2014-02-11T10:21:38.458837-05:00 c13-5c0s5n0 LustreError: Skipped 10 previous similar messages console-20140211:2014-02-11T10:21:38.458844-05:00 c13-5c0s5n0 Lustre: Server atlas1-MDT0000_UUID version (2.4.2.0) is much newer than client version (1.8.6) console-20140211:2014-02-11T10:21:38.458851-05:00 c13-5c0s5n0 Lustre: Skipped 10 previous similar messages console-20140211:2014-02-11T10:21:38.458859-05:00 c13-5c0s5n0 Lustre: atlas1-MDT0000-mdc-ffff88080dd5c800: Connection restored to service atlas1-MDT0000 using nid 10.36.226.72@o2ib200. |
| Comment by Oleg Drokin [ 12/Feb/14 ] |
|
Hm, the mds_connect failing with -16 before the dropped put means the client is already disconnected from MDS and MDS does not let it in because it believes it is already serving a request from this client (server side messages are "refused reconnection, still busy with X active RPCs"). Dropping PUT is because something was sent to a client that the client did not expect. Portal 10 is MDC_REPLY. What likely happened here is that the request on MDS took really-really long time, enough for a client to timeout it's original request and try to reconnect (only to be rejected by the server as above), then eventually MDS thread completed (will be visible as a "This indicates the system was overloaded" message) and sent a response, alas the client is no longer listening for that response. But because the server is no longer processing anything from this client - next reconnect succeeds (and client notices it actually was evicted). Earlier on in that same log you'll see there was a request timed out message of some sort, what was it and when (or if more than one - all of them)? How synced up are the clocks on the servers and clients (for log purposes)? 10:18:55 is before the dropped put reply. Could it be we had a deadlock-like situation where a client holds the lock, then readdir (or whatever was the request, but it must have had the lock held to prevent cancellation) had MDS to perform an operation that wanted that lock canceled. Additionally, this seems to be a fairly recent problem, I believe that ORNL was running 2.4 servers + 1.8 clients for quite a while before, so can you briefly describe what have changed before the problems appeared? Just fs reformat? if you can dump a sysrq-t from the first node that develops 'The mds_connect operation failed with -16' symptom, that might be interesting to see what other stuff was happening there, though I guess the problem might be due to several nodes interoperations. I wonder if something like this is a possiblity: |
| Comment by Matt Ezell [ 12/Feb/14 ] |
|
I've been having problems stitching together the logs. The MDS suppressed dumping stack traces to 3 every 300 seconds, so it seems like I'm missing different pieces for different occurrences at different times. I'll look into it some more tonight. Interestingly, it looks like this behavior has calmed down today. I'm guessing it's workload-related. Time should be pretty-well synchronized. Everything uses NTP. ORNL was running with (Cray) 1.8.6 clients against multiple file systems. Most users were still on Widow (1.8.x based) and we brought in our new Atlas system (2.4.x). In December and January, users migrated from Widow to Atlas. As more users moved to Atlas, it seemed like our slowdowns got more common/worse. A couple weeks ago we decommissioned Widow and moved most clients to 2.4 (since we only had 2.4 servers). We hit some client-side issues and decided to downgrade clients to (Cray) 1.8.6 this Monday. |
| Comment by Oleg Drokin [ 13/Feb/14 ] |
|
It well could be lock related. If you can do sysrq-t when you are in this situation, it should be useful, at least we should be able to get some idea of what's waiting where (and esp. where do readdir threads wait). |
| Comment by Jason Hill (Inactive) [ 14/Feb/14 ] |
|
We attempted to crash dump the MDS today (memory pressure was indicated in the logs by messages like "page allocation failure"). Last time I looked there was 175GB in slabs; I did not grab a slabtop to see where we were holding memory. The crash dump process was stuck for ~20 minutes and the admin decided to kill the crash dump (ipmi power off), so we don't have additional data. The performance for metadata operations continues to be a large complaint from our user base, so we'd like to raise the priority back. We ran with the 1.8.6 client for 1 week and the logs on titan are much quieter. Please indicate the next steps that we should take in terms of gathering data. Current status: Server: Lustre 1.4.2 + Can we raise the priority? Thanks, |
| Comment by James Nunez (Inactive) [ 14/Feb/14 ] |
|
Jason, I've returned the priority of this ticket to Major. James |
| Comment by Oleg Drokin [ 14/Feb/14 ] |
|
Jason: The previous analysis where most of lock timeouts were attributed to 2.3 clients was also interesting. Is this still the pattern? Page allocation failure, was that "tage" allocation failure by any chance, have a sample if not? |
| Comment by Jason Hill (Inactive) [ 17/Feb/14 ] |
|
Oleg, We've not seen the "refused connection" messages since 02/11/14. However since introducing the patch from [root@atlas-mds1 ~]# uname -a [slabtop] OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME |
| Comment by Jason Hill (Inactive) [ 17/Feb/14 ] |
|
for completeness: [root@atlas-mds1 ~]# rpm -qi lustre |
| Comment by Jason Hill (Inactive) [ 17/Feb/14 ] |
|
Also to correlate, we've consumed a large number of MDT inodes for atlas1: [root@lfs-mgmt1 scratch]# df -i /lustre/atlas1 current slab usage for atlas-mds3 (atlas2's mds) is ~30GB. next test is to fire up the TDS and create ~500M files to see if the slab issue follows with the number of files in the filesystem. |
| Comment by Matt Ezell [ 18/Feb/14 ] |
|
We ran oprofile on the mds and got the following top-consumers: samples % image name app name symbol name 4753117 40.5645 vmlinux vmlinux __write_lock_failed 1276800 10.8966 libcfs.ko libcfs.ko cfs_percpt_lock 869346 7.4193 oprofiled oprofiled /usr/bin/oprofiled 470403 4.0146 vmlinux vmlinux __vmalloc_node 411024 3.5078 vmlinux vmlinux rb_get_reader_page 277246 2.3661 vmlinux vmlinux ring_buffer_consume 276266 2.3577 vmlinux vmlinux poll_idle 239955 2.0478 lod.ko lod.ko lod_alloc_qos.clone.0 211548 1.8054 vmlinux vmlinux intel_idle 198649 1.6953 vmlinux vmlinux remove_vm_area 146320 1.2487 vmlinux vmlinux smp_call_function_many 141106 1.2042 vmlinux vmlinux generic_smp_call_function_interrupt That looks a lot like the results from |
| Comment by Oleg Drokin [ 18/Feb/14 ] |
|
Yes, you are definitely seeing lock contention on some lock, but it could be any lock that's compiled into the kernel. Without calltrace analysis it's hard to tell more. Patch from lu4579 should not really affect any memory allocations whatsoever. As for slab usage, are you sure this did not happen before? I would imagine the cache will grow to fill entire available memory as you access more and more files. This is the purpose of the cache. |
| Comment by Matt Ezell [ 19/Feb/14 ] |
|
We'll do some more analysis tomorrow to try to get call graphs to see what locks we are contending on. We think its vmalloc at this point. We plan to try the What are the implications of increasing OBD_ALLOC_BIG? Just more likely to hit fragmentation/allocation failures/slowdowns? Is the following a really bad idea: diff --git a/lustre/include/obd_support.h b/lustre/include/obd_support.h index 24671c3..83d3c62 100644 --- a/lustre/include/obd_support.h +++ b/lustre/include/obd_support.h @@ -704,7 +704,8 @@ do { \ * since vmalloc in Linux doesn't perform well on multi-cores system, calling * vmalloc in critical path would hurt peformance badly. See LU-66. */ -#define OBD_ALLOC_BIG (4 * PAGE_CACHE_SIZE) +#define OBD_ALLOC_BIG_ORDER 4 +#define OBD_ALLOC_BIG (PAGE_CACHE_SIZE << OBD_ALLOC_BIG_ORDER) #define OBD_ALLOC_LARGE(ptr, size) \ do { \ Or instead try to kmalloc without GFP_WAIT and fall back to vmalloc if it fails? Then we can compare the pointer to see if its between VMALLOC_START and VMALLOC_END to know which deallocation function to use (except it looks like those are being deprecated in the kernel). Might only want to do that if the size is over 4*PAGE_CACHE_SIZE to avoid extra checks in the fast path for small allocations. |
| Comment by Jason Hill (Inactive) [ 19/Feb/14 ] |
|
Oleg - slab usage got a separate issue – |
| Comment by Jason Hill (Inactive) [ 19/Feb/14 ] |
|
Oleg: We did get a few trips of your debug patch in the last two days. I failed to look for it yesterday. Sorry about that. Here's the messages, I will work on getting you more info from around these times:
|
| Comment by Andreas Dilger [ 19/Feb/14 ] |
|
For OBD_FREE_LARGE() it can use is_vmalloc_addr() to determine if a passed pointer was allocated via kmalloc() or vmalloc(). That would allow OBD_ALLOC_LARGE() to try kmalloc() first (maybe without __GFP_WAIT), and only fall back to vmalloc() in case of failure. |
| Comment by James A Simmons [ 19/Feb/14 ] |
|
I was just discussing that on |
| Comment by Jason Hill (Inactive) [ 19/Feb/14 ] |
|
Oleg, I'm attaching the Lustre logs from atlas and Titan during the timeranges from above (we hit some more this morning around 9:40 also). Curiously there are no titan entries for most of that time. I will to look over at our smaller Cray XC30 (no Lustre messages during any of these time frames), and also our 200 node linux cluster – those are our three biggest consumers of Lustre – to get their log data from these time frames. hilljj@titan-smw0:/tmp$ for i in $(ls titan-lustre-2014-02-1*); do wc -l $i; done [root@rhea-mgmt2 tmp]# wc -l rhea-lustre-2014-02-1* [root@rhea-mgmt2 tmp]# md5sum rhea-lustre-logs-LU4570.tgz hilljj@hill-desktop:~$ wc -l ccs_homedir/ldlm_lockd.c.lustrekernel.txt |
| Comment by Oleg Drokin [ 20/Feb/14 ] |
|
Hm, so all the trips of my at_max message seem to be at max 755 seconds. On the client logs we see timeouts both to OSTs and MDTs, but sadly the reduced printing only shows xid, but not the opc so there's no way to know what requests are failing. Interesting that there's quite a bunch of timeouts that are in low range - 17s, 19s, 31s could not be explained by too long timeouts or anything like that. I wonder why those were not replied to by the server in time, if only to send an early reply? |
| Comment by Oleg Drokin [ 22/Feb/14 ] |
|
So, I've been digging around the code and it looks like we send close RPCs to readdir portal now, so if you have heavy close traffic, and close is affected by |
| Comment by Jason Hill (Inactive) [ 28/Feb/14 ] |
|
console output of echo t > /proc/sysrq-trigger under load. |
| Comment by James A Simmons [ 28/Feb/14 ] |
|
I uploaded the MDS strack traces to ftp.whamcloud.com/uploads/ |
| Comment by Oleg Drokin [ 03/Mar/14 ] |
|
Hm, looking at the traces I see 134 threads waiting for object allocations from OSTs. This is nothing too critical I imagine. If you can manage to dump sysrq-t trace from a node that's trying to reconnect and get's refused with 'refused reconnection, still busy with X active RPCs' while this is ongoing, I am interested in getting that. |
| Comment by Oleg Drokin [ 03/Mar/14 ] |
|
I guess another bit of useful info would be if you can cross reference jobs run on the nodes that enter this stuck state and could not reconnect due to BUSY on MDT. |
| Comment by Matt Ezell [ 12/Mar/14 ] |
|
We had two "still busy" messages since we last patched, but there were special circumstances: Mar 8 09:47:54 atlas-mds1 kernel: [159420.993577] Lustre: atlas1-MDT0000: Client b4ea3026-13aa-b280-3f2e-94a4bab53a72 (at 95@gni100) refused reconnection, still busy with 1 active RPCs Mar 8 09:47:55 atlas-mds1 kernel: [159421.712317] Lustre: atlas1-MDT0000: Client 9fc952d9-a299-68de-a80e-21913bc6e90d (at 9311@gni100) refused reconnection, still busy with 1 active RPCs Around the same time, a Gemini network error on our Cray caused the network to quiesce and re-route. 2014-03-08T09:47:27.602931-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) @@@ Request x1459682730665633 sent from MGC10.36.226.77@o2ib200 to NID 10.36.226.77@o2ib200 17s ago has timed out (17s prior to deadline). 2014-03-08T09:47:27.602945-05:00 c24-0c0s0n3 req@ffff8803ff9e3000 x1459682730665633/t0 o400->MGS@MGC10.36.226.77@o2ib200_0:26/25 lens 192/384 e 0 to 1 dl 1394290046 ref 2 fl Rpc:N/0/0 rc 0/0 2014-03-08T09:47:27.602952-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) Skipped 19 previous similar messages 2014-03-08T09:47:27.602961-05:00 c24-0c0s0n3 LustreError: 166-1: MGC10.36.226.77@o2ib200: Connection to service MGS via nid 10.36.226.77@o2ib200 was lost; in progress operations using this service will fail. 2014-03-08T09:47:27.602970-05:00 c24-0c0s0n3 Lustre: linkfarm-MDT0000-mdc-ffff880407978c00: Connection to service linkfarm-MDT0000 via nid 10.36.226.83@o2ib200 was lost; in progress operations using this service will wait for recovery to complete. 2014-03-08T09:47:33.651947-05:00 c24-0c0s0n3 LNet: Waiting for hardware quiesce flag to clear 2014-03-08T09:47:57.909338-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.169@o2ib218 portal 4 match 1459682730665920 offset 192 length 192: 2 2014-03-08T09:47:57.909351-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1 previous similar message 2014-03-08T09:47:57.909358-05:00 c24-0c0s0n3 Lustre: atlas2-MDT0000-mdc-ffff8802ad97c400: Connection restored to service atlas2-MDT0000 using nid 10.36.226.74@o2ib200. 2014-03-08T09:47:57.909366-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.245@o2ib219 portal 4 match 1459682730666861 offset 192 length 192: 2 2014-03-08T09:47:57.909425-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.42@o2ib203 portal 4 match 1459682730665937 offset 192 length 192: 2 2014-03-08T09:47:57.909432-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1516 previous similar messages I think the clients sent requests, but the response didn't get back to them before the network timed them out. I'm not sure why it timed out, I thought during a quiesce situation gnilnd would extend timeouts. When the client timed out the request, it unlinked the MD, so the responses from the MDS were dropped. The MDS didn't get a NACK, so it thinks it's still waiting on that request to be completed. This is probably a special situation, and likely cannot explain all the occurrences we've seen of this previously. |
| Comment by Oleg Drokin [ 20/Mar/14 ] |
|
I think you are right and this instance is different from before due to network issues. |
| Comment by James Nunez (Inactive) [ 02/Jul/14 ] |
|
Jason, The two patches referenced in this ticket have landed; http://review.whamcloud.com/#/c/10601/ and http://review.whamcloud.com/#/c/6339 . Are you still experiencing this problem? If not, should be close this ticket? |
| Comment by Jason Hill (Inactive) [ 07/Jul/14 ] |
|
James: yes, let's go ahead and resolve this issue. – |
| Comment by James A Simmons [ 07/Jul/14 ] |
|
As a note I back ported both of the above patches to 2.5. They are at : http://review.whamcloud.com/#/c/10962 |
| Comment by James Nunez (Inactive) [ 09/Jul/14 ] |
|
We will track the b2_5 patches separately. |