[LU-1504] the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service Created: 11/Jun/12 Updated: 15/Feb/13 Resolved: 15/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.0.0, Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major |
| Reporter: | Archie Dizon | Assignee: | Cliff White (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | performance | ||
| Environment: |
Clustering |
||
| Attachments: |
|
| Epic: | hang |
| Rank (Obsolete): | 4043 |
| Description |
|
Hello Support, One of customer at University of Delaware had at least three separate instances where the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service due to: Jun 11 02:40:07 oss4 kernel: Lustre: 9443:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0016: refuse reconnection from d085b4f1-e418-031f-8474-b980894ce7ad@10.55.50.115@o2ib to 0xffff8103119bac00; still busy with 1 active RPCs The hang was so bad for one of them (upwards of 30 minutes with the OST unavailable) that a reboot of the oss1/oss2 pair was necessary. The symptom is easily identified: long hangs on the head node while one waits for a directory listing or for a file to open for editing in vi, etc. Sometimes the situation remedies itself, sometimes it does not and we need to reboot one or more OSS nodes. "Enclosed are all of the syslogs, dmesg, and /tmp/lustre* crash dumps for our MDS and OSS's." You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser): "https://pandora.nss.udel.edu//pickup.php?claimID=vuAFoSBUoReVuaje&claimPasscode=RfTmXJZFVdUGzbLk&emailAddr=tsingh%40penguincomputing.com" Full information for the drop-off: Claim ID: vuAFoSBUoReVuaje Please review the attached log files and provide us the next course of action since it's very critical issue and impacting their environment? Also please let me know Thanks |
| Comments |
| Comment by Cliff White (Inactive) [ 11/Jun/12 ] |
|
First, as explained in the Lustre Manual, we can't use the lustre-log information unless you pre-process it. |
| Comment by Archie Dizon [ 11/Jun/12 ] |
|
Hello Cliff, "Pre-processed lustre-log files, per request of WhamCloud." You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser): "https://pandora.nss.udel.edu//pickup.php?claimID=j47zLKRJwQMqRFkA&claimPasscode=iNY2YwUBujk2hGpY&emailAddr=tsingh%40penguincomputing.com" Can you be more specific w.r.t. what system logs you need? I sent along the "/var/log/messages*" and "/var/log/dmesg" files, what additional ones do you need? Thanks |
| Comment by Cliff White (Inactive) [ 11/Jun/12 ] |
|
As I said, we need complete /var/log/messages, just for the day of the incident. Your logs contained only Lustre messages, really need to see the complete logs. |
| Comment by Cliff White (Inactive) [ 11/Jun/12 ] |
|
Looking at these logs, I am quite confused. It looks like your system has not run cleanly for quite some time. May 20 04:04:55 oss4 kernel: Lustre: 8468:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1401607688579645 sent from MGC10.55.31.1@o2ib to NID 10.55.31.2@o2ib 0s ago has failed due to network error (5s prior to deadline). I can't really tell where you problem starts, as near as I can see the system was running in some degraded fashion at least since May 20th. |
| Comment by Archie Dizon [ 11/Jun/12 ] |
|
Hello Cliff, I sent /var/log/messages from the Mills head node to you via Dropbox. The compute nodes do not have local syslog files, they output via rsyslogd to the head node, where it drops into /var/log/messages along with the syslog from the head node itself. So the file I sent essentially includes syslog data from all the LUSTRE clients over the last ~24 hours. You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser): "https://pandora.nss.udel.edu//pickup.php?claimID=46ce3AvsMgKuk0KQ&claimPasscode=Jv6u3TXkTXR14ctJ&emailAddr=tsingh%40penguincomputing.com" Thanks |
| Comment by Cliff White (Inactive) [ 11/Jun/12 ] |
|
In the future, please use our anonymous FTP site for uploads (ftp.whamcloud.com) Indicate the exact file name uploaded in the bug comment. When was the filesystem last restarted? |
| Comment by Cliff White (Inactive) [ 11/Jun/12 ] |
|
Also, are the system clocks on the servers and clients in sync? |
| Comment by Archie Dizon [ 12/Jun/12 ] |
|
Hello Cliff, Please find the customer response in-line below. Also would it be possible for you to work with the customer directly to avoid any delays? If so then let me know and I'll provide you the contact info. > Looking at these logs, I am quite confused. It looks like your system has not run cleanly for quite some time. That's an understatement. We've been rebooting OSS's and the MDS at least 3 times a week for the last 3 months. This system has never run in a way that we'd term "clean." > When was the last time the filesystem was restarted? Last week we did the disk and controller firmware updates – so that was the last reboot. The filesystem ran without major issue for a couple days before we again had to restart OSS's due to hung RPCs, etc. > May 20 04:04:55 oss4 kernel: Lustre: 8468:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1401607688579645 sent from MGC10.55.31.1@o2ib to NID 10.55.31.2@o2ib 0s ago has failed due to network error (5s prior to deadline). Yes, I agree. But "network error" really doesn't say much, that's why we asked WhamCloud to analyze the situation. 10.55.31.2 is "mds0b-ib", the standby MDS server. 10.55.31.1 is "mds0a-ib", the active MDS servers. So why is "mds0a" trying to talk to "mds0b"? Since "mds0b" is standby under the heartbeat service, there's no reason to expect that LUSTRE communications to it should succeed – there's no active LUSTRE services on it. > I also see other indications of things being overloaded: So the same error message is indicative of both "network errors" as well as "overload"? Or did they mean to cite a different error message? The last ticket we opened with WhamCloud was w.r.t. the "overload" we were observing under the default LUSTRE configuration as the system was built by Penguin. Originally our idea of lowering the max thread limits was deemed not likely to help anything; a few weeks later WhamCloud's solution was to lower the max thread limits, just as we had originally proposed. Obviously that hasn't helped matters. > I can't really tell where you problem starts, as near as I can see the system was running in some degraded fashion at least since May 20th. This is also very vague: "some degraded fashion" does not tell us how this cluster of services is degraded, that's the kind of info we were hoping WhamCloud could provide. > Are you seeing errors in logs now? Can you validate that your network is healthy? I would consider a restart of the filesystem, to get yourself into a known stable state. As you can see in the logs we sent, errors are a somewhat continuous affair on these systems. As mentioned, we just did a full restart last week; should we not expect LUSTRE 1.8.6 to stay stable any more than 3 or 4 days and be doing a full restart that often? A restart temporarily alleviates problems, but doesn't truly fix anything. > Also, are the system clocks on the servers and clients in sync? Yes, they are all sync'ed to the head node's NTP service. [root@mills 1001]# for n in mds0a oss {1,2,3,4}; do ssh $n /usr/sbin/ntpq -c peer; done Thanks |
| Comment by Cliff White (Inactive) [ 12/Jun/12 ] |
|
Please send me the contact information. We're always glad to help. |
| Comment by Archie Dizon [ 13/Jun/12 ] |
|
Here is the customers contact information. Contact Name: Jeffrey Frey |
| Comment by Archie Dizon [ 14/Jun/12 ] |
|
Hello Cliff, Were you able to contact Jeffrey Frey at UoD? Also do you need any further info to resolve this issue since I don't see any update on the case? Thanks |
| Comment by Peter Jones [ 14/Jun/12 ] |
|
Archie I have tried to reach Jeffrey by both email and phone but with no luck so far. Please contact me directly at peter@whamcloud.com with any further questions of this nature Thanks Peter |
| Comment by Archie Dizon [ 15/Jun/12 ] |
|
Hi Peter, I am following up with Jeff and his team at U Del. It was brought to my Thanks – Penguin Computing Inc. |
| Comment by Minh Diep [ 15/Jun/12 ] |
|
Could you provide the following information? 1. modprobe.conf or lnet/lustre.conf Thanks |
| Comment by Peter Jones [ 16/Jun/12 ] |
|
From the customer site " 2. lctl list_nids foreach server and 1 client 3. lctl ping between servers and headnode headnode to 10.55.31.1@o2ib 4. output of lctl show_route Also, we are using a Qlogic QDR Infiniband network if that wasn't mentioned before. " |
| Comment by Peter Jones [ 16/Jun/12 ] |
|
More info from customer site "I'll try to provide some context now if that will help. We have two MDS servers in an HA configuration. We also have four OSS servers (oss[1-4]) where oss1/2 in an HA config and oss3/4 are another pair. Each OSS has six OSTs. The initial problem we started seeing in February was that the load would spike on one OSS (as high as 500 at times) and the other OSS would take over. We lowered the thread count (max of 100) and we are still seeing the load spike as high as 99 at times and one OSS takes over for another. This is probably more an issue with the HA solution provided by Penguin than with Lustre per se though. We then started to have problems where disks would go offline in the DotHill enclosures. This seems to have been fixed finally with a recent Seagate firmware update. We continue to see OSTs go offline however. Sometimes it is temporary and will recover. Other times it is not temporary and our solution has been to reboot one or both OSS pair involved. The recovery after a reboot is quick at times (around 5 minutes) other times it is more like an hour. A lot of the logs Jeff sent you involve the errors when these events happen. Can you tell from the logs what might be a cause? We are running Lustre 1.8.6. Are there known bugs with this version that we are hitting? Would an upgrade (1.8.7 or 2.x) solve any problems we are seeing? We also had some questions earlier about what the stripe size should be and are thinking about that again now also. We are also exploring providing a non-lustre solution to our users so they can get some productive work done. The many lustre outages have been affecting our users greatly. " |
| Comment by Cliff White (Inactive) [ 18/Jun/12 ] |
|
When you say 'the system takes an hour to recover' after a server failover, what data are you using to measure this? Is this server load factor, or some other measure? We need to understand how you define 'recover' in this case. We need to gather some more data on block IO performance. We can do this directly if you can give us login access to the system. otherwise, from each server, please obtain and attach the following: |
| Comment by Ben Miller [ 19/Jun/12 ] |
|
I define recovery time as the time it takes from once I notice that OSS4 took over for OSS3 last night and the load on OSS4 was at 100 this Jun 19 08:03:08 oss4 kernel: LustreError: OST000c is on OSS3 and is available to clients. OST00 {12-17} currently I have attached a gzipped log of the data (prior to rebooting this thanks, |
| Comment by Ben Miller [ 19/Jun/12 ] |
|
And then right after I sent this email the OSTs finally became available Ben |
| Comment by Cliff White (Inactive) [ 19/Jun/12 ] |
|
Please attach the system log for oss4, starting from a time prior to this morning's reboot. Also please the log from the MDS and at least one client, for the same time period. |
| Comment by Ben Miller [ 19/Jun/12 ] |
|
Logs are attached. The headnode-messages file has syslogs for the Ben |
| Comment by Cliff White (Inactive) [ 19/Jun/12 ] |
|
Can you run "vmstat 5 100 > vmstat.log" on one of the busy OSTs and attach the output? |
| Comment by Ben Miller [ 19/Jun/12 ] |
|
Here it is for oss3. Ben procs ---------- |
| Comment by Cliff White (Inactive) [ 19/Jun/12 ] |
|
Thanks for providing the information so far. Analyzing the results has been useful in understanding what is going on. From the IO statistics, we are seeing a great deal of small file IO (4k io size), and very little parallel IO (most of the time only 1 IO in flight). This is not an optimal IO model for Lustre - any steps you can take on the application side to increase IO size, eliminate excessive flush() or sync() calls, or otherwise allow the filesystem to aggregate larger IO will help to improve your performance. Given this IO pattern the Lustre read cache – which is on by default - may be doing more harm than good. To turn it off please run the command "lctl set_param obdfilter.*.read_cache_enable=0" on all OSS nodes. Finally, we recommend an immediate upgrade to Lustre 1.8.8-wc1 as this release contains optimizations for small file IO (see http://jira.whamcloud.com/browse/LU-983). The Lustre 1.8.6-wc1 and 1.8.8-wc1 releases are completely compatible, so you can do a rolling upgrade of your systems without needing any downtime. Please let me know if you have any further questions about any of the above and let us know whether this advice helps. |
| Comment by Ben Miller [ 19/Jun/12 ] |
|
OK, thanks. I turned off the read cache. We'll let you know if it Ben |
| Comment by Jeffrey Frey [ 20/Jun/12 ] |
|
Cliff: With respect to upgrading the LUSTRE release on these production servers: (1) Since these systems have QLogic's own OFED installed on them, we would want to build the LUSTRE 1.8.8 server components from scratch, correct? It appears this is what Penguin did when the system was built: [root@oss1 src]# cd /usr/src (2) For rolling upgrades, we're assuming the general order of operations would be:
We'd appreciate citations of any important/helpful materials on the subject of LUSTRE rolling upgrades. Thanks for any and all information/feedback you can provide. :::::::::::::::::::::::::::::::::::::::::::::::::::::: |
| Comment by Cliff White (Inactive) [ 20/Jun/12 ] |
|
As far as we know, Qlogic cards are supported by the OFED supplied with the RedHat kernel - was there a reason giving for using external OFED? |
| Comment by Jeffrey Frey [ 20/Jun/12 ] |
|
QLogic's OFED stack is a superset of the stock OFED distribution. It includes value-added utilities/libraries and bug-fixes [that OFED hypothetically rolls back into the baseline distribution]. When we initially setup the head node w/ ScientificLinux 6.1 the stock RDMA kernel module kept the machine from booting (kernel panic every time). Wiping the RHEL IB stack and replacing with the QLogic one for RHEL 6.1 fixed the issue. As configured by Penguin, the LUSTRE servers are all running CentOS 5.x. [root@oss1 ~]# ibstat So in your (Whamcloud's) experience there have never been any instances of data loss due to upgrading like this on-the-fly while the filesystem is still online? We're using the loadable module variant of the client rather than building a LUSTRE kernel for the clients or using a Whamcloud kernel. Is there anything about the 1.8.8 server that demands 1.8.8 clients? Upgrading all clients = lots of downtime for our users and they've already experienced a LOT of downtime/lost time thanks to LUSTRE. :::::::::::::::::::::::::::::::::::::::::::::::::::::: |
| Comment by Cliff White (Inactive) [ 20/Jun/12 ] |
|
We have never had data loss from a rolling upgrade, or from a point upgrade of this type. We routinely test these point release upgrade/downgrades as a part of the release process for each new Lustre release. If there are other dependencies involved (such as Qlogic tools) then it would be advisable to involve Penguin in the decision-making process about how to proceed. While upgrading your Lustre version provides the benefit from other fixes included, it is quite possible to apply the exact fix you need to your existing version of Lustre, which may be less disruptive on other layers of the stack. Our clients use a stock RedHat kernel and you can certainly do a gradual upgrade of the clients as you schedule permits, there is no problem running with a mix of client versions. Before embarking on such an upgrade path you could even experiment with a single client to see whether the pay-off warrants the effort. To do this effectively you would need to identify an example of the kind of workload that triggers this issue so that you can measure the before and after effect of applying this fix. Do you have any idea as to a likely candidate for such a test? |
| Comment by Cliff White (Inactive) [ 27/Jun/12 ] |
|
What is your current state? What else can we do to assist? |
| Comment by Cliff White (Inactive) [ 05/Jul/12 ] |
|
Please update us as to your status. What else we do to help on this issue? Should we close this bug? |
| Comment by Ben Miller [ 06/Jul/12 ] |
|
We continue to have temporary OST unavailability with the read cache Ben |
| Comment by Jeffrey Frey [ 12/Jul/12 ] |
|
Cliff: Further research on your Jira site has shown that the issue we're seeing is EXACTLY the situation reported in http://jira.whamcloud.com/browse/LU-416 To what version of Lustre does that incident correspond – 1.8.6? – and is it resolved in 1.8.8? :::::::::::::::::::::::::::::::::::::::::::::::::::::: |
| Comment by Oleg Drokin [ 12/Jul/12 ] |
|
I believe your problem does not look like |
| Comment by Jeffrey Frey [ 12/Jul/12 ] |
|
Interesting that this is a 2.x problem; the fact remains, though, that the symptoms we observe on the client side are EXACTLY the same as the ticket I mentioned, so we were hoping that would provide you folks with a better idea of what's going on. The problem seems like it has to be partly client-oriented, since 185 of the 200 Lustre clients were using the OST without issue while 15 of the clients showed it as unavailable. On the cluster head node where we noticed this recent outage: [root@mills lustre-OST0006-osc-ffff880d13d29000]# pwd [root@mills lustre-OST0006-osc-ffff880d13d29000]# more lock_* Should we read this statement as "use Lustre for applications with a couple large files to access?" There is no single application here since this is a community cluster with many different applications, and Lustre was intended by us to be used as working space for users' jobs on the cluster (a'la scratch). On Jul 12, 2012, at 12:51 PM, Oleg Drokin (JIRA) wrote: :::::::::::::::::::::::::::::::::::::::::::::::::::::: |
| Comment by Cliff White (Inactive) [ 12/Jul/12 ] |
|
We have already recommended that you upgrade to 1.8.8 to improve small file IO. We understand that you run a mix of workloads, we believe if you could identify which workload is doing the random 4k IO, changes to that workload would help the situation. |
| Comment by Jeffrey Frey [ 12/Jul/12 ] |
|
Thanks, Cliff. We cannot "jump" into the 1.8.8 upgrade for two important reasons: (1) We like to be very aware of what's changed in a software package before we install it on a production system – so we have some reading ahead of us to get familiar with the 1.8.6 to 1.8.8 roadmap (2) This is a production system with user's computation work running out of the Lustre filesystem – any disruption is a major issue for our users, so we try to minimize those In theory a rolling upgrade works flawlessly – but in theory the hardware we were sold behind this Lustre filesystem is enterprise-class, fast storage, too :-\ So our users have to be notified of the possibility for failure, which means if there could be downtime involved we may as well do other patching at the same time. Just so happens QLogic released a firmware update for our switch recently. We plan to do the 1.8.8 upgrade within the next month, we just have some planning and scheduling to do first. As for the random 4k i/o, for our sake can you show applicable portions of the logs/debug traces that led to that diagnosis? In context is it all block-level i/o to/from files, or could it be standard filesystem ops (ls, cp, tar, etc.) e.g. also fetching size metadata from the OST(s)? :::::::::::::::::::::::::::::::::::::::::::::::::::::: |
| Comment by Cliff White (Inactive) [ 12/Jul/12 ] |
|
The random 4k IO was identified in the brw_stats which you furnished to us - to quote that comment(18/Jun/12 11:18 AM): You can examine these files using that method at any time to see how your IO is performing, and can clear the data by echoing 'clear' into the file. # cat lustre-OST0012
snapshot_time: 1340104119.496271 (secs.usecs)
read | write
pages per bulk r/w rpcs % cum % | rpcs % cum %
1: 491976 73 73 | 408911 59 59
2: 4588 0 74 | 16816 2 62
4: 1431 0 74 | 10062 1 63
8: 1156 0 74 | 19221 2 66
16: 1077 0 74 | 7433 1 67
32: 1731 0 74 | 13036 1 69
64: 2898 0 75 | 8544 1 70
128: 5149 0 76 | 12009 1 72
256: 160523 23 100 | 187956 27 100
A 'page' in this case is a 4k memory page. So we can see here that 73% of your read IO and 59% of your write IO was 4k in size. This is rather more that we would expect from the odd ls, etc, and likely indicates an application behavior. Again, this is why we advise the 1.8.8 upgrade. The 1.8.8 release has been out for quite some time, and is stable in numerous production deployments. |
| Comment by Oleg Drokin [ 12/Jul/12 ] |
|
I wonder about the "185 of 200 clients have no problem accessing the OST" case you mention. Could it be the case that only 15 "problematic" clients are actaully doing any IO on that OST and the rest happen to do IO on other OSTs? People in big government labs have these problems too and the solutions range from changing hardware to limiting "bad-behaving" applications to only certain parts of the machine to actually rewriting the apps to form a better IO pattern. The 100k lock count should not be affecting anything as long as you don't run out of memory on OSS nodes, but looking at the vmstat output its evident you don't have this problem. brw_stat output shows a lot of 4k io, and it only tracks data IO by applications. |
| Comment by Cliff White (Inactive) [ 15/Feb/13 ] |
|
Please reopen this issue if you have further comments/questions. |