[LU-1504] the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service - Whamcloud Community JIRA

Details

Type: Task
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.0.0, Lustre 1.8.6
Labels:
- performance
Environment:
Clustering

Epic:
- hang
Rank (Obsolete):
4043

Description

Hello Support,

One of customer at University of Delaware had at least three separate instances where the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service due to:

Jun 11 02:40:07 oss4 kernel: Lustre: 9443:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0016: refuse reconnection from d085b4f1-e418-031f-8474-b980894ce7ad@10.55.50.115@o2ib to 0xffff8103119bac00; still busy with 1 active RPCs

The hang was so bad for one of them (upwards of 30 minutes with the OST unavailable) that a reboot of the oss1/oss2 pair was necessary. The symptom is easily identified: long hangs on the head node while one waits for a directory listing or for a file to open for editing in vi, etc. Sometimes the situation remedies itself, sometimes it does not and we need to reboot one or more OSS nodes.

"Enclosed are all of the syslogs, dmesg, and /tmp/lustre* crash dumps for our MDS and OSS's."

You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):

"https://pandora.nss.udel.edu//pickup.php?claimID=vuAFoSBUoReVuaje&claimPasscode=RfTmXJZFVdUGzbLk&emailAddr=tsingh%40penguincomputing.com"

Full information for the drop-off:

Claim ID: vuAFoSBUoReVuaje
Claim Passcode: RfTmXJZFVdUGzbLk
Date of Drop-Off: 2012-06-11 12:23:20-0400

Please review the attached log files and provide us the next course of action since it's very critical issue and impacting their environment? Also please let me know
if you need any further info?

Thanks
Terry
Penguin Tech Support
Ph: 415-954-2833

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

headnode-messages.gz
623 kB
19/Jun/12 10:27 AM
lustre-failure-120619-1.gz
21 kB
19/Jun/12 8:08 AM
mds0a-messages.gz
5 kB
19/Jun/12 10:26 AM
oss3-vmstat.log
9 kB
19/Jun/12 11:38 AM
oss4-messages.gz
52 kB
19/Jun/12 10:26 AM

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

[LU-1504] the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service

Oleg Drokin added a comment - 12/Jul/12 12:50 PM

I believe your problem does not look like ~~LU-416~~ which was a client side problem )and the one specific to 2.x releases only at that too).
You are clearly having a problem on the server side.
Looks like your disk devices are overloaded with small IO and there is very little Lustre can do about it because apparently thisis what the application in question wants.

Oleg Drokin added a comment - 12/Jul/12 12:50 PM I believe your problem does not look like LU-416 which was a client side problem )and the one specific to 2.x releases only at that too). You are clearly having a problem on the server side. Looks like your disk devices are overloaded with small IO and there is very little Lustre can do about it because apparently thisis what the application in question wants.

Jeffrey Frey added a comment - 12/Jul/12 11:47 AM

Cliff:

Further research on your Jira site has shown that the issue we're seeing is EXACTLY the situation reported in ~~LU-416~~:

http://jira.whamcloud.com/browse/LU-416

To what version of Lustre does that incident correspond – 1.8.6? – and is it resolved in 1.8.8?

::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer IV / Cluster Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
http://turin.nss.udel.edu/
::::::::::::::::::::::::::::::::::::::::::::::::::::::

Jeffrey Frey added a comment - 12/Jul/12 11:47 AM Cliff: Further research on your Jira site has shown that the issue we're seeing is EXACTLY the situation reported in LU-416 : http://jira.whamcloud.com/browse/LU-416 To what version of Lustre does that incident correspond – 1.8.6? – and is it resolved in 1.8.8? :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::

Ben Miller added a comment - 06/Jul/12 2:31 PM

We continue to have temporary OST unavailability with the read cache
disabled. We haven't had a chance to upgrade to 1.8.8 yet. Sometimes
the OST will recover within a few minutes other times after several
minutes (or hours) we end up rebooted an OSS to two to get Lustre
available again.

Ben

Ben Miller added a comment - 06/Jul/12 2:31 PM We continue to have temporary OST unavailability with the read cache disabled. We haven't had a chance to upgrade to 1.8.8 yet. Sometimes the OST will recover within a few minutes other times after several minutes (or hours) we end up rebooted an OSS to two to get Lustre available again. Ben

Cliff White (Inactive) added a comment - 05/Jul/12 2:27 PM

Please update us as to your status. What else we do to help on this issue? Should we close this bug?

Cliff White (Inactive) added a comment - 05/Jul/12 2:27 PM Please update us as to your status. What else we do to help on this issue? Should we close this bug?

Cliff White (Inactive) added a comment - 27/Jun/12 10:15 PM

What is your current state? What else can we do to assist?

Cliff White (Inactive) added a comment - 27/Jun/12 10:15 PM What is your current state? What else can we do to assist?

Cliff White (Inactive) added a comment - 20/Jun/12 3:49 PM

We have never had data loss from a rolling upgrade, or from a point upgrade of this type. We routinely test these point release upgrade/downgrades as a part of the release process for each new Lustre release.

If there are other dependencies involved (such as Qlogic tools) then it would be advisable to involve Penguin in the decision-making process about how to proceed.

While upgrading your Lustre version provides the benefit from other fixes included, it is quite possible to apply the exact fix you need to your existing version of Lustre, which may be less disruptive on other layers of the stack.

Our clients use a stock RedHat kernel and you can certainly do a gradual upgrade of the clients as you schedule permits, there is no problem running with a mix of client versions.

Before embarking on such an upgrade path you could even experiment with a single client to see whether the pay-off warrants the effort. To do this effectively you would need to identify an example of the kind of workload that triggers this issue so that you can measure the before and after effect of applying this fix. Do you have any idea as to a likely candidate for such a test?

Cliff White (Inactive) added a comment - 20/Jun/12 3:49 PM We have never had data loss from a rolling upgrade, or from a point upgrade of this type. We routinely test these point release upgrade/downgrades as a part of the release process for each new Lustre release. If there are other dependencies involved (such as Qlogic tools) then it would be advisable to involve Penguin in the decision-making process about how to proceed. While upgrading your Lustre version provides the benefit from other fixes included, it is quite possible to apply the exact fix you need to your existing version of Lustre, which may be less disruptive on other layers of the stack. Our clients use a stock RedHat kernel and you can certainly do a gradual upgrade of the clients as you schedule permits, there is no problem running with a mix of client versions. Before embarking on such an upgrade path you could even experiment with a single client to see whether the pay-off warrants the effort. To do this effectively you would need to identify an example of the kind of workload that triggers this issue so that you can measure the before and after effect of applying this fix. Do you have any idea as to a likely candidate for such a test?

Jeffrey Frey added a comment - 20/Jun/12 2:40 PM

QLogic's OFED stack is a superset of the stock OFED distribution. It includes value-added utilities/libraries and bug-fixes [that OFED hypothetically rolls back into the baseline distribution].

When we initially setup the head node w/ ScientificLinux 6.1 the stock RDMA kernel module kept the machine from booting (kernel panic every time). Wiping the RHEL IB stack and replacing with the QLogic one for RHEL 6.1 fixed the issue.

As configured by Penguin, the LUSTRE servers are all running CentOS 5.x.

[root@oss1 ~]# ibstat
CA 'qib0'
CA type: InfiniPath_QLE7340
Number of ports: 1
Firmware version:
Hardware version: 2
Node GUID: 0x001175000077592a
System image GUID: 0x001175000077592a
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 219
LMC: 0
SM lid: 1
Capability mask: 0x07610868
Port GUID: 0x001175000077592a
Link layer: IB

So in your (Whamcloud's) experience there have never been any instances of data loss due to upgrading like this on-the-fly while the filesystem is still online?

We're using the loadable module variant of the client rather than building a LUSTRE kernel for the clients or using a Whamcloud kernel. Is there anything about the 1.8.8 server that demands 1.8.8 clients? Upgrading all clients = lots of downtime for our users and they've already experienced a LOT of downtime/lost time thanks to LUSTRE.

Jeffrey Frey added a comment - 20/Jun/12 2:40 PM QLogic's OFED stack is a superset of the stock OFED distribution. It includes value-added utilities/libraries and bug-fixes [that OFED hypothetically rolls back into the baseline distribution] . When we initially setup the head node w/ ScientificLinux 6.1 the stock RDMA kernel module kept the machine from booting (kernel panic every time). Wiping the RHEL IB stack and replacing with the QLogic one for RHEL 6.1 fixed the issue. As configured by Penguin, the LUSTRE servers are all running CentOS 5.x. [root@oss1 ~] # ibstat CA 'qib0' CA type: InfiniPath_QLE7340 Number of ports: 1 Firmware version: Hardware version: 2 Node GUID: 0x001175000077592a System image GUID: 0x001175000077592a Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 219 LMC: 0 SM lid: 1 Capability mask: 0x07610868 Port GUID: 0x001175000077592a Link layer: IB So in your (Whamcloud's) experience there have never been any instances of data loss due to upgrading like this on-the-fly while the filesystem is still online? We're using the loadable module variant of the client rather than building a LUSTRE kernel for the clients or using a Whamcloud kernel. Is there anything about the 1.8.8 server that demands 1.8.8 clients? Upgrading all clients = lots of downtime for our users and they've already experienced a LOT of downtime/lost time thanks to LUSTRE. :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::

Cliff White (Inactive) added a comment - 20/Jun/12 2:08 PM

As far as we know, Qlogic cards are supported by the OFED supplied with the RedHat kernel - was there a reason giving for using external OFED?
Please give us the model numbers of the cards.
Your upgrade procedure is correct, and identical to that in the Lustre Manual. The change from 1.8.6 to 1.8.8 is a point release, there are absolutely no concerns about compatibility and no need for any special work. The clients may require a RedHat kernel update to match our client kernel.
There is afaik not much information beyond the Lustre Manual, the upgrade should not be difficult.
Please let us know when you are planning this, and we'd be glad to have a conference call or furnish other help.
Has disabling the read cache produced any change in the performance?

Cliff White (Inactive) added a comment - 20/Jun/12 2:08 PM As far as we know, Qlogic cards are supported by the OFED supplied with the RedHat kernel - was there a reason giving for using external OFED? Please give us the model numbers of the cards. Your upgrade procedure is correct, and identical to that in the Lustre Manual. The change from 1.8.6 to 1.8.8 is a point release, there are absolutely no concerns about compatibility and no need for any special work. The clients may require a RedHat kernel update to match our client kernel. There is afaik not much information beyond the Lustre Manual, the upgrade should not be difficult. Please let us know when you are planning this, and we'd be glad to have a conference call or furnish other help. Has disabling the read cache produced any change in the performance?

Jeffrey Frey added a comment - 20/Jun/12 1:16 PM

Cliff:

With respect to upgrading the LUSTRE release on these production servers:

(1) Since these systems have QLogic's own OFED installed on them, we would want to build the LUSTRE 1.8.8 server components from scratch, correct? It appears this is what Penguin did when the system was built:

[root@oss1 src]# cd /usr/src
[root@oss1 src]# ls
aacraid-1.1.7.28000 debug kernels ofa_kernel ofa_kernel-1.5.3 openib redhat
[root@oss1 src]# ls kernels/
2.6.18-238.12.1.el5_lustre.g266a955-x86_64 2.6.18-238.el5-x86_64

(2) For rolling upgrades, we're assuming the general order of operations would be:

upgrade spare MDS
failover MDS to upgraded spare, upgrade primary MDS
failover oss1 => oss2; upgrade oss1; failover oss2 => oss1; upgrade oss2
failover oss3 => oss4; upgrade oss3; failover oss4 => oss3; upgrade oss4

We'd appreciate citations of any important/helpful materials on the subject of LUSTRE rolling upgrades.

Thanks for any and all information/feedback you can provide.

Jeffrey Frey added a comment - 20/Jun/12 1:16 PM Cliff: With respect to upgrading the LUSTRE release on these production servers: (1) Since these systems have QLogic's own OFED installed on them, we would want to build the LUSTRE 1.8.8 server components from scratch, correct? It appears this is what Penguin did when the system was built: [root@oss1 src] # cd /usr/src [root@oss1 src] # ls aacraid-1.1.7.28000 debug kernels ofa_kernel ofa_kernel-1.5.3 openib redhat [root@oss1 src] # ls kernels/ 2.6.18-238.12.1.el5_lustre.g266a955-x86_64 2.6.18-238.el5-x86_64 (2) For rolling upgrades, we're assuming the general order of operations would be: upgrade spare MDS failover MDS to upgraded spare, upgrade primary MDS failover oss1 => oss2; upgrade oss1; failover oss2 => oss1; upgrade oss2 failover oss3 => oss4; upgrade oss3; failover oss4 => oss3; upgrade oss4 We'd appreciate citations of any important/helpful materials on the subject of LUSTRE rolling upgrades. Thanks for any and all information/feedback you can provide. :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::

Ben Miller added a comment - 19/Jun/12 1:44 PM

OK, thanks. I turned off the read cache. We'll let you know if it
helps. We'll discuss internally about upgrading soon.

Ben

Ben Miller added a comment - 19/Jun/12 1:44 PM OK, thanks. I turned off the read cache. We'll let you know if it helps. We'll discuss internally about upgrading soon. Ben

Cliff White (Inactive) added a comment - 19/Jun/12 1:04 PM - edited

Thanks for providing the information so far. Analyzing the results has been useful in understanding what is going on.

From the IO statistics, we are seeing a great deal of small file IO (4k io size), and very little parallel IO (most of the time only 1 IO in flight). This is not an optimal IO model for Lustre - any steps you can take on the application side to increase IO size, eliminate excessive flush() or sync() calls, or otherwise allow the filesystem to aggregate larger IO will help to improve your performance.

Given this IO pattern the Lustre read cache – which is on by default - may be doing more harm than good. To turn it off please run the command "lctl set_param obdfilter.*.read_cache_enable=0" on all OSS nodes.

Finally, we recommend an immediate upgrade to Lustre 1.8.8-wc1 as this release contains optimizations for small file IO (see http://jira.whamcloud.com/browse/LU-983). The Lustre 1.8.6-wc1 and 1.8.8-wc1 releases are completely compatible, so you can do a rolling upgrade of your systems without needing any downtime.

Please let me know if you have any further questions about any of the above and let us know whether this advice helps.

Cliff White (Inactive) added a comment - 19/Jun/12 1:04 PM - edited Thanks for providing the information so far. Analyzing the results has been useful in understanding what is going on. From the IO statistics, we are seeing a great deal of small file IO (4k io size), and very little parallel IO (most of the time only 1 IO in flight). This is not an optimal IO model for Lustre - any steps you can take on the application side to increase IO size, eliminate excessive flush() or sync() calls, or otherwise allow the filesystem to aggregate larger IO will help to improve your performance. Given this IO pattern the Lustre read cache – which is on by default - may be doing more harm than good. To turn it off please run the command "lctl set_param obdfilter.*.read_cache_enable=0" on all OSS nodes. Finally, we recommend an immediate upgrade to Lustre 1.8.8-wc1 as this release contains optimizations for small file IO (see http://jira.whamcloud.com/browse/LU-983 ). The Lustre 1.8.6-wc1 and 1.8.8-wc1 releases are completely compatible, so you can do a rolling upgrade of your systems without needing any downtime. Please let me know if you have any further questions about any of the above and let us know whether this advice helps.

People

Assignee:: Cliff White (Inactive)

Reporter:: Archie Dizon

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Jun/12 1:04 PM

Updated:: 15/Feb/13 5:32 PM

Resolved:: 15/Feb/13 5:32 PM