[LU-1504] the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service - Whamcloud Community JIRA

Details

Type: Task
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.0.0, Lustre 1.8.6
Labels:
- performance
Environment:
Clustering

Epic:
- hang
Rank (Obsolete):
4043

Description

Hello Support,

One of customer at University of Delaware had at least three separate instances where the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service due to:

Jun 11 02:40:07 oss4 kernel: Lustre: 9443:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0016: refuse reconnection from d085b4f1-e418-031f-8474-b980894ce7ad@10.55.50.115@o2ib to 0xffff8103119bac00; still busy with 1 active RPCs

The hang was so bad for one of them (upwards of 30 minutes with the OST unavailable) that a reboot of the oss1/oss2 pair was necessary. The symptom is easily identified: long hangs on the head node while one waits for a directory listing or for a file to open for editing in vi, etc. Sometimes the situation remedies itself, sometimes it does not and we need to reboot one or more OSS nodes.

"Enclosed are all of the syslogs, dmesg, and /tmp/lustre* crash dumps for our MDS and OSS's."

You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):

"https://pandora.nss.udel.edu//pickup.php?claimID=vuAFoSBUoReVuaje&claimPasscode=RfTmXJZFVdUGzbLk&emailAddr=tsingh%40penguincomputing.com"

Full information for the drop-off:

Claim ID: vuAFoSBUoReVuaje
Claim Passcode: RfTmXJZFVdUGzbLk
Date of Drop-Off: 2012-06-11 12:23:20-0400

Please review the attached log files and provide us the next course of action since it's very critical issue and impacting their environment? Also please let me know
if you need any further info?

Thanks
Terry
Penguin Tech Support
Ph: 415-954-2833

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

headnode-messages.gz
623 kB
19/Jun/12 10:27 AM
lustre-failure-120619-1.gz
21 kB
19/Jun/12 8:08 AM
mds0a-messages.gz
5 kB
19/Jun/12 10:26 AM
oss3-vmstat.log
9 kB
19/Jun/12 11:38 AM
oss4-messages.gz
52 kB
19/Jun/12 10:26 AM

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

[LU-1504] the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service

Cliff White (Inactive) added a comment - 12/Jul/12 1:52 PM - edited

The random 4k IO was identified in the brw_stats which you furnished to us - to quote that comment(18/Jun/12 11:18 AM):
"2. On each server for each OST, there is a 'brw_stats' file located on the path
/proc/fs/lustre/obdfilter/<OST ID>/brw_stats where 'OST ID' is of the format lustre-OST00XX. for each of your OSTs, from a login on the OSS please issue the command:
'# cat /proc/fs/lustre/obdfilter/<OST ID>/brw_stats > <file>'
where <file> is the OST ID. Please do this when the server load is high. "

You can examine these files using that method at any time to see how your IO is performing, and can clear the data by echoing 'clear' into the file.
In the data you furnished to us previous, we see this, as an example:

# cat lustre-OST0012
snapshot_time:         1340104119.496271 (secs.usecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs  % cum %
1:                  491976  73  73   | 408911  59  59
2:                    4588   0  74   | 16816   2  62
4:                    1431   0  74   | 10062   1  63
8:                    1156   0  74   | 19221   2  66
16:                   1077   0  74   | 7433   1  67
32:                   1731   0  74   | 13036   1  69
64:                   2898   0  75   | 8544   1  70
128:                  5149   0  76   | 12009   1  72
256:                160523  23 100   | 187956  27 100

A 'page' in this case is a 4k memory page. So we can see here that 73% of your read IO and 59% of your write IO was 4k in size. This is rather more that we would expect from the odd ls, etc, and likely indicates an application behavior. Again, this is why we advise the 1.8.8 upgrade. The 1.8.8 release has been out for quite some time, and is stable in numerous production deployments.

Cliff White (Inactive) added a comment - 12/Jul/12 1:52 PM - edited The random 4k IO was identified in the brw_stats which you furnished to us - to quote that comment(18/Jun/12 11:18 AM): "2. On each server for each OST, there is a 'brw_stats' file located on the path /proc/fs/lustre/obdfilter/<OST ID>/brw_stats where 'OST ID' is of the format lustre-OST00XX. for each of your OSTs, from a login on the OSS please issue the command: '# cat /proc/fs/lustre/obdfilter/<OST ID>/brw_stats > <file>' where <file> is the OST ID. Please do this when the server load is high. " You can examine these files using that method at any time to see how your IO is performing, and can clear the data by echoing 'clear' into the file. In the data you furnished to us previous, we see this, as an example: # cat lustre-OST0012 snapshot_time: 1340104119.496271 (secs.usecs) read | write pages per bulk r/w rpcs % cum % | rpcs % cum % 1: 491976 73 73 | 408911 59 59 2: 4588 0 74 | 16816 2 62 4: 1431 0 74 | 10062 1 63 8: 1156 0 74 | 19221 2 66 16: 1077 0 74 | 7433 1 67 32: 1731 0 74 | 13036 1 69 64: 2898 0 75 | 8544 1 70 128: 5149 0 76 | 12009 1 72 256: 160523 23 100 | 187956 27 100 A 'page' in this case is a 4k memory page. So we can see here that 73% of your read IO and 59% of your write IO was 4k in size. This is rather more that we would expect from the odd ls, etc, and likely indicates an application behavior. Again, this is why we advise the 1.8.8 upgrade. The 1.8.8 release has been out for quite some time, and is stable in numerous production deployments.

Jeffrey Frey added a comment - 12/Jul/12 1:27 PM

Thanks, Cliff. We cannot "jump" into the 1.8.8 upgrade for two important reasons:

(1) We like to be very aware of what's changed in a software package before we install it on a production system – so we have some reading ahead of us to get familiar with the 1.8.6 to 1.8.8 roadmap

(2) This is a production system with user's computation work running out of the Lustre filesystem – any disruption is a major issue for our users, so we try to minimize those

In theory a rolling upgrade works flawlessly – but in theory the hardware we were sold behind this Lustre filesystem is enterprise-class, fast storage, too :-\ So our users have to be notified of the possibility for failure, which means if there could be downtime involved we may as well do other patching at the same time. Just so happens QLogic released a firmware update for our switch recently. We plan to do the 1.8.8 upgrade within the next month, we just have some planning and scheduling to do first.

As for the random 4k i/o, for our sake can you show applicable portions of the logs/debug traces that led to that diagnosis? In context is it all block-level i/o to/from files, or could it be standard filesystem ops (ls, cp, tar, etc.) e.g. also fetching size metadata from the OST(s)?

::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer IV / Cluster Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
http://turin.nss.udel.edu/
::::::::::::::::::::::::::::::::::::::::::::::::::::::

Jeffrey Frey added a comment - 12/Jul/12 1:27 PM Thanks, Cliff. We cannot "jump" into the 1.8.8 upgrade for two important reasons: (1) We like to be very aware of what's changed in a software package before we install it on a production system – so we have some reading ahead of us to get familiar with the 1.8.6 to 1.8.8 roadmap (2) This is a production system with user's computation work running out of the Lustre filesystem – any disruption is a major issue for our users, so we try to minimize those In theory a rolling upgrade works flawlessly – but in theory the hardware we were sold behind this Lustre filesystem is enterprise-class, fast storage, too :-\ So our users have to be notified of the possibility for failure, which means if there could be downtime involved we may as well do other patching at the same time. Just so happens QLogic released a firmware update for our switch recently. We plan to do the 1.8.8 upgrade within the next month, we just have some planning and scheduling to do first. As for the random 4k i/o, for our sake can you show applicable portions of the logs/debug traces that led to that diagnosis? In context is it all block-level i/o to/from files, or could it be standard filesystem ops (ls, cp, tar, etc.) e.g. also fetching size metadata from the OST(s)? :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::

Cliff White (Inactive) added a comment - 12/Jul/12 1:12 PM

We have already recommended that you upgrade to 1.8.8 to improve small file IO. We understand that you run a mix of workloads, we believe if you could identify which workload is doing the random 4k IO, changes to that workload would help the situation.

Cliff White (Inactive) added a comment - 12/Jul/12 1:12 PM We have already recommended that you upgrade to 1.8.8 to improve small file IO. We understand that you run a mix of workloads, we believe if you could identify which workload is doing the random 4k IO, changes to that workload would help the situation.

Jeffrey Frey added a comment - 12/Jul/12 1:01 PM

Interesting that this is a 2.x problem; the fact remains, though, that the symptoms we observe on the client side are EXACTLY the same as the ticket I mentioned, so we were hoping that would provide you folks with a better idea of what's going on. The problem seems like it has to be partly client-oriented, since 185 of the 200 Lustre clients were using the OST without issue while 15 of the clients showed it as unavailable. On the cluster head node where we noticed this recent outage:

[root@mills lustre-OST0006-osc-ffff880d13d29000]# pwd
/proc/fs/lustre/ldlm/namespaces/lustre-OST0006-osc-ffff880d13d29000

[root@mills lustre-OST0006-osc-ffff880d13d29000]# more lock_*
::::::::::::::
lock_count
::::::::::::::
99554
::::::::::::::
lock_unused_count
::::::::::::::
0

Should we read this statement as "use Lustre for applications with a couple large files to access?" There is no single application here since this is a community cluster with many different applications, and Lustre was intended by us to be used as working space for users' jobs on the cluster (a'la scratch).

On Jul 12, 2012, at 12:51 PM, Oleg Drokin (JIRA) wrote:

Jeffrey Frey added a comment - 12/Jul/12 1:01 PM Interesting that this is a 2.x problem; the fact remains, though, that the symptoms we observe on the client side are EXACTLY the same as the ticket I mentioned, so we were hoping that would provide you folks with a better idea of what's going on. The problem seems like it has to be partly client-oriented, since 185 of the 200 Lustre clients were using the OST without issue while 15 of the clients showed it as unavailable. On the cluster head node where we noticed this recent outage: [root@mills lustre-OST0006-osc-ffff880d13d29000] # pwd /proc/fs/lustre/ldlm/namespaces/lustre-OST0006-osc-ffff880d13d29000 [root@mills lustre-OST0006-osc-ffff880d13d29000] # more lock_* :::::::::::::: lock_count :::::::::::::: 99554 :::::::::::::: lock_unused_count :::::::::::::: 0 Should we read this statement as "use Lustre for applications with a couple large files to access?" There is no single application here since this is a community cluster with many different applications, and Lustre was intended by us to be used as working space for users' jobs on the cluster (a'la scratch). On Jul 12, 2012, at 12:51 PM, Oleg Drokin (JIRA) wrote: :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::

Oleg Drokin added a comment - 12/Jul/12 12:50 PM

I believe your problem does not look like ~~LU-416~~ which was a client side problem )and the one specific to 2.x releases only at that too).
You are clearly having a problem on the server side.
Looks like your disk devices are overloaded with small IO and there is very little Lustre can do about it because apparently thisis what the application in question wants.

Oleg Drokin added a comment - 12/Jul/12 12:50 PM I believe your problem does not look like LU-416 which was a client side problem )and the one specific to 2.x releases only at that too). You are clearly having a problem on the server side. Looks like your disk devices are overloaded with small IO and there is very little Lustre can do about it because apparently thisis what the application in question wants.

Jeffrey Frey added a comment - 12/Jul/12 11:47 AM

Cliff:

Further research on your Jira site has shown that the issue we're seeing is EXACTLY the situation reported in ~~LU-416~~:

http://jira.whamcloud.com/browse/LU-416

To what version of Lustre does that incident correspond – 1.8.6? – and is it resolved in 1.8.8?

Jeffrey Frey added a comment - 12/Jul/12 11:47 AM Cliff: Further research on your Jira site has shown that the issue we're seeing is EXACTLY the situation reported in LU-416 : http://jira.whamcloud.com/browse/LU-416 To what version of Lustre does that incident correspond – 1.8.6? – and is it resolved in 1.8.8? :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::

Ben Miller added a comment - 06/Jul/12 2:31 PM

We continue to have temporary OST unavailability with the read cache
disabled. We haven't had a chance to upgrade to 1.8.8 yet. Sometimes
the OST will recover within a few minutes other times after several
minutes (or hours) we end up rebooted an OSS to two to get Lustre
available again.

Ben

Ben Miller added a comment - 06/Jul/12 2:31 PM We continue to have temporary OST unavailability with the read cache disabled. We haven't had a chance to upgrade to 1.8.8 yet. Sometimes the OST will recover within a few minutes other times after several minutes (or hours) we end up rebooted an OSS to two to get Lustre available again. Ben

Cliff White (Inactive) added a comment - 05/Jul/12 2:27 PM

Please update us as to your status. What else we do to help on this issue? Should we close this bug?

Cliff White (Inactive) added a comment - 05/Jul/12 2:27 PM Please update us as to your status. What else we do to help on this issue? Should we close this bug?

Cliff White (Inactive) added a comment - 27/Jun/12 10:15 PM

What is your current state? What else can we do to assist?

Cliff White (Inactive) added a comment - 27/Jun/12 10:15 PM What is your current state? What else can we do to assist?

Cliff White (Inactive) added a comment - 20/Jun/12 3:49 PM

We have never had data loss from a rolling upgrade, or from a point upgrade of this type. We routinely test these point release upgrade/downgrades as a part of the release process for each new Lustre release.

If there are other dependencies involved (such as Qlogic tools) then it would be advisable to involve Penguin in the decision-making process about how to proceed.

While upgrading your Lustre version provides the benefit from other fixes included, it is quite possible to apply the exact fix you need to your existing version of Lustre, which may be less disruptive on other layers of the stack.

Our clients use a stock RedHat kernel and you can certainly do a gradual upgrade of the clients as you schedule permits, there is no problem running with a mix of client versions.

Before embarking on such an upgrade path you could even experiment with a single client to see whether the pay-off warrants the effort. To do this effectively you would need to identify an example of the kind of workload that triggers this issue so that you can measure the before and after effect of applying this fix. Do you have any idea as to a likely candidate for such a test?

Cliff White (Inactive) added a comment - 20/Jun/12 3:49 PM We have never had data loss from a rolling upgrade, or from a point upgrade of this type. We routinely test these point release upgrade/downgrades as a part of the release process for each new Lustre release. If there are other dependencies involved (such as Qlogic tools) then it would be advisable to involve Penguin in the decision-making process about how to proceed. While upgrading your Lustre version provides the benefit from other fixes included, it is quite possible to apply the exact fix you need to your existing version of Lustre, which may be less disruptive on other layers of the stack. Our clients use a stock RedHat kernel and you can certainly do a gradual upgrade of the clients as you schedule permits, there is no problem running with a mix of client versions. Before embarking on such an upgrade path you could even experiment with a single client to see whether the pay-off warrants the effort. To do this effectively you would need to identify an example of the kind of workload that triggers this issue so that you can measure the before and after effect of applying this fix. Do you have any idea as to a likely candidate for such a test?

Jeffrey Frey added a comment - 20/Jun/12 2:40 PM

QLogic's OFED stack is a superset of the stock OFED distribution. It includes value-added utilities/libraries and bug-fixes [that OFED hypothetically rolls back into the baseline distribution].

When we initially setup the head node w/ ScientificLinux 6.1 the stock RDMA kernel module kept the machine from booting (kernel panic every time). Wiping the RHEL IB stack and replacing with the QLogic one for RHEL 6.1 fixed the issue.

As configured by Penguin, the LUSTRE servers are all running CentOS 5.x.

[root@oss1 ~]# ibstat
CA 'qib0'
CA type: InfiniPath_QLE7340
Number of ports: 1
Firmware version:
Hardware version: 2
Node GUID: 0x001175000077592a
System image GUID: 0x001175000077592a
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 219
LMC: 0
SM lid: 1
Capability mask: 0x07610868
Port GUID: 0x001175000077592a
Link layer: IB

So in your (Whamcloud's) experience there have never been any instances of data loss due to upgrading like this on-the-fly while the filesystem is still online?

We're using the loadable module variant of the client rather than building a LUSTRE kernel for the clients or using a Whamcloud kernel. Is there anything about the 1.8.8 server that demands 1.8.8 clients? Upgrading all clients = lots of downtime for our users and they've already experienced a LOT of downtime/lost time thanks to LUSTRE.

Jeffrey Frey added a comment - 20/Jun/12 2:40 PM QLogic's OFED stack is a superset of the stock OFED distribution. It includes value-added utilities/libraries and bug-fixes [that OFED hypothetically rolls back into the baseline distribution] . When we initially setup the head node w/ ScientificLinux 6.1 the stock RDMA kernel module kept the machine from booting (kernel panic every time). Wiping the RHEL IB stack and replacing with the QLogic one for RHEL 6.1 fixed the issue. As configured by Penguin, the LUSTRE servers are all running CentOS 5.x. [root@oss1 ~] # ibstat CA 'qib0' CA type: InfiniPath_QLE7340 Number of ports: 1 Firmware version: Hardware version: 2 Node GUID: 0x001175000077592a System image GUID: 0x001175000077592a Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 219 LMC: 0 SM lid: 1 Capability mask: 0x07610868 Port GUID: 0x001175000077592a Link layer: IB So in your (Whamcloud's) experience there have never been any instances of data loss due to upgrading like this on-the-fly while the filesystem is still online? We're using the loadable module variant of the client rather than building a LUSTRE kernel for the clients or using a Whamcloud kernel. Is there anything about the 1.8.8 server that demands 1.8.8 clients? Upgrading all clients = lots of downtime for our users and they've already experienced a LOT of downtime/lost time thanks to LUSTRE. :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::

People

Assignee:: Cliff White (Inactive)

Reporter:: Archie Dizon

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Jun/12 1:04 PM

Updated:: 15/Feb/13 5:32 PM

Resolved:: 15/Feb/13 5:32 PM