[LU-1504] the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service - Whamcloud Community JIRA

Details

Type: Task
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.0.0, Lustre 1.8.6
Labels:
- performance
Environment:
Clustering

Epic:
- hang
Rank (Obsolete):
4043

Description

Hello Support,

One of customer at University of Delaware had at least three separate instances where the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service due to:

Jun 11 02:40:07 oss4 kernel: Lustre: 9443:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0016: refuse reconnection from d085b4f1-e418-031f-8474-b980894ce7ad@10.55.50.115@o2ib to 0xffff8103119bac00; still busy with 1 active RPCs

The hang was so bad for one of them (upwards of 30 minutes with the OST unavailable) that a reboot of the oss1/oss2 pair was necessary. The symptom is easily identified: long hangs on the head node while one waits for a directory listing or for a file to open for editing in vi, etc. Sometimes the situation remedies itself, sometimes it does not and we need to reboot one or more OSS nodes.

"Enclosed are all of the syslogs, dmesg, and /tmp/lustre* crash dumps for our MDS and OSS's."

You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):

"https://pandora.nss.udel.edu//pickup.php?claimID=vuAFoSBUoReVuaje&claimPasscode=RfTmXJZFVdUGzbLk&emailAddr=tsingh%40penguincomputing.com"

Full information for the drop-off:

Claim ID: vuAFoSBUoReVuaje
Claim Passcode: RfTmXJZFVdUGzbLk
Date of Drop-Off: 2012-06-11 12:23:20-0400

Please review the attached log files and provide us the next course of action since it's very critical issue and impacting their environment? Also please let me know
if you need any further info?

Thanks
Terry
Penguin Tech Support
Ph: 415-954-2833

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

headnode-messages.gz
623 kB
19/Jun/12 10:27 AM
lustre-failure-120619-1.gz
21 kB
19/Jun/12 8:08 AM
mds0a-messages.gz
5 kB
19/Jun/12 10:26 AM
oss3-vmstat.log
9 kB
19/Jun/12 11:38 AM
oss4-messages.gz
52 kB
19/Jun/12 10:26 AM

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

[LU-1504] the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service

Cliff White (Inactive) added a comment - 15/Feb/13 5:32 PM

Please reopen this issue if you have further comments/questions.

Cliff White (Inactive) added a comment - 15/Feb/13 5:32 PM Please reopen this issue if you have further comments/questions.

Oleg Drokin added a comment - 12/Jul/12 1:58 PM

I wonder about the "185 of 200 clients have no problem accessing the OST" case you mention. Could it be the case that only 15 "problematic" clients are actaully doing any IO on that OST and the rest happen to do IO on other OSTs?
After all overloaded OST is overloaded OST. Lustre cannot really improve disk controller properties and if the controller hates lots of small IO (and frankly, a lot of them does), there is very little we can do about it.

People in big government labs have these problems too and the solutions range from changing hardware to limiting "bad-behaving" applications to only certain parts of the machine to actually rewriting the apps to form a better IO pattern.

The 100k lock count should not be affecting anything as long as you don't run out of memory on OSS nodes, but looking at the vmstat output its evident you don't have this problem.

brw_stat output shows a lot of 4k io, and it only tracks data IO by applications.

Oleg Drokin added a comment - 12/Jul/12 1:58 PM I wonder about the "185 of 200 clients have no problem accessing the OST" case you mention. Could it be the case that only 15 "problematic" clients are actaully doing any IO on that OST and the rest happen to do IO on other OSTs? After all overloaded OST is overloaded OST. Lustre cannot really improve disk controller properties and if the controller hates lots of small IO (and frankly, a lot of them does), there is very little we can do about it. People in big government labs have these problems too and the solutions range from changing hardware to limiting "bad-behaving" applications to only certain parts of the machine to actually rewriting the apps to form a better IO pattern. The 100k lock count should not be affecting anything as long as you don't run out of memory on OSS nodes, but looking at the vmstat output its evident you don't have this problem. brw_stat output shows a lot of 4k io, and it only tracks data IO by applications.

Cliff White (Inactive) added a comment - 12/Jul/12 1:52 PM - edited

The random 4k IO was identified in the brw_stats which you furnished to us - to quote that comment(18/Jun/12 11:18 AM):
"2. On each server for each OST, there is a 'brw_stats' file located on the path
/proc/fs/lustre/obdfilter/<OST ID>/brw_stats where 'OST ID' is of the format lustre-OST00XX. for each of your OSTs, from a login on the OSS please issue the command:
'# cat /proc/fs/lustre/obdfilter/<OST ID>/brw_stats > <file>'
where <file> is the OST ID. Please do this when the server load is high. "

You can examine these files using that method at any time to see how your IO is performing, and can clear the data by echoing 'clear' into the file.
In the data you furnished to us previous, we see this, as an example:

# cat lustre-OST0012
snapshot_time:         1340104119.496271 (secs.usecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs  % cum %
1:                  491976  73  73   | 408911  59  59
2:                    4588   0  74   | 16816   2  62
4:                    1431   0  74   | 10062   1  63
8:                    1156   0  74   | 19221   2  66
16:                   1077   0  74   | 7433   1  67
32:                   1731   0  74   | 13036   1  69
64:                   2898   0  75   | 8544   1  70
128:                  5149   0  76   | 12009   1  72
256:                160523  23 100   | 187956  27 100

A 'page' in this case is a 4k memory page. So we can see here that 73% of your read IO and 59% of your write IO was 4k in size. This is rather more that we would expect from the odd ls, etc, and likely indicates an application behavior. Again, this is why we advise the 1.8.8 upgrade. The 1.8.8 release has been out for quite some time, and is stable in numerous production deployments.

Cliff White (Inactive) added a comment - 12/Jul/12 1:52 PM - edited The random 4k IO was identified in the brw_stats which you furnished to us - to quote that comment(18/Jun/12 11:18 AM): "2. On each server for each OST, there is a 'brw_stats' file located on the path /proc/fs/lustre/obdfilter/<OST ID>/brw_stats where 'OST ID' is of the format lustre-OST00XX. for each of your OSTs, from a login on the OSS please issue the command: '# cat /proc/fs/lustre/obdfilter/<OST ID>/brw_stats > <file>' where <file> is the OST ID. Please do this when the server load is high. " You can examine these files using that method at any time to see how your IO is performing, and can clear the data by echoing 'clear' into the file. In the data you furnished to us previous, we see this, as an example: # cat lustre-OST0012 snapshot_time: 1340104119.496271 (secs.usecs) read | write pages per bulk r/w rpcs % cum % | rpcs % cum % 1: 491976 73 73 | 408911 59 59 2: 4588 0 74 | 16816 2 62 4: 1431 0 74 | 10062 1 63 8: 1156 0 74 | 19221 2 66 16: 1077 0 74 | 7433 1 67 32: 1731 0 74 | 13036 1 69 64: 2898 0 75 | 8544 1 70 128: 5149 0 76 | 12009 1 72 256: 160523 23 100 | 187956 27 100 A 'page' in this case is a 4k memory page. So we can see here that 73% of your read IO and 59% of your write IO was 4k in size. This is rather more that we would expect from the odd ls, etc, and likely indicates an application behavior. Again, this is why we advise the 1.8.8 upgrade. The 1.8.8 release has been out for quite some time, and is stable in numerous production deployments.

Jeffrey Frey added a comment - 12/Jul/12 1:27 PM

Thanks, Cliff. We cannot "jump" into the 1.8.8 upgrade for two important reasons:

(1) We like to be very aware of what's changed in a software package before we install it on a production system – so we have some reading ahead of us to get familiar with the 1.8.6 to 1.8.8 roadmap

(2) This is a production system with user's computation work running out of the Lustre filesystem – any disruption is a major issue for our users, so we try to minimize those

In theory a rolling upgrade works flawlessly – but in theory the hardware we were sold behind this Lustre filesystem is enterprise-class, fast storage, too :-\ So our users have to be notified of the possibility for failure, which means if there could be downtime involved we may as well do other patching at the same time. Just so happens QLogic released a firmware update for our switch recently. We plan to do the 1.8.8 upgrade within the next month, we just have some planning and scheduling to do first.

As for the random 4k i/o, for our sake can you show applicable portions of the logs/debug traces that led to that diagnosis? In context is it all block-level i/o to/from files, or could it be standard filesystem ops (ls, cp, tar, etc.) e.g. also fetching size metadata from the OST(s)?

::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer IV / Cluster Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
http://turin.nss.udel.edu/
::::::::::::::::::::::::::::::::::::::::::::::::::::::

Jeffrey Frey added a comment - 12/Jul/12 1:27 PM Thanks, Cliff. We cannot "jump" into the 1.8.8 upgrade for two important reasons: (1) We like to be very aware of what's changed in a software package before we install it on a production system – so we have some reading ahead of us to get familiar with the 1.8.6 to 1.8.8 roadmap (2) This is a production system with user's computation work running out of the Lustre filesystem – any disruption is a major issue for our users, so we try to minimize those In theory a rolling upgrade works flawlessly – but in theory the hardware we were sold behind this Lustre filesystem is enterprise-class, fast storage, too :-\ So our users have to be notified of the possibility for failure, which means if there could be downtime involved we may as well do other patching at the same time. Just so happens QLogic released a firmware update for our switch recently. We plan to do the 1.8.8 upgrade within the next month, we just have some planning and scheduling to do first. As for the random 4k i/o, for our sake can you show applicable portions of the logs/debug traces that led to that diagnosis? In context is it all block-level i/o to/from files, or could it be standard filesystem ops (ls, cp, tar, etc.) e.g. also fetching size metadata from the OST(s)? :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::

Cliff White (Inactive) added a comment - 12/Jul/12 1:12 PM

We have already recommended that you upgrade to 1.8.8 to improve small file IO. We understand that you run a mix of workloads, we believe if you could identify which workload is doing the random 4k IO, changes to that workload would help the situation.

Cliff White (Inactive) added a comment - 12/Jul/12 1:12 PM We have already recommended that you upgrade to 1.8.8 to improve small file IO. We understand that you run a mix of workloads, we believe if you could identify which workload is doing the random 4k IO, changes to that workload would help the situation.

Jeffrey Frey added a comment - 12/Jul/12 1:01 PM

Interesting that this is a 2.x problem; the fact remains, though, that the symptoms we observe on the client side are EXACTLY the same as the ticket I mentioned, so we were hoping that would provide you folks with a better idea of what's going on. The problem seems like it has to be partly client-oriented, since 185 of the 200 Lustre clients were using the OST without issue while 15 of the clients showed it as unavailable. On the cluster head node where we noticed this recent outage:

[root@mills lustre-OST0006-osc-ffff880d13d29000]# pwd
/proc/fs/lustre/ldlm/namespaces/lustre-OST0006-osc-ffff880d13d29000

[root@mills lustre-OST0006-osc-ffff880d13d29000]# more lock_*
::::::::::::::
lock_count
::::::::::::::
99554
::::::::::::::
lock_unused_count
::::::::::::::
0

Should we read this statement as "use Lustre for applications with a couple large files to access?" There is no single application here since this is a community cluster with many different applications, and Lustre was intended by us to be used as working space for users' jobs on the cluster (a'la scratch).

On Jul 12, 2012, at 12:51 PM, Oleg Drokin (JIRA) wrote:

Jeffrey Frey added a comment - 12/Jul/12 1:01 PM Interesting that this is a 2.x problem; the fact remains, though, that the symptoms we observe on the client side are EXACTLY the same as the ticket I mentioned, so we were hoping that would provide you folks with a better idea of what's going on. The problem seems like it has to be partly client-oriented, since 185 of the 200 Lustre clients were using the OST without issue while 15 of the clients showed it as unavailable. On the cluster head node where we noticed this recent outage: [root@mills lustre-OST0006-osc-ffff880d13d29000] # pwd /proc/fs/lustre/ldlm/namespaces/lustre-OST0006-osc-ffff880d13d29000 [root@mills lustre-OST0006-osc-ffff880d13d29000] # more lock_* :::::::::::::: lock_count :::::::::::::: 99554 :::::::::::::: lock_unused_count :::::::::::::: 0 Should we read this statement as "use Lustre for applications with a couple large files to access?" There is no single application here since this is a community cluster with many different applications, and Lustre was intended by us to be used as working space for users' jobs on the cluster (a'la scratch). On Jul 12, 2012, at 12:51 PM, Oleg Drokin (JIRA) wrote: :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::

Oleg Drokin added a comment - 12/Jul/12 12:50 PM

I believe your problem does not look like ~~LU-416~~ which was a client side problem )and the one specific to 2.x releases only at that too).
You are clearly having a problem on the server side.
Looks like your disk devices are overloaded with small IO and there is very little Lustre can do about it because apparently thisis what the application in question wants.

Oleg Drokin added a comment - 12/Jul/12 12:50 PM I believe your problem does not look like LU-416 which was a client side problem )and the one specific to 2.x releases only at that too). You are clearly having a problem on the server side. Looks like your disk devices are overloaded with small IO and there is very little Lustre can do about it because apparently thisis what the application in question wants.

Jeffrey Frey added a comment - 12/Jul/12 11:47 AM

Cliff:

Further research on your Jira site has shown that the issue we're seeing is EXACTLY the situation reported in ~~LU-416~~:

http://jira.whamcloud.com/browse/LU-416

To what version of Lustre does that incident correspond – 1.8.6? – and is it resolved in 1.8.8?

Jeffrey Frey added a comment - 12/Jul/12 11:47 AM Cliff: Further research on your Jira site has shown that the issue we're seeing is EXACTLY the situation reported in LU-416 : http://jira.whamcloud.com/browse/LU-416 To what version of Lustre does that incident correspond – 1.8.6? – and is it resolved in 1.8.8? :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::

Ben Miller added a comment - 06/Jul/12 2:31 PM

We continue to have temporary OST unavailability with the read cache
disabled. We haven't had a chance to upgrade to 1.8.8 yet. Sometimes
the OST will recover within a few minutes other times after several
minutes (or hours) we end up rebooted an OSS to two to get Lustre
available again.

Ben

Ben Miller added a comment - 06/Jul/12 2:31 PM We continue to have temporary OST unavailability with the read cache disabled. We haven't had a chance to upgrade to 1.8.8 yet. Sometimes the OST will recover within a few minutes other times after several minutes (or hours) we end up rebooted an OSS to two to get Lustre available again. Ben

Cliff White (Inactive) added a comment - 05/Jul/12 2:27 PM

Please update us as to your status. What else we do to help on this issue? Should we close this bug?

Cliff White (Inactive) added a comment - 05/Jul/12 2:27 PM Please update us as to your status. What else we do to help on this issue? Should we close this bug?

People

Assignee:: Cliff White (Inactive)

Reporter:: Archie Dizon

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Jun/12 1:04 PM

Updated:: 15/Feb/13 5:32 PM

Resolved:: 15/Feb/13 5:32 PM