Loading...

XML

Word

Printable

Details

Type: Task
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.0.0, Lustre 1.8.6
Labels:
- performance
Environment:
Clustering

Epic:
- hang
Rank (Obsolete):
4043

Description

Hello Support,

One of customer at University of Delaware had at least three separate instances where the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service due to:

Jun 11 02:40:07 oss4 kernel: Lustre: 9443:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0016: refuse reconnection from d085b4f1-e418-031f-8474-b980894ce7ad@10.55.50.115@o2ib to 0xffff8103119bac00; still busy with 1 active RPCs

The hang was so bad for one of them (upwards of 30 minutes with the OST unavailable) that a reboot of the oss1/oss2 pair was necessary. The symptom is easily identified: long hangs on the head node while one waits for a directory listing or for a file to open for editing in vi, etc. Sometimes the situation remedies itself, sometimes it does not and we need to reboot one or more OSS nodes.

"Enclosed are all of the syslogs, dmesg, and /tmp/lustre* crash dumps for our MDS and OSS's."

You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):

"https://pandora.nss.udel.edu//pickup.php?claimID=vuAFoSBUoReVuaje&claimPasscode=RfTmXJZFVdUGzbLk&emailAddr=tsingh%40penguincomputing.com"

Full information for the drop-off:

Claim ID: vuAFoSBUoReVuaje
Claim Passcode: RfTmXJZFVdUGzbLk
Date of Drop-Off: 2012-06-11 12:23:20-0400

Please review the attached log files and provide us the next course of action since it's very critical issue and impacting their environment? Also please let me know
if you need any further info?

Thanks
Terry
Penguin Tech Support
Ph: 415-954-2833

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

headnode-messages.gz
19/Jun/12 10:27 AM
623 kB
Ben Miller
lustre-failure-120619-1.gz
19/Jun/12 8:08 AM
21 kB
Ben Miller
mds0a-messages.gz
19/Jun/12 10:26 AM
5 kB
Ben Miller
oss3-vmstat.log
19/Jun/12 11:38 AM
9 kB
Ben Miller
oss4-messages.gz
19/Jun/12 10:26 AM
52 kB
Ben Miller

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

People

Assignee:: Cliff White (Inactive)

Reporter:: Archie Dizon

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Jun/12 1:04 PM

Updated:: 15/Feb/13 5:32 PM

Resolved:: 15/Feb/13 5:32 PM