Details
-
Task
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.0.0, Lustre 1.8.6
-
Clustering
-
4043
Description
Hello Support,
One of customer at University of Delaware had at least three separate instances where the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service due to:
Jun 11 02:40:07 oss4 kernel: Lustre: 9443:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0016: refuse reconnection from d085b4f1-e418-031f-8474-b980894ce7ad@10.55.50.115@o2ib to 0xffff8103119bac00; still busy with 1 active RPCs
The hang was so bad for one of them (upwards of 30 minutes with the OST unavailable) that a reboot of the oss1/oss2 pair was necessary. The symptom is easily identified: long hangs on the head node while one waits for a directory listing or for a file to open for editing in vi, etc. Sometimes the situation remedies itself, sometimes it does not and we need to reboot one or more OSS nodes.
"Enclosed are all of the syslogs, dmesg, and /tmp/lustre* crash dumps for our MDS and OSS's."
You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):
"https://pandora.nss.udel.edu//pickup.php?claimID=vuAFoSBUoReVuaje&claimPasscode=RfTmXJZFVdUGzbLk&emailAddr=tsingh%40penguincomputing.com"
Full information for the drop-off:
Claim ID: vuAFoSBUoReVuaje
Claim Passcode: RfTmXJZFVdUGzbLk
Date of Drop-Off: 2012-06-11 12:23:20-0400
Please review the attached log files and provide us the next course of action since it's very critical issue and impacting their environment? Also please let me know
if you need any further info?
Thanks
Terry
Penguin Tech Support
Ph: 415-954-2833
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA