Details
-
Task
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.0.0, Lustre 1.8.6
-
Clustering
-
4043
Description
Hello Support,
One of customer at University of Delaware had at least three separate instances where the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service due to:
Jun 11 02:40:07 oss4 kernel: Lustre: 9443:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0016: refuse reconnection from d085b4f1-e418-031f-8474-b980894ce7ad@10.55.50.115@o2ib to 0xffff8103119bac00; still busy with 1 active RPCs
The hang was so bad for one of them (upwards of 30 minutes with the OST unavailable) that a reboot of the oss1/oss2 pair was necessary. The symptom is easily identified: long hangs on the head node while one waits for a directory listing or for a file to open for editing in vi, etc. Sometimes the situation remedies itself, sometimes it does not and we need to reboot one or more OSS nodes.
"Enclosed are all of the syslogs, dmesg, and /tmp/lustre* crash dumps for our MDS and OSS's."
You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):
"https://pandora.nss.udel.edu//pickup.php?claimID=vuAFoSBUoReVuaje&claimPasscode=RfTmXJZFVdUGzbLk&emailAddr=tsingh%40penguincomputing.com"
Full information for the drop-off:
Claim ID: vuAFoSBUoReVuaje
Claim Passcode: RfTmXJZFVdUGzbLk
Date of Drop-Off: 2012-06-11 12:23:20-0400
Please review the attached log files and provide us the next course of action since it's very critical issue and impacting their environment? Also please let me know
if you need any further info?
Thanks
Terry
Penguin Tech Support
Ph: 415-954-2833
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA
Interesting that this is a 2.x problem; the fact remains, though, that the symptoms we observe on the client side are EXACTLY the same as the ticket I mentioned, so we were hoping that would provide you folks with a better idea of what's going on. The problem seems like it has to be partly client-oriented, since 185 of the 200 Lustre clients were using the OST without issue while 15 of the clients showed it as unavailable. On the cluster head node where we noticed this recent outage:
[root@mills lustre-OST0006-osc-ffff880d13d29000]# pwd
/proc/fs/lustre/ldlm/namespaces/lustre-OST0006-osc-ffff880d13d29000
[root@mills lustre-OST0006-osc-ffff880d13d29000]# more lock_*
::::::::::::::
lock_count
::::::::::::::
99554
::::::::::::::
lock_unused_count
::::::::::::::
0
Should we read this statement as "use Lustre for applications with a couple large files to access?" There is no single application here since this is a community cluster with many different applications, and Lustre was intended by us to be used as working space for users' jobs on the cluster (a'la scratch).
On Jul 12, 2012, at 12:51 PM, Oleg Drokin (JIRA) wrote:
::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer IV / Cluster Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
http://turin.nss.udel.edu/
::::::::::::::::::::::::::::::::::::::::::::::::::::::