Details
-
Task
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.0.0, Lustre 1.8.6
-
Clustering
-
4043
Description
Hello Support,
One of customer at University of Delaware had at least three separate instances where the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service due to:
Jun 11 02:40:07 oss4 kernel: Lustre: 9443:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0016: refuse reconnection from d085b4f1-e418-031f-8474-b980894ce7ad@10.55.50.115@o2ib to 0xffff8103119bac00; still busy with 1 active RPCs
The hang was so bad for one of them (upwards of 30 minutes with the OST unavailable) that a reboot of the oss1/oss2 pair was necessary. The symptom is easily identified: long hangs on the head node while one waits for a directory listing or for a file to open for editing in vi, etc. Sometimes the situation remedies itself, sometimes it does not and we need to reboot one or more OSS nodes.
"Enclosed are all of the syslogs, dmesg, and /tmp/lustre* crash dumps for our MDS and OSS's."
You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):
"https://pandora.nss.udel.edu//pickup.php?claimID=vuAFoSBUoReVuaje&claimPasscode=RfTmXJZFVdUGzbLk&emailAddr=tsingh%40penguincomputing.com"
Full information for the drop-off:
Claim ID: vuAFoSBUoReVuaje
Claim Passcode: RfTmXJZFVdUGzbLk
Date of Drop-Off: 2012-06-11 12:23:20-0400
Please review the attached log files and provide us the next course of action since it's very critical issue and impacting their environment? Also please let me know
if you need any further info?
Thanks
Terry
Penguin Tech Support
Ph: 415-954-2833
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA
The random 4k IO was identified in the brw_stats which you furnished to us - to quote that comment(18/Jun/12 11:18 AM):
"2. On each server for each OST, there is a 'brw_stats' file located on the path
/proc/fs/lustre/obdfilter/<OST ID>/brw_stats where 'OST ID' is of the format lustre-OST00XX. for each of your OSTs, from a login on the OSS please issue the command:
'# cat /proc/fs/lustre/obdfilter/<OST ID>/brw_stats > <file>'
where <file> is the OST ID. Please do this when the server load is high. "
You can examine these files using that method at any time to see how your IO is performing, and can clear the data by echoing 'clear' into the file.
In the data you furnished to us previous, we see this, as an example:
A 'page' in this case is a 4k memory page. So we can see here that 73% of your read IO and 59% of your write IO was 4k in size. This is rather more that we would expect from the odd ls, etc, and likely indicates an application behavior. Again, this is why we advise the 1.8.8 upgrade. The 1.8.8 release has been out for quite some time, and is stable in numerous production deployments.