Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.5.3
Labels:
- llnl
- ornl
- p4o
Environment:
MDS server running RHEL6.5 running ORNL 2.5.3 branch with about 12 patches.

Severity:
2
Rank (Obsolete):
16076

Description

Today we experienced a hardware failure with our MDS. The MDS rebooted and then came back. We restarted the MDS but IR behaved strangely. Four clients got evicted but when the timer to completion got down to zero IR restarted all over again. Then once it got to the 700 second range the timer starting to go up. It did this a few times before letting the timer running out. Once the timer did finally get to zero the recovery state was reported as still being in recovery. It removed this way for several more minutes before finally being in a recovered state. In all it toke 54 minutes to recover.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

atlas-mds1.log
668 kB
10/Oct/14 7:56 PM
atlas-tds-kernel-logs_20141229.tar.gz
265 kB
05/Jan/15 9:57 PM
atlas-tds-oss1_recovery_lustre-log.1418679242.16958
0.3 kB
15/Dec/14 10:42 PM
rhea513_kern_12292014.log
482 kB
07/Jan/15 5:43 PM
rhea-rtr1_kern_12292014.log
366 kB
07/Jan/15 5:43 PM

Issue Links

duplicates

LU-4119 recovery time hard doesn't limit recovery duration

Resolved

is related to

LU-5079 conf-sanity test_47 timeout

Resolved

Activity

People

Assignee:: Hongchao Zhang

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 10/Oct/14 7:51 PM

Updated:: 20/Feb/15 2:59 PM

Resolved:: 20/Feb/15 2:59 PM