[LU-5724] IR recovery doesn't behave properly with Lustre 2.5 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.5.3
Labels:
- llnl
- ornl
- p4o
Environment:
MDS server running RHEL6.5 running ORNL 2.5.3 branch with about 12 patches.

Severity:
2
Rank (Obsolete):
16076

Description

Today we experienced a hardware failure with our MDS. The MDS rebooted and then came back. We restarted the MDS but IR behaved strangely. Four clients got evicted but when the timer to completion got down to zero IR restarted all over again. Then once it got to the 700 second range the timer starting to go up. It did this a few times before letting the timer running out. Once the timer did finally get to zero the recovery state was reported as still being in recovery. It removed this way for several more minutes before finally being in a recovered state. In all it toke 54 minutes to recover.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

atlas-mds1.log
10/Oct/14 7:56 PM
668 kB
James A Simmons
atlas-tds-kernel-logs_20141229.tar.gz
05/Jan/15 9:57 PM
265 kB
James A Simmons
atlas-tds-oss1_recovery_lustre-log.1418679242.16958
15/Dec/14 10:42 PM
0.3 kB
James A Simmons
rhea513_kern_12292014.log
07/Jan/15 5:43 PM
482 kB
James A Simmons
rhea-rtr1_kern_12292014.log
07/Jan/15 5:43 PM
366 kB
James A Simmons

Issue Links

duplicates

LU-4119 recovery time hard doesn't limit recovery duration

Resolved

is related to

LU-5079 conf-sanity test_47 timeout

Resolved

Activity

People

Assignee:: Hongchao Zhang

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 10/Oct/14 7:51 PM

Updated:: 20/Feb/15 2:59 PM

Resolved:: 20/Feb/15 2:59 PM