[LU-5724] IR recovery doesn't behave properly with Lustre 2.5 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.5.3
Labels:
- llnl
- ornl
- p4o
Environment:
MDS server running RHEL6.5 running ORNL 2.5.3 branch with about 12 patches.

Severity:
2
Rank (Obsolete):
16076

Description

Today we experienced a hardware failure with our MDS. The MDS rebooted and then came back. We restarted the MDS but IR behaved strangely. Four clients got evicted but when the timer to completion got down to zero IR restarted all over again. Then once it got to the 700 second range the timer starting to go up. It did this a few times before letting the timer running out. Once the timer did finally get to zero the recovery state was reported as still being in recovery. It removed this way for several more minutes before finally being in a recovered state. In all it toke 54 minutes to recover.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

atlas-mds1.log
668 kB
10/Oct/14 7:56 PM
atlas-tds-kernel-logs_20141229.tar.gz
265 kB
05/Jan/15 9:57 PM
atlas-tds-oss1_recovery_lustre-log.1418679242.16958
0.3 kB
15/Dec/14 10:42 PM
rhea513_kern_12292014.log
482 kB
07/Jan/15 5:43 PM
rhea-rtr1_kern_12292014.log
366 kB
07/Jan/15 5:43 PM

Issue Links

duplicates

LU-4119 recovery time hard doesn't limit recovery duration

Resolved

is related to

LU-5079 conf-sanity test_47 timeout

Resolved

Activity

[LU-5724] IR recovery doesn't behave properly with Lustre 2.5

James A Simmons added a comment - 07/Jan/15 5:43 PM

Here are the kern logs for a client and a router. If you want the logs for all the clients let me know.

James A Simmons added a comment - 07/Jan/15 5:43 PM Here are the kern logs for a client and a router. If you want the logs for all the clients let me know.

Hongchao Zhang added a comment - 07/Jan/15 9:37 AM - edited

is there only one Lustre client at 10.38.144.11 in this configuration? are these logs in the same failover test above?

[ 2267.379541] Lustre: atlastds-MDT0000: Will be in recovery for at least 30:00, or until 1 client reconnects
Dec 29 14:31:02 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2267.409294] Lustre: atlastds-MDT0000: Denying connection for new client 3ae0ecec-84ef-cf8f-c128-51873c53d1ad (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:59
Dec 29 14:31:08 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2272.910080] Lustre: atlastds-MDT0000: Denying connection for new client 5116891d-0ace-dffd-7497-218db0b23e98 (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:54

the MDT & OSSs are waiting the client to reconnect for recovery, but it somehow failed to reconnect and seems to connect as
a new Lustre client which was denied by the MDS&OSSs for it was recovering from failover.

Could you please attach the console&sys logs of the client? Thanks!

Hongchao Zhang added a comment - 07/Jan/15 9:37 AM - edited is there only one Lustre client at 10.38.144.11 in this configuration? are these logs in the same failover test above? [ 2267.379541] Lustre: atlastds-MDT0000: Will be in recovery for at least 30:00, or until 1 client reconnects Dec 29 14:31:02 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2267.409294] Lustre: atlastds-MDT0000: Denying connection for new client 3ae0ecec-84ef-cf8f-c128-51873c53d1ad (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:59 Dec 29 14:31:08 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2272.910080] Lustre: atlastds-MDT0000: Denying connection for new client 5116891d-0ace-dffd-7497-218db0b23e98 (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:54 the MDT & OSSs are waiting the client to reconnect for recovery, but it somehow failed to reconnect and seems to connect as a new Lustre client which was denied by the MDS&OSSs for it was recovering from failover. Could you please attach the console&sys logs of the client? Thanks!

James A Simmons added a comment - 05/Jan/15 9:57 PM

Here you go. These are the logs from the clients and servers.

James A Simmons added a comment - 05/Jan/15 9:57 PM Here you go. These are the logs from the clients and servers.

James A Simmons added a comment - 05/Jan/15 7:05 PM

The OSS reconnected to the MDS but none of the clients every reconnected. The clients appeared stuck. The client logs are from the client nodes we used. As for the configuration the MGS is a stand alone node and we tested with 4 nodes. Will grab the logs.

James A Simmons added a comment - 05/Jan/15 7:05 PM The OSS reconnected to the MDS but none of the clients every reconnected. The clients appeared stuck. The client logs are from the client nodes we used. As for the configuration the MGS is a stand alone node and we tested with 4 nodes. Will grab the logs.

Hongchao Zhang added a comment - 05/Jan/15 10:10 AM

as per the log "dump_atlas-tds-mds1-after-recovery.log", there are 3 out of 4 clients completed the recovery at MDT.

00010000:02000000:13.0:1419964653.561987:0:15786:0:(ldlm_lib.c:1392:target_finish_recovery()) atlastds-MDT0000: Recovery over after 30:00, of 4 clients 3 recovered and 1 was evicted.

which nodes does the client log "client-dump.log" contain? no eviction record was found in this log.

btw, do you use 4 clients and a separated MGS in this test? and could you please attach the console/sys logs along with
those debug logs?

Thanks!

Hongchao Zhang added a comment - 05/Jan/15 10:10 AM as per the log "dump_atlas-tds-mds1-after-recovery.log", there are 3 out of 4 clients completed the recovery at MDT. 00010000:02000000:13.0:1419964653.561987:0:15786:0:(ldlm_lib.c:1392:target_finish_recovery()) atlastds-MDT0000: Recovery over after 30:00, of 4 clients 3 recovered and 1 was evicted. which nodes does the client log "client-dump.log" contain? no eviction record was found in this log. btw, do you use 4 clients and a separated MGS in this test? and could you please attach the console/sys logs along with those debug logs? Thanks!

James A Simmons added a comment - 30/Dec/14 8:37 PM

We did another test run for recovery in the case of both MDS and OSS fail. I collected logs and placed them at ftp.whamcloud.com/uploads/~~LU-5724~~/*.log. The OSS seem to recovery but the MDS did not recovery properly.

James A Simmons added a comment - 30/Dec/14 8:37 PM We did another test run for recovery in the case of both MDS and OSS fail. I collected logs and placed them at ftp.whamcloud.com/uploads/ LU-5724 /*.log. The OSS seem to recovery but the MDS did not recovery properly.

James A Simmons added a comment - 16/Dec/14 7:15 PM

No. Only the MDS and OSS were restarted.

James A Simmons added a comment - 16/Dec/14 7:15 PM No. Only the MDS and OSS were restarted.

Jinshan Xiong (Inactive) added a comment - 16/Dec/14 6:56 PM

Does "single server node" mean that the MGS was also restarted in the test?

Jinshan Xiong (Inactive) added a comment - 16/Dec/14 6:56 PM Does "single server node" mean that the MGS was also restarted in the test?

James A Simmons added a comment - 16/Dec/14 5:52 PM

Some more info from todays testings. The failure to recovery occurred when both the MDS and an OSS were failed over. If we did just a MDS or a OSS recovery would complete. When we did the second round of testing with a single server node we noticed that IR was reported as disabled even tho we have no non-IR clients. We checked that on the MGS.

James A Simmons added a comment - 16/Dec/14 5:52 PM Some more info from todays testings. The failure to recovery occurred when both the MDS and an OSS were failed over. If we did just a MDS or a OSS recovery would complete. When we did the second round of testing with a single server node we noticed that IR was reported as disabled even tho we have no non-IR clients. We checked that on the MGS.

James A Simmons added a comment - 15/Dec/14 10:42 PM

Today we tested the latest 2.5 lustre code with the following patches:

~~LU-793~~
~~LU-3338~~
~~LU-5485~~
~~LU-5651~~
~~LU-5740~~

witha 500 client node. Recovery completely failed to complete. After a hour and 22 minutes we gave up and ended recovery. During recovery we lost a OSS node which I attached the lustre log it dumped. We also have a core I can post from that OSS as well.

James A Simmons added a comment - 15/Dec/14 10:42 PM Today we tested the latest 2.5 lustre code with the following patches: LU-793 LU-3338 LU-5485 LU-5651 LU-5740 witha 500 client node. Recovery completely failed to complete. After a hour and 22 minutes we gave up and ended recovery. During recovery we lost a OSS node which I attached the lustre log it dumped. We also have a core I can post from that OSS as well.

James A Simmons added a comment - 01/Dec/14 3:28 PM

The cause of our recovery issues was three things. They are ~~LU-5079~~, ~~LU-5287~~, and lastly ~~LU-5651~~. Of those only ~~LU-5651~~ is left to be merged to b2_5. So this ticket should remain open until that patch lands.

James A Simmons added a comment - 01/Dec/14 3:28 PM The cause of our recovery issues was three things. They are LU-5079 , LU-5287 , and lastly LU-5651 . Of those only LU-5651 is left to be merged to b2_5. So this ticket should remain open until that patch lands.

People

Assignee:: Hongchao Zhang

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 10/Oct/14 7:51 PM

Updated:: 20/Feb/15 2:59 PM

Resolved:: 20/Feb/15 2:59 PM