[LU-5724] IR recovery doesn't behave properly with Lustre 2.5 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.5.3
Labels:
- llnl
- ornl
- p4o
Environment:
MDS server running RHEL6.5 running ORNL 2.5.3 branch with about 12 patches.

Severity:
2
Rank (Obsolete):
16076

Description

Today we experienced a hardware failure with our MDS. The MDS rebooted and then came back. We restarted the MDS but IR behaved strangely. Four clients got evicted but when the timer to completion got down to zero IR restarted all over again. Then once it got to the 700 second range the timer starting to go up. It did this a few times before letting the timer running out. Once the timer did finally get to zero the recovery state was reported as still being in recovery. It removed this way for several more minutes before finally being in a recovered state. In all it toke 54 minutes to recover.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

atlas-mds1.log
668 kB
10/Oct/14 7:56 PM
atlas-tds-kernel-logs_20141229.tar.gz
265 kB
05/Jan/15 9:57 PM
atlas-tds-oss1_recovery_lustre-log.1418679242.16958
0.3 kB
15/Dec/14 10:42 PM
rhea513_kern_12292014.log
482 kB
07/Jan/15 5:43 PM
rhea-rtr1_kern_12292014.log
366 kB
07/Jan/15 5:43 PM

Issue Links

duplicates

LU-4119 recovery time hard doesn't limit recovery duration

Resolved

is related to

LU-5079 conf-sanity test_47 timeout

Resolved

Activity

[LU-5724] IR recovery doesn't behave properly with Lustre 2.5

James A Simmons added a comment - 08/Jan/15 3:55 PM - edited

We have had success before with MDS+OSS failing over at the same time in the past. We really like to have that functionality restored.

James A Simmons added a comment - 08/Jan/15 3:55 PM - edited We have had success before with MDS+OSS failing over at the same time in the past. We really like to have that functionality restored.

Hongchao Zhang added a comment - 08/Jan/15 10:52 AM

Is the client (rhea513, 10.38.146.45) the Lustre client connected to "atlastds"?
this client mounted at "Dec 29 12:55:33", but umounted at "Dec 29 14:04:54", but the MDT failed over at "Dec 29 13:55:12" and the Lustre was started only at "Dec 29 14:26:55".

at Client (rhea513, 10.38.146.45)

 
Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: client wants to enable acl, but mdt not!
Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Layout lock feature supported.
Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Mounted atlastds-client
Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879033/real 1419879033]  req@ffff881000328000 x1487757986577588/t0(0) o400->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 224/224 e 0 to 1 dl 1419879600 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST0023-osc-ffff880820a28c00: Connection to atlastds-OST0023 (at 10.36.226.67@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST002e-osc-ffff880820a28c00: Connection to atlastds-OST002e (at 10.36.226.70@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: LustreError: 166-1: MGC10.36.226.79@o2ib: Connection to MGS (at 10.36.226.79@o2ib) was lost; in progress operations using this service will fail
Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 52 previous similar messages
Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879600/real 1419879600]  req@ffff88102a64b800 x1487757986596960/t0(0) o8->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 400/544 e 0 to 1 dl 1419879880 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Dec 29 14:04:54 rhea513.ccs.ornl.gov kernel: Lustre: Unmounted atlastds-client

at MDT (10.36.226.69)

Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.814766]  sdbd:
Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.814952] sd 7:0:0:98: [sdaj] Attached SCSI disk
Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.835086]  unknown partition table
Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.845364]  unknown partition table
Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.864189] sd 6:0:0:71: [sdau] Attached SCSI disk
Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.864257] sd 6:0:0:101: [sdbd] Attached SCSI disk
Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.970252] device-mapper: multipath round-robin: version 1.0.0 loaded
Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.955421] LNet: HW CPU cores: 16, npartitions: 4
Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.963523] alg: No test for crc32 (crc32-table)
Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.977845] alg: No test for adler32 (adler32-zlib)
Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.988788] alg: No test for crc32 (crc32-pclmul)
Dec 29 14:26:59 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1962.022800] padlock: VIA PadLock Hash Engine not detected.
Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.277906] LNet: Added LNI 10.36.226.69@o2ib [63/2560/0/180]
Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.814213] LNet: Added LNI 10.36.226.69@o2ib200 [63/2560/0/180]
Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1967.054096] Lustre: Lustre: Build Version: 2.5.3-g6158f83-CHANGED-2.6.32-431.29.2.el6.atlas.x86_64

btw, in this failover test of both MDT and OSS was rebooted, and recovery will be slow for the OSSs will wait time to evict the client from MDT, which won't reconnect to recover.
and Lustre seems not to support both MDS and OSS to failover, IIRC.

Hongchao Zhang added a comment - 08/Jan/15 10:52 AM Is the client (rhea513, 10.38.146.45) the Lustre client connected to "atlastds"? this client mounted at "Dec 29 12:55:33", but umounted at "Dec 29 14:04:54", but the MDT failed over at "Dec 29 13:55:12" and the Lustre was started only at "Dec 29 14:26:55". at Client (rhea513, 10.38.146.45) Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: client wants to enable acl, but mdt not! Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Layout lock feature supported. Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Mounted atlastds-client Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879033/real 1419879033] req@ffff881000328000 x1487757986577588/t0(0) o400->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 224/224 e 0 to 1 dl 1419879600 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST0023-osc-ffff880820a28c00: Connection to atlastds-OST0023 (at 10.36.226.67@o2ib) was lost; in progress operations using this service will wait for recovery to complete Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST002e-osc-ffff880820a28c00: Connection to atlastds-OST002e (at 10.36.226.70@o2ib) was lost; in progress operations using this service will wait for recovery to complete Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: LustreError: 166-1: MGC10.36.226.79@o2ib: Connection to MGS (at 10.36.226.79@o2ib) was lost; in progress operations using this service will fail Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 52 previous similar messages Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879600/real 1419879600] req@ffff88102a64b800 x1487757986596960/t0(0) o8->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 400/544 e 0 to 1 dl 1419879880 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 1 previous similar message Dec 29 14:04:54 rhea513.ccs.ornl.gov kernel: Lustre: Unmounted atlastds-client at MDT (10.36.226.69) Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.814766] sdbd: Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.814952] sd 7:0:0:98: [sdaj] Attached SCSI disk Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.835086] unknown partition table Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.845364] unknown partition table Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.864189] sd 6:0:0:71: [sdau] Attached SCSI disk Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.864257] sd 6:0:0:101: [sdbd] Attached SCSI disk Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.970252] device-mapper: multipath round-robin: version 1.0.0 loaded Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.955421] LNet: HW CPU cores: 16, npartitions: 4 Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.963523] alg: No test for crc32 (crc32-table) Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.977845] alg: No test for adler32 (adler32-zlib) Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.988788] alg: No test for crc32 (crc32-pclmul) Dec 29 14:26:59 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1962.022800] padlock: VIA PadLock Hash Engine not detected. Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.277906] LNet: Added LNI 10.36.226.69@o2ib [63/2560/0/180] Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.814213] LNet: Added LNI 10.36.226.69@o2ib200 [63/2560/0/180] Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1967.054096] Lustre: Lustre: Build Version: 2.5.3-g6158f83-CHANGED-2.6.32-431.29.2.el6.atlas.x86_64 btw, in this failover test of both MDT and OSS was rebooted, and recovery will be slow for the OSSs will wait time to evict the client from MDT, which won't reconnect to recover. and Lustre seems not to support both MDS and OSS to failover, IIRC.

James A Simmons added a comment - 07/Jan/15 5:43 PM

Here are the kern logs for a client and a router. If you want the logs for all the clients let me know.

James A Simmons added a comment - 07/Jan/15 5:43 PM Here are the kern logs for a client and a router. If you want the logs for all the clients let me know.

Hongchao Zhang added a comment - 07/Jan/15 9:37 AM - edited

is there only one Lustre client at 10.38.144.11 in this configuration? are these logs in the same failover test above?

[ 2267.379541] Lustre: atlastds-MDT0000: Will be in recovery for at least 30:00, or until 1 client reconnects
Dec 29 14:31:02 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2267.409294] Lustre: atlastds-MDT0000: Denying connection for new client 3ae0ecec-84ef-cf8f-c128-51873c53d1ad (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:59
Dec 29 14:31:08 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2272.910080] Lustre: atlastds-MDT0000: Denying connection for new client 5116891d-0ace-dffd-7497-218db0b23e98 (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:54

the MDT & OSSs are waiting the client to reconnect for recovery, but it somehow failed to reconnect and seems to connect as
a new Lustre client which was denied by the MDS&OSSs for it was recovering from failover.

Could you please attach the console&sys logs of the client? Thanks!

Hongchao Zhang added a comment - 07/Jan/15 9:37 AM - edited is there only one Lustre client at 10.38.144.11 in this configuration? are these logs in the same failover test above? [ 2267.379541] Lustre: atlastds-MDT0000: Will be in recovery for at least 30:00, or until 1 client reconnects Dec 29 14:31:02 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2267.409294] Lustre: atlastds-MDT0000: Denying connection for new client 3ae0ecec-84ef-cf8f-c128-51873c53d1ad (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:59 Dec 29 14:31:08 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2272.910080] Lustre: atlastds-MDT0000: Denying connection for new client 5116891d-0ace-dffd-7497-218db0b23e98 (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:54 the MDT & OSSs are waiting the client to reconnect for recovery, but it somehow failed to reconnect and seems to connect as a new Lustre client which was denied by the MDS&OSSs for it was recovering from failover. Could you please attach the console&sys logs of the client? Thanks!

James A Simmons added a comment - 05/Jan/15 9:57 PM

Here you go. These are the logs from the clients and servers.

James A Simmons added a comment - 05/Jan/15 9:57 PM Here you go. These are the logs from the clients and servers.

James A Simmons added a comment - 05/Jan/15 7:05 PM

The OSS reconnected to the MDS but none of the clients every reconnected. The clients appeared stuck. The client logs are from the client nodes we used. As for the configuration the MGS is a stand alone node and we tested with 4 nodes. Will grab the logs.

James A Simmons added a comment - 05/Jan/15 7:05 PM The OSS reconnected to the MDS but none of the clients every reconnected. The clients appeared stuck. The client logs are from the client nodes we used. As for the configuration the MGS is a stand alone node and we tested with 4 nodes. Will grab the logs.

Hongchao Zhang added a comment - 05/Jan/15 10:10 AM

as per the log "dump_atlas-tds-mds1-after-recovery.log", there are 3 out of 4 clients completed the recovery at MDT.

00010000:02000000:13.0:1419964653.561987:0:15786:0:(ldlm_lib.c:1392:target_finish_recovery()) atlastds-MDT0000: Recovery over after 30:00, of 4 clients 3 recovered and 1 was evicted.

which nodes does the client log "client-dump.log" contain? no eviction record was found in this log.

btw, do you use 4 clients and a separated MGS in this test? and could you please attach the console/sys logs along with
those debug logs?

Thanks!

Hongchao Zhang added a comment - 05/Jan/15 10:10 AM as per the log "dump_atlas-tds-mds1-after-recovery.log", there are 3 out of 4 clients completed the recovery at MDT. 00010000:02000000:13.0:1419964653.561987:0:15786:0:(ldlm_lib.c:1392:target_finish_recovery()) atlastds-MDT0000: Recovery over after 30:00, of 4 clients 3 recovered and 1 was evicted. which nodes does the client log "client-dump.log" contain? no eviction record was found in this log. btw, do you use 4 clients and a separated MGS in this test? and could you please attach the console/sys logs along with those debug logs? Thanks!

James A Simmons added a comment - 30/Dec/14 8:37 PM

We did another test run for recovery in the case of both MDS and OSS fail. I collected logs and placed them at ftp.whamcloud.com/uploads/~~LU-5724~~/*.log. The OSS seem to recovery but the MDS did not recovery properly.

James A Simmons added a comment - 30/Dec/14 8:37 PM We did another test run for recovery in the case of both MDS and OSS fail. I collected logs and placed them at ftp.whamcloud.com/uploads/ LU-5724 /*.log. The OSS seem to recovery but the MDS did not recovery properly.

James A Simmons added a comment - 16/Dec/14 7:15 PM

No. Only the MDS and OSS were restarted.

James A Simmons added a comment - 16/Dec/14 7:15 PM No. Only the MDS and OSS were restarted.

Jinshan Xiong (Inactive) added a comment - 16/Dec/14 6:56 PM

Does "single server node" mean that the MGS was also restarted in the test?

Jinshan Xiong (Inactive) added a comment - 16/Dec/14 6:56 PM Does "single server node" mean that the MGS was also restarted in the test?

James A Simmons added a comment - 16/Dec/14 5:52 PM

Some more info from todays testings. The failure to recovery occurred when both the MDS and an OSS were failed over. If we did just a MDS or a OSS recovery would complete. When we did the second round of testing with a single server node we noticed that IR was reported as disabled even tho we have no non-IR clients. We checked that on the MGS.

James A Simmons added a comment - 16/Dec/14 5:52 PM Some more info from todays testings. The failure to recovery occurred when both the MDS and an OSS were failed over. If we did just a MDS or a OSS recovery would complete. When we did the second round of testing with a single server node we noticed that IR was reported as disabled even tho we have no non-IR clients. We checked that on the MGS.

People

Assignee:: Hongchao Zhang

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 10/Oct/14 7:51 PM

Updated:: 20/Feb/15 2:59 PM

Resolved:: 20/Feb/15 2:59 PM