[LU-5724] IR recovery doesn't behave properly with Lustre 2.5 Created: 10/Oct/14  Updated: 20/Feb/15  Resolved: 20/Feb/15

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: Hongchao Zhang
Resolution: Duplicate Votes: 0
Labels: llnl, ornl, p4o
Environment:

MDS server running RHEL6.5 running ORNL 2.5.3 branch with about 12 patches.


Attachments: Text File atlas-mds1.log     File atlas-tds-kernel-logs_20141229.tar.gz     Text File atlas-tds-oss1_recovery_lustre-log.1418679242.16958     Text File rhea-rtr1_kern_12292014.log     Text File rhea513_kern_12292014.log    
Issue Links:
Duplicate
duplicates LU-4119 recovery time hard doesn't limit reco... Resolved
Related
is related to LU-5079 conf-sanity test_47 timeout Resolved
Severity: 2
Rank (Obsolete): 16076

 Description   

Today we experienced a hardware failure with our MDS. The MDS rebooted and then came back. We restarted the MDS but IR behaved strangely. Four clients got evicted but when the timer to completion got down to zero IR restarted all over again. Then once it got to the 700 second range the timer starting to go up. It did this a few times before letting the timer running out. Once the timer did finally get to zero the recovery state was reported as still being in recovery. It removed this way for several more minutes before finally being in a recovered state. In all it toke 54 minutes to recover.



 Comments   
Comment by James A Simmons [ 10/Oct/14 ]

I attached the dmesg but sorry we didn't collect any data this time.

Comment by James A Simmons [ 10/Oct/14 ]

It was noticed that when a client reconnects the replay timer goes up a bit. Is this proper behavior.

Comment by Peter Jones [ 10/Oct/14 ]

Jinshan

Could you please comment?

Thanks

Peter

Comment by Jinshan Xiong (Inactive) [ 10/Oct/14 ]

From the dmesg, IR is enabled on this server:

Oct 10 12:00:56 atlas-mds1.ccs.ornl.gov kernel: [  280.655680] Lustre: atlas1-MDT0000: Imperative Recovery enabled, recovery window shrunk from 1800-5400 down to 900-2700
Oct 10 12:00:56 atlas-mds1.ccs.ornl.gov kernel: [  280.661720] Lustre: atlas1-MDT0000: Will be in recovery for at least 15:00, or until 20178 clients reconnect

After 15 minutes, the recovery timed out and 4 clients were evicted.

Oct 10 12:15:56 atlas-mds1.ccs.ornl.gov kernel: [ 1181.536843] Lustre: atlas1-MDT0000: recovery is timed out, evict stale exports
Oct 10 12:15:56 atlas-mds1.ccs.ornl.gov kernel: [ 1181.547322] Lustre: atlas1-MDT0000: disconnecting 4 stale clients

There were few clients couldn't reconnect to the server because one active RPC still existed:

Oct 10 12:17:53 atlas-mds1.ccs.ornl.gov kernel: [ 1297.846751] Lustre: atlas1-MDT0000: Client 31fe9dcf-fbb8-41ea-72d2-cdae9b6dd941 (at 15449@gni100) refused reconnection, still busy with 1 active RPCs

I suspect this is the exact reason why it couldn't be recovered. There are few bugs about this `active RPCs', we will have to investigate if this is issue has already been fixed.

Finally, the MDT lost its temper and those clients were evicted finally.

Oct 10 12:51:28 atlas-mds1.ccs.ornl.gov kernel: [ 3315.271837] Lustre: 19764:0:(ldlm_lib.c:2088:target_recovery_thread()) too long recovery - read logs
Oct 10 12:51:28 atlas-mds1.ccs.ornl.gov kernel: [ 3315.282307] LustreError: dumping log to /tmp/lustre-log.1412959888.19764
Oct 10 12:51:28 atlas-mds1.ccs.ornl.gov kernel: [ 3315.282310] Lustre: atlas1-MDT0000: Recovery over after 50:32, of 20178 clients 20171 recovered and 7 were evicted.

From the log, there are 20178 clients in total, and 4 clients couldn't reconnect in the first place so they were evicted after IR recovery window timed out. The other 3 got into `active RPCs' trouble and the MDT has to wait for them to finish recovery. I feel that IR was working well in this example. As you know, IR can only help notify the clients the events of server restart, obviously the recovery took way too long time to finish.

Another message popped out many times in the log

Lustre: mdt: This server is not able to keep up with request traffic (cpu-bound).

It seems like the MDT can't handle the RPCs in time so the RPC expired. Can you please verify that the CPU was super busy at that time?

Comment by Matt Ezell [ 11/Oct/14 ]

Hi Jinshan- Thanks for the analysis.

After we recently upgraded clients and servers to 2.5.x, we hit LU-5651 (issue covered in LU-5719). To resolve, we backed out the patch from LU-793. This could have caused new instances of the "still busy with 1 active RPC" messages.

While the MDS was trying to recover, I was logged in looking at 'top' and 'perf top' to see the current load. Neither became excessive, so I wonder if there's a situation where the "server is not able to keep up with request traffic" message may be emitted for reasons other than excessive CPU usage. It is emitted when the server tries to send early replies but it's already past the request deadline. When in recovery, does that metric even make sense? I fear that it's not CPU usage but some other issue (deadlock?) causing the problem.

Maybe I don't understand the normal recovery sequence, but at 12:15:56 when recovery timed out and 4 clients were evicted, why did recovery continue and run into the "still busy" clients? Then it had to wait until 12:51:28 for the 3 additional ones? Why wasn't every non-compliant client evicted at that time so the MDS could move on with its life?

Anyway, I think we need to get a stable fix for LU-5651 and re-enable the patch from LU-793 to see if it improves the recovery process. Other issues may persist, and that's what we need to track in this ticket.

Comment by Hongchao Zhang [ 13/Oct/14 ]

Hi Matt,

the extra recovery period is caused by VBR (version based recovery). "check_and_start_recovery_timer" will start the first period of
recovery when some client is connecting, and the period is "3*obd_timeout = 1800s = 30m", which is reduced by IR to 900s = 15m,
as per the log ("check_for_clients" in stack), the MDT has been waiting the connection of all clients in this period (4 clients didn't connect).

tgt_recov     D 000000000000000f     0 19764      2 0x00000000
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.914484]  ffff881fcaed1da0 0000000000000046 0000000000000000 ffff881fcaed1d64
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.923068]  0000009100000000 ffff88207fc28800 ffff88011c456880 0000000000000400
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.931650]  ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.940231] Call Trace:
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.943131]  [<ffffffffa07ca620>] ? check_for_clients+0x0/0x70 [ptlrpc]
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.950680]  [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.959026]  [<ffffffffa07ca310>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.966778]  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.974140]  [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.982456]  [<ffffffffa07d2a90>] target_recovery_thread+0x540/0x1920 [ptlrpc]
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.990824]  [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.997983]  [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1085.006277]  [<ffffffff8109ab56>] kthread+0x96/0xa0
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1085.011867]  [<ffffffff8100c20a>] child_rip+0xa/0x20
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1085.017531]  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1085.023222]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
Oct 10 12:00:56 atlas-mds1.ccs.ornl.gov kernel: [  280.655680] Lustre: atlas1-MDT0000: Imperative Recovery enabled, recovery window shrunk from 1800-5400 down to 900-2700
Oct 10 12:00:56 atlas-mds1.ccs.ornl.gov kernel: [  280.661720] Lustre: atlas1-MDT0000: Will be in recovery for at least 15:00, or until 20178 clients reconnect
...
Oct 10 12:15:56 atlas-mds1.ccs.ornl.gov kernel: [ 1181.536843] Lustre: atlas1-MDT0000: recovery is timed out, evict stale exports
Oct 10 12:15:56 atlas-mds1.ccs.ornl.gov kernel: [ 1181.547322] Lustre: atlas1-MDT0000: disconnecting 4 stale clients

after the recovery timer expired, there will be another recovery period to startup (VBR – version based recovery), and the
is expired, obd_device->obd_version_recov will be set and the extra recovery period (the deadline is 2700s = 45m) is started
by calling "extend_recovery_timer" in "check_and_start_recovery_timer" and "handle_recovery_req". and the recovery timer
expired while waiting the request with the next transno to replay ("check_for_next_transno" in stack), and 1 client failed.

Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.578366] tgt_recov     D 0000000000000015     0 19764      2 0x00000000
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.586215]  ffff881fcaed1da0 0000000000000046 ffff881fcaed1d00 ffff881fcaed1d64
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.594772]  ffffc9007f72b470 ffff88402a8af430 0000000000004ed2 0000000000004ece
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.603336]  ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.611889] Call Trace:
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.614787]  [<ffffffffa07cea60>] ? check_for_next_transno+0x0/0x590 [ptlrpc]
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.622914]  [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.631242]  [<ffffffffa07ca330>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.639246]  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.646589]  [<ffffffffa07d2cba>] target_recovery_thread+0x76a/0x1920 [ptlrpc]
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.654880]  [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.662034]  [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.670330]  [<ffffffff8109ab56>] kthread+0x96/0xa0
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.675928]  [<ffffffff8100c20a>] child_rip+0xa/0x20
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.681593]  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.687255]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
Oct 10 12:45:20 atlas-mds1.ccs.ornl.gov kernel: [ 2946.578608] Lustre: atlas1-MDT0000: recovery is timed out, evict stale exports
Oct 10 12:45:20 atlas-mds1.ccs.ornl.gov kernel: [ 2946.588579] Lustre: atlas1-MDT0000: disconnecting 1 stale clients

the load of the system should be not high, for there are only one thread to process these replay request and the count
of clients is relative high (20178), then some request could be spent more time to trigger the log,

Oct 10 12:16:27 atlas-mds1.ccs.ornl.gov kernel: [ 1212.683035] Lustre: 16227:0:(service.c:1304:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-62s), not sending early reply. Consider increasing at_early_margin (5)?  req@ffff88404991d000 x1481333400177244/t0(0) o400->dbcda88f-598d-9915-c1a2-3272067b8e42@18919@gni100:0/0 lens 224/0 e 1 to 0 dl 1412957725 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
Oct 10 12:16:27 atlas-mds1.ccs.ornl.gov kernel: [ 1212.718387] Lustre: 16227:0:(service.c:1304:ptlrpc_at_send_early_reply()) Skipped 6595 previous similar messages
Comment by Jason Hill (Inactive) [ 16/Oct/14 ]

After talking with James Nunez on the call today some more background information might be useful.

There were 2 outages on Friday 10/10/. The first was the MDS rebooted spuriously and there was no known cause. After we returned to service more hardware validation and debugging occurred and a hardware fault was found. A second outage was taken to swap the MDS hardware (not MDT). The first outage had ~1.5 hours between when the mds was unresponsive and became responsive again. The second outage was no more than 20 minutes. The first outage had the issue described in this ticket; the second outage recovered successfully and cleanly in a much shorter time (less than 30 minutes).

Comment by Jinshan Xiong (Inactive) [ 16/Oct/14 ]

30 minutes was pretty good for this size of cluster, in my personal opinion.

Comment by Jason Hill (Inactive) [ 16/Oct/14 ]

jinshan:

We are happy with the second outage's performance. the 1.5 hours for the first outage's performance is where we have problems.

Comment by Jason Hill (Inactive) [ 16/Oct/14 ]

We're currently in the holding pattern of the MDS having 0s remaining on the recovery timer but not exiting recovery.

status: RECOVERING
recovery_start: 1413488962
time_remaining: 0
connected_clients: 20177/20178
req_replay_clients: 409
lock_repay_clients: 481
completed_clients: 19696
evicted_clients: 1
replayed_requests: 269
queued_requests: 408
next_transno: 148691449625

Oct 16 17:05:17 atlas-mds3.ccs.ornl.gov kernel: [ 4769.501597] Lustre: mdt: This server is not able to keep up with request traffic (cpu-bound).
Oct 16 17:05:17 atlas-mds3.ccs.ornl.gov kernel: [ 4769.503568] Lustre: 14922:0:(service.c:1507:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=250, delay=0(jiff)
Oct 16 17:05:17 atlas-mds3.ccs.ornl.gov kernel: [ 4769.503571] Lustre: 14922:0:(service.c:1507:ptlrpc_at_check_timed()) Skipped 8 previous similar messages
Oct 16 17:05:17 atlas-mds3.ccs.ornl.gov kernel: [ 4769.503579] Lustre: 14922:0:(service.c:1304:ptlrpc_at_send_early_reply()) @@@ Already past deadline (62s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff883ff8de8400 x1481332130881448/t0(148691450005) o36>1304f9d3-b2ef-59fa-e0f8-5662fc869d82@173@gni3:0/0 lens 488/0 e 1 to 0 dl 1413493455 ref 2 fl Complete:/6/ffffffff rc 0/-1

Comment by Matt Ezell [ 16/Oct/14 ]

I just uploaded some debug logs to your FTP server:
/uploads/LU-5724/atlas-mds3-recovery-20141016-all.gz

Comment by Hongchao Zhang [ 17/Oct/14 ]

according the log "atlas-mds3-recovery-20141016-all.gz", there is one client failing to connect during the first phase of recovery(900s),
and the process of replay request is very slow (and there are so much reconnection requests from clients), which prolongs the recovery
to exceed the "hard" limit of the recovery (obd_recovery_time_hard), then the "time_remaining" in the "recovery_status" is "0".

Hi Matt,
Is there any difference between this recovery and the first one in the ticket (such as outputting more logs, etc)? and that one is much
faster in the second phase (VBR phase), and the recovery expired at the hard recovery limit. but in this new recovery instance,
replay requests are processed slowly and extend the recovery timer step by step to exceed the "hard" limit, the recovery is not aborted
for there are still health clients to recover (only very slow).

the recovery speed depends on the clients for given the same "sys.timeout" parameter, if some clients failed to connect for recovery,
the recovery could need long time to make sure the client is failed and to evict it.

Comment by Matt Ezell [ 17/Oct/14 ]

We run with:

[root@atlas-mds3 ~]# cat /proc/sys/lnet/debug
ioctl neterror warning error emerg ha config console
[root@atlas-mds3 ~]# cat /proc/sys/lnet/printk 
warning error emerg console

In the previous scenario, we only provided the output of what printk()'ed. For the more recent instance, I used 'lctl dk' regularly to create the logs I provided. I think for diagnosing recovery, the D_HA messages are helpful.

In both instances, we had hardware issues that caused the MDS to go down. I think we followed the same procedures when we tried to bring them back up. It's possible that in the most recent instance there were still some lingering hardware issues.

Is it normal for the stale client to be reported as just a UUID and not a NID?

atlas2-MDT0000: disconnect stale client 57ffbae2-6467-b467-fad5-58b51832006c@<unknown>
atlas2-MDT0000: EVICTING ffff881eaf853800 57ffbae2-6467-b467-fad5-58b51832006c (no nid) 2 (0 0 0) 0 0 1 0: (null)  146091266917

After the fact now, can we tell which node that was? (looking in the exports now, that UUID doesn't exist). I'd like to see what messages it logged.

Comment by James A Simmons [ 17/Oct/14 ]

On our smaller scale test bed system applying the patch for LU-5079 seems to help resolve the recovery issues. Next week we will move to testing to a large size system. It looks like the patch for LU-4578 is what is causing the breakage. It is also impacting Lustre 2.7 as reported in LU-5077. We might have a fix

Comment by Peter Jones [ 28/Nov/14 ]

James

Are you comfortable to close this as a duplicate of LU-5079?

Peter

Comment by James A Simmons [ 01/Dec/14 ]

The cause of our recovery issues was three things. They are LU-5079, LU-5287, and lastly LU-5651. Of those only LU-5651 is left to be merged to b2_5. So this ticket should remain open until that patch lands.

Comment by James A Simmons [ 15/Dec/14 ]

Today we tested the latest 2.5 lustre code with the following patches:

LU-793
LU-3338
LU-5485
LU-5651
LU-5740

witha 500 client node. Recovery completely failed to complete. After a hour and 22 minutes we gave up and ended recovery. During recovery we lost a OSS node which I attached the lustre log it dumped. We also have a core I can post from that OSS as well.

Comment by James A Simmons [ 16/Dec/14 ]

Some more info from todays testings. The failure to recovery occurred when both the MDS and an OSS were failed over. If we did just a MDS or a OSS recovery would complete. When we did the second round of testing with a single server node we noticed that IR was reported as disabled even tho we have no non-IR clients. We checked that on the MGS.

Comment by Jinshan Xiong (Inactive) [ 16/Dec/14 ]

Does "single server node" mean that the MGS was also restarted in the test?

Comment by James A Simmons [ 16/Dec/14 ]

No. Only the MDS and OSS were restarted.

Comment by James A Simmons [ 30/Dec/14 ]

We did another test run for recovery in the case of both MDS and OSS fail. I collected logs and placed them at ftp.whamcloud.com/uploads/LU-5724/*.log. The OSS seem to recovery but the MDS did not recovery properly.

Comment by Hongchao Zhang [ 05/Jan/15 ]

as per the log "dump_atlas-tds-mds1-after-recovery.log", there are 3 out of 4 clients completed the recovery at MDT.

00010000:02000000:13.0:1419964653.561987:0:15786:0:(ldlm_lib.c:1392:target_finish_recovery()) atlastds-MDT0000: Recovery over after 30:00, of 4 clients 3 recovered and 1 was evicted.

which nodes does the client log "client-dump.log" contain? no eviction record was found in this log.

btw, do you use 4 clients and a separated MGS in this test? and could you please attach the console/sys logs along with
those debug logs?

Thanks!

Comment by James A Simmons [ 05/Jan/15 ]

The OSS reconnected to the MDS but none of the clients every reconnected. The clients appeared stuck. The client logs are from the client nodes we used. As for the configuration the MGS is a stand alone node and we tested with 4 nodes. Will grab the logs.

Comment by James A Simmons [ 05/Jan/15 ]

Here you go. These are the logs from the clients and servers.

Comment by Hongchao Zhang [ 07/Jan/15 ]

is there only one Lustre client at 10.38.144.11 in this configuration? are these logs in the same failover test above?

[ 2267.379541] Lustre: atlastds-MDT0000: Will be in recovery for at least 30:00, or until 1 client reconnects
Dec 29 14:31:02 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2267.409294] Lustre: atlastds-MDT0000: Denying connection for new client 3ae0ecec-84ef-cf8f-c128-51873c53d1ad (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:59
Dec 29 14:31:08 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2272.910080] Lustre: atlastds-MDT0000: Denying connection for new client 5116891d-0ace-dffd-7497-218db0b23e98 (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:54

the MDT & OSSs are waiting the client to reconnect for recovery, but it somehow failed to reconnect and seems to connect as
a new Lustre client which was denied by the MDS&OSSs for it was recovering from failover.

Could you please attach the console&sys logs of the client? Thanks!

Comment by James A Simmons [ 07/Jan/15 ]

Here are the kern logs for a client and a router. If you want the logs for all the clients let me know.

Comment by Hongchao Zhang [ 08/Jan/15 ]

Is the client (rhea513, 10.38.146.45) the Lustre client connected to "atlastds"?
this client mounted at "Dec 29 12:55:33", but umounted at "Dec 29 14:04:54", but the MDT failed over at "Dec 29 13:55:12" and the Lustre was started only at "Dec 29 14:26:55".

at Client (rhea513, 10.38.146.45)

 
Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: client wants to enable acl, but mdt not!
Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Layout lock feature supported.
Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Mounted atlastds-client
Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879033/real 1419879033]  req@ffff881000328000 x1487757986577588/t0(0) o400->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 224/224 e 0 to 1 dl 1419879600 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST0023-osc-ffff880820a28c00: Connection to atlastds-OST0023 (at 10.36.226.67@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST002e-osc-ffff880820a28c00: Connection to atlastds-OST002e (at 10.36.226.70@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: LustreError: 166-1: MGC10.36.226.79@o2ib: Connection to MGS (at 10.36.226.79@o2ib) was lost; in progress operations using this service will fail
Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 52 previous similar messages
Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879600/real 1419879600]  req@ffff88102a64b800 x1487757986596960/t0(0) o8->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 400/544 e 0 to 1 dl 1419879880 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Dec 29 14:04:54 rhea513.ccs.ornl.gov kernel: Lustre: Unmounted atlastds-client

at MDT (10.36.226.69)

Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.814766]  sdbd:
Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.814952] sd 7:0:0:98: [sdaj] Attached SCSI disk
Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.835086]  unknown partition table
Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.845364]  unknown partition table
Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.864189] sd 6:0:0:71: [sdau] Attached SCSI disk
Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.864257] sd 6:0:0:101: [sdbd] Attached SCSI disk
Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.970252] device-mapper: multipath round-robin: version 1.0.0 loaded
Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.955421] LNet: HW CPU cores: 16, npartitions: 4
Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.963523] alg: No test for crc32 (crc32-table)
Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.977845] alg: No test for adler32 (adler32-zlib)
Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.988788] alg: No test for crc32 (crc32-pclmul)
Dec 29 14:26:59 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1962.022800] padlock: VIA PadLock Hash Engine not detected.
Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.277906] LNet: Added LNI 10.36.226.69@o2ib [63/2560/0/180]
Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.814213] LNet: Added LNI 10.36.226.69@o2ib200 [63/2560/0/180]
Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1967.054096] Lustre: Lustre: Build Version: 2.5.3-g6158f83-CHANGED-2.6.32-431.29.2.el6.atlas.x86_64

btw, in this failover test of both MDT and OSS was rebooted, and recovery will be slow for the OSSs will wait time to evict the client from MDT, which won't reconnect to recover.
and Lustre seems not to support both MDS and OSS to failover, IIRC.

Comment by James A Simmons [ 08/Jan/15 ]

We have had success before with MDS+OSS failing over at the same time in the past. We really like to have that functionality restored.

Comment by Oleg Drokin [ 08/Jan/15 ]

I think what HongChao tries to say is that when MDS and OST both go down, then since MDS is a client of OST, the OST recovery can never complete because of the missing client.

But on the other hand we took two steps to help this. First, MDS client UUID should be always the same, so even after restart it still should be allowed to reconnect as the old known client (this is assuming it actually got up and into reconecting state in time for OST recovery. if your MDS takes ages to reboot, for example, it might miss this window, esp. if it's a shortened window thanks to IR).

Second, we have VBR to deal with missing clients during recovery, which is esp. easy with MDT client, since it never has any outstanding uncommitted transactions to replay.

Comment by Jian Yu [ 08/Jan/15 ]

In Lustre test suite, the following sub-tests in insanity.sh test failing MDS and OSS at the same time:

run_test 2 "Second Failure Mode: MDS/OST `date`"
run_test 4 "Fourth Failure Mode: OST/MDS `date`"

The basic test steps for sub-test 2 are:

fail MDS
fail OSS
start OSS
start MDS

And the basic test steps for sub-test 4 are:

fail OSS
fail MDS
start OSS
start MDS

Here is the insanity test report for Lustre b2_5 build #107: https://testing.hpdd.intel.com/test_sets/bfd812b0-8a4d-11e4-a10b-5254006e85c2

Comment by Hongchao Zhang [ 09/Jan/15 ]

sorry for misunderstanding! Yes, Lustre supports both MDT and OSS to fail over.

Comment by James A Simmons [ 14/Jan/15 ]

Testing failover and we see with just the MDS being failed over:

very 1.0s: cat recovery_status Wed Jan 14 14:11:11 2015

status: COMPLETE
recovery_start: 1421262582
recovery_duration: 60
completed_clients: 82/82
replayed_requests: 0
last_transno: 90194315377
VBR: DISABLED
IR: DISABLED

All clients are 2.5+ so there should be no reason that IR is disabled. Can you report this problem on your side?

On the MGS we see during failover.

root@atlas-tds-mds1 MGC10.36.226.79@o2ib]# cat ir_state
imperative_recovery: ENABLED
client_state:

  • { client: atlastds-MDT0000, nidtbl_version: 957 }
Comment by Hongchao Zhang [ 15/Jan/15 ]

Are both MGS and MDS failed over in this test?

the IR status will be set IR_STARTUP after MGS is started and will be changed to IR_FULL after "ir_timeout" seconds
(default is OBD_IR_MGS_TIMEOUT = "4*obd_timeout"). the target(MDT or OST) registered to MGS will only be marked as
"LDD_F_IR_CAPABLE" if the IR status is IR_FULL, and "IR" will be printed as "DISABLED" in this case.

for the client side, the imperative_recovery will be marked as "Enabled" if the connection with the server supports recovery
(imp->imp_connect_data & OBD_CONNECT_IMP_RECOV == TRUE).

Comment by James A Simmons [ 15/Jan/15 ]

No the MGS is left up. We failed over the MDS and OSS together.

Comment by Hongchao Zhang [ 16/Jan/15 ]

Is the failover mode the same for both tests at Dec 31, 2014 and at Jan 14, 2015, which is there is a separated node
only running MGS connected by MDS, OSSs and the clients nodes, and the MDS and OSSs are failed over together?

Comment by Jian Yu [ 23/Jan/15 ]

With the patch for LU-4119, the recovery issue did not occur at small scale testing in ORNL. Large scale testing will be performed.

Comment by Jian Yu [ 03/Feb/15 ]

Large scale testing passed.

Hi James, can we close this ticket now?

Comment by Jian Yu [ 05/Feb/15 ]

Close this ticket as a duplicate of LU-4119.

Generated at Sat Feb 10 01:53:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.