[LU-5724] IR recovery doesn't behave properly with Lustre 2.5 Created: 10/Oct/14 Updated: 20/Feb/15 Resolved: 20/Feb/15 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl, ornl, p4o | ||
| Environment: |
MDS server running RHEL6.5 running ORNL 2.5.3 branch with about 12 patches. |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 2 | ||||||||||||||||
| Rank (Obsolete): | 16076 | ||||||||||||||||
| Description |
|
Today we experienced a hardware failure with our MDS. The MDS rebooted and then came back. We restarted the MDS but IR behaved strangely. Four clients got evicted but when the timer to completion got down to zero IR restarted all over again. Then once it got to the 700 second range the timer starting to go up. It did this a few times before letting the timer running out. Once the timer did finally get to zero the recovery state was reported as still being in recovery. It removed this way for several more minutes before finally being in a recovered state. In all it toke 54 minutes to recover. |
| Comments |
| Comment by James A Simmons [ 10/Oct/14 ] |
|
I attached the dmesg but sorry we didn't collect any data this time. |
| Comment by James A Simmons [ 10/Oct/14 ] |
|
It was noticed that when a client reconnects the replay timer goes up a bit. Is this proper behavior. |
| Comment by Peter Jones [ 10/Oct/14 ] |
|
Jinshan Could you please comment? Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 10/Oct/14 ] |
|
From the dmesg, IR is enabled on this server: Oct 10 12:00:56 atlas-mds1.ccs.ornl.gov kernel: [ 280.655680] Lustre: atlas1-MDT0000: Imperative Recovery enabled, recovery window shrunk from 1800-5400 down to 900-2700 Oct 10 12:00:56 atlas-mds1.ccs.ornl.gov kernel: [ 280.661720] Lustre: atlas1-MDT0000: Will be in recovery for at least 15:00, or until 20178 clients reconnect After 15 minutes, the recovery timed out and 4 clients were evicted. Oct 10 12:15:56 atlas-mds1.ccs.ornl.gov kernel: [ 1181.536843] Lustre: atlas1-MDT0000: recovery is timed out, evict stale exports Oct 10 12:15:56 atlas-mds1.ccs.ornl.gov kernel: [ 1181.547322] Lustre: atlas1-MDT0000: disconnecting 4 stale clients There were few clients couldn't reconnect to the server because one active RPC still existed: Oct 10 12:17:53 atlas-mds1.ccs.ornl.gov kernel: [ 1297.846751] Lustre: atlas1-MDT0000: Client 31fe9dcf-fbb8-41ea-72d2-cdae9b6dd941 (at 15449@gni100) refused reconnection, still busy with 1 active RPCs I suspect this is the exact reason why it couldn't be recovered. There are few bugs about this `active RPCs', we will have to investigate if this is issue has already been fixed. Finally, the MDT lost its temper and those clients were evicted finally. Oct 10 12:51:28 atlas-mds1.ccs.ornl.gov kernel: [ 3315.271837] Lustre: 19764:0:(ldlm_lib.c:2088:target_recovery_thread()) too long recovery - read logs Oct 10 12:51:28 atlas-mds1.ccs.ornl.gov kernel: [ 3315.282307] LustreError: dumping log to /tmp/lustre-log.1412959888.19764 Oct 10 12:51:28 atlas-mds1.ccs.ornl.gov kernel: [ 3315.282310] Lustre: atlas1-MDT0000: Recovery over after 50:32, of 20178 clients 20171 recovered and 7 were evicted. From the log, there are 20178 clients in total, and 4 clients couldn't reconnect in the first place so they were evicted after IR recovery window timed out. The other 3 got into `active RPCs' trouble and the MDT has to wait for them to finish recovery. I feel that IR was working well in this example. As you know, IR can only help notify the clients the events of server restart, obviously the recovery took way too long time to finish. Another message popped out many times in the log Lustre: mdt: This server is not able to keep up with request traffic (cpu-bound). It seems like the MDT can't handle the RPCs in time so the RPC expired. Can you please verify that the CPU was super busy at that time? |
| Comment by Matt Ezell [ 11/Oct/14 ] |
|
Hi Jinshan- Thanks for the analysis. After we recently upgraded clients and servers to 2.5.x, we hit While the MDS was trying to recover, I was logged in looking at 'top' and 'perf top' to see the current load. Neither became excessive, so I wonder if there's a situation where the "server is not able to keep up with request traffic" message may be emitted for reasons other than excessive CPU usage. It is emitted when the server tries to send early replies but it's already past the request deadline. When in recovery, does that metric even make sense? I fear that it's not CPU usage but some other issue (deadlock?) causing the problem. Maybe I don't understand the normal recovery sequence, but at 12:15:56 when recovery timed out and 4 clients were evicted, why did recovery continue and run into the "still busy" clients? Then it had to wait until 12:51:28 for the 3 additional ones? Why wasn't every non-compliant client evicted at that time so the MDS could move on with its life? Anyway, I think we need to get a stable fix for |
| Comment by Hongchao Zhang [ 13/Oct/14 ] |
|
Hi Matt, the extra recovery period is caused by VBR (version based recovery). "check_and_start_recovery_timer" will start the first period of tgt_recov D 000000000000000f 0 19764 2 0x00000000 Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.914484] ffff881fcaed1da0 0000000000000046 0000000000000000 ffff881fcaed1d64 Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.923068] 0000009100000000 ffff88207fc28800 ffff88011c456880 0000000000000400 Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.931650] ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058 Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.940231] Call Trace: Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.943131] [<ffffffffa07ca620>] ? check_for_clients+0x0/0x70 [ptlrpc] Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.950680] [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.959026] [<ffffffffa07ca310>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.966778] [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.974140] [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc] Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.982456] [<ffffffffa07d2a90>] target_recovery_thread+0x540/0x1920 [ptlrpc] Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.990824] [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1084.997983] [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc] Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1085.006277] [<ffffffff8109ab56>] kthread+0x96/0xa0 Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1085.011867] [<ffffffff8100c20a>] child_rip+0xa/0x20 Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1085.017531] [<ffffffff8109aac0>] ? kthread+0x0/0xa0 Oct 10 12:14:20 atlas-mds1.ccs.ornl.gov kernel: [ 1085.023222] [<ffffffff8100c200>] ? child_rip+0x0/0x20 Oct 10 12:00:56 atlas-mds1.ccs.ornl.gov kernel: [ 280.655680] Lustre: atlas1-MDT0000: Imperative Recovery enabled, recovery window shrunk from 1800-5400 down to 900-2700 Oct 10 12:00:56 atlas-mds1.ccs.ornl.gov kernel: [ 280.661720] Lustre: atlas1-MDT0000: Will be in recovery for at least 15:00, or until 20178 clients reconnect ... Oct 10 12:15:56 atlas-mds1.ccs.ornl.gov kernel: [ 1181.536843] Lustre: atlas1-MDT0000: recovery is timed out, evict stale exports Oct 10 12:15:56 atlas-mds1.ccs.ornl.gov kernel: [ 1181.547322] Lustre: atlas1-MDT0000: disconnecting 4 stale clients after the recovery timer expired, there will be another recovery period to startup (VBR – version based recovery), and the Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.578366] tgt_recov D 0000000000000015 0 19764 2 0x00000000 Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.586215] ffff881fcaed1da0 0000000000000046 ffff881fcaed1d00 ffff881fcaed1d64 Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.594772] ffffc9007f72b470 ffff88402a8af430 0000000000004ed2 0000000000004ece Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.603336] ffff881fcaecb058 ffff881fcaed1fd8 000000000000fbc8 ffff881fcaecb058 Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.611889] Call Trace: Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.614787] [<ffffffffa07cea60>] ? check_for_next_transno+0x0/0x590 [ptlrpc] Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.622914] [<ffffffffa07cbc8d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.631242] [<ffffffffa07ca330>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc] Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.639246] [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.646589] [<ffffffffa07d2cba>] target_recovery_thread+0x76a/0x1920 [ptlrpc] Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.654880] [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.662034] [<ffffffffa07d2550>] ? target_recovery_thread+0x0/0x1920 [ptlrpc] Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.670330] [<ffffffff8109ab56>] kthread+0x96/0xa0 Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.675928] [<ffffffff8100c20a>] child_rip+0xa/0x20 Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.681593] [<ffffffff8109aac0>] ? kthread+0x0/0xa0 Oct 10 12:44:21 atlas-mds1.ccs.ornl.gov kernel: [ 2887.687255] [<ffffffff8100c200>] ? child_rip+0x0/0x20 Oct 10 12:45:20 atlas-mds1.ccs.ornl.gov kernel: [ 2946.578608] Lustre: atlas1-MDT0000: recovery is timed out, evict stale exports Oct 10 12:45:20 atlas-mds1.ccs.ornl.gov kernel: [ 2946.588579] Lustre: atlas1-MDT0000: disconnecting 1 stale clients the load of the system should be not high, for there are only one thread to process these replay request and the count Oct 10 12:16:27 atlas-mds1.ccs.ornl.gov kernel: [ 1212.683035] Lustre: 16227:0:(service.c:1304:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-62s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff88404991d000 x1481333400177244/t0(0) o400->dbcda88f-598d-9915-c1a2-3272067b8e42@18919@gni100:0/0 lens 224/0 e 1 to 0 dl 1412957725 ref 2 fl Complete:H/c0/ffffffff rc 0/-1 Oct 10 12:16:27 atlas-mds1.ccs.ornl.gov kernel: [ 1212.718387] Lustre: 16227:0:(service.c:1304:ptlrpc_at_send_early_reply()) Skipped 6595 previous similar messages |
| Comment by Jason Hill (Inactive) [ 16/Oct/14 ] |
|
After talking with James Nunez on the call today some more background information might be useful. There were 2 outages on Friday 10/10/. The first was the MDS rebooted spuriously and there was no known cause. After we returned to service more hardware validation and debugging occurred and a hardware fault was found. A second outage was taken to swap the MDS hardware (not MDT). The first outage had ~1.5 hours between when the mds was unresponsive and became responsive again. The second outage was no more than 20 minutes. The first outage had the issue described in this ticket; the second outage recovered successfully and cleanly in a much shorter time (less than 30 minutes). |
| Comment by Jinshan Xiong (Inactive) [ 16/Oct/14 ] |
|
30 minutes was pretty good for this size of cluster, in my personal opinion. |
| Comment by Jason Hill (Inactive) [ 16/Oct/14 ] |
|
jinshan: We are happy with the second outage's performance. the 1.5 hours for the first outage's performance is where we have problems. |
| Comment by Jason Hill (Inactive) [ 16/Oct/14 ] |
|
We're currently in the holding pattern of the MDS having 0s remaining on the recovery timer but not exiting recovery. status: RECOVERING Oct 16 17:05:17 atlas-mds3.ccs.ornl.gov kernel: [ 4769.501597] Lustre: mdt: This server is not able to keep up with request traffic (cpu-bound). |
| Comment by Matt Ezell [ 16/Oct/14 ] |
|
I just uploaded some debug logs to your FTP server: |
| Comment by Hongchao Zhang [ 17/Oct/14 ] |
|
according the log "atlas-mds3-recovery-20141016-all.gz", there is one client failing to connect during the first phase of recovery(900s), Hi Matt, the recovery speed depends on the clients for given the same "sys.timeout" parameter, if some clients failed to connect for recovery, |
| Comment by Matt Ezell [ 17/Oct/14 ] |
|
We run with: [root@atlas-mds3 ~]# cat /proc/sys/lnet/debug ioctl neterror warning error emerg ha config console [root@atlas-mds3 ~]# cat /proc/sys/lnet/printk warning error emerg console In the previous scenario, we only provided the output of what printk()'ed. For the more recent instance, I used 'lctl dk' regularly to create the logs I provided. I think for diagnosing recovery, the D_HA messages are helpful. In both instances, we had hardware issues that caused the MDS to go down. I think we followed the same procedures when we tried to bring them back up. It's possible that in the most recent instance there were still some lingering hardware issues. Is it normal for the stale client to be reported as just a UUID and not a NID? atlas2-MDT0000: disconnect stale client 57ffbae2-6467-b467-fad5-58b51832006c@<unknown>
atlas2-MDT0000: EVICTING ffff881eaf853800 57ffbae2-6467-b467-fad5-58b51832006c (no nid) 2 (0 0 0) 0 0 1 0: (null) 146091266917
After the fact now, can we tell which node that was? (looking in the exports now, that UUID doesn't exist). I'd like to see what messages it logged. |
| Comment by James A Simmons [ 17/Oct/14 ] |
|
On our smaller scale test bed system applying the patch for |
| Comment by Peter Jones [ 28/Nov/14 ] |
|
James Are you comfortable to close this as a duplicate of Peter |
| Comment by James A Simmons [ 01/Dec/14 ] |
|
The cause of our recovery issues was three things. They are |
| Comment by James A Simmons [ 15/Dec/14 ] |
|
Today we tested the latest 2.5 lustre code with the following patches:
witha 500 client node. Recovery completely failed to complete. After a hour and 22 minutes we gave up and ended recovery. During recovery we lost a OSS node which I attached the lustre log it dumped. We also have a core I can post from that OSS as well. |
| Comment by James A Simmons [ 16/Dec/14 ] |
|
Some more info from todays testings. The failure to recovery occurred when both the MDS and an OSS were failed over. If we did just a MDS or a OSS recovery would complete. When we did the second round of testing with a single server node we noticed that IR was reported as disabled even tho we have no non-IR clients. We checked that on the MGS. |
| Comment by Jinshan Xiong (Inactive) [ 16/Dec/14 ] |
|
Does "single server node" mean that the MGS was also restarted in the test? |
| Comment by James A Simmons [ 16/Dec/14 ] |
|
No. Only the MDS and OSS were restarted. |
| Comment by James A Simmons [ 30/Dec/14 ] |
|
We did another test run for recovery in the case of both MDS and OSS fail. I collected logs and placed them at ftp.whamcloud.com/uploads/ |
| Comment by Hongchao Zhang [ 05/Jan/15 ] |
|
as per the log "dump_atlas-tds-mds1-after-recovery.log", there are 3 out of 4 clients completed the recovery at MDT. 00010000:02000000:13.0:1419964653.561987:0:15786:0:(ldlm_lib.c:1392:target_finish_recovery()) atlastds-MDT0000: Recovery over after 30:00, of 4 clients 3 recovered and 1 was evicted. which nodes does the client log "client-dump.log" contain? no eviction record was found in this log. btw, do you use 4 clients and a separated MGS in this test? and could you please attach the console/sys logs along with Thanks! |
| Comment by James A Simmons [ 05/Jan/15 ] |
|
The OSS reconnected to the MDS but none of the clients every reconnected. The clients appeared stuck. The client logs are from the client nodes we used. As for the configuration the MGS is a stand alone node and we tested with 4 nodes. Will grab the logs. |
| Comment by James A Simmons [ 05/Jan/15 ] |
|
Here you go. These are the logs from the clients and servers. |
| Comment by Hongchao Zhang [ 07/Jan/15 ] |
|
is there only one Lustre client at 10.38.144.11 in this configuration? are these logs in the same failover test above? [ 2267.379541] Lustre: atlastds-MDT0000: Will be in recovery for at least 30:00, or until 1 client reconnects Dec 29 14:31:02 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2267.409294] Lustre: atlastds-MDT0000: Denying connection for new client 3ae0ecec-84ef-cf8f-c128-51873c53d1ad (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:59 Dec 29 14:31:08 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2272.910080] Lustre: atlastds-MDT0000: Denying connection for new client 5116891d-0ace-dffd-7497-218db0b23e98 (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:54 the MDT & OSSs are waiting the client to reconnect for recovery, but it somehow failed to reconnect and seems to connect as Could you please attach the console&sys logs of the client? Thanks! |
| Comment by James A Simmons [ 07/Jan/15 ] |
|
Here are the kern logs for a client and a router. If you want the logs for all the clients let me know. |
| Comment by Hongchao Zhang [ 08/Jan/15 ] |
|
Is the client (rhea513, 10.38.146.45) the Lustre client connected to "atlastds"? at Client (rhea513, 10.38.146.45) Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: client wants to enable acl, but mdt not! Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Layout lock feature supported. Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Mounted atlastds-client Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879033/real 1419879033] req@ffff881000328000 x1487757986577588/t0(0) o400->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 224/224 e 0 to 1 dl 1419879600 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST0023-osc-ffff880820a28c00: Connection to atlastds-OST0023 (at 10.36.226.67@o2ib) was lost; in progress operations using this service will wait for recovery to complete Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST002e-osc-ffff880820a28c00: Connection to atlastds-OST002e (at 10.36.226.70@o2ib) was lost; in progress operations using this service will wait for recovery to complete Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: LustreError: 166-1: MGC10.36.226.79@o2ib: Connection to MGS (at 10.36.226.79@o2ib) was lost; in progress operations using this service will fail Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 52 previous similar messages Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879600/real 1419879600] req@ffff88102a64b800 x1487757986596960/t0(0) o8->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 400/544 e 0 to 1 dl 1419879880 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 1 previous similar message Dec 29 14:04:54 rhea513.ccs.ornl.gov kernel: Lustre: Unmounted atlastds-client at MDT (10.36.226.69) Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.814766] sdbd: Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.814952] sd 7:0:0:98: [sdaj] Attached SCSI disk Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.835086] unknown partition table Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.845364] unknown partition table Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.864189] sd 6:0:0:71: [sdau] Attached SCSI disk Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.864257] sd 6:0:0:101: [sdbd] Attached SCSI disk Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.970252] device-mapper: multipath round-robin: version 1.0.0 loaded Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.955421] LNet: HW CPU cores: 16, npartitions: 4 Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.963523] alg: No test for crc32 (crc32-table) Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.977845] alg: No test for adler32 (adler32-zlib) Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.988788] alg: No test for crc32 (crc32-pclmul) Dec 29 14:26:59 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1962.022800] padlock: VIA PadLock Hash Engine not detected. Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.277906] LNet: Added LNI 10.36.226.69@o2ib [63/2560/0/180] Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.814213] LNet: Added LNI 10.36.226.69@o2ib200 [63/2560/0/180] Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1967.054096] Lustre: Lustre: Build Version: 2.5.3-g6158f83-CHANGED-2.6.32-431.29.2.el6.atlas.x86_64 btw, in this failover test of both MDT and OSS was rebooted, and recovery will be slow for the OSSs will wait time to evict the client from MDT, which won't reconnect to recover. |
| Comment by James A Simmons [ 08/Jan/15 ] |
|
We have had success before with MDS+OSS failing over at the same time in the past. We really like to have that functionality restored. |
| Comment by Oleg Drokin [ 08/Jan/15 ] |
|
I think what HongChao tries to say is that when MDS and OST both go down, then since MDS is a client of OST, the OST recovery can never complete because of the missing client. But on the other hand we took two steps to help this. First, MDS client UUID should be always the same, so even after restart it still should be allowed to reconnect as the old known client (this is assuming it actually got up and into reconecting state in time for OST recovery. if your MDS takes ages to reboot, for example, it might miss this window, esp. if it's a shortened window thanks to IR). Second, we have VBR to deal with missing clients during recovery, which is esp. easy with MDT client, since it never has any outstanding uncommitted transactions to replay. |
| Comment by Jian Yu [ 08/Jan/15 ] |
|
In Lustre test suite, the following sub-tests in insanity.sh test failing MDS and OSS at the same time: run_test 2 "Second Failure Mode: MDS/OST `date`" run_test 4 "Fourth Failure Mode: OST/MDS `date`" The basic test steps for sub-test 2 are: fail MDS fail OSS start OSS start MDS And the basic test steps for sub-test 4 are: fail OSS fail MDS start OSS start MDS Here is the insanity test report for Lustre b2_5 build #107: https://testing.hpdd.intel.com/test_sets/bfd812b0-8a4d-11e4-a10b-5254006e85c2 |
| Comment by Hongchao Zhang [ 09/Jan/15 ] |
|
sorry for misunderstanding! Yes, Lustre supports both MDT and OSS to fail over. |
| Comment by James A Simmons [ 14/Jan/15 ] |
|
Testing failover and we see with just the MDS being failed over: very 1.0s: cat recovery_status Wed Jan 14 14:11:11 2015 status: COMPLETE All clients are 2.5+ so there should be no reason that IR is disabled. Can you report this problem on your side? On the MGS we see during failover. root@atlas-tds-mds1 MGC10.36.226.79@o2ib]# cat ir_state
|
| Comment by Hongchao Zhang [ 15/Jan/15 ] |
|
Are both MGS and MDS failed over in this test? the IR status will be set IR_STARTUP after MGS is started and will be changed to IR_FULL after "ir_timeout" seconds for the client side, the imperative_recovery will be marked as "Enabled" if the connection with the server supports recovery |
| Comment by James A Simmons [ 15/Jan/15 ] |
|
No the MGS is left up. We failed over the MDS and OSS together. |
| Comment by Hongchao Zhang [ 16/Jan/15 ] |
|
Is the failover mode the same for both tests at Dec 31, 2014 and at Jan 14, 2015, which is there is a separated node |
| Comment by Jian Yu [ 23/Jan/15 ] |
|
With the patch for |
| Comment by Jian Yu [ 03/Feb/15 ] |
|
Large scale testing passed. Hi James, can we close this ticket now? |
| Comment by Jian Yu [ 05/Feb/15 ] |
|
Close this ticket as a duplicate of |