[LU-2965] MDS evicted OST after attempting recovery. Created: 14/Mar/13  Updated: 23/Apr/13  Resolved: 23/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: LB

Attachments: File c4-0c0s5n0.log     File client.log     File mds-test-shot-whole.log     File mds-test-shot.log     File oss-test-shot.log     File spare.log    
Severity: 3
Rank (Obsolete): 7230

 Comments   
Comment by James A Simmons [ 14/Mar/13 ]

During the large scale test we lost a OSS due to another bug. The OSS came back and entered recovery but the recovery process at some point had the MDS evict the OSS imports. This resulted in clients then being evicted.
I have the logs from a client, oss, and the mds.

Comment by Peter Jones [ 15/Mar/13 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 18/Mar/13 ]

I saw these Lnet error messages in ost log

Mar  8 21:02:19 widow-oss8b4 kernel: [  680.829820] LNet: No route to 12345-15903@gni via 10.36.227.92@o2ib (all routers down)
Mar  8 21:02:19 widow-oss8b4 kernel: [  680.860759] LNet: No route to 12345-2632@gni via 10.36.227.92@o2ib (all routers down)
Mar  8 21:02:19 widow-oss8b4 kernel: [  680.876671] LNet: Skipped 12 previous similar messages
Mar  8 21:02:20 widow-oss8b4 kernel: [  681.633701] LNet: No route to 12345-19172@gni via 10.36.227.92@o2ib (all routers down)
Mar  8 21:02:20 widow-oss8b4 kernel: [  681.649998] LNet: Skipped 1 previous similar message
Mar  8 21:02:22 widow-oss8b4 kernel: [  683.930515] LNet: No route to 12345-16216@gni via 10.36.227.92@o2ib (all routers down)
Mar  8 21:02:22 widow-oss8b4 kernel: [  683.946526] LNet: Skipped 3 previous similar messages
Mar  8 21:02:25 widow-oss8b4 kernel: [  686.430669] LNet: No route to 12345-245@gni via 10.36.227.92@o2ib (all routers down)
Mar  8 21:02:25 widow-oss8b4 kernel: [  686.446617] LNet: Skipped 18 previous similar messages
Mar  8 21:02:29 widow-oss8b4 kernel: [  691.152386] LNet: No route to 12345-1391@gni via 10.36.227.92@o2ib (all routers down)
Mar  8 21:02:29 widow-oss8b4 kernel: [  691.175839] LNet: Skipped 10 previous similar messages
Mar  8 21:03:27 widow-oss8b4 kernel: [  749.157594] LustreError: 11-0: routed1-MDT0000-lwp-OST00eb: Communicating with 10.36.227.211@o2ib, operation mds_connect failed with -114.
Mar  8 21:03:27 widow-oss8b4 kernel: [  749.198062] LustreError: Skipped 5 previous similar messages
Mar  8 21:05:57 widow-oss8b4 kernel: [  899.157523] LustreError: 11-0: routed1-MDT0000-lwp-OST008b: Communicating with 10.36.227.211@o2ib, operation mds_connect failed with -114.

Is the network on this OSS working during the recovery?

Comment by James A Simmons [ 22/Mar/13 ]

yes it was working. The clients managed to reconnect with no problems. I believe that those down routes are for the other file systems in the fabric that we took off line.

Comment by Zhenyu Xu [ 23/Mar/13 ]

The MDS log shows that the OSS reconnection was using a different connection cookie, and MDS reject it, I'm still investigating what could cause this situation, when you lost the OSS what happened to it and what did you to this OSS before you try to mount its OST?

already connected client routed1-MDT0000-lwp-OST008b_UUID (at 10.36.227.92@o2ib) with handle 0x1e3d407c957c8e7b. Rejecting client with the same UUID trying to reconnect with handle 0xd64e61dedb5d3691
Mar 8 20:58:25 widow-mds1 kernel: [26179.518445] format at ldlm_lib.c:999:target_handle_connect doesn't end in newline
Mar 8 20:58:25 widow-mds1 kernel: [26179.518448] Lustre: routed1-MDT0000: Rejecting reconnect from the known client routed1-MDT0000-lwp-OST008b_UUID (at 10.36.227.92@o2ib) because it is indicating it is a new client

Comment by Zhenyu Xu [ 25/Mar/13 ]

James,

what is your initial issue? Is it that a client should not be evicted from OSS? Following is the situation which I interpret from the logs.

When OSS reconnect MDS, there is some stale export remained on MDS, so that MDS evicted OST (and MDS would clear the stale export, the OST reconnect MDS again should be ok, this is a normal situation).

and on OSS logs

Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.540784] Lustre: routed1-OST00eb: Recovery over after 7:32, of 8982 clients 8982 recovered and 0 were evicted.
Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.569906] Lustre: Skipped 5 previous similar messages
Mar 8 21:31:27 widow-oss8b4 kernel: [ 2428.640322] Lustre: routed1-OST00eb: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1354 seconds. I think it's dead, and I am evicting it. exp ffff880397414000, cur 1362796287 expire 1362795387 last 1362794933
Mar 8 21:31:30 widow-oss8b4 kernel: [ 2432.006547] Lustre: routed1-OST014b: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1357 seconds. I think it's dead, and I am evicting it. exp ffff88039ae54000, cur 1362796290 expire 1362795390 last 1362794933
Mar 8 21:31:32 widow-oss8b4 kernel: [ 2433.311741] Lustre: routed1-OST008b: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1358 seconds. I think it's dead, and I am evicting it. exp ffff880398ddd400, cur 1362796291 expire 1362795391 last 1362794933

the client get evicted by OSS is because OSS does not hear client's message for a long time, this is a problem out of the recovery (the client's network could be in trouble)

the other part of log in OSS

Mar 8 22:43:04 widow-oss8b4 kernel: [ 6726.231467] LustreError: 2212:0:(obd_mount.c:2115:server_put_super()) routed1-OST002b: failed to disconnect lwp. (rc=-110)
Mar 8 22:43:04 widow-oss8b4 kernel: [ 6726.262375] Lustre: Failing over routed1-OST002b
Mar 8 22:43:05 widow-oss8b4 kernel: [ 6726.700008] Lustre: server umount routed1-OST002b complete

this part indicates that OST002b cannot connect to MDS (the same reason as I described, MDS has OST002b stale export), so that OSS failed to mount OST002b, then OSS has to umount it.

Comment by James A Simmons [ 25/Mar/13 ]

The issue was the drop off of clients after recovery was "done". As you can see all clients did recover but then they started to get evicted.

Comment by James A Simmons [ 25/Mar/13 ]

Client that was evicted after recover. This log is for the node for the whole test shot so you the eviction problem look at 21:30 area.

Comment by Zhenyu Xu [ 25/Mar/13 ]

Can specify a client example for me?

Comment by James A Simmons [ 25/Mar/13 ]

I attached a log for one of the nodes that was evicted. Due to these problems recovery took the full 30 minutes instead of finishing quickly.

Comment by Zhenyu Xu [ 26/Mar/13 ]

Can you specify the timestamp of the troubled recovery process, from the c4-0c0s5n0.log

[2013-03-08 21:03:28][c4-0c0s5n0]Lustre: 7635:0:(client.c:1866:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
[2013-03-08 21:06:04][c4-0c0s5n0]Lustre: routed1-OST00eb-osc-ffff88081e7e4000: Connection restored to routed1-OST00eb (at 10.36.227.92@o2ib)
[2013-03-08 21:06:04][c4-0c0s5n0]Lustre: routed1-OST008b-osc-ffff88081e7e4000: Connection restored to routed1-OST008b (at 10.36.227.92@o2ib)
[2013-03-08 21:28:08][c4-0c0s5n0]LustreError: 11-0: routed1-MDT0000-mdc-ffff88081e7e4000: Communicating with 10.36.227.211@o2ib, operation obd_ping failed with -107.
[2013-03-08 21:28:08][c4-0c0s5n0]Lustre: routed1-MDT0000-mdc-ffff88081e7e4000: Connection to routed1-MDT0000 (at 10.36.227.211@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[2013-03-08 21:28:08][c4-0c0s5n0]LustreError: 167-0: routed1-MDT0000-mdc-ffff88081e7e4000: This client was evicted by routed1-MDT0000; in progress operations using this service will fail.
[2013-03-08 21:28:08][c4-0c0s5n0]Lustre: routed1-MDT0000-mdc-ffff88081e7e4000: Connection restored to routed1-MDT0000 (at 10.36.227.211@o2ib)

I only see that the client restored connection to OST00[eb|8b] at 2013-03-08 21:06:04, and 24 minutes later lost connection with MDS, and restored to it again soon. Then at 22:32 it is umounted (I think by manually).

Comment by James A Simmons [ 28/Mar/13 ]

We saw the problems around 21:07 for the client evicts. The MDS had these errors which lead to the evictions.

Mar 8 21:12:54 widow-mds1 kernel: [27048.704079] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 379s: evicting client at 3167@gni ns: mdt-ffff880e55b20000 lock: ffff880c3f748600/0x4453d516b35bd568 lrc: 3/0,0 mode: CR/CR res: 8589939214/48084 bits 0x9 rrc: 14 type: IBT flags: 0x200000000020 nid: 3167@gni remote: 0x1f865dc509f72511 expref: 11 pid: 28552 timeout: 4321715043 lvb_type: 0

To make clear what happened I have attached the MDS logs for the entire test shot.

Comment by Zhenyu Xu [ 01/Apr/13 ]

the OSS has completed recover around 21:06, but 25 minutes later, MDS and OSS all reports that they lost connection with some clients

oss log

Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.540784] Lustre: routed1-OST00eb: Recovery over after 7:32, of 8982 clients 8982 recovered and 0 were evicted.
Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.569906] Lustre: Skipped 5 previous similar messages
Mar 8 21:31:27 widow-oss8b4 kernel: [ 2428.640322] Lustre: routed1-OST00eb: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1354 seconds. I think it's dead, and I am evicting it. exp ffff880397414000, cur 1362796287 expire 1362795387 last 1362794933

and all logs provided do not show details during the 25 minutes, could something happened about the clients' network?

Comment by Oleg Drokin [ 02/Apr/13 ]

I just confirmed with James that the actual problem is "even though recovery completed just fine after a server failure, clients were evicted later for some reason".

Given the timing of 1354 seconds - that's exactly how much passed since recovery - appears the clients were not pinging?
Were you already playing with suppress pinging at the time, I wonder?

Comment by James A Simmons [ 02/Apr/13 ]

Nope. I didn't test suppress pings at that time. I started the suppress pings test 2 hours later than when all the recover problems happened.

Comment by James A Simmons [ 02/Apr/13 ]

Looking at it logs at 175 seconds after every client recovered some stopped pinging the servers.

Comment by Li Wei (Inactive) [ 03/Apr/13 ]

James,

According to the oss-test-shot.log, only one client was evicted by the OSS around 21:30: "61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib)". It seems c4-0c0s5n0 was "1546@gni", not the o2ib client. Is the log on the o2ib client still available?

Also, how did you infer that some clients had stopped pinging the servers, please?

Comment by James A Simmons [ 03/Apr/13 ]

10.36.227.198 was our 2.4 non cray client. I attached the log (spare.log) here.

Comment by Oleg Drokin [ 03/Apr/13 ]

I guess the assumption clients stopped pinging is because they were evicted for not pinging

Comment by Li Wei (Inactive) [ 08/Apr/13 ]

From spare.log (why is it called "spare", by the way?):

Mar 8 21:06:01 widow-spare06 kernel: [ 7478.823574] Lustre: routed1-OST00eb-osc-ffff8804319c7000: Connection restored to routed1-OST00eb (at 10.36.227.92@o2ib)
Mar 8 21:06:01 widow-spare06 kernel: [ 7478.845592] Lustre: routed1-OST01ab-osc-ffff8804319c7000: Connection restored to routed1-OST01ab (at 10.36.227.92@o2ib)
Mar 8 21:12:33 widow-spare06 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Mar 8 21:12:33 widow-spare06 kernel: [ 0.000000] Initializing cgroup subsys cpuset
[...]
Mar 8 21:13:34 widow-spare06 kernel: [ 120.446824] ipmi device interface
[EOF]

The client completed recoveries at least with OST00eb and OST01ab, but restarted for some reason. OST00eb last heard from the client at around 21:08:53---two to three minutes after the recovery finished. If this was not something else, the timing was close to the 150s ping interval, suggesting the request heard was probably a ping. The next ping should happen 150s to 300s later, which correspond to 21:11:23 and 21:13:53. Could this eviction merely because the client was simply not remounted after the restart? Any log available after 21:13:34?

Comment by Jodi Levi (Inactive) [ 08/Apr/13 ]

Reducing priority until more information is available to understand the issue.

Comment by James A Simmons [ 23/Apr/13 ]

For our last test shot IR worked flawlessly. I think we can close this ticket. If we encounter the bug again this ticket can be reopened.

Comment by Peter Jones [ 23/Apr/13 ]

Thanks James - that is good news!

Generated at Sat Feb 10 01:29:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.