[LU-2965] MDS evicted OST after attempting recovery. Created: 14/Mar/13 Updated: 23/Apr/13 Resolved: 23/Apr/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Zhenyu Xu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | LB | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 7230 |
| Comments |
| Comment by James A Simmons [ 14/Mar/13 ] |
|
During the large scale test we lost a OSS due to another bug. The OSS came back and entered recovery but the recovery process at some point had the MDS evict the OSS imports. This resulted in clients then being evicted. |
| Comment by Peter Jones [ 15/Mar/13 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Zhenyu Xu [ 18/Mar/13 ] |
|
I saw these Lnet error messages in ost log Mar 8 21:02:19 widow-oss8b4 kernel: [ 680.829820] LNet: No route to 12345-15903@gni via 10.36.227.92@o2ib (all routers down) Mar 8 21:02:19 widow-oss8b4 kernel: [ 680.860759] LNet: No route to 12345-2632@gni via 10.36.227.92@o2ib (all routers down) Mar 8 21:02:19 widow-oss8b4 kernel: [ 680.876671] LNet: Skipped 12 previous similar messages Mar 8 21:02:20 widow-oss8b4 kernel: [ 681.633701] LNet: No route to 12345-19172@gni via 10.36.227.92@o2ib (all routers down) Mar 8 21:02:20 widow-oss8b4 kernel: [ 681.649998] LNet: Skipped 1 previous similar message Mar 8 21:02:22 widow-oss8b4 kernel: [ 683.930515] LNet: No route to 12345-16216@gni via 10.36.227.92@o2ib (all routers down) Mar 8 21:02:22 widow-oss8b4 kernel: [ 683.946526] LNet: Skipped 3 previous similar messages Mar 8 21:02:25 widow-oss8b4 kernel: [ 686.430669] LNet: No route to 12345-245@gni via 10.36.227.92@o2ib (all routers down) Mar 8 21:02:25 widow-oss8b4 kernel: [ 686.446617] LNet: Skipped 18 previous similar messages Mar 8 21:02:29 widow-oss8b4 kernel: [ 691.152386] LNet: No route to 12345-1391@gni via 10.36.227.92@o2ib (all routers down) Mar 8 21:02:29 widow-oss8b4 kernel: [ 691.175839] LNet: Skipped 10 previous similar messages Mar 8 21:03:27 widow-oss8b4 kernel: [ 749.157594] LustreError: 11-0: routed1-MDT0000-lwp-OST00eb: Communicating with 10.36.227.211@o2ib, operation mds_connect failed with -114. Mar 8 21:03:27 widow-oss8b4 kernel: [ 749.198062] LustreError: Skipped 5 previous similar messages Mar 8 21:05:57 widow-oss8b4 kernel: [ 899.157523] LustreError: 11-0: routed1-MDT0000-lwp-OST008b: Communicating with 10.36.227.211@o2ib, operation mds_connect failed with -114. Is the network on this OSS working during the recovery? |
| Comment by James A Simmons [ 22/Mar/13 ] |
|
yes it was working. The clients managed to reconnect with no problems. I believe that those down routes are for the other file systems in the fabric that we took off line. |
| Comment by Zhenyu Xu [ 23/Mar/13 ] |
|
The MDS log shows that the OSS reconnection was using a different connection cookie, and MDS reject it, I'm still investigating what could cause this situation, when you lost the OSS what happened to it and what did you to this OSS before you try to mount its OST?
|
| Comment by Zhenyu Xu [ 25/Mar/13 ] |
|
James, what is your initial issue? Is it that a client should not be evicted from OSS? Following is the situation which I interpret from the logs. When OSS reconnect MDS, there is some stale export remained on MDS, so that MDS evicted OST (and MDS would clear the stale export, the OST reconnect MDS again should be ok, this is a normal situation). and on OSS logs
the client get evicted by OSS is because OSS does not hear client's message for a long time, this is a problem out of the recovery (the client's network could be in trouble) the other part of log in OSS
this part indicates that OST002b cannot connect to MDS (the same reason as I described, MDS has OST002b stale export), so that OSS failed to mount OST002b, then OSS has to umount it. |
| Comment by James A Simmons [ 25/Mar/13 ] |
|
The issue was the drop off of clients after recovery was "done". As you can see all clients did recover but then they started to get evicted. |
| Comment by James A Simmons [ 25/Mar/13 ] |
|
Client that was evicted after recover. This log is for the node for the whole test shot so you the eviction problem look at 21:30 area. |
| Comment by Zhenyu Xu [ 25/Mar/13 ] |
|
Can specify a client example for me? |
| Comment by James A Simmons [ 25/Mar/13 ] |
|
I attached a log for one of the nodes that was evicted. Due to these problems recovery took the full 30 minutes instead of finishing quickly. |
| Comment by Zhenyu Xu [ 26/Mar/13 ] |
|
Can you specify the timestamp of the troubled recovery process, from the c4-0c0s5n0.log
I only see that the client restored connection to OST00[eb|8b] at 2013-03-08 21:06:04, and 24 minutes later lost connection with MDS, and restored to it again soon. Then at 22:32 it is umounted (I think by manually). |
| Comment by James A Simmons [ 28/Mar/13 ] |
|
We saw the problems around 21:07 for the client evicts. The MDS had these errors which lead to the evictions. Mar 8 21:12:54 widow-mds1 kernel: [27048.704079] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 379s: evicting client at 3167@gni ns: mdt-ffff880e55b20000 lock: ffff880c3f748600/0x4453d516b35bd568 lrc: 3/0,0 mode: CR/CR res: 8589939214/48084 bits 0x9 rrc: 14 type: IBT flags: 0x200000000020 nid: 3167@gni remote: 0x1f865dc509f72511 expref: 11 pid: 28552 timeout: 4321715043 lvb_type: 0 To make clear what happened I have attached the MDS logs for the entire test shot. |
| Comment by Zhenyu Xu [ 01/Apr/13 ] |
|
the OSS has completed recover around 21:06, but 25 minutes later, MDS and OSS all reports that they lost connection with some clients oss log Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.540784] Lustre: routed1-OST00eb: Recovery over after 7:32, of 8982 clients 8982 recovered and 0 were evicted. and all logs provided do not show details during the 25 minutes, could something happened about the clients' network? |
| Comment by Oleg Drokin [ 02/Apr/13 ] |
|
I just confirmed with James that the actual problem is "even though recovery completed just fine after a server failure, clients were evicted later for some reason". Given the timing of 1354 seconds - that's exactly how much passed since recovery - appears the clients were not pinging? |
| Comment by James A Simmons [ 02/Apr/13 ] |
|
Nope. I didn't test suppress pings at that time. I started the suppress pings test 2 hours later than when all the recover problems happened. |
| Comment by James A Simmons [ 02/Apr/13 ] |
|
Looking at it logs at 175 seconds after every client recovered some stopped pinging the servers. |
| Comment by Li Wei (Inactive) [ 03/Apr/13 ] |
|
James, According to the oss-test-shot.log, only one client was evicted by the OSS around 21:30: "61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib)". It seems c4-0c0s5n0 was "1546@gni", not the o2ib client. Is the log on the o2ib client still available? Also, how did you infer that some clients had stopped pinging the servers, please? |
| Comment by James A Simmons [ 03/Apr/13 ] |
|
10.36.227.198 was our 2.4 non cray client. I attached the log (spare.log) here. |
| Comment by Oleg Drokin [ 03/Apr/13 ] |
|
I guess the assumption clients stopped pinging is because they were evicted for not pinging |
| Comment by Li Wei (Inactive) [ 08/Apr/13 ] |
|
From spare.log (why is it called "spare", by the way?):
The client completed recoveries at least with OST00eb and OST01ab, but restarted for some reason. OST00eb last heard from the client at around 21:08:53---two to three minutes after the recovery finished. If this was not something else, the timing was close to the 150s ping interval, suggesting the request heard was probably a ping. The next ping should happen 150s to 300s later, which correspond to 21:11:23 and 21:13:53. Could this eviction merely because the client was simply not remounted after the restart? Any log available after 21:13:34? |
| Comment by Jodi Levi (Inactive) [ 08/Apr/13 ] |
|
Reducing priority until more information is available to understand the issue. |
| Comment by James A Simmons [ 23/Apr/13 ] |
|
For our last test shot IR worked flawlessly. I think we can close this ticket. If we encounter the bug again this ticket can be reopened. |
| Comment by Peter Jones [ 23/Apr/13 ] |
|
Thanks James - that is good news! |