[LU-2965] MDS evicted OST after attempting recovery. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.4.0
Labels:
- LB

Severity:
3
Rank (Obsolete):
7230

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

c4-0c0s5n0.log
33 kB
25/Mar/13 4:07 PM
client.log
4 kB
14/Mar/13 11:47 AM
mds-test-shot.log
8 kB
14/Mar/13 11:47 AM
mds-test-shot-whole.log
1.27 MB
28/Mar/13 1:46 PM
oss-test-shot.log
8 kB
14/Mar/13 11:47 AM
spare.log
869 kB
03/Apr/13 2:49 PM

Activity

[LU-2965] MDS evicted OST after attempting recovery.

Oleg Drokin added a comment - 02/Apr/13 3:21 PM

I just confirmed with James that the actual problem is "even though recovery completed just fine after a server failure, clients were evicted later for some reason".

Given the timing of 1354 seconds - that's exactly how much passed since recovery - appears the clients were not pinging?
Were you already playing with suppress pinging at the time, I wonder?

Oleg Drokin added a comment - 02/Apr/13 3:21 PM I just confirmed with James that the actual problem is "even though recovery completed just fine after a server failure, clients were evicted later for some reason". Given the timing of 1354 seconds - that's exactly how much passed since recovery - appears the clients were not pinging? Were you already playing with suppress pinging at the time, I wonder?

Zhenyu Xu added a comment - 01/Apr/13 4:33 AM

the OSS has completed recover around 21:06, but 25 minutes later, MDS and OSS all reports that they lost connection with some clients

oss log

Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.540784] Lustre: routed1-OST00eb: Recovery over after 7:32, of 8982 clients 8982 recovered and 0 were evicted.
Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.569906] Lustre: Skipped 5 previous similar messages
Mar 8 21:31:27 widow-oss8b4 kernel: [ 2428.640322] Lustre: routed1-OST00eb: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1354 seconds. I think it's dead, and I am evicting it. exp ffff880397414000, cur 1362796287 expire 1362795387 last 1362794933

and all logs provided do not show details during the 25 minutes, could something happened about the clients' network?

Zhenyu Xu added a comment - 01/Apr/13 4:33 AM the OSS has completed recover around 21:06, but 25 minutes later, MDS and OSS all reports that they lost connection with some clients oss log Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.540784] Lustre: routed1-OST00eb: Recovery over after 7:32, of 8982 clients 8982 recovered and 0 were evicted. Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.569906] Lustre: Skipped 5 previous similar messages Mar 8 21:31:27 widow-oss8b4 kernel: [ 2428.640322] Lustre: routed1-OST00eb: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1354 seconds. I think it's dead, and I am evicting it. exp ffff880397414000, cur 1362796287 expire 1362795387 last 1362794933 and all logs provided do not show details during the 25 minutes, could something happened about the clients' network?

James A Simmons added a comment - 28/Mar/13 1:46 PM

We saw the problems around 21:07 for the client evicts. The MDS had these errors which lead to the evictions.

Mar 8 21:12:54 widow-mds1 kernel: [27048.704079] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 379s: evicting client at 3167@gni ns: mdt-ffff880e55b20000 lock: ffff880c3f748600/0x4453d516b35bd568 lrc: 3/0,0 mode: CR/CR res: 8589939214/48084 bits 0x9 rrc: 14 type: IBT flags: 0x200000000020 nid: 3167@gni remote: 0x1f865dc509f72511 expref: 11 pid: 28552 timeout: 4321715043 lvb_type: 0

To make clear what happened I have attached the MDS logs for the entire test shot.

James A Simmons added a comment - 28/Mar/13 1:46 PM We saw the problems around 21:07 for the client evicts. The MDS had these errors which lead to the evictions. Mar 8 21:12:54 widow-mds1 kernel: [27048.704079] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 379s: evicting client at 3167@gni ns: mdt-ffff880e55b20000 lock: ffff880c3f748600/0x4453d516b35bd568 lrc: 3/0,0 mode: CR/CR res: 8589939214/48084 bits 0x9 rrc: 14 type: IBT flags: 0x200000000020 nid: 3167@gni remote: 0x1f865dc509f72511 expref: 11 pid: 28552 timeout: 4321715043 lvb_type: 0 To make clear what happened I have attached the MDS logs for the entire test shot.

Zhenyu Xu added a comment - 26/Mar/13 7:20 AM

Can you specify the timestamp of the troubled recovery process, from the c4-0c0s5n0.log

[2013-03-08 21:03:28][c4-0c0s5n0]Lustre: 7635:0:(client.c:1866:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
[2013-03-08 21:06:04][c4-0c0s5n0]Lustre: routed1-OST00eb-osc-ffff88081e7e4000: Connection restored to routed1-OST00eb (at 10.36.227.92@o2ib)
[2013-03-08 21:06:04][c4-0c0s5n0]Lustre: routed1-OST008b-osc-ffff88081e7e4000: Connection restored to routed1-OST008b (at 10.36.227.92@o2ib)
[2013-03-08 21:28:08][c4-0c0s5n0]LustreError: 11-0: routed1-MDT0000-mdc-ffff88081e7e4000: Communicating with 10.36.227.211@o2ib, operation obd_ping failed with -107.
[2013-03-08 21:28:08][c4-0c0s5n0]Lustre: routed1-MDT0000-mdc-ffff88081e7e4000: Connection to routed1-MDT0000 (at 10.36.227.211@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[2013-03-08 21:28:08][c4-0c0s5n0]LustreError: 167-0: routed1-MDT0000-mdc-ffff88081e7e4000: This client was evicted by routed1-MDT0000; in progress operations using this service will fail.
[2013-03-08 21:28:08][c4-0c0s5n0]Lustre: routed1-MDT0000-mdc-ffff88081e7e4000: Connection restored to routed1-MDT0000 (at 10.36.227.211@o2ib)

I only see that the client restored connection to OST00[eb|8b] at 2013-03-08 21:06:04, and 24 minutes later lost connection with MDS, and restored to it again soon. Then at 22:32 it is umounted (I think by manually).

Zhenyu Xu added a comment - 26/Mar/13 7:20 AM Can you specify the timestamp of the troubled recovery process, from the c4-0c0s5n0.log [2013-03-08 21:03:28] [c4-0c0s5n0] Lustre: 7635:0:(client.c:1866:ptlrpc_expire_one_request()) Skipped 5 previous similar messages [2013-03-08 21:06:04] [c4-0c0s5n0] Lustre: routed1-OST00eb-osc-ffff88081e7e4000: Connection restored to routed1-OST00eb (at 10.36.227.92@o2ib) [2013-03-08 21:06:04] [c4-0c0s5n0] Lustre: routed1-OST008b-osc-ffff88081e7e4000: Connection restored to routed1-OST008b (at 10.36.227.92@o2ib) [2013-03-08 21:28:08] [c4-0c0s5n0] LustreError: 11-0: routed1-MDT0000-mdc-ffff88081e7e4000: Communicating with 10.36.227.211@o2ib, operation obd_ping failed with -107. [2013-03-08 21:28:08] [c4-0c0s5n0] Lustre: routed1-MDT0000-mdc-ffff88081e7e4000: Connection to routed1-MDT0000 (at 10.36.227.211@o2ib) was lost; in progress operations using this service will wait for recovery to complete [2013-03-08 21:28:08] [c4-0c0s5n0] LustreError: 167-0: routed1-MDT0000-mdc-ffff88081e7e4000: This client was evicted by routed1-MDT0000; in progress operations using this service will fail. [2013-03-08 21:28:08] [c4-0c0s5n0] Lustre: routed1-MDT0000-mdc-ffff88081e7e4000: Connection restored to routed1-MDT0000 (at 10.36.227.211@o2ib) I only see that the client restored connection to OST00 [eb|8b] at 2013-03-08 21:06:04, and 24 minutes later lost connection with MDS, and restored to it again soon. Then at 22:32 it is umounted (I think by manually).

James A Simmons added a comment - 25/Mar/13 4:08 PM

I attached a log for one of the nodes that was evicted. Due to these problems recovery took the full 30 minutes instead of finishing quickly.

James A Simmons added a comment - 25/Mar/13 4:08 PM I attached a log for one of the nodes that was evicted. Due to these problems recovery took the full 30 minutes instead of finishing quickly.

Zhenyu Xu added a comment - 25/Mar/13 4:08 PM

Can specify a client example for me?

Zhenyu Xu added a comment - 25/Mar/13 4:08 PM Can specify a client example for me?

James A Simmons added a comment - 25/Mar/13 4:07 PM

Client that was evicted after recover. This log is for the node for the whole test shot so you the eviction problem look at 21:30 area.

James A Simmons added a comment - 25/Mar/13 4:07 PM Client that was evicted after recover. This log is for the node for the whole test shot so you the eviction problem look at 21:30 area.

James A Simmons added a comment - 25/Mar/13 3:07 PM

The issue was the drop off of clients after recovery was "done". As you can see all clients did recover but then they started to get evicted.

James A Simmons added a comment - 25/Mar/13 3:07 PM The issue was the drop off of clients after recovery was "done". As you can see all clients did recover but then they started to get evicted.

Zhenyu Xu added a comment - 25/Mar/13 5:41 AM

James,

what is your initial issue? Is it that a client should not be evicted from OSS? Following is the situation which I interpret from the logs.

When OSS reconnect MDS, there is some stale export remained on MDS, so that MDS evicted OST (and MDS would clear the stale export, the OST reconnect MDS again should be ok, this is a normal situation).

and on OSS logs

Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.540784] Lustre: routed1-OST00eb: Recovery over after 7:32, of 8982 clients 8982 recovered and 0 were evicted.
Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.569906] Lustre: Skipped 5 previous similar messages
Mar 8 21:31:27 widow-oss8b4 kernel: [ 2428.640322] Lustre: routed1-OST00eb: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1354 seconds. I think it's dead, and I am evicting it. exp ffff880397414000, cur 1362796287 expire 1362795387 last 1362794933
Mar 8 21:31:30 widow-oss8b4 kernel: [ 2432.006547] Lustre: routed1-OST014b: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1357 seconds. I think it's dead, and I am evicting it. exp ffff88039ae54000, cur 1362796290 expire 1362795390 last 1362794933
Mar 8 21:31:32 widow-oss8b4 kernel: [ 2433.311741] Lustre: routed1-OST008b: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1358 seconds. I think it's dead, and I am evicting it. exp ffff880398ddd400, cur 1362796291 expire 1362795391 last 1362794933

the client get evicted by OSS is because OSS does not hear client's message for a long time, this is a problem out of the recovery (the client's network could be in trouble)

the other part of log in OSS

Mar 8 22:43:04 widow-oss8b4 kernel: [ 6726.231467] LustreError: 2212:0:(obd_mount.c:2115:server_put_super()) routed1-OST002b: failed to disconnect lwp. (rc=-110)
Mar 8 22:43:04 widow-oss8b4 kernel: [ 6726.262375] Lustre: Failing over routed1-OST002b
Mar 8 22:43:05 widow-oss8b4 kernel: [ 6726.700008] Lustre: server umount routed1-OST002b complete

this part indicates that OST002b cannot connect to MDS (the same reason as I described, MDS has OST002b stale export), so that OSS failed to mount OST002b, then OSS has to umount it.

Zhenyu Xu added a comment - 25/Mar/13 5:41 AM James, what is your initial issue? Is it that a client should not be evicted from OSS? Following is the situation which I interpret from the logs. When OSS reconnect MDS, there is some stale export remained on MDS, so that MDS evicted OST (and MDS would clear the stale export, the OST reconnect MDS again should be ok, this is a normal situation). and on OSS logs Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.540784] Lustre: routed1-OST00eb: Recovery over after 7:32, of 8982 clients 8982 recovered and 0 were evicted. Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.569906] Lustre: Skipped 5 previous similar messages Mar 8 21:31:27 widow-oss8b4 kernel: [ 2428.640322] Lustre: routed1-OST00eb: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1354 seconds. I think it's dead, and I am evicting it. exp ffff880397414000, cur 1362796287 expire 1362795387 last 1362794933 Mar 8 21:31:30 widow-oss8b4 kernel: [ 2432.006547] Lustre: routed1-OST014b: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1357 seconds. I think it's dead, and I am evicting it. exp ffff88039ae54000, cur 1362796290 expire 1362795390 last 1362794933 Mar 8 21:31:32 widow-oss8b4 kernel: [ 2433.311741] Lustre: routed1-OST008b: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1358 seconds. I think it's dead, and I am evicting it. exp ffff880398ddd400, cur 1362796291 expire 1362795391 last 1362794933 the client get evicted by OSS is because OSS does not hear client's message for a long time, this is a problem out of the recovery (the client's network could be in trouble) the other part of log in OSS Mar 8 22:43:04 widow-oss8b4 kernel: [ 6726.231467] LustreError: 2212:0:(obd_mount.c:2115:server_put_super()) routed1-OST002b: failed to disconnect lwp. (rc=-110) Mar 8 22:43:04 widow-oss8b4 kernel: [ 6726.262375] Lustre: Failing over routed1-OST002b Mar 8 22:43:05 widow-oss8b4 kernel: [ 6726.700008] Lustre: server umount routed1-OST002b complete this part indicates that OST002b cannot connect to MDS (the same reason as I described, MDS has OST002b stale export), so that OSS failed to mount OST002b, then OSS has to umount it.

Zhenyu Xu added a comment - 23/Mar/13 7:06 PM

The MDS log shows that the OSS reconnection was using a different connection cookie, and MDS reject it, I'm still investigating what could cause this situation, when you lost the OSS what happened to it and what did you to this OSS before you try to mount its OST?

already connected client routed1-MDT0000-lwp-OST008b_UUID (at 10.36.227.92@o2ib) with handle 0x1e3d407c957c8e7b. Rejecting client with the same UUID trying to reconnect with handle 0xd64e61dedb5d3691
Mar 8 20:58:25 widow-mds1 kernel: [26179.518445] format at ldlm_lib.c:999:target_handle_connect doesn't end in newline
Mar 8 20:58:25 widow-mds1 kernel: [26179.518448] Lustre: routed1-MDT0000: Rejecting reconnect from the known client routed1-MDT0000-lwp-OST008b_UUID (at 10.36.227.92@o2ib) because it is indicating it is a new client

Zhenyu Xu added a comment - 23/Mar/13 7:06 PM The MDS log shows that the OSS reconnection was using a different connection cookie, and MDS reject it, I'm still investigating what could cause this situation, when you lost the OSS what happened to it and what did you to this OSS before you try to mount its OST? already connected client routed1-MDT0000-lwp-OST008b_UUID (at 10.36.227.92@o2ib) with handle 0x1e3d407c957c8e7b. Rejecting client with the same UUID trying to reconnect with handle 0xd64e61dedb5d3691 Mar 8 20:58:25 widow-mds1 kernel: [26179.518445] format at ldlm_lib.c:999:target_handle_connect doesn't end in newline Mar 8 20:58:25 widow-mds1 kernel: [26179.518448] Lustre: routed1-MDT0000: Rejecting reconnect from the known client routed1-MDT0000-lwp-OST008b_UUID (at 10.36.227.92@o2ib) because it is indicating it is a new client

James A Simmons added a comment - 22/Mar/13 6:59 PM

yes it was working. The clients managed to reconnect with no problems. I believe that those down routes are for the other file systems in the fabric that we took off line.

James A Simmons added a comment - 22/Mar/13 6:59 PM yes it was working. The clients managed to reconnect with no problems. I believe that those down routes are for the other file systems in the fabric that we took off line.

People

Assignee:: Zhenyu Xu

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 14/Mar/13 11:33 AM

Updated:: 23/Apr/13 1:34 PM

Resolved:: 23/Apr/13 1:34 PM