Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
Lustre 2.4.0
-
3
-
7230
Attachments
- c4-0c0s5n0.log
- 33 kB
- client.log
- 4 kB
- mds-test-shot-whole.log
- 1.27 MB
- spare.log
- 869 kB
Activity
the OSS has completed recover around 21:06, but 25 minutes later, MDS and OSS all reports that they lost connection with some clients
Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.540784] Lustre: routed1-OST00eb: Recovery over after 7:32, of 8982 clients 8982 recovered and 0 were evicted.
Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.569906] Lustre: Skipped 5 previous similar messages
Mar 8 21:31:27 widow-oss8b4 kernel: [ 2428.640322] Lustre: routed1-OST00eb: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1354 seconds. I think it's dead, and I am evicting it. exp ffff880397414000, cur 1362796287 expire 1362795387 last 1362794933
and all logs provided do not show details during the 25 minutes, could something happened about the clients' network?
We saw the problems around 21:07 for the client evicts. The MDS had these errors which lead to the evictions.
Mar 8 21:12:54 widow-mds1 kernel: [27048.704079] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 379s: evicting client at 3167@gni ns: mdt-ffff880e55b20000 lock: ffff880c3f748600/0x4453d516b35bd568 lrc: 3/0,0 mode: CR/CR res: 8589939214/48084 bits 0x9 rrc: 14 type: IBT flags: 0x200000000020 nid: 3167@gni remote: 0x1f865dc509f72511 expref: 11 pid: 28552 timeout: 4321715043 lvb_type: 0
To make clear what happened I have attached the MDS logs for the entire test shot.
Can you specify the timestamp of the troubled recovery process, from the c4-0c0s5n0.log
[2013-03-08 21:03:28][c4-0c0s5n0]Lustre: 7635:0:(client.c:1866:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
[2013-03-08 21:06:04][c4-0c0s5n0]Lustre: routed1-OST00eb-osc-ffff88081e7e4000: Connection restored to routed1-OST00eb (at 10.36.227.92@o2ib)
[2013-03-08 21:06:04][c4-0c0s5n0]Lustre: routed1-OST008b-osc-ffff88081e7e4000: Connection restored to routed1-OST008b (at 10.36.227.92@o2ib)
[2013-03-08 21:28:08][c4-0c0s5n0]LustreError: 11-0: routed1-MDT0000-mdc-ffff88081e7e4000: Communicating with 10.36.227.211@o2ib, operation obd_ping failed with -107.
[2013-03-08 21:28:08][c4-0c0s5n0]Lustre: routed1-MDT0000-mdc-ffff88081e7e4000: Connection to routed1-MDT0000 (at 10.36.227.211@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[2013-03-08 21:28:08][c4-0c0s5n0]LustreError: 167-0: routed1-MDT0000-mdc-ffff88081e7e4000: This client was evicted by routed1-MDT0000; in progress operations using this service will fail.
[2013-03-08 21:28:08][c4-0c0s5n0]Lustre: routed1-MDT0000-mdc-ffff88081e7e4000: Connection restored to routed1-MDT0000 (at 10.36.227.211@o2ib)
I only see that the client restored connection to OST00[eb|8b] at 2013-03-08 21:06:04, and 24 minutes later lost connection with MDS, and restored to it again soon. Then at 22:32 it is umounted (I think by manually).
I attached a log for one of the nodes that was evicted. Due to these problems recovery took the full 30 minutes instead of finishing quickly.
Client that was evicted after recover. This log is for the node for the whole test shot so you the eviction problem look at 21:30 area.
The issue was the drop off of clients after recovery was "done". As you can see all clients did recover but then they started to get evicted.
James,
what is your initial issue? Is it that a client should not be evicted from OSS? Following is the situation which I interpret from the logs.
When OSS reconnect MDS, there is some stale export remained on MDS, so that MDS evicted OST (and MDS would clear the stale export, the OST reconnect MDS again should be ok, this is a normal situation).
and on OSS logs
Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.540784] Lustre: routed1-OST00eb: Recovery over after 7:32, of 8982 clients 8982 recovered and 0 were evicted.
Mar 8 21:06:02 widow-oss8b4 kernel: [ 903.569906] Lustre: Skipped 5 previous similar messages
Mar 8 21:31:27 widow-oss8b4 kernel: [ 2428.640322] Lustre: routed1-OST00eb: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1354 seconds. I think it's dead, and I am evicting it. exp ffff880397414000, cur 1362796287 expire 1362795387 last 1362794933
Mar 8 21:31:30 widow-oss8b4 kernel: [ 2432.006547] Lustre: routed1-OST014b: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1357 seconds. I think it's dead, and I am evicting it. exp ffff88039ae54000, cur 1362796290 expire 1362795390 last 1362794933
Mar 8 21:31:32 widow-oss8b4 kernel: [ 2433.311741] Lustre: routed1-OST008b: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1358 seconds. I think it's dead, and I am evicting it. exp ffff880398ddd400, cur 1362796291 expire 1362795391 last 1362794933
the client get evicted by OSS is because OSS does not hear client's message for a long time, this is a problem out of the recovery (the client's network could be in trouble)
the other part of log in OSS
Mar 8 22:43:04 widow-oss8b4 kernel: [ 6726.231467] LustreError: 2212:0:(obd_mount.c:2115:server_put_super()) routed1-OST002b: failed to disconnect lwp. (rc=-110)
Mar 8 22:43:04 widow-oss8b4 kernel: [ 6726.262375] Lustre: Failing over routed1-OST002b
Mar 8 22:43:05 widow-oss8b4 kernel: [ 6726.700008] Lustre: server umount routed1-OST002b complete
this part indicates that OST002b cannot connect to MDS (the same reason as I described, MDS has OST002b stale export), so that OSS failed to mount OST002b, then OSS has to umount it.
The MDS log shows that the OSS reconnection was using a different connection cookie, and MDS reject it, I'm still investigating what could cause this situation, when you lost the OSS what happened to it and what did you to this OSS before you try to mount its OST?
already connected client routed1-MDT0000-lwp-OST008b_UUID (at 10.36.227.92@o2ib) with handle 0x1e3d407c957c8e7b. Rejecting client with the same UUID trying to reconnect with handle 0xd64e61dedb5d3691
Mar 8 20:58:25 widow-mds1 kernel: [26179.518445] format at ldlm_lib.c:999:target_handle_connect doesn't end in newline
Mar 8 20:58:25 widow-mds1 kernel: [26179.518448] Lustre: routed1-MDT0000: Rejecting reconnect from the known client routed1-MDT0000-lwp-OST008b_UUID (at 10.36.227.92@o2ib) because it is indicating it is a new client
yes it was working. The clients managed to reconnect with no problems. I believe that those down routes are for the other file systems in the fabric that we took off line.
I just confirmed with James that the actual problem is "even though recovery completed just fine after a server failure, clients were evicted later for some reason".
Given the timing of 1354 seconds - that's exactly how much passed since recovery - appears the clients were not pinging?
Were you already playing with suppress pinging at the time, I wonder?