Thanks for the timeline, Bobi!
The reason client thinks the server evicted it was because that's what happened.
Recovery window closed on the server at 1350224331:
In fact the vm2 client reconnected much earlier and went through recovery as far as it could (til REPLAY_LOCKS phase) until it hit the wall with server waiting for the other client.
Then it proceeds to replay locks:
Note how x1415810084000533 completes right away, but x1415810084000532 is not.
Switching to the server side, we did get x1415810084000532 request there and send the reply:
So req x1415810084000532 is "lost" - the test is workign as expected timing out lock recovery
now, some 60 seconds later on a client
By which time server has timed out the recovery already.
So the problem at hand is that server did not extend the deadline. Now, reading through the code it appears that we only extend the timer on requests placed in various recovery queues, but to get there lock replay must be sent with MSG_REQ_REPLAY_DONE flag set, and only 2.x clients set it.
1.8 clients don't set the flag and so the timer is not extended.
Now, in 1.8 server case there are no separate lock queues at all and I suspect the initial timeout is just much higher (don't have a log nearby, will try to look for something in maloo shortly) and as such allows enough margin for reconnect to still succeed.
Hm, actually I just looked into the maloo report and it seems there reconnect arrives ~50 seconds after lost message instead of 60 seconds in case of 2.1 interop. Sadly we don't collect any debug logs on success so I cannot compare the results. I'll try to repeat 1.8 test locally and see what's inside.
The entire test is somewhat of a fringe case where we have a double failure of sorts - fist server dies and then there is a loss of one of the lock replies, so it does not warrant in my opinion. But I am curious to see why 1.8 to 1.8 case replays sooner than during interop, so I'll make a final decision after that.
Close old ticket.