[LU-868] Add early reply and netlatency into the timeout of waiting next replay. Created: 20/Nov/11  Updated: 03/Jun/16  Resolved: 03/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.2.0

Type: Improvement Priority: Minor
Reporter: Di Wang Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None

Rank (Obsolete): 4771

 Description   

In current handle_recovery_req, when setting the timeout for waiting next replay

to = lustre_msg_get_timeout(req->rq_reqmsg);
extend_recovery_timer(class_exp2obd(req->rq_export), to);

It did not consider the netlatency and early reply might extend the req timeout on client side, which might cause some replay tests failed like replay-dual 9, 10.



 Comments   
Comment by Jinshan Xiong (Inactive) [ 22/Nov/11 ]

There exists a quite similar scenario at: https://maloo.whamcloud.com/test_sets/eb23c4ba-1510-11e1-b669-52540025f9af

The failure on patch set 2 is quite similar to LU-868.

0. server started recovery window at: 1321951273.840132, and recovery timeout was set to 60s;
1. client replayed the open request at 1321951273.829810, with timeout value to 60 seconds;
2. server received this request at 1321951273.840592, and reply was dropped at 1321951273.845728;
3. server evicted this client at 1321951333.841817 because it couldn't finish recovery in time;
4. the replaying request was expired on the client side at 1321951334.831938;

It looks like that it's not enough to extend the recovery window by rq_timeout only. We should add RECONNECT time as well so that it has enough time for the client to reconnect and replay again. Another question is why rq_timeout of replaying request was set to 60 seconds, I think it should be much shorter.

Comment by Mikhail Pershin [ 24/Nov/11 ]

I'd agree that reconnect time is good to have included but it should be also added to the initial recovery timeout then. As for 60s timeout, this can be due to network latency, I recall from discussion with Eric that it is normal to have delays up to 60s, e.g. with routers. In current situation the recovery time in 60s looks more unnatural than request timeout, it is just too short.

Comment by Mikhail Pershin [ 24/Nov/11 ]

The check_and_start_recovery_timer() looks pretty reasonable but at the end it has:

1487         service_time -= obd->obd_recovery_timeout;
1488         if (service_time > 0)
1489                 extend_recovery_timer(obd, service_time);

I think that Maloo issue you are referring was before your fixes for extend_recovery_timer(), so I suspect that can cause not correct total recovery time making it smaller, what do you think?

Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/include/lustre/lustre_idl.h
  • lustre/ptlrpc/client.c
Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/include/lustre/lustre_idl.h
  • lustre/ptlrpc/client.c
Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/include/lustre/lustre_idl.h
  • lustre/ptlrpc/client.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/include/lustre/lustre_idl.h
  • lustre/ptlrpc/client.c
Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/ptlrpc/client.c
  • lustre/include/lustre/lustre_idl.h
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/ptlrpc/client.c
  • lustre/ldlm/ldlm_lib.c
  • lustre/include/lustre/lustre_idl.h
Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/include/lustre/lustre_idl.h
  • lustre/ptlrpc/client.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/ptlrpc/client.c
  • lustre/include/lustre/lustre_idl.h
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/include/lustre/lustre_idl.h
  • lustre/ptlrpc/client.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/include/lustre/lustre_idl.h
  • lustre/ldlm/ldlm_lib.c
  • lustre/ptlrpc/client.c
Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » i686,server,el5,ofa #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/include/lustre/lustre_idl.h
  • lustre/ptlrpc/client.c
  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/ptlrpc/client.c
  • lustre/include/lustre/lustre_idl.h
Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/include/lustre/lustre_idl.h
  • lustre/ptlrpc/client.c
Comment by Build Master (Inactive) [ 11/Jan/12 ]

Integrated in lustre-master » i686,client,el5,ofa #422
LU-868 ptlrpc: Fix the timeout for waiting next replay (Revision 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c)

Result = SUCCESS
Oleg Drokin : 06f41b84901ad3bb4901b51a24e1553f9eeb6f1c
Files :

  • lustre/ldlm/ldlm_lib.c
  • lustre/include/lustre/lustre_idl.h
  • lustre/ptlrpc/client.c
Generated at Sat Feb 10 01:11:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.