[LU-8420] unexpected? client eviction after bulk transfer timeout Created: 20/Jul/16  Updated: 07/Feb/17  Resolved: 07/Feb/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Minor
Reporter: Vladimir Saveliev Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The following scenario leading to client's eviction has been observed in acceptance testing:

1) client 1 owns PW lock on file A and sends write rpc to ost
2) ost initiates a bulk transfer which gets lost somewhere in networks
3) client 2 enqueues PR lock on file A
4) the server sees the incompatible lock, sends blocking ast to client 1 and starts waiting until client 1 cancels the lock.
5) bulk transfer timeouts, but client 1 does not get a reply in that case.

int tgt_brw_write(struct tgt_session_info *tsi)
...
        rc = target_bulk_io(exp, desc, &lwi);
        no_reply = rc != 0;
...

6) blocking ast callback timer expires and the server evicts client 1
7) write rpc on client 1 times out, and client 1 finds itself evicted

AT settings managed to make client's rpc timeout bigger than blast callback timeout.



 Comments   
Comment by Gerrit Updater [ 20/Jul/16 ]

Vladimir Saveliev (vladimir_saveliev@xyratex.com) uploaded a new patch: http://review.whamcloud.com/21448
Subject: LU-8420 tests: yet another test for bulk transfer timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d4fe365d7f8ea2c9e02810ea7f48da0fce496fee

Comment by Vladimir Saveliev [ 02/Dec/16 ]

2 important points were not mentioned in this scenario:

1) client 1 owns PW lock on file A and sends write rpc to ost
2) ost initiates a bulk transfer which gets lost somewhere in networks
3) client 2 enqueues PR lock on file A
4) the server sees the incompatible lock, sends blocking ast to client 1 and starts waiting until client 1 cancels the lock.

4.2) at_history passed since worst rpc took place and service estimate drops down.

5) bulk transfer timeouts, but client 1 does not get a reply in that case.

5.2 prolong tries to prolong lock callback timer using decreased service estimate. That makes prolong to make no effect.

6) blocking ast callback timer expires and the server evicts client 1
7) write rpc on client 1 times out, and client 1 finds itself evicted

Comment by Gerrit Updater [ 07/Feb/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21448/
Subject: LU-8420 ldlm: take at_current change into account on prolong
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 18c95c436a55a2c7c8b8f71c0935e8d92c70c42f

Comment by Peter Jones [ 07/Feb/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:17:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.