[LU-5528] Race - connect vs resend Created: 21/Aug/14  Updated: 14/Jun/18  Resolved: 29/Dec/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1, Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Li Wei (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-7558 niobuf.c:721:ptl_send_rpc() LASSERT(A... Resolved
is related to LU-5533 Wrong buffer for field `dlm_rep' (1 o... Resolved
Severity: 3
Rank (Obsolete): 15386

 Description   

Buggy code at ptlrpc_connect_interpret()
finish:
rc = ptlrpc_import_recovery_state_machine(imp);
...
Set import connection flags
When import has FULL state ptlrpc_import_recovery_state_machine() wakeup all waiters on import and all delayed request, which was resented. And it could happened that request was send without updated flags and AT is disabled. After that, server could drop resend request if server already processing it and send early reply for client, base on the first incarnation of the request. Client got early reply for request without AT and became confused, touch the buffer outside the reply and fail with EPROTO.



 Comments   
Comment by Alexander Boyko [ 21/Aug/14 ]

Xyratex: MRP-2034
patch http://review.whamcloud.com/11540 for b2_5

Comment by Alexander Boyko [ 02/Sep/14 ]

We have failed in this issue during testing.

[12485.898910] Lustre: 10447:0:(ldlm_lib.c:1004:target_handle_connect()) lustre-MDT0000: connection from 588e19fc-8e99-b13a-cf77-3d993fb6631e@0@lo t0 exp ffff8801
27570048 cur 1409434074 last 1409434073
[12485.903011] Lustre: 10447:0:(ldlm_lib.c:1004:target_handle_connect()) Skipped 2 previous similar messages
[12485.905135] Lustre: lustre-MDT0000-mdc-ffff88008d3700c8: Connection restored to lustre-MDT0000 (at 0@lo)
[12485.905809] LustreError: 10440:0:(layout.c:1687:__req_capsule_get()) @@@ Wrong buffer for field `mdt_body' (1 of 1) in format `MDS_REINT_SETATTR': 0 vs. 216 (s
erver)
[12485.905810]   req@ffff8800bb670a20 x1477898561126671/t0(0) o36->lustre-MDT0000-mdc-ffff88008d3700c8@0@lo:12/10 lens 456/192 e 0 to 0 dl 1409434081 ref 1 fl Com
plete:R/2/0 rc 0/0
[12485.905827] LustreError: 10440:0:(llite_lib.c:1224:ll_md_setattr()) md_setattr fails: rc = -71
Comment by Alexander Boyko [ 02/Sep/14 ]

for master http://review.whamcloud.com/11723

Comment by Gerrit Updater [ 23/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11723/
Subject: LU-5528 ptlrpc: fix race between connect vs resend
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8645c6f7e95b81dedbc5d47a9ab76947343ed05e

Comment by Jodi Levi (Inactive) [ 29/Dec/14 ]

Patch landed to Master.
b2_5 patch tracked externally to land.

Generated at Sat Feb 10 01:52:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.