[LU-11131] resent reint rpc failure due to reused reply data slot Created: 09/Jul/18  Updated: 15/Oct/19  Resolved: 18/Jul/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Minor
Reporter: Vladimir Saveliev Assignee: Vladimir Saveliev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The following scenario leads to failure of recent reint rpc:

1. mdt server has number of rpcs being handled, rpc 1 from client A
and rpc 2 from client B.

2. shutdown for the server starts

3. rpc 1 is processed, reply data is added, but client A gets ENODEV
in reply (ptlrpc_send_reply()) as shutdown is running

3. shutdown reaches class_disconnect_exports() and links an export A
to the list of zombie exports

4. obd_zombid thread wakes up and destroy the export A, which includes
freeing of reply data list with clearing bits in
lut->lut_reply_bitmap (tgt_free_reply_data())

5. export B is still processing the rpc 2 and looks for free bit in
the lut->lut_reply_bitmap to store reply data
(tgt_add_reply_data()). If it finds a bit which has been just freed by
obd_zombid thread, then reply data from export A will get overwritten
in reply_data file with reply data from export B

6. after failover, reply data gets restored with
tgt_reply_data_init(). The reply data of rpc1 from client A is missing

7. client A reconnects and resends its rpc 1. Server does not find
reply data and processes the rpc as if it has not been seen yet. In
case of unlink, the directory entry already does not exist so rpc 1
fails



 Comments   
Comment by Gerrit Updater [ 09/Jul/18 ]

Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/32798
Subject: LU-11131 target: keep reply data bit set on failover
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fe72c36f256d061862adb8bcf14a596eb0709c31

Comment by Gerrit Updater [ 18/Jul/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32798/
Subject: LU-11131 target: keep reply data bit set on failover
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 58edec38160f44be0ef784fecfab830a43f92fa8

Comment by Peter Jones [ 18/Jul/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:41:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.