[LU-14876] OUT: possible concurrent execution of UPDATE request and its resent Created: 21/Jul/21  Updated: 10/Mar/22  Resolved: 18/Aug/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.7, Lustre 2.15.0
Fix Version/s: Lustre 2.12.8, Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Mikhail Pershin Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There is possible LBUG() in out_reconstruct():

lustre_update.h:246:object_update_result_insert()) LBUG

Bug happened because export lcd_last_xid became the same as request rq_xid in the middle of OUT UPDATE resent processing.

1. The first update with index 0 was not equal to lcd_last_xid and was added as normal update to be processed.
2. The second update, index 1, finds that lcd_last_xid is the same and entered out_reconstruct() first time. In the object_update_result_get() it finds that there is no ourp_lens[0] for previous index 0 and returned NULL as result causing assertion.

I am pretty sure about sequence and log messages confirm that. This revealed at least two issues with OUT UPDATE resent handling.

1. req_xid_is_last() check shouldn't be done for each update, this is always the same request with the same XID, so it is either last or not and that should be checked only once prior updates processing. 

2. In step #2 of scenario the lcd_last_xid was changed and became the same as request's one. This is the real problem and means that original request was processing while resent also starts processing. For ordinary clients this is prevented by checking exp_rpc_count in target_handle_connec() but for MDS-MDS re-connection it is flawed somewhere it seems.



 Comments   
Comment by Gerrit Updater [ 21/Jul/21 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44362
Subject: LU-14876 out: improved check and debug for OUT resent
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 27f1605e00f75a913bd072b0b3def17d79394fb0

Comment by Mikhail Pershin [ 22/Jul/21 ]

As I assumed  in description, there is flaw in target_handle_connect() allowing concurrent execution of original request and its resent on server.  I am preparing patch for this.

Comment by Gerrit Updater [ 23/Jul/21 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44390
Subject: LU-14876 out: don't connect to busy MDS-MDS export
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a15ae51e5df962726bb5b2b801b1ae26e8c918f1

Comment by Gerrit Updater [ 18/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44390/
Subject: LU-14876 out: don't connect to busy MDS-MDS export
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 301d76a71176c186129231ddd1323bae21100165

Comment by Peter Jones [ 18/Aug/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 13/Sep/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44362/
Subject: LU-14876 out: don't connect to busy MDS-MDS export
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 67a24ac97553b684195d210b0db1d5bfad0fa5d7

Generated at Sat Feb 10 03:13:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.