[LU-7319] OUT: continue updates processing upon an error Created: 20/Oct/15  Updated: 23/Dec/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Alex Zhuravlev Assignee: Alex Zhuravlev
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Blocker
is blocking LU-4215 Some expected improvements for OUT Open
Related
is related to LU-12310 MDT Device-level Replication/Mirroring Open
Rank (Obsolete): 9223372036854775807

 Description   

in some cases it would be useful to continue updates processing even if some of updates fail. for example, when MDT synchronizes with OST it needs to send a batch of attr_set/destroy updates, up to few thousands in a single RPC. it doesn't make sense to send another batch if some of destroy failed (say, with -ENOENT).



 Comments   
Comment by Alex Zhuravlev [ 20/Oct/15 ]

the major problem here is how to reconstruct. say, we've got an RPC with 3 transactions (1 update in each for simplicity). we've executed 2 transactions, then crashed. ideally, during recovery we'd like to skip those 2 transactions, execute missing one and reconstruct the reply with appropriate result codes. but we don't have enough space to store all codes in a last_rcvd's slot. I think there are obvious options here:
1) OUT to store result codes in an own object
2) stop execution upon an error and store XID/batchid in a last_rcvd's slot – essentially never proceed execution upon an error and force the initiator to resubmit remaining part. this in turn can result in a silly sequence of huge requests returning an error after every executed update (say, MDT wants to synchronize OST object destroys, but they have been destroyed already).
3) apply this logic only to idempotent updates, so we're able to execute again instead of reconstruction

Generated at Sat Feb 10 02:07:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.