|
the major problem here is how to reconstruct. say, we've got an RPC with 3 transactions (1 update in each for simplicity). we've executed 2 transactions, then crashed. ideally, during recovery we'd like to skip those 2 transactions, execute missing one and reconstruct the reply with appropriate result codes. but we don't have enough space to store all codes in a last_rcvd's slot. I think there are obvious options here:
1) OUT to store result codes in an own object
2) stop execution upon an error and store XID/batchid in a last_rcvd's slot – essentially never proceed execution upon an error and force the initiator to resubmit remaining part. this in turn can result in a silly sequence of huge requests returning an error after every executed update (say, MDT wants to synchronize OST object destroys, but they have been destroyed already).
3) apply this logic only to idempotent updates, so we're able to execute again instead of reconstruction
|