[LU-1008] Improper error handling in trans_stop if some failure occurred during the transaction Created: 17/Jan/12  Updated: 28/Feb/18  Resolved: 28/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: nasf (Inactive) Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 10515

 Description   

On master branch, the failure during one transaction is recorded through thandle::th_result in mdd_trans_stop(), but osd_trans_stop() neither transfer such errno down to lower layer JBD(2), nor does any error processing. So JBD(2) does not know what happened for the transaction.



 Comments   
Comment by Mikhail Pershin [ 17/Jan/12 ]

Why jbd(2) should know about that? Can you explain more what is wrong now. The th_result is not to inform OSD about result of operation but to pass return code to the last_rcvd file which is written in mdt_txn_stop_cb(). Note that last_rcvd should be updated even if operation is failed so we still must finish started transaction and osd_trans_stop is doing that mo matter what is result of operation in MDD.

Comment by nasf (Inactive) [ 18/Jan/12 ]

When some failure occurred during the transaction, the transaction handle maybe aborted, maybe not, depends on where the failure occurred, the result is uncertain – is_handle_aborted(). So even though we expect "last_rcvd" to be updated regardless of failures, it cannot be guaranteed.

Under such case, we expect: all the former sub-operations should be rolled back, and the "last_rcvd" should be updated according to the failure. Rolling back step by step maybe failed also, which (double failures) is difficult to be processed. So it is relative simple that aborting the transaction explicitly (by osd_trans_stop()) without commit anything. And then restart the transaction for "last_rcvd" updating.

It is just my idea. Please correct me if I miss anything.

Comment by Andreas Dilger [ 28/Feb/18 ]

It is not possible to "abort" transactions at the JBD2 layer to "undo" them. The only possible actions are to abort the whole journal and make the filesystem read-only, to manually undo the changes to the filesystem (if possible), or to leave the changes in place and fix them afterward with e2fsck/LFSCK (this should be avoided if possible).

Generated at Sat Feb 10 01:12:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.