[LU-2640] deactivate OSD_EXEC_OP() operation accounting if operation is being undone Created: 18/Jan/13 Updated: 13/Sep/17 Resolved: 18/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Andreas Dilger | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB | ||
| Issue Links: |
|
||||||||
| Severity: | 4 | ||||||||
| Rank (Obsolete): | 6174 | ||||||||
| Description |
|
In http://review.whamcloud.com/5046 there was a change to OSD_EXEC_OP() to address problems in MDD operations that fail part-way through, where the undo of earlier operations causes internal osd-ldiskfs declare/execute accounting to fail. For example, in mdd_create()::cleanup, the failed create calls __mdd_index_delete(), mdo_ref_del(), mdo_destroy() in order to clean up the newly created object. One proposal is to have some kind of OSD API call/flag/method to mark the transaction handle as being used for rollback, and to disable the ot_declare_op LASSERT() checking in OSD_EXEC_OP() for this case. |
| Comments |
| Comment by Andreas Dilger [ 18/Jan/13 ] |
|
To clarify, this bug is to track this fix separately from the LFSCK 1.5 project, instead of submitting the fix merged into that patch series. |
| Comment by Alex Zhuravlev [ 21/Jan/13 ] |
|
given all this information is required only from create to stop (and not longer), we can reserve quite amount of memory in osd_thread_info to track per-inode declaration. this way we could avoid the need in additional method. |
| Comment by nasf (Inactive) [ 21/Jan/13 ] |
|
Currently, I think that changing OSD APIs to enhance transaction debug for tracing modification rollback is some overreaction: not only the declare APIs, but also some exec APIs, otherwise, the OSD cannot distinguish whether a modification, such as destroy, belongs to a normal unlink or create rollback. So I prefer to simplify the requirement and processing as that: 1) When declare for the modification, the transaction trace mechanism will record both the transaction itself and its possible rollback. For example, when declare for ref_add, both "osd_thandle::ot_declare_ops[OSD_OT_REF_ADD]" and "osd_thandle::ot_declare_ops_rb[OSD_OT_REF_ADD]" will be increased. 2) The OSD needs not to strictly distinguish whether a modification belongs to a normal transaction or a rollback. For example, when ref_del is called, it will try to check "osd_thandle::ot_declare_ops[OSD_OT_REF_DEL]" firstly, if failed to decrease, then it quite possible belongs to former ref_add rollback, so the "osd_thandle::ot_declare_ops_rb[OSD_OT_REF_ADD]" will be decreased. Although it is not perfect, it is enough to resolve the issues I met during LFSCK 1.5 test. I prefer to use this patch (http://review.whamcloud.com/#change,5138) as a solution, at least a temporary solution for It does not belong to the LFSCK, and not depends on the LFSCK. But some sanity tests may fail if without this issue fixed, it should be landed to master in priority. |
| Comment by Alex Zhuravlev [ 21/Jan/13 ] |
|
well, I think Andreas is expecting a bit more than just fixing this assertion (otherwise we could just disable the whole thing which is just silly sanity check at the moment). instead we should think how we can improve credits accounting as well. I think this can be done on per-inode/op basis. |
| Comment by nasf (Inactive) [ 21/Jan/13 ] |
|
Well, it is different from transaction debug disable, it still has original functionality, additional to allow rollback cases. About more accurate credits accounting, we need consider more, but it seems not urgent requirement. To not block sanity tests, can we consider some temporary and simple solution? |
| Comment by Alex Zhuravlev [ 21/Jan/13 ] |
|
not sure I follow. we do pass sanity tests, there might be some cases, but they do not block testing. |
| Comment by nasf (Inactive) [ 21/Jan/13 ] |
|
We passed sanity test, just because we did not correctly rollback something. Please check mdd_dir.c changes in the patch http://review.whamcloud.com/#change,5138 |
| Comment by Alex Zhuravlev [ 21/Jan/13 ] |
|
this doesn't sound like a blocker for testing, because 5138 itself can be postponed. for release, sure. |
| Comment by nasf (Inactive) [ 21/Jan/13 ] |
|
The issues fixed in mdd_dir.c will affect LFSCK behavior. If we do not fix it, then there may be some dangling reference name entries in the system. |
| Comment by Alex Zhuravlev [ 21/Jan/13 ] |
|
well, i didn't say "do not fix it". but i think there is no point to introduce some temporary solution because for a while it's enough to just disable this sanity check. |
| Comment by nasf (Inactive) [ 21/Jan/13 ] |
|
The "dangling reference" will not affect sanity test, but it will affect the LFSCK behavior. |
| Comment by Alex Zhuravlev [ 21/Jan/13 ] |
|
not sure what "policy" means here. the exact order should not matter much. if current implementation is missing some undo, then it's to be fixed, of course. |
| Comment by nasf (Inactive) [ 21/Jan/13 ] |
|
Yes, in theory, for existing system, the dandling reference name entry may exist. But we should try to avoid that in new system. As my understand, we prefer to space lost rather than dangling reference, which has been explained in mdd_create() comment. I think we should follow the similar policy for link/unlink cases. |
| Comment by Alex Zhuravlev [ 21/Jan/13 ] |
|
we sort of having transactions to avoid this.. if some change can't be undone, then i'd suppose all the processing should be aborted (read-only and failover in case of ldiskfs). until the change is committed with ->trans_stop() on-disk state is safe (or not worse than before ->trans_start()) |
| Comment by nasf (Inactive) [ 22/Jan/13 ] |
|
I have updated the patch (http://review.whamcloud.com/#change,5138) with more consideration. |
| Comment by Alexander Boyko [ 24/Jan/13 ] |
|
Hi Nasf, I have checked the latest patch and got fail, pls look at http://jira.whamcloud.com/browse/LU-2668?focusedCommentId=51027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-51027 |
| Comment by nasf (Inactive) [ 24/Jan/13 ] |
|
Miss to call "OSD_EXEC_OP(th, insert)" in "osd_index_ea_insert()". It exists in master for a long time. Thanks for trigger it! I will update my patches to fix it. |
| Comment by Andreas Dilger [ 24/Jan/13 ] |
|
Alexander, what are you using for testing this problem? It looks like the MDT is running out of space and fails in the error cleanup path. |
| Comment by Alexey Lyashkov [ 24/Jan/13 ] |
|
Andreas, Alexander worked on backport a directory size limits patch from upstream to lustre to avoid problems from |
| Comment by Alexander Boyko [ 24/Jan/13 ] |
|
I try patchset 3 00000004:00000001:1.0:1359025663.382680:0:14733:0:(mdd_dir.c:523:__mdd_index_insert_only()) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) 00000004:00000001:1.0:1359025663.382681:0:14733:0:(mdd_dir.c:540:__mdd_index_insert()) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) 00000004:00000001:1.0:1359025663.382682:0:14733:0:(mdd_dir.c:1794:mdd_create()) Process leaving via cleanup (rc=18446744073709551588 : -28 : 0xffffffffffffffe4) 00000004:00040000:1.0:1359025663.382684:0:14733:0:(osd_internal.h:412:osd_trans_exec_op()) ASSERTION( oh->ot_declare_ops_rb[rb] > 0 ) failed: 5 00000004:00040000:1.0:1359025663.382799:0:14733:0:(osd_internal.h:412:osd_trans_exec_op()) LBUG Pid: 14733, comm: mdt01_001 Call Trace: [<ffffffffa073e905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa073ef07>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0e7d6e0>] osd_it_ea_get+0x0/0x140 [osd_ldiskfs] [<ffffffffa0e86c75>] osd_object_ref_del+0x135/0x210 [osd_ldiskfs] [<ffffffffa0bb132b>] lod_ref_del+0x3b/0xd0 [lod] [<ffffffffa0dbf5fd>] mdo_ref_del+0xad/0xb0 [mdd] [<ffffffffa0dc5ce2>] mdd_create+0xd02/0x1500 [mdd] [<ffffffffa074f371>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0f27321>] mdt_reint_open+0x1191/0x1900 [mdt] [<ffffffffa074f371>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0f121a1>] mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0f0b843>] mdt_reint_internal+0x4e3/0x7d0 [mdt] [<ffffffffa0f0bdfd>] mdt_intent_reint+0x1ed/0x4f0 [mdt] [<ffffffffa0f074fe>] mdt_intent_policy+0x3ae/0x750 [mdt] mdd_declare_create() has one declare_ref_add, but mdd_create() cleanup has two ref_del(). |
| Comment by Alex Zhuravlev [ 24/Jan/13 ] |
|
> mdd_declare_create() has one declare_ref_add, but mdd_create() cleanup has two ref_del(). right, because mdd_create() does not declare initial nlink set to 1 by object creation |
| Comment by nasf (Inactive) [ 24/Jan/13 ] |
|
Seems we do not need two ref_del() when rollback for mkdir, the osd_object_destroy() can handle the last nlink for dir object. Alex, how do you think? |
| Comment by Alex Zhuravlev [ 24/Jan/13 ] |
|
well, this is not quite correct. osd_object_destroy() does its own hack because of limits on nlink in ldiskfs. we should not bring this assumption into the API. |
| Comment by nasf (Inactive) [ 24/Jan/13 ] |
|
Yes, it's hack. Then what do you prefer to resolve it? We can hack in OSD, and let osd_thandle::ot_declare_ops_rb[OSD_OT_REF_ADD] increase when osd_trans_exec_op(OSD_OT_CREATE) called, can you accept it? |
| Comment by Alex Zhuravlev [ 24/Jan/13 ] |
|
of course, it's better. though I was actually hoping for a breakthrough solution to improve checks and credits calculation. |
| Comment by nasf (Inactive) [ 25/Jan/13 ] |
|
Alexander, would you please to try this one? Thanks. |
| Comment by Alexander Boyko [ 28/Jan/13 ] |
|
With patch set 4 test looks fine. |
| Comment by Andreas Dilger [ 30/Jan/13 ] |
|
Alex, the 5138 patch was landed as-is to allow later LFSCK patch landings to move forward. Oleg wants a better comment explaining what the transaction declaration accounting is doing, so that it can be more easily understood by others. Other open questions that need to be discussed:
|
| Comment by Alex Zhuravlev [ 30/Jan/13 ] |
|
if I remember DMU code correctly, it's fine to declare some range to be modified and then modify it few times - we should be passing internal sanity checks. |
| Comment by nasf (Inactive) [ 17/Feb/13 ] |
|
One idea about the "rollback" notification: introduce a new member thandle::th_rollback, which can be set by MDD and checked by OSD. How do you think? |
| Comment by Alex Zhuravlev [ 17/Feb/13 ] |
|
the thing is that neither ldiskfs nor dmu need this.. |
| Comment by nasf (Inactive) [ 17/Feb/13 ] |
|
That depends on how "need" to be defined. From the view of transaction tracing and debugging, it is useful. We may implement such purposes through other ways, but the "thandle::th_rollback" is simple and clean. In fact, the "thandle::th_local" is also used for notification among layers, but not the functional requirement of ldiskfs or DUM. So... |
| Comment by Alex Zhuravlev [ 17/Feb/13 ] |
|
this debugging is in turn needed only because low-level debugging in ldiskfs is rather poor. this is the only reason to have it in the first place. |
| Comment by Andreas Dilger [ 19/Feb/13 ] |
|
Alex, while I agree that th_rollback is not strictly needed, I think you two are arguing in circles. Clearly we want to improve the ldiskfs transaction credit debugging in some way. The current mechanism is rather weak in detecting when rollback has started, while setting a flag in the handle is positively indicating that rollback has started. Unless you have some better mechanism that we could use in short order, I'm inclined to go with this method for now. The only other method I can think of is adding detailed accounting for every item being modified (e.g. tracking inode numbers, group descriptors, bitmaps, etc) so that the same item can be modified multiple times. However the complexity of that is high and I think it could be deferred until 2.5 if we want to implement it at that point. For now, I think th_rollback is an acceptable solution for the time we have left in 2.4, and it can be clearly commented that it is a workaround and reference this bug for a more complete solution in the future. |
| Comment by Alex Zhuravlev [ 19/Feb/13 ] |
|
it doesn't need to be that detailed, imo. just an array of object and (depending on object type) optype with some argument (hash in case of index, range in case of regular file, xattr) would be enough. I am against th_rollback because we'll have to modify number of functions in MDD to implement this. and then this will stay there for long as no one seem to be interested enough. at the same time 2.x is doing pretty well w/o this mechanism at all (obviously). I'm tempted to say i regret for introducing this in the first place |
| Comment by Alex Zhuravlev [ 19/Feb/13 ] |
|
given we do not hit this (except undo cases) so regular paths should be fine from this point of view, I'd suggest to disable the mechanism. |
| Comment by Andreas Dilger [ 19/Feb/13 ] |
|
Do you think the current mechanism is worse than none at all? I think the current accounting can still find some code defects, even if it does not find as many as with th_rollback being set by the caller. In that case, I'd rather leave it in place during development instead of removing it entirely. Maybe make this debugging conditional on (LUSTRE_PATCH >= 50 && LUSTRE_PATCH < 90)? That allows us to catch problems during development, but relies only on ldiskfs/jbd accounting during production. |
| Comment by Alex Zhuravlev [ 19/Feb/13 ] |
|
well, if it doesn't work well on its own, i'd rather agree it's bad. in orion i made the most trivial implementation just to avoid some silly bugs and frankly i didn't meant to keep it this way for long. unfortunately it survived. i'm absolutely fine with your idea to have it enabled in pre-release versions or something like that. i'm not actually going to block the patch if you think this is required. but again my concern is that we keep filling generic code with more and more very specific handling (another example is fid_is_client_mdt_visible()) |
| Comment by nasf (Inactive) [ 21/Feb/13 ] |
|
The patch for thandle::th_rollback idea: |
| Comment by nasf (Inactive) [ 13/Mar/13 ] |
|
The patch for conditional enable transaction debug: |
| Comment by Peter Jones [ 18/Mar/13 ] |
|
Landed for 2.4 |