Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2640

deactivate OSD_EXEC_OP() operation accounting if operation is being undone

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 4
    • 6174

    Description

      In http://review.whamcloud.com/5046 there was a change to OSD_EXEC_OP() to address problems in MDD operations that fail part-way through, where the undo of earlier operations causes internal osd-ldiskfs declare/execute accounting to fail. For example, in mdd_create()::cleanup, the failed create calls __mdd_index_delete(), mdo_ref_del(), mdo_destroy() in order to clean up the newly created object.

      One proposal is to have some kind of OSD API call/flag/method to mark the transaction handle as being used for rollback, and to disable the ot_declare_op LASSERT() checking in OSD_EXEC_OP() for this case.

      Attachments

        Issue Links

          Activity

            [LU-2640] deactivate OSD_EXEC_OP() operation accounting if operation is being undone
            pjones Peter Jones added a comment -

            Landed for 2.4

            pjones Peter Jones added a comment - Landed for 2.4

            The patch for conditional enable transaction debug:

            http://review.whamcloud.com/#change,5698

            yong.fan nasf (Inactive) added a comment - The patch for conditional enable transaction debug: http://review.whamcloud.com/#change,5698

            The patch for thandle::th_rollback idea:

            http://review.whamcloud.com/#change,5501

            yong.fan nasf (Inactive) added a comment - The patch for thandle::th_rollback idea: http://review.whamcloud.com/#change,5501

            well, if it doesn't work well on its own, i'd rather agree it's bad. in orion i made the most trivial implementation just to avoid some silly bugs and frankly i didn't meant to keep it this way for long. unfortunately it survived.

            i'm absolutely fine with your idea to have it enabled in pre-release versions or something like that. i'm not actually going to block the patch if you think this is required. but again my concern is that we keep filling generic code with more and more very specific handling (another example is fid_is_client_mdt_visible())
            which shouldn't be there at all.

            bzzz Alex Zhuravlev added a comment - well, if it doesn't work well on its own, i'd rather agree it's bad. in orion i made the most trivial implementation just to avoid some silly bugs and frankly i didn't meant to keep it this way for long. unfortunately it survived. i'm absolutely fine with your idea to have it enabled in pre-release versions or something like that. i'm not actually going to block the patch if you think this is required. but again my concern is that we keep filling generic code with more and more very specific handling (another example is fid_is_client_mdt_visible()) which shouldn't be there at all.

            Do you think the current mechanism is worse than none at all? I think the current accounting can still find some code defects, even if it does not find as many as with th_rollback being set by the caller. In that case, I'd rather leave it in place during development instead of removing it entirely.

            Maybe make this debugging conditional on (LUSTRE_PATCH >= 50 && LUSTRE_PATCH < 90)? That allows us to catch problems during development, but relies only on ldiskfs/jbd accounting during production.

            adilger Andreas Dilger added a comment - Do you think the current mechanism is worse than none at all? I think the current accounting can still find some code defects, even if it does not find as many as with th_rollback being set by the caller. In that case, I'd rather leave it in place during development instead of removing it entirely. Maybe make this debugging conditional on (LUSTRE_PATCH >= 50 && LUSTRE_PATCH < 90)? That allows us to catch problems during development, but relies only on ldiskfs/jbd accounting during production.

            given we do not hit this (except undo cases) so regular paths should be fine from this point of view, I'd suggest to disable the mechanism.
            we still have ldiskfs's internal checks.

            bzzz Alex Zhuravlev added a comment - given we do not hit this (except undo cases) so regular paths should be fine from this point of view, I'd suggest to disable the mechanism. we still have ldiskfs's internal checks.

            it doesn't need to be that detailed, imo. just an array of object and (depending on object type) optype with some argument (hash in case of index, range in case of regular file, xattr) would be enough.

            I am against th_rollback because we'll have to modify number of functions in MDD to implement this. and then this will stay there for long as no one seem to be interested enough.

            at the same time 2.x is doing pretty well w/o this mechanism at all (obviously). I'm tempted to say i regret for introducing this in the first place

            bzzz Alex Zhuravlev added a comment - it doesn't need to be that detailed, imo. just an array of object and (depending on object type) optype with some argument (hash in case of index, range in case of regular file, xattr) would be enough. I am against th_rollback because we'll have to modify number of functions in MDD to implement this. and then this will stay there for long as no one seem to be interested enough. at the same time 2.x is doing pretty well w/o this mechanism at all (obviously). I'm tempted to say i regret for introducing this in the first place

            Alex, while I agree that th_rollback is not strictly needed, I think you two are arguing in circles. Clearly we want to improve the ldiskfs transaction credit debugging in some way. The current mechanism is rather weak in detecting when rollback has started, while setting a flag in the handle is positively indicating that rollback has started.

            Unless you have some better mechanism that we could use in short order, I'm inclined to go with this method for now. The only other method I can think of is adding detailed accounting for every item being modified (e.g. tracking inode numbers, group descriptors, bitmaps, etc) so that the same item can be modified multiple times. However the complexity of that is high and I think it could be deferred until 2.5 if we want to implement it at that point.

            For now, I think th_rollback is an acceptable solution for the time we have left in 2.4, and it can be clearly commented that it is a workaround and reference this bug for a more complete solution in the future.

            adilger Andreas Dilger added a comment - Alex, while I agree that th_rollback is not strictly needed, I think you two are arguing in circles. Clearly we want to improve the ldiskfs transaction credit debugging in some way. The current mechanism is rather weak in detecting when rollback has started, while setting a flag in the handle is positively indicating that rollback has started. Unless you have some better mechanism that we could use in short order, I'm inclined to go with this method for now. The only other method I can think of is adding detailed accounting for every item being modified (e.g. tracking inode numbers, group descriptors, bitmaps, etc) so that the same item can be modified multiple times. However the complexity of that is high and I think it could be deferred until 2.5 if we want to implement it at that point. For now, I think th_rollback is an acceptable solution for the time we have left in 2.4, and it can be clearly commented that it is a workaround and reference this bug for a more complete solution in the future.

            this debugging is in turn needed only because low-level debugging in ldiskfs is rather poor. this is the only reason to have it in the first place.
            with better debugging (which we want to calculate credits better) we likely won't need this flag.

            bzzz Alex Zhuravlev added a comment - this debugging is in turn needed only because low-level debugging in ldiskfs is rather poor. this is the only reason to have it in the first place. with better debugging (which we want to calculate credits better) we likely won't need this flag.

            That depends on how "need" to be defined. From the view of transaction tracing and debugging, it is useful. We may implement such purposes through other ways, but the "thandle::th_rollback" is simple and clean. In fact, the "thandle::th_local" is also used for notification among layers, but not the functional requirement of ldiskfs or DUM. So...

            yong.fan nasf (Inactive) added a comment - That depends on how "need" to be defined. From the view of transaction tracing and debugging, it is useful. We may implement such purposes through other ways, but the "thandle::th_rollback" is simple and clean. In fact, the "thandle::th_local" is also used for notification among layers, but not the functional requirement of ldiskfs or DUM. So...

            People

              yong.fan nasf (Inactive)
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: