Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10048

osd-ldiskfs to truncate outside of main transaction

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0
    • Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5
    • None
    • 9223372036854775807

    Description

      this is needed to implement (transaction first; locking next) order to unify locking among MDT/OST/OUT

      Attachments

        Issue Links

          Activity

            [LU-10048] osd-ldiskfs to truncate outside of main transaction

            Hello,

            We have applied the https://review.whamcloud.com/43277 ("LU-10048 ofd: take local locks within transaction") + https://review.whamcloud.com/43278 ("LU-13093 osd: fix osd_attr_set race") on the problematic filesystem. The issue never occurred after.

            We were able to reproduce this issue in 5/10 min with many creations of small files and several threads doing file migrations ("lfs migrate" between OSTs).

            eaujames Etienne Aujames added a comment - Hello, We have applied the https://review.whamcloud.com/43277 (" LU-10048 ofd: take local locks within transaction") + https://review.whamcloud.com/43278 (" LU-13093 osd: fix osd_attr_set race") on the problematic filesystem. The issue never occurred after. We were able to reproduce this issue in 5/10 min with many creations of small files and several threads doing file migrations ("lfs migrate" between OSTs).

            Hello Alex,

            Did you have the time to look to our backtrace?

            If you need more data from the crashdump I can get you some.

            eaujames Etienne Aujames added a comment - Hello Alex, Did you have the time to look to our backtrace? If you need more data from the crashdump I can get you some.

            I have added the our backtrace to this tickets: crash_bt_jbd2_locked_20210412.log

            eaujames Etienne Aujames added a comment - I have added the our backtrace to this tickets: crash_bt_jbd2_locked_20210412.log

            The issue seems to occurs relatively often with a lot of migrate threads (today 4 times on different OSTs).

            We will test the backports quickly on a production environment (maybe this week).

            I will try to get backtrace from the crashdump (manually triggered) tomorrow.

            eaujames Etienne Aujames added a comment - The issue seems to occurs relatively often with a lot of migrate threads (today 4 times on different OSTs). We will test the backports quickly on a production environment (maybe this week). I will try to get backtrace from the crashdump (manually triggered) tomorrow.

            please, generate full backtrace (

            echo t >/proc/sysrq-trigger

            ) and attach to the ticket

            bzzz Alex Zhuravlev added a comment - please, generate full backtrace ( echo t >/proc/sysrq-trigger ) and attach to the ticket

            The "LU-10048 osd: async truncate" has already landed on b2_12.

            eaujames Etienne Aujames added a comment - The " LU-10048 osd: async truncate" has already landed on b2_12.

            as for fast journal - we've got number of nodes running all the tests on RAM-backed devices 24h a day.

            bzzz Alex Zhuravlev added a comment - as for fast journal - we've got number of nodes running all the tests on RAM-backed devices 24h a day.
            bzzz Alex Zhuravlev added a comment - - edited

            there is yet another patch under LU-10048 - LU-10048 osd: async truncate, cf29a5e7bfa254ccfcea023028fe7da80503c512 in master

            bzzz Alex Zhuravlev added a comment - - edited there is yet another patch under LU-10048 - LU-10048 osd: async truncate, cf29a5e7bfa254ccfcea023028fe7da80503c512 in master
            eaujames Etienne Aujames added a comment - - edited

            Hello,

            I have backported the https://review.whamcloud.com/43277 ("LU-10048 ofd: take local locks within transaction") along with the https://review.whamcloud.com/43278 ("LU-13093 osd: fix osd_attr_set race").

            It seems we trigger that bug in 2.12.6 with external journal (on flash dev) for rotational disk while several migrate thread is running on a robinhood client (512 thread).

            jbd2 journal seems to be locked when the transaction is in T_LOCKED.

            transaction_t.t_state = T_LOCKED
            transaction_t.t_handle_count = 40
            transaction_t.t_updates = 1
            nbr of task in j_wait_transaction_locked = 324
            nbr of task in j_wait_update = 0
            
            eaujames Etienne Aujames added a comment - - edited Hello, I have backported the https://review.whamcloud.com/43277 (" LU-10048 ofd: take local locks within transaction") along with the https://review.whamcloud.com/43278 (" LU-13093 osd: fix osd_attr_set race"). It seems we trigger that bug in 2.12.6 with external journal (on flash dev) for rotational disk while several migrate thread is running on a robinhood client (512 thread). jbd2 journal seems to be locked when the transaction is in T_LOCKED. transaction_t.t_state = T_LOCKED transaction_t.t_handle_count = 40 transaction_t.t_updates = 1 nbr of task in j_wait_transaction_locked = 324 nbr of task in j_wait_update = 0

            Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/43277
            Subject: LU-10048 ofd: take local locks within transaction
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 4dc5634dcdf3381a7ea698605a3c7a345f689a4c

            gerrit Gerrit Updater added a comment - Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/43277 Subject: LU-10048 ofd: take local locks within transaction Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 4dc5634dcdf3381a7ea698605a3c7a345f689a4c

            People

              bzzz Alex Zhuravlev
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: