Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • None
    • None
    • local testing
    • 3
    • 9223372036854775807

    Description

      w/o the patch only 3 from 36 racer runs timed out. w/ the patch - 35 from 36 did.

      Attachments

        Issue Links

          Activity

            [LU-14187] LU-14134 causes many racer timeouts

            bzzzyup, i saw your pushed patch, i'll try to fix by moving restart to ofd layer.

            wshilong Wang Shilong (Inactive) added a comment - bzzz yup, i saw your pushed patch, i'll try to fix by moving restart to ofd layer.

            It looks currently, some ofd_write_lock is held outside transaction, some inside, so maybe that is reason why it could timeout even without patch

            I found this only in fallocate path which was landed just recently and this is clearly a bug.

            bzzz Alex Zhuravlev added a comment - It looks currently, some ofd_write_lock is held outside transaction, some inside, so maybe that is reason why it could timeout even without patch I found this only in fallocate path which was landed just recently and this is clearly a bug.

            It looks currently, some ofd_write_lock is held outside transaction, some inside, so maybe that is reason why it could timeout even without patch

            wshilong Wang Shilong (Inactive) added a comment - It looks currently, some ofd_write_lock is held outside transaction, some inside, so maybe that is reason why it could timeout even without patch

            What do you think we move IO restart upper layer(in OFD layer), or we should implement similar way like truncate?

            wshilong Wang Shilong (Inactive) added a comment - What do you think we move IO restart upper layer(in OFD layer), or we should implement similar way like truncate?

            while trunc lock was involved in this specific example, I think the issue is not directly related to that one - regular dt locks (e.g. ofd_write_lock() and ofd_read_lock()) are taken within a running transaction and transaction restart breaks this. actually that's the reason why truncate was implemented via osd_trans_stop().

            bzzz Alex Zhuravlev added a comment - while trunc lock was involved in this specific example, I think the issue is not directly related to that one - regular dt locks (e.g. ofd_write_lock() and ofd_read_lock()) are taken within a running transaction and transaction restart breaks this. actually that's the reason why truncate was implemented via osd_trans_stop().

            bzzz Thanks for helping track it down.
            Would you be specific a bit more? It looks now we hold trunc lock before start transaction, and for restart case, trunc lock is still held?

            wshilong Wang Shilong (Inactive) added a comment - bzzz Thanks for helping track it down. Would you be specific a bit more? It looks now we hold trunc lock before start transaction, and for restart case, trunc lock is still held?
            PID: 1677   TASK: ffff88012f27e000  CPU: 1   COMMAND: "ll_ost_io00_030"
             #0 [ffff880105c5ba80] __schedule at ffffffff816af9ad
                /home/lustre/linux-4.18.0-32.el8/kernel/sched/core.c: 2853
             #1 [ffff880105c5bb18] schedule at ffffffff816b0234
                /home/lustre/linux-4.18.0-32.el8/./arch/x86/include/asm/preempt.h: 81
             #2 [ffff880105c5bb28] rwsem_down_write_failed at ffffffff816b479b
                /home/lustre/linux-4.18.0-32.el8/./arch/x86/include/asm/current.h: 15
             #3 [ffff880105c5bbb8] call_rwsem_down_write_failed at ffffffff816abb13
                /home/lustre/linux-4.18.0-32.el8/arch/x86/lib/rwsem.S: 118
             #4 [ffff880105c5bbf8] down_write at ffffffff816b3a34
                /home/lustre/linux-4.18.0-32.el8/./arch/x86/include/asm/rwsem.h: 142
             #5 [ffff880105c5bc18] osd_trunc_lock at ffffffffa0b2da21 [osd_ldiskfs]
                /home/lustre/master-mine/lustre/osd-ldiskfs/osd_io.c: 2530
             #6 [ffff880105c5bc48] osd_declare_punch at ffffffffa0b2e0fe [osd_ldiskfs]
                /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155
             #7 [ffff880105c5bc70] ofd_object_punch at ffffffffa0e824f9 [ofd]
                /home/lustre/master-mine/lustre/include/dt_object.h: 2570
             #8 [ffff880105c5bd48] ofd_punch_hdl at ffffffffa0e6c54c [ofd]
                /home/lustre/master-mine/lustre/ofd/ofd_dev.c: 2125
            
            PID: 6515   TASK: ffff88012c996000  CPU: 1   COMMAND: "ll_ost_io00_006"
             #0 [ffff88010d40b9a8] __schedule at ffffffff816af9ad
                /home/lustre/linux-4.18.0-32.el8/kernel/sched/core.c: 2853
             #1 [ffff88010d40ba40] schedule at ffffffff816b0234
                /home/lustre/linux-4.18.0-32.el8/./arch/x86/include/asm/preempt.h: 81
             #2 [ffff88010d40ba50] wait_transaction_locked at ffffffff812ef568
                /home/lustre/linux-4.18.0-32.el8/fs/jbd2/transaction.c: 160
             #3 [ffff88010d40baa8] add_transaction_credits at ffffffff812ef700
                /home/lustre/linux-4.18.0-32.el8/fs/jbd2/transaction.c: 188
             #4 [ffff88010d40bb00] start_this_handle at ffffffff812efa1c
                /home/lustre/linux-4.18.0-32.el8/fs/jbd2/transaction.c: 357
             #5 [ffff88010d40bb80] osd_write_commit at ffffffffa0b2ccc0 [osd_ldiskfs]
                /home/lustre/master-mine/lustre/osd-ldiskfs/osd_io.c: 1476
             #6 [ffff88010d40bbf8] ofd_commitrw at ffffffffa0e892df [ofd]
                /home/lustre/master-mine/lustre/include/dt_object.h: 2530
             #7 [ffff88010d40bcb0] tgt_brw_write at ffffffffa05c756c [ptlrpc]
                /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155
            

            basically LU-14134 breaks transaction-locks ordering as restart (which is just start in essense) is done with the lock held

            bzzz Alex Zhuravlev added a comment - PID: 1677 TASK: ffff88012f27e000 CPU: 1 COMMAND: "ll_ost_io00_030" #0 [ffff880105c5ba80] __schedule at ffffffff816af9ad /home/lustre/linux-4.18.0-32.el8/kernel/sched/core.c: 2853 #1 [ffff880105c5bb18] schedule at ffffffff816b0234 /home/lustre/linux-4.18.0-32.el8/./arch/x86/include/asm/preempt.h: 81 #2 [ffff880105c5bb28] rwsem_down_write_failed at ffffffff816b479b /home/lustre/linux-4.18.0-32.el8/./arch/x86/include/asm/current.h: 15 #3 [ffff880105c5bbb8] call_rwsem_down_write_failed at ffffffff816abb13 /home/lustre/linux-4.18.0-32.el8/arch/x86/lib/rwsem.S: 118 #4 [ffff880105c5bbf8] down_write at ffffffff816b3a34 /home/lustre/linux-4.18.0-32.el8/./arch/x86/include/asm/rwsem.h: 142 #5 [ffff880105c5bc18] osd_trunc_lock at ffffffffa0b2da21 [osd_ldiskfs] /home/lustre/master-mine/lustre/osd-ldiskfs/osd_io.c: 2530 #6 [ffff880105c5bc48] osd_declare_punch at ffffffffa0b2e0fe [osd_ldiskfs] /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155 #7 [ffff880105c5bc70] ofd_object_punch at ffffffffa0e824f9 [ofd] /home/lustre/master-mine/lustre/include/dt_object.h: 2570 #8 [ffff880105c5bd48] ofd_punch_hdl at ffffffffa0e6c54c [ofd] /home/lustre/master-mine/lustre/ofd/ofd_dev.c: 2125 PID: 6515 TASK: ffff88012c996000 CPU: 1 COMMAND: "ll_ost_io00_006" #0 [ffff88010d40b9a8] __schedule at ffffffff816af9ad /home/lustre/linux-4.18.0-32.el8/kernel/sched/core.c: 2853 #1 [ffff88010d40ba40] schedule at ffffffff816b0234 /home/lustre/linux-4.18.0-32.el8/./arch/x86/include/asm/preempt.h: 81 #2 [ffff88010d40ba50] wait_transaction_locked at ffffffff812ef568 /home/lustre/linux-4.18.0-32.el8/fs/jbd2/transaction.c: 160 #3 [ffff88010d40baa8] add_transaction_credits at ffffffff812ef700 /home/lustre/linux-4.18.0-32.el8/fs/jbd2/transaction.c: 188 #4 [ffff88010d40bb00] start_this_handle at ffffffff812efa1c /home/lustre/linux-4.18.0-32.el8/fs/jbd2/transaction.c: 357 #5 [ffff88010d40bb80] osd_write_commit at ffffffffa0b2ccc0 [osd_ldiskfs] /home/lustre/master-mine/lustre/osd-ldiskfs/osd_io.c: 1476 #6 [ffff88010d40bbf8] ofd_commitrw at ffffffffa0e892df [ofd] /home/lustre/master-mine/lustre/include/dt_object.h: 2530 #7 [ffff88010d40bcb0] tgt_brw_write at ffffffffa05c756c [ptlrpc] /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155 basically LU-14134 breaks transaction-locks ordering as restart (which is just start in essense) is done with the lock held

            People

              wshilong Wang Shilong (Inactive)
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: