Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15117

ofd_read_lock vs transaction deadlock while allocating buffers

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      PID: 154236  TASK: ffff9ab9b2f330c0  CPU: 9   COMMAND: "ll_ost_io01_002"
       #0 [ffff9ab9b2f6af58] __schedule at ffffffff9876ab17
       #1 [ffff9ab9b2f6afe0] schedule at ffffffff9876b019
       #2 [ffff9ab9b2f6aff0] wait_transaction_locked at ffffffffc0760085 [jbd2]
       #3 [ffff9ab9b2f6b048] add_transaction_credits at ffffffffc0760368 [jbd2]
       #4 [ffff9ab9b2f6b0a8] start_this_handle at ffffffffc07605e1 [jbd2]
       #5 [ffff9ab9b2f6b140] jbd2__journal_start at ffffffffc0760a93 [jbd2]
       #6 [ffff9ab9b2f6b188] __ldiskfs_journal_start_sb at ffffffffc19c1c79 [ldiskfs]
       #7 [ffff9ab9b2f6b1c8] ldiskfs_release_dquot at ffffffffc19b92ec [ldiskfs]
       #8 [ffff9ab9b2f6b1e8] dqput at ffffffff982aeb5d
       #9 [ffff9ab9b2f6b210] __dquot_drop at ffffffff982b0215
      #10 [ffff9ab9b2f6b248] dquot_drop at ffffffff982b0285
      #11 [ffff9ab9b2f6b258] ldiskfs_clear_inode at ffffffffc19bdcf2 [ldiskfs]
      #12 [ffff9ab9b2f6b270] ldiskfs_evict_inode at ffffffffc19dccdf [ldiskfs]
      #13 [ffff9ab9b2f6b2b0] evict at ffffffff9825ee14
      #14 [ffff9ab9b2f6b2d8] dispose_list at ffffffff9825ef1e
      #15 [ffff9ab9b2f6b300] prune_icache_sb at ffffffff9825ff2c
      #16 [ffff9ab9b2f6b368] prune_super at ffffffff98244323
      #17 [ffff9ab9b2f6b3a0] shrink_slab at ffffffff981ca105
      #18 [ffff9ab9b2f6b440] do_try_to_free_pages at ffffffff981cd3c2
      #19 [ffff9ab9b2f6b4b8] try_to_free_pages at ffffffff981cd5dc
      #20 [ffff9ab9b2f6b550] __alloc_pages_slowpath at ffffffff987601ef
      #21 [ffff9ab9b2f6b640] __alloc_pages_nodemask at ffffffff981c1465
      #22 [ffff9ab9b2f6b6f0] alloc_pages_current at ffffffff9820e2c8
      #23 [ffff9ab9b2f6b738] new_slab at ffffffff982192d5
      #24 [ffff9ab9b2f6b770] ___slab_alloc at ffffffff9821ad4c
      #25 [ffff9ab9b2f6b840] __slab_alloc at ffffffff9876160c
      #26 [ffff9ab9b2f6b880] kmem_cache_alloc at ffffffff9821c3eb
      #27 [ffff9ab9b2f6b8c0] __radix_tree_preload at ffffffff9837b7b9
      #28 [ffff9ab9b2f6b8f0] radix_tree_maybe_preload at ffffffff9837bd0e
      #29 [ffff9ab9b2f6b900] __add_to_page_cache_locked at ffffffff981b734a
      #30 [ffff9ab9b2f6b940] add_to_page_cache_lru at ffffffff981b74b7
      #31 [ffff9ab9b2f6b970] find_or_create_page at ffffffff981b783e
      #32 [ffff9ab9b2f6b9b0] osd_bufs_get at ffffffffc1a773c3 [osd_ldiskfs]
      #33 [ffff9ab9b2f6ba10] ofd_preprw_write at ffffffffc144f156 [ofd]
      #34 [ffff9ab9b2f6ba90] ofd_preprw at ffffffffc14500ce [ofd]
      #35 [ffff9ab9b2f6bb28] tgt_brw_write at ffffffffc0ece6e9 [ptlrpc]
      #36 [ffff9ab9b2f6bca0] tgt_request_handle at ffffffffc0eccd4a [ptlrpc]
      #37 [ffff9ab9b2f6bd30] ptlrpc_server_handle_request at ffffffffc0e72586 [ptlrpc]
      #38 [ffff9ab9b2f6bde8] ptlrpc_main at ffffffffc0e7625a [ptlrpc]
      #39 [ffff9ab9b2f6bec8] kthread at ffffffff980c1f81
      #40 [ffff9ab9b2f6bf50] ret_from_fork_nospec_begin at ffffffff98777c1d
      

      Attachments

        Issue Links

          Activity

            [LU-15117] ofd_read_lock vs transaction deadlock while allocating buffers

            sthiell I think it's LU-16044, unfortunately LU-15117 increases likelyhood for the former one. would you mind to try https://review.whamcloud.com/#/c/48033/ along with https://review.whamcloud.com/47925 ?

            bzzz Alex Zhuravlev added a comment - sthiell I think it's LU-16044 , unfortunately LU-15117 increases likelyhood for the former one. would you mind to try https://review.whamcloud.com/#/c/48033/ along with https://review.whamcloud.com/47925 ?

            Alex, any update on this new patch? We have been using https://review.whamcloud.com/47925 on top of 2.12.9, but last night, the same kind of deadlock occurred on an OSS (attaching the output of "foreach bt" as fir-io7-s1_crash_foreach_bt_20220831.login case you're interested). This is still our most annoying Lustre 2.12 issue... Thanks!

            sthiell Stephane Thiell added a comment - Alex, any update on this new patch? We have been using https://review.whamcloud.com/47925 on top of 2.12.9, but last night, the same kind of deadlock occurred on an OSS (attaching the output of "foreach bt" as fir-io7-s1_crash_foreach_bt_20220831.log in case you're interested). This is still our most annoying Lustre 2.12 issue... Thanks!

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48209
            Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3cd2507bf8211cc662056c48bbe5537fd0a3f5a1

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48209 Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3cd2507bf8211cc662056c48bbe5537fd0a3f5a1
            eaujames Etienne Aujames added a comment - - edited

            Hello,

            We hit this issue in production on a Clusterstore 2.12 version at the CEA after activating OSTs cache.
            We have isolated a job reproducing this issue:

            1. one ost thread is doing fsync forcing to commit the current transaction
            2. the jbd2 thread is waiting for another thread to finish updating his buffer (t_updates == 1).
            3. the thread doing the update (after ofd_trans_start) waiting for the write lock on the ost object (ofd_write_lock) for a punch.
            4. the thread with a read lock hang in "ofd_preprw_write -> osd_bufs_get -> .. -> ldiskfs_release_dquot -> wait_transaction_locked" because of memory pressure and the current transaction commit.

            So for now, we have disabled the writethough_cache, but without the "LU-12071 osd-ldiskfs: bypass pagecache if requested" (https://review.whamcloud.com/34422) this doesn't seem to be sufficient.

            I have read the https://review.whamcloud.com/47925.
            Can someone tell me why this issue shouldn't happen with ofd_preprw_read() (the ofd_read_lock is also taken when dt_bufs_get is called) ?

            eaujames Etienne Aujames added a comment - - edited Hello, We hit this issue in production on a Clusterstore 2.12 version at the CEA after activating OSTs cache. We have isolated a job reproducing this issue: one ost thread is doing fsync forcing to commit the current transaction the jbd2 thread is waiting for another thread to finish updating his buffer (t_updates == 1). the thread doing the update (after ofd_trans_start) waiting for the write lock on the ost object (ofd_write_lock) for a punch. the thread with a read lock hang in "ofd_preprw_write -> osd_bufs_get -> .. -> ldiskfs_release_dquot -> wait_transaction_locked" because of memory pressure and the current transaction commit. So for now, we have disabled the writethough_cache, but without the " LU-12071 osd-ldiskfs: bypass pagecache if requested" ( https://review.whamcloud.com/34422 ) this doesn't seem to be sufficient. I have read the https://review.whamcloud.com/47925 . Can someone tell me why this issue shouldn't happen with ofd_preprw_read() (the ofd_read_lock is also taken when dt_bufs_get is called) ?

            "Stephane Thiell <sthiell@stanford.edu>" uploaded a new patch: https://review.whamcloud.com/47925
            Subject: LU-15117 ofd: don't take lock for dt_bufs_get()
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 2fdf51055b76d47e24464cc93ebeabdc050dbd3a

            gerrit Gerrit Updater added a comment - "Stephane Thiell <sthiell@stanford.edu>" uploaded a new patch: https://review.whamcloud.com/47925 Subject: LU-15117 ofd: don't take lock for dt_bufs_get() Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 2fdf51055b76d47e24464cc93ebeabdc050dbd3a

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47029/
            Subject: LU-15117 ofd: don't take lock for dt_bufs_get()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7c4a7c59ed9c6185da326d6df6223f4818b57769

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47029/ Subject: LU-15117 ofd: don't take lock for dt_bufs_get() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7c4a7c59ed9c6185da326d6df6223f4818b57769

            Hello,

            We hit this problem on another 2.12 filesystem last night. One OSS load was at 400+, still up, answering quota requests but not processing I/Os. Thus, many jobs were hang. I see that Alex's patch is still in review for master. Just wanted to raise awareness that this issue is currently our most impactful Lustre issue. Thanks for your help with this!

            sthiell Stephane Thiell added a comment - Hello, We hit this problem on another 2.12 filesystem last night. One OSS load was at 400+, still up, answering quota requests but not processing I/Os. Thus, many jobs were hang. I see that Alex's patch is still in review for master. Just wanted to raise awareness that this issue is currently our most impactful Lustre issue. Thanks for your help with this!

            sthiell you're correct.

            bzzz Alex Zhuravlev added a comment - sthiell you're correct.

            I believe we hit this same issue yesterday with 2.12.8 on Oak at Stanford. OSS was deadlocked. Alex, do you think it can be backported to b2_12?

            thread #1: zone_reclaim -> start_this_handle

            PID: 64334  TASK: ffff98040eb26300  CPU: 10  COMMAND: "ll_ost_io00_105"
             #0 [ffff980ca7ff31d8] __schedule at ffffffffacb86d07
             #1 [ffff980ca7ff3260] schedule at ffffffffacb87229
             #2 [ffff980ca7ff3270] wait_transaction_locked at ffffffffc026a085 [jbd2]
             #3 [ffff980ca7ff32c8] add_transaction_credits at ffffffffc026a378 [jbd2]
             #4 [ffff980ca7ff3328] start_this_handle at ffffffffc026a601 [jbd2]
             #5 [ffff980ca7ff33c0] jbd2__journal_start at ffffffffc026aab3 [jbd2]
             #6 [ffff980ca7ff3408] __ldiskfs_journal_start_sb at ffffffffc12f02b9 [ldiskfs]
             #7 [ffff980ca7ff3448] ldiskfs_release_dquot at ffffffffc132839c [ldiskfs]
             #8 [ffff980ca7ff3468] dqput at ffffffffac6bd16d
             #9 [ffff980ca7ff3490] __dquot_drop at ffffffffac6be865
            #10 [ffff980ca7ff34c8] dquot_drop at ffffffffac6be8d5
            #11 [ffff980ca7ff34d8] ldiskfs_clear_inode at ffffffffc132cf02 [ldiskfs]
            #12 [ffff980ca7ff34f0] ldiskfs_evict_inode at ffffffffc131601f [ldiskfs]
            #13 [ffff980ca7ff3530] evict at ffffffffac66c194
            #14 [ffff980ca7ff3558] dispose_list at ffffffffac66c29e
            #15 [ffff980ca7ff3580] prune_icache_sb at ffffffffac66d38c
            #16 [ffff980ca7ff35e8] prune_super at ffffffffac65071b
            #17 [ffff980ca7ff3618] shrink_slab at ffffffffac5d18c5
            #18 [ffff980ca7ff36b8] zone_reclaim at ffffffffac5d46c9
            #19 [ffff980ca7ff3760] get_page_from_freelist at ffffffffac5c8788
            #20 [ffff980ca7ff3878] __alloc_pages_nodemask at ffffffffac5c8ae6
            #21 [ffff980ca7ff3920] alloc_pages_current at ffffffffac618a18
            #22 [ffff980ca7ff3968] __page_cache_alloc at ffffffffac5bdb87
            #23 [ffff980ca7ff39a0] find_or_create_page at ffffffffac5bed25
            #24 [ffff980ca7ff39e0] osd_bufs_get at ffffffffc1433523 [osd_ldiskfs]
            #25 [ffff980ca7ff3a40] ofd_preprw_write at ffffffffc1582346 [ofd]
            #26 [ffff980ca7ff3ab8] ofd_preprw at ffffffffc15831ff [ofd]
            #27 [ffff980ca7ff3b60] tgt_brw_write at ffffffffc0f7be89 [ptlrpc]
            #28 [ffff980ca7ff3cd0] tgt_request_handle at ffffffffc0f7df1a [ptlrpc]
            #29 [ffff980ca7ff3d58] ptlrpc_server_handle_request at ffffffffc0f22bfb [ptlrpc]
            #30 [ffff980ca7ff3df8] ptlrpc_main at ffffffffc0f26564 [ptlrpc]
            #31 [ffff980ca7ff3ec8] kthread at ffffffffac4c5c21
            #32 [ffff980ca7ff3f50] ret_from_fork_nospec_begin at ffffffffacb94ddd
            

             

            thread #2: ofd_attr_set

            PID: 233404  TASK: ffff97f3e2fee300  CPU: 13  COMMAND: "ll_ost01_009"
             #0 [ffff984ee284ba58] __schedule at ffffffffacb86d07
             #1 [ffff984ee284bae0] schedule at ffffffffacb87229
             #2 [ffff984ee284baf0] rwsem_down_write_failed at ffffffffacb88965
             #3 [ffff984ee284bb88] call_rwsem_down_write_failed at ffffffffac797767
             #4 [ffff984ee284bbd0] down_write at ffffffffacb8655d
             #5 [ffff984ee284bbe8] osd_write_lock at ffffffffc1409c9c [osd_ldiskfs]
             #6 [ffff984ee284bc10] ofd_attr_set at ffffffffc157c053 [ofd]
             #7 [ffff984ee284bc78] ofd_setattr_hdl at ffffffffc156b95d [ofd]
             #8 [ffff984ee284bcd0] tgt_request_handle at ffffffffc0f7df1a [ptlrpc]
             #9 [ffff984ee284bd58] ptlrpc_server_handle_request at ffffffffc0f22bfb [ptlrpc]
            #10 [ffff984ee284bdf8] ptlrpc_main at ffffffffc0f26564 [ptlrpc]
            #11 [ffff984ee284bec8] kthread at ffffffffac4c5c21
            #12 [ffff984ee284bf50] ret_from_fork_nospec_begin at ffffffffacb94ddd
            

             

            sthiell Stephane Thiell added a comment - I believe we hit this same issue yesterday with 2.12.8 on Oak at Stanford. OSS was deadlocked. Alex, do you think it can be backported to b2_12? thread #1: zone_reclaim -> start_this_handle PID: 64334 TASK: ffff98040eb26300 CPU: 10 COMMAND: "ll_ost_io00_105" #0 [ffff980ca7ff31d8] __schedule at ffffffffacb86d07 #1 [ffff980ca7ff3260] schedule at ffffffffacb87229 #2 [ffff980ca7ff3270] wait_transaction_locked at ffffffffc026a085 [jbd2] #3 [ffff980ca7ff32c8] add_transaction_credits at ffffffffc026a378 [jbd2] #4 [ffff980ca7ff3328] start_this_handle at ffffffffc026a601 [jbd2] #5 [ffff980ca7ff33c0] jbd2__journal_start at ffffffffc026aab3 [jbd2] #6 [ffff980ca7ff3408] __ldiskfs_journal_start_sb at ffffffffc12f02b9 [ldiskfs] #7 [ffff980ca7ff3448] ldiskfs_release_dquot at ffffffffc132839c [ldiskfs] #8 [ffff980ca7ff3468] dqput at ffffffffac6bd16d #9 [ffff980ca7ff3490] __dquot_drop at ffffffffac6be865 #10 [ffff980ca7ff34c8] dquot_drop at ffffffffac6be8d5 #11 [ffff980ca7ff34d8] ldiskfs_clear_inode at ffffffffc132cf02 [ldiskfs] #12 [ffff980ca7ff34f0] ldiskfs_evict_inode at ffffffffc131601f [ldiskfs] #13 [ffff980ca7ff3530] evict at ffffffffac66c194 #14 [ffff980ca7ff3558] dispose_list at ffffffffac66c29e #15 [ffff980ca7ff3580] prune_icache_sb at ffffffffac66d38c #16 [ffff980ca7ff35e8] prune_super at ffffffffac65071b #17 [ffff980ca7ff3618] shrink_slab at ffffffffac5d18c5 #18 [ffff980ca7ff36b8] zone_reclaim at ffffffffac5d46c9 #19 [ffff980ca7ff3760] get_page_from_freelist at ffffffffac5c8788 #20 [ffff980ca7ff3878] __alloc_pages_nodemask at ffffffffac5c8ae6 #21 [ffff980ca7ff3920] alloc_pages_current at ffffffffac618a18 #22 [ffff980ca7ff3968] __page_cache_alloc at ffffffffac5bdb87 #23 [ffff980ca7ff39a0] find_or_create_page at ffffffffac5bed25 #24 [ffff980ca7ff39e0] osd_bufs_get at ffffffffc1433523 [osd_ldiskfs] #25 [ffff980ca7ff3a40] ofd_preprw_write at ffffffffc1582346 [ofd] #26 [ffff980ca7ff3ab8] ofd_preprw at ffffffffc15831ff [ofd] #27 [ffff980ca7ff3b60] tgt_brw_write at ffffffffc0f7be89 [ptlrpc] #28 [ffff980ca7ff3cd0] tgt_request_handle at ffffffffc0f7df1a [ptlrpc] #29 [ffff980ca7ff3d58] ptlrpc_server_handle_request at ffffffffc0f22bfb [ptlrpc] #30 [ffff980ca7ff3df8] ptlrpc_main at ffffffffc0f26564 [ptlrpc] #31 [ffff980ca7ff3ec8] kthread at ffffffffac4c5c21 #32 [ffff980ca7ff3f50] ret_from_fork_nospec_begin at ffffffffacb94ddd   thread #2: ofd_attr_set PID: 233404 TASK: ffff97f3e2fee300 CPU: 13 COMMAND: "ll_ost01_009" #0 [ffff984ee284ba58] __schedule at ffffffffacb86d07 #1 [ffff984ee284bae0] schedule at ffffffffacb87229 #2 [ffff984ee284baf0] rwsem_down_write_failed at ffffffffacb88965 #3 [ffff984ee284bb88] call_rwsem_down_write_failed at ffffffffac797767 #4 [ffff984ee284bbd0] down_write at ffffffffacb8655d #5 [ffff984ee284bbe8] osd_write_lock at ffffffffc1409c9c [osd_ldiskfs] #6 [ffff984ee284bc10] ofd_attr_set at ffffffffc157c053 [ofd] #7 [ffff984ee284bc78] ofd_setattr_hdl at ffffffffc156b95d [ofd] #8 [ffff984ee284bcd0] tgt_request_handle at ffffffffc0f7df1a [ptlrpc] #9 [ffff984ee284bd58] ptlrpc_server_handle_request at ffffffffc0f22bfb [ptlrpc] #10 [ffff984ee284bdf8] ptlrpc_main at ffffffffc0f26564 [ptlrpc] #11 [ffff984ee284bec8] kthread at ffffffffac4c5c21 #12 [ffff984ee284bf50] ret_from_fork_nospec_begin at ffffffffacb94ddd  

            mnishizawa I've done with the local testing for the patch above. would you be able to test it at scale?

            bzzz Alex Zhuravlev added a comment - mnishizawa I've done with the local testing for the patch above. would you be able to test it at scale?

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47029
            Subject: LU-15117 ofd: don't take lock for dt_bufs_get()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c5db35bca5b42c40b4a5675a5d8b8230018d5138

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47029 Subject: LU-15117 ofd: don't take lock for dt_bufs_get() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c5db35bca5b42c40b4a5675a5d8b8230018d5138

            People

              askulysh Andriy Skulysh
              askulysh Andriy Skulysh
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: