Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15117

ofd_read_lock vs transaction deadlock while allocating buffers

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      PID: 154236  TASK: ffff9ab9b2f330c0  CPU: 9   COMMAND: "ll_ost_io01_002"
       #0 [ffff9ab9b2f6af58] __schedule at ffffffff9876ab17
       #1 [ffff9ab9b2f6afe0] schedule at ffffffff9876b019
       #2 [ffff9ab9b2f6aff0] wait_transaction_locked at ffffffffc0760085 [jbd2]
       #3 [ffff9ab9b2f6b048] add_transaction_credits at ffffffffc0760368 [jbd2]
       #4 [ffff9ab9b2f6b0a8] start_this_handle at ffffffffc07605e1 [jbd2]
       #5 [ffff9ab9b2f6b140] jbd2__journal_start at ffffffffc0760a93 [jbd2]
       #6 [ffff9ab9b2f6b188] __ldiskfs_journal_start_sb at ffffffffc19c1c79 [ldiskfs]
       #7 [ffff9ab9b2f6b1c8] ldiskfs_release_dquot at ffffffffc19b92ec [ldiskfs]
       #8 [ffff9ab9b2f6b1e8] dqput at ffffffff982aeb5d
       #9 [ffff9ab9b2f6b210] __dquot_drop at ffffffff982b0215
      #10 [ffff9ab9b2f6b248] dquot_drop at ffffffff982b0285
      #11 [ffff9ab9b2f6b258] ldiskfs_clear_inode at ffffffffc19bdcf2 [ldiskfs]
      #12 [ffff9ab9b2f6b270] ldiskfs_evict_inode at ffffffffc19dccdf [ldiskfs]
      #13 [ffff9ab9b2f6b2b0] evict at ffffffff9825ee14
      #14 [ffff9ab9b2f6b2d8] dispose_list at ffffffff9825ef1e
      #15 [ffff9ab9b2f6b300] prune_icache_sb at ffffffff9825ff2c
      #16 [ffff9ab9b2f6b368] prune_super at ffffffff98244323
      #17 [ffff9ab9b2f6b3a0] shrink_slab at ffffffff981ca105
      #18 [ffff9ab9b2f6b440] do_try_to_free_pages at ffffffff981cd3c2
      #19 [ffff9ab9b2f6b4b8] try_to_free_pages at ffffffff981cd5dc
      #20 [ffff9ab9b2f6b550] __alloc_pages_slowpath at ffffffff987601ef
      #21 [ffff9ab9b2f6b640] __alloc_pages_nodemask at ffffffff981c1465
      #22 [ffff9ab9b2f6b6f0] alloc_pages_current at ffffffff9820e2c8
      #23 [ffff9ab9b2f6b738] new_slab at ffffffff982192d5
      #24 [ffff9ab9b2f6b770] ___slab_alloc at ffffffff9821ad4c
      #25 [ffff9ab9b2f6b840] __slab_alloc at ffffffff9876160c
      #26 [ffff9ab9b2f6b880] kmem_cache_alloc at ffffffff9821c3eb
      #27 [ffff9ab9b2f6b8c0] __radix_tree_preload at ffffffff9837b7b9
      #28 [ffff9ab9b2f6b8f0] radix_tree_maybe_preload at ffffffff9837bd0e
      #29 [ffff9ab9b2f6b900] __add_to_page_cache_locked at ffffffff981b734a
      #30 [ffff9ab9b2f6b940] add_to_page_cache_lru at ffffffff981b74b7
      #31 [ffff9ab9b2f6b970] find_or_create_page at ffffffff981b783e
      #32 [ffff9ab9b2f6b9b0] osd_bufs_get at ffffffffc1a773c3 [osd_ldiskfs]
      #33 [ffff9ab9b2f6ba10] ofd_preprw_write at ffffffffc144f156 [ofd]
      #34 [ffff9ab9b2f6ba90] ofd_preprw at ffffffffc14500ce [ofd]
      #35 [ffff9ab9b2f6bb28] tgt_brw_write at ffffffffc0ece6e9 [ptlrpc]
      #36 [ffff9ab9b2f6bca0] tgt_request_handle at ffffffffc0eccd4a [ptlrpc]
      #37 [ffff9ab9b2f6bd30] ptlrpc_server_handle_request at ffffffffc0e72586 [ptlrpc]
      #38 [ffff9ab9b2f6bde8] ptlrpc_main at ffffffffc0e7625a [ptlrpc]
      #39 [ffff9ab9b2f6bec8] kthread at ffffffff980c1f81
      #40 [ffff9ab9b2f6bf50] ret_from_fork_nospec_begin at ffffffff98777c1d
      

      Attachments

        Issue Links

          Activity

            [LU-15117] ofd_read_lock vs transaction deadlock while allocating buffers

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51362
            Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 560c007fcd53692dd1e90bd56e73e29aa28bff2b

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51362 Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 560c007fcd53692dd1e90bd56e73e29aa28bff2b

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51361
            Subject: LU-15117 ofd: don't take lock for dt_bufs_get()
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: a567f5232073d2968970bb81b165fa1808c2bb83

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51361 Subject: LU-15117 ofd: don't take lock for dt_bufs_get() Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: a567f5232073d2968970bb81b165fa1808c2bb83

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49469
            Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 8ab8435c5bb28d9814d79bb31f46848754cfd6a8

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49469 Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 8ab8435c5bb28d9814d79bb31f46848754cfd6a8

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48209/
            Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 85941b9fb9ef5c27870550469f2e088c4e690603

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48209/ Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path Project: fs/lustre-release Branch: master Current Patch Set: Commit: 85941b9fb9ef5c27870550469f2e088c4e690603

            Thanks Alex, that makes sense!

            sthiell Stephane Thiell added a comment - Thanks Alex, that makes sense!

            sthiell I think it's LU-16044, unfortunately LU-15117 increases likelyhood for the former one. would you mind to try https://review.whamcloud.com/#/c/48033/ along with https://review.whamcloud.com/47925 ?

            bzzz Alex Zhuravlev added a comment - sthiell I think it's LU-16044 , unfortunately LU-15117 increases likelyhood for the former one. would you mind to try https://review.whamcloud.com/#/c/48033/ along with https://review.whamcloud.com/47925 ?

            Alex, any update on this new patch? We have been using https://review.whamcloud.com/47925 on top of 2.12.9, but last night, the same kind of deadlock occurred on an OSS (attaching the output of "foreach bt" as fir-io7-s1_crash_foreach_bt_20220831.login case you're interested). This is still our most annoying Lustre 2.12 issue... Thanks!

            sthiell Stephane Thiell added a comment - Alex, any update on this new patch? We have been using https://review.whamcloud.com/47925 on top of 2.12.9, but last night, the same kind of deadlock occurred on an OSS (attaching the output of "foreach bt" as fir-io7-s1_crash_foreach_bt_20220831.log in case you're interested). This is still our most annoying Lustre 2.12 issue... Thanks!

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48209
            Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3cd2507bf8211cc662056c48bbe5537fd0a3f5a1

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48209 Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3cd2507bf8211cc662056c48bbe5537fd0a3f5a1
            eaujames Etienne Aujames added a comment - - edited

            Hello,

            We hit this issue in production on a Clusterstore 2.12 version at the CEA after activating OSTs cache.
            We have isolated a job reproducing this issue:

            1. one ost thread is doing fsync forcing to commit the current transaction
            2. the jbd2 thread is waiting for another thread to finish updating his buffer (t_updates == 1).
            3. the thread doing the update (after ofd_trans_start) waiting for the write lock on the ost object (ofd_write_lock) for a punch.
            4. the thread with a read lock hang in "ofd_preprw_write -> osd_bufs_get -> .. -> ldiskfs_release_dquot -> wait_transaction_locked" because of memory pressure and the current transaction commit.

            So for now, we have disabled the writethough_cache, but without the "LU-12071 osd-ldiskfs: bypass pagecache if requested" (https://review.whamcloud.com/34422) this doesn't seem to be sufficient.

            I have read the https://review.whamcloud.com/47925.
            Can someone tell me why this issue shouldn't happen with ofd_preprw_read() (the ofd_read_lock is also taken when dt_bufs_get is called) ?

            eaujames Etienne Aujames added a comment - - edited Hello, We hit this issue in production on a Clusterstore 2.12 version at the CEA after activating OSTs cache. We have isolated a job reproducing this issue: one ost thread is doing fsync forcing to commit the current transaction the jbd2 thread is waiting for another thread to finish updating his buffer (t_updates == 1). the thread doing the update (after ofd_trans_start) waiting for the write lock on the ost object (ofd_write_lock) for a punch. the thread with a read lock hang in "ofd_preprw_write -> osd_bufs_get -> .. -> ldiskfs_release_dquot -> wait_transaction_locked" because of memory pressure and the current transaction commit. So for now, we have disabled the writethough_cache, but without the " LU-12071 osd-ldiskfs: bypass pagecache if requested" ( https://review.whamcloud.com/34422 ) this doesn't seem to be sufficient. I have read the https://review.whamcloud.com/47925 . Can someone tell me why this issue shouldn't happen with ofd_preprw_read() (the ofd_read_lock is also taken when dt_bufs_get is called) ?

            "Stephane Thiell <sthiell@stanford.edu>" uploaded a new patch: https://review.whamcloud.com/47925
            Subject: LU-15117 ofd: don't take lock for dt_bufs_get()
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 2fdf51055b76d47e24464cc93ebeabdc050dbd3a

            gerrit Gerrit Updater added a comment - "Stephane Thiell <sthiell@stanford.edu>" uploaded a new patch: https://review.whamcloud.com/47925 Subject: LU-15117 ofd: don't take lock for dt_bufs_get() Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 2fdf51055b76d47e24464cc93ebeabdc050dbd3a

            People

              askulysh Andriy Skulysh
              askulysh Andriy Skulysh
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: