[LU-15117] ofd_read_lock vs transaction deadlock while allocating buffers - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

PID: 154236  TASK: ffff9ab9b2f330c0  CPU: 9   COMMAND: "ll_ost_io01_002"
 #0 [ffff9ab9b2f6af58] __schedule at ffffffff9876ab17
 #1 [ffff9ab9b2f6afe0] schedule at ffffffff9876b019
 #2 [ffff9ab9b2f6aff0] wait_transaction_locked at ffffffffc0760085 [jbd2]
 #3 [ffff9ab9b2f6b048] add_transaction_credits at ffffffffc0760368 [jbd2]
 #4 [ffff9ab9b2f6b0a8] start_this_handle at ffffffffc07605e1 [jbd2]
 #5 [ffff9ab9b2f6b140] jbd2__journal_start at ffffffffc0760a93 [jbd2]
 #6 [ffff9ab9b2f6b188] __ldiskfs_journal_start_sb at ffffffffc19c1c79 [ldiskfs]
 #7 [ffff9ab9b2f6b1c8] ldiskfs_release_dquot at ffffffffc19b92ec [ldiskfs]
 #8 [ffff9ab9b2f6b1e8] dqput at ffffffff982aeb5d
 #9 [ffff9ab9b2f6b210] __dquot_drop at ffffffff982b0215
#10 [ffff9ab9b2f6b248] dquot_drop at ffffffff982b0285
#11 [ffff9ab9b2f6b258] ldiskfs_clear_inode at ffffffffc19bdcf2 [ldiskfs]
#12 [ffff9ab9b2f6b270] ldiskfs_evict_inode at ffffffffc19dccdf [ldiskfs]
#13 [ffff9ab9b2f6b2b0] evict at ffffffff9825ee14
#14 [ffff9ab9b2f6b2d8] dispose_list at ffffffff9825ef1e
#15 [ffff9ab9b2f6b300] prune_icache_sb at ffffffff9825ff2c
#16 [ffff9ab9b2f6b368] prune_super at ffffffff98244323
#17 [ffff9ab9b2f6b3a0] shrink_slab at ffffffff981ca105
#18 [ffff9ab9b2f6b440] do_try_to_free_pages at ffffffff981cd3c2
#19 [ffff9ab9b2f6b4b8] try_to_free_pages at ffffffff981cd5dc
#20 [ffff9ab9b2f6b550] __alloc_pages_slowpath at ffffffff987601ef
#21 [ffff9ab9b2f6b640] __alloc_pages_nodemask at ffffffff981c1465
#22 [ffff9ab9b2f6b6f0] alloc_pages_current at ffffffff9820e2c8
#23 [ffff9ab9b2f6b738] new_slab at ffffffff982192d5
#24 [ffff9ab9b2f6b770] ___slab_alloc at ffffffff9821ad4c
#25 [ffff9ab9b2f6b840] __slab_alloc at ffffffff9876160c
#26 [ffff9ab9b2f6b880] kmem_cache_alloc at ffffffff9821c3eb
#27 [ffff9ab9b2f6b8c0] __radix_tree_preload at ffffffff9837b7b9
#28 [ffff9ab9b2f6b8f0] radix_tree_maybe_preload at ffffffff9837bd0e
#29 [ffff9ab9b2f6b900] __add_to_page_cache_locked at ffffffff981b734a
#30 [ffff9ab9b2f6b940] add_to_page_cache_lru at ffffffff981b74b7
#31 [ffff9ab9b2f6b970] find_or_create_page at ffffffff981b783e
#32 [ffff9ab9b2f6b9b0] osd_bufs_get at ffffffffc1a773c3 [osd_ldiskfs]
#33 [ffff9ab9b2f6ba10] ofd_preprw_write at ffffffffc144f156 [ofd]
#34 [ffff9ab9b2f6ba90] ofd_preprw at ffffffffc14500ce [ofd]
#35 [ffff9ab9b2f6bb28] tgt_brw_write at ffffffffc0ece6e9 [ptlrpc]
#36 [ffff9ab9b2f6bca0] tgt_request_handle at ffffffffc0eccd4a [ptlrpc]
#37 [ffff9ab9b2f6bd30] ptlrpc_server_handle_request at ffffffffc0e72586 [ptlrpc]
#38 [ffff9ab9b2f6bde8] ptlrpc_main at ffffffffc0e7625a [ptlrpc]
#39 [ffff9ab9b2f6bec8] kthread at ffffffff980c1f81
#40 [ffff9ab9b2f6bf50] ret_from_fork_nospec_begin at ffffffff98777c1d

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-io7-s1_crash_foreach_bt_20220831.log
1.60 MB
31/Aug/22 4:29 PM
st_vmcore
84 kB
17/Jan/22 11:04 AM

Issue Links

is related to

LU-9728 out of memory on OSS causing allocation failures or hung threads

Resolved

Activity

[LU-15117] ofd_read_lock vs transaction deadlock while allocating buffers

Alex Zhuravlev added a comment - 01/Sep/22 11:56 AM

sthiell I think it's ~~LU-16044~~, unfortunately ~~LU-15117~~ increases likelyhood for the former one. would you mind to try https://review.whamcloud.com/#/c/48033/ along with https://review.whamcloud.com/47925 ?

Alex Zhuravlev added a comment - 01/Sep/22 11:56 AM sthiell I think it's LU-16044 , unfortunately LU-15117 increases likelyhood for the former one. would you mind to try https://review.whamcloud.com/#/c/48033/ along with https://review.whamcloud.com/47925 ?

Stephane Thiell added a comment - 31/Aug/22 4:30 PM

Alex, any update on this new patch? We have been using https://review.whamcloud.com/47925 on top of 2.12.9, but last night, the same kind of deadlock occurred on an OSS (attaching the output of "foreach bt" as fir-io7-s1_crash_foreach_bt_20220831.login case you're interested). This is still our most annoying Lustre 2.12 issue... Thanks!

Stephane Thiell added a comment - 31/Aug/22 4:30 PM Alex, any update on this new patch? We have been using https://review.whamcloud.com/47925 on top of 2.12.9, but last night, the same kind of deadlock occurred on an OSS (attaching the output of "foreach bt" as fir-io7-s1_crash_foreach_bt_20220831.log in case you're interested). This is still our most annoying Lustre 2.12 issue... Thanks!

Gerrit Updater added a comment - 12/Aug/22 11:26 AM

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48209
Subject: ~~LU-15117~~ ofd: no lock for dt_bufs_get() in read path
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3cd2507bf8211cc662056c48bbe5537fd0a3f5a1

Gerrit Updater added a comment - 12/Aug/22 11:26 AM "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48209 Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3cd2507bf8211cc662056c48bbe5537fd0a3f5a1

Etienne Aujames added a comment - 10/Aug/22 7:33 AM - edited

Hello,

We hit this issue in production on a Clusterstore 2.12 version at the CEA after activating OSTs cache.
We have isolated a job reproducing this issue:

one ost thread is doing fsync forcing to commit the current transaction
the jbd2 thread is waiting for another thread to finish updating his buffer (t_updates == 1).
the thread doing the update (after ofd_trans_start) waiting for the write lock on the ost object (ofd_write_lock) for a punch.
the thread with a read lock hang in "ofd_preprw_write -> osd_bufs_get -> .. -> ldiskfs_release_dquot -> wait_transaction_locked" because of memory pressure and the current transaction commit.

So for now, we have disabled the writethough_cache, but without the "~~LU-12071~~ osd-ldiskfs: bypass pagecache if requested" (https://review.whamcloud.com/34422) this doesn't seem to be sufficient.

I have read the https://review.whamcloud.com/47925.
Can someone tell me why this issue shouldn't happen with ofd_preprw_read() (the ofd_read_lock is also taken when dt_bufs_get is called) ?

Etienne Aujames added a comment - 10/Aug/22 7:33 AM - edited Hello, We hit this issue in production on a Clusterstore 2.12 version at the CEA after activating OSTs cache. We have isolated a job reproducing this issue: one ost thread is doing fsync forcing to commit the current transaction the jbd2 thread is waiting for another thread to finish updating his buffer (t_updates == 1). the thread doing the update (after ofd_trans_start) waiting for the write lock on the ost object (ofd_write_lock) for a punch. the thread with a read lock hang in "ofd_preprw_write -> osd_bufs_get -> .. -> ldiskfs_release_dquot -> wait_transaction_locked" because of memory pressure and the current transaction commit. So for now, we have disabled the writethough_cache, but without the " LU-12071 osd-ldiskfs: bypass pagecache if requested" ( https://review.whamcloud.com/34422 ) this doesn't seem to be sufficient. I have read the https://review.whamcloud.com/47925 . Can someone tell me why this issue shouldn't happen with ofd_preprw_read() (the ofd_read_lock is also taken when dt_bufs_get is called) ?

Gerrit Updater added a comment - 08/Jul/22 9:00 PM

"Stephane Thiell <sthiell@stanford.edu>" uploaded a new patch: https://review.whamcloud.com/47925
Subject: ~~LU-15117~~ ofd: don't take lock for dt_bufs_get()
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 2fdf51055b76d47e24464cc93ebeabdc050dbd3a

Gerrit Updater added a comment - 08/Jul/22 9:00 PM "Stephane Thiell <sthiell@stanford.edu>" uploaded a new patch: https://review.whamcloud.com/47925 Subject: LU-15117 ofd: don't take lock for dt_bufs_get() Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 2fdf51055b76d47e24464cc93ebeabdc050dbd3a

Gerrit Updater added a comment - 27/Jun/22 4:57 AM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47029/
Subject: ~~LU-15117~~ ofd: don't take lock for dt_bufs_get()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7c4a7c59ed9c6185da326d6df6223f4818b57769

Gerrit Updater added a comment - 27/Jun/22 4:57 AM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47029/ Subject: LU-15117 ofd: don't take lock for dt_bufs_get() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7c4a7c59ed9c6185da326d6df6223f4818b57769

Stephane Thiell added a comment - 03/Jun/22 4:29 PM

Hello,

We hit this problem on another 2.12 filesystem last night. One OSS load was at 400+, still up, answering quota requests but not processing I/Os. Thus, many jobs were hang. I see that Alex's patch is still in review for master. Just wanted to raise awareness that this issue is currently our most impactful Lustre issue. Thanks for your help with this!

Stephane Thiell added a comment - 03/Jun/22 4:29 PM Hello, We hit this problem on another 2.12 filesystem last night. One OSS load was at 400+, still up, answering quota requests but not processing I/Os. Thus, many jobs were hang. I see that Alex's patch is still in review for master. Just wanted to raise awareness that this issue is currently our most impactful Lustre issue. Thanks for your help with this!

Alex Zhuravlev added a comment - 15/Apr/22 5:19 PM

sthiell you're correct.

Alex Zhuravlev added a comment - 15/Apr/22 5:19 PM sthiell you're correct.

Stephane Thiell added a comment - 15/Apr/22 4:51 PM

I believe we hit this same issue yesterday with 2.12.8 on Oak at Stanford. OSS was deadlocked. Alex, do you think it can be backported to b2_12?

thread #1: zone_reclaim -> start_this_handle

PID: 64334  TASK: ffff98040eb26300  CPU: 10  COMMAND: "ll_ost_io00_105"
 #0 [ffff980ca7ff31d8] __schedule at ffffffffacb86d07
 #1 [ffff980ca7ff3260] schedule at ffffffffacb87229
 #2 [ffff980ca7ff3270] wait_transaction_locked at ffffffffc026a085 [jbd2]
 #3 [ffff980ca7ff32c8] add_transaction_credits at ffffffffc026a378 [jbd2]
 #4 [ffff980ca7ff3328] start_this_handle at ffffffffc026a601 [jbd2]
 #5 [ffff980ca7ff33c0] jbd2__journal_start at ffffffffc026aab3 [jbd2]
 #6 [ffff980ca7ff3408] __ldiskfs_journal_start_sb at ffffffffc12f02b9 [ldiskfs]
 #7 [ffff980ca7ff3448] ldiskfs_release_dquot at ffffffffc132839c [ldiskfs]
 #8 [ffff980ca7ff3468] dqput at ffffffffac6bd16d
 #9 [ffff980ca7ff3490] __dquot_drop at ffffffffac6be865
#10 [ffff980ca7ff34c8] dquot_drop at ffffffffac6be8d5
#11 [ffff980ca7ff34d8] ldiskfs_clear_inode at ffffffffc132cf02 [ldiskfs]
#12 [ffff980ca7ff34f0] ldiskfs_evict_inode at ffffffffc131601f [ldiskfs]
#13 [ffff980ca7ff3530] evict at ffffffffac66c194
#14 [ffff980ca7ff3558] dispose_list at ffffffffac66c29e
#15 [ffff980ca7ff3580] prune_icache_sb at ffffffffac66d38c
#16 [ffff980ca7ff35e8] prune_super at ffffffffac65071b
#17 [ffff980ca7ff3618] shrink_slab at ffffffffac5d18c5
#18 [ffff980ca7ff36b8] zone_reclaim at ffffffffac5d46c9
#19 [ffff980ca7ff3760] get_page_from_freelist at ffffffffac5c8788
#20 [ffff980ca7ff3878] __alloc_pages_nodemask at ffffffffac5c8ae6
#21 [ffff980ca7ff3920] alloc_pages_current at ffffffffac618a18
#22 [ffff980ca7ff3968] __page_cache_alloc at ffffffffac5bdb87
#23 [ffff980ca7ff39a0] find_or_create_page at ffffffffac5bed25
#24 [ffff980ca7ff39e0] osd_bufs_get at ffffffffc1433523 [osd_ldiskfs]
#25 [ffff980ca7ff3a40] ofd_preprw_write at ffffffffc1582346 [ofd]
#26 [ffff980ca7ff3ab8] ofd_preprw at ffffffffc15831ff [ofd]
#27 [ffff980ca7ff3b60] tgt_brw_write at ffffffffc0f7be89 [ptlrpc]
#28 [ffff980ca7ff3cd0] tgt_request_handle at ffffffffc0f7df1a [ptlrpc]
#29 [ffff980ca7ff3d58] ptlrpc_server_handle_request at ffffffffc0f22bfb [ptlrpc]
#30 [ffff980ca7ff3df8] ptlrpc_main at ffffffffc0f26564 [ptlrpc]
#31 [ffff980ca7ff3ec8] kthread at ffffffffac4c5c21
#32 [ffff980ca7ff3f50] ret_from_fork_nospec_begin at ffffffffacb94ddd

thread #2: ofd_attr_set

PID: 233404  TASK: ffff97f3e2fee300  CPU: 13  COMMAND: "ll_ost01_009"
 #0 [ffff984ee284ba58] __schedule at ffffffffacb86d07
 #1 [ffff984ee284bae0] schedule at ffffffffacb87229
 #2 [ffff984ee284baf0] rwsem_down_write_failed at ffffffffacb88965
 #3 [ffff984ee284bb88] call_rwsem_down_write_failed at ffffffffac797767
 #4 [ffff984ee284bbd0] down_write at ffffffffacb8655d
 #5 [ffff984ee284bbe8] osd_write_lock at ffffffffc1409c9c [osd_ldiskfs]
 #6 [ffff984ee284bc10] ofd_attr_set at ffffffffc157c053 [ofd]
 #7 [ffff984ee284bc78] ofd_setattr_hdl at ffffffffc156b95d [ofd]
 #8 [ffff984ee284bcd0] tgt_request_handle at ffffffffc0f7df1a [ptlrpc]
 #9 [ffff984ee284bd58] ptlrpc_server_handle_request at ffffffffc0f22bfb [ptlrpc]
#10 [ffff984ee284bdf8] ptlrpc_main at ffffffffc0f26564 [ptlrpc]
#11 [ffff984ee284bec8] kthread at ffffffffac4c5c21
#12 [ffff984ee284bf50] ret_from_fork_nospec_begin at ffffffffacb94ddd

Stephane Thiell added a comment - 15/Apr/22 4:51 PM I believe we hit this same issue yesterday with 2.12.8 on Oak at Stanford. OSS was deadlocked. Alex, do you think it can be backported to b2_12? thread #1: zone_reclaim -> start_this_handle PID: 64334 TASK: ffff98040eb26300 CPU: 10 COMMAND: "ll_ost_io00_105" #0 [ffff980ca7ff31d8] __schedule at ffffffffacb86d07 #1 [ffff980ca7ff3260] schedule at ffffffffacb87229 #2 [ffff980ca7ff3270] wait_transaction_locked at ffffffffc026a085 [jbd2] #3 [ffff980ca7ff32c8] add_transaction_credits at ffffffffc026a378 [jbd2] #4 [ffff980ca7ff3328] start_this_handle at ffffffffc026a601 [jbd2] #5 [ffff980ca7ff33c0] jbd2__journal_start at ffffffffc026aab3 [jbd2] #6 [ffff980ca7ff3408] __ldiskfs_journal_start_sb at ffffffffc12f02b9 [ldiskfs] #7 [ffff980ca7ff3448] ldiskfs_release_dquot at ffffffffc132839c [ldiskfs] #8 [ffff980ca7ff3468] dqput at ffffffffac6bd16d #9 [ffff980ca7ff3490] __dquot_drop at ffffffffac6be865 #10 [ffff980ca7ff34c8] dquot_drop at ffffffffac6be8d5 #11 [ffff980ca7ff34d8] ldiskfs_clear_inode at ffffffffc132cf02 [ldiskfs] #12 [ffff980ca7ff34f0] ldiskfs_evict_inode at ffffffffc131601f [ldiskfs] #13 [ffff980ca7ff3530] evict at ffffffffac66c194 #14 [ffff980ca7ff3558] dispose_list at ffffffffac66c29e #15 [ffff980ca7ff3580] prune_icache_sb at ffffffffac66d38c #16 [ffff980ca7ff35e8] prune_super at ffffffffac65071b #17 [ffff980ca7ff3618] shrink_slab at ffffffffac5d18c5 #18 [ffff980ca7ff36b8] zone_reclaim at ffffffffac5d46c9 #19 [ffff980ca7ff3760] get_page_from_freelist at ffffffffac5c8788 #20 [ffff980ca7ff3878] __alloc_pages_nodemask at ffffffffac5c8ae6 #21 [ffff980ca7ff3920] alloc_pages_current at ffffffffac618a18 #22 [ffff980ca7ff3968] __page_cache_alloc at ffffffffac5bdb87 #23 [ffff980ca7ff39a0] find_or_create_page at ffffffffac5bed25 #24 [ffff980ca7ff39e0] osd_bufs_get at ffffffffc1433523 [osd_ldiskfs] #25 [ffff980ca7ff3a40] ofd_preprw_write at ffffffffc1582346 [ofd] #26 [ffff980ca7ff3ab8] ofd_preprw at ffffffffc15831ff [ofd] #27 [ffff980ca7ff3b60] tgt_brw_write at ffffffffc0f7be89 [ptlrpc] #28 [ffff980ca7ff3cd0] tgt_request_handle at ffffffffc0f7df1a [ptlrpc] #29 [ffff980ca7ff3d58] ptlrpc_server_handle_request at ffffffffc0f22bfb [ptlrpc] #30 [ffff980ca7ff3df8] ptlrpc_main at ffffffffc0f26564 [ptlrpc] #31 [ffff980ca7ff3ec8] kthread at ffffffffac4c5c21 #32 [ffff980ca7ff3f50] ret_from_fork_nospec_begin at ffffffffacb94ddd thread #2: ofd_attr_set PID: 233404 TASK: ffff97f3e2fee300 CPU: 13 COMMAND: "ll_ost01_009" #0 [ffff984ee284ba58] __schedule at ffffffffacb86d07 #1 [ffff984ee284bae0] schedule at ffffffffacb87229 #2 [ffff984ee284baf0] rwsem_down_write_failed at ffffffffacb88965 #3 [ffff984ee284bb88] call_rwsem_down_write_failed at ffffffffac797767 #4 [ffff984ee284bbd0] down_write at ffffffffacb8655d #5 [ffff984ee284bbe8] osd_write_lock at ffffffffc1409c9c [osd_ldiskfs] #6 [ffff984ee284bc10] ofd_attr_set at ffffffffc157c053 [ofd] #7 [ffff984ee284bc78] ofd_setattr_hdl at ffffffffc156b95d [ofd] #8 [ffff984ee284bcd0] tgt_request_handle at ffffffffc0f7df1a [ptlrpc] #9 [ffff984ee284bd58] ptlrpc_server_handle_request at ffffffffc0f22bfb [ptlrpc] #10 [ffff984ee284bdf8] ptlrpc_main at ffffffffc0f26564 [ptlrpc] #11 [ffff984ee284bec8] kthread at ffffffffac4c5c21 #12 [ffff984ee284bf50] ret_from_fork_nospec_begin at ffffffffacb94ddd

Alex Zhuravlev added a comment - 15/Apr/22 9:47 AM

mnishizawa I've done with the local testing for the patch above. would you be able to test it at scale?

Alex Zhuravlev added a comment - 15/Apr/22 9:47 AM mnishizawa I've done with the local testing for the patch above. would you be able to test it at scale?

Gerrit Updater added a comment - 11/Apr/22 9:06 AM

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47029
Subject: ~~LU-15117~~ ofd: don't take lock for dt_bufs_get()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c5db35bca5b42c40b4a5675a5d8b8230018d5138

Gerrit Updater added a comment - 11/Apr/22 9:06 AM "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47029 Subject: LU-15117 ofd: don't take lock for dt_bufs_get() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c5db35bca5b42c40b4a5675a5d8b8230018d5138

People

Assignee:: Andriy Skulysh

Reporter:: Andriy Skulysh

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 16/Oct/21 8:17 AM

Updated:: 21/May/24 2:34 PM

Resolved:: 10/Nov/22 10:10 PM