[LU-15117] ofd_read_lock vs transaction deadlock while allocating buffers Created: 16/Oct/21  Updated: 19/Jun/23  Resolved: 10/Nov/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Andriy Skulysh Assignee: Andriy Skulysh
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File fir-io7-s1_crash_foreach_bt_20220831.log     HTML File st_vmcore    
Issue Links:
Duplicate
Related
is related to LU-9728 out of memory on OSS causing allocati... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
PID: 154236  TASK: ffff9ab9b2f330c0  CPU: 9   COMMAND: "ll_ost_io01_002"
 #0 [ffff9ab9b2f6af58] __schedule at ffffffff9876ab17
 #1 [ffff9ab9b2f6afe0] schedule at ffffffff9876b019
 #2 [ffff9ab9b2f6aff0] wait_transaction_locked at ffffffffc0760085 [jbd2]
 #3 [ffff9ab9b2f6b048] add_transaction_credits at ffffffffc0760368 [jbd2]
 #4 [ffff9ab9b2f6b0a8] start_this_handle at ffffffffc07605e1 [jbd2]
 #5 [ffff9ab9b2f6b140] jbd2__journal_start at ffffffffc0760a93 [jbd2]
 #6 [ffff9ab9b2f6b188] __ldiskfs_journal_start_sb at ffffffffc19c1c79 [ldiskfs]
 #7 [ffff9ab9b2f6b1c8] ldiskfs_release_dquot at ffffffffc19b92ec [ldiskfs]
 #8 [ffff9ab9b2f6b1e8] dqput at ffffffff982aeb5d
 #9 [ffff9ab9b2f6b210] __dquot_drop at ffffffff982b0215
#10 [ffff9ab9b2f6b248] dquot_drop at ffffffff982b0285
#11 [ffff9ab9b2f6b258] ldiskfs_clear_inode at ffffffffc19bdcf2 [ldiskfs]
#12 [ffff9ab9b2f6b270] ldiskfs_evict_inode at ffffffffc19dccdf [ldiskfs]
#13 [ffff9ab9b2f6b2b0] evict at ffffffff9825ee14
#14 [ffff9ab9b2f6b2d8] dispose_list at ffffffff9825ef1e
#15 [ffff9ab9b2f6b300] prune_icache_sb at ffffffff9825ff2c
#16 [ffff9ab9b2f6b368] prune_super at ffffffff98244323
#17 [ffff9ab9b2f6b3a0] shrink_slab at ffffffff981ca105
#18 [ffff9ab9b2f6b440] do_try_to_free_pages at ffffffff981cd3c2
#19 [ffff9ab9b2f6b4b8] try_to_free_pages at ffffffff981cd5dc
#20 [ffff9ab9b2f6b550] __alloc_pages_slowpath at ffffffff987601ef
#21 [ffff9ab9b2f6b640] __alloc_pages_nodemask at ffffffff981c1465
#22 [ffff9ab9b2f6b6f0] alloc_pages_current at ffffffff9820e2c8
#23 [ffff9ab9b2f6b738] new_slab at ffffffff982192d5
#24 [ffff9ab9b2f6b770] ___slab_alloc at ffffffff9821ad4c
#25 [ffff9ab9b2f6b840] __slab_alloc at ffffffff9876160c
#26 [ffff9ab9b2f6b880] kmem_cache_alloc at ffffffff9821c3eb
#27 [ffff9ab9b2f6b8c0] __radix_tree_preload at ffffffff9837b7b9
#28 [ffff9ab9b2f6b8f0] radix_tree_maybe_preload at ffffffff9837bd0e
#29 [ffff9ab9b2f6b900] __add_to_page_cache_locked at ffffffff981b734a
#30 [ffff9ab9b2f6b940] add_to_page_cache_lru at ffffffff981b74b7
#31 [ffff9ab9b2f6b970] find_or_create_page at ffffffff981b783e
#32 [ffff9ab9b2f6b9b0] osd_bufs_get at ffffffffc1a773c3 [osd_ldiskfs]
#33 [ffff9ab9b2f6ba10] ofd_preprw_write at ffffffffc144f156 [ofd]
#34 [ffff9ab9b2f6ba90] ofd_preprw at ffffffffc14500ce [ofd]
#35 [ffff9ab9b2f6bb28] tgt_brw_write at ffffffffc0ece6e9 [ptlrpc]
#36 [ffff9ab9b2f6bca0] tgt_request_handle at ffffffffc0eccd4a [ptlrpc]
#37 [ffff9ab9b2f6bd30] ptlrpc_server_handle_request at ffffffffc0e72586 [ptlrpc]
#38 [ffff9ab9b2f6bde8] ptlrpc_main at ffffffffc0e7625a [ptlrpc]
#39 [ffff9ab9b2f6bec8] kthread at ffffffff980c1f81
#40 [ffff9ab9b2f6bf50] ret_from_fork_nospec_begin at ffffffff98777c1d


 Comments   
Comment by Gerrit Updater [ 16/Oct/21 ]

"Andriy Skulysh <andriy.skulysh@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45271
Subject: LU-15117 ofd: Don't allow __GFP_FS for page allocation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6fe2d3da18134ece1b849e25d99059be8424ae64

Comment by Andreas Dilger [ 16/Oct/21 ]

The stack trace shows the thread that is blocked on start_this_handle(), but not the thread that has the transaction open. There must be a better way to address this than disabling all memory pressure on the OSS.

Also, what version was this hit on? There may have been changes to the code that may have avoided this problem.

Comment by Alex Zhuravlev [ 18/Oct/21 ]

yes, there must be something else as the calltrace in the description doesn't show how really holds the transaction. the buffers are allocated long before the transaction.

Comment by Andrew Perepechko [ 18/Oct/21 ]

IIRC, the deadlock happens because of the reverse locking.

In ofd_preprw_write, it's ofd_read_lock() -> page alloc -> reclaim -> start_handle(), and e.g. in ofd_attr_set it's start_handle() -> ofd_write_lock().

Comment by Andriy Skulysh [ 18/Oct/21 ]

the second thread is

Stack :
__schedule
schedule
rwsem_down_write_failed
call_rwsem_down_write_failed
down_write
osd_write_lock
ofd_attr_set
ofd_setattr_hdl
tgt_request_handle
ptlrpc_server_handle_request
ptlrpc_main
kthread
ret_from_fork_nospec_begin
Progs:  153628 "ll_ost01_170"
Comment by Alex Zhuravlev [ 22/Oct/21 ]

askulyshdo you still have full list of traces? could you please attach it to the ticket?

Comment by Andriy Skulysh [ 17/Jan/22 ]

st_vmcore

Comment by Mitsuhiro Nishizawa [ 08/Apr/22 ]

Hi bzzz
One of our customer was affected by this issue and waiting for the fix. Do you have any progress on this?

Comment by Alex Zhuravlev [ 08/Apr/22 ]

> One of our customer was affected by this issue and waiting for the fix. Do you have any progress on this?
the patch is in local testing at the moment..

Comment by Gerrit Updater [ 11/Apr/22 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47029
Subject: LU-15117 ofd: don't take lock for dt_bufs_get()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c5db35bca5b42c40b4a5675a5d8b8230018d5138

Comment by Alex Zhuravlev [ 15/Apr/22 ]

mnishizawa I've done with the local testing for the patch above. would you be able to test it at scale?

Comment by Stephane Thiell [ 15/Apr/22 ]

I believe we hit this same issue yesterday with 2.12.8 on Oak at Stanford. OSS was deadlocked. Alex, do you think it can be backported to b2_12?

thread #1: zone_reclaim -> start_this_handle

PID: 64334  TASK: ffff98040eb26300  CPU: 10  COMMAND: "ll_ost_io00_105"
 #0 [ffff980ca7ff31d8] __schedule at ffffffffacb86d07
 #1 [ffff980ca7ff3260] schedule at ffffffffacb87229
 #2 [ffff980ca7ff3270] wait_transaction_locked at ffffffffc026a085 [jbd2]
 #3 [ffff980ca7ff32c8] add_transaction_credits at ffffffffc026a378 [jbd2]
 #4 [ffff980ca7ff3328] start_this_handle at ffffffffc026a601 [jbd2]
 #5 [ffff980ca7ff33c0] jbd2__journal_start at ffffffffc026aab3 [jbd2]
 #6 [ffff980ca7ff3408] __ldiskfs_journal_start_sb at ffffffffc12f02b9 [ldiskfs]
 #7 [ffff980ca7ff3448] ldiskfs_release_dquot at ffffffffc132839c [ldiskfs]
 #8 [ffff980ca7ff3468] dqput at ffffffffac6bd16d
 #9 [ffff980ca7ff3490] __dquot_drop at ffffffffac6be865
#10 [ffff980ca7ff34c8] dquot_drop at ffffffffac6be8d5
#11 [ffff980ca7ff34d8] ldiskfs_clear_inode at ffffffffc132cf02 [ldiskfs]
#12 [ffff980ca7ff34f0] ldiskfs_evict_inode at ffffffffc131601f [ldiskfs]
#13 [ffff980ca7ff3530] evict at ffffffffac66c194
#14 [ffff980ca7ff3558] dispose_list at ffffffffac66c29e
#15 [ffff980ca7ff3580] prune_icache_sb at ffffffffac66d38c
#16 [ffff980ca7ff35e8] prune_super at ffffffffac65071b
#17 [ffff980ca7ff3618] shrink_slab at ffffffffac5d18c5
#18 [ffff980ca7ff36b8] zone_reclaim at ffffffffac5d46c9
#19 [ffff980ca7ff3760] get_page_from_freelist at ffffffffac5c8788
#20 [ffff980ca7ff3878] __alloc_pages_nodemask at ffffffffac5c8ae6
#21 [ffff980ca7ff3920] alloc_pages_current at ffffffffac618a18
#22 [ffff980ca7ff3968] __page_cache_alloc at ffffffffac5bdb87
#23 [ffff980ca7ff39a0] find_or_create_page at ffffffffac5bed25
#24 [ffff980ca7ff39e0] osd_bufs_get at ffffffffc1433523 [osd_ldiskfs]
#25 [ffff980ca7ff3a40] ofd_preprw_write at ffffffffc1582346 [ofd]
#26 [ffff980ca7ff3ab8] ofd_preprw at ffffffffc15831ff [ofd]
#27 [ffff980ca7ff3b60] tgt_brw_write at ffffffffc0f7be89 [ptlrpc]
#28 [ffff980ca7ff3cd0] tgt_request_handle at ffffffffc0f7df1a [ptlrpc]
#29 [ffff980ca7ff3d58] ptlrpc_server_handle_request at ffffffffc0f22bfb [ptlrpc]
#30 [ffff980ca7ff3df8] ptlrpc_main at ffffffffc0f26564 [ptlrpc]
#31 [ffff980ca7ff3ec8] kthread at ffffffffac4c5c21
#32 [ffff980ca7ff3f50] ret_from_fork_nospec_begin at ffffffffacb94ddd

 

thread #2: ofd_attr_set

PID: 233404  TASK: ffff97f3e2fee300  CPU: 13  COMMAND: "ll_ost01_009"
 #0 [ffff984ee284ba58] __schedule at ffffffffacb86d07
 #1 [ffff984ee284bae0] schedule at ffffffffacb87229
 #2 [ffff984ee284baf0] rwsem_down_write_failed at ffffffffacb88965
 #3 [ffff984ee284bb88] call_rwsem_down_write_failed at ffffffffac797767
 #4 [ffff984ee284bbd0] down_write at ffffffffacb8655d
 #5 [ffff984ee284bbe8] osd_write_lock at ffffffffc1409c9c [osd_ldiskfs]
 #6 [ffff984ee284bc10] ofd_attr_set at ffffffffc157c053 [ofd]
 #7 [ffff984ee284bc78] ofd_setattr_hdl at ffffffffc156b95d [ofd]
 #8 [ffff984ee284bcd0] tgt_request_handle at ffffffffc0f7df1a [ptlrpc]
 #9 [ffff984ee284bd58] ptlrpc_server_handle_request at ffffffffc0f22bfb [ptlrpc]
#10 [ffff984ee284bdf8] ptlrpc_main at ffffffffc0f26564 [ptlrpc]
#11 [ffff984ee284bec8] kthread at ffffffffac4c5c21
#12 [ffff984ee284bf50] ret_from_fork_nospec_begin at ffffffffacb94ddd

 

Comment by Alex Zhuravlev [ 15/Apr/22 ]

sthiell you're correct.

Comment by Stephane Thiell [ 03/Jun/22 ]

Hello,

We hit this problem on another 2.12 filesystem last night. One OSS load was at 400+, still up, answering quota requests but not processing I/Os. Thus, many jobs were hang. I see that Alex's patch is still in review for master. Just wanted to raise awareness that this issue is currently our most impactful Lustre issue. Thanks for your help with this!

Comment by Gerrit Updater [ 27/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47029/
Subject: LU-15117 ofd: don't take lock for dt_bufs_get()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7c4a7c59ed9c6185da326d6df6223f4818b57769

Comment by Gerrit Updater [ 08/Jul/22 ]

"Stephane Thiell <sthiell@stanford.edu>" uploaded a new patch: https://review.whamcloud.com/47925
Subject: LU-15117 ofd: don't take lock for dt_bufs_get()
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 2fdf51055b76d47e24464cc93ebeabdc050dbd3a

Comment by Etienne Aujames [ 10/Aug/22 ]

Hello,

We hit this issue in production on a Clusterstore 2.12 version at the CEA after activating OSTs cache.
We have isolated a job reproducing this issue:

  1. one ost thread is doing fsync forcing to commit the current transaction
  2. the jbd2 thread is waiting for another thread to finish updating his buffer (t_updates == 1).
  3. the thread doing the update (after ofd_trans_start) waiting for the write lock on the ost object (ofd_write_lock) for a punch.
  4. the thread with a read lock hang in "ofd_preprw_write -> osd_bufs_get -> .. -> ldiskfs_release_dquot -> wait_transaction_locked" because of memory pressure and the current transaction commit.

So for now, we have disabled the writethough_cache, but without the "LU-12071 osd-ldiskfs: bypass pagecache if requested" (https://review.whamcloud.com/34422) this doesn't seem to be sufficient.

I have read the https://review.whamcloud.com/47925.
Can someone tell me why this issue shouldn't happen with ofd_preprw_read() (the ofd_read_lock is also taken when dt_bufs_get is called) ?

Comment by Gerrit Updater [ 12/Aug/22 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48209
Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3cd2507bf8211cc662056c48bbe5537fd0a3f5a1

Comment by Stephane Thiell [ 31/Aug/22 ]

Alex, any update on this new patch? We have been using https://review.whamcloud.com/47925 on top of 2.12.9, but last night, the same kind of deadlock occurred on an OSS (attaching the output of "foreach bt" as fir-io7-s1_crash_foreach_bt_20220831.login case you're interested). This is still our most annoying Lustre 2.12 issue... Thanks!

Comment by Alex Zhuravlev [ 01/Sep/22 ]

sthiell I think it's LU-16044, unfortunately LU-15117 increases likelyhood for the former one. would you mind to try https://review.whamcloud.com/#/c/48033/ along with https://review.whamcloud.com/47925 ?

Comment by Stephane Thiell [ 01/Sep/22 ]

Thanks Alex, that makes sense!

Comment by Gerrit Updater [ 08/Nov/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48209/
Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 85941b9fb9ef5c27870550469f2e088c4e690603

Comment by Gerrit Updater [ 21/Dec/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49469
Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 8ab8435c5bb28d9814d79bb31f46848754cfd6a8

Comment by Gerrit Updater [ 19/Jun/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51361
Subject: LU-15117 ofd: don't take lock for dt_bufs_get()
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: a567f5232073d2968970bb81b165fa1808c2bb83

Comment by Gerrit Updater [ 19/Jun/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51362
Subject: LU-15117 ofd: no lock for dt_bufs_get() in read path
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 560c007fcd53692dd1e90bd56e73e29aa28bff2b

Generated at Sat Feb 10 03:15:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.