[LU-15779] do not hold object's lock over read bulk Created: 24/Apr/22  Updated: 21/Dec/22  Resolved: 16/Jul/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0, Lustre 2.15.2

Type: Improvement Priority: Minor
Reporter: Alex Zhuravlev Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
Rank (Obsolete): 9223372036854775807

 Description   

as a stuck bulk can block OUT's (e.g. out_tx_xattr_set_exec taking an exclusive object's lock), then all shared object's locks are blocked and finally all transactions are blocked:

Call Trace:
[<0>] call_rwsem_down_write_failed+0x17/0x30
[<0>] osd_write_lock+0x5c/0xe0 [osd_ldiskfs]
[<0>] out_tx_xattr_set_exec+0xdb/0x840 [ptlrpc]
[<0>] out_tx_end+0xe1/0x5c0 [ptlrpc]
[<0>] out_handle+0x1452/0x1bc0 [ptlrpc]
[<0>] tgt_request_handle+0xaee/0x15f0 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]

Call Trace:
[<0>] wait_transaction_locked+0x85/0xd0 [jbd2]
[<0>] add_transaction_credits+0x278/0x310 [jbd2]
[<0>] start_this_handle+0x1a1/0x430 [jbd2]
[<0>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[<0>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[<0>] osd_trans_start+0x20e/0x4e0 [osd_ldiskfs]
[<0>] ofd_commitrw_write+0x11dc/0x1da0 [ofd]
[<0>] ofd_commitrw+0x53f/0xf70 [ofd]
[<0>] tgt_brw_write+0xffb/0x1dc0 [ptlrpc]
[<0>] tgt_request_handle+0xaee/0x15f0 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]


 Comments   
Comment by Gerrit Updater [ 24/Apr/22 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47126
Subject: LU-15779 ofd: don't hold read lock over bulk
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2faff9f82a263ae1b9ed3ab0ebe893aea5772ba1

Comment by Alex Zhuravlev [ 25/Apr/22 ]

I constructed a test which basically: 1) grabs buffers using dt_bufs_get(), then declares 0-copy write using these buffers 3) removes the object in a separate thread 4) tries to commit the buffer to the filesysytem – passed on both ldiskfs and ZFS.

Comment by Gerrit Updater [ 28/Jun/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47825
Subject: LU-15779 ofd: don't hold read lock over bulk
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: a229943d51b7c876ce7108a2d7fab9b34e85d0ff

Comment by Gerrit Updater [ 11/Jul/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47126/
Subject: LU-15779 ofd: don't hold read lock over bulk
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 98ba50819024b908453b62fd095647442929a61f

Comment by Peter Jones [ 16/Jul/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 20/Aug/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47825/
Subject: LU-15779 ofd: don't hold read lock over bulk
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 28875487ab3c94015fdd1c6b32c3ee63bdf81965

Comment by Gerrit Updater [ 21/Dec/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49468
Subject: LU-15779 ofd: don't hold read lock over bulk
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: ad08375a6a5dccec2c7b70770b35695543ff6aae

Generated at Sat Feb 10 03:21:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.