[LU-15535] deadlock on lli->lli_lsm_sem Created: 08/Feb/22  Updated: 21/Aug/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Andriy Skulysh Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-3308 large readdir chunk size slows unlink... Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
<struct rw_semaphore 0xffff91737c998370> counter: 102 owner: ffff9174359adf00
PID: 233888  TASK: ffff9174748a5f00  CPU: 23  COMMAND: "mv"
 #1 [ffffbba8a4b0f588] schedule at ffffffff9594a448
 #2 [ffffbba8a4b0f598] rwsem_down_write_slowpath at ffffffff9513c57a
 #3 [ffffbba8a4b0f640] ll_update_default_lsm_md at ffffffffc1339fe3 [lustre]
 #4 [ffffbba8a4b0f678] ll_update_lsm_md at ffffffffc1340439 [lustre]
 #5 [ffffbba8a4b0f6e0] ll_update_inode at ffffffffc1344766 [lustre]
 #6 [ffffbba8a4b0f718] ll_iget at ffffffffc1357a47 [lustre]
 #7 [ffffbba8a4b0f738] ll_prep_inode at ffffffffc1346d5a [lustre]
 #8 [ffffbba8a4b0f800] ll_lookup_it_finish.constprop.28 at ffffffffc1358611 [lustre]
 #9 [ffffbba8a4b0f8e8] ll_lookup_it at ffffffffc1359cb9 [lustre]
#10 [ffffbba8a4b0fb58] ll_lookup_nd at ffffffffc135ca8c [lustre]
#11 [ffffbba8a4b0fbe0] __lookup_slow at ffffffff95325667
PID: 242859  TASK: ffff9174359adf00  CPU: 23  COMMAND: "rm"
 #2 [ffffbba88ba1f7f8] schedule_timeout at ffffffff9594dad3
 #3 [ffffbba88ba1f890] ldlm_completion_ast at ffffffffc10409bc [ptlrpc]
 #4 [ffffbba88ba1f920] ldlm_cli_enqueue_fini at ffffffffc103fa9d [ptlrpc]
 #5 [ffffbba88ba1f990] ldlm_cli_enqueue at ffffffffc10433ed [ptlrpc]
 #6 [ffffbba88ba1fa38] mdc_enqueue_base at ffffffffc119d24d [mdc]
 #7 [ffffbba88ba1fb30] mdc_intent_lock at ffffffffc119f1b9 [mdc]
 #8 [ffffbba88ba1fbd8] mdc_read_page at ffffffffc118c42c [mdc]
 #9 [ffffbba88ba1fcb8] lmv_read_page at ffffffffc12e22d0 [lmv]
#10 [ffffbba88ba1fd00] ll_get_dir_page at ffffffffc1313c5f [lustre]
#11 [ffffbba88ba1fd48] ll_dir_read at ffffffffc1313f38 [lustre]
        ll_prep_md_op_data()
#12 [ffffbba88ba1fe00] ll_iterate at ffffffffc13144dc [lustre]


 Comments   
Comment by Oleg Drokin [ 08/Feb/22 ]

from racer or how did it come to be?

Comment by Patrick Farrell [ 08/Feb/22 ]

What is the actual deadlock?  What's the cycle here?

Basically, how is 242859 waiting for 233888?  Or is it something simpler, like 242859 should not be holding the semaphore at this time?

Comment by Andriy Skulysh [ 09/Feb/22 ]

242859 takes lli_lsm_sem and sends a lock_enqueue

233888 processes lock  reply and tries to acquire  lli_lsm_sem to update directory striping

Comment by Lai Siyao [ 10/Feb/22 ]

IMO mdc_read_page() doesn't need to enqueue lock when reading directory page, because it doesn't guarantee anything, when it finishes reading and unlocks, the real directory content may be changed on server side at any time.

Comment by Lai Siyao [ 12/Feb/22 ]

mdc_read_page() could revalidate lock first, if there is a lock already (quite likely, because readdir will getattr first), and then continue reading dir page, and after reading page, revalidate lock again, if lock is still valid, and lock handle unchanged, the page read can be kept in page cache, otherwise discard the page read.

Comment by Etienne Aujames [ 16/Feb/22 ]

Hi,

What are the symptoms/consequences of this ? Is this resulting to client threads endlessly hang or is this resulting to a client eviction?
Client lock should have timeout set (LDLM_FL_NO_TIMEOUT not set), so ldlm_completion_ast should exit with timeout error -> eviction.

Comment by Gerrit Updater [ 18/Feb/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46551
Subject: LU-15535 mdc: don't enqueue lock in read_page
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c16aae1c3e3e50d76addf32eecbd3d1b22bd9052

Comment by Andreas Dilger [ 12/Mar/22 ]

please see also LU-3308 #comment-58611 and the following one, and patch https://review.whamcloud.com/7909 "LU-3240 llite: limit readdir buffer for small directories".

There is no requirement under POSIX that multiple readdir() calls be completely coherent with files being created/unlinked in the directory, so long as this is consistent from the time the directory open() call, or if rewinddir() is called, so it would also be possible to read multiple pages into a temporary cache for the file descriptor, and discard those pages when the fd is closed or on rewinddir().

Comment by Andriy Skulysh [ 14/Mar/22 ]

The same deadlock happens with getattr:

PID: 13996  TASK: ffff98f0791c0000  CPU: 1   COMMAND: "setfattr"
 #0 [ffffa7850d8fb5a8] __schedule at ffffffff91349fac
 #1 [ffffa7850d8fb638] schedule at ffffffff9134a448
 #2 [ffffa7850d8fb648] rwsem_down_write_slowpath at ffffffff90b3c57a
 #3 [ffffa7850d8fb6f0] ll_update_default_lsm_md at ffffffffc14100a3 [lustre]
 #4 [ffffa7850d8fb728] ll_update_lsm_md at ffffffffc1416369 [lustre]
 #5 [ffffa7850d8fb790] ll_update_inode at ffffffffc141a696 [lustre]
 #6 [ffffa7850d8fb7c8] ll_iget at ffffffffc142dc57 [lustre]
 #7 [ffffa7850d8fb7e8] ll_prep_inode at ffffffffc141d01a [lustre]
 #8 [ffffa7850d8fb8b0] ll_lookup_it_finish.constprop.28 at ffffffffc142e821 [lustre]
 #9 [ffffa7850d8fb998] ll_lookup_it at ffffffffc142fec9 [lustre]
#10 [ffffa7850d8fbc08] ll_lookup_nd at ffffffffc1432c9c [lustre]
#11 [ffffa7850d8fbc90] __lookup_slow at ffffffff90d25667
PID: 13876  TASK: ffff98f09f570000  CPU: 7   COMMAND: "mkdir"
 #0 [ffffa7850d94ba70] __schedule at ffffffff91349fac
 #1 [ffffa7850d94bb00] schedule at ffffffff9134a448
 #2 [ffffa7850d94bb10] schedule_timeout at ffffffff9134dad3
 #3 [ffffa7850d94bba8] ptlrpc_set_wait at ffffffffc1131a80 [ptlrpc]
 #4 [ffffa7850d94bc28] ptlrpc_queue_wait at ffffffffc1131c71 [ptlrpc]
 #5 [ffffa7850d94bc40] mdc_getattr_common at ffffffffc12619b0 [mdc]
 #6 [ffffa7850d94bc70] mdc_getattr at ffffffffc12621ce [mdc]
 #7 [ffffa7850d94bcc0] lmv_getattr at ffffffffc13b54fe [lmv]
 #8 [ffffa7850d94bcf8] ll_dir_get_default_layout at ffffffffc13e7d32 [lustre]
 #9 [ffffa7850d94bd68] ll_dir_getstripe at ffffffffc13eb181 [lustre]
#10 [ffffa7850d94bdb8] ll_new_node at ffffffffc1434257 [lustre]
#11 [ffffa7850d94be90] ll_mkdir at ffffffffc1434faf [lustre]
#12 [ffffa7850d94beb8] vfs_mkdir at ffffffff90d27aa2
Comment by Lai Siyao [ 14/Mar/22 ]

Why mdc_getattr() get stuck? I don't see ldlm lock involved.

Comment by Andriy Skulysh [ 14/Mar/22 ]

For example pid  13996 sends, lock request, than 8 conflicting lock requests are send, pid  13876 takes the mutex and sends getattr request but it can't be sent because max_rpc_in_flight limit.

Comment by Andriy Skulysh [ 15/Mar/22 ]

another failure:

PID: 25082  TASK: ffff948ed3e79680  CPU: 0   COMMAND: "ls"
 #0 [ffffa6fa81d5f9f0] __schedule at ffffffff8eb47d74
 #1 [ffffa6fa81d5fa88] schedule at ffffffff8eb481e8
 #2 [ffffa6fa81d5fa98] rwsem_down_write_slowpath at ffffffff8e33c02a
 #3 [ffffa6fa81d5fb40] ll_update_default_lsm_md at ffffffffc0f9a0a3 [lustre]
 #4 [ffffa6fa81d5fb78] ll_update_lsm_md at ffffffffc0fa0369 [lustre]
 #5 [ffffa6fa81d5fbe0] ll_update_inode at ffffffffc0fa4696 [lustre]
 #6 [ffffa6fa81d5fc18] ll_prep_inode at ffffffffc0fa6dfd [lustre]
 #7 [ffffa6fa81d5fce0] ll_revalidate_it_finish at ffffffffc0f701c6 [lustre]
 #8 [ffffa6fa81d5fd30] ll_inode_revalidate at ffffffffc0f82382 [lustre]
 #9 [ffffa6fa81d5fdb0] ll_getattr_dentry at ffffffffc0f8f0be [lustre]
#10 [ffffa6fa81d5fe60] vfs_statx_fd at ffffffff8e51e2c4
PID: 25074  TASK: ffff948ed1824380  CPU: 0   COMMAND: "ls"
 #0 [ffffa6fa8183b850] __schedule at ffffffff8eb47d74
 #1 [ffffa6fa8183b8e8] schedule at ffffffff8eb481e8
 #2 [ffffa6fa8183b8f8] schedule_timeout at ffffffff8eb4b873
 #3 [ffffa6fa8183b990] ptlrpc_set_wait at ffffffffc0cd2a80 [ptlrpc]
 #4 [ffffa6fa8183ba10] ptlrpc_queue_wait at ffffffffc0cd2c71 [ptlrpc]
 #5 [ffffa6fa8183ba28] ldlm_cli_enqueue at ffffffffc0cba3d7 [ptlrpc]
 #6 [ffffa6fa8183bab0] mdc_enqueue_base at ffffffffc0f0c3fd [mdc]
 #7 [ffffa6fa8183bba8] mdc_intent_lock at ffffffffc0f0e369 [mdc]
 #8 [ffffa6fa8183bc50] lmv_intent_lookup at ffffffffc0e17702 [lmv]
 #9 [ffffa6fa8183bcb8] lmv_intent_lock at ffffffffc0e1814c [lmv]
#10 [ffffa6fa8183bd30] ll_inode_revalidate at ffffffffc0f8235d [lustre]
#11 [ffffa6fa8183bdb0] ll_getattr_dentry at ffffffffc0f8f0be [lustre]
#12 [ffffa6fa8183be60] vfs_statx_fd at ffffffff8e51e2c4

Pid 25074 send lock enqueue for not conflicting lock with pid 25082 but MDS may have conflicting lock enqueue from another client

Comment by Gerrit Updater [ 31/Mar/23 ]

"Vitaly Fertman <vitaly.fertman@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50488
Subject: LU-15535 revert: "LU-15284 llite: access lli_lsm_md with lock in all places"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 314754aa89ff07e5067a76395ea7f42e111410ef

Comment by Gerrit Updater [ 31/Mar/23 ]

"Vitaly Fertman <vitaly.fertman@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50489
Subject: LU-15535 llite: deadlock on lli_lsm_sem
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d9b3b9ab1c3d5452ccea9265ef9dccc5ba14de94

Comment by Gerrit Updater [ 19/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50488/
Subject: LU-15535 revert: "LU-15284 llite: access lli_lsm_md with lock in all places"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: be278f82efa736035c32ca61a3bfbfd0043d2ee3

Comment by Gerrit Updater [ 19/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50489/
Subject: LU-15535 llite: deadlock on lli_lsm_sem
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3ebc8e0528e34a11ffeff1e6be347de18b248069

Generated at Sat Feb 10 03:19:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.