[LU-16958] migrate vs regular ops deadlock Created: 12/Jul/23  Updated: 20/Nov/23  Resolved: 18/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Alex Zhuravlev Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16637 ll_truncate_inode_pages_final() ASSER... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
PID: 350193  TASK: ffff9bd65af446c0  CPU: 0   COMMAND: "getfattr"
 #0 [ffff9bd63ffb7950] __schedule at ffffffffba5a232d
    /tmp/kernel/kernel/sched/core.c: 3109
 #1 [ffff9bd63ffb79d8] schedule at ffffffffba5a2748
    /tmp/kernel/./arch/x86/include/asm/preempt.h: 84
 #2 [ffff9bd63ffb79e8] rwsem_down_write_slowpath at ffffffffba0f41a7
    /tmp/kernel/./arch/x86/include/asm/current.h: 15
 #3 [ffff9bd63ffb7a88] down_write at ffffffffba5a691a
    /tmp/kernel/./include/linux/err.h: 36
 #4 [ffff9bd63ffb7ac0] vvp_inode_ops at ffffffffc116d57f [lustre]
    /home/lustre/linux-4.18.0-305.25.1.el8_4/./arch/x86/include/asm/current.h: 15
 #5 [ffff9bd63ffb7ae0] cl_object_inode_ops at ffffffffc0454a50 [obdclass]
    /home/lustre/master-mine/lustre/obdclass/cl_object.c: 442
 #6 [ffff9bd63ffb7b18] lov_conf_set at ffffffffc0aa36c4 [lov]
    /home/lustre/master-mine/lustre/lov/lov_object.c: 1465
 #7 [ffff9bd63ffb7b88] cl_conf_set at ffffffffc04542d8 [obdclass]
    /home/lustre/master-mine/lustre/obdclass/cl_object.c: 299
 #8 [ffff9bd63ffb7bb8] ll_layout_conf at ffffffffc111d110 [lustre]
    /home/lustre/master-mine/lustre/llite/file.c: 5995
 #9 [ffff9bd63ffb7c28] ll_layout_refresh at ffffffffc111dad3 [lustre]
    /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155
#10 [ffff9bd63ffb7cf0] vvp_io_init at ffffffffc116d019 [lustre]
    /home/lustre/master-mine/lustre/llite/vvp_io.c: 1870
#11 [ffff9bd63ffb7d20] __cl_io_init at ffffffffc045e66f [obdclass]
    /home/lustre/master-mine/lustre/obdclass/cl_io.c: 134
#12 [ffff9bd63ffb7d58] cl_glimpse_size0 at ffffffffc11642ca [lustre]
    /home/lustre/master-mine/lustre/llite/glimpse.c: 204
#13 [ffff9bd63ffb7da0] ll_getattr_dentry at ffffffffc111c65d [lustre]
    /home/lustre/master-mine/lustre/llite/llite_internal.h: 1677
#14 [ffff9bd63ffb7e50] vfs_statx at ffffffffba1d4be9
    /tmp/kernel/fs/stat.c: 204

checking the stack on the process above inode was found at 0xffff9bd60367d350:

crash> p *(struct ll_inode_info *)(0xffff9bd60367d350-0x150)
  lli_inode_magic = 287116773,
...
  lli_inode_lock_owner = 0xffff9bd68f51d380

now check task 0xffff9bd68f51d380:

crash> p *(struct task_struct *)0xffff9bd68f51d380|more
...
  pid = 348428,
...
PID: 348428  TASK: ffff9bd68f51d380  CPU: 1   COMMAND: "lfs"
 #0 [ffff9bd613c37968] __schedule at ffffffffba5a232d
    /tmp/kernel/kernel/sched/core.c: 3109
 #1 [ffff9bd613c379f0] schedule at ffffffffba5a2748
    /tmp/kernel/./arch/x86/include/asm/preempt.h: 84
 #2 [ffff9bd613c37a00] schedule_preempt_disabled at ffffffffba5a2a6c
    /tmp/kernel/./arch/x86/include/asm/preempt.h: 79
 #3 [ffff9bd613c37a08] __mutex_lock at ffffffffba5a3a40
    /tmp/kernel/kernel/locking/mutex.c: 1038
 #4 [ffff9bd613c37ac8] ll_layout_refresh at ffffffffc111d577 [lustre]
    /home/lustre/master-mine/lustre/llite/llite_internal.h: 1536
 #5 [ffff9bd613c37b88] vvp_io_init at ffffffffc116d019 [lustre]
    /home/lustre/master-mine/lustre/llite/vvp_io.c: 1870
 #6 [ffff9bd613c37bb8] __cl_io_init at ffffffffc045e66f [obdclass]
    /home/lustre/master-mine/lustre/obdclass/cl_io.c: 134
 #7 [ffff9bd613c37bf0] ll_ioc_data_version at ffffffffc110c665 [lustre]
    /home/lustre/master-mine/lustre/llite/file.c: 3193
 #8 [ffff9bd613c37c28] ll_migrate at ffffffffc111b244 [lustre]
    /home/lustre/master-mine/lustre/llite/file.c: 3227
 #9 [ffff9bd613c37ca8] ll_dir_ioctl at ffffffffc1105563 [lustre]
    /home/lustre/master-mine/lustre/llite/dir.c: 2277
#10 [ffff9bd613c37e88] do_vfs_ioctl at ffffffffba1e3199
    /tmp/kernel/fs/ioctl.c: 48

it seems this is an locking order issue:
ll_migrate() takes inode lock, then lli_layout_mutex (in ll_layout_refresh()) while other ops (like getfattr) use the reversed order.



 Comments   
Comment by Gerrit Updater [ 12/Jul/23 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51641
Subject: LU-16958 llite: migrate vs regular ops deadlock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8de4a374979a37d09a057cbcdfd9914775cfc59b

Comment by Gerrit Updater [ 01/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51641/
Subject: LU-16958 llite: migrate vs regular ops deadlock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8f2c1592c3bbd0351ab3984a88a3eed7075690c8

Comment by Peter Jones [ 01/Aug/23 ]

Landed for 2.16

Comment by Zhenyu Xu [ 21/Sep/23 ]

another deadlock found

T1:
vvp_io_init()
  ->ll_layout_refresh() <= take lli_layout_mutex
  ->ll_layout_intent()
  ->ll_take_md_lock()  <= take the CR layout lock ref
  ->ll_layout_conf()
    ->vvp_prune()
    ->vvp_inode_ops() <= release lli_layout_mtex
    ->vvp_inode_ops() <= try to acquire lli_layout_mutex
    -> racer wait here
T2:
->ll_file_write_iter()
  ->vvp_io_init()
    ->ll_layout_refresh() <= take lli_layout_mutex
    ->ll_layout_intent() <= Request layout from MDT
    -> racer wait ...

T3: occure in PCC-RO attach, It can happen in normal case without PCC-RO.
->pcc_readonly_attach()
  ->ll_layout_intent_write()
  ->ll_intent_lock()
     -> on MDT, it will try to obtain EX layout lock to change layout.
        but the client T1 hold CR layout lock, and T2's lock request is in lock waiting list to wai for T3 finished, thus cause dealock...
Comment by Zhenyu Xu [ 28/Sep/23 ]

I thought deadlock due to this patch , but I reverted the essential part of this patch at https://review.whamcloud.com/52388, and the racer still hang at the server, looks more like LU-15491

Comment by Qian Yingjin [ 11/Oct/23 ]

Found another deadlock for parallel DIO:

T1: writer
Obtain DLM extent lock: L1=PW[0, EOF]

T2: DIO reader: 50M data, iosize=64M, max_pages_per_rpc=1024 (4M) max_rpcs_in_flight=8
ll_direct_IO_impl()
use all available RPC slots: number of read RPC in flight is 9
on the server side:
->tgt_brw_read()
->tgt_brw_lock() # server side locking
-> Try to cancel the conflict locks on client: L1=PW[0, EOF]

T3: reader
take DLM lock ref on L1=PW[0, EOF]
Read-ahead pages (prepare pages);
wait for RPC slots to send the read RPCs to OST

deadlock: T2->T3: T2 is waiting for T3 to release DLM extent lock L1;
          T3->T2: T3 is waiting for T2 finished to free RPC slots...

The possible solution is that when found all RPC slots are used by srvlock DIO, and there are urgent I/O, force to send the I/O RPC to OST? 

Comment by Andreas Dilger [ 27/Oct/23 ]

Another patch was pushed under this ticket.

Comment by Andreas Dilger [ 27/Oct/23 ]

I thought deadlock due to this patch , but I reverted the essential part of this patch at https://review.whamcloud.com/52388, and the racer still hang at the server, looks more like LU-15491

There could definitely be multiple different issues affecting racer testing, so that doesn't mean the above patch is not fixing a problem.

Comment by Gerrit Updater [ 18/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52388/
Subject: LU-16958 llite: migrate deadlock on not responding lock cancel
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 37646c74bf884c535149d530af840d728814792b

Comment by Peter Jones [ 18/Nov/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:31:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.