[LU-17499] migrate vs lseek deadlock Created: 02/Feb/24  Updated: 04/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Alex Zhuravlev Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

ll_migrate() takes a lock on the inode in the very beginning, then needs to get data's version via OST which enqueues an extent lock.
in turn ll_file_seek() grab an extent lock first, then wants to lock the inode in vvp_io_lseek_start().
I'd think we should take inode's lock in the beginning of ll_file_seek(), but not an expert in this area.

PID: 468416   TASK: ffff99380683c280  CPU: 3    COMMAND: "lfs"
 #0 [ffff993802edb9a0] __schedule at ffffffff916f6ab6
    /tmp/kernel/kernel/sched/core.c: 3755
 #1 [ffff993802edb9f8] schedule at ffffffff916f7050
    /tmp/kernel/kernel/sched/core.c: 4602
 #2 [ffff993802edba10] schedule_timeout at ffffffff916fd3d0
    /tmp/kernel/kernel/time/timer.c: 1836
 #3 [ffff993802edbac8] wait_for_common at ffffffff916f7b0c
    /tmp/kernel/kernel/sched/completion.c: 86
 #4 [ffff993802edbb18] mdc_io_data_version_end at ffffffffc0b3e7ba [mdc]
    /home/lustre/master-test/lustre/mdc/mdc_dev.c: 1340
 #5 [ffff993802edbb30] cl_io_end at ffffffffc04897de [obdclass]
    /home/lustre/master-test/lustre/obdclass/cl_io.c: 529
 #6 [ffff993802edbb58] lov_io_end_wrapper at ffffffffc0ad43da [lov]
    /home/lustre/master-test/lustre/lov/lov_io.c: 1092
 #7 [ffff993802edbb70] lov_io_data_version_end at ffffffffc0ad4c32 [lov]
    /home/lustre/master-test/lustre/lov/lov_io.c: 1137
 #8 [ffff993802edbb90] cl_io_end at ffffffffc04897de [obdclass]
    /home/lustre/master-test/lustre/obdclass/cl_io.c: 529
 #9 [ffff993802edbbb8] cl_io_loop at ffffffffc048ded2 [obdclass]
    /home/lustre/master-test/lustre/obdclass/cl_io.c: 765
#10 [ffff993802edbbf0] ll_ioc_data_version at ffffffffc11c73ba [lustre]
    /home/lustre/master-test/lustre/llite/file.c: 3236
#11 [ffff993802edbc28] ll_migrate at ffffffffc11d6eff [lustre]
    /home/lustre/master-test/lustre/llite/file.c: 3269
#12 [ffff993802edbca8] ll_dir_ioctl at ffffffffc11bffb1 [lustre]
    /home/lustre/master-test/lustre/llite/dir.c: 2295
#13 [ffff993802edbe88] do_vfs_ioctl at ffffffff91273479
    /tmp/kernel/fs/ioctl.c: 48

PID: 468197   TASK: ffff99398a360680  CPU: 1    COMMAND: "lfs"
 #0 [ffff99392f44fd28] __schedule at ffffffff916f6ab6
    /tmp/kernel/kernel/sched/core.c: 3755
 #1 [ffff99392f44fd80] schedule at ffffffff916f7050
    /tmp/kernel/kernel/sched/core.c: 4602
 #2 [ffff99392f44fd98] rwsem_down_write_slowpath at ffffffff9112c55a
    /tmp/kernel/./arch/x86/include/asm/current.h: 15
 #3 [ffff99392f44fe28] vvp_io_lseek_start at ffffffffc12286fd [lustre]
    /home/lustre/linux-4.18.0-477.15.1.el8_8/./arch/x86/include/asm/current.h: 15
 #4 [ffff99392f44fe50] cl_io_start at ffffffffc04896d5 [obdclass]
    /home/lustre/master-test/lustre/obdclass/cl_io.c: 507
 #5 [ffff99392f44fe78] cl_io_loop at ffffffffc048dec5 [obdclass]
    /home/lustre/master-test/lustre/obdclass/cl_io.c: 764
 #6 [ffff99392f44feb0] ll_file_seek at ffffffffc11d618d [lustre]
    /home/lustre/master-test/lustre/llite/file.c: 4729
 #7 [ffff99392f44ff08] ksys_lseek at ffffffff9125d48d
    /tmp/kernel/fs/read_write.c: 300

PID: 7933     TASK: ffff993929c68980  CPU: 3    COMMAND: "mdt00_022"
 #0 [ffff993971c03b20] __schedule at ffffffff916f6ab6
    /tmp/kernel/kernel/sched/core.c: 3755
 #1 [ffff993971c03b78] schedule at ffffffff916f7050
    /tmp/kernel/kernel/sched/core.c: 4602
 #2 [ffff993971c03b90] ldlm_completion_ast at ffffffffc07f8e7c [ptlrpc]
    /home/lustre/master-test/lustre/ptlrpc/../../lustre/ldlm/ldlm_request.c: 332
 #3 [ffff993971c03c08] ldlm_cli_enqueue_local at ffffffffc07f672a [ptlrpc]
    /home/lustre/master-test/lustre/ptlrpc/../../lustre/ldlm/ldlm_request.c: 536
 #4 [ffff993971c03c78] tgt_mdt_data_lock at ffffffffc08441a8 [ptlrpc]
    /home/lustre/master-test/libcfs/include/libcfs/libcfs_debug.h: 120
 #5 [ffff993971c03d20] mdt_data_version_get at ffffffffc0fb7cc9 [mdt]
    /home/lustre/master-test/lustre/mdt/mdt_io.c: 1649
 #6 [ffff993971c03d70] mdt_getattr at ffffffffc0f6ab40 [mdt]
    /home/lustre/master-test/libcfs/include/libcfs/libcfs_debug.h: 120
 #7 [ffff993971c03da0] tgt_request_handle at ffffffffc08495cc [ptlrpc]
    /home/lustre/master-test/lustre/include/lu_target.h: 654
 #8 [ffff993971c03e18] ptlrpc_main at ffffffffc079da91 [ptlrpc]
    /home/lustre/master-test/lustre/include/lustre_net.h: 2421
 #9 [ffff993971c03f10] kthread at ffffffff9110383e
    /tmp/kernel/kernel/kthread.c: 354


 Comments   
Comment by Alex Zhuravlev [ 02/Feb/24 ]

bobijam what's your opinion?

Comment by Andreas Dilger [ 02/Feb/24 ]

Hmm, I think I've seen this in racer as well? Hung threads on the clients with migrate active on some threads. Please have a look through the recent racer test_1 timeouts to confirm, then we may have an easy reproducer, and fixing this may also reduce our test errors.

Comment by Alex Zhuravlev [ 04/Feb/24 ]

adilger these two reports have the same symptom as in the descriptions:
https://testing.whamcloud.com/sub_tests/f08ac5b3-d083-4b9c-87e1-6f752b3c95ef
https://testing.whamcloud.com/sub_tests/735b8dc8-c37a-417f-a215-b5233d0f7606
others have another root cause

Generated at Sat Feb 10 03:35:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.