[LU-5344] ldlm/ifind deadlock for striped directory Created: 14/Jul/14  Updated: 19/Sep/15  Resolved: 19/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: John Hammond Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: dne2

Issue Links:
Related
is related to LU-6831 The ticket for tracking all DNE2 bugs Reopened
Severity: 3
Rank (Obsolete): 14905

 Description   

To reproduce:

export MDSCOUNT=4
export MOUNT_2=y
llmount.sh

cd /mnt/lustre
while true; do lfs mkdir -c4 d0; touch d0/f{0..3}; done &

cd /mnt/lustre2
while true; do rm -rf d0; done

After about 10 rms we are stuck:

7185 touch
[<ffffffffa068376a>] ptlrpc_set_wait+0x2ea/0x830 [ptlrpc]
[<ffffffffa0683d37>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
[<ffffffffa065f13e>] ldlm_cli_enqueue+0x36e/0x860 [ptlrpc]
[<ffffffffa09105ae>] mdc_enqueue+0x2be/0x1ab0 [mdc]
[<ffffffffa0911f82>] mdc_intent_lock+0x1e2/0x52f [mdc]
[<ffffffffa08cbd2b>] lmv_intent_open+0x31b/0x9f0 [lmv]
[<ffffffffa08cc6e0>] lmv_intent_lock+0x2e0/0x1180 [lmv]
[<ffffffffa0e81faa>] ll_lookup_it+0x25a/0xad0 [lustre]
[<ffffffffa0e828ac>] ll_lookup_nd+0x8c/0x4a0 [lustre]
[<ffffffff811b0442>] __lookup_hash+0x102/0x160
[<ffffffff811b0b7a>] lookup_hash+0x3a/0x50
[<ffffffff811b5250>] do_filp_open+0x2e0/0xd30
[<ffffffff8119f809>] do_sys_open+0x69/0x140
[<ffffffff8119f920>] sys_open+0x20/0x30
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

4792 mdt01_004
[<ffffffffa06643c9>] ldlm_completion_ast+0x4c9/0x930 [ptlrpc]
[<ffffffffa0663b23>] ldlm_cli_enqueue_local+0x1f3/0x5d0 [ptlrpc]
[<ffffffffa0c9e264>] mdt_object_local_lock+0x394/0xa60 [mdt]
[<ffffffffa0c9e995>] mdt_object_lock_internal+0x65/0x360 [mdt]
[<ffffffffa0c9ed54>] mdt_object_lock+0x14/0x20 [mdt]
[<ffffffffa0c9ef11>] mdt_object_find_lock+0x61/0x170 [mdt]
[<ffffffffa0cc8926>] mdt_reint_open+0x5c6/0x20b0 [mdt]
[<ffffffffa0cb07a1>] mdt_reint_rec+0x41/0xe0 [mdt]
[<ffffffffa0c9baf3>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
[<ffffffffa0c9bfe6>] mdt_intent_reint+0x1f6/0x520 [mdt]
[<ffffffffa0c9a6c9>] mdt_intent_policy+0x499/0xca0 [mdt]
[<ffffffffa0645422>] ldlm_lock_enqueue+0x302/0x920 [ptlrpc]
[<ffffffffa066d651>] ldlm_handle_enqueue0+0x341/0x11e0 [ptlrpc]
[<ffffffffa06ec9a2>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
[<ffffffffa06ebc35>] tgt_request_handle+0x245/0xad0 [ptlrpc]
[<ffffffffa069ed91>] ptlrpc_main+0xcf1/0x1880 [ptlrpc]
[<ffffffff8109eab6>] kthread+0x96/0xa0
[<ffffffff8100c30a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

7186 rm
[<ffffffffa068376a>] ptlrpc_set_wait+0x2ea/0x830 [ptlrpc]
[<ffffffffa0683d37>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
[<ffffffffa065f13e>] ldlm_cli_enqueue+0x36e/0x860 [ptlrpc]
[<ffffffffa09105ae>] mdc_enqueue+0x2be/0x1ab0 [mdc]
[<ffffffffa0911f82>] mdc_intent_lock+0x1e2/0x52f [mdc]
[<ffffffffa08cae7e>] lmv_revalidate_slaves+0x49e/0x1030 [lmv]
[<ffffffffa08b25ba>] lmv_update_lsm_md+0x1a/0x20 [lmv]
[<ffffffffa0e63ac0>] ll_update_inode+0x1370/0x1e90 [lustre]
[<ffffffffa0e64668>] ll_read_inode2+0x88/0x480 [lustre]
[<ffffffffa0e7e62b>] ll_iget+0x13b/0x3c0 [lustre]
[<ffffffffa0e71740>] ll_prep_inode+0x6c0/0xe80 [lustre]
[<ffffffffa0e80e91>] ll_lookup_it_finish+0x2f1/0x11b0 [lustre]
[<ffffffffa0e82007>] ll_lookup_it+0x2b7/0xad0 [lustre]
[<ffffffffa0e828ac>] ll_lookup_nd+0x8c/0x4a0 [lustre]
[<ffffffff811b29b5>] do_lookup+0x1a5/0x230
[<ffffffff811b2fc4>] __link_path_walk+0x584/0x840
[<ffffffff811b398a>] path_walk+0x6a/0xe0
[<ffffffff811b3b9b>] filename_lookup+0x6b/0xc0
[<ffffffff811b4cc7>] user_path_at+0x57/0xa0
[<ffffffff811a8790>] vfs_fstatat+0x50/0xa0
[<ffffffff811a8804>] sys_newfstatat+0x24/0x50
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

4799 mdt00_005
[<ffffffffa06643c9>] ldlm_completion_ast+0x4c9/0x930 [ptlrpc]
[<ffffffffa0663b23>] ldlm_cli_enqueue_local+0x1f3/0x5d0 [ptlrpc]
[<ffffffffa0c9e085>] mdt_object_local_lock+0x1b5/0xa60 [mdt]
[<ffffffffa0c9e995>] mdt_object_lock_internal+0x65/0x360 [mdt]
[<ffffffffa0c9ed54>] mdt_object_lock+0x14/0x20 [mdt]
[<ffffffffa0ca3f1c>] mdt_getattr_name_lock+0xd4c/0x1a60 [mdt]
[<ffffffffa0ca5152>] mdt_intent_getattr+0x292/0x470 [mdt]
[<ffffffffa0c9a6c9>] mdt_intent_policy+0x499/0xca0 [mdt]
[<ffffffffa0645422>] ldlm_lock_enqueue+0x302/0x920 [ptlrpc]
[<ffffffffa066d651>] ldlm_handle_enqueue0+0x341/0x11e0 [ptlrpc]
[<ffffffffa06ec9a2>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
[<ffffffffa06ebc35>] tgt_request_handle+0x245/0xad0 [ptlrpc]
[<ffffffffa069ed91>] ptlrpc_main+0xcf1/0x1880 [ptlrpc]
[<ffffffff8109eab6>] kthread+0x96/0xa0
[<ffffffff8100c30a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff


3831 ldlm_bl_00
[<ffffffff811bf08e>] inode_wait+0xe/0x20
[<ffffffff811c0c0c>] ifind+0xac/0xe0
[<ffffffff811c0c8a>] ilookup5+0x4a/0x60
[<ffffffffa0e80a5d>] ll_md_blocking_ast+0x6bd/0x800 [lustre]
[<ffffffffa063fe6f>] ldlm_cancel_callback+0x6f/0x160 [ptlrpc]
[<ffffffffa065d6aa>] ldlm_cli_cancel_local+0x8a/0x480 [ptlrpc]
[<ffffffffa0662280>] ldlm_cli_cancel+0x60/0x360 [ptlrpc]
[<ffffffffa0e80487>] ll_md_blocking_ast+0xe7/0x800 [lustre]
[<ffffffffa0666060>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
[<ffffffffa0668161>] ldlm_bl_thread_main+0x281/0x400 [ptlrpc]
[<ffffffff8109eab6>] kthread+0x96/0xa0
[<ffffffff8100c30a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff



u:lustre-release# xddr2line ll_md_blocking_ast+0x6bd/0x800 [lustre]
ll_md_blocking_ast
/root/lustre-release/lustre/llite/namei.c:322

        master_inode = ilookup5(inode->i_sb, hash,
                                                ll_test_inode_by_fid,
                                                (void *)&lli->lli_pfid);


 Comments   
Comment by Andreas Dilger [ 24/Aug/15 ]

Di, has this problem been fixed with other recent changes to DNE?

Comment by Di Wang [ 24/Aug/15 ]

Hmm, no, I do not think so. But this looks like an serious issue, I will work on it right away.

Comment by Gerrit Updater [ 24/Aug/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16066
Subject: LU-5344 tests: add test_90 in sanityn.sh
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 63129f70575eeaf722d0e85caed145c9a6368b29

Comment by Di Wang [ 24/Aug/15 ]

Hmm, I can not reproduce this problem on current master anymore. Just add a test case here.

Comment by John Hammond [ 24/Aug/15 ]

This is still here.

export MDSCOUNT=2
export MOUNT_2=y
llmount.sh

cd /mnt/lustre
while true; do lfs mkdir -c4 d0; chmod go+w d0; done

cd /mnt/lustre2
while true; do rm -rf d0; done
4430 ldlm_bl_02
[<ffffffff811bffae>] inode_wait+0xe/0x20
[<ffffffff811c1b2c>] ifind+0xac/0xe0
[<ffffffff811c1baa>] ilookup5+0x4a/0x60
[<ffffffffa15269f9>] ll_md_blocking_ast+0x6d9/0x810 [lustre]
[<ffffffffa0c51bcf>] ldlm_cancel_callback+0x6f/0x160 [ptlrpc]
[<ffffffffa0c70a1a>] ldlm_cli_cancel_local+0x8a/0x480 [ptlrpc]
[<ffffffffa0c75700>] ldlm_cli_cancel+0x60/0x360 [ptlrpc]
[<ffffffffa152640d>] ll_md_blocking_ast+0xed/0x810 [lustre]
[<ffffffffa0c79bc0>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
[<ffffffffa0c7aadc>] ldlm_bl_thread_main+0x48c/0x700 [ptlrpc]
[<ffffffff8109e856>] kthread+0x96/0xa0
[<ffffffff8100c30a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

29332 chmod
[<ffffffffa0c99ac3>] ptlrpc_set_wait+0x333/0x9e0 [ptlrpc]
[<ffffffffa0c9a1f4>] ptlrpc_queue_wait+0x84/0x220 [ptlrpc]
[<ffffffffa0fb588d>] mdc_reint+0x6d/0x170 [mdc]
[<ffffffffa0fb7418>] mdc_setattr+0x1b8/0x470 [mdc]
[<ffffffffa0f6fdad>] lmv_setattr+0x21d/0x5a0 [lmv]
[<ffffffffa150ec84>] ll_setattr_raw+0x304/0x13a0 [lustre]
[<ffffffffa150fd85>] ll_setattr+0x65/0xd0 [lustre]
[<ffffffff811c25c8>] notify_change+0x168/0x340
[<ffffffff811a0719>] sys_fchmodat+0x119/0x170
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

29333 rm
[<ffffffffa0c99ac3>] ptlrpc_set_wait+0x333/0x9e0 [ptlrpc]
[<ffffffffa0c9a1f4>] ptlrpc_queue_wait+0x84/0x220 [ptlrpc]
[<ffffffffa0c7249e>] ldlm_cli_enqueue+0x37e/0x870 [ptlrpc]
[<ffffffffa0fbc6e0>] mdc_enqueue+0x2a0/0x18e0 [mdc]
[<ffffffffa0fbdf15>] mdc_intent_lock+0x1f5/0x537 [mdc]
[<ffffffffa0f7bcd5>] lmv_revalidate_slaves+0x375/0xe50 [lmv]
[<ffffffffa0f62ef4>] lmv_merge_attr+0x24/0x1a0 [lmv]
[<ffffffffa150b179>] ll_update_inode+0x1679/0x1c10 [lustre]
[<ffffffffa150b77d>] ll_read_inode2+0x6d/0x420 [lustre]
[<ffffffffa15239eb>] ll_iget+0x12b/0x2e0 [lustre]
[<ffffffffa150d720>] ll_prep_inode+0x5d0/0xc70 [lustre]
[<ffffffffa1526e51>] ll_lookup_it_finish+0x321/0x1300 [lustre]
[<ffffffffa15280b9>] ll_lookup_it+0x289/0xdb0 [lustre]
[<ffffffffa1528c69>] ll_lookup_nd+0x89/0x530 [lustre]
[<ffffffff811b3385>] do_lookup+0x1a5/0x230
[<ffffffff811b3994>] __link_path_walk+0x584/0x840
[<ffffffff811b435a>] path_walk+0x6a/0xe0
[<ffffffff811b456b>] filename_lookup+0x6b/0xc0
[<ffffffff811b60b4>] do_filp_open+0x104/0xd30
[<ffffffff8119fbe9>] do_sys_open+0x69/0x140
[<ffffffff8119fcd1>] sys_openat+0x11/0x20
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff


4801 mdt01_004
[<ffffffffa0c77c79>] ldlm_completion_ast+0x609/0x9b0 [ptlrpc]
[<ffffffffa0c71c26>] ldlm_cli_enqueue_fini+0x966/0xe60 [ptlrpc]
[<ffffffffa0c724e1>] ldlm_cli_enqueue+0x3c1/0x870 [ptlrpc]
[<ffffffffa14566c1>] osp_md_object_lock+0x181/0x220 [osp]
[<ffffffffa13e97ab>] lod_object_lock+0x36b/0x830 [lod]
[<ffffffffa12e455b>] mdd_object_lock+0x3b/0xd0 [mdd]
[<ffffffffa134994e>] mdt_lock_slaves+0x2ce/0x540 [mdt]
[<ffffffffa134b4c2>] mdt_reint_setattr+0x812/0xd00 [mdt]
[<ffffffffa134190d>] mdt_reint_rec+0x5d/0x200 [mdt]
[<ffffffffa132ac63>] mdt_reint_internal+0x633/0xa50 [mdt]
[<ffffffffa132b51b>] mdt_reint+0x6b/0x120 [mdt]
[<ffffffffa0d0bae2>] tgt_request_handle+0xa62/0x1260 [ptlrpc]
[<ffffffffa0cb687a>] ptlrpc_main+0xdaa/0x18b0 [ptlrpc]
[<ffffffff8109e856>] kthread+0x96/0xa0
[<ffffffff8100c30a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff
Comment by Di Wang [ 25/Aug/15 ]

John: Thanks for testing, I just updated the patch, please check. Thanks.

Comment by Gerrit Updater [ 19/Sep/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16066/
Subject: LU-5344 llite: lookup master inode by ilookup5_nowait
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d06433141bbd83e523bc611f23cb1b42935830f4

Comment by Peter Jones [ 19/Sep/15 ]

Landed for 2.8

Generated at Sat Feb 10 01:50:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.