[LU-7016] deadlock on reading multiple files from HSM archive Created: 17/Aug/15  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Shuichi Ihara (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

master


Attachments: HTML File messages     File mult_test.sh    
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

When it reads multiple files from HSM archive using posix copytool, it causes deadlock on the client.

Aug 18 07:46:58 r38 kernel: INFO: task lhsmtool_posix:16414 blocked for more than 120 seconds.
Aug 18 07:46:58 r38 kernel:      Not tainted 2.6.32-431.29.2.el6.x86_64 #1
Aug 18 07:46:58 r38 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 18 07:46:58 r38 kernel: lhsmtool_posi D 0000000000000002     0 16414  14111 0x00000000
Aug 18 07:46:58 r38 kernel: ffff880f6379bd18 0000000000000082 ffff88102609502a ffff881020055c40
Aug 18 07:46:58 r38 kernel: ffff880f6379bc88 ffffffff81227e9f ffff880f6379bd68 ffffffff81199045
Aug 18 07:46:59 r38 kernel: ffff880f67361af8 ffff880f6379bfd8 000000000000fbc8 ffff880f67361af8
Aug 18 07:46:59 r38 kernel: Call Trace:
Aug 18 07:46:59 r38 kernel: [<ffffffff81227e9f>] ? security_inode_permission+0x1f/0x30
Aug 18 07:46:59 r38 kernel: [<ffffffff81199045>] ? __link_path_walk+0x145/0x1000
Aug 18 07:46:59 r38 kernel: [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180
Aug 18 07:46:59 r38 kernel: [<ffffffff8152a45b>] mutex_lock+0x2b/0x50
Aug 18 07:46:59 r38 kernel: [<ffffffff8119ba76>] do_filp_open+0x2d6/0xd20
Aug 18 07:46:59 r38 kernel: [<ffffffffa0605343>] ? ll_file_release+0x683/0xac0 [lustre]
Aug 18 07:46:59 r38 kernel: [<ffffffff81153ea1>] ? unlink_anon_vmas+0x71/0xd0
Aug 18 07:46:59 r38 kernel: [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90
Aug 18 07:46:59 r38 kernel: [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160
Aug 18 07:46:59 r38 kernel: [<ffffffff81185be9>] do_sys_open+0x69/0x140
Aug 18 07:46:59 r38 kernel: [<ffffffff81185d00>] sys_open+0x20/0x30
Aug 18 07:46:59 r38 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Aug 18 07:46:59 r38 kernel: INFO: task lhsmtool_posix:16415 blocked for more than 120 seconds.
Aug 18 07:46:59 r38 kernel:      Not tainted 2.6.32-431.29.2.el6.x86_64 #1
Aug 18 07:46:59 r38 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 18 07:46:59 r38 kernel: lhsmtool_posi D 0000000000000004     0 16415  14111 0x00000000
Aug 18 07:46:59 r38 kernel: ffff880f634c1d18 0000000000000082 0000000000000000 ffff881020055c40
Aug 18 07:46:59 r38 kernel: ffff880f634c1c88 ffffffff81227e9f ffff880f634c1d68 ffffffff81199045
Aug 18 07:46:59 r38 kernel: ffff880f64e5a5f8 ffff880f634c1fd8 000000000000fbc8 ffff880f64e5a5f8
Aug 18 07:46:59 r38 kernel: Call Trace:
Aug 18 07:46:59 r38 kernel: [<ffffffff81227e9f>] ? security_inode_permission+0x1f/0x30
Aug 18 07:46:59 r38 kernel: [<ffffffff81199045>] ? __link_path_walk+0x145/0x1000
Aug 18 07:46:59 r38 kernel: [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180
Aug 18 07:46:59 r38 kernel: [<ffffffff8152a45b>] mutex_lock+0x2b/0x50
Aug 18 07:46:59 r38 kernel: [<ffffffff8119ba76>] do_filp_open+0x2d6/0xd20
Aug 18 07:46:59 r38 kernel: [<ffffffffa0605343>] ? ll_file_release+0x683/0xac0 [lustre]
Aug 18 07:46:59 r38 kernel: [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90
Aug 18 07:46:59 r38 kernel: [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160
Aug 18 07:46:59 r38 kernel: [<ffffffff81185be9>] do_sys_open+0x69/0x140
Aug 18 07:46:59 r38 kernel: [<ffffffff81185d00>] sys_open+0x20/0x30
Aug 18 07:46:59 r38 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Aug 18 07:46:59 r38 kernel: INFO: task md5sum:16312 blocked for more than 120 seconds.
Aug 18 07:46:59 r38 kernel:      Not tainted 2.6.32-431.29.2.el6.x86_64 #1
Aug 18 07:46:59 r38 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 18 07:46:59 r38 kernel: md5sum        D 0000000000000001     0 16312  15803 0x00000000
Aug 18 07:46:59 r38 kernel: ffff881fbd611b58 0000000000000082 0000000000000000 0000000000000286
Aug 18 07:46:59 r38 kernel: ffff881fbd611b68 ffff881020055c40 ffff881fbd611b38 ffffffffa05eb3f0
Aug 18 07:46:59 r38 kernel: ffff882026a85ab8 ffff881fbd611fd8 000000000000fbc8 ffff882026a85ab8
Aug 18 07:46:59 r38 kernel: Call Trace:
Aug 18 07:46:59 r38 kernel: [<ffffffffa05eb3f0>] ? ll_lookup_finish_locks+0x270/0x8d0 [lustre]
Aug 18 07:46:59 r38 kernel: [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180
Aug 18 07:46:59 r38 kernel: [<ffffffff8152a45b>] mutex_lock+0x2b/0x50
Aug 18 07:46:59 r38 kernel: [<ffffffff811989ab>] do_lookup+0x11b/0x230
Aug 18 07:46:59 r38 kernel: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000
Aug 18 07:46:59 r38 kernel: [<ffffffff8114a3d7>] ? handle_pte_fault+0xf7/0xb00
Aug 18 07:46:59 r38 kernel: [<ffffffff8104eeb7>] ? pte_alloc_one+0x37/0x50
Aug 18 07:46:59 r38 kernel: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0
Aug 18 07:46:59 r38 kernel: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0
Aug 18 07:46:59 r38 kernel: [<ffffffff81226d56>] ? security_file_alloc+0x16/0x20
Aug 18 07:46:59 r38 kernel: [<ffffffff8119b8a4>] do_filp_open+0x104/0xd20
Aug 18 07:46:59 r38 kernel: [<ffffffff8114e820>] ? __vma_link_rb+0x30/0x40
Aug 18 07:46:59 r38 kernel: [<ffffffff8114e8cb>] ? vma_link+0x9b/0xf0
Aug 18 07:46:59 r38 kernel: [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90
Aug 18 07:46:59 r38 kernel: [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160
Aug 18 07:46:59 r38 kernel: [<ffffffff81185be9>] do_sys_open+0x69/0x140
Aug 18 07:46:59 r38 kernel: [<ffffffff81185d00>] sys_open+0x20/0x30
Aug 18 07:46:59 r38 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Shuichi Ihara (Inactive) [ 17/Aug/15 ]

this is reproduder.
the problem always happens when reading (md5sum) files

Comment by Shuichi Ihara (Inactive) [ 17/Aug/15 ]

client's syslog

Comment by Bruno Faccini (Inactive) [ 17/Aug/15 ]

Hello Ihara,
According to the syslog you have provided, and the hung threads stacks dumped in it, both your reproducer/script, and the md5sum commands, are running on the same node than the Posix CopyTool. So, the first question that comes to my mind is, does your copytool use the same Lustre mount-point than the others Lustre Client-side commands ??
And if yes, can you give a new try to your reproducer with the Posix CopyTool using a separate mount-point ?

Comment by Shuichi Ihara (Inactive) [ 17/Aug/15 ]

Yes, test script and copytool are running on same node. is that problem?
let me quick run on separate nodes. Thanks for your advise.

Comment by Bruno Faccini (Inactive) [ 17/Aug/15 ]

Thanks in advance, but you can also use a separate+specific Lustre mount-point for the copytool.

Comment by Shuichi Ihara (Inactive) [ 18/Aug/15 ]

OK, confirmed. At least, it works if we use separate nodes for copytool and test script.

Comment by John Fuchs-Chesney (Inactive) [ 29/Aug/15 ]

Any more work required on this ticket or can we mark it as resolved?

Thanks,
~ jfc.

Comment by Bruno Faccini (Inactive) [ 31/Aug/15 ]

John, even if the dual/2nd specific mount-point work-around works, I still need to work with the reproducer provided and understand/fix the dead-lock condition.

Generated at Sat Feb 10 02:05:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.