[LU-7016] deadlock on reading multiple files from HSM archive Created: 17/Aug/15 Updated: 13/Oct/21 Resolved: 13/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
master |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
When it reads multiple files from HSM archive using posix copytool, it causes deadlock on the client. Aug 18 07:46:58 r38 kernel: INFO: task lhsmtool_posix:16414 blocked for more than 120 seconds. Aug 18 07:46:58 r38 kernel: Not tainted 2.6.32-431.29.2.el6.x86_64 #1 Aug 18 07:46:58 r38 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 18 07:46:58 r38 kernel: lhsmtool_posi D 0000000000000002 0 16414 14111 0x00000000 Aug 18 07:46:58 r38 kernel: ffff880f6379bd18 0000000000000082 ffff88102609502a ffff881020055c40 Aug 18 07:46:58 r38 kernel: ffff880f6379bc88 ffffffff81227e9f ffff880f6379bd68 ffffffff81199045 Aug 18 07:46:59 r38 kernel: ffff880f67361af8 ffff880f6379bfd8 000000000000fbc8 ffff880f67361af8 Aug 18 07:46:59 r38 kernel: Call Trace: Aug 18 07:46:59 r38 kernel: [<ffffffff81227e9f>] ? security_inode_permission+0x1f/0x30 Aug 18 07:46:59 r38 kernel: [<ffffffff81199045>] ? __link_path_walk+0x145/0x1000 Aug 18 07:46:59 r38 kernel: [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180 Aug 18 07:46:59 r38 kernel: [<ffffffff8152a45b>] mutex_lock+0x2b/0x50 Aug 18 07:46:59 r38 kernel: [<ffffffff8119ba76>] do_filp_open+0x2d6/0xd20 Aug 18 07:46:59 r38 kernel: [<ffffffffa0605343>] ? ll_file_release+0x683/0xac0 [lustre] Aug 18 07:46:59 r38 kernel: [<ffffffff81153ea1>] ? unlink_anon_vmas+0x71/0xd0 Aug 18 07:46:59 r38 kernel: [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90 Aug 18 07:46:59 r38 kernel: [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160 Aug 18 07:46:59 r38 kernel: [<ffffffff81185be9>] do_sys_open+0x69/0x140 Aug 18 07:46:59 r38 kernel: [<ffffffff81185d00>] sys_open+0x20/0x30 Aug 18 07:46:59 r38 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Aug 18 07:46:59 r38 kernel: INFO: task lhsmtool_posix:16415 blocked for more than 120 seconds. Aug 18 07:46:59 r38 kernel: Not tainted 2.6.32-431.29.2.el6.x86_64 #1 Aug 18 07:46:59 r38 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 18 07:46:59 r38 kernel: lhsmtool_posi D 0000000000000004 0 16415 14111 0x00000000 Aug 18 07:46:59 r38 kernel: ffff880f634c1d18 0000000000000082 0000000000000000 ffff881020055c40 Aug 18 07:46:59 r38 kernel: ffff880f634c1c88 ffffffff81227e9f ffff880f634c1d68 ffffffff81199045 Aug 18 07:46:59 r38 kernel: ffff880f64e5a5f8 ffff880f634c1fd8 000000000000fbc8 ffff880f64e5a5f8 Aug 18 07:46:59 r38 kernel: Call Trace: Aug 18 07:46:59 r38 kernel: [<ffffffff81227e9f>] ? security_inode_permission+0x1f/0x30 Aug 18 07:46:59 r38 kernel: [<ffffffff81199045>] ? __link_path_walk+0x145/0x1000 Aug 18 07:46:59 r38 kernel: [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180 Aug 18 07:46:59 r38 kernel: [<ffffffff8152a45b>] mutex_lock+0x2b/0x50 Aug 18 07:46:59 r38 kernel: [<ffffffff8119ba76>] do_filp_open+0x2d6/0xd20 Aug 18 07:46:59 r38 kernel: [<ffffffffa0605343>] ? ll_file_release+0x683/0xac0 [lustre] Aug 18 07:46:59 r38 kernel: [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90 Aug 18 07:46:59 r38 kernel: [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160 Aug 18 07:46:59 r38 kernel: [<ffffffff81185be9>] do_sys_open+0x69/0x140 Aug 18 07:46:59 r38 kernel: [<ffffffff81185d00>] sys_open+0x20/0x30 Aug 18 07:46:59 r38 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Aug 18 07:46:59 r38 kernel: INFO: task md5sum:16312 blocked for more than 120 seconds. Aug 18 07:46:59 r38 kernel: Not tainted 2.6.32-431.29.2.el6.x86_64 #1 Aug 18 07:46:59 r38 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 18 07:46:59 r38 kernel: md5sum D 0000000000000001 0 16312 15803 0x00000000 Aug 18 07:46:59 r38 kernel: ffff881fbd611b58 0000000000000082 0000000000000000 0000000000000286 Aug 18 07:46:59 r38 kernel: ffff881fbd611b68 ffff881020055c40 ffff881fbd611b38 ffffffffa05eb3f0 Aug 18 07:46:59 r38 kernel: ffff882026a85ab8 ffff881fbd611fd8 000000000000fbc8 ffff882026a85ab8 Aug 18 07:46:59 r38 kernel: Call Trace: Aug 18 07:46:59 r38 kernel: [<ffffffffa05eb3f0>] ? ll_lookup_finish_locks+0x270/0x8d0 [lustre] Aug 18 07:46:59 r38 kernel: [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180 Aug 18 07:46:59 r38 kernel: [<ffffffff8152a45b>] mutex_lock+0x2b/0x50 Aug 18 07:46:59 r38 kernel: [<ffffffff811989ab>] do_lookup+0x11b/0x230 Aug 18 07:46:59 r38 kernel: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000 Aug 18 07:46:59 r38 kernel: [<ffffffff8114a3d7>] ? handle_pte_fault+0xf7/0xb00 Aug 18 07:46:59 r38 kernel: [<ffffffff8104eeb7>] ? pte_alloc_one+0x37/0x50 Aug 18 07:46:59 r38 kernel: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0 Aug 18 07:46:59 r38 kernel: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0 Aug 18 07:46:59 r38 kernel: [<ffffffff81226d56>] ? security_file_alloc+0x16/0x20 Aug 18 07:46:59 r38 kernel: [<ffffffff8119b8a4>] do_filp_open+0x104/0xd20 Aug 18 07:46:59 r38 kernel: [<ffffffff8114e820>] ? __vma_link_rb+0x30/0x40 Aug 18 07:46:59 r38 kernel: [<ffffffff8114e8cb>] ? vma_link+0x9b/0xf0 Aug 18 07:46:59 r38 kernel: [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90 Aug 18 07:46:59 r38 kernel: [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160 Aug 18 07:46:59 r38 kernel: [<ffffffff81185be9>] do_sys_open+0x69/0x140 Aug 18 07:46:59 r38 kernel: [<ffffffff81185d00>] sys_open+0x20/0x30 Aug 18 07:46:59 r38 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Shuichi Ihara (Inactive) [ 17/Aug/15 ] |
|
this is reproduder. |
| Comment by Shuichi Ihara (Inactive) [ 17/Aug/15 ] |
|
client's syslog |
| Comment by Bruno Faccini (Inactive) [ 17/Aug/15 ] |
|
Hello Ihara, |
| Comment by Shuichi Ihara (Inactive) [ 17/Aug/15 ] |
|
Yes, test script and copytool are running on same node. is that problem? |
| Comment by Bruno Faccini (Inactive) [ 17/Aug/15 ] |
|
Thanks in advance, but you can also use a separate+specific Lustre mount-point for the copytool. |
| Comment by Shuichi Ihara (Inactive) [ 18/Aug/15 ] |
|
OK, confirmed. At least, it works if we use separate nodes for copytool and test script. |
| Comment by John Fuchs-Chesney (Inactive) [ 29/Aug/15 ] |
|
Any more work required on this ticket or can we mark it as resolved? Thanks, |
| Comment by Bruno Faccini (Inactive) [ 31/Aug/15 ] |
|
John, even if the dual/2nd specific mount-point work-around works, I still need to work with the reproducer provided and understand/fix the dead-lock condition. |