PFL known issues tracking ticket (LU-9349)

[LU-9344] sanity test_244: sendfile_grouplock test12() test hung Created: 14/Apr/17  Updated: 10/Jul/17  Resolved: 28/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Technical task Priority: Critical
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: pfl

Issue Links:
Related
is related to LU-8998 Progressive File Layout (PFL) Resolved
is related to LU-9429 parallel-scale test_parallel_grouploc... Open
is related to LU-9479 sanity test 184d 244: don't instantia... Open
is related to LU-9756 sanity test 184d fails with ‘lovea *... Open
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for bobijam <bobijam.xu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/67af86be-2027-11e7-9073-5254006e85c2.

The sub-test test_244 failed with the following error:

test failed to respond and timed out

Info required for matching: sanity 244

sendfile_grouplock.c calls sendfile_copy(sourfile, 0, destfile, 98765)
and sendfile_copy()->llapi_group_lock(fd_out, dest_gid);

which will call into lov_io_init() and atomic_inc(&lov->lo_active_ios)

and sendfile_copy() tries to write to the file, which will check to get layout, and ll_layout_refresh() finds there is an active ios (marked by ll_get_grouplock()), so the write hung there

sendfile_grou S 0000000000000000     0  7394   7321 0x00000080
 ffff88000eb3f618 0000000000000082 ffff88000eb3f5e0 ffff88000eb3f5dc
 00001ce200000000 ffff88003f828400 0000005dce083b5f ffff880003436ac0
 00000000000005ff 0000000100017a1d ffff88002b57fad0 ffff88000eb3ffd8
Call Trace:
 [<ffffffffa0afa20b>] lov_layout_wait+0x11b/0x220 [lov]
 [<ffffffff810640e0>] ? default_wake_function+0x0/0x20
 [<ffffffffa0afc11e>] lov_conf_set+0x37e/0xa30 [lov]
 [<ffffffffa040f471>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa059d888>] cl_conf_set+0x58/0x100 [obdclass]
 [<ffffffffa0fa5dd4>] ll_layout_conf+0x84/0x3f0 [lustre]
 [<ffffffffa040f471>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa0fb0b9d>] ll_layout_refresh+0x96d/0x1710 [lustre]
 [<ffffffffa040f471>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa0ff7d6f>] vvp_io_init+0x32f/0x450 [lustre]
 [<ffffffffa040f471>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa05a5148>] cl_io_init0+0x88/0x150 [obdclass]
 [<ffffffffa05a7caa>] cl_io_init+0x4a/0xa0 [obdclass]
 [<ffffffffa05a7dbc>] cl_io_rw_init+0xbc/0x200 [obdclass]
 [<ffffffffa0fa7213>] ll_file_io_generic+0x203/0xaf0 [lustre]
 [<ffffffffa0fa941d>] ll_file_aio_write+0x13d/0x280 [lustre]
 [<ffffffffa0fa969a>] ll_file_write+0x13a/0x270 [lustre]
 [<ffffffff81189ef8>] vfs_write+0xb8/0x1a0
 [<ffffffff811ba76d>] kernel_write+0x3d/0x50
 [<ffffffff811ba7da>] write_pipe_buf+0x5a/0x90
 [<ffffffff811b9342>] splice_from_pipe_feed+0x72/0x120
 [<ffffffff811ba780>] ? write_pipe_buf+0x0/0x90
 [<ffffffff811ba780>] ? write_pipe_buf+0x0/0x90
 [<ffffffff811b9d9e>] __splice_from_pipe+0x6e/0x80
 [<ffffffff811ba780>] ? write_pipe_buf+0x0/0x90
 [<ffffffff811b9e01>] splice_from_pipe+0x51/0x70
 [<ffffffff811b9e3d>] default_file_splice_write+0x1d/0x30
 [<ffffffff811b9fca>] do_splice_from+0xba/0xf0
 [<ffffffff811ba020>] direct_splice_actor+0x20/0x30
 [<ffffffff811ba256>] splice_direct_to_actor+0xc6/0x1c0
 [<ffffffff811ba000>] ? direct_splice_actor+0x0/0x30
 [<ffffffff811ba39d>] do_splice_direct+0x4d/0x60
 [<ffffffff8118a344>] do_sendfile+0x184/0x1e0
 [<ffffffff8118a3d4>] sys_sendfile64+0x34/0xb0
 [<ffffffff810e031e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Zhenyu Xu [ 14/Apr/17 ]

Jinshan,

I think the sendfile_grouplock.c does not use group lock correctly. It holds a group lock while trying write data to it.

Comment by Gerrit Updater [ 15/Apr/17 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26646
Subject: LU-9344 test: hung with test12()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 92460697205d25e2de08f4c6d05e3dc8d8bc3387

Comment by Jinshan Xiong (Inactive) [ 17/Apr/17 ]

Bobijam and I discussed this problem a little bit. Group lock needs to acquire locks from all objects in the current layout, so that it has to increase active_ios in the LOV layer, therefore this layout won't disappear during the existence of group lock.

When a write extends PFL layout with group lock held, it will result in deadlock because configuring new layout lock needs to wait for the active IOs to reach to zero.

The current workaround solution is to instantiate all components before group lock is taken.

Comment by Andreas Dilger [ 22/Apr/17 ]

This will hurt all file migration operations, since it will instantiate all layout components on both the source and target files. That wouldn't be so bad if it only instantiated the components on the source, but that doesn't really make sense to instantiate the components when getting the group lock on any file that is opened read-only.

That said, I'm wondering if there is even a race when getting the group lock on the new objects? Since the client(s) writing to the file are already holding the group lock on the objects on the first part of the file, any other clients would be blocked from accessing the file if they are enqueuing the group locks in component order. The existing group lock holders could still group lock the newly allocated objects without dropping the locks on the existing objects (which would cause a deadlock).

Comment by Jinshan Xiong (Inactive) [ 23/Apr/17 ]

For migration, there is another option to use Lustre file lease. But really good point on acquiring group lock when a file is opened for read only.

It seems like it's hard to maintain current semantics of group lock. Can we revise the semantics of the group lock? For example, group lock will fail if the file's layout change is changed.

Comment by Gerrit Updater [ 28/Apr/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26646/
Subject: LU-9344 test: hung with sendfile_grouplock test12()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c6b5df7644c245853b5dcf82b1c93614c5357f3f

Comment by Peter Jones [ 28/Apr/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:25:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.