Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.0, Lustre 2.5.0, Lustre 2.6.0
-
Chaos 5, single client
-
3
-
4056
Description
Shared file writes on a single client from many threads performs significantly worse with the CLIO code.
However, unlike issue LU-1666 the root cause for this appears to be the lli->lli_write_mutex which is taken in the common write path (see ll_file_io_generic). This lock effectively serializes all writes to the file and drastically limits the rate at which we can dirty pages. It's so bad that according to /proc/meminfo we're unable to dirty pages in the cache faster than they are written to the server. That is the total number of dirty pages on the client rarely exceeds a few megabytes.
Once again this issue can be easily reproduced with the following fio script run on a single client.
fio ssf.fio
; ssf.fio, single shared file [global] directory=/p/lstest/behlendo/fio/ filename=ssf size=256g blocksize=1m direct=0 ioengine=sync numjobs=128 group_reporting=1 fallocate=none runtime=60 stonewall [write] rw=randwrite:256 rw_sequencer=sequential fsync_on_close=1 [read] rw=randread:256 rw_sequencer=sequential
The following are a sampling of stacks from the system showing all the IO threads contending for this mutex. In addition, we can clearly see process 33296 holding the mutex while potentially blocking briefly in osc_enter_cache(). It's still unclear to me if we can simply drastically speed up cl_io_loop() to improve this, or if we must reintroduce some range locking as was done in 1.8.
fio S 0000000000000008 0 33296 33207 0x00000000 ffff880522555548 0000000000000082 ffff880522555510 ffff88052255550c ffff8807ff75a8c8 ffff88083fe82000 ffff880044635fc0 00000000000007ff ffff8807c6633af8 ffff880522555fd8 000000000000f4e8 ffff8807c6633af8 Call Trace: [<ffffffffa033467e>] cfs_waitq_wait+0xe/0x10 [libcfs] [<ffffffffa0a3b046>] osc_enter_cache+0x676/0x9c0 [osc] [<ffffffffa0aa3433>] ? lov_merge_lvb_kms+0x113/0x260 [lov] [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 [<ffffffffa0aafc4e>] ? lov_attr_get+0x1e/0x60 [lov] [<ffffffffa0a41e63>] osc_queue_async_io+0xe53/0x1430 [osc] [<ffffffff81274a1c>] ? put_dec+0x10c/0x110 [<ffffffff81274d0e>] ? number+0x2ee/0x320 [<ffffffffa0a29f3e>] ? osc_page_init+0x29e/0x5b0 [osc] [<ffffffff81277206>] ? vsnprintf+0x2b6/0x5f0 [<ffffffffa0a28e24>] osc_page_cache_add+0x94/0x1a0 [osc] [<ffffffffa06c45bf>] cl_page_cache_add+0x7f/0x1e0 [obdclass] [<ffffffffa0ab2edf>] lov_page_cache_add+0x8f/0x190 [lov] [<ffffffffa06c45bf>] cl_page_cache_add+0x7f/0x1e0 [obdclass] [<ffffffff8112672d>] ? __set_page_dirty_nobuffers+0xdd/0x160 [<ffffffffa0b78c43>] vvp_io_commit_write+0x313/0x4e0 [lustre] [<ffffffffa06d0d8d>] cl_io_commit_write+0xbd/0x1b0 [obdclass] [<ffffffffa0b5465b>] ll_commit_write+0xfb/0x320 [lustre] [<ffffffffa0b68660>] ll_write_end+0x30/0x60 [lustre] [<ffffffff81111684>] generic_file_buffered_write+0x174/0x2a0 [<ffffffff810708d7>] ? current_fs_time+0x27/0x30 [<ffffffff81112f70>] __generic_file_aio_write+0x250/0x480 [<ffffffffa06c14c5>] ? cl_env_info+0x15/0x20 [obdclass] [<ffffffff8111320f>] generic_file_aio_write+0x6f/0xe0 [<ffffffffa0b793a0>] vvp_io_write_start+0xb0/0x1e0 [lustre] [<ffffffffa06ce392>] cl_io_start+0x72/0xf0 [obdclass] [<ffffffffa06d1ca4>] cl_io_loop+0xb4/0x160 [obdclass] [<ffffffffa0b33b0e>] ll_file_io_generic+0x3be/0x4f0 [lustre] [<ffffffffa0b33d8c>] ll_file_aio_write+0x14c/0x290 [lustre] [<ffffffffa0b341e3>] ll_file_write+0x173/0x270 [lustre] [<ffffffff81177a98>] vfs_write+0xb8/0x1a0 [<ffffffff811784a1>] sys_write+0x51/0x90 [<ffffffff814f21be>] ? do_device_not_available+0xe/0x10 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b fio S 0000000000000001 0 33297 33207 0x00000000 ffff8807d4b91d18 0000000000000086 ffff8807d4b91c98 ffffffffa0aba783 ffff880c5cca92f0 ffff880f9a8c7428 ffffffff8100bc0e ffff8807d4b91d18 ffff8807c6633098 ffff8807d4b91fd8 000000000000f4e8 ffff8807c6633098 Call Trace: [<ffffffffa0aba783>] ? lov_io_init_raid0+0x383/0x780 [lov] [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff8104d93b>] ? mutex_spin_on_owner+0x9b/0xc0 [<ffffffff814f04bb>] __mutex_lock_interruptible_slowpath+0x14b/0x1b0 [<ffffffff814f0568>] mutex_lock_interruptible+0x48/0x50 [<ffffffffa0b33aff>] ll_file_io_generic+0x3af/0x4f0 [lustre] [<ffffffffa0b33d8c>] ll_file_aio_write+0x14c/0x290 [lustre] [<ffffffffa0b341e3>] ll_file_write+0x173/0x270 [lustre] [<ffffffff81177a98>] vfs_write+0xb8/0x1a0 [<ffffffff811784a1>] sys_write+0x51/0x90 [<ffffffff814f21be>] ? do_device_not_available+0xe/0x10 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b ...many threads like 33297...
Finally, it's my understanding that this lock is solely here to enforce atomicity between threads on the client. If that is in fact the case we may do a little experimenting with the group lock to see if that's a reasonable short term workaround for some of our codes.
Attachments
Issue Links
- is blocked by
-
LU-3261 Threads stuck in osc_enter_cache when testing LU-1669
- Resolved
- is related to
-
LU-5700 async IO LBUG obj->cob_transient_pages == 0
- Resolved
-
LU-8025 ll_direct_io code introduced in LU-6260 is based on 2.4/2.5, ignores LU-1669
- Resolved
-
LU-5360 Test timeout sanity-hsm test_13
- Resolved
- is related to
-
LU-11085 Replace Lustre interval tree with kernel one
- Resolved