[LU-4977] Deadlock in balance_dirty_pages() Created: 29/Apr/14 Updated: 15/May/14 Resolved: 14/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jinshan Xiong (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 13790 | ||||||||
| Description |
|
I can occasionally see this issue in machines with less memory. The deadlock has the following call stack: dd D 0000000000000000 0 2158 1 0x00000004 ffff88010ecc10f8 0000000000000086 ffff8801ffffffff 0000000042a8c635 ffff88010ecc1078 ffff88009ccb68a0 0000000000047e6a ffffffffaca103a3 ffff8800d7bd5058 ffff88010ecc1fd8 000000000000fb88 ffff8800d7bd5058 Call Trace: [<ffffffff810a2431>] ? ktime_get_ts+0xb1/0xf0 [<ffffffff81119e10>] ? sync_page+0x0/0x50 [<ffffffff8150ed93>] io_schedule+0x73/0xc0 [<ffffffff81119e4d>] sync_page+0x3d/0x50 [<ffffffff8150f5fa>] __wait_on_bit_lock+0x5a/0xc0 [<ffffffff81119de7>] __lock_page+0x67/0x70 [<ffffffff81096de0>] ? wake_bit_function+0x0/0x50 [<ffffffffa0f60101>] vvp_page_make_ready+0x271/0x280 [lustre] [<ffffffffa0542999>] cl_page_make_ready+0x89/0x370 [obdclass] [<ffffffffa03b45a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0a323b7>] osc_extent_make_ready+0x3b7/0xe50 [osc] [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 [<ffffffffa0a36af6>] osc_io_unplug0+0x1736/0x2130 [osc] [<ffffffff8103c7d8>] ? pvclock_clocksource_read+0x58/0xd0 [<ffffffffa03b45a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0a39681>] osc_io_unplug+0x11/0x20 [osc] [<ffffffffa0a3bc86>] osc_cache_writeback_range+0xdb6/0x1290 [osc] [<ffffffffa03b9d47>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs] [<ffffffffa03b9d47>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs] [<ffffffffa03b9362>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] [<ffffffffa054a45d>] ? cl_io_sub_init+0x5d/0xc0 [obdclass] [<ffffffffa0a29fd0>] osc_io_fsync_start+0x90/0x360 [osc] [<ffffffffa0547640>] ? cl_io_start+0x0/0x140 [obdclass] [<ffffffffa05476aa>] cl_io_start+0x6a/0x140 [obdclass] [<ffffffffa0a8f18e>] lov_io_call+0x8e/0x130 [lov] [<ffffffffa0a9324c>] lov_io_start+0x10c/0x180 [lov] [<ffffffffa05476aa>] cl_io_start+0x6a/0x140 [obdclass] [<ffffffffa054aea4>] cl_io_loop+0xb4/0x1b0 [obdclass] [<ffffffffa0f02acb>] cl_sync_file_range+0x31b/0x500 [lustre] [<ffffffffa0f2fe7c>] ll_writepages+0x9c/0x220 [lustre] [<ffffffff8112e1b1>] do_writepages+0x21/0x40 [<ffffffff811aca9d>] writeback_single_inode+0xdd/0x290 [<ffffffff811aceae>] writeback_sb_inodes+0xce/0x180 [<ffffffff811ad00b>] writeback_inodes_wb+0xab/0x1b0 [<ffffffff8112d60d>] balance_dirty_pages+0x23d/0x4d0 [<ffffffffa0541768>] ? cl_page_invoid+0x68/0x160 [obdclass] [<ffffffff8112d904>] balance_dirty_pages_ratelimited_nr+0x64/0x70 [<ffffffff8111a86a>] generic_file_buffered_write+0x1da/0x2e0 [<ffffffff81075887>] ? current_fs_time+0x27/0x30 [<ffffffff8111c210>] __generic_file_aio_write+0x260/0x490 [<ffffffffa0a93d9c>] ? lov_lock_enqueue+0xbc/0x170 [lov] [<ffffffff8111c4c8>] generic_file_aio_write+0x88/0x100 [<ffffffffa0f634a2>] vvp_io_write_start+0x102/0x3f0 [lustre] [<ffffffffa05476aa>] cl_io_start+0x6a/0x140 [obdclass] [<ffffffffa054aea4>] cl_io_loop+0xb4/0x1b0 [obdclass] [<ffffffffa0f00297>] ll_file_io_generic+0x407/0x8d0 [lustre] [<ffffffffa05406c9>] ? cl_env_get+0x29/0x350 [obdclass] [<ffffffffa0f00fa3>] ll_file_aio_write+0x133/0x2b0 [lustre] [<ffffffffa0f01279>] ll_file_write+0x159/0x290 [lustre] [<ffffffff81181398>] vfs_write+0xb8/0x1a0 [<ffffffff81181c91>] sys_write+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b In balance_dirty_pages(), it tries to write back some dirty pages between after write_end(). However, ll_write_end() can hold the page to add it into commit queue and causes the problem. We can fix the problem by releasing the page in ll_write_end() if the page is already dirty. Patch is coming. |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 29/Apr/14 ] |
|
patch is located at http://review.whamcloud.com/10149 |
| Comment by Peter Jones [ 14/May/14 ] |
|
Landed for 2.6 |
| Comment by Patrick Farrell (Inactive) [ 14/May/14 ] |
|
The patch for this issue removes the change made to fix Can you explain why this change means it's safe to do: for partial page writes? |
| Comment by Patrick Farrell (Inactive) [ 15/May/14 ] |
|
With Here's a sampling... We saw a small number of this stack trace: A similar number with this stack trace: A lot with this stack trace: And the largest number were stuck here: |
| Comment by Jinshan Xiong (Inactive) [ 15/May/14 ] |
|
Hi Patrick, I don't think the problem you've seen is related to the patch here. Is this the first time you see this issue? Please file a separate ticket. Jinshan |
| Comment by Patrick Farrell (Inactive) [ 15/May/14 ] |
|
Ah, sorry, the setattr_raw issue is a known one. For some reason that fix wasn't on this tree. The others are, as far as I can tell, new instances. I'm going to test again with the latest master... |