[LU-4540] Test failure sanity-quota test_8: dbench hung in vvp_page_assume Created: 24/Jan/14 Updated: 03/Mar/14 Resolved: 06/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 12409 | ||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ffb7e79e-83c2-11e3-bedf-52540035b04c. The sub-test test_8 failed with the following error:
Info required for matching: sanity-quota 8 Client console: 04:28:33:dbench D 0000000000000001 0 29279 29278 0x00000080 04:28:33: ffff88007cfd58e8 0000000000000082 00000002ffffffff 000061b9fbe152ee 04:28:33: ffffffffffffffff ffff880037c4f280 00000000015e9fd8 ffffffffad20e0cc 04:28:33: ffff8800668bbab8 ffff88007cfd5fd8 000000000000fb88 ffff8800668bbab8 04:28:33:Call Trace: 04:28:33: [<ffffffff810a2431>] ? ktime_get_ts+0xb1/0xf0 04:28:33: [<ffffffff81119e10>] ? sync_page+0x0/0x50 04:28:33: [<ffffffff8150e953>] io_schedule+0x73/0xc0 04:28:33: [<ffffffff81119e4d>] sync_page+0x3d/0x50 04:28:33: [<ffffffff8150f30f>] __wait_on_bit+0x5f/0x90 04:28:33: [<ffffffff8111a083>] wait_on_page_bit+0x73/0x80 04:28:33: [<ffffffff81096de0>] ? wake_bit_function+0x0/0x50 04:28:33: [<ffffffffa1602c65>] vvp_page_assume+0x35/0xa0 [lustre] 04:28:33: [<ffffffffa0ffcc88>] cl_page_invoid+0x68/0x160 [obdclass] 04:28:33: [<ffffffffa0fff1a6>] cl_page_assume+0x56/0x220 [obdclass] 04:28:33: [<ffffffffa15f0a08>] ll_write_begin+0xf8/0x740 [lustre] 04:28:33: [<ffffffff8111a7b3>] generic_file_buffered_write+0x123/0x2e0 04:28:33: [<ffffffff81075887>] ? current_fs_time+0x27/0x30 04:28:33: [<ffffffff8111c210>] __generic_file_aio_write+0x260/0x490 04:28:33: [<ffffffffa0e7fa81>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 04:28:33: [<ffffffff8111c4c8>] generic_file_aio_write+0x88/0x100 04:28:33: [<ffffffffa160652b>] vvp_io_write_start+0xdb/0x3d0 [lustre] 04:28:33: [<ffffffffa1007c9a>] cl_io_start+0x6a/0x140 [obdclass] 04:28:33: [<ffffffffa100bdf4>] cl_io_loop+0xb4/0x1b0 [obdclass] 04:28:33: [<ffffffffa15a5c96>] ll_file_io_generic+0x2b6/0x710 [lustre] 04:28:33: [<ffffffffa0ffbd69>] ? cl_env_get+0x29/0x350 [obdclass] 04:28:33: [<ffffffffa15a6962>] ll_file_aio_write+0x142/0x2c0 [lustre] 04:28:33: [<ffffffffa15a6c4c>] ll_file_write+0x16c/0x2a0 [lustre] 04:28:33: [<ffffffff81181398>] vfs_write+0xb8/0x1a0 04:28:33: [<ffffffff81181d52>] sys_pwrite64+0x82/0xa0 04:28:33: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Oleg Drokin [ 28/Jan/14 ] |
|
This is probably related to lockups I see in my testing that Jinshan thinks were introduced with a stream of recent clio changes. I have a patch in testing that we'll upload somewhere once complete. |
| Comment by Jinshan Xiong (Inactive) [ 28/Jan/14 ] |
|
patch is at: http://review.whamcloud.com/9036 |
| Comment by Jinshan Xiong (Inactive) [ 06/Feb/14 ] |
|
patch landed |
| Comment by Cory Spitz [ 06/Feb/14 ] |
|
The landed fix didn't make clear which change introduced the regression. Does someone have that handy? |
| Comment by Jinshan Xiong (Inactive) [ 06/Feb/14 ] |
|
This issue was introduced by In the patch, the writing thread will check if the page has WriteBack bit set before trying to dirty again. |
| Comment by Li Xi (Inactive) [ 24/Feb/14 ] |
|
Hi Jinshan, Do you think whether this problem can happen in other functions, e.g. ll_read_ahead_pages? ll_read_ahead_pages also calls cl_page_assume() when holding the lock of other pages. Thank you! |
| Comment by Jinshan Xiong (Inactive) [ 24/Feb/14 ] |
|
Hi Li Xi, I don't think it can happen in ll_read_ahead_pages() because this problem is introduced by holding a page lock and waiting for another page. This is not the case for read ahead. |