[LU-4540] Test failure sanity-quota test_8: dbench hung in vvp_page_assume Created: 24/Jan/14  Updated: 03/Mar/14  Resolved: 06/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: MB

Issue Links:
Duplicate
duplicates LU-4561 threads stuck waiting on page bit Closed
Severity: 3
Rank (Obsolete): 12409

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ffb7e79e-83c2-11e3-bedf-52540035b04c.

The sub-test test_8 failed with the following error:

test failed to respond and timed out

Info required for matching: sanity-quota 8

Client console:

04:28:33:dbench        D 0000000000000001     0 29279  29278 0x00000080
04:28:33: ffff88007cfd58e8 0000000000000082 00000002ffffffff 000061b9fbe152ee
04:28:33: ffffffffffffffff ffff880037c4f280 00000000015e9fd8 ffffffffad20e0cc
04:28:33: ffff8800668bbab8 ffff88007cfd5fd8 000000000000fb88 ffff8800668bbab8
04:28:33:Call Trace:
04:28:33: [<ffffffff810a2431>] ? ktime_get_ts+0xb1/0xf0
04:28:33: [<ffffffff81119e10>] ? sync_page+0x0/0x50
04:28:33: [<ffffffff8150e953>] io_schedule+0x73/0xc0
04:28:33: [<ffffffff81119e4d>] sync_page+0x3d/0x50
04:28:33: [<ffffffff8150f30f>] __wait_on_bit+0x5f/0x90
04:28:33: [<ffffffff8111a083>] wait_on_page_bit+0x73/0x80
04:28:33: [<ffffffff81096de0>] ? wake_bit_function+0x0/0x50
04:28:33: [<ffffffffa1602c65>] vvp_page_assume+0x35/0xa0 [lustre]
04:28:33: [<ffffffffa0ffcc88>] cl_page_invoid+0x68/0x160 [obdclass]
04:28:33: [<ffffffffa0fff1a6>] cl_page_assume+0x56/0x220 [obdclass]
04:28:33: [<ffffffffa15f0a08>] ll_write_begin+0xf8/0x740 [lustre]
04:28:33: [<ffffffff8111a7b3>] generic_file_buffered_write+0x123/0x2e0
04:28:33: [<ffffffff81075887>] ? current_fs_time+0x27/0x30
04:28:33: [<ffffffff8111c210>] __generic_file_aio_write+0x260/0x490
04:28:33: [<ffffffffa0e7fa81>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
04:28:33: [<ffffffff8111c4c8>] generic_file_aio_write+0x88/0x100
04:28:33: [<ffffffffa160652b>] vvp_io_write_start+0xdb/0x3d0 [lustre]
04:28:33: [<ffffffffa1007c9a>] cl_io_start+0x6a/0x140 [obdclass]
04:28:33: [<ffffffffa100bdf4>] cl_io_loop+0xb4/0x1b0 [obdclass]
04:28:33: [<ffffffffa15a5c96>] ll_file_io_generic+0x2b6/0x710 [lustre]
04:28:33: [<ffffffffa0ffbd69>] ? cl_env_get+0x29/0x350 [obdclass]
04:28:33: [<ffffffffa15a6962>] ll_file_aio_write+0x142/0x2c0 [lustre]
04:28:33: [<ffffffffa15a6c4c>] ll_file_write+0x16c/0x2a0 [lustre]
04:28:33: [<ffffffff81181398>] vfs_write+0xb8/0x1a0
04:28:33: [<ffffffff81181d52>] sys_pwrite64+0x82/0xa0
04:28:33: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Oleg Drokin [ 28/Jan/14 ]

This is probably related to lockups I see in my testing that Jinshan thinks were introduced with a stream of recent clio changes.

I have a patch in testing that we'll upload somewhere once complete.

Comment by Jinshan Xiong (Inactive) [ 28/Jan/14 ]

patch is at: http://review.whamcloud.com/9036

Comment by Jinshan Xiong (Inactive) [ 06/Feb/14 ]

patch landed

Comment by Cory Spitz [ 06/Feb/14 ]

The landed fix didn't make clear which change introduced the regression. Does someone have that handy?

Comment by Jinshan Xiong (Inactive) [ 06/Feb/14 ]

This issue was introduced by LU-3321. The writing thread is holding page lock of page A, and then wait for write back bit of page B; meanwhile ptlrpc thread is trying to send out a write request, so it set the write back bit of page B, and then trying to lock page A to set write back bit. This is a deadlock case.

In the patch, the writing thread will check if the page has WriteBack bit set before trying to dirty again.

Comment by Li Xi (Inactive) [ 24/Feb/14 ]

Hi Jinshan,

Do you think whether this problem can happen in other functions, e.g. ll_read_ahead_pages? ll_read_ahead_pages also calls cl_page_assume() when holding the lock of other pages.

Thank you!

Comment by Jinshan Xiong (Inactive) [ 24/Feb/14 ]

Hi Li Xi,

I don't think it can happen in ll_read_ahead_pages() because this problem is introduced by holding a page lock and waiting for another page. This is not the case for read ahead.

Generated at Sat Feb 10 01:43:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.