[LU-4503] Panic with lu_ref checks enabled Created: 17/Jan/14  Updated: 14/Aug/14  Resolved: 14/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: MB
Environment:

rhel6 last master with lu_ref checks enabled


Severity: 3
Rank (Obsolete): 12316

 Description   

== sanity-benchmark test dbench: dbench == 13:07:36 (1389949656)
debug=0
running as uid/gid/euid/egid 500/500/500/500, groups:
[touch] [/mnt/lustre/d0_runas_test/f13997]
debug=0
running as uid/gid/euid/egid 500/500/500/500, groups:
[bash] [rundbench] [-D] [/mnt/lustre/d0.rhel6-64.shadowland] [6] [-t] [120]
copying /usr/local/share/client.txt to /mnt/lustre/d0.rhel6-64.shadowland/client.txt

Message from syslogd@rhel6-64 at Jan 17 13:07:37 ...
kernel:LustreError: 16666:0:(lu_ref.c:265:lu_ref_del()) ASSERTION( 0 ) failed:

Message from syslogd@rhel6-64 at Jan 17 13:07:37 ...
kernel:LustreError: 16666:0:(lu_ref.c:265:lu_ref_del()) LBUG

Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
LustreError: 16666:0:(lu_ref.c:105:lu_ref_print()) lu_ref: ffff88013997cee8 2 0 cl_page_alloc:197
LustreError: 16666:0:(lu_ref.c:107:lu_ref_print())      link: cl_io ffff880097136088
LustreError: 16666:0:(lu_ref.c:107:lu_ref_print())      link: transfer ffff88013997cde0
LustreError: 16666:0:(lu_ref.c:105:lu_ref_print()) lu_ref: ffff88008b045698 2 0 cl_page_alloc:197
LustreError: 16666:0:(lu_ref.c:265:lu_ref_del()) ASSERTION( 0 ) failed: 
LustreError: 16666:0:(lu_ref.c:265:lu_ref_del()) LBUG
Pid: 16666, comm: cp

Call Trace:
 [<ffffffffa03ea8c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa03eaec7>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa05a78ed>] lu_ref_del+0x23d/0x240 [obdclass]
 [<ffffffffa0f9e826>] write_commit_callback+0x86/0xb0 [lustre]
 [<ffffffffa09fb26f>] osc_io_commit_async+0xaf/0x3b0 [osc]
 [<ffffffffa0f9e7a0>] ? write_commit_callback+0x0/0xb0 [lustre]
 [<ffffffffa0f9e7a0>] ? write_commit_callback+0x0/0xb0 [lustre]
 [<ffffffffa05a1dd6>] cl_io_commit_async+0x76/0x130 [obdclass]
 [<ffffffffa0a47917>] lov_io_commit_async+0x2d7/0x500 [lov]
 [<ffffffffa0f9e7a0>] ? write_commit_callback+0x0/0xb0 [lustre]
 [<ffffffffa0f9e7a0>] ? write_commit_callback+0x0/0xb0 [lustre]
 [<ffffffffa05a1dd6>] cl_io_commit_async+0x76/0x130 [obdclass]
 [<ffffffffa0fa0417>] vvp_io_write_commit+0x267/0x8c0 [lustre]
 [<ffffffffa059797b>] ? cl_page_get+0x2b/0x100 [obdclass]
 [<ffffffffa0f8ca3c>] ll_write_end+0xbc/0x3e0 [lustre]
 [<ffffffff81129e5a>] generic_file_buffered_write+0x18a/0x300
 [<ffffffff8153481b>] ? _spin_unlock+0x2b/0x40
 [<ffffffff8112bed0>] __generic_file_aio_write+0x260/0x490
 [<ffffffff8112c173>] ? generic_file_aio_write+0x73/0x100
 [<ffffffff8112c18a>] generic_file_aio_write+0x8a/0x100
 [<ffffffffa0fa0b4b>] vvp_io_write_start+0xdb/0x3d0 [lustre]
 [<ffffffffa05a1efa>] cl_io_start+0x6a/0x140 [obdclass]
 [<ffffffffa05a6024>] cl_io_loop+0xb4/0x1b0 [obdclass]
 [<ffffffffa0f42326>] ll_file_io_generic+0x2b6/0x710 [lustre]
 [<ffffffffa05958a9>] ? cl_env_get+0x29/0x350 [obdclass]
 [<ffffffffa0f42ff2>] ll_file_aio_write+0x142/0x2c0 [lustre]
 [<ffffffffa0f432dc>] ll_file_write+0x16c/0x2a0 [lustre]
 [<ffffffff81196568>] vfs_write+0xb8/0x1a0
 [<ffffffff815342e8>] ? lockdep_sys_exit_thunk+0x35/0x67
 [<ffffffff81196e61>] sys_write+0x51/0x90
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Oleg Drokin [ 20/Jan/14 ]

I am also hitting this.

apparently the problem now stems from http://review.whamcloud.com/7893 that does not release the page to wroteback list, but aggregates them to be freed by later async callback that could be running from a different io context

Comment by Jinshan Xiong (Inactive) [ 20/Jan/14 ]

the io parameter for lu_ref_del() should be converted to top io and that's all.

Comment by Zhenyu Xu [ 23/Jan/14 ]

patch tracking at http://review.whamcloud.com/8970

Comment by Alexey Lyashkov [ 23/Jan/14 ]

did you run acc-sm with lu_refcheck enabled ? of just fix that particular issue ?

Comment by Zhenyu Xu [ 23/Jan/14 ]

did you find more issues? I've run some test with enable-lu_ref.

Comment by Alexey Lyashkov [ 23/Jan/14 ]

I just ask before i will start own stress testing. i hit it bug with simple run, and don't like to stop testing / find other clio bugs (like lack a mutex lock) after panic at next simple test. Also i know - Maloo don't run tests with recheck enabled - so any bugs in that area will don't found during automatic testing and acc-sm need to be run by hand if we need verification for a lu_refcheck.

Comment by Zhenyu Xu [ 23/Jan/14 ]

ok, thank you for the explanation, I've run some test and am still running other tests, haven't finished acc-sm yet for now.

Comment by Alexey Lyashkov [ 23/Jan/14 ]

Thanks for testing.

Comment by Oleg Drokin [ 23/Jan/14 ]

I hit more problems with lu_ref checking enabled in my testing but mostly in mdc now, unrelated to clio

Comment by Alexey Lyashkov [ 24/Jan/14 ]

Oleg,

what you think about enabling lu_ref checks and invariants for a Maloo testing ? I think it's good to have better code quality.

Comment by Oleg Drokin [ 04/Feb/14 ]

Yes, I totally agre we need to ahve at least some portion of the runs with debug enabled.

Comment by Zhenyu Xu [ 14/Aug/14 ]

patch landed for 2.7 (master branch)

Generated at Sat Feb 10 01:43:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.