[LU-4503] Panic with lu_ref checks enabled Created: 17/Jan/14 Updated: 14/Aug/14 Resolved: 14/Aug/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alexey Lyashkov | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB | ||
| Environment: |
rhel6 last master with lu_ref checks enabled |
||
| Severity: | 3 |
| Rank (Obsolete): | 12316 |
| Description |
|
== sanity-benchmark test dbench: dbench == 13:07:36 (1389949656) Message from syslogd@rhel6-64 at Jan 17 13:07:37 ... Message from syslogd@rhel6-64 at Jan 17 13:07:37 ... Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt LustreError: 16666:0:(lu_ref.c:105:lu_ref_print()) lu_ref: ffff88013997cee8 2 0 cl_page_alloc:197 LustreError: 16666:0:(lu_ref.c:107:lu_ref_print()) link: cl_io ffff880097136088 LustreError: 16666:0:(lu_ref.c:107:lu_ref_print()) link: transfer ffff88013997cde0 LustreError: 16666:0:(lu_ref.c:105:lu_ref_print()) lu_ref: ffff88008b045698 2 0 cl_page_alloc:197 LustreError: 16666:0:(lu_ref.c:265:lu_ref_del()) ASSERTION( 0 ) failed: LustreError: 16666:0:(lu_ref.c:265:lu_ref_del()) LBUG Pid: 16666, comm: cp Call Trace: [<ffffffffa03ea8c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa03eaec7>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa05a78ed>] lu_ref_del+0x23d/0x240 [obdclass] [<ffffffffa0f9e826>] write_commit_callback+0x86/0xb0 [lustre] [<ffffffffa09fb26f>] osc_io_commit_async+0xaf/0x3b0 [osc] [<ffffffffa0f9e7a0>] ? write_commit_callback+0x0/0xb0 [lustre] [<ffffffffa0f9e7a0>] ? write_commit_callback+0x0/0xb0 [lustre] [<ffffffffa05a1dd6>] cl_io_commit_async+0x76/0x130 [obdclass] [<ffffffffa0a47917>] lov_io_commit_async+0x2d7/0x500 [lov] [<ffffffffa0f9e7a0>] ? write_commit_callback+0x0/0xb0 [lustre] [<ffffffffa0f9e7a0>] ? write_commit_callback+0x0/0xb0 [lustre] [<ffffffffa05a1dd6>] cl_io_commit_async+0x76/0x130 [obdclass] [<ffffffffa0fa0417>] vvp_io_write_commit+0x267/0x8c0 [lustre] [<ffffffffa059797b>] ? cl_page_get+0x2b/0x100 [obdclass] [<ffffffffa0f8ca3c>] ll_write_end+0xbc/0x3e0 [lustre] [<ffffffff81129e5a>] generic_file_buffered_write+0x18a/0x300 [<ffffffff8153481b>] ? _spin_unlock+0x2b/0x40 [<ffffffff8112bed0>] __generic_file_aio_write+0x260/0x490 [<ffffffff8112c173>] ? generic_file_aio_write+0x73/0x100 [<ffffffff8112c18a>] generic_file_aio_write+0x8a/0x100 [<ffffffffa0fa0b4b>] vvp_io_write_start+0xdb/0x3d0 [lustre] [<ffffffffa05a1efa>] cl_io_start+0x6a/0x140 [obdclass] [<ffffffffa05a6024>] cl_io_loop+0xb4/0x1b0 [obdclass] [<ffffffffa0f42326>] ll_file_io_generic+0x2b6/0x710 [lustre] [<ffffffffa05958a9>] ? cl_env_get+0x29/0x350 [obdclass] [<ffffffffa0f42ff2>] ll_file_aio_write+0x142/0x2c0 [lustre] [<ffffffffa0f432dc>] ll_file_write+0x16c/0x2a0 [lustre] [<ffffffff81196568>] vfs_write+0xb8/0x1a0 [<ffffffff815342e8>] ? lockdep_sys_exit_thunk+0x35/0x67 [<ffffffff81196e61>] sys_write+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Oleg Drokin [ 20/Jan/14 ] |
|
I am also hitting this. apparently the problem now stems from http://review.whamcloud.com/7893 that does not release the page to wroteback list, but aggregates them to be freed by later async callback that could be running from a different io context |
| Comment by Jinshan Xiong (Inactive) [ 20/Jan/14 ] |
|
the io parameter for lu_ref_del() should be converted to top io and that's all. |
| Comment by Zhenyu Xu [ 23/Jan/14 ] |
|
patch tracking at http://review.whamcloud.com/8970 |
| Comment by Alexey Lyashkov [ 23/Jan/14 ] |
|
did you run acc-sm with lu_refcheck enabled ? of just fix that particular issue ? |
| Comment by Zhenyu Xu [ 23/Jan/14 ] |
|
did you find more issues? I've run some test with enable-lu_ref. |
| Comment by Alexey Lyashkov [ 23/Jan/14 ] |
|
I just ask before i will start own stress testing. i hit it bug with simple run, and don't like to stop testing / find other clio bugs (like lack a mutex lock) after panic at next simple test. Also i know - Maloo don't run tests with recheck enabled - so any bugs in that area will don't found during automatic testing and acc-sm need to be run by hand if we need verification for a lu_refcheck. |
| Comment by Zhenyu Xu [ 23/Jan/14 ] |
|
ok, thank you for the explanation, I've run some test and am still running other tests, haven't finished acc-sm yet for now. |
| Comment by Alexey Lyashkov [ 23/Jan/14 ] |
|
Thanks for testing. |
| Comment by Oleg Drokin [ 23/Jan/14 ] |
|
I hit more problems with lu_ref checking enabled in my testing but mostly in mdc now, unrelated to clio |
| Comment by Alexey Lyashkov [ 24/Jan/14 ] |
|
Oleg, what you think about enabling lu_ref checks and invariants for a Maloo testing ? I think it's good to have better code quality. |
| Comment by Oleg Drokin [ 04/Feb/14 ] |
|
Yes, I totally agre we need to ahve at least some portion of the runs with debug enabled. |
| Comment by Zhenyu Xu [ 14/Aug/14 ] |
|
patch landed for 2.7 (master branch) |