[LU-396] LustreError: 5663:0:(osc_io.c:322:osc_io_commit_write()) ASSERTION(to > 0) failed Created: 06/Jun/11  Updated: 05/Jun/12  Resolved: 05/Jun/12

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Major
Reporter: Prakash Surya (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Based on the RHEL6/CHAOS5 distro and 2.0.59 lustre tag


Severity: 3
Rank (Obsolete): 8535

 Description   

Failed assertion when running a 32 node, 64 client (2 per node), file per process IOR.

The relevant console output from the client node that crashed is below:

2011-06-03 15:23:28 LustreError: 5663:0:(osc_io.c:322:osc_io_commit_write()) ASSERTION(to > 0) failed
2011-06-03 15:23:28 LustreError: 5663:0:(osc_io.c:322:osc_io_commit_write()) LBUG
2011-06-03 15:23:28 Pid: 5663, comm: IOR
2011-06-03 15:23:28 Jun 3 15:23:28
2011-06-03 15:23:28 zwicky123 kernelCall Trace:
2011-06-03 15:23:28 : LustreError: 5 [<ffffffffa034d8b5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2011-06-03 15:23:28 663:0:(osc_io.c: [<ffffffffa034def5>] lbug_with_loc+0x75/0xe0 [libcfs]
2011-06-03 15:23:28 322:osc_io_commi [<ffffffffa0358886>] libcfs_assertion_failed+0x66/0x70 [libcfs]
2011-06-03 15:23:28 t_write()) ASSER [<ffffffffa06e6cb0>] osc_io_commit_write+0x190/0x1a0 [osc]
2011-06-03 15:23:28 TION(to > 0) fai [<ffffffffa07446ae>] ? lov_page_stripe+0x3e/0x150 [lov]
2011-06-03 15:23:28 led
2011-06-03 15:23:28 Jun 3 15:23 [<ffffffffa048df3f>] cl_io_commit_write+0xaf/0x1f0 [obdclass]
2011-06-03 15:23:28 :28 zwicky123 ke [<ffffffffa07475bd>] ? lov_page_subio+0xad/0x260 [lov]
2011-06-03 15:23:28 rnel: LustreErro [<ffffffffa047fa28>] ? cl_page_export+0x58/0x80 [obdclass]
2011-06-03 15:23:28 r: 5663:0:(osc_i [<ffffffffa0747817>] lov_io_commit_write+0xa7/0x1d0 [lov]
2011-06-03 15:23:28 o.c:322:osc_io_c [<ffffffffa047fd7e>] ? cl_page_at+0xe/0x10 [obdclass]
2011-06-03 15:23:28 ommit_write()) L [<ffffffffa048df3f>] cl_io_commit_write+0xaf/0x1f0 [obdclass]
2011-06-03 15:23:28 BUG
2011-06-03 15:23:28 [<ffffffffa047e909>] ? cl_env_get+0x29/0x350 [obdclass]
2011-06-03 15:23:28 [<ffffffffa07b46dd>] ll_commit_write+0xdd/0x2b0 [lustre]
2011-06-03 15:23:28 [<ffffffff814e4bfc>] ? bad_to_user+0x66/0x612
2011-06-03 15:23:28 [<ffffffffa07cd030>] ll_write_end+0x30/0x60 [lustre]
2011-06-03 15:23:28 [<ffffffff8110dc74>] generic_file_buffered_write+0x174/0x2a0
2011-06-03 15:23:28 [<ffffffff8106dd17>] ? current_fs_time+0x27/0x30
2011-06-03 15:23:28 [<ffffffff8110f560>] __generic_file_aio_write+0x250/0x480
2011-06-03 15:23:28 [<ffffffff8110f7ff>] generic_file_aio_write+0x6f/0xe0
2011-06-03 15:23:28 [<ffffffffa07dec31>] vvp_io_write_start+0xa1/0x270 [lustre]
2011-06-03 15:23:28 [<ffffffffa048b4b8>] cl_io_start+0x68/0x170 [obdclass]
2011-06-03 15:23:28 [<ffffffffa048f2f0>] cl_io_loop+0x110/0x1d0 [obdclass]
2011-06-03 15:23:28 [<ffffffffa035d422>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
2011-06-03 15:23:28 [<ffffffffa07878e3>] ll_file_io_generic+0x3b3/0x4e0 [lustre]
2011-06-03 15:23:28 [<ffffffffa035bf54>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
2011-06-03 15:23:28 [<ffffffffa047e909>] ? cl_env_get+0x29/0x350 [obdclass]
2011-06-03 15:23:28 [<ffffffffa078c95f>] ll_file_aio_write+0x13f/0x310 [lustre]
2011-06-03 15:23:28 [<ffffffffa047ea7e>] ? cl_env_get+0x19e/0x350 [obdclass]
2011-06-03 15:23:28 [<ffffffffa047cfdf>] ? cl_env_put+0x1af/0x2e0 [obdclass]
2011-06-03 15:23:28 [<ffffffffa078cca1>] ll_file_write+0x171/0x310 [lustre]
2011-06-03 15:23:28 [<ffffffff811725b8>] vfs_write+0xb8/0x1a0
2011-06-03 15:23:28 [<ffffffff81172ff1>] sys_write+0x51/0x90
2011-06-03 15:23:28 322:osc_io_commi [<ffffffffa0358886>] libcfs_assertion_failed+0x66/0x70 [libcfs]
2011-06-03 15:23:28 t_write()) ASSER [<ffffffffa06e6cb0>] osc_io_commit_write+0x190/0x1a0 [osc]
2011-06-03 15:23:28 TION(to > 0) fai [<ffffffffa07446ae>] ? lov_page_stripe+0x3e/0x150 [lov]
2011-06-03 15:23:28 led
2011-06-03 15:23:28 Jun 3 15:23 [<ffffffffa048df3f>] cl_io_commit_write+0xaf/0x1f0 [obdclass]
2011-06-03 15:23:28 :28 zwicky123 ke [<ffffffffa07475bd>] ? lov_page_subio+0xad/0x260 [lov]
2011-06-03 15:23:28 rnel: LustreErro [<ffffffffa047fa28>] ? cl_page_export+0x58/0x80 [obdclass]
2011-06-03 15:23:28 r: 5663:0:(osc_i [<ffffffffa0747817>] lov_io_commit_write+0xa7/0x1d0 [lov]
2011-06-03 15:23:28 o.c:322:osc_io_c [<ffffffffa047fd7e>] ? cl_page_at+0xe/0x10 [obdclass]
2011-06-03 15:23:28 ommit_write()) L [<ffffffffa048df3f>] cl_io_commit_write+0xaf/0x1f0 [obdclass]
2011-06-03 15:23:28 BUG
2011-06-03 15:23:28 [<ffffffffa047e909>] ? cl_env_get+0x29/0x350 [obdclass]
2011-06-03 15:23:28 [<ffffffffa07b46dd>] ll_commit_write+0xdd/0x2b0 [lustre]
2011-06-03 15:23:28 [<ffffffff814e4bfc>] ? bad_to_user+0x66/0x612
2011-06-03 15:23:28 [<ffffffffa07cd030>] ll_write_end+0x30/0x60 [lustre]
2011-06-03 15:23:28 [<ffffffff8110dc74>] generic_file_buffered_write+0x174/0x2a0
2011-06-03 15:23:28 [<ffffffff8106dd17>] ? current_fs_time+0x27/0x30
2011-06-03 15:23:28 [<ffffffff8110f560>] __generic_file_aio_write+0x250/0x480
2011-06-03 15:23:28 [<ffffffff8110f7ff>] generic_file_aio_write+0x6f/0xe0
2011-06-03 15:23:28 [<ffffffffa07dec31>] vvp_io_write_start+0xa1/0x270 [lustre]
2011-06-03 15:23:28 [<ffffffffa048b4b8>] cl_io_start+0x68/0x170 [obdclass]
2011-06-03 15:23:28 [<ffffffffa048f2f0>] cl_io_loop+0x110/0x1d0 [obdclass]
2011-06-03 15:23:28 [<ffffffffa035d422>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
2011-06-03 15:23:28 [<ffffffffa07878e3>] ll_file_io_generic+0x3b3/0x4e0 [lustre]
2011-06-03 15:23:28 [<ffffffffa035bf54>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
2011-06-03 15:23:28 [<ffffffffa047e909>] ? cl_env_get+0x29/0x350 [obdclass]
2011-06-03 15:23:28 [<ffffffffa078c95f>] ll_file_aio_write+0x13f/0x310 [lustre]
2011-06-03 15:23:28 [<ffffffffa047ea7e>] ? cl_env_get+0x19e/0x350 [obdclass]
2011-06-03 15:23:28 [<ffffffffa047cfdf>] ? cl_env_put+0x1af/0x2e0 [obdclass]
2011-06-03 15:23:28 [<ffffffffa078cca1>] ll_file_write+0x171/0x310 [lustre]
2011-06-03 15:23:28 [<ffffffff811725b8>] vfs_write+0xb8/0x1a0
2011-06-03 15:23:28 [<ffffffff81172ff1>] sys_write+0x51/0x90
2011-06-03 15:23:28 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
2011-06-03 15:23:28
2011-06-03 15:23:28 Kernel panic - not syncing: LBUG
2011-06-03 15:23:28 Pid: 5663, comm: IOR Tainted: G ---------------- T 2.6.32-131.0.15.2chaos.ch5.x86_64 #1
2011-06-03 15:23:28 Call Trace:
2011-06-03 15:23:28 Jun 3 15:23:28 [<ffffffff814da3fe>] ? panic+0x78/0x143
2011-06-03 15:23:28 zwicky123 kernel [<ffffffffa034df4b>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
2011-06-03 15:23:28 : Kernel panic - [<ffffffffa0358886>] ? libcfs_assertion_failed+0x66/0x70 [libcfs]
2011-06-03 15:23:28 not syncing: LB [<ffffffffa06e6cb0>] ? osc_io_commit_write+0x190/0x1a0 [osc]
2011-06-03 15:23:28 UG
2011-06-03 15:23:28 [<ffffffffa07446ae>] ? lov_page_stripe+0x3e/0x150 [lov]
2011-06-03 15:23:28 [<ffffffffa048df3f>] ? cl_io_commit_write+0xaf/0x1f0 [obdclass]
2011-06-03 15:23:28 [<ffffffffa07475bd>] ? lov_page_subio+0xad/0x260 [lov]
2011-06-03 15:23:28 [<ffffffffa047fa28>] ? cl_page_export+0x58/0x80 [obdclass]
2011-06-03 15:23:28 [<ffffffffa0747817>] ? lov_io_commit_write+0xa7/0x1d0 [lov]
2011-06-03 15:23:28 [<ffffffffa047fd7e>] ? cl_page_at+0xe/0x10 [obdclass]
2011-06-03 15:23:28 [<ffffffffa048df3f>] ? cl_io_commit_write+0xaf/0x1f0 [obdclass]
2011-06-03 15:23:28 [<ffffffffa047e909>] ? cl_env_get+0x29/0x350 [obdclass]
2011-06-03 15:23:28 [<ffffffffa07b46dd>] ? ll_commit_write+0xdd/0x2b0 [lustre]
2011-06-03 15:23:28 [<ffffffff814e4bfc>] ? bad_to_user+0x66/0x612
2011-06-03 15:23:28 [<ffffffffa07cd030>] ? ll_write_end+0x30/0x60 [lustre]
2011-06-03 15:23:28 [<ffffffff8110dc74>] ? generic_file_buffered_write+0x174/0x2a0
2011-06-03 15:23:28 [<ffffffff8106dd17>] ? current_fs_time+0x27/0x30
2011-06-03 15:23:28 [<ffffffff8110f560>] ? __generic_file_aio_write+0x250/0x480
2011-06-03 15:23:28 [<ffffffff8110f7ff>] ? generic_file_aio_write+0x6f/0xe0
2011-06-03 15:23:28 [<ffffffffa07dec31>] ? vvp_io_write_start+0xa1/0x270 [lustre]
2011-06-03 15:23:28 [<ffffffffa048b4b8>] ? cl_io_start+0x68/0x170 [obdclass]
2011-06-03 15:23:28 [<ffffffffa048f2f0>] ? cl_io_loop+0x110/0x1d0 [obdclass]
2011-06-03 15:23:28 [<ffffffffa035d422>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
2011-06-03 15:23:28 [<ffffffffa07878e3>] ? ll_file_io_generic+0x3b3/0x4e0 [lustre]
2011-06-03 15:23:28 [<ffffffffa035bf54>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
2011-06-03 15:23:28 [<ffffffffa047e909>] ? cl_env_get+0x29/0x350 [obdclass]
2011-06-03 15:23:28 [<ffffffffa078c95f>] ? ll_file_aio_write+0x13f/0x310 [lustre]
2011-06-03 15:23:28 [<ffffffffa047ea7e>] ? cl_env_get+0x19e/0x350 [obdclass]
2011-06-03 15:23:28 [<ffffffffa047cfdf>] ? cl_env_put+0x1af/0x2e0 [obdclass]
2011-06-03 15:23:28 [<ffffffffa078cca1>] ? ll_file_write+0x171/0x310 [lustre]
2011-06-03 15:23:28 [<ffffffff811725b8>] ? vfs_write+0xb8/0x1a0
2011-06-03 15:23:28 [<ffffffff81172ff1>] ? sys_write+0x51/0x90
2011-06-03 15:23:28 [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b



 Comments   
Comment by Oleg Drokin [ 06/Jun/11 ]

Jinshan, can you please take a look at this?

Comment by Lai Siyao [ 07/Jun/11 ]

Hi Jinshan, maybe it's associated with new aops write_begin/end, if so, please check 1.8 also. P.S. recently write_begin/end is enabled by default for RHEL5 kernels too.

Comment by Jinshan Xiong (Inactive) [ 07/Jun/11 ]

Hi Lai, Thank you for helping, I'll look into this.

Comment by Jinshan Xiong (Inactive) [ 10/Jun/11 ]

I can't figure out this issue. If you can reproduce this problem, can you please run it with debug enabled and post lustre log here? Thanks.

Comment by Prakash Surya (Inactive) [ 10/Jun/11 ]

Sure, I will try to reproduce the issue.

How do I ensure that debug in enabled? By issuing 'echo -1 > /proc/sys/lnet/debug'?

And the lustre log can be retrieved by 'lctl dk' right?

Comment by Jinshan Xiong (Inactive) [ 10/Jun/11 ]

Yes. lctl dk > log.txt, and send me the log.txt

Comment by Christopher Morrone [ 10/Jun/11 ]

He is not going to be able to log in and run lctl after the kernel panics.

I would suggest a debugging patch that prints to the console.

Comment by Christopher Morrone [ 10/Jun/11 ]

Oh, yeah, we have --enable-panic_dumplog set in our build, so if we make sure the path for the dump file is persistent storage, we may get logs that way.

Comment by Prakash Surya (Inactive) [ 10/Jun/11 ]

Where are the logs being dumped to? Can I query lustre or the system to get the path?

Comment by Jinshan Xiong (Inactive) [ 10/Jun/11 ]

echo 0 > /proc/sys/lnet/panic_on_lbug

to not let the system hang so you can dump the log.

Comment by Christopher Morrone [ 13/Jun/11 ]

/proc/sys/lnet/debug_path

Comment by Peter Jones [ 22/Jun/11 ]

LLNL agreed on the weekly 2.1 call that this should be dropped in priority until it reoccurred.

Comment by Niu Yawei (Inactive) [ 04/Jun/12 ]

Hi, Jingshan

Is the LASSERT(to > 0) in osc_io_commit_write() valid? From the kernel code, seems the 'copied' could be zero when something wrong happened during copy data into page cache, that will result in the to == 0 in osc_io_commit_write(). Did I miss anything?

Comment by Jinshan Xiong (Inactive) [ 05/Jun/12 ]

It won't come to this function if to is zero and everything else works fine. For example, in ll_commit_write(), it checks

        LASSERT(from <= to);
        if (from != to) /* handle short write case. */
                result = cl_io_commit_write(env, io, page, from, to);

I think it would be a good start to revise it LASSERTF(to > 0, "to is %d\n", to) and it will be really helpful to print out as much information as possible.

Comment by Niu Yawei (Inactive) [ 05/Jun/12 ]

This bug should have been fixed in LU-683.

Comment by Jinshan Xiong (Inactive) [ 05/Jun/12 ]

duplicate to LU-683

Generated at Sat Feb 10 01:06:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.