[LU-683] LBUG: (osc_io.c:336:osc_io_commit_write()) ASSERTION(to > 0) failed Created: 14/Sep/11 Updated: 05/Jun/12 Resolved: 05/Jun/12 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.1.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jian Yu | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Branch: master MGS/MDS Node: fat-amd-1-ib OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active) Client Nodes: client-[1,2,4,5,12,13,15],fat-intel-4 Network Addresses: |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4902 |
| Description |
|
While running recovery-mds-scale with FLAVOR=OSS, it failed as follows: <~snip~> ==== Checking the clients loads AFTER failover -- failure NOT OK Client load failed on node client-2-ib, rc=1 Client load failed during failover. Exiting 2011-09-13 22:12:19 Terminating clients loads ... Duration: 7200 Server failover period: 600 seconds Exited after: 4991 seconds Number of failovers before exit: mds1: 0 times ost1: 1 times ost2: 1 times ost3: 0 times ost4: 0 times ost5: 1 times ost6: 0 times ost7: 1 times Status: FAIL: rc=5 Client node client-2-ib hit the following LBUG: LustreError: 32289:0:(osc_io.c:336:osc_io_commit_write()) ASSERTION(to > 0) failed^M LustreError: 32289:0:(osc_io.c:336:osc_io_commit_write()) LBUG^M Pid: 32289, comm: dd^M ^M Call Trace:^M [<ffffffffa03b9855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M [<ffffffffa03b9e95>] lbug_with_loc+0x75/0xe0 [libcfs]^M [<ffffffffa03c4d76>] libcfs_assertion_failed+0x66/0x70 [libcfs]^M [<ffffffffa079dcc0>] osc_io_commit_write+0x1d0/0x1e0 [osc]^M [<ffffffff8119b6cf>] ? __mark_inode_dirty+0x13f/0x160^M [<ffffffffa04e241f>] cl_io_commit_write+0xaf/0x1f0 [obdclass]^M [<ffffffffa07fe6dd>] ? lov_page_subio+0xad/0x260 [lov]^M [<ffffffffa04d37d8>] ? cl_page_export+0x58/0x80 [obdclass]^M [<ffffffffa07fe937>] lov_io_commit_write+0xa7/0x1d0 [lov]^M^@ [<ffffffffa04e241f>] cl_io_commit_write+0xaf/0x1f0 [obdclass]^M [<ffffffffa04d26b9>] ? cl_env_get+0x29/0x350 [obdclass]^M [<ffffffffa0871e0d>] ll_commit_write+0xdd/0x2b0 [lustre]^M [<ffffffff814e4d3c>] ? bad_to_user+0x66/0x612^M [<ffffffffa08894a0>] ll_write_end+0x30/0x60 [lustre]^M [<ffffffff8110dc84>] generic_file_buffered_write+0x174/0x2a0^M [<ffffffff8106dd17>] ? current_fs_time+0x27/0x30^M [<ffffffff8110f570>] __generic_file_aio_write+0x250/0x480^M [<ffffffff8110f80f>] generic_file_aio_write+0x6f/0xe0^M [<ffffffffa089b001>] vvp_io_write_start+0xa1/0x270 [lustre]^M [<ffffffffa04df318>] cl_io_start+0x68/0x170 [obdclass]^M [<ffffffffa04e37d0>] cl_io_loop+0x110/0x1c0 [obdclass]^M [<ffffffffa084291b>] ll_file_io_generic+0x44b/0x580 [lustre]^M [<ffffffffa03c8424>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]^M [<ffffffffa04d26b9>] ? cl_env_get+0x29/0x350 [obdclass]^M [<ffffffffa0842b8f>] ll_file_aio_write+0x13f/0x310 [lustre]^M [<ffffffffa04d282e>] ? cl_env_get+0x19e/0x350 [obdclass]^M [<ffffffffa04d0d8f>] ? cl_env_put+0x1af/0x2e0 [obdclass]^M [<ffffffffa0849301>] ll_file_write+0x171/0x310 [lustre]^M [<ffffffff8126e781>] ? __clear_user+0x21/0x70^M [<ffffffff81172748>] vfs_write+0xb8/0x1a0^M [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0^M [<ffffffff81173181>] sys_write+0x51/0x90^M [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b^M ^M Kernel panic - not syncing: LBUG^M Pid: 32289, comm: dd Tainted: G ---------------- T 2.6.32-131.6.1.el6.x86_64 #1^M Call Trace:^M [<ffffffff814da518>] ? panic+0x78/0x143^M [<ffffffffa03b9eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M [<ffffffffa03c4d76>] ? libcfs_assertion_failed+0x66/0x70 [libcfs]^M [<ffffffffa079dcc0>] ? osc_io_commit_write+0x1d0/0x1e0 [osc]^M [<ffffffff8119b6cf>] ? __mark_inode_dirty+0x13f/0x160^M [<ffffffffa04e241f>] ? cl_io_commit_write+0xaf/0x1f0 [obdclass]^M [<ffffffffa07fe6dd>] ? lov_page_subio+0xad/0x260 [lov]^M [<ffffffffa04d37d8>] ? cl_page_export+0x58/0x80 [obdclass]^M [<ffffffffa07fe937>] ? lov_io_commit_write+0xa7/0x1d0 [lov]^M [<ffffffffa04e241f>] ? cl_io_commit_write+0xaf/0x1f0 [obdclass]^M [<ffffffffa04d26b9>] ? cl_env_get+0x29/0x350 [obdclass]^M [<ffffffffa0871e0d>] ? ll_commit_write+0xdd/0x2b0 [lustre]^M [<ffffffff814e4d3c>] ? bad_to_user+0x66/0x612^M [<ffffffffa08894a0>] ? ll_write_end+0x30/0x60 [lustre]^M [<ffffffff8110dc84>] ? generic_file_buffered_write+0x174/0x2a0^M [<ffffffff8106dd17>] ? current_fs_time+0x27/0x30^M [<ffffffff8110f570>] ? __generic_file_aio_write+0x250/0x480^M [<ffffffff8110f80f>] ? generic_file_aio_write+0x6f/0xe0^M [<ffffffffa089b001>] ? vvp_io_write_start+0xa1/0x270 [lustre]^M [<ffffffffa04df318>] ? cl_io_start+0x68/0x170 [obdclass]^M [<ffffffffa04e37d0>] ? cl_io_loop+0x110/0x1c0 [obdclass]^M [<ffffffffa084291b>] ? ll_file_io_generic+0x44b/0x580 [lustre]^M [<ffffffffa03c8424>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]^M [<ffffffffa04d26b9>] ? cl_env_get+0x29/0x350 [obdclass]^M [<ffffffffa0842b8f>] ? ll_file_aio_write+0x13f/0x310 [lustre]^M [<ffffffffa04d282e>] ? cl_env_get+0x19e/0x350 [obdclass]^M [<ffffffffa04d0d8f>] ? cl_env_put+0x1af/0x2e0 [obdclass]^M [<ffffffffa0849301>] ? ll_file_write+0x171/0x310 [lustre]^M [<ffffffff8126e781>] ? __clear_user+0x21/0x70^M [<ffffffff81172748>] ? vfs_write+0xb8/0x1a0^M [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0^M [<ffffffff81173181>] ? sys_write+0x51/0x90^M [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b^M Maloo report: https://maloo.whamcloud.com/test_sets/943d6cda-de9e-11e0-9909-52540025f9af Please refer to the attached recovery-oss-scale.1315977145.log.tar.bz2 for more logs. |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 15/Sep/11 ] |
|
From the kernel implementation, variable `to' in osc_io_commit_write can be zero. I'm cooking a patch. |
| Comment by Jinshan Xiong (Inactive) [ 15/Sep/11 ] |
|
patch is at: http://review.whamcloud.com/1383 We really need to fix our testing system because no lustre log was caught on the client side. |
| Comment by Jian Yu [ 15/Sep/11 ] |
Lustre Build: http://build.whamcloud.com/job/lustre-reviews/2204/ Running recovery-mds-scale with FLAVOR=OSS failed as follows: <~snip~>
==== Checking the clients loads AFTER failover -- failure NOT OK
Client load failed on node client-12-ib, rc=1
Client load failed during failover. Exiting
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-2-ib
Client load failed on node client-2-ib
client client-2-ib load stdout and debug files :
/tmp/recovery-mds-scale.log_run_dd.sh-client-2-ib
/tmp/recovery-mds-scale.log_run_dd.sh-client-2-ib.debug
2011-09-15 04:08:07 Terminating clients loads ...
Duration: 7200
Server failover period: 900 seconds
Exited after: 3509 seconds
Number of failovers before exit:
mds1: 0 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
ost5: 1 times
ost6: 1 times
ost7: 0 times
Status: FAIL: rc=1
/tmp/recovery-mds-scale.log_run_dd.sh-client-2-ib.debug: <~snip~> 2011-09-15 04:04:43: dd succeeded + cd /tmp + rm -rf /mnt/lustre/d0.dd-client-2-ib ++ date '+%F %H:%M:%S' + echoerr '2011-09-15 04:04:45: dd run finished' + echo '2011-09-15 04:04:45: dd run finished' 2011-09-15 04:04:45: dd run finished + '[' '!' -e /home/yujian/test_logs/end_run_file ']' + true ++ date '+%F %H:%M:%S' + echoerr '2011-09-15 04:04:45: dd run starting' + echo '2011-09-15 04:04:45: dd run starting' 2011-09-15 04:04:45: dd run starting + mkdir -p /mnt/lustre/d0.dd-client-2-ib + cd /mnt/lustre/d0.dd-client-2-ib + load_pid=31837 + wait 31837 + dd bs=4k count=1000000 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-2-ib/dd-file dd: writing `/mnt/lustre/d0.dd-client-2-ib/dd-file': No space left on device 617072+0 records in 617071+0 records out + '[' 1 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2011-09-15 04:05:44: dd failed' + echo '2011-09-15 04:05:44: dd failed' 2011-09-15 04:05:44: dd failed Syslog on client node client-2-ib showed that: Sep 15 04:05:44 client-2 kernel: LustreError: 31837:0:(vvp_io.c:1001:vvp_io_commit_write()) Write page 617071 of inode ffff880127479cb8 failed -28 The dd error and syslog on client-2-ib showed that "No space left on device", however: [root@client-2-ib ~]# lfs df UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 7498624 443864 6554760 6% /mnt/lustre[MDT:0] lustre-OST0000_UUID 15939896 2922084 12217684 19% /mnt/lustre[OST:0] lustre-OST0001_UUID 15939896 478888 14653776 3% /mnt/lustre[OST:1] lustre-OST0002_UUID 15939896 455896 14682912 3% /mnt/lustre[OST:2] lustre-OST0003_UUID 15939896 451792 14680872 3% /mnt/lustre[OST:3] lustre-OST0004_UUID 15939896 453048 14686784 3% /mnt/lustre[OST:4] lustre-OST0005_UUID 15939896 452036 14687796 3% /mnt/lustre[OST:5] lustre-OST0006_UUID 15939896 480232 14645024 3% /mnt/lustre[OST:6] filesystem summary: 111579272 5693976 100254848 5% /mnt/lustre [root@client-2-ib ~]# lfs df -i UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 5000040 148 4999892 0% /mnt/lustre[MDT:0] lustre-OST0000_UUID 236160 107 236053 0% /mnt/lustre[OST:0] lustre-OST0001_UUID 236160 93 236067 0% /mnt/lustre[OST:1] lustre-OST0002_UUID 236160 109 236051 0% /mnt/lustre[OST:2] lustre-OST0003_UUID 236160 90 236070 0% /mnt/lustre[OST:3] lustre-OST0004_UUID 236160 97 236063 0% /mnt/lustre[OST:4] lustre-OST0005_UUID 236160 95 236065 0% /mnt/lustre[OST:5] lustre-OST0006_UUID 236160 90 236070 0% /mnt/lustre[OST:6] filesystem summary: 5000040 148 4999892 0% /mnt/lustre An oom issue also occurred on client node client-12-ib: dd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0 dd cpuset=/ mems_allowed=0 Pid: 31050, comm: dd Tainted: G ---------------- T 2.6.32-131.6.1.el6.x86_64 #1 Call Trace: [<ffffffff810c0071>] ? cpuset_print_task_mems_allowed+0x91/0xb0 [<ffffffff8111020b>] ? oom_kill_process+0xcb/0x2e0 [<ffffffff811107d0>] ? select_bad_process+0xd0/0x110 [<ffffffff81110868>] ? __out_of_memory+0x58/0xc0 [<ffffffff81110a69>] ? out_of_memory+0x199/0x210 [<ffffffff811201b2>] ? __alloc_pages_nodemask+0x812/0x8b0 [<ffffffff8115465a>] ? alloc_pages_current+0xaa/0x110 [<ffffffff8110d667>] ? __page_cache_alloc+0x87/0x90 [<ffffffff8110e37c>] ? grab_cache_page_write_begin+0x7c/0xc0 [<ffffffffa088957c>] ? ll_write_begin+0x5c/0x200 [lustre] [<ffffffffa0882182>] ? ll_getxattr+0xf2/0x580 [lustre] [<ffffffff8110dc1e>] ? generic_file_buffered_write+0x10e/0x2a0 [<ffffffff8106dd17>] ? current_fs_time+0x27/0x30 [<ffffffff8110f570>] ? __generic_file_aio_write+0x250/0x480 [<ffffffff8110f80f>] ? generic_file_aio_write+0x6f/0xe0 [<ffffffffa089b051>] ? vvp_io_write_start+0xa1/0x270 [lustre] [<ffffffffa04df318>] ? cl_io_start+0x68/0x170 [obdclass] [<ffffffffa04e37d0>] ? cl_io_loop+0x110/0x1c0 [obdclass] [<ffffffffa084291b>] ? ll_file_io_generic+0x44b/0x580 [lustre] [<ffffffffa03c8424>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] [<ffffffffa04d26b9>] ? cl_env_get+0x29/0x350 [obdclass] [<ffffffffa0842b8f>] ? ll_file_aio_write+0x13f/0x310 [lustre] [<ffffffffa04d282e>] ? cl_env_get+0x19e/0x350 [obdclass] [<ffffffffa04d0d8f>] ? cl_env_put+0x1af/0x2e0 [obdclass] [<ffffffffa0849301>] ? ll_file_write+0x171/0x310 [lustre] [<ffffffff8126e781>] ? __clear_user+0x21/0x70 [<ffffffff81172748>] ? vfs_write+0xb8/0x1a0 [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81173181>] ? sys_write+0x51/0x90 [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 CPU 2: hi: 0, btch: 1 usd: 0 CPU 3: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 16 CPU 1: hi: 186, btch: 31 usd: 0 CPU 2: hi: 186, btch: 31 usd: 132 CPU 3: hi: 186, btch: 31 usd: 54 Node 0 Normal per-cpu: CPU 0: hi: 186, btch: 31 usd: 53 CPU 1: hi: 186, btch: 31 usd: 30 CPU 2: hi: 186, btch: 31 usd: 178 CPU 3: hi: 186, btch: 31 usd: 59 active_anon:2493 inactive_anon:419 isolated_anon:0 active_file:6765 inactive_file:2834208 isolated_file:64 unevictable:0 dirty:0 writeback:62 unstable:0 free:30040 slab_reclaimable:4392 slab_unreclaimable:97285 mapped:2752 shmem:1 pagetables:936 bounce:0 Node 0 DMA free:15564kB min:80kB low:100kB high:120kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15156kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2991 12081 12081 Node 0 DMA32 free:53164kB min:16712kB low:20888kB high:25068kB active_anon:0kB inactive_anon:4kB active_file:88kB inactive_file:2643896kB unevictable:0kB isolated(anon):0kB isolated(file):256kB present:3063392kB mlocked:0kB dirty:0kB writeback:248kB mapped:0kB shmem:0kB slab_reclaimable:5104kB slab_unreclaimable:32240kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:189512 all_unreclaimable? no lowmem_reserve[]: 0 0 9090 9090 Node 0 Normal free:51432kB min:50784kB low:63480kB high:76176kB active_anon:9972kB inactive_anon:1672kB active_file:26972kB inactive_file:8692936kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:9308160kB mlocked:0kB dirty:0kB writeback:0kB mapped:11008kB shmem:4kB slab_reclaimable:12464kB slab_unreclaimable:356900kB kernel_stack:1592kB pagetables:3744kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1090588 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB 1*8kB 0*16kB 2*32kB 0*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15564kB Node 0 DMA32: 583*4kB 630*8kB 548*16kB 373*32kB 228*64kB 34*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 1*4096kB = 53164kB Node 0 Normal: 1169*4kB 1168*8kB 658*16kB 324*32kB 178*64kB 7*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 51300kB 389500 total pagecache pages 322 pages in swap cache Swap cache stats: add 3069, delete 2747, find 535/672 Free swap = 14410400kB Total swap = 14417912kB 3145712 pages RAM 98606 pages reserved 397745 pages shared 2558959 pages non-shared Out of memory: kill process 6405 (qmgr) score 16072 or a child Killed process 6405 (qmgr) vsz:64288kB, anon-rss:668kB, file-rss:1884kB Maloo report: https://maloo.whamcloud.com/test_sets/aeb95442-df8f-11e0-9909-52540025f9af Please refer to the attached recovery-oss-scale.1316084894.log.tar.bz2 for more logs. The "ENOSPC" issue is similar to |
| Comment by Jian Yu [ 15/Sep/11 ] |
|
The second run hit |
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|
| Comment by Build Master (Inactive) [ 16/Sep/11 ] |
|
Integrated in Oleg Drokin : ff38d3f6ba13e8c251a2fdc070bae3cb2098a4a4
|