[LU-1929] performance-sanity subtest test_3: list_add corruption Created: 13/Sep/12  Updated: 28/Sep/12  Resolved: 14/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Lai Siyao
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-1881 sanity test 116 soft lockup Resolved
Severity: 3
Rank (Obsolete): 6326

 Description   

This issue was created by maloo for yujian <yujian@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/d0f9d278-fd9a-11e1-afe5-52540035b04c.

Info required for matching: performance-sanity 3

Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17

Console log on MDS (fat-intel-2):

Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh
------------[ cut here ]------------
WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Not tainted)
Hardware name: X8DTT-H
list_add corruption. prev->next should be next (ffffc90022c8c01c), but was (null). (prev=ffff8805f43957b8).
Modules linked in: nfs fscache cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa mlx4_ib ib_mad ib_core mlx4_en mlx4_core e1000e microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
generalInitializing cgroup subsys cpuset


 Comments   
Comment by Peter Jones [ 13/Sep/12 ]

Lai

It seems that it might be best to understand this failure first

Peter

Comment by Lai Siyao [ 13/Sep/12 ]

In my previous test, this test is quite likely to trigger LU-1881, and this crash doesn't give much information, but it's quite possible to be that issue. IMO it's better to retest performance-sanity after LU-1881 fix is merged.

Comment by Jian Yu [ 13/Sep/12 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/18

parallel-scale test_compilebench: https://maloo.whamcloud.com/test_sets/6592fc34-fdaa-11e1-a1b4-52540035b04c

Console log on MDS (fat-intel-2):

Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 4 -r 4 --makej
BUG: unable to handle kernel 
------------[ cut here ]------------
WARNING: at lib/list_debug.c:51 list_del+0x8d/0xa0() (Not tainted)
Hardware name: X8DTT-H
list_del corruption. next->prev should be ffff880335a9f000, but was (null)
Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) jbd2 nfs fscache lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa mlx4_ib ib_mad ib_core mlx4_en mlx4_core e1000e microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 104, comm: events/5 Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50
 [<ffffffff812833bd>] ? list_del+0x8d/0xa0
 [<ffffffff81164008>] ? free_block+0xc8/0x170
 [<ffffffff811642e1>] ? drain_array+0xc1/0x100
 [<ffffffff811652ae>] ? cache_reap+0x8e/0x260
 [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80
 [<ffffffff81165220>] ? cache_reap+0x0/0x260
 [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
 [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
 [<ffffffff81091d66>] ? kthread+0x96/0xa0
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0

Please refer to the above report for more console logs.

Comment by Jian Yu [ 14/Sep/12 ]

This is fixed in LU-1881.

Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/19

performance-sanity test passed:
https://maloo.whamcloud.com/test_sets/2413e27e-fe44-11e1-b4cd-52540035b04c

Generated at Sat Feb 10 01:20:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.