[LU-4644] Test failure sanity test_161a: list_del corruption Created: 18/Feb/14  Updated: 18/Jul/17  Resolved: 17/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
is related to LU-4420 sanity test_161a: cannot create regul... Resolved
Severity: 3
Rank (Obsolete): 12702

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/015e73f8-9870-11e3-968c-52540035b04c.

The sub-test test_161a failed with the following error:

test failed to respond and timed out

Info required for matching: sanity 161a

MDS2 Console log: (review-dne-part-1 on master)

16:31:57:Lustre: DEBUG MARKER: == sanity test 161a: link ea sanity == 16:31:56 (1392683516)
16:31:57:------------[ cut here ]------------
16:31:57:WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)
16:31:57:Hardware name: KVM
16:31:57:list_del corruption. prev->next should be ffff88005f8fdb98, but was ffff88005f8fdbe0
16:31:57:Modules linked in: osp(U) mdd(U) lod(U) lfsck(U) mdt(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
16:31:57:Pid: 10866, comm: mdt00_002 Not tainted 2.6.32-431.3.1.el6_lustre.gc762f0f.x86_64 #1
16:31:57:Call Trace:
16:31:57: [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0
16:31:57: [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50
16:31:57: [<ffffffff812948be>] ? list_del+0x6e/0xa0
16:31:57: [<ffffffff8109b6a1>] ? remove_wait_queue+0x31/0x50
16:31:57: [<ffffffffa05fa469>] ? lu_object_find_at+0xf9/0x360 [obdclass]
16:31:57: [<ffffffffa04a3cb2>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
16:31:57: [<ffffffffa05fa6e6>] ? lu_object_find+0x16/0x20 [obdclass]
16:31:57: [<ffffffffa0d80ed6>] ? mdt_object_find+0x56/0x170 [mdt]
16:31:57: [<ffffffffa0d8e0f3>] ? mdt_get_info+0xee3/0x1a20 [mdt]
16:31:57: [<ffffffffa089346c>] ? tgt_request_handle+0x23c/0xac0 [ptlrpc]
16:31:57: [<ffffffffa084266a>] ? ptlrpc_main+0xd1a/0x1980 [ptlrpc]
16:31:57: [<ffffffffa0841950>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
16:31:57: [<ffffffff8109af06>] ? kthread+0x96/0xa0
16:31:57: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
16:31:57: [<ffffffff8109ae70>] ? kthread+0x0/0xa0
16:31:57: [<ffffffff8100c200>] ? child_rip+0x0/0x20
16:31:57:---[ end trace dd02610dd223ce47 ]---
16:31:57:------------[ cut here ]------------


 Comments   
Comment by Alex Zhuravlev [ 18/Feb/14 ]

this is due to extra lu_object_find() somewhere in mdt_get_info(), instead it should be picking that object directly from info like the other handlers do. please check with Mike Pershin for the details.

Comment by Jodi Levi (Inactive) [ 19/Feb/14 ]

Mike,
Can you have a look at this please and comment?
Thank you!

Comment by Mikhail Pershin [ 11/Apr/14 ]

The only possible point of problem is mdt_path_current() which may call mdt_object_find() twice for the same object, I need some time to make sure about that.

Comment by Andreas Dilger [ 14/Nov/14 ]

Mike, can you please update this ticket. Sounds like it could be fixed by a relatively simple patch?

Comment by Andreas Dilger [ 25/Nov/14 ]

Mike, can you please update this ticket?

Comment by Mikhail Pershin [ 27/Nov/14 ]

The mdt_path()->mdt_path_current() may cause double mdt_object_find() it seems. I'll prepare patch.

Comment by Andreas Dilger [ 17/Apr/17 ]

The referenced functions no longer exist, and no repeat of this problem in the past 3 years.

Comment by Oscar Santos [ 17/Jul/17 ]

Andreas,

Because my understanding is that mdt_get_info, mdt_path and mdt_patch_current are still present in master, can you clarify which referenced functions no longer exist? And as of what version?

Comment by Andreas Dilger [ 18/Jul/17 ]

Oscar, I'm not sure of the context of my comment.  I may have been confused because there were typos in the function names.  In any case, this test has not failed in a long time, and this code has changed significantly since the bug was filed, so I don't think it is worthwhile to keep it open.

Comment by Alex Zhuravlev [ 18/Jul/17 ]

multiple calls to lu_object_find() shouldn't be a problem since LU-9049 landed.

Generated at Sat Feb 10 01:44:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.