[LU-4731] sanity-lfsck test 1b: Oops: IP: [<ffffffff812948e7>] __list_add+0x17/0xa0 Created: 07/Mar/14  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: dne
Environment:

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/39/ (2.5.1 RC1)
Distro/Arch: RHEL6.5/x86_64
MDSCOUNT=2


Severity: 3
Rank (Obsolete): 13006

 Description   

While running sanity-lfsck test 1b, MDS hit the following Oops:

03:15:32:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-ldiskfs.lustre-MDT0000.quota_slave.enabled
03:15:32:Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.quota.mdt=ug3
03:15:32:Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.quota.ost=ug3
03:15:32:BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
03:15:32:IP: [<ffffffff812948e7>] __list_add+0x17/0xa0
03:15:32:PGD 0 
03:15:32:Oops: 0000 [#1] SMP 
03:15:32:last sysfs file: /sys/devices/system/cpu/online
03:15:32:CPU 1 
03:15:32:Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: llog_test]
03:15:32:
03:15:32:Pid: 30021, comm: ll_mgs_0001 Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 Red Hat KVM
03:15:32:RIP: 0010:[<ffffffff812948e7>]  [<ffffffff812948e7>] __list_add+0x17/0xa0
03:15:32:RSP: 0018:ffff88006cb65cd0  EFLAGS: 00010282
03:15:32:RAX: ffff88007bf135c0 RBX: ffff88006d108450 RCX: 0000000000000002
03:15:32:RDX: 0000000000000000 RSI: ffffc90001768d20 RDI: ffff88006d108450
03:15:32:RBP: ffff88006cb65cf0 R08: ffff88006d108440 R09: 0000000000000000
03:15:32:R10: 000000000000000e R11: 20736e6172742029 R12: ffffffffffffffff
03:15:32:R13: 0000000000000001 R14: 0000000000000000 R15: ffff8800605b5540
03:15:32:FS:  0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
03:15:32:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
03:15:32:CR2: 0000000000000008 CR3: 00000000379c4000 CR4: 00000000000006e0
03:15:32:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
03:15:32:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
03:15:32:Process ll_mgs_0001 (pid: 30021, threadinfo ffff88006cb64000, task ffff88006d28a080)
03:15:32:Stack:
03:15:32: 0000000000000000 000000000000001a ffffffffffffffff 0000000000000001
03:15:32:<d> ffff88006cb65d00 ffffffffa053a93b ffff88006cb65d60 ffffffffa05425af
03:15:32:<d> ffff88006cb65d30 ffff88006d108440 ffffffffffffffff ffff8800ffffffff
03:15:32:Call Trace:
03:15:32: [<ffffffffa053a93b>] lnet_res_lh_initialize+0x4b/0x50 [lnet]
03:15:32: [<ffffffffa05425af>] LNetMEAttach+0xff/0x220 [lnet]
03:15:32: [<ffffffffa0772f12>] ptlrpc_register_rqbd+0x82/0x390 [ptlrpc]
03:15:32: [<ffffffffa0780ec5>] ptlrpc_server_post_idle_rqbds+0x75/0xe0 [ptlrpc]
03:15:32: [<ffffffffa0789dd1>] ptlrpc_main+0xb11/0x1740 [ptlrpc]
03:15:32: [<ffffffffa07892c0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
03:15:32: [<ffffffff8109aee6>] kthread+0x96/0xa0
03:15:32: [<ffffffff8100c20a>] child_rip+0xa/0x20
03:15:32: [<ffffffff8109ae50>] ? kthread+0x0/0xa0
03:15:32: [<ffffffff8100c200>] ? child_rip+0x0/0x20
03:15:32:Code: ff 48 8b 03 eb 92 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c 89 65 f0 48 89 fb 4c 89 6d f8 <4c> 8b 42 08 49 89 f5 49 89 d4 49 39 f0 75 27 4d 8b 45 00 4d 39 
03:15:32:RIP  [<ffffffff812948e7>] __list_add+0x17/0xa0
03:15:32: RSP <ffff88006cb65cd0>
03:15:32:CR2: 0000000000000008

Maloo report: https://maloo.whamcloud.com/test_sets/fef44202-a551-11e3-a61d-52540035b04c



 Comments   
Comment by Jian Yu [ 07/Mar/14 ]

This is a regression failure occurred on Lustre b2_5 branch. It did not occur before.

Comment by Jian Yu [ 07/Mar/14 ]

Here is a patch trying to reproduce the failure: http://review.whamcloud.com/9548

Comment by Andreas Dilger [ 13/Mar/14 ]

It looks like this is not directly related to LFSCK, since the crash is in LNET. However, it might be indirectly caused by LFSCK if that has a use-after-free or other bad memory access. It definitely looks like a corrupt list pointer (0x000000008) that would be caused by incorrect memory access.

Let's see if this is hit again.

Comment by Andreas Dilger [ 09/Jan/20 ]

Close old bug

Generated at Sat Feb 10 01:45:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.