[LU-6439] sanity test_120g: panic on client Created: 07/Apr/15  Updated: 26/Mar/18  Resolved: 06/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-3270 ptlrpcd strnlen crash trying to log a... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/96df30ce-dccc-11e4-a6e6-5254006e85c2.

from evidence in console log it looks like the client node panic'ed and even took a crash dump:

17:25:50:Saving to remote location onyx-4.onyx.hpdd.intel.com:/export/scratch/dumps
17:25:50:Saving vmcore-dmesg.txt
17:25:50:Saved vmcore-dmesg.txt

This seems like a one off and possibly a TEI issue. Other test runs of patches depending on the one in this test run completed without any problems.

The sub-test test_120g failed with the following error:

test failed to respond and timed out

Please provide additional information about the failure here.

Info required for matching: sanity 120g



 Comments   
Comment by Bob Glossman (Inactive) [ 07/Apr/15 ]

found the saved crash dump. in the vmcore-dmesg.txt I see the following:

<0>BUG: soft lockup - CPU#0 stuck for 67s! [ptlrpcd_0:2379]
<4>Modules linked in: ext2 lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
<4>CPU 0
<4>Modules linked in: ext2 lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U)
<0>BUG: soft lockup - CPU#1 stuck for 67s! [ll_sa_28008:28009]
<4>Modules linked in: ext2 lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
<4>CPU 1
<4>Modules linked in: ext2 lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
<4>
<4>Pid: 28009, comm: ll_sa_28008 Not tainted 2.6.32-431.29.2.el6.x86_64 #1 Red Hat KVM
<4>RIP: 0010:[<ffffffff8152b84e>] [<ffffffff8152b84e>] _spin_lock+0x1e/0x30
<4>RSP: 0018:ffff88006f26fd30 EFLAGS: 00000206
<4>RAX: 0000000000000001 RBX: ffff88006f26fd30 RCX: 0000000000000003
<4>RDX: 0000000000000000 RSI: 000000001082ebea RDI: ffff88006a1e2440
<4>RBP: ffffffff8100bb8e R08: 0000000031353433 R09: 0000000000000000
<4>R10: ffff880067eb68c0 R11: 0000000000000080 R12: 0000000000000000
<4>R13: 0000000000000eef R14: 0000000200001b71 R15: 0000000000000000
<4>FS: 0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
<4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 0000000001bb80b8 CR3: 000000007d793000 CR4: 00000000000006e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process ll_sa_28008 (pid: 28009, threadinfo ffff88006f26e000, task ffff88007abec080)
<4>Stack:
<4> ffff88006f26fdc0 ffffffffa0b6a275 0000000000000005 0000000000000080
<4><d> 0000000000001a77 ffff88006cb14080 ffff88006a1e2178 ffff88005fc241c8
<4><d> 0000000000000000 0000000000000000 ffff88007a38ab00 ffff88006b218200
<4>Call Trace:
<4> [<ffffffffa0b6a275>] ? ll_statahead_one+0x295/0xdc0 [lustre]
<4> [<ffffffffa0b6b11b>] ? ll_statahead_thread+0x37b/0xfb0 [lustre]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0b6ada0>] ? ll_statahead_thread+0x0/0xfb0 [lustre]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>Code: 00 00 00 01 74 05 e8 92 3a d6 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> 1f 44 00 00 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89
<4>Call Trace:
<4> [<ffffffffa0b6a275>] ? ll_statahead_one+0x295/0xdc0 [lustre]
<4> [<ffffffffa0b6b11b>] ? ll_statahead_thread+0x37b/0xfb0 [lustre]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0b6ada0>] ? ll_statahead_thread+0x0/0xfb0 [lustre]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<0>Kernel panic - not syncing: softlockup: hung tasks
<4>Pid: 28009, comm: ll_sa_28008 Not tainted 2.6.32-431.29.2.el6.x86_64 #1
<4>Call Trace:
<4> <IRQ> [<ffffffff8152873c>] ? panic+0xa7/0x16f
<4> [<ffffffff810e6200>] ? watchdog_timer_fn+0x0/0x1e0
<4> [<ffffffff810e63ca>] ? watchdog_timer_fn+0x1ca/0x1e0
<4> [<ffffffff8109f6be>] ? __run_hrtimer+0x8e/0x1a0
<4> [<ffffffff810a6a9f>] ? ktime_get_update_offsets+0x4f/0xd0
<4> [<ffffffff8109fa26>] ? hrtimer_interrupt+0xe6/0x260
<4> [<ffffffff81031f1d>] ? local_apic_timer_interrupt+0x3d/0x70
<4> [<ffffffff815325e5>] ? smp_apic_timer_interrupt+0x45/0x60
<4> [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
<4> <EOI> [<ffffffff8152b84e>] ? _spin_lock+0x1e/0x30
<4> [<ffffffffa0b6a275>] ? ll_statahead_one+0x295/0xdc0 [lustre]
<4> [<ffffffffa0b6b11b>] ? ll_statahead_thread+0x37b/0xfb0 [lustre]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0b6ada0>] ? ll_statahead_thread+0x0/0xfb0 [lustre]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

so it looks like a statahead issue.

dup of LU-4410?

Comment by Lai Siyao [ 08/Apr/15 ]

this looks to be a dup of LU-3270, there is a backport patch for 2.5 http://review.whamcloud.com/#/c/12901/, which should be able to fix this.

Comment by Peter Jones [ 06/Feb/18 ]

Landed for 2.11

Generated at Sat Feb 10 02:00:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.