[LU-8156] sanity test_116a: test failed to respond and timed out Created: 17/May/16 Updated: 05/Aug/20 Resolved: 05/Aug/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | zfs | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/16faf1ee-1c5f-11e6-a909-5254006e85c2. The sub-test test_116a failed with the following error: test failed to respond and timed out Please provide additional information about the failure here. Info required for matching: sanity 116a |
| Comments |
| Comment by Nathaniel Clark [ 29/Jul/16 ] |
|
Just had a very similar issue happen on review-zfs-part-1 Crash on OSS: 14:30:48:[ 3507.809179] Lustre: DEBUG MARKER: == sanity test 116a: stripe QOS: free space balance ================================================== 14:28:28 (1469654908) 14:30:48:[ 3644.052006] BUG: soft lockup - CPU#0 stuck for 22s! [ll_ost_io00_009:2658] 14:30:48:[ 3644.052006] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) nfsd nfs_acl lockd auth_rpcgss grace spl(OE) zlib_deflate sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too cirrus syscopyarea sysfillrect sysimgblt drm_kms_helper ttm drm serio_raw ata_piix i2c_core virtio_pci 8139cp virtio_ring mii virtio floppy libata 14:30:48:[ 3644.052006] CPU: 0 PID: 2658 Comm: ll_ost_io00_009 Tainted: P OE ------------ 3.10.0-327.22.2.el7_lustre.x86_64 #1 14:30:48:[ 3644.052006] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 14:30:48:[ 3644.052006] task: ffff88004b906780 ti: ffff880046388000 task.ti: ffff880046388000 14:30:48:[ 3644.052006] RIP: 0010:[<ffffffff81302159>] [<ffffffff81302159>] __write_lock_failed+0x9/0x20 14:30:48:[ 3644.052006] RSP: 0018:ffff88004638b8e8 EFLAGS: 00000297 14:30:48:[ 3644.052006] RAX: 0000000000000315 RBX: 0000000000000000 RCX: ffff8800469886c0 14:30:48:[ 3644.052006] RDX: ffff88004e868600 RSI: ffff88004e868600 RDI: ffff8800531e40bc 14:30:48:[ 3644.052006] RBP: ffff88004638b8e8 R08: 0000000000017560 R09: ffff88007fc17560 14:30:48:[ 3644.052006] R10: ffffea0000ddf380 R11: ffffffffa021075a R12: ffffffffa0210ccd 14:30:48:[ 3644.052006] R13: ffff88004638b888 R14: 0000000000000040 R15: 00000000ffffffff 14:30:48:[ 3644.052006] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 14:30:48:[ 3644.052006] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 14:30:48:[ 3644.052006] CR2: 00007fdc66ed5000 CR3: 000000000194a000 CR4: 00000000000006f0 14:30:48:[ 3644.052006] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 14:30:48:[ 3644.052006] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 14:30:48:[ 3644.052006] Stack: 14:30:48:[ 3644.052006] ffff88004638b8f8 ffffffff8163da17 ffff88004638b928 ffffffffa0ed3a43 14:30:48:[ 3644.052006] ffff88004e868600 ffff8800531e4000 ffff88004e868600 0000000000000000 14:30:48:[ 3644.052006] ffff88004638b948 ffffffffa0ede0ab ffff88004638b948 ffffffffa100459d 14:30:48:[ 3644.052006] Call Trace: 14:30:48:[ 3644.052006] [<ffffffff8163da17>] _raw_write_lock+0x17/0x20 14:30:48:[ 3644.052006] [<ffffffffa0ed3a43>] osd_object_sa_dirty_add+0x43/0x70 [osd_zfs] 14:30:48:[ 3644.052006] [<ffffffffa0ede0ab>] __osd_sa_xattr_schedule_update+0x4b/0x160 [osd_zfs] 14:30:48:[ 3644.052006] [<ffffffffa0ede345>] __osd_sa_xattr_set+0x185/0x400 [osd_zfs] 14:30:48:[ 3644.052006] [<ffffffffa0edec7b>] osd_xattr_set+0x27b/0x510 [osd_zfs] 14:30:48:[ 3644.052006] [<ffffffffa0ff1f35>] ofd_commitrw_write+0x1315/0x1ad0 [ofd] 14:30:48:[ 3644.052006] [<ffffffffa0ff526d>] ofd_commitrw+0x51d/0xa40 [ofd] 14:30:48:[ 3644.052006] [<ffffffffa0d49451>] obd_commitrw+0x2ec/0x32f [ptlrpc] 14:30:48:[ 3644.052006] [<ffffffffa0d23861>] tgt_brw_write+0xea1/0x1640 [ptlrpc] 14:30:48:[ 3644.052006] [<ffffffff812fe46b>] ? string.isra.7+0x3b/0xf0 14:30:48:[ 3644.052006] [<ffffffffa0c7a4a0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] 14:30:48:[ 3644.052006] [<ffffffffa0d1fb15>] tgt_request_handle+0x915/0x1320 [ptlrpc] 14:30:48:[ 3644.052006] [<ffffffffa0ccbccb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] 14:30:48:[ 3644.052006] [<ffffffffa0cc9888>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] 14:30:48:[ 3644.052006] [<ffffffff810b88d2>] ? default_wake_function+0x12/0x20 14:30:48:[ 3644.052006] [<ffffffff810af038>] ? __wake_up_common+0x58/0x90 14:30:48:[ 3644.052006] [<ffffffffa0ccfd80>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] 14:30:48:[ 3644.052006] [<ffffffffa0ccf2e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] 14:30:48:[ 3644.052006] [<ffffffff810a5aef>] kthread+0xcf/0xe0 14:30:48:[ 3644.052006] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 14:30:48:[ 3644.052006] [<ffffffff816469d8>] ret_from_fork+0x58/0x90 14:30:48:[ 3644.052006] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 14:30:48:[ 3644.052006] Code: 66 90 48 89 01 31 c0 66 66 90 c3 b8 f2 ff ff ff 66 66 90 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 f0 ff 07 f3 90 <83> 3f 01 75 f9 f0 ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00 00 00 14:30:48:[ 3644.052006] Kernel panic - not syncing: softlockup: hung tasks |
| Comment by Kit Westneat [ 11/Aug/16 ] |
|
Another softlock up, but slightly different trace during review-zfs-part-1: 19:01:47:[ 2720.066003] Call Trace: 19:01:47:[ 2720.066003] [<ffffffff8163d984>] _raw_read_lock+0x14/0x20 19:01:47:[ 2720.066003] [<ffffffffa0ec3fa0>] osd_attr_get+0x90/0x2d0 [osd_zfs] 19:01:47:[ 2720.066003] [<ffffffffa0fde73e>] ofd_attr_handle_ugid+0x1ce/0x3b0 [ofd] 19:01:47:[ 2720.066003] [<ffffffffa0fe1e4e>] ofd_commitrw_write+0x22e/0x1ad0 [ofd] 19:01:47:[ 2720.066003] [<ffffffffa0fe837e>] ? ofd_fmd_find_nolock+0xbe/0xe0 [ofd] 19:01:47:[ 2720.066003] [<ffffffffa0fe626d>] ofd_commitrw+0x51d/0xa40 [ofd] 19:01:47:[ 2720.066003] [<ffffffffa0d3a7a1>] obd_commitrw+0x2ec/0x32f [ptlrpc] 19:01:47:[ 2720.066003] [<ffffffffa0d14861>] tgt_brw_write+0xea1/0x1640 [ptlrpc] 19:01:47:[ 2720.066003] [<ffffffff812fe46b>] ? string.isra.7+0x3b/0xf0 19:01:47:[ 2720.066003] [<ffffffffa0c6b4a0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] 19:01:47:[ 2720.066003] [<ffffffffa0d10b15>] tgt_request_handle+0x915/0x1320 [ptlrpc] 19:01:47:[ 2720.066003] [<ffffffffa0cbcccb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] 19:01:47:[ 2720.066003] [<ffffffffa0cba888>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] 19:01:47:[ 2720.066003] [<ffffffff810b88d2>] ? default_wake_function+0x12/0x20 19:01:47:[ 2720.066003] [<ffffffff810af038>] ? __wake_up_common+0x58/0x90 19:01:47:[ 2720.066003] [<ffffffffa0cc0d80>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] 19:01:47:[ 2720.066003] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 19:01:47:[ 2720.066003] [<ffffffffa0cc02e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] 19:01:47:[ 2720.066003] [<ffffffff810a5aef>] kthread+0xcf/0xe0 19:01:47:[ 2720.066003] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 19:01:47:[ 2720.066003] [<ffffffff816469d8>] ret_from_fork+0x58/0x90 19:01:47:[ 2720.066003] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 |
| Comment by Jian Yu [ 12/Aug/16 ] |
|
On master branch, sanityn test 77f hit the similar crash in review-zfs-part-1 test session. Console log on OSS: Lustre: DEBUG MARKER: == sanityn test 77f: check TBF JobID nrs policy ====================================================== 17:02:56 (1470934976)
Lustre: DEBUG MARKER: lctl set_param ost.OSS.ost_io.nrs_policies=tbf\ jobid
Lustre: DEBUG MARKER: lctl set_param ost.OSS.ost_io.nrs_policies=tbf\ jobid
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version
Lustre: DEBUG MARKER: lctl set_param ost.OSS.ost_io.nrs_tbf_rule=start\ runas\ jobid={iozone.500\ dd.500\ tiotest.500}\ rate=1000
Lustre: DEBUG MARKER: lctl set_param ost.OSS.ost_io.nrs_tbf_rule=start\ iozone_runas\ jobid={iozone.500}\ rate=100
Lustre: DEBUG MARKER: lctl set_param ost.OSS.ost_io.nrs_tbf_rule=start\ dd_runas\ jobid={dd.500}\ rate=50
BUG: soft lockup - CPU#0 stuck for 23s! [ll_ost_io00_009:6085]
Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) nfsd nfs_acl lockd auth_rpcgss grace spl(OE) zlib_deflate sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus syscopyarea sysfillrect sysimgblt drm_kms_helper virtio_blk ttm 8139too ata_piix drm serio_raw virtio_pci virtio_ring virtio 8139cp mii i2c_core libata floppy
CPU: 0 PID: 6085 Comm: ll_ost_io00_009 Tainted: P OE ------------ 3.10.0-327.22.2.el7_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
task: ffff880037d79700 ti: ffff88003df58000 task.ti: ffff88003df58000
RIP: 0010:[<ffffffff8130217a>] [<ffffffff8130217a>] __read_lock_failed+0xa/0x20
RSP: 0018:ffff88003df5b978 EFLAGS: 00000297
RAX: ffff88000a593690 RBX: ffff88000a5936e0 RCX: ffff88003df5bfd8
RDX: 0000000000000000 RSI: ffff88000a594818 RDI: ffff88000a5948d0
RBP: ffff88003df5b978 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000010000 R11: 0000000000000004 R12: ffffffffa0a9695e
R13: ffff88003df5b948 R14: ffffffffa094fca9 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ff30dbb8000 CR3: 000000007afca000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff88003df5b988 ffffffff8163d984 ffff88003df5b9c8 ffffffffa0ecdfa0
00000000758d352b ffff88000d294080 00000000758d352b ffff88000d294080
ffff880044459400 ffff880044459430 ffff88003df5ba00 ffffffffa0fe873e
Call Trace:
[<ffffffff8163d984>] _raw_read_lock+0x14/0x20
[<ffffffffa0ecdfa0>] osd_attr_get+0x90/0x2d0 [osd_zfs]
[<ffffffffa0fe873e>] ofd_attr_handle_ugid+0x1ce/0x3b0 [ofd]
[<ffffffffa0febe4e>] ofd_commitrw_write+0x22e/0x1ad0 [ofd]
[<ffffffffa0ff237e>] ? ofd_fmd_find_nolock+0xbe/0xe0 [ofd]
[<ffffffffa0ff026d>] ofd_commitrw+0x51d/0xa40 [ofd]
[<ffffffffa0d447a1>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[<ffffffffa0d1e861>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
[<ffffffff812fe46b>] ? string.isra.7+0x3b/0xf0
[<ffffffffa0c754a0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[<ffffffffa0d1ab15>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[<ffffffffa0cc6ccb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[<ffffffffa0cc4888>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[<ffffffffa0ccad80>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
[<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
[<ffffffffa0cca2e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[<ffffffff810a5aef>] kthread+0xcf/0xe0
[<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[<ffffffff816469d8>] ret_from_fork+0x58/0x90
[<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Code: 90 55 48 89 e5 3e ff 07 f3 90 83 3f 01 75 f9 3e ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 3e 48 ff 07 f3 90 <48> 83 3f 01 78 f8 3e 48 ff 0f 78 ee 5d c3 90 90 90 90 90 90 90
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 6085 Comm: ll_ost_io00_009 Tainted: P OEL ------------ 3.10.0-327.22.2.el7_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
ffffffff818731dc 00000000758d352b ffff88007fc03e18 ffffffff81636324
ffff88007fc03e98 ffffffff8162fb9a 0000000000000008 ffff88007fc03ea8
ffff88007fc03e48 00000000758d352b ffff88007fc03e67 0000000000000000
Call Trace:
<IRQ> [<ffffffff81636324>] dump_stack+0x19/0x1b
[<ffffffff8162fb9a>] panic+0xd8/0x1e7
[<ffffffff8111b7e6>] watchdog_timer_fn+0x1b6/0x1c0
[<ffffffff8111b630>] ? watchdog_enable+0xc0/0xc0
[<ffffffff810a9d32>] __hrtimer_run_queues+0xd2/0x260
[<ffffffff810aa2d0>] hrtimer_interrupt+0xb0/0x1e0
[<ffffffff8164839c>] ? call_softirq+0x1c/0x30
[<ffffffff81049567>] local_apic_timer_interrupt+0x37/0x60
[<ffffffff8164900f>] smp_apic_timer_interrupt+0x3f/0x60
[<ffffffff816476dd>] apic_timer_interrupt+0x6d/0x80
<EOI> [<ffffffff8130217a>] ? __read_lock_failed+0xa/0x20
[<ffffffff8163d984>] _raw_read_lock+0x14/0x20
[<ffffffffa0ecdfa0>] osd_attr_get+0x90/0x2d0 [osd_zfs]
[<ffffffffa0fe873e>] ofd_attr_handle_ugid+0x1ce/0x3b0 [ofd]
[<ffffffffa0febe4e>] ofd_commitrw_write+0x22e/0x1ad0 [ofd]
[<ffffffffa0ff237e>] ? ofd_fmd_find_nolock+0xbe/0xe0 [ofd]
[<ffffffffa0ff026d>] ofd_commitrw+0x51d/0xa40 [ofd]
[<ffffffffa0d447a1>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[<ffffffffa0d1e861>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
[<ffffffff812fe46b>] ? string.isra.7+0x3b/0xf0
[<ffffffffa0c754a0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[<ffffffffa0d1ab15>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[<ffffffffa0cc6ccb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[<ffffffffa0cc4888>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[<ffffffffa0ccad80>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
[<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
[<ffffffffa0cca2e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[<ffffffff810a5aef>] kthread+0xcf/0xe0
[<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[<ffffffff816469d8>] ret_from_fork+0x58/0x90
[<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
The kdump vmcore and vmcore-dmesg.txt were saved in: trevis-2.trevis.hpdd.intel.com:/export/scratch/dumps/trevis-6vm3.trevis.hpdd.intel.com/10.9.4.57-2016-08-11-17:03:31/ Maloo report: https://testing.hpdd.intel.com/test_sets/c5407ab2-6002-11e6-b2e2-5254006e85c2 |
| Comment by Bob Glossman (Inactive) [ 09/Sep/17 ] |
|
more on master: |
| Comment by Andreas Dilger [ 05/Aug/20 ] |
|
Closing old ticket that has not been seen in a long time. |