[LU-8156] sanity test_116a: test failed to respond and timed out Created: 17/May/16  Updated: 05/Aug/20  Resolved: 05/Aug/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: zfs

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/16faf1ee-1c5f-11e6-a909-5254006e85c2.

The sub-test test_116a failed with the following error:

test failed to respond and timed out

Please provide additional information about the failure here.

Info required for matching: sanity 116a



 Comments   
Comment by Nathaniel Clark [ 29/Jul/16 ]

Just had a very similar issue happen on review-zfs-part-1
https://testing.hpdd.intel.com/test_sets/2098b3a8-546d-11e6-905c-5254006e85c2

Crash on OSS:

14:30:48:[ 3507.809179] Lustre: DEBUG MARKER: == sanity test 116a: stripe QOS: free space balance ================================================== 14:28:28 (1469654908)
14:30:48:[ 3644.052006] BUG: soft lockup - CPU#0 stuck for 22s! [ll_ost_io00_009:2658]
14:30:48:[ 3644.052006] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) nfsd nfs_acl lockd auth_rpcgss grace spl(OE) zlib_deflate sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too cirrus syscopyarea sysfillrect sysimgblt drm_kms_helper ttm drm serio_raw ata_piix i2c_core virtio_pci 8139cp virtio_ring mii virtio floppy libata
14:30:48:[ 3644.052006] CPU: 0 PID: 2658 Comm: ll_ost_io00_009 Tainted: P           OE  ------------   3.10.0-327.22.2.el7_lustre.x86_64 #1
14:30:48:[ 3644.052006] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
14:30:48:[ 3644.052006] task: ffff88004b906780 ti: ffff880046388000 task.ti: ffff880046388000
14:30:48:[ 3644.052006] RIP: 0010:[<ffffffff81302159>]  [<ffffffff81302159>] __write_lock_failed+0x9/0x20
14:30:48:[ 3644.052006] RSP: 0018:ffff88004638b8e8  EFLAGS: 00000297
14:30:48:[ 3644.052006] RAX: 0000000000000315 RBX: 0000000000000000 RCX: ffff8800469886c0
14:30:48:[ 3644.052006] RDX: ffff88004e868600 RSI: ffff88004e868600 RDI: ffff8800531e40bc
14:30:48:[ 3644.052006] RBP: ffff88004638b8e8 R08: 0000000000017560 R09: ffff88007fc17560
14:30:48:[ 3644.052006] R10: ffffea0000ddf380 R11: ffffffffa021075a R12: ffffffffa0210ccd
14:30:48:[ 3644.052006] R13: ffff88004638b888 R14: 0000000000000040 R15: 00000000ffffffff
14:30:48:[ 3644.052006] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
14:30:48:[ 3644.052006] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
14:30:48:[ 3644.052006] CR2: 00007fdc66ed5000 CR3: 000000000194a000 CR4: 00000000000006f0
14:30:48:[ 3644.052006] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
14:30:48:[ 3644.052006] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
14:30:48:[ 3644.052006] Stack:
14:30:48:[ 3644.052006]  ffff88004638b8f8 ffffffff8163da17 ffff88004638b928 ffffffffa0ed3a43
14:30:48:[ 3644.052006]  ffff88004e868600 ffff8800531e4000 ffff88004e868600 0000000000000000
14:30:48:[ 3644.052006]  ffff88004638b948 ffffffffa0ede0ab ffff88004638b948 ffffffffa100459d
14:30:48:[ 3644.052006] Call Trace:
14:30:48:[ 3644.052006]  [<ffffffff8163da17>] _raw_write_lock+0x17/0x20
14:30:48:[ 3644.052006]  [<ffffffffa0ed3a43>] osd_object_sa_dirty_add+0x43/0x70 [osd_zfs]
14:30:48:[ 3644.052006]  [<ffffffffa0ede0ab>] __osd_sa_xattr_schedule_update+0x4b/0x160 [osd_zfs]
14:30:48:[ 3644.052006]  [<ffffffffa0ede345>] __osd_sa_xattr_set+0x185/0x400 [osd_zfs]
14:30:48:[ 3644.052006]  [<ffffffffa0edec7b>] osd_xattr_set+0x27b/0x510 [osd_zfs]
14:30:48:[ 3644.052006]  [<ffffffffa0ff1f35>] ofd_commitrw_write+0x1315/0x1ad0 [ofd]
14:30:48:[ 3644.052006]  [<ffffffffa0ff526d>] ofd_commitrw+0x51d/0xa40 [ofd]
14:30:48:[ 3644.052006]  [<ffffffffa0d49451>] obd_commitrw+0x2ec/0x32f [ptlrpc]
14:30:48:[ 3644.052006]  [<ffffffffa0d23861>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
14:30:48:[ 3644.052006]  [<ffffffff812fe46b>] ? string.isra.7+0x3b/0xf0
14:30:48:[ 3644.052006]  [<ffffffffa0c7a4a0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
14:30:48:[ 3644.052006]  [<ffffffffa0d1fb15>] tgt_request_handle+0x915/0x1320 [ptlrpc]
14:30:48:[ 3644.052006]  [<ffffffffa0ccbccb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
14:30:48:[ 3644.052006]  [<ffffffffa0cc9888>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
14:30:48:[ 3644.052006]  [<ffffffff810b88d2>] ? default_wake_function+0x12/0x20
14:30:48:[ 3644.052006]  [<ffffffff810af038>] ? __wake_up_common+0x58/0x90
14:30:48:[ 3644.052006]  [<ffffffffa0ccfd80>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
14:30:48:[ 3644.052006]  [<ffffffffa0ccf2e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
14:30:48:[ 3644.052006]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
14:30:48:[ 3644.052006]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
14:30:48:[ 3644.052006]  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
14:30:48:[ 3644.052006]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
14:30:48:[ 3644.052006] Code: 66 90 48 89 01 31 c0 66 66 90 c3 b8 f2 ff ff ff 66 66 90 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 f0 ff 07 f3 90 <83> 3f 01 75 f9 f0 ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00 00 00 
14:30:48:[ 3644.052006] Kernel panic - not syncing: softlockup: hung tasks
Comment by Kit Westneat [ 11/Aug/16 ]

Another softlock up, but slightly different trace during review-zfs-part-1:
https://testing.hpdd.intel.com/test_sets/3045bb60-5f4f-11e6-906c-5254006e85c2
https://testing.hpdd.intel.com/test_logs/442dfc64-5f4f-11e6-906c-5254006e85c2/show_text

19:01:47:[ 2720.066003] Call Trace:
19:01:47:[ 2720.066003]  [<ffffffff8163d984>] _raw_read_lock+0x14/0x20
19:01:47:[ 2720.066003]  [<ffffffffa0ec3fa0>] osd_attr_get+0x90/0x2d0 [osd_zfs]
19:01:47:[ 2720.066003]  [<ffffffffa0fde73e>] ofd_attr_handle_ugid+0x1ce/0x3b0 [ofd]
19:01:47:[ 2720.066003]  [<ffffffffa0fe1e4e>] ofd_commitrw_write+0x22e/0x1ad0 [ofd]
19:01:47:[ 2720.066003]  [<ffffffffa0fe837e>] ? ofd_fmd_find_nolock+0xbe/0xe0 [ofd]
19:01:47:[ 2720.066003]  [<ffffffffa0fe626d>] ofd_commitrw+0x51d/0xa40 [ofd]
19:01:47:[ 2720.066003]  [<ffffffffa0d3a7a1>] obd_commitrw+0x2ec/0x32f [ptlrpc]
19:01:47:[ 2720.066003]  [<ffffffffa0d14861>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
19:01:47:[ 2720.066003]  [<ffffffff812fe46b>] ? string.isra.7+0x3b/0xf0
19:01:47:[ 2720.066003]  [<ffffffffa0c6b4a0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
19:01:47:[ 2720.066003]  [<ffffffffa0d10b15>] tgt_request_handle+0x915/0x1320 [ptlrpc]
19:01:47:[ 2720.066003]  [<ffffffffa0cbcccb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
19:01:47:[ 2720.066003]  [<ffffffffa0cba888>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
19:01:47:[ 2720.066003]  [<ffffffff810b88d2>] ? default_wake_function+0x12/0x20
19:01:47:[ 2720.066003]  [<ffffffff810af038>] ? __wake_up_common+0x58/0x90
19:01:47:[ 2720.066003]  [<ffffffffa0cc0d80>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
19:01:47:[ 2720.066003]  [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
19:01:47:[ 2720.066003]  [<ffffffffa0cc02e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
19:01:47:[ 2720.066003]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
19:01:47:[ 2720.066003]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
19:01:47:[ 2720.066003]  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
19:01:47:[ 2720.066003]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Comment by Jian Yu [ 12/Aug/16 ]

On master branch, sanityn test 77f hit the similar crash in review-zfs-part-1 test session. Console log on OSS:

Lustre: DEBUG MARKER: == sanityn test 77f: check TBF JobID nrs policy ====================================================== 17:02:56 (1470934976)
Lustre: DEBUG MARKER: lctl set_param ost.OSS.ost_io.nrs_policies=tbf\ jobid
Lustre: DEBUG MARKER: lctl set_param ost.OSS.ost_io.nrs_policies=tbf\ jobid
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version
Lustre: DEBUG MARKER: lctl set_param ost.OSS.ost_io.nrs_tbf_rule=start\ runas\ jobid={iozone.500\ dd.500\ tiotest.500}\ rate=1000
Lustre: DEBUG MARKER: lctl set_param ost.OSS.ost_io.nrs_tbf_rule=start\ iozone_runas\ jobid={iozone.500}\ rate=100
Lustre: DEBUG MARKER: lctl set_param ost.OSS.ost_io.nrs_tbf_rule=start\ dd_runas\ jobid={dd.500}\ rate=50
BUG: soft lockup - CPU#0 stuck for 23s! [ll_ost_io00_009:6085]
Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) nfsd nfs_acl lockd auth_rpcgss grace spl(OE) zlib_deflate sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus syscopyarea sysfillrect sysimgblt drm_kms_helper virtio_blk ttm 8139too ata_piix drm serio_raw virtio_pci virtio_ring virtio 8139cp mii i2c_core libata floppy
CPU: 0 PID: 6085 Comm: ll_ost_io00_009 Tainted: P           OE  ------------   3.10.0-327.22.2.el7_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
task: ffff880037d79700 ti: ffff88003df58000 task.ti: ffff88003df58000
RIP: 0010:[<ffffffff8130217a>]  [<ffffffff8130217a>] __read_lock_failed+0xa/0x20
RSP: 0018:ffff88003df5b978  EFLAGS: 00000297
RAX: ffff88000a593690 RBX: ffff88000a5936e0 RCX: ffff88003df5bfd8
RDX: 0000000000000000 RSI: ffff88000a594818 RDI: ffff88000a5948d0
RBP: ffff88003df5b978 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000010000 R11: 0000000000000004 R12: ffffffffa0a9695e
R13: ffff88003df5b948 R14: ffffffffa094fca9 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ff30dbb8000 CR3: 000000007afca000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffff88003df5b988 ffffffff8163d984 ffff88003df5b9c8 ffffffffa0ecdfa0
 00000000758d352b ffff88000d294080 00000000758d352b ffff88000d294080
 ffff880044459400 ffff880044459430 ffff88003df5ba00 ffffffffa0fe873e
Call Trace:
 [<ffffffff8163d984>] _raw_read_lock+0x14/0x20
 [<ffffffffa0ecdfa0>] osd_attr_get+0x90/0x2d0 [osd_zfs]
 [<ffffffffa0fe873e>] ofd_attr_handle_ugid+0x1ce/0x3b0 [ofd]
 [<ffffffffa0febe4e>] ofd_commitrw_write+0x22e/0x1ad0 [ofd]
 [<ffffffffa0ff237e>] ? ofd_fmd_find_nolock+0xbe/0xe0 [ofd]
 [<ffffffffa0ff026d>] ofd_commitrw+0x51d/0xa40 [ofd]
 [<ffffffffa0d447a1>] obd_commitrw+0x2ec/0x32f [ptlrpc]
 [<ffffffffa0d1e861>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
 [<ffffffff812fe46b>] ? string.isra.7+0x3b/0xf0 
 [<ffffffffa0c754a0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
 [<ffffffffa0d1ab15>] tgt_request_handle+0x915/0x1320 [ptlrpc]
 [<ffffffffa0cc6ccb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
 [<ffffffffa0cc4888>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
 [<ffffffffa0ccad80>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
 [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
 [<ffffffffa0cca2e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
 [<ffffffff810a5aef>] kthread+0xcf/0xe0
 [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff816469d8>] ret_from_fork+0x58/0x90 
 [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Code: 90 55 48 89 e5 3e ff 07 f3 90 83 3f 01 75 f9 3e ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 3e 48 ff 07 f3 90 <48> 83 3f 01 78 f8 3e 48 ff 0f 78 ee 5d c3 90 90 90 90 90 90 90 
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 6085 Comm: ll_ost_io00_009 Tainted: P           OEL ------------   3.10.0-327.22.2.el7_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
 ffffffff818731dc 00000000758d352b ffff88007fc03e18 ffffffff81636324
 ffff88007fc03e98 ffffffff8162fb9a 0000000000000008 ffff88007fc03ea8
 ffff88007fc03e48 00000000758d352b ffff88007fc03e67 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff81636324>] dump_stack+0x19/0x1b
 [<ffffffff8162fb9a>] panic+0xd8/0x1e7
 [<ffffffff8111b7e6>] watchdog_timer_fn+0x1b6/0x1c0
 [<ffffffff8111b630>] ? watchdog_enable+0xc0/0xc0
 [<ffffffff810a9d32>] __hrtimer_run_queues+0xd2/0x260
 [<ffffffff810aa2d0>] hrtimer_interrupt+0xb0/0x1e0
 [<ffffffff8164839c>] ? call_softirq+0x1c/0x30
 [<ffffffff81049567>] local_apic_timer_interrupt+0x37/0x60
 [<ffffffff8164900f>] smp_apic_timer_interrupt+0x3f/0x60
 [<ffffffff816476dd>] apic_timer_interrupt+0x6d/0x80
 <EOI>  [<ffffffff8130217a>] ? __read_lock_failed+0xa/0x20
 [<ffffffff8163d984>] _raw_read_lock+0x14/0x20
 [<ffffffffa0ecdfa0>] osd_attr_get+0x90/0x2d0 [osd_zfs]
 [<ffffffffa0fe873e>] ofd_attr_handle_ugid+0x1ce/0x3b0 [ofd]
 [<ffffffffa0febe4e>] ofd_commitrw_write+0x22e/0x1ad0 [ofd]
 [<ffffffffa0ff237e>] ? ofd_fmd_find_nolock+0xbe/0xe0 [ofd]
 [<ffffffffa0ff026d>] ofd_commitrw+0x51d/0xa40 [ofd]
 [<ffffffffa0d447a1>] obd_commitrw+0x2ec/0x32f [ptlrpc]
 [<ffffffffa0d1e861>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
 [<ffffffff812fe46b>] ? string.isra.7+0x3b/0xf0
 [<ffffffffa0c754a0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
 [<ffffffffa0d1ab15>] tgt_request_handle+0x915/0x1320 [ptlrpc]
 [<ffffffffa0cc6ccb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
 [<ffffffffa0cc4888>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
 [<ffffffffa0ccad80>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
 [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
 [<ffffffffa0cca2e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
 [<ffffffff810a5aef>] kthread+0xcf/0xe0
 [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff816469d8>] ret_from_fork+0x58/0x90
 [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140

The kdump vmcore and vmcore-dmesg.txt were saved in:

trevis-2.trevis.hpdd.intel.com:/export/scratch/dumps/trevis-6vm3.trevis.hpdd.intel.com/10.9.4.57-2016-08-11-17:03:31/

Maloo report: https://testing.hpdd.intel.com/test_sets/c5407ab2-6002-11e6-b2e2-5254006e85c2

Comment by Bob Glossman (Inactive) [ 09/Sep/17 ]

more on master:
https://testing.hpdd.intel.com/test_sets/62ccafb6-9514-11e7-b89e-5254006e85c2
https://testing.hpdd.intel.com/test_sets/bcc0ed1e-954d-11e7-b75f-5254006e85c2
https://testing.hpdd.intel.com/test_sets/2fb57e3e-97ab-11e7-b775-5254006e85c2

Comment by Andreas Dilger [ 05/Aug/20 ]

Closing old ticket that has not been seen in a long time.

Generated at Sat Feb 10 02:15:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.