[LU-10935] sanityn test 16a hangs with ‘NMI watchdog: BUG: soft lockup’ - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
- DNE
- zfs
Environment:
DNE/ZFS

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

sanityn test_16a hangs. The last thing seen in the client test_log is

== sanityn test 16a: 500 iterations of dual-mount fsx ================================================ 03:55:24 (1524196524)
CMD: trevis-4vm9 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize

Core file can be found in /scratch/dumps/trevis-4vm8.trevis.hpdd.intel.com/10.9.4.36-2018-04-20-03:58:14

In the OST console, we can see the following stack trace

[10146.583339] Lustre: DEBUG MARKER: == sanityn test 16a: 500 iterations of dual-mount fsx ================================================ 03:55:24 (1524196524)
[10252.159310] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [ll_ost_io00_003:23498]
[10252.159310] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core dm_mod zfs(POE) zunicode(POE) zavl(POE) icp(POE) iosf_mbi crc32_pclmul ghash_clmulni_intel zcommon(POE) znvpair(POE) ppdev spl(OE) aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev virtio_balloon pcspkr i2c_piix4 parport_pc parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix virtio_blk cirrus drm_kms_helper libata syscopyarea sysfillrect 8139too sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common drm crc32c_intel serio_raw virtio_pci virtio_ring virtio 8139cp i2c_core mii floppy
[10252.159310] CPU: 1 PID: 23498 Comm: ll_ost_io00_003 Tainted: P           OE  ------------   3.10.0-693.21.1.el7_lustre.x86_64 #1
[10252.159310] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[10252.159310] task: ffff880059e56eb0 ti: ffff880059538000 task.ti: ffff880059538000
[10252.159310] RIP: 0010:[<ffffffff81337f93>]  [<ffffffff81337f93>] memset+0x33/0xb0
[10252.159310] RSP: 0018:ffff88005953b820  EFLAGS: 00010212
[10252.159310] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000fbf
[10252.159310] RDX: 0000000000100000 RSI: 0000000000000000 RDI: ffffc900140ec000
[10252.159310] RBP: ffff88005953b8c0 R08: ffff880036aab690 R09: 0000000000000000
[10252.159310] R10: ffffc9001402b000 R11: ffffea000116bc00 R12: ffffffffc04a7d3b
[10252.159310] R13: ffff88005953b7b0 R14: ffff880059a5b000 R15: ffffffffc056e667
[10252.159310] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[10252.159310] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10252.159310] CR2: 00007fd2559ae000 CR3: 0000000079462000 CR4: 00000000000606e0
[10252.159310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10252.159310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10252.159310] Call Trace:
[10252.159310]  [<ffffffffc04b2cc7>] ? dbuf_read+0x397/0x9e0 [zfs]
[10252.159310]  [<ffffffffc0575128>] ? zio_done+0x748/0xd20 [zfs]
[10252.159310]  [<ffffffffc056eb8c>] ? zio_destroy+0x7c/0x80 [zfs]
[10252.159310]  [<ffffffffc04b66d9>] dmu_buf_will_dirty+0x119/0x130 [zfs]
[10252.159310]  [<ffffffffc04bcf15>] dmu_write_impl+0x45/0xd0 [zfs]
[10252.159310]  [<ffffffffc04beb57>] dmu_write.part.7+0xa7/0x110 [zfs]
[10252.159310]  [<ffffffffc04bed36>] dmu_assign_arcbuf+0x156/0x1a0 [zfs]
[10252.159310]  [<ffffffffc10cfdcd>] osd_write_commit+0x46d/0xa00 [osd_zfs]
[10252.159310]  [<ffffffffc120b29a>] ofd_commitrw_write+0xf9a/0x1d00 [ofd]
[10252.159310]  [<ffffffffc120f112>] ofd_commitrw+0x4b2/0xa10 [ofd]
[10252.159310]  [<ffffffffc0c36c39>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[10252.159310]  [<ffffffffc0ef8430>] tgt_brw_write+0x1180/0x1d70 [ptlrpc]
[10252.159310] INFO: rcu_sched detected stalls on CPUs/tasks:
[10252.159310]  [<ffffffffc0e4b940>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[10252.159310]  [<ffffffffc0ef994a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[10252.159310]  [<ffffffffc0e9da53>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[10252.159310]  [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
[10252.159310]  [<ffffffffc0ea1202>] ptlrpc_main+0xa92/0x1f40 [ptlrpc]
[10252.159310]  [<ffffffffc0ea0770>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[10252.159310]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[10252.159310]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10252.159310]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[10252.159310]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10252.159310] Code: b8 01 01 01 01 01 01 01 01 48 0f af c1 41 89 f9 41 83 e1 07 75 70 48 89 d1 48 c1 e9 06 74 39 66 0f 1f 84 00 00 00 00 00 48 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 
…

[10274.555013] } (detected by 0, t=60299 jiffies, g=942723, c=942722, q=34)
[10274.555013] Task dump for CPU 1:
[10274.555013] ll_ost_io00_003 R  running task        0 23498      2 0x00000088
[10274.555013] Call Trace:
[10274.555013]  [<ffffffffc04bed36>] ? dmu_assign_arcbuf+0x156/0x1a0 [zfs]
[10274.555013]  [<ffffffffc10cfdcd>] ? osd_write_commit+0x46d/0xa00 [osd_zfs]
[10274.555013]  [<ffffffffc120b29a>] ? ofd_commitrw_write+0xf9a/0x1d00 [ofd]
[10274.555013]  [<ffffffffc120f112>] ? ofd_commitrw+0x4b2/0xa10 [ofd]
[10274.555013]  [<ffffffffc0c36c39>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[10274.555013]  [<ffffffffc0ef8430>] ? tgt_brw_write+0x1180/0x1d70 [ptlrpc]
[10274.555013]  [<ffffffffc0e4b940>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[10274.555013]  [<ffffffffc0ef994a>] ? tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[10274.555013]  [<ffffffffc0e9da53>] ? ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[10274.555013]  [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
[10274.555013]  [<ffffffffc0ea1202>] ? ptlrpc_main+0xa92/0x1f40 [ptlrpc]
[10274.555013]  [<ffffffffc0ea0770>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[10274.555013]  [<ffffffff810b4031>] ? kthread+0xd1/0xe0
[10274.555013]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10274.555013]  [<ffffffff816c0577>] ? ret_from_fork+0x77/0xb0
[10274.555013]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10304.489700] WARNING: MMP writes to pool 'lustre-ost3' have not succeeded in over 54s; suspending pool

We’ve only seen this failure once. The failure seen is for review-dne-zfs. Logs are at
https://testing.hpdd.intel.com/test_sets/614c9d28-444f-11e8-95c0-52540065bddc

sanityn test 16a hangs with ‘NMI watchdog: BUG: soft lockup’

Details

Description

Attachments

Activity

People

Dates