[LU-10943] obdfilter-survey test_1c: NMI Watchdog Created: 23/Apr/18 Updated: 25/Apr/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a570f22c-46fe-11e8-b45c-52540065bddc test_1c failed with the following error: test_1c returned 1 VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Nathaniel Clark [ 23/Apr/18 ] |
|
NMI watchdog and reboot on non-primary host: [ 388.293114] Lustre: DEBUG MARKER: == obdfilter-survey test 1c: Object Storage Targets survey, big batch ================================ 12:58:06 (1524488286) [ 992.631002] sched: RT throttling activated [ 1177.833369] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [mandb:19085] [ 1177.833369] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core ppdev iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix virtio_blk drm 8139too libata crct10dif_pclmul crct10dif_common 8139cp virtio_pci crc32c_intel i2c_core virtio_ring mii serio_raw virtio floppy [ 1177.833369] CPU: 0 PID: 19085 Comm: mandb Tainted: G OE ------------ 3.10.0-693.21.1.el7.x86_64 #1 [ 1177.833369] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [ 1177.833369] task: ffff880063f01fa0 ti: ffff880064514000 task.ti: ffff880064514000 [ 1177.833369] RIP: 0010:[<ffffffff81093fd7>] [<ffffffff81093fd7>] __do_softirq+0x97/0x280 [ 1177.833369] RSP: 0000:ffff88007fc03ec0 EFLAGS: 00000202 [ 1177.833369] RAX: ffff880064517fd8 RBX: ffff88007fc03e80 RCX: 0000000000000000 [ 1177.833369] RDX: 00000001000d0a14 RSI: 0000000000000000 RDI: ffff880063f01fa0 [ 1177.833369] RBP: ffff88007fc03f20 R08: 0000000000000000 R09: 0000000000004000 [ 1177.833369] R10: ffffffff81fea640 R11: 0000000000007ffe R12: ffff88007fc03e38 [ 1177.833369] R13: ffffffff816c1732 R14: ffff88007fc03f20 R15: ffff880064517fd8 [ 1177.833369] FS: 00007f3bac50c740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 [ 1177.833369] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1177.833369] CR2: 00007f3bac50f310 CR3: 0000000079ff8000 CR4: 00000000000606f0 [ 1177.833369] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1177.833369] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1177.833369] Call Trace: [ 1177.833369] <IRQ> [ 1177.833369] [<ffffffff816c3afc>] call_softirq+0x1c/0x30 [ 1177.833369] [<ffffffff8102d435>] do_softirq+0x65/0xa0 [ 1177.833369] [<ffffffff810943b5>] irq_exit+0x105/0x110 [ 1177.833369] [<ffffffff816c4d96>] do_IRQ+0x56/0xf0 [ 1177.833369] [<ffffffff816b7362>] common_interrupt+0x162/0x162 [ 1177.833369] <EOI> [ 1177.833369] [<ffffffff813363b9>] ? copy_page+0x49/0xe0 [ 1212.815422] Lustre: 12540:0:(client.c:2099:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1524489052/real 0] req@ffff8800786d2d00 x1598541594112512/t0(0) o400->lustre-MDT0000-mdc-ffff88007bae7000@10.9.6.112@tcp:12/10 lens 224/224 e 0 to 1 dl 1524489060 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [ 1212.815458] Lustre: lustre-MDT0000-mdc-ffff88007bae7000: Connection to lustre-MDT0000 (at 10.9.6.112@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 1212.815518] LustreError: 166-1: MGC10.9.6.112@tcp: Connection to MGS (at 10.9.6.112@tcp) was lost; in progress operations using this service will fail [ 1212.815685] [<ffffffff811b26fe>] ? wp_page_copy.isra.58+0xee/0x470 [ 1212.815685] [<ffffffff811b448b>] do_wp_page+0xeb/0x5c0 [ 1212.815685] [<ffffffff811b1f16>] ? do_read_fault.isra.44+0xe6/0x130 [ 1212.815685] [<ffffffff811b67fc>] handle_mm_fault+0x70c/0xfa0 [ 1212.815685] [<ffffffff816bb504>] __do_page_fault+0x154/0x450 [ 1212.815685] [<ffffffff816bb835>] do_page_fault+0x35/0x90 [ 1212.815685] [<ffffffff816b7ab6>] ? error_swapgs+0xa7/0xbd [ 1212.815685] [<ffffffff816b7768>] page_fault+0x28/0x30 [ 1212.815685] Code: 65 8b 0d 71 a0 f7 7e c7 45 a4 0a 00 00 00 89 4d d0 48 89 45 c0 48 89 45 c8 0f 1f 00 65 c7 05 ad 40 f8 7e 00 00 00 00 fb 66 66 90 <66> 66 90 49 c7 c4 c0 b0 9f 81 eb 0e 0f 1f 44 00 00 49 83 c4 08 [ 1212.815685] Kernel panic - not syncing: softlockup: hung tasks [ 1219.215471] CPU: 0 PID: 19085 Comm: mandb Tainted: G OEL ------------ 3.10.0-693.21.1.el7.x86_64 #1 [ 1219.215471] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [ 1219.215471] Call Trace: [ 1219.215471] <IRQ> [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [ 1219.215471] [<ffffffff816a8634>] panic+0xe8/0x21f [ 1219.215471] [<ffffffff8102d7cf>] ? show_regs+0x5f/0x210 [ 1219.215471] [<ffffffff811334e1>] watchdog_timer_fn+0x231/0x240 [ 1219.215471] [<ffffffff811332b0>] ? watchdog+0x40/0x40 [ 1219.215471] [<ffffffff810b8196>] __hrtimer_run_queues+0xd6/0x260 [ 1219.215471] [<ffffffff810b872f>] hrtimer_interrupt+0xaf/0x1d0 [ 1219.215471] [<ffffffff8105467b>] local_apic_timer_interrupt+0x3b/0x60 [ 1219.215471] [<ffffffff816c4e73>] smp_apic_timer_interrupt+0x43/0x60 [ 1219.215471] [<ffffffff816c1732>] apic_timer_interrupt+0x162/0x170 [ 1219.215471] [<ffffffff81093fd7>] ? __do_softirq+0x97/0x280 [ 1219.215471] [<ffffffff816c3afc>] call_softirq+0x1c/0x30 [ 1219.215471] [<ffffffff8102d435>] do_softirq+0x65/0xa0 [ 1219.215471] [<ffffffff810943b5>] irq_exit+0x105/0x110 [ 1219.215471] [<ffffffff816c4d96>] do_IRQ+0x56/0xf0 [ 1219.215471] [<ffffffff816b7362>] common_interrupt+0x162/0x162 [ 1219.215471] <EOI> [<ffffffff813363b9>] ? copy_page+0x49/0xe0 [ 1219.215471] [<ffffffff811b26fe>] ? wp_page_copy.isra.58+0xee/0x470 [ 1219.215471] [<ffffffff811b448b>] do_wp_page+0xeb/0x5c0 [ 1219.215471] [<ffffffff811b1f16>] ? do_read_fault.isra.44+0xe6/0x130 [ 1219.215471] [<ffffffff811b67fc>] handle_mm_fault+0x70c/0xfa0 [ 1219.215471] [<ffffffff816bb504>] __do_page_fault+0x154/0x450 [ 1219.215471] [<ffffffff816bb835>] do_page_fault+0x35/0x90 [ 1219.215471] [<ffffffff816b7ab6>] ? error_swapgs+0xa7/0xbd [ 1219.215471] [<ffffffff816b7768>] page_fault+0x28/0x30 |
| Comment by Sarah Liu [ 24/Apr/18 ] |
|
+1 on master tag-2.11.51 RHEL7.4 DNE ldiskfs https://testing.hpdd.intel.com/test_sets/40c8f270-47a3-11e8-960d-52540065bddc |