[LU-10943] obdfilter-survey test_1c: NMI Watchdog Created: 23/Apr/18  Updated: 25/Apr/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a570f22c-46fe-11e8-b45c-52540065bddc

test_1c failed with the following error:

test_1c returned 1

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
obdfilter-survey test_1c - test_1c returned 1



 Comments   
Comment by Nathaniel Clark [ 23/Apr/18 ]

NMI watchdog and reboot on non-primary host:

[  388.293114] Lustre: DEBUG MARKER: == obdfilter-survey test 1c: Object Storage Targets survey, big batch ================================ 12:58:06 (1524488286)
[  992.631002] sched: RT throttling activated
[ 1177.833369] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [mandb:19085]
[ 1177.833369] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core ppdev iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix virtio_blk drm 8139too libata crct10dif_pclmul crct10dif_common 8139cp virtio_pci crc32c_intel i2c_core virtio_ring mii serio_raw virtio floppy
[ 1177.833369] CPU: 0 PID: 19085 Comm: mandb Tainted: G           OE  ------------   3.10.0-693.21.1.el7.x86_64 #1
[ 1177.833369] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 1177.833369] task: ffff880063f01fa0 ti: ffff880064514000 task.ti: ffff880064514000
[ 1177.833369] RIP: 0010:[<ffffffff81093fd7>]  [<ffffffff81093fd7>] __do_softirq+0x97/0x280
[ 1177.833369] RSP: 0000:ffff88007fc03ec0  EFLAGS: 00000202
[ 1177.833369] RAX: ffff880064517fd8 RBX: ffff88007fc03e80 RCX: 0000000000000000
[ 1177.833369] RDX: 00000001000d0a14 RSI: 0000000000000000 RDI: ffff880063f01fa0
[ 1177.833369] RBP: ffff88007fc03f20 R08: 0000000000000000 R09: 0000000000004000
[ 1177.833369] R10: ffffffff81fea640 R11: 0000000000007ffe R12: ffff88007fc03e38
[ 1177.833369] R13: ffffffff816c1732 R14: ffff88007fc03f20 R15: ffff880064517fd8
[ 1177.833369] FS:  00007f3bac50c740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 1177.833369] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1177.833369] CR2: 00007f3bac50f310 CR3: 0000000079ff8000 CR4: 00000000000606f0
[ 1177.833369] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1177.833369] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1177.833369] Call Trace:
[ 1177.833369]  <IRQ> 
[ 1177.833369]  [<ffffffff816c3afc>] call_softirq+0x1c/0x30
[ 1177.833369]  [<ffffffff8102d435>] do_softirq+0x65/0xa0
[ 1177.833369]  [<ffffffff810943b5>] irq_exit+0x105/0x110
[ 1177.833369]  [<ffffffff816c4d96>] do_IRQ+0x56/0xf0
[ 1177.833369]  [<ffffffff816b7362>] common_interrupt+0x162/0x162
[ 1177.833369]  <EOI> 
[ 1177.833369]  [<ffffffff813363b9>] ? copy_page+0x49/0xe0
[ 1212.815422] Lustre: 12540:0:(client.c:2099:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1524489052/real 0]  req@ffff8800786d2d00 x1598541594112512/t0(0) o400->lustre-MDT0000-mdc-ffff88007bae7000@10.9.6.112@tcp:12/10 lens 224/224 e 0 to 1 dl 1524489060 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1212.815458] Lustre: lustre-MDT0000-mdc-ffff88007bae7000: Connection to lustre-MDT0000 (at 10.9.6.112@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 1212.815518] LustreError: 166-1: MGC10.9.6.112@tcp: Connection to MGS (at 10.9.6.112@tcp) was lost; in progress operations using this service will fail
[ 1212.815685]  [<ffffffff811b26fe>] ? wp_page_copy.isra.58+0xee/0x470
[ 1212.815685]  [<ffffffff811b448b>] do_wp_page+0xeb/0x5c0
[ 1212.815685]  [<ffffffff811b1f16>] ? do_read_fault.isra.44+0xe6/0x130
[ 1212.815685]  [<ffffffff811b67fc>] handle_mm_fault+0x70c/0xfa0
[ 1212.815685]  [<ffffffff816bb504>] __do_page_fault+0x154/0x450
[ 1212.815685]  [<ffffffff816bb835>] do_page_fault+0x35/0x90
[ 1212.815685]  [<ffffffff816b7ab6>] ? error_swapgs+0xa7/0xbd
[ 1212.815685]  [<ffffffff816b7768>] page_fault+0x28/0x30
[ 1212.815685] Code: 65 8b 0d 71 a0 f7 7e c7 45 a4 0a 00 00 00 89 4d d0 48 89 45 c0 48 89 45 c8 0f 1f 00 65 c7 05 ad 40 f8 7e 00 00 00 00 fb 66 66 90 <66> 66 90 49 c7 c4 c0 b0 9f 81 eb 0e 0f 1f 44 00 00 49 83 c4 08 
[ 1212.815685] Kernel panic - not syncing: softlockup: hung tasks
[ 1219.215471] CPU: 0 PID: 19085 Comm: mandb Tainted: G           OEL ------------   3.10.0-693.21.1.el7.x86_64 #1
[ 1219.215471] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 1219.215471] Call Trace:
[ 1219.215471]  <IRQ>  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[ 1219.215471]  [<ffffffff816a8634>] panic+0xe8/0x21f
[ 1219.215471]  [<ffffffff8102d7cf>] ? show_regs+0x5f/0x210
[ 1219.215471]  [<ffffffff811334e1>] watchdog_timer_fn+0x231/0x240
[ 1219.215471]  [<ffffffff811332b0>] ? watchdog+0x40/0x40
[ 1219.215471]  [<ffffffff810b8196>] __hrtimer_run_queues+0xd6/0x260
[ 1219.215471]  [<ffffffff810b872f>] hrtimer_interrupt+0xaf/0x1d0
[ 1219.215471]  [<ffffffff8105467b>] local_apic_timer_interrupt+0x3b/0x60
[ 1219.215471]  [<ffffffff816c4e73>] smp_apic_timer_interrupt+0x43/0x60
[ 1219.215471]  [<ffffffff816c1732>] apic_timer_interrupt+0x162/0x170
[ 1219.215471]  [<ffffffff81093fd7>] ? __do_softirq+0x97/0x280
[ 1219.215471]  [<ffffffff816c3afc>] call_softirq+0x1c/0x30
[ 1219.215471]  [<ffffffff8102d435>] do_softirq+0x65/0xa0
[ 1219.215471]  [<ffffffff810943b5>] irq_exit+0x105/0x110
[ 1219.215471]  [<ffffffff816c4d96>] do_IRQ+0x56/0xf0
[ 1219.215471]  [<ffffffff816b7362>] common_interrupt+0x162/0x162
[ 1219.215471]  <EOI>  [<ffffffff813363b9>] ? copy_page+0x49/0xe0
[ 1219.215471]  [<ffffffff811b26fe>] ? wp_page_copy.isra.58+0xee/0x470
[ 1219.215471]  [<ffffffff811b448b>] do_wp_page+0xeb/0x5c0
[ 1219.215471]  [<ffffffff811b1f16>] ? do_read_fault.isra.44+0xe6/0x130
[ 1219.215471]  [<ffffffff811b67fc>] handle_mm_fault+0x70c/0xfa0
[ 1219.215471]  [<ffffffff816bb504>] __do_page_fault+0x154/0x450
[ 1219.215471]  [<ffffffff816bb835>] do_page_fault+0x35/0x90
[ 1219.215471]  [<ffffffff816b7ab6>] ? error_swapgs+0xa7/0xbd
[ 1219.215471]  [<ffffffff816b7768>] page_fault+0x28/0x30
Comment by Sarah Liu [ 24/Apr/18 ]

+1 on master tag-2.11.51 RHEL7.4 DNE ldiskfs

https://testing.hpdd.intel.com/test_sets/40c8f270-47a3-11e8-960d-52540065bddc

Generated at Sat Feb 10 02:39:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.