[LU-10840] obdfilter-survey test 1a hangs with soft lockup on a client - ll_ping Created: 22/Mar/18 Updated: 22/Mar/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
obdfilter-survey test_1a hangs. For the failure at https://testing.hpdd.intel.com/test_sets/fc8fb130-2c62-11e8-b74b-52540065bddc, the last thing seen in the client test_log is == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 15:01:10 (1521558070) Unable to detect ip address for host: '' + NETTYPE=tcp thrlo=8 nobjhi=1 thrhi=16 size=1024 case=disk rslt_loc=/tmp targets="10.9.4.174:lustre-OST0000 10.9.4.174:lustre-OST0001 10.9.4.174:lustre-OST0002 10.9.4.174:lustre-OST0003 10.9.4.174:lustre-OST0004 10.9.4.174:lustre-OST0005 10.9.4.174:lustre-OST0006 10.9.4.174:lustre-OST0007" /usr/bin/obdfilter-survey Warning: Permanently added '10.9.4.174' (ECDSA) to the list of known hosts. Tue Mar 20 15:01:22 UTC 2018 Obdfilter-survey for case=disk from trevis-15vm1.trevis.hpdd.intel.com ost 8 sz 8388608K rsz 1024K obj 8 thr 64 write 11.65 [ 0.00, 56.99] rewrite
On the client console we see [16324.875232] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 15:01:10 (1521558070) [17853.559782] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 60s! [ll_ping:14185] [17855.065078] sched: RT throttling activated [17925.031347] LNetError: 14173:0:(socklnd_cb.c:1732:ksocknal_recv_hello()) Error -104 reading HELLO from 10.9.4.174 [17934.795274] LNetError: 11b-b: Connection to 10.9.4.175@tcp at host 10.9.4.175 on port 7988 was reset: is it running a compatible version of Lustre and is 10.9.4.175@tcp one of its NIDs? [17941.175915] Lustre: 14183:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1521559599/real 1521559599] req@ffff88006510c300 x1595459204013520/t0(0) o400->lustre-MDT0000-mdc-ffff88004b976000@10.9.4.175@tcp:12/10 lens 224/224 e 0 to 1 dl 1521559606 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [17941.209818] Lustre: lustre-MDT0000-mdc-ffff88004b976000: Connection to lustre-MDT0000 (at 10.9.4.175@tcp) was lost; in progress operations using this service will wait for recovery to complete [17855.189113] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr virtio_balloon i2c_piix4 parport_pc nfsd parport nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common crc32c_intel ata_piix libata 8139too serio_raw virtio_pci 8139cp virtio_ring virtio mii i2c_core floppy [last unloaded: lnet_selftest] [17855.189113] CPU: 1 PID: 14185 Comm: ll_ping Tainted: G OE ------------ 3.10.0-693.21.1.el7.x86_64 #1 [17855.189113] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [17855.189113] task: ffff8800365d1fa0 ti: ffff88007b330000 task.ti: ffff88007b330000 [17855.189113] RIP: 0010:[<ffffffff81337c63>] [<ffffffff81337c63>] memset+0x33/0xb0 [17855.189113] RSP: 0018:ffff88007b333d10 EFLAGS: 00010202 [17855.189113] RAX: 0000000000000000 RBX: ffffea0001a35b00 RCX: 0000000000000007 [17855.189113] RDX: 0000000000000300 RSI: 0000000000000000 RDI: ffff880068d6d000 [17855.189113] RBP: ffff88007b333d48 R08: 0000000000000000 R09: 0000000000000000 [17855.189113] R10: ffff880068d6cf00 R11: 0000000000000000 R12: 0000000180150015 [17855.189113] R13: ffff880068d6cf00 R14: 0000000000000000 R15: 0000000180150015 [17855.189113] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 [17855.189113] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [17855.189113] CR2: 00007fb76d57e000 CR3: 000000007acfa000 CR4: 00000000000606e0 [17855.189113] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [17855.189113] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [17855.189113] Call Trace: [17855.189113] [<ffffffff811e4092>] ? kmem_cache_alloc+0x1c2/0x1f0 [17959.187506] LustreError: 166-1: MGC10.9.4.175@tcp: Connection to MGS (at 10.9.4.175@tcp) was lost; in progress operations using this service will fail [17855.189113] [<ffffffffc09ff7f7>] ? ptlrpc_request_cache_alloc+0x27/0x110 [ptlrpc] [17855.189113] [<ffffffffc09ff7f7>] ptlrpc_request_cache_alloc+0x27/0x110 [ptlrpc] [17855.189113] [<ffffffffc09ff905>] ptlrpc_request_alloc_internal+0x25/0x440 [ptlrpc] [17855.189113] [<ffffffffc0a0839e>] ptlrpc_request_alloc_pack+0x1e/0x60 [ptlrpc] [17855.189113] [<ffffffffc0a253e0>] ptlrpc_prep_ping+0x20/0x40 [ptlrpc] [17855.189113] [<ffffffffc0a258a8>] ptlrpc_pinger_main+0x378/0xb40 [ptlrpc] [17855.189113] [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20 [17855.189113] [<ffffffffc0a25530>] ? ptlrpc_obd_ping+0x130/0x130 [ptlrpc] [17855.189113] [<ffffffff810b4031>] kthread+0xd1/0xe0 [17855.189113] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [17855.189113] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0 [17855.189113] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [17855.189113] Code: b8 01 01 01 01 01 01 01 01 48 0f af c1 41 89 f9 41 83 e1 07 75 70 48 89 d1 48 c1 e9 06 74 39 66 0f 1f 84 00 00 00 00 00 48 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 [17855.189113] Kernel panic - not syncing: softlockup: hung tasks [17855.189113] CPU: 1 PID: 14185 Comm: ll_ping Tainted: G OEL ------------ 3.10.0-693.21.1.el7.x86_64 #1 [17855.189113] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [17855.189113] Call Trace: [17855.189113] <IRQ> [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [17855.189113] [<ffffffff816a8634>] panic+0xe8/0x21f [17855.189113] [<ffffffff8102d7cf>] ? show_regs+0x5f/0x210 [17855.189113] [<ffffffff811334e1>] watchdog_timer_fn+0x231/0x240 [17855.189113] [<ffffffff811332b0>] ? watchdog+0x40/0x40 [17855.189113] [<ffffffff810b8196>] __hrtimer_run_queues+0xd6/0x260 [17855.189113] [<ffffffff810b872f>] hrtimer_interrupt+0xaf/0x1d0 [17855.189113] [<ffffffff8105467b>] local_apic_timer_interrupt+0x3b/0x60 [17855.189113] [<ffffffff816c4e73>] smp_apic_timer_interrupt+0x43/0x60 [17855.189113] [<ffffffff816c1732>] apic_timer_interrupt+0x162/0x170 [17855.189113] <EOI> [<ffffffff81337c63>] ? memset+0x33/0xb0 [17855.189113] [<ffffffff811e4092>] ? kmem_cache_alloc+0x1c2/0x1f0 [17855.189113] [<ffffffffc09ff7f7>] ? ptlrpc_request_cache_alloc+0x27/0x110 [ptlrpc] [17855.189113] [<ffffffffc09ff7f7>] ptlrpc_request_cache_alloc+0x27/0x110 [ptlrpc] [17855.189113] [<ffffffffc09ff905>] ptlrpc_request_alloc_internal+0x25/0x440 [ptlrpc] [17855.189113] [<ffffffffc0a0839e>] ptlrpc_request_alloc_pack+0x1e/0x60 [ptlrpc] [17855.189113] [<ffffffffc0a253e0>] ptlrpc_prep_ping+0x20/0x40 [ptlrpc] [17855.189113] [<ffffffffc0a258a8>] ptlrpc_pinger_main+0x378/0xb40 [ptlrpc] [17855.189113] [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20 [17855.189113] [<ffffffffc0a25530>] ? ptlrpc_obd_ping+0x130/0x130 [ptlrpc] [17855.189113] [<ffffffff810b4031>] kthread+0xd1/0xe0 [17855.189113] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [17855.189113] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0 [17855.189113] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 I've found only one instance of this hang so far. |