[LU-10840] obdfilter-survey test 1a hangs with soft lockup on a client - ll_ping Created: 22/Mar/18  Updated: 22/Mar/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

obdfilter-survey test_1a hangs. For the failure at https://testing.hpdd.intel.com/test_sets/fc8fb130-2c62-11e8-b74b-52540065bddc, the last thing seen in the client test_log is

== obdfilter-survey test 1a: Object Storage Targets survey =========================================== 15:01:10 (1521558070)
Unable to detect ip address for host: ''
+ NETTYPE=tcp thrlo=8 nobjhi=1 thrhi=16 size=1024 case=disk rslt_loc=/tmp targets="10.9.4.174:lustre-OST0000 10.9.4.174:lustre-OST0001 10.9.4.174:lustre-OST0002 10.9.4.174:lustre-OST0003 10.9.4.174:lustre-OST0004 10.9.4.174:lustre-OST0005 10.9.4.174:lustre-OST0006 10.9.4.174:lustre-OST0007" /usr/bin/obdfilter-survey
Warning: Permanently added '10.9.4.174' (ECDSA) to the list of known hosts.
Tue Mar 20 15:01:22 UTC 2018 Obdfilter-survey for case=disk from trevis-15vm1.trevis.hpdd.intel.com
ost  8 sz  8388608K rsz 1024K obj    8 thr   64 write   11.65 [   0.00,   56.99] rewrite

 

On the client console we see

[16324.875232] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 15:01:10 (1521558070)
[17853.559782] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 60s! [ll_ping:14185]
[17855.065078] sched: RT throttling activated
[17925.031347] LNetError: 14173:0:(socklnd_cb.c:1732:ksocknal_recv_hello()) Error -104 reading HELLO from 10.9.4.174
[17934.795274] LNetError: 11b-b: Connection to 10.9.4.175@tcp at host 10.9.4.175 on port 7988 was reset: is it running a compatible version of Lustre and is 10.9.4.175@tcp one of its NIDs?
[17941.175915] Lustre: 14183:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1521559599/real 1521559599]  req@ffff88006510c300 x1595459204013520/t0(0) o400->lustre-MDT0000-mdc-ffff88004b976000@10.9.4.175@tcp:12/10 lens 224/224 e 0 to 1 dl 1521559606 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[17941.209818] Lustre: lustre-MDT0000-mdc-ffff88004b976000: Connection to lustre-MDT0000 (at 10.9.4.175@tcp) was lost; in progress operations using this service will wait for recovery to complete
[17855.189113] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr virtio_balloon i2c_piix4 parport_pc nfsd parport nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common crc32c_intel ata_piix libata 8139too serio_raw virtio_pci 8139cp virtio_ring virtio mii i2c_core floppy [last unloaded: lnet_selftest]
[17855.189113] CPU: 1 PID: 14185 Comm: ll_ping Tainted: G           OE  ------------   3.10.0-693.21.1.el7.x86_64 #1
[17855.189113] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[17855.189113] task: ffff8800365d1fa0 ti: ffff88007b330000 task.ti: ffff88007b330000
[17855.189113] RIP: 0010:[<ffffffff81337c63>]  [<ffffffff81337c63>] memset+0x33/0xb0
[17855.189113] RSP: 0018:ffff88007b333d10  EFLAGS: 00010202
[17855.189113] RAX: 0000000000000000 RBX: ffffea0001a35b00 RCX: 0000000000000007
[17855.189113] RDX: 0000000000000300 RSI: 0000000000000000 RDI: ffff880068d6d000
[17855.189113] RBP: ffff88007b333d48 R08: 0000000000000000 R09: 0000000000000000
[17855.189113] R10: ffff880068d6cf00 R11: 0000000000000000 R12: 0000000180150015
[17855.189113] R13: ffff880068d6cf00 R14: 0000000000000000 R15: 0000000180150015
[17855.189113] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[17855.189113] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[17855.189113] CR2: 00007fb76d57e000 CR3: 000000007acfa000 CR4: 00000000000606e0
[17855.189113] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[17855.189113] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[17855.189113] Call Trace:
[17855.189113]  [<ffffffff811e4092>] ? kmem_cache_alloc+0x1c2/0x1f0
[17959.187506] LustreError: 166-1: MGC10.9.4.175@tcp: Connection to MGS (at 10.9.4.175@tcp) was lost; in progress operations using this service will fail
[17855.189113]  [<ffffffffc09ff7f7>] ? ptlrpc_request_cache_alloc+0x27/0x110 [ptlrpc]
[17855.189113]  [<ffffffffc09ff7f7>] ptlrpc_request_cache_alloc+0x27/0x110 [ptlrpc]
[17855.189113]  [<ffffffffc09ff905>] ptlrpc_request_alloc_internal+0x25/0x440 [ptlrpc]
[17855.189113]  [<ffffffffc0a0839e>] ptlrpc_request_alloc_pack+0x1e/0x60 [ptlrpc]
[17855.189113]  [<ffffffffc0a253e0>] ptlrpc_prep_ping+0x20/0x40 [ptlrpc]
[17855.189113]  [<ffffffffc0a258a8>] ptlrpc_pinger_main+0x378/0xb40 [ptlrpc]
[17855.189113]  [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20
[17855.189113]  [<ffffffffc0a25530>] ? ptlrpc_obd_ping+0x130/0x130 [ptlrpc]
[17855.189113]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[17855.189113]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[17855.189113]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[17855.189113]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[17855.189113] Code: b8 01 01 01 01 01 01 01 01 48 0f af c1 41 89 f9 41 83 e1 07 75 70 48 89 d1 48 c1 e9 06 74 39 66 0f 1f 84 00 00 00 00 00 48 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89
[17855.189113] Kernel panic - not syncing: softlockup: hung tasks
[17855.189113] CPU: 1 PID: 14185 Comm: ll_ping Tainted: G           OEL ------------   3.10.0-693.21.1.el7.x86_64 #1
[17855.189113] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[17855.189113] Call Trace:
[17855.189113]  <IRQ>  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[17855.189113]  [<ffffffff816a8634>] panic+0xe8/0x21f
[17855.189113]  [<ffffffff8102d7cf>] ? show_regs+0x5f/0x210
[17855.189113]  [<ffffffff811334e1>] watchdog_timer_fn+0x231/0x240
[17855.189113]  [<ffffffff811332b0>] ? watchdog+0x40/0x40
[17855.189113]  [<ffffffff810b8196>] __hrtimer_run_queues+0xd6/0x260
[17855.189113]  [<ffffffff810b872f>] hrtimer_interrupt+0xaf/0x1d0
[17855.189113]  [<ffffffff8105467b>] local_apic_timer_interrupt+0x3b/0x60
[17855.189113]  [<ffffffff816c4e73>] smp_apic_timer_interrupt+0x43/0x60
[17855.189113]  [<ffffffff816c1732>] apic_timer_interrupt+0x162/0x170
[17855.189113]  <EOI>  [<ffffffff81337c63>] ? memset+0x33/0xb0
[17855.189113]  [<ffffffff811e4092>] ? kmem_cache_alloc+0x1c2/0x1f0
[17855.189113]  [<ffffffffc09ff7f7>] ? ptlrpc_request_cache_alloc+0x27/0x110 [ptlrpc]
[17855.189113]  [<ffffffffc09ff7f7>] ptlrpc_request_cache_alloc+0x27/0x110 [ptlrpc]
[17855.189113]  [<ffffffffc09ff905>] ptlrpc_request_alloc_internal+0x25/0x440 [ptlrpc]
[17855.189113]  [<ffffffffc0a0839e>] ptlrpc_request_alloc_pack+0x1e/0x60 [ptlrpc]
[17855.189113]  [<ffffffffc0a253e0>] ptlrpc_prep_ping+0x20/0x40 [ptlrpc]
[17855.189113]  [<ffffffffc0a258a8>] ptlrpc_pinger_main+0x378/0xb40 [ptlrpc]
[17855.189113]  [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20
[17855.189113]  [<ffffffffc0a25530>] ? ptlrpc_obd_ping+0x130/0x130 [ptlrpc]
[17855.189113]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[17855.189113]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[17855.189113]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[17855.189113]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40

I've found only one instance of this hang so far.


Generated at Sat Feb 10 02:38:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.