Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.10.1, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4
-
None
-
3
-
9223372036854775807
Description
obdfilter-survey test_1a, test_1c and test_2a fail with what looks like network issues. Looking at the client test_log from a recent failures, https://testing.hpdd.intel.com/test_sets/4f1abec0-50f5-11e8-abc3-52540065bddc , we see
Resetting fail_loc on all nodes...CMD: trevis-22vm1.trevis.hpdd.intel.com,trevis-22vm2,trevis-22vm3,trevis-22vm4,trevis-22vm5 lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null pdsh@trevis-22vm1: trevis-22vm2: mcmd: connect failed: No route to host done. 02:07:26 (1525572446) waiting for trevis-22vm1.trevis.hpdd.intel.com network 5 secs ... 02:07:26 (1525572446) network interface is UP CMD: trevis-22vm1.trevis.hpdd.intel.com rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc 02:07:26 (1525572446) waiting for trevis-22vm2 network 5 secs ... Network not available!
For a test session that passes, we see that we wait for the network interface to come up and it eventually does;
01:05:51 (1525309551) waiting for trevis-16vm1.trevis.hpdd.intel.com network 5 secs ... 01:05:51 (1525309551) network interface is UP CMD: trevis-16vm1.trevis.hpdd.intel.com rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc 01:05:51 (1525309551) waiting for trevis-16vm2 network 5 secs ... 01:05:51 (1525309551) network interface is UP
On the console of trevis-22vm2, we see a soft lockup
[52813.979019] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 01:50:44 \(1525571444\) [52814.176674] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 01:50:44 (1525571444) [53232.624965] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [sssd_be:596] [53232.624965] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd nfsd joydev pcspkr virtio_balloon i2c_piix4 parport_pc parport nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_blk ata_piix drm 8139too libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci virtio_ring virtio 8139cp mii i2c_core floppy [last unloaded: lnet_selftest] [53232.624965] CPU: 0 PID: 596 Comm: sssd_be Tainted: G OE ------------ 3.10.0-693.21.1.el7.x86_64 #1 [53232.624965] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [53232.624965] task: ffff880036758fd0 ti: ffff88007ab78000 task.ti: ffff88007ab78000 [53232.624965] RIP: 0010:[<ffffffff812200c8>] [<ffffffff812200c8>] __d_lookup+0x68/0x160 [53232.624965] RSP: 0018:ffff88007ab7bd38 EFLAGS: 00010286 [53232.624965] RAX: ffffc9000011f818 RBX: 0000000a0000ffff RCX: 0000000000000012 [53232.624965] RDX: ffffc90000000000 RSI: ffff88007ab7bdd0 RDI: ffff88007cc04240 [53232.624965] RBP: ffff88007ab7bd78 R08: 0000000000000004 R09: ffffffff81279cc0 [53232.624965] R10: 0000000000000000 R11: ffff88007ab7bcce R12: ffff88007ab7bd40 [53232.624965] R13: ffff88007ab7be60 R14: ffff88007ab7be53 R15: ffffffff81332562 [53232.624965] FS: 00007f04cc1d8880(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 [53232.624965] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [53232.624965] CR2: 0000560dc7999880 CR3: 000000007abb2000 CR4: 00000000000606f0 [53232.624965] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [53232.624965] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [53232.624965] Call Trace: [53232.624965] [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0 [53232.624965] [<ffffffff812201ea>] d_lookup+0x2a/0x50 [53232.624965] [<ffffffff8127a43f>] proc_fill_cache+0x6f/0x180 [53232.624965] [<ffffffff81279cc0>] ? proc_pid_make_inode+0xf0/0xf0 [53232.624965] [<ffffffff8127b26e>] proc_pid_readdir+0x14e/0x1f0 [53232.624965] [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0 [53232.624965] [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0 [53232.624965] [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0 [53232.624965] [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0 [53232.624965] [<ffffffff812761bf>] proc_root_readdir+0x3f/0x50 [53232.624965] [<ffffffff8121a216>] vfs_readdir+0xb6/0xe0 [53232.624965] [<ffffffff8121a635>] SyS_getdents+0x95/0x120 [53232.624965] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [53232.624965] [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146 [53232.624965] Code: 89 c2 d3 ea 01 d0 23 05 4b 50 91 00 48 8b 15 38 50 91 00 48 8d 04 c2 48 8b 18 48 83 e3 fe 75 0b eb 39 90 48 8b 1b 48 85 db 74 30 <44> 39 73 18 75 f2 4c 8d 7b 50 4c 89 ff e8 56 67 49 00 4c 39 63 [53232.624965] Kernel panic - not syncing: softlockup: hung tasks [53232.624965] CPU: 0 PID: 596 Comm: sssd_be Tainted: G OEL ------------ 3.10.0-693.21.1.el7.x86_64 #1 [53232.624965] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [53232.624965] Call Trace: [53232.624965] <IRQ> [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [53232.624965] [<ffffffff816a8634>] panic+0xe8/0x21f [53232.624965] [<ffffffff8102d7cf>] ? show_regs+0x5f/0x210 [53232.624965] [<ffffffff811334e1>] watchdog_timer_fn+0x231/0x240 [53232.624965] [<ffffffff811332b0>] ? watchdog+0x40/0x40 [53232.624965] [<ffffffff810b8196>] __hrtimer_run_queues+0xd6/0x260 [53232.624965] [<ffffffff810b872f>] hrtimer_interrupt+0xaf/0x1d0 [53232.624965] [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0 [53232.624965] [<ffffffff8105467b>] local_apic_timer_interrupt+0x3b/0x60 [53232.624965] [<ffffffff816c4e73>] smp_apic_timer_interrupt+0x43/0x60 [53232.624965] [<ffffffff816c1732>] apic_timer_interrupt+0x162/0x170 [53232.624965] <EOI> [<ffffffff81279cc0>] ? proc_pid_make_inode+0xf0/0xf0 [53232.624965] [<ffffffff812200c8>] ? __d_lookup+0x68/0x160 [53232.624965] [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0 [53232.624965] [<ffffffff812201ea>] d_lookup+0x2a/0x50 [53232.624965] [<ffffffff8127a43f>] proc_fill_cache+0x6f/0x180 [53232.624965] [<ffffffff81279cc0>] ? proc_pid_make_inode+0xf0/0xf0 [53232.624965] [<ffffffff8127b26e>] proc_pid_readdir+0x14e/0x1f0 [53232.624965] [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0 [53232.624965] [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0 [53232.624965] [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0 [53232.624965] [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0 [53232.624965] [<ffffffff812761bf>] proc_root_readdir+0x3f/0x50 [53232.624965] [<ffffffff8121a216>] vfs_readdir+0xb6/0xe0 [53232.624965] [<ffffffff8121a635>] SyS_getdents+0x95/0x120 [53232.624965] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [53232.624965] [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146
On the console for the OSS, we see
[52707.292360] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 01:50:44 (1525571444) [52707.499316] Lustre: DEBUG MARKER: lctl dl | grep obdfilter [52707.854211] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d @ [52708.863961] Lustre: Echo OBD driver; http://www.lustre.org/ [53152.432160] LNetError: 17668:0:(socklnd.c:1681:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.5.10@tcp[1], ip 10.9.5.10:1023, with error, wanted: 152, left: 152, last alive is 28 secs ago [53152.437669] LustreError: 17668:0:(events.c:304:request_in_callback()) event type 2, status -5, service ost [53152.440488] LustreError: 25177:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check [53152.445357] LustreError: 25177:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.9.5.10@tcp x1599622583354080 [53168.454267] Lustre: lustre-OST0006: haven't heard from client 0c1d0b38-84c1-ca60-829d-9aaff06c9a4a (at 10.9.5.10@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff88006b639800, cur 1525571906 expire 1525571876 last 1525571859 [53168.460481] Lustre: Skipped 7 previous similar messages [53702.588305] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null
More failed tests and logs can be found at:
https://testing.hpdd.intel.com/test_sets/7cb76a18-4281-11e8-960d-52540065bddc
https://testing.hpdd.intel.com/test_sets/aa7e5422-425c-11e8-b45c-52540065bddc
https://testing.hpdd.intel.com/test_sets/b0f496da-2cdc-11e8-9e0e-52540065bddc
https://testing.hpdd.intel.com/test_sets/26cfa8ec-20b0-11e8-a4b1-52540065bddc
https://testing.hpdd.intel.com/test_sets/f685a540-0329-11e8-a7cd-52540065bddc
Attachments
Issue Links
- mentioned in
-
Page Loading...