Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11005

obdfilter-survey test 1a, 1c and 2a fail with soft lockup/message “Network not available!”

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.10.1, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4
    • None
    • 3
    • 9223372036854775807

    Description

      obdfilter-survey test_1a, test_1c and test_2a fail with what looks like network issues. Looking at the client test_log from a recent failures, https://testing.hpdd.intel.com/test_sets/4f1abec0-50f5-11e8-abc3-52540065bddc , we see

      Resetting fail_loc on all nodes...CMD: trevis-22vm1.trevis.hpdd.intel.com,trevis-22vm2,trevis-22vm3,trevis-22vm4,trevis-22vm5 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      pdsh@trevis-22vm1: trevis-22vm2: mcmd: connect failed: No route to host
      done.
      02:07:26 (1525572446) waiting for trevis-22vm1.trevis.hpdd.intel.com network 5 secs ...
      02:07:26 (1525572446) network interface is UP
      CMD: trevis-22vm1.trevis.hpdd.intel.com rc=0;
      			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
      				echo \$(hostname -s): \$val;
      				rc=\$val;
      			fi;
      			exit \$rc
      02:07:26 (1525572446) waiting for trevis-22vm2 network 5 secs ...
      Network not available!
      

      For a test session that passes, we see that we wait for the network interface to come up and it eventually does;

      01:05:51 (1525309551) waiting for trevis-16vm1.trevis.hpdd.intel.com network 5 secs ...
      01:05:51 (1525309551) network interface is UP
      CMD: trevis-16vm1.trevis.hpdd.intel.com rc=0;
      			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
      				echo \$(hostname -s): \$val;
      				rc=\$val;
      			fi;
      			exit \$rc
      01:05:51 (1525309551) waiting for trevis-16vm2 network 5 secs ...
      01:05:51 (1525309551) network interface is UP
      

      On the console of trevis-22vm2, we see a soft lockup

      [52813.979019] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 01:50:44 \(1525571444\)
      [52814.176674] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 01:50:44 (1525571444)
      [53232.624965] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [sssd_be:596]
      [53232.624965] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd nfsd joydev pcspkr virtio_balloon i2c_piix4 parport_pc parport nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_blk ata_piix drm 8139too libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci virtio_ring virtio 8139cp mii i2c_core floppy [last unloaded: lnet_selftest]
      [53232.624965] CPU: 0 PID: 596 Comm: sssd_be Tainted: G           OE  ------------   3.10.0-693.21.1.el7.x86_64 #1
      [53232.624965] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      [53232.624965] task: ffff880036758fd0 ti: ffff88007ab78000 task.ti: ffff88007ab78000
      [53232.624965] RIP: 0010:[<ffffffff812200c8>]  [<ffffffff812200c8>] __d_lookup+0x68/0x160
      [53232.624965] RSP: 0018:ffff88007ab7bd38  EFLAGS: 00010286
      [53232.624965] RAX: ffffc9000011f818 RBX: 0000000a0000ffff RCX: 0000000000000012
      [53232.624965] RDX: ffffc90000000000 RSI: ffff88007ab7bdd0 RDI: ffff88007cc04240
      [53232.624965] RBP: ffff88007ab7bd78 R08: 0000000000000004 R09: ffffffff81279cc0
      [53232.624965] R10: 0000000000000000 R11: ffff88007ab7bcce R12: ffff88007ab7bd40
      [53232.624965] R13: ffff88007ab7be60 R14: ffff88007ab7be53 R15: ffffffff81332562
      [53232.624965] FS:  00007f04cc1d8880(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
      [53232.624965] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [53232.624965] CR2: 0000560dc7999880 CR3: 000000007abb2000 CR4: 00000000000606f0
      [53232.624965] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [53232.624965] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [53232.624965] Call Trace:
      [53232.624965]  [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0
      [53232.624965]  [<ffffffff812201ea>] d_lookup+0x2a/0x50
      [53232.624965]  [<ffffffff8127a43f>] proc_fill_cache+0x6f/0x180
      [53232.624965]  [<ffffffff81279cc0>] ? proc_pid_make_inode+0xf0/0xf0
      [53232.624965]  [<ffffffff8127b26e>] proc_pid_readdir+0x14e/0x1f0
      [53232.624965]  [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0
      [53232.624965]  [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0
      [53232.624965]  [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0
      [53232.624965]  [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0
      [53232.624965]  [<ffffffff812761bf>] proc_root_readdir+0x3f/0x50
      [53232.624965]  [<ffffffff8121a216>] vfs_readdir+0xb6/0xe0
      [53232.624965]  [<ffffffff8121a635>] SyS_getdents+0x95/0x120
      [53232.624965]  [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21
      [53232.624965]  [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146
      [53232.624965] Code: 89 c2 d3 ea 01 d0 23 05 4b 50 91 00 48 8b 15 38 50 91 00 48 8d 04 c2 48 8b 18 48 83 e3 fe 75 0b eb 39 90 48 8b 1b 48 85 db 74 30 <44> 39 73 18 75 f2 4c 8d 7b 50 4c 89 ff e8 56 67 49 00 4c 39 63 
      [53232.624965] Kernel panic - not syncing: softlockup: hung tasks
      [53232.624965] CPU: 0 PID: 596 Comm: sssd_be Tainted: G           OEL ------------   3.10.0-693.21.1.el7.x86_64 #1
      [53232.624965] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      [53232.624965] Call Trace:
      [53232.624965]  <IRQ>  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
      [53232.624965]  [<ffffffff816a8634>] panic+0xe8/0x21f
      [53232.624965]  [<ffffffff8102d7cf>] ? show_regs+0x5f/0x210
      [53232.624965]  [<ffffffff811334e1>] watchdog_timer_fn+0x231/0x240
      [53232.624965]  [<ffffffff811332b0>] ? watchdog+0x40/0x40
      [53232.624965]  [<ffffffff810b8196>] __hrtimer_run_queues+0xd6/0x260
      [53232.624965]  [<ffffffff810b872f>] hrtimer_interrupt+0xaf/0x1d0
      [53232.624965]  [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0
      [53232.624965]  [<ffffffff8105467b>] local_apic_timer_interrupt+0x3b/0x60
      [53232.624965]  [<ffffffff816c4e73>] smp_apic_timer_interrupt+0x43/0x60
      [53232.624965]  [<ffffffff816c1732>] apic_timer_interrupt+0x162/0x170
      [53232.624965]  <EOI>  [<ffffffff81279cc0>] ? proc_pid_make_inode+0xf0/0xf0
      [53232.624965]  [<ffffffff812200c8>] ? __d_lookup+0x68/0x160
      [53232.624965]  [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0
      [53232.624965]  [<ffffffff812201ea>] d_lookup+0x2a/0x50
      [53232.624965]  [<ffffffff8127a43f>] proc_fill_cache+0x6f/0x180
      [53232.624965]  [<ffffffff81279cc0>] ? proc_pid_make_inode+0xf0/0xf0
      [53232.624965]  [<ffffffff8127b26e>] proc_pid_readdir+0x14e/0x1f0
      [53232.624965]  [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0
      [53232.624965]  [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0
      [53232.624965]  [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0
      [53232.624965]  [<ffffffff8121a320>] ? fillonedir+0xe0/0xe0
      [53232.624965]  [<ffffffff812761bf>] proc_root_readdir+0x3f/0x50
      [53232.624965]  [<ffffffff8121a216>] vfs_readdir+0xb6/0xe0
      [53232.624965]  [<ffffffff8121a635>] SyS_getdents+0x95/0x120
      [53232.624965]  [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21
      [53232.624965]  [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146
      

      On the console for the OSS, we see

      [52707.292360] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 01:50:44 (1525571444)
      [52707.499316] Lustre: DEBUG MARKER: lctl dl | grep obdfilter
      [52707.854211] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d @
      [52708.863961] Lustre: Echo OBD driver; http://www.lustre.org/
      [53152.432160] LNetError: 17668:0:(socklnd.c:1681:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.5.10@tcp[1], ip 10.9.5.10:1023, with error, wanted: 152, left: 152, last alive is 28 secs ago
      [53152.437669] LustreError: 17668:0:(events.c:304:request_in_callback()) event type 2, status -5, service ost
      [53152.440488] LustreError: 25177:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check
      [53152.445357] LustreError: 25177:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.9.5.10@tcp x1599622583354080
      [53168.454267] Lustre: lustre-OST0006: haven't heard from client 0c1d0b38-84c1-ca60-829d-9aaff06c9a4a (at 10.9.5.10@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff88006b639800, cur 1525571906 expire 1525571876 last 1525571859
      [53168.460481] Lustre: Skipped 7 previous similar messages
      [53702.588305] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      

      More failed tests and logs can be found at:
      https://testing.hpdd.intel.com/test_sets/7cb76a18-4281-11e8-960d-52540065bddc
      https://testing.hpdd.intel.com/test_sets/aa7e5422-425c-11e8-b45c-52540065bddc
      https://testing.hpdd.intel.com/test_sets/b0f496da-2cdc-11e8-9e0e-52540065bddc
      https://testing.hpdd.intel.com/test_sets/26cfa8ec-20b0-11e8-a4b1-52540065bddc
      https://testing.hpdd.intel.com/test_sets/f685a540-0329-11e8-a7cd-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: