Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13067

lnet router crashes with Thread overran stack, or stack corrupted

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.12.3
    • 3
    • 9223372036854775807

    Description

      Lustre router crashed with following in vmcore-dmesg.txt:

      [  963.601219] LNet: 29007:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 172.19.1.165@o2ib100: 50 seconds
      [  963.611771] LNetError: 29007:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 172.19.2.26@o2ib100 added to recovery queue. Health = 900
      [  963.623984] LNetError: 29007:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 172.19.1.165@o2ib100 added to recovery queue. Health = 900
      [  963.637202] LNetError: 29007:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages
      [  963.648165] BUG: unable to handle kernel paging request at 00000000c10cb305
      [  963.655155] IP: [<00000000c10cb305>] 0xc10cb305
      [  963.659715] PGD 0
      [  963.661750] Thread overran stack, or stack corrupted
      [  963.666716] Oops: 0010 [#1] SMP
      [  963.669994] Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack mlx5_core(OE) amd64_edac_mod nf_conntrack edac_mce_amd joydev kvm_amd libcrc32c mlx_compat(OE) kvm mlxfw(OE) ses devlink enclosure iptable_filter irqbypass sg pcspkr ipmi_si ipmi_devintf ipmi_msghandler i2c_designware_platform pcc_cpufreq pinctrl_amd i2c_designware_core i2c_piix4 k10temp acpi_cpufreq sch_fq_codel binfmt_misc msr_safe(OE) ip_tables nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache overlay(T) ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic be2iscsi bnx2i cnic uio cxgb4i
      [  963.741681]  cxgb4 cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs dm_multipath ast drm_kms_helper crct10dif_pclmul syscopyarea crct10dif_common sysfillrect crc32_pclmul sysimgblt 8021q crc32c_intel fb_sys_fops ghash_clmulni_intel garp ttm mrp aesni_intel stp lrw llc gf128mul glue_helper igb mpt3sas ablk_helper dca raid_class cryptd drm ptp scsi_transport_sas ccp pps_core drm_panel_orientation_quirks i2c_algo_bit nfit libnvdimm sunrpc dm_mirror dm_region_hash dm_log dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
      [  963.788993] CPU: 5 PID: 29007 Comm: kiblnd_connd Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.7.1.1chaos.ch6.x86_64 #1
      [  963.801495] Hardware name: Penguin Computing Altus XE2112/MZ91-FS0-ZB, BIOS F08a 12/19/2018
      [  963.809833] task: ffff9df47705c1c0 ti: ffff9df47bf80000 task.ti: ffff9df47bf80000
      [  963.817304] RIP: 0010:[<00000000c10cb305>]  [<00000000c10cb305>] 0xc10cb305
      [  963.824280] RSP: 0018:ffff9df47bf80020  EFLAGS: 00010246
      [  963.829585] RAX: 0000000000000000 RBX: ffff9e147019c280 RCX: ffffffffc11a1430
      [  963.836716] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9e147019c280
      [  963.843842] RBP: ffff9df47bf80030 R08: 000000000000ffff R09: 000000000000ffff
      [  963.850972] R10: 0000000000000280 R11: ffff9df47bf8006e R12: 0000000000000000
      [  963.858096] R13: ffff9e147019c280 R14: ffff9de479d3b840 R15: ffff9de46fe4a200
      [  963.865223] FS:  00007fffddfd0700(0000) GS:ffff9ddc7ef40000(0000) knlGS:0000000000000000
      [  963.873307] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  963.879052] CR2: 00000000c10cb305 CR3: 0000003fd2410000 CR4: 00000000003407e0
      [  963.886176] Call Trace:
      [  963.888636]  [<ffffffffc10d15d5>] libcfs_debug_vmsg2+0xe5/0xbb0 [libcfs]
      [  963.895336]  [<ffffffff977a8e03>] ? number.isra.2+0x323/0x360
      [  963.901074]  [<ffffffff977a8f7b>] ? string.isra.7+0x3b/0xf0
      [  963.906652]  [<ffffffffc10d20f7>] libcfs_debug_msg+0x57/0x80 [libcfs]
      [  963.913106]  [<ffffffffc114e9df>] lnet_post_send_locked+0x40f/0xa40 [lnet]
      [  963.919987]  [<ffffffffc1150ca8>] lnet_return_tx_credits_locked+0x238/0x4a0 [lnet]
      [  963.927558]  [<ffffffffc1144511>] lnet_health_check+0x6a1/0x8b0 [lnet]
      [  963.934084]  [<ffffffffc114488f>] lnet_finalize+0x16f/0x9a0 [lnet]
      [  963.940262]  [<ffffffffc10d20f7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [  963.946876]  [<ffffffffc114e9fa>] lnet_post_send_locked+0x42a/0xa40 [lnet]
      [  963.953749]  [<ffffffffc1150ca8>] lnet_return_tx_credits_locked+0x238/0x4a0 [lnet]
      [  963.961321]  [<ffffffffc1144511>] lnet_health_check+0x6a1/0x8b0 [lnet]
      [  963.967855]  [<ffffffffc114488f>] lnet_finalize+0x16f/0x9a0 [lnet]
      [  963.974033]  [<ffffffffc10d20f7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [  963.980648]  [<ffffffffc114e9fa>] lnet_post_send_locked+0x42a/0xa40 [lnet]
      [  963.987520]  [<ffffffffc1150ca8>] lnet_return_tx_credits_locked+0x238/0x4a0 [lnet]
      [  963.995085]  [<ffffffffc1144511>] lnet_health_check+0x6a1/0x8b0 [lnet]
      [  964.001611]  [<ffffffffc114488f>] lnet_finalize+0x16f/0x9a0 [lnet]
      ... <more of the same cycle>...
      [  965.398450]  [<ffffffffc1012d42>] ? kiblnd_pool_free_node+0x82/0x180 [ko2iblnd]
      [  965.405761]  [<ffffffffc101c79d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]
      [  965.412372]  [<ffffffffc101cabb>] kiblnd_txlist_done+0x4b/0x60 [ko2iblnd]
      [  965.419159]  [<ffffffffc1021dd3>] kiblnd_check_conns+0x573/0x8c0 [ko2iblnd]
      [  965.426129]  [<ffffffffc1026eeb>] kiblnd_connd+0x83b/0xa00 [ko2iblnd]
      [  965.432567]  [<ffffffff97bac120>] ? __schedule+0x430/0xa00
      [  965.438053]  [<ffffffff974e1890>] ? wake_up_state+0x20/0x20
      [  965.443624]  [<ffffffffc10266b0>] ? kiblnd_cm_callback+0x23b0/0x23b0 [ko2iblnd]
      [  965.450928]  [<ffffffff974cb451>] kthread+0xd1/0xe0
      [  965.455806]  [<ffffffff974cb380>] ? insert_kthread_work+0x40/0x40
      [  965.461899]  [<ffffffff97bb9f64>] ret_from_fork_nospec_begin+0xe/0x21
      [  965.468337]  [<ffffffff974cb380>] ? insert_kthread_work+0x40/0x40
      [  965.474429] Code:  Bad RIP value.
      [  965.477784] RIP  [<00000000c10cb305>] 0xc10cb305
      [  965.482429]  RSP <ffff9df47bf80020>
      [  965.485922] CR2: 00000000c10cb305
      

      Attachments

        Activity

          People

            ashehata Amir Shehata (Inactive)
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: