Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.12.3
-
lustre-2.12.3_2.chaos-1.4mofed.ch6.x86_64
clients they connect to run the same lustre 2.12 version
servers and other routers they connect to run lustre-2.10.8_5.chaos-1.ch6.x86_64
RHEL 7.7 derivative
linux 3.10.0-1062.7.1.1chaos.ch6.x86_64
mlx5_ib: Mellanox Connect-IB Infiniband driver v4.7-1.0.0
See https://github.com/LLNL/lustre/ for these patch stacks.
One file system was undergoing an OS update at the time, so the servers were likely going up or down at the time.
lustre-2.12.3_2.chaos-1.4mofed.ch6.x86_64 clients they connect to run the same lustre 2.12 version servers and other routers they connect to run lustre-2.10.8_5.chaos-1.ch6.x86_64 RHEL 7.7 derivative linux 3.10.0-1062.7.1.1chaos.ch6.x86_64 mlx5_ib: Mellanox Connect-IB Infiniband driver v4.7-1.0.0 See https://github.com/LLNL/lustre/ for these patch stacks. One file system was undergoing an OS update at the time, so the servers were likely going up or down at the time.
-
3
-
9223372036854775807
Description
Lustre router crashed with following in vmcore-dmesg.txt:
[ 963.601219] LNet: 29007:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 172.19.1.165@o2ib100: 50 seconds [ 963.611771] LNetError: 29007:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 172.19.2.26@o2ib100 added to recovery queue. Health = 900 [ 963.623984] LNetError: 29007:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 172.19.1.165@o2ib100 added to recovery queue. Health = 900 [ 963.637202] LNetError: 29007:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages [ 963.648165] BUG: unable to handle kernel paging request at 00000000c10cb305 [ 963.655155] IP: [<00000000c10cb305>] 0xc10cb305 [ 963.659715] PGD 0 [ 963.661750] Thread overran stack, or stack corrupted [ 963.666716] Oops: 0010 [#1] SMP [ 963.669994] Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack mlx5_core(OE) amd64_edac_mod nf_conntrack edac_mce_amd joydev kvm_amd libcrc32c mlx_compat(OE) kvm mlxfw(OE) ses devlink enclosure iptable_filter irqbypass sg pcspkr ipmi_si ipmi_devintf ipmi_msghandler i2c_designware_platform pcc_cpufreq pinctrl_amd i2c_designware_core i2c_piix4 k10temp acpi_cpufreq sch_fq_codel binfmt_misc msr_safe(OE) ip_tables nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache overlay(T) ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic be2iscsi bnx2i cnic uio cxgb4i [ 963.741681] cxgb4 cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs dm_multipath ast drm_kms_helper crct10dif_pclmul syscopyarea crct10dif_common sysfillrect crc32_pclmul sysimgblt 8021q crc32c_intel fb_sys_fops ghash_clmulni_intel garp ttm mrp aesni_intel stp lrw llc gf128mul glue_helper igb mpt3sas ablk_helper dca raid_class cryptd drm ptp scsi_transport_sas ccp pps_core drm_panel_orientation_quirks i2c_algo_bit nfit libnvdimm sunrpc dm_mirror dm_region_hash dm_log dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi [ 963.788993] CPU: 5 PID: 29007 Comm: kiblnd_connd Kdump: loaded Tainted: G OE ------------ T 3.10.0-1062.7.1.1chaos.ch6.x86_64 #1 [ 963.801495] Hardware name: Penguin Computing Altus XE2112/MZ91-FS0-ZB, BIOS F08a 12/19/2018 [ 963.809833] task: ffff9df47705c1c0 ti: ffff9df47bf80000 task.ti: ffff9df47bf80000 [ 963.817304] RIP: 0010:[<00000000c10cb305>] [<00000000c10cb305>] 0xc10cb305 [ 963.824280] RSP: 0018:ffff9df47bf80020 EFLAGS: 00010246 [ 963.829585] RAX: 0000000000000000 RBX: ffff9e147019c280 RCX: ffffffffc11a1430 [ 963.836716] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9e147019c280 [ 963.843842] RBP: ffff9df47bf80030 R08: 000000000000ffff R09: 000000000000ffff [ 963.850972] R10: 0000000000000280 R11: ffff9df47bf8006e R12: 0000000000000000 [ 963.858096] R13: ffff9e147019c280 R14: ffff9de479d3b840 R15: ffff9de46fe4a200 [ 963.865223] FS: 00007fffddfd0700(0000) GS:ffff9ddc7ef40000(0000) knlGS:0000000000000000 [ 963.873307] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 963.879052] CR2: 00000000c10cb305 CR3: 0000003fd2410000 CR4: 00000000003407e0 [ 963.886176] Call Trace: [ 963.888636] [<ffffffffc10d15d5>] libcfs_debug_vmsg2+0xe5/0xbb0 [libcfs] [ 963.895336] [<ffffffff977a8e03>] ? number.isra.2+0x323/0x360 [ 963.901074] [<ffffffff977a8f7b>] ? string.isra.7+0x3b/0xf0 [ 963.906652] [<ffffffffc10d20f7>] libcfs_debug_msg+0x57/0x80 [libcfs] [ 963.913106] [<ffffffffc114e9df>] lnet_post_send_locked+0x40f/0xa40 [lnet] [ 963.919987] [<ffffffffc1150ca8>] lnet_return_tx_credits_locked+0x238/0x4a0 [lnet] [ 963.927558] [<ffffffffc1144511>] lnet_health_check+0x6a1/0x8b0 [lnet] [ 963.934084] [<ffffffffc114488f>] lnet_finalize+0x16f/0x9a0 [lnet] [ 963.940262] [<ffffffffc10d20f7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 963.946876] [<ffffffffc114e9fa>] lnet_post_send_locked+0x42a/0xa40 [lnet] [ 963.953749] [<ffffffffc1150ca8>] lnet_return_tx_credits_locked+0x238/0x4a0 [lnet] [ 963.961321] [<ffffffffc1144511>] lnet_health_check+0x6a1/0x8b0 [lnet] [ 963.967855] [<ffffffffc114488f>] lnet_finalize+0x16f/0x9a0 [lnet] [ 963.974033] [<ffffffffc10d20f7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 963.980648] [<ffffffffc114e9fa>] lnet_post_send_locked+0x42a/0xa40 [lnet] [ 963.987520] [<ffffffffc1150ca8>] lnet_return_tx_credits_locked+0x238/0x4a0 [lnet] [ 963.995085] [<ffffffffc1144511>] lnet_health_check+0x6a1/0x8b0 [lnet] [ 964.001611] [<ffffffffc114488f>] lnet_finalize+0x16f/0x9a0 [lnet] ... <more of the same cycle>... [ 965.398450] [<ffffffffc1012d42>] ? kiblnd_pool_free_node+0x82/0x180 [ko2iblnd] [ 965.405761] [<ffffffffc101c79d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd] [ 965.412372] [<ffffffffc101cabb>] kiblnd_txlist_done+0x4b/0x60 [ko2iblnd] [ 965.419159] [<ffffffffc1021dd3>] kiblnd_check_conns+0x573/0x8c0 [ko2iblnd] [ 965.426129] [<ffffffffc1026eeb>] kiblnd_connd+0x83b/0xa00 [ko2iblnd] [ 965.432567] [<ffffffff97bac120>] ? __schedule+0x430/0xa00 [ 965.438053] [<ffffffff974e1890>] ? wake_up_state+0x20/0x20 [ 965.443624] [<ffffffffc10266b0>] ? kiblnd_cm_callback+0x23b0/0x23b0 [ko2iblnd] [ 965.450928] [<ffffffff974cb451>] kthread+0xd1/0xe0 [ 965.455806] [<ffffffff974cb380>] ? insert_kthread_work+0x40/0x40 [ 965.461899] [<ffffffff97bb9f64>] ret_from_fork_nospec_begin+0xe/0x21 [ 965.468337] [<ffffffff974cb380>] ? insert_kthread_work+0x40/0x40 [ 965.474429] Code: Bad RIP value. [ 965.477784] RIP [<00000000c10cb305>] 0xc10cb305 [ 965.482429] RSP <ffff9df47bf80020> [ 965.485922] CR2: 00000000c10cb305