[LU-13067] lnet router crashes with Thread overran stack, or stack corrupted Created: 12/Dec/19  Updated: 30/Jan/20  Resolved: 30/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 0
Labels: llnl
Environment:

lustre-2.12.3_2.chaos-1.4mofed.ch6.x86_64
clients they connect to run the same lustre 2.12 version
servers and other routers they connect to run lustre-2.10.8_5.chaos-1.ch6.x86_64
RHEL 7.7 derivative
linux 3.10.0-1062.7.1.1chaos.ch6.x86_64
mlx5_ib: Mellanox Connect-IB Infiniband driver v4.7-1.0.0
See https://github.com/LLNL/lustre/ for these patch stacks.
One file system was undergoing an OS update at the time, so the servers were likely going up or down at the time.


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Lustre router crashed with following in vmcore-dmesg.txt:

[  963.601219] LNet: 29007:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 172.19.1.165@o2ib100: 50 seconds
[  963.611771] LNetError: 29007:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 172.19.2.26@o2ib100 added to recovery queue. Health = 900
[  963.623984] LNetError: 29007:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 172.19.1.165@o2ib100 added to recovery queue. Health = 900
[  963.637202] LNetError: 29007:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages
[  963.648165] BUG: unable to handle kernel paging request at 00000000c10cb305
[  963.655155] IP: [<00000000c10cb305>] 0xc10cb305
[  963.659715] PGD 0
[  963.661750] Thread overran stack, or stack corrupted
[  963.666716] Oops: 0010 [#1] SMP
[  963.669994] Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack mlx5_core(OE) amd64_edac_mod nf_conntrack edac_mce_amd joydev kvm_amd libcrc32c mlx_compat(OE) kvm mlxfw(OE) ses devlink enclosure iptable_filter irqbypass sg pcspkr ipmi_si ipmi_devintf ipmi_msghandler i2c_designware_platform pcc_cpufreq pinctrl_amd i2c_designware_core i2c_piix4 k10temp acpi_cpufreq sch_fq_codel binfmt_misc msr_safe(OE) ip_tables nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache overlay(T) ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic be2iscsi bnx2i cnic uio cxgb4i
[  963.741681]  cxgb4 cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs dm_multipath ast drm_kms_helper crct10dif_pclmul syscopyarea crct10dif_common sysfillrect crc32_pclmul sysimgblt 8021q crc32c_intel fb_sys_fops ghash_clmulni_intel garp ttm mrp aesni_intel stp lrw llc gf128mul glue_helper igb mpt3sas ablk_helper dca raid_class cryptd drm ptp scsi_transport_sas ccp pps_core drm_panel_orientation_quirks i2c_algo_bit nfit libnvdimm sunrpc dm_mirror dm_region_hash dm_log dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
[  963.788993] CPU: 5 PID: 29007 Comm: kiblnd_connd Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.7.1.1chaos.ch6.x86_64 #1
[  963.801495] Hardware name: Penguin Computing Altus XE2112/MZ91-FS0-ZB, BIOS F08a 12/19/2018
[  963.809833] task: ffff9df47705c1c0 ti: ffff9df47bf80000 task.ti: ffff9df47bf80000
[  963.817304] RIP: 0010:[<00000000c10cb305>]  [<00000000c10cb305>] 0xc10cb305
[  963.824280] RSP: 0018:ffff9df47bf80020  EFLAGS: 00010246
[  963.829585] RAX: 0000000000000000 RBX: ffff9e147019c280 RCX: ffffffffc11a1430
[  963.836716] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9e147019c280
[  963.843842] RBP: ffff9df47bf80030 R08: 000000000000ffff R09: 000000000000ffff
[  963.850972] R10: 0000000000000280 R11: ffff9df47bf8006e R12: 0000000000000000
[  963.858096] R13: ffff9e147019c280 R14: ffff9de479d3b840 R15: ffff9de46fe4a200
[  963.865223] FS:  00007fffddfd0700(0000) GS:ffff9ddc7ef40000(0000) knlGS:0000000000000000
[  963.873307] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  963.879052] CR2: 00000000c10cb305 CR3: 0000003fd2410000 CR4: 00000000003407e0
[  963.886176] Call Trace:
[  963.888636]  [<ffffffffc10d15d5>] libcfs_debug_vmsg2+0xe5/0xbb0 [libcfs]
[  963.895336]  [<ffffffff977a8e03>] ? number.isra.2+0x323/0x360
[  963.901074]  [<ffffffff977a8f7b>] ? string.isra.7+0x3b/0xf0
[  963.906652]  [<ffffffffc10d20f7>] libcfs_debug_msg+0x57/0x80 [libcfs]
[  963.913106]  [<ffffffffc114e9df>] lnet_post_send_locked+0x40f/0xa40 [lnet]
[  963.919987]  [<ffffffffc1150ca8>] lnet_return_tx_credits_locked+0x238/0x4a0 [lnet]
[  963.927558]  [<ffffffffc1144511>] lnet_health_check+0x6a1/0x8b0 [lnet]
[  963.934084]  [<ffffffffc114488f>] lnet_finalize+0x16f/0x9a0 [lnet]
[  963.940262]  [<ffffffffc10d20f7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[  963.946876]  [<ffffffffc114e9fa>] lnet_post_send_locked+0x42a/0xa40 [lnet]
[  963.953749]  [<ffffffffc1150ca8>] lnet_return_tx_credits_locked+0x238/0x4a0 [lnet]
[  963.961321]  [<ffffffffc1144511>] lnet_health_check+0x6a1/0x8b0 [lnet]
[  963.967855]  [<ffffffffc114488f>] lnet_finalize+0x16f/0x9a0 [lnet]
[  963.974033]  [<ffffffffc10d20f7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[  963.980648]  [<ffffffffc114e9fa>] lnet_post_send_locked+0x42a/0xa40 [lnet]
[  963.987520]  [<ffffffffc1150ca8>] lnet_return_tx_credits_locked+0x238/0x4a0 [lnet]
[  963.995085]  [<ffffffffc1144511>] lnet_health_check+0x6a1/0x8b0 [lnet]
[  964.001611]  [<ffffffffc114488f>] lnet_finalize+0x16f/0x9a0 [lnet]
... <more of the same cycle>...
[  965.398450]  [<ffffffffc1012d42>] ? kiblnd_pool_free_node+0x82/0x180 [ko2iblnd]
[  965.405761]  [<ffffffffc101c79d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]
[  965.412372]  [<ffffffffc101cabb>] kiblnd_txlist_done+0x4b/0x60 [ko2iblnd]
[  965.419159]  [<ffffffffc1021dd3>] kiblnd_check_conns+0x573/0x8c0 [ko2iblnd]
[  965.426129]  [<ffffffffc1026eeb>] kiblnd_connd+0x83b/0xa00 [ko2iblnd]
[  965.432567]  [<ffffffff97bac120>] ? __schedule+0x430/0xa00
[  965.438053]  [<ffffffff974e1890>] ? wake_up_state+0x20/0x20
[  965.443624]  [<ffffffffc10266b0>] ? kiblnd_cm_callback+0x23b0/0x23b0 [ko2iblnd]
[  965.450928]  [<ffffffff974cb451>] kthread+0xd1/0xe0
[  965.455806]  [<ffffffff974cb380>] ? insert_kthread_work+0x40/0x40
[  965.461899]  [<ffffffff97bb9f64>] ret_from_fork_nospec_begin+0xe/0x21
[  965.468337]  [<ffffffff974cb380>] ? insert_kthread_work+0x40/0x40
[  965.474429] Code:  Bad RIP value.
[  965.477784] RIP  [<00000000c10cb305>] 0xc10cb305
[  965.482429]  RSP <ffff9df47bf80020>
[  965.485922] CR2: 00000000c10cb305


 Comments   
Comment by Olaf Faaland [ 12/Dec/19 ]

Our local bug ID: TOSS4698

Comment by Olaf Faaland [ 12/Dec/19 ]

I have the core dumps, so I can obtain information from them.

Reading symbols from /usr/lib/debug/usr/lib/modules/3.10.0-1062.4.1.1chaos.ch6.x86_64/extra/lustre/net/lnet.ko.debug...done.                      
(gdb) l *(lnet_finalize+0x16f)                                                                                                                    
0x128bf is in lnet_finalize (/usr/src/debug/lustre-2.12.3_2.chaos/lnet/lnet/lib-msg.c:914).                                                       
909                      * if the message send is success, timed out or failed in the                                                             
910                      * health check for any reason then we'll just finalize the                                                               
911                      * message. Otherwise just return since the message has been                                                              
912                      * put on the resend queue.                                                                                               
913                      */                                                                                                                       
914                     if (!lnet_health_check(msg))                                                                                              
915                             /* Message is queued for resend */                                                                                
916                             return;                                                                                                           
917             }                                                                                                                                 
918                                                                                                                                               
(gdb) l *(lnet_health_check+0x6a1)                                                                                                                
0x12541 is in lnet_health_check (/usr/src/debug/lustre-2.12.3_2.chaos/lnet/lnet/lib-msg.c:750).                                                   
745              */                                                                                                                               
746             msg->msg_target.nid = msg->msg_hdr.dest_nid;                                                                                      
747             lnet_msg_decommit_tx(msg, -EAGAIN);                                                                                               
748             msg->msg_sending = 0;                                                                                                             
749             msg->msg_receiving = 0;                                                                                                           
750             msg->msg_target_is_router = 0;
751
752             CDEBUG(D_NET, "%s->%s:%s:%s - queuing for resend\n",
753                    libcfs_nid2str(msg->msg_hdr.src_nid),
754                    libcfs_nid2str(msg->msg_hdr.dest_nid),
(gdb) l *(lnet_return_tx_credits_locked+0x238)
0x1ecd8 is in lnet_return_tx_credits_locked (/usr/src/debug/lustre-2.12.3_2.chaos/lnet/lnet/lib-move.c:1212).
1207                            if (msg2_cpt != msg->msg_tx_cpt) {
1208                                    lnet_net_unlock(msg->msg_tx_cpt);
1209                                    lnet_net_lock(msg2_cpt);
1210                            }
1211                            (void) lnet_post_send_locked(msg2, 1);
1212                            if (msg2_cpt != msg->msg_tx_cpt) {
1213                                    lnet_net_unlock(msg2_cpt);
1214                                    lnet_net_lock(msg->msg_tx_cpt);
1215                            }
1216                    } else {
(gdb) l *(lnet_post_send_locked+0x40f)
0x1ca0f is in lnet_post_send_locked (/usr/src/debug/lustre-2.12.3_2.chaos/lnet/lnet/lib-move.c:963).
958                                             LNET_STATS_TYPE_DROP);
959
960                     CNETERR("Dropping message for %s: peer not alive\n",
961                             libcfs_id2str(msg->msg_target));
962                     msg->msg_health_status = LNET_MSG_STATUS_REMOTE_DROPPED;
963                     if (do_send)
964                             lnet_finalize(msg, -EHOSTUNREACH);
965
966                     lnet_net_lock(cpt);
967                     return -EHOSTUNREACH;
(gdb) quit
Comment by Peter Jones [ 12/Dec/19 ]

Amir

Could you please investigate?

Thanks

Peter

Comment by Amir Shehata (Inactive) [ 12/Dec/19 ]

Can you please share the share the patch list you have? This looks like a crash which has already been fixed by these two patches:

3df41bb8515d5012d7e2f19b2d7019e3e1b64a71 LU-12568 lnet: Defer rspt cleanup when MD queued for unlink
c095fbda55ca632cff2696550f22a13a19ee4514 LU-12441 lnet: Detach rspt when md_threshold is infinite

Do you have them?

Looks similar to this: LU-12907

Comment by Olaf Faaland [ 12/Dec/19 ]

Hi Amir,
No, we do not have those two patches. See https://github.com/LLNL/lustre/ for our patch stacks. Thanks.

Comment by Olaf Faaland [ 30/Jan/20 ]

We are updating our 2.12 machines to 2.12.4-RC1 in the next week or two and will reopen the ticket if necessary.

Comment by Olaf Faaland [ 30/Jan/20 ]

Amir believes this is a dupe of LU-12907

Generated at Sat Feb 10 02:58:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.