[LU-13483] Apparently infinite recursion in lnet_finalize() Created: 23/Apr/20  Updated: 19/May/20  Resolved: 19/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Olaf Faaland Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

lustre 2.12.4 + patches
lustre 2.10.8 + patches
RHEL 7.8 + patches


Attachments: Text File vmcore-dmesg.txt    
Issue Links:
Related
is related to LU-12402 LNet Health: lnet_finalize() recursion Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Router node crashes with apparently infinite recursion in lnet.

[15037.327128] Thread overran stack, or stack corrupted
[15037.332674] Oops: 0000 [#1] SMP
[15037.336294] Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) mlx4_ib mlx4_en rpcrdma ib_iser iTCO_wdt iTCO_vendor_support sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass pcspkr zfs(POE) ib_qib rdmavt lpc_ich zunicode(POE) zavl(POE) icp(POE) joydev zcommon(POE) znvpair(POE) spl(OE) mlx4_core devlink sg i2c_i801 ioatdma ipmi_si ipmi_devintf ipmi_msghandler ib_ipoib rdma_ucm ib_uverbs ib_umad acpi_cpufreq iw_cxgb4 rdma_cm iw_cm ib_cm iw_cxgb3 ib_core sch_fq_codel binfmt_misc msr_safe(OE) ip_tables nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache overlay(T) ext4 mbcache jbd2 dm_service_time sd_mod crc_t10dif crct10dif_generic be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs 8021q mgag200 garp mrp
[15037.415946]  stp drm_kms_helper llc syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul crct10dif_common ttm crc32_pclmul crc32c_intel ghash_clmulni_intel ahci drm mpt2sas igb isci libahci aesni_intel lrw gf128mul libsas glue_helper dca ablk_helper raid_class ptp cryptd libata dm_multipath scsi_transport_sas drm_panel_orientation_quirks pps_core wmi i2c_algo_bit sunrpc dm_mirror dm_region_hash dm_log dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
[15037.461455] CPU: 3 PID: 21567 Comm: kiblnd_sd_00_00 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1
[15037.475718] Hardware name: cray cray-2628-lr/S2600GZ, BIOS SE5C600.86B.02.06.0002.101320150901 10/13/2015
[15037.486395] task: ffff9fd2b7775230 ti: ffff9fd27c1e8000 task.ti: ffff9fd27c1e8000
[15037.494744] RIP: 0010:[<ffffffffa17acd4d>]  [<ffffffffa17acd4d>] strnlen+0xd/0x40
[15037.503106] RSP: 0018:ffff9fd27c1e7e80  EFLAGS: 00010086
[15037.509032] RAX: ffffffffa1e86261 RBX: ffffffffa2402fd6 RCX: fffffffffffffffe
[15037.516994] RDX: 00000000c0ab26ee RSI: ffffffffffffffff RDI: 00000000c0ab26ee
[15037.524957] RBP: ffff9fd27c1e7e80 R08: 000000000000ffff R09: 000000000000ffff
[15037.532917] R10: 0000000000000000 R11: ffff9fd27c1e7e46 R12: 00000000c0ab26ee
[15037.540880] R13: ffffffffa24033a0 R14: 00000000ffffffff R15: 0000000000000000
[15037.548842] FS:  0000000000000000(0000) GS:ffff9fd2be8c0000(0000) knlGS:0000000000000000
[15037.557871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[15037.564282] CR2: 00000000c0ab26ee CR3: 0000001ffbd04000 CR4: 00000000001607e0
[15037.572244] Call Trace:
[15037.574990]  [<ffffffffc08e759a>] ? cfs_print_to_console+0x7a/0x1c0 [libcfs]
[15037.582862]  [<ffffffffc08eda74>] ? libcfs_debug_vmsg2+0x574/0xbb0 [libcfs]
[15037.590635]  [<ffffffffc08ee107>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[15037.598037]  [<ffffffffc0a7e226>] ? lnet_finalize+0x976/0x9f0 [lnet]
...
[15039.259572]  [<ffffffffc0a7d9e9>] ? lnet_finalize+0x139/0x9f0 [lnet]
[15039.266666]  [<ffffffffc08ee107>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[15039.274052]  [<ffffffffc0a87b8a>] ? lnet_post_send_locked+0x42a/0xa40 [lnet]
[15039.281922]  [<ffffffffc0a89e38>] ? lnet_return_tx_credits_locked+0x238/0x4a0 [lnet]
[15039.290583]  [<ffffffffc0a7c88c>] ? lnet_msg_decommit+0xec/0x700 [lnet]
[15039.297960]  [<ffffffffc0a7dc3f>] ? lnet_finalize+0x38f/0x9f0 [lnet]
[15039.305055]  [<ffffffffc09bc75d>] ? kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]
[15039.312634]  [<ffffffffc09c7d19>] ? kiblnd_scheduler+0x8c9/0x1160 [ko2iblnd]
[15039.320502]  [<ffffffffa142d59e>] ? __switch_to+0xce/0x5a0
[15039.326626]  [<ffffffffa14e29b0>] ? wake_up_state+0x20/0x20
[15039.332835]  [<ffffffffa14b46bc>] ? mod_timer+0x11c/0x260
[15039.338859]  [<ffffffffc09c7450>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd]
[15039.346338]  [<ffffffffa14cca01>] ? kthread+0xd1/0xe0
[15039.351974]  [<ffffffffa14cc930>] ? insert_kthread_work+0x40/0x40
[15039.358776]  [<ffffffffa1bbff77>] ? ret_from_fork_nospec_begin+0x21/0x21
[15039.366253]  [<ffffffffa14cc930>] ? insert_kthread_work+0x40/0x40


 Comments   
Comment by Olaf Faaland [ 23/Apr/20 ]

Possibly same as https://jira.whamcloud.com/browse/LU-12402

Comment by Peter Jones [ 23/Apr/20 ]

Amir

Could you please advise?

Thanks

Peter

Comment by Olaf Faaland [ 23/Apr/20 ]

Our 2.12 tag on the router node that crashed. Clients on this cluster (catalyst) and clients on some other clusters also run this tag:
https://github.com/LLNL/lustre/releases/tag/2.12.4_4.chaos

Our other file systems and client systems are these tags (entire clusters are at one tag):
https://github.com/LLNL/lustre/releases/tag/2.12.4_2.chaos
https://github.com/LLNL/lustre/releases/tag/2.10.8_7.chaos
https://github.com/LLNL/lustre/releases/tag/2.10.8_5.chaos

Comment by Olaf Faaland [ 24/Apr/20 ]

I see https://jira.whamcloud.com/browse/LU-13067 has the same stack. Was LU-13067 misdiagnosed?

In any case, it looks to me like b2_12 has the same code path as LU-12402 and needs https://review.whamcloud.com/35431
Please take a look and see if you agree.

Comment by Olaf Faaland [ 24/Apr/20 ]

In case you agree, I backported the patch from LU-12402 along with two patches it depends on and pushed them for testing.

I forgot to remove the old Change-Id and make the other metadata changes in the commit message. I abandoned these thinking of aborting the testing but now wonder if I've done other damage to the gerrit historical information. If so, sorry .

I'm fixing that in the stack and pushing again...

Below are the original gerrit change URLs:
remote: https://review.whamcloud.com/38353 LU-11300 lnet: simplify lnet_handle_local_failure()
remote: https://review.whamcloud.com/38354 LU-11477 lnet: handle health for incoming messages
remote: https://review.whamcloud.com/38355 LU-12402 lnet: handle recursion in resend

Comment by Olaf Faaland [ 24/Apr/20 ]

Second try at my patch stack:
remote: https://review.whamcloud.com/38356 LU-11300 lnet: simplify lnet_handle_local_failure()
remote: https://review.whamcloud.com/38357 LU-11477 lnet: handle health for incoming messages
remote: https://review.whamcloud.com/38358 LU-12402 lnet: handle recursion in resend

Disclaimer: I don't know this code well.

Comment by Olaf Faaland [ 24/Apr/20 ]

My backport missed at least one dependency, looking into it.

Comment by Olaf Faaland [ 24/Apr/20 ]

Hi Amir,
Do you have any news or need anything from me?

Comment by Olaf Faaland [ 24/Apr/20 ]

> My backport missed at least one dependency, looking into it.

This is more complex than I'd hoped, I'll need you to backport it if that's the correct fix.

Comment by Amir Shehata (Inactive) [ 25/Apr/20 ]

https://review.whamcloud.com/38367

There is no need to backport other patches.

Can you try this out and let me know if it resolves your issue.

If it resolves your issue, can you please abandon the other patches and use that one.

thanks

Comment by Olaf Faaland [ 28/Apr/20 ]

I'm unable to reproduce this issue on our test system, I believe because it's simply too small. I'm abandoning my other patches. Can we get review on this patch? Thanks

Comment by Peter Jones [ 02/May/20 ]

The patch landed for 2.12.5 so presumably is at a point where you would add it to your distribution and be able to test at greater scale to test whether this fix works.

Comment by Olaf Faaland [ 02/May/20 ]

Yes, this patch is in our distribution and current schedule is to start deploying to the production systems where we see this issue on Tuesday (and to bigger ones after that).

Comment by Peter Jones [ 19/May/20 ]

As per LLNL ok to close

Comment by Olaf Faaland [ 19/May/20 ]

This patch has been deployed to all our 2.12 systems. So far we've not seen this issue reoccur, nor any new LNet related pathologies.

Generated at Sat Feb 10 03:01:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.