[LU-13483] Apparently infinite recursion in lnet_finalize() Created: 23/Apr/20 Updated: 19/May/20 Resolved: 19/May/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Olaf Faaland | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre 2.12.4 + patches |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Router node crashes with apparently infinite recursion in lnet. [15037.327128] Thread overran stack, or stack corrupted [15037.332674] Oops: 0000 [#1] SMP [15037.336294] Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) mlx4_ib mlx4_en rpcrdma ib_iser iTCO_wdt iTCO_vendor_support sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass pcspkr zfs(POE) ib_qib rdmavt lpc_ich zunicode(POE) zavl(POE) icp(POE) joydev zcommon(POE) znvpair(POE) spl(OE) mlx4_core devlink sg i2c_i801 ioatdma ipmi_si ipmi_devintf ipmi_msghandler ib_ipoib rdma_ucm ib_uverbs ib_umad acpi_cpufreq iw_cxgb4 rdma_cm iw_cm ib_cm iw_cxgb3 ib_core sch_fq_codel binfmt_misc msr_safe(OE) ip_tables nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache overlay(T) ext4 mbcache jbd2 dm_service_time sd_mod crc_t10dif crct10dif_generic be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs 8021q mgag200 garp mrp [15037.415946] stp drm_kms_helper llc syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul crct10dif_common ttm crc32_pclmul crc32c_intel ghash_clmulni_intel ahci drm mpt2sas igb isci libahci aesni_intel lrw gf128mul libsas glue_helper dca ablk_helper raid_class ptp cryptd libata dm_multipath scsi_transport_sas drm_panel_orientation_quirks pps_core wmi i2c_algo_bit sunrpc dm_mirror dm_region_hash dm_log dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi [15037.461455] CPU: 3 PID: 21567 Comm: kiblnd_sd_00_00 Kdump: loaded Tainted: P OE ------------ T 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1 [15037.475718] Hardware name: cray cray-2628-lr/S2600GZ, BIOS SE5C600.86B.02.06.0002.101320150901 10/13/2015 [15037.486395] task: ffff9fd2b7775230 ti: ffff9fd27c1e8000 task.ti: ffff9fd27c1e8000 [15037.494744] RIP: 0010:[<ffffffffa17acd4d>] [<ffffffffa17acd4d>] strnlen+0xd/0x40 [15037.503106] RSP: 0018:ffff9fd27c1e7e80 EFLAGS: 00010086 [15037.509032] RAX: ffffffffa1e86261 RBX: ffffffffa2402fd6 RCX: fffffffffffffffe [15037.516994] RDX: 00000000c0ab26ee RSI: ffffffffffffffff RDI: 00000000c0ab26ee [15037.524957] RBP: ffff9fd27c1e7e80 R08: 000000000000ffff R09: 000000000000ffff [15037.532917] R10: 0000000000000000 R11: ffff9fd27c1e7e46 R12: 00000000c0ab26ee [15037.540880] R13: ffffffffa24033a0 R14: 00000000ffffffff R15: 0000000000000000 [15037.548842] FS: 0000000000000000(0000) GS:ffff9fd2be8c0000(0000) knlGS:0000000000000000 [15037.557871] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [15037.564282] CR2: 00000000c0ab26ee CR3: 0000001ffbd04000 CR4: 00000000001607e0 [15037.572244] Call Trace: [15037.574990] [<ffffffffc08e759a>] ? cfs_print_to_console+0x7a/0x1c0 [libcfs] [15037.582862] [<ffffffffc08eda74>] ? libcfs_debug_vmsg2+0x574/0xbb0 [libcfs] [15037.590635] [<ffffffffc08ee107>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [15037.598037] [<ffffffffc0a7e226>] ? lnet_finalize+0x976/0x9f0 [lnet] ... [15039.259572] [<ffffffffc0a7d9e9>] ? lnet_finalize+0x139/0x9f0 [lnet] [15039.266666] [<ffffffffc08ee107>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [15039.274052] [<ffffffffc0a87b8a>] ? lnet_post_send_locked+0x42a/0xa40 [lnet] [15039.281922] [<ffffffffc0a89e38>] ? lnet_return_tx_credits_locked+0x238/0x4a0 [lnet] [15039.290583] [<ffffffffc0a7c88c>] ? lnet_msg_decommit+0xec/0x700 [lnet] [15039.297960] [<ffffffffc0a7dc3f>] ? lnet_finalize+0x38f/0x9f0 [lnet] [15039.305055] [<ffffffffc09bc75d>] ? kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd] [15039.312634] [<ffffffffc09c7d19>] ? kiblnd_scheduler+0x8c9/0x1160 [ko2iblnd] [15039.320502] [<ffffffffa142d59e>] ? __switch_to+0xce/0x5a0 [15039.326626] [<ffffffffa14e29b0>] ? wake_up_state+0x20/0x20 [15039.332835] [<ffffffffa14b46bc>] ? mod_timer+0x11c/0x260 [15039.338859] [<ffffffffc09c7450>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd] [15039.346338] [<ffffffffa14cca01>] ? kthread+0xd1/0xe0 [15039.351974] [<ffffffffa14cc930>] ? insert_kthread_work+0x40/0x40 [15039.358776] [<ffffffffa1bbff77>] ? ret_from_fork_nospec_begin+0x21/0x21 [15039.366253] [<ffffffffa14cc930>] ? insert_kthread_work+0x40/0x40 |
| Comments |
| Comment by Olaf Faaland [ 23/Apr/20 ] |
|
Possibly same as https://jira.whamcloud.com/browse/LU-12402 |
| Comment by Peter Jones [ 23/Apr/20 ] |
|
Amir Could you please advise? Thanks Peter |
| Comment by Olaf Faaland [ 23/Apr/20 ] |
|
Our 2.12 tag on the router node that crashed. Clients on this cluster (catalyst) and clients on some other clusters also run this tag: Our other file systems and client systems are these tags (entire clusters are at one tag): |
| Comment by Olaf Faaland [ 24/Apr/20 ] |
|
I see https://jira.whamcloud.com/browse/LU-13067 has the same stack. Was In any case, it looks to me like b2_12 has the same code path as |
| Comment by Olaf Faaland [ 24/Apr/20 ] |
|
In case you agree, I backported the patch from I forgot to remove the old Change-Id and make the other metadata changes in the commit message. I abandoned these thinking of aborting the testing but now wonder if I've done other damage to the gerrit historical information. If so, sorry I'm fixing that in the stack and pushing again... Below are the original gerrit change URLs: |
| Comment by Olaf Faaland [ 24/Apr/20 ] |
|
Second try at my patch stack: Disclaimer: I don't know this code well. |
| Comment by Olaf Faaland [ 24/Apr/20 ] |
|
My backport missed at least one dependency, looking into it. |
| Comment by Olaf Faaland [ 24/Apr/20 ] |
|
Hi Amir, |
| Comment by Olaf Faaland [ 24/Apr/20 ] |
|
> My backport missed at least one dependency, looking into it. This is more complex than I'd hoped, I'll need you to backport it if that's the correct fix. |
| Comment by Amir Shehata (Inactive) [ 25/Apr/20 ] |
|
https://review.whamcloud.com/38367 There is no need to backport other patches. Can you try this out and let me know if it resolves your issue. If it resolves your issue, can you please abandon the other patches and use that one. thanks |
| Comment by Olaf Faaland [ 28/Apr/20 ] |
|
I'm unable to reproduce this issue on our test system, I believe because it's simply too small. I'm abandoning my other patches. Can we get review on this patch? Thanks |
| Comment by Peter Jones [ 02/May/20 ] |
|
The patch landed for 2.12.5 so presumably is at a point where you would add it to your distribution and be able to test at greater scale to test whether this fix works. |
| Comment by Olaf Faaland [ 02/May/20 ] |
|
Yes, this patch is in our distribution and current schedule is to start deploying to the production systems where we see this issue on Tuesday (and to bigger ones after that). |
| Comment by Peter Jones [ 19/May/20 ] |
|
As per LLNL ok to close |
| Comment by Olaf Faaland [ 19/May/20 ] |
|
This patch has been deployed to all our 2.12 systems. So far we've not seen this issue reoccur, nor any new LNet related pathologies. |