Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13483

Apparently infinite recursion in lnet_finalize()

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.12.4
    • lustre 2.12.4 + patches
      lustre 2.10.8 + patches
      RHEL 7.8 + patches
    • 3
    • 9223372036854775807

    Description

      Router node crashes with apparently infinite recursion in lnet.

      [15037.327128] Thread overran stack, or stack corrupted
      [15037.332674] Oops: 0000 [#1] SMP
      [15037.336294] Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) mlx4_ib mlx4_en rpcrdma ib_iser iTCO_wdt iTCO_vendor_support sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass pcspkr zfs(POE) ib_qib rdmavt lpc_ich zunicode(POE) zavl(POE) icp(POE) joydev zcommon(POE) znvpair(POE) spl(OE) mlx4_core devlink sg i2c_i801 ioatdma ipmi_si ipmi_devintf ipmi_msghandler ib_ipoib rdma_ucm ib_uverbs ib_umad acpi_cpufreq iw_cxgb4 rdma_cm iw_cm ib_cm iw_cxgb3 ib_core sch_fq_codel binfmt_misc msr_safe(OE) ip_tables nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache overlay(T) ext4 mbcache jbd2 dm_service_time sd_mod crc_t10dif crct10dif_generic be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs 8021q mgag200 garp mrp
      [15037.415946]  stp drm_kms_helper llc syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul crct10dif_common ttm crc32_pclmul crc32c_intel ghash_clmulni_intel ahci drm mpt2sas igb isci libahci aesni_intel lrw gf128mul libsas glue_helper dca ablk_helper raid_class ptp cryptd libata dm_multipath scsi_transport_sas drm_panel_orientation_quirks pps_core wmi i2c_algo_bit sunrpc dm_mirror dm_region_hash dm_log dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
      [15037.461455] CPU: 3 PID: 21567 Comm: kiblnd_sd_00_00 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1
      [15037.475718] Hardware name: cray cray-2628-lr/S2600GZ, BIOS SE5C600.86B.02.06.0002.101320150901 10/13/2015
      [15037.486395] task: ffff9fd2b7775230 ti: ffff9fd27c1e8000 task.ti: ffff9fd27c1e8000
      [15037.494744] RIP: 0010:[<ffffffffa17acd4d>]  [<ffffffffa17acd4d>] strnlen+0xd/0x40
      [15037.503106] RSP: 0018:ffff9fd27c1e7e80  EFLAGS: 00010086
      [15037.509032] RAX: ffffffffa1e86261 RBX: ffffffffa2402fd6 RCX: fffffffffffffffe
      [15037.516994] RDX: 00000000c0ab26ee RSI: ffffffffffffffff RDI: 00000000c0ab26ee
      [15037.524957] RBP: ffff9fd27c1e7e80 R08: 000000000000ffff R09: 000000000000ffff
      [15037.532917] R10: 0000000000000000 R11: ffff9fd27c1e7e46 R12: 00000000c0ab26ee
      [15037.540880] R13: ffffffffa24033a0 R14: 00000000ffffffff R15: 0000000000000000
      [15037.548842] FS:  0000000000000000(0000) GS:ffff9fd2be8c0000(0000) knlGS:0000000000000000
      [15037.557871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [15037.564282] CR2: 00000000c0ab26ee CR3: 0000001ffbd04000 CR4: 00000000001607e0
      [15037.572244] Call Trace:
      [15037.574990]  [<ffffffffc08e759a>] ? cfs_print_to_console+0x7a/0x1c0 [libcfs]
      [15037.582862]  [<ffffffffc08eda74>] ? libcfs_debug_vmsg2+0x574/0xbb0 [libcfs]
      [15037.590635]  [<ffffffffc08ee107>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [15037.598037]  [<ffffffffc0a7e226>] ? lnet_finalize+0x976/0x9f0 [lnet]
      ...
      [15039.259572]  [<ffffffffc0a7d9e9>] ? lnet_finalize+0x139/0x9f0 [lnet]
      [15039.266666]  [<ffffffffc08ee107>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [15039.274052]  [<ffffffffc0a87b8a>] ? lnet_post_send_locked+0x42a/0xa40 [lnet]
      [15039.281922]  [<ffffffffc0a89e38>] ? lnet_return_tx_credits_locked+0x238/0x4a0 [lnet]
      [15039.290583]  [<ffffffffc0a7c88c>] ? lnet_msg_decommit+0xec/0x700 [lnet]
      [15039.297960]  [<ffffffffc0a7dc3f>] ? lnet_finalize+0x38f/0x9f0 [lnet]
      [15039.305055]  [<ffffffffc09bc75d>] ? kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]
      [15039.312634]  [<ffffffffc09c7d19>] ? kiblnd_scheduler+0x8c9/0x1160 [ko2iblnd]
      [15039.320502]  [<ffffffffa142d59e>] ? __switch_to+0xce/0x5a0
      [15039.326626]  [<ffffffffa14e29b0>] ? wake_up_state+0x20/0x20
      [15039.332835]  [<ffffffffa14b46bc>] ? mod_timer+0x11c/0x260
      [15039.338859]  [<ffffffffc09c7450>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd]
      [15039.346338]  [<ffffffffa14cca01>] ? kthread+0xd1/0xe0
      [15039.351974]  [<ffffffffa14cc930>] ? insert_kthread_work+0x40/0x40
      [15039.358776]  [<ffffffffa1bbff77>] ? ret_from_fork_nospec_begin+0x21/0x21
      [15039.366253]  [<ffffffffa14cc930>] ? insert_kthread_work+0x40/0x40
      

      Attachments

        Issue Links

          Activity

            [LU-13483] Apparently infinite recursion in lnet_finalize()

            This patch has been deployed to all our 2.12 systems. So far we've not seen this issue reoccur, nor any new LNet related pathologies.

            ofaaland Olaf Faaland added a comment - This patch has been deployed to all our 2.12 systems. So far we've not seen this issue reoccur, nor any new LNet related pathologies.
            pjones Peter Jones added a comment -

            As per LLNL ok to close

            pjones Peter Jones added a comment - As per LLNL ok to close
            ofaaland Olaf Faaland added a comment -

            Yes, this patch is in our distribution and current schedule is to start deploying to the production systems where we see this issue on Tuesday (and to bigger ones after that).

            ofaaland Olaf Faaland added a comment - Yes, this patch is in our distribution and current schedule is to start deploying to the production systems where we see this issue on Tuesday (and to bigger ones after that).
            pjones Peter Jones added a comment -

            The patch landed for 2.12.5 so presumably is at a point where you would add it to your distribution and be able to test at greater scale to test whether this fix works.

            pjones Peter Jones added a comment - The patch landed for 2.12.5 so presumably is at a point where you would add it to your distribution and be able to test at greater scale to test whether this fix works.

            I'm unable to reproduce this issue on our test system, I believe because it's simply too small. I'm abandoning my other patches. Can we get review on this patch? Thanks

            ofaaland Olaf Faaland added a comment - I'm unable to reproduce this issue on our test system, I believe because it's simply too small. I'm abandoning my other patches. Can we get review on this patch? Thanks
            ashehata Amir Shehata (Inactive) added a comment - - edited

            https://review.whamcloud.com/38367

            There is no need to backport other patches.

            Can you try this out and let me know if it resolves your issue.

            If it resolves your issue, can you please abandon the other patches and use that one.

            thanks

            ashehata Amir Shehata (Inactive) added a comment - - edited https://review.whamcloud.com/38367 There is no need to backport other patches. Can you try this out and let me know if it resolves your issue. If it resolves your issue, can you please abandon the other patches and use that one. thanks

            > My backport missed at least one dependency, looking into it.

            This is more complex than I'd hoped, I'll need you to backport it if that's the correct fix.

            ofaaland Olaf Faaland added a comment - > My backport missed at least one dependency, looking into it. This is more complex than I'd hoped, I'll need you to backport it if that's the correct fix.
            ofaaland Olaf Faaland added a comment -

            Hi Amir,
            Do you have any news or need anything from me?

            ofaaland Olaf Faaland added a comment - Hi Amir, Do you have any news or need anything from me?
            ofaaland Olaf Faaland added a comment -

            My backport missed at least one dependency, looking into it.

            ofaaland Olaf Faaland added a comment - My backport missed at least one dependency, looking into it.
            ofaaland Olaf Faaland added a comment -

            Second try at my patch stack:
            remote: https://review.whamcloud.com/38356 LU-11300 lnet: simplify lnet_handle_local_failure()
            remote: https://review.whamcloud.com/38357 LU-11477 lnet: handle health for incoming messages
            remote: https://review.whamcloud.com/38358 LU-12402 lnet: handle recursion in resend

            Disclaimer: I don't know this code well.

            ofaaland Olaf Faaland added a comment - Second try at my patch stack: remote: https://review.whamcloud.com/38356 LU-11300 lnet: simplify lnet_handle_local_failure() remote: https://review.whamcloud.com/38357 LU-11477 lnet: handle health for incoming messages remote: https://review.whamcloud.com/38358 LU-12402 lnet: handle recursion in resend Disclaimer: I don't know this code well.

            People

              ashehata Amir Shehata (Inactive)
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: