Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18137

Ksocklnd orphaned TCP sockets are never cleaned up

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.12.0, Lustre 2.15.0
    • Amazon Linux 2, both 4.14.* and 5.10.* kernels as of June 2024.
    • 3
    • 9223372036854775807

    Description

      Amazon Linux 2 (AL2) is a Linux RHEL-based Linux distribution supported by AWS. In June 2024, a backport of a change from Linux kernel 6.* was made to AL2 kernel versions 4.14.* and 5.10.* and is released on the current AL2 AMIs.  The commit disables timers on socket release if the socket counter `sk_net_refcnt`  is not incremented. Lustre does not increment on socket creation in ksocklnd. The kernel upstream commit is: (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=151c9c724d05d5b0dd8acd3e11cb69ef1f2dbada).

      The effect that this change in Linux has on Lustre (I've tested 2.12 and 2.15, but presumably this impacts 2.10 as well) is that for spin-up, spin-down workloads where clients mount and unmount frequently (a common pattern in cloud deployments of Lustre) a large number of TCP connections can become orphans on MDS and OSS servers, stranded in the `FIN_WAIT_1` state. Over time this can lead to hundreds of thousands or millions of of such connections, consuming memory and making mounting from clients that re-use the IP address of an orphaned connection fail to mount for firewalls that block challenge ACK packets because TCP connections cannot be established (https://www.networkdefenseblog.com/post/wireshark-tcp-challenge-ack).

      Using bpf, we can see the code path in Lustre that calls `sock_release` when a client disappears while TCP connections are still ESTABLISHED on the server side:

      ```
      10.0.71.247:988 X> 10.0.93.116:1021 FIN_WAIT1
      Call stack:
      inet_csk_clear_xmit_timers_sync+0x0
      inet_release+0x4c
      __sock_release+0xbc
      sock_release+0x24
      ksocknal_terminate_conn+0x260
      ksocknal_reaper+0x13c kthread+0x138
      ret_from_fork+0x10
      ```

      When the timers are cleared for the TCP connections, they are never reaped. On a build without the recent Linux upstream change, we typically see this call stack, which clears the TCP socket for each TCP connection to a client:

      ```
      18:27:07 0 4 10.0.66.107:0 X> 10.0.93.116:1021 CLOSE
      Call stack:
      tcp_v4_destroy_sock+0x0
      tcp_done+0xc4
      tcp_write_err+0x118
      tcp_retransmit_timer+0x208
      tcp_write_timer_handler+0x108
      tcp_write_timer+0x50
      call_timer_fn+0x38
      expire_timers+0xe0
      run_timer_softirq+0xc0
      __do_softirq+0x134
      irq_exit+0xd4
      __handle_domain_irq+0x6c
      gic_handle_irq+0x8c
      el1_irq+0xe8
      arch_cpu_idle+0x30
      do_idle+0x128
      cpu_startup_entry+0x28
      secondary_start_kernel+0x100
      ```

      I am working on a patch to `ksocknal_lib_setup_sock` that will (based on the kernel at compile time) increment the `sk_net_refcnt` and register the socket with the netns.

      ```
      ksocknal_lib_setup_sock(...)
      ...
      sk->sk_net_refcnt = 1;
      get_net(net);
      ```

      I am opening this JIRA to track that work and solicit feedback on the proposed patch.

      Attachments

        Issue Links

          Activity

            People

              ropermar Mark Roper
              ropermar Mark Roper
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: