Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.12.0, Lustre 2.15.0
-
Amazon Linux 2, both 4.14.* and 5.10.* kernels as of June 2024.
-
3
-
9223372036854775807
Description
Amazon Linux 2 (AL2) is a Linux RHEL-based Linux distribution supported by AWS. In June 2024, a backport of a change from Linux kernel 6.* was made to AL2 kernel versions 4.14.* and 5.10.* and is released on the current AL2 AMIs. The commit disables timers on socket release if the socket counter `sk_net_refcnt` is not incremented. Lustre does not increment on socket creation in ksocklnd. The kernel upstream commit is: (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=151c9c724d05d5b0dd8acd3e11cb69ef1f2dbada).
The effect that this change in Linux has on Lustre (I've tested 2.12 and 2.15, but presumably this impacts 2.10 as well) is that for spin-up, spin-down workloads where clients mount and unmount frequently (a common pattern in cloud deployments of Lustre) a large number of TCP connections can become orphans on MDS and OSS servers, stranded in the `FIN_WAIT_1` state. Over time this can lead to hundreds of thousands or millions of of such connections, consuming memory and making mounting from clients that re-use the IP address of an orphaned connection fail to mount for firewalls that block challenge ACK packets because TCP connections cannot be established (https://www.networkdefenseblog.com/post/wireshark-tcp-challenge-ack).
Using bpf, we can see the code path in Lustre that calls `sock_release` when a client disappears while TCP connections are still ESTABLISHED on the server side:
```
10.0.71.247:988 X> 10.0.93.116:1021 FIN_WAIT1
Call stack:
inet_csk_clear_xmit_timers_sync+0x0
inet_release+0x4c
__sock_release+0xbc
sock_release+0x24
ksocknal_terminate_conn+0x260
ksocknal_reaper+0x13c kthread+0x138
ret_from_fork+0x10
```
When the timers are cleared for the TCP connections, they are never reaped. On a build without the recent Linux upstream change, we typically see this call stack, which clears the TCP socket for each TCP connection to a client:
```
18:27:07 0 4 10.0.66.107:0 X> 10.0.93.116:1021 CLOSE
Call stack:
tcp_v4_destroy_sock+0x0
tcp_done+0xc4
tcp_write_err+0x118
tcp_retransmit_timer+0x208
tcp_write_timer_handler+0x108
tcp_write_timer+0x50
call_timer_fn+0x38
expire_timers+0xe0
run_timer_softirq+0xc0
__do_softirq+0x134
irq_exit+0xd4
__handle_domain_irq+0x6c
gic_handle_irq+0x8c
el1_irq+0xe8
arch_cpu_idle+0x30
do_idle+0x128
cpu_startup_entry+0x28
secondary_start_kernel+0x100
```
I am working on a patch to `ksocknal_lib_setup_sock` that will (based on the kernel at compile time) increment the `sk_net_refcnt` and register the socket with the netns.
```
ksocknal_lib_setup_sock(...)
...
sk->sk_net_refcnt = 1;
get_net(net);
```
I am opening this JIRA to track that work and solicit feedback on the proposed patch.
Attachments
Issue Links
- is related to
-
LU-18383 ksocklnd: Avoid TCP socket orphans in racy LNet hello
- Open