[LU-17323] fork() leaks ERESTARTNOINTR (errno 513) to user application - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.5, Lustre 2.12.6, Lustre 2.12.9
Labels:
None
Environment:
RHEL6, RHEL7/CentOS7 (various kernels)

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

When using file locks on a Lustre mount with the 'flock' mount option, fork()
can leak ERESTARTNOINTR to a user application. The fork() system call checks
if a signal is pending, and if so, cleans up everything it did and returns
ERESTARTNOINTR. The kernel transparently restarts the fork() from scratch,
the user application is never supposed to get the ERESTARTNOINTR errno.

The fork() cleanup code calls exit_files() which calls Lustre code. I'm not
positive what the problem is at a low level. It may be that the Lustre code
clears the TIF_SIGPENDING flag, which prevents the kernel from restarting the
fork() and it leaks the ERESTARTNOINTR errno to the user application.

It seems there has to be multiple threads involved. My reproducer has two
threads. Thread 1 does fork() calls in an infinite loop, spawning children
that exit after a random number of seconds. Thread 2 sleeps for a random
number of seconds in an infinite loop. There is a SIGCHLD handler set up and
both threads can handle SIGCHLD signals. The fork() gets interrupted by
pending SIGCHLD signals from exiting children. I think thread 2 has to handle
the SIGCHLD signal for the problem to happen. If thread 2 has SIGCHLD signals
blocked, the problem never happens.

The problem doesn't reproduce with the 'localflock' mount option, so we
believe 'localflock' is safe from this issue.

We've seen this on RHEL6, RHEL7/CentOS7 kernels,
and Lustre 2.11.0, 2.12.5 and 2.12.6
Lustre 2.12.0 does not reproduce the issue.

Steps to reproduce:

1) Lustre mount must be using 'flock' mount option.
2) gcc -o repro ./repro.c -lpthread
3) Run reproducer:

Problem usually reproduces within 5-60 seconds.
Reproducer runs indefinitely or until the issue occurs,
enter Ctrl-C to quit

> touch /lustre_mnt/testfile.txt
> ./repro /lustre_mnt/testfile.txt
Fork returned -1, errno = 513, exiting...

Use POSIX style read lock
> ./repro /lustre_mnt/testfile.txt posix
Fork returned -1, errno = 513, exiting...

Use BSD style read lock
> ./repro /lustre_mnt/testfile.txt flock
Fork returned -1, errno = 513, exiting...

Don't lock at all (this won't reproduce and will run indefinitely)
> ./repro /lustre_mnt/testfile.txt none

NOTE: be aware the reproducer can exhaust your maxprocs limit

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

repro.c
5 kB
29/Nov/23 9:57 PM

Issue Links

is related to

LU-17405 Executable created with gcc gives ELF interpreter error (2.15 client w/ 2.12 server)

Open

Activity

People

Assignee:: WC Triage

Reporter:: Mike D

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 29/Nov/23 9:59 PM

Updated:: 08/Jul/24 8:52 PM