Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.7.0
Labels:
None

Severity:
3
Rank (Obsolete):
15523

Lustre does much of its waiting on Linux in a TASK_INTERRUPTIBLE state with all signals blocked.

This was changed briefly to TASK_UNINTERRUPTIBLE as part of bugzilla 16842 (https://projectlava.xyratex.com/show_bug.cgi?id=16842), then changed back because it led to extremely high reported load averages on Lustre servers:
https://projectlava.xyratex.com/show_bug.cgi?id=16842#c28
These sorts of reported load averages would be expected to cause issues

This is because the kernel calculates load average by counting tasks in TASK_UNINTERRUPTIBLE. (See the definition of "task_contributes_to_load".).

This method of waiting in TASK_INTERRUPTIBLE with all signals blocked (including SIGKILL) causes a problem with delivery of signals in shared_pending which can result in unkillable processes.

The situation is this:
Lustre is waiting as described, in TASK_INTERRUPTIBLE with all signals blocked. A SIGKILL is sent to the process group of a user process in a syscall in to Lustre. This goes in the shared_pending mask for the process. This would normally wake a process sleeping in TASK_INTERRUPTIBLE, but is not handled because Lustre is waiting with all signals blocked.

Separately, a SIGSTOP is sent directly to the process. This is commonly used as part of our debugging/tracing software, and normally SIGSTOP and SIGKILL arriving to the same process is not a problem:

For a process waiting in TASK_INTERRUPTIBLE (without SIGKILL blocked), SIGKILL will cause that task to exit (whether it arrives before SIGSTOP or after - the effect is the same).

For a task waiting in TASK_UNINTERRUPTIBLE, the task finishes waiting, then on return to userspace, the signals (SIGKILL in the shared mask & SIGSTOP in the per-process mask) are handled correctly & the process exits.

But somehow, waiting in TASK_INTERRUPTIBLE with all signals blocked confuses things, and the result is stopped processes that do not exit. Sending another SIGKILL works, but any other process would have exited by this point.

It's very clear why Lustre does its waiting in TASK_INTERRUPTIBLE (Oleg reported load averages on a health server of >3000 in the bugzilla bug linked above), and since a complex system like Lustre is not good at being interrupted arbitrarily, it's understandable why it waits with all signals blocked.
At the same time, it's clearly wrong to be waiting in TASK_INTERRUPTIBLE while blocking all signals. Whatever scheduler behavior is causing these zombie processes is not the central problem.

The cleanest solution I can see is to add a new wait state to the Linux kernel, one that allows a process to wait uninterruptibly but not contribute to load. I'm thinking of proposing a patch to this effect (to start the conversation) on fs-devel or the LKML, and I wanted to get input from anyone at Intel who'd like to give it before starting that conversation.

I can provide further details of the non-exiting processes, etc, if needed, including dumps, but I think the description above should be sufficient.

is related to

LU-19825 use io_schedule() instead of schedule when waiting for IO

Resolved

is related to

LU-12511 Prepare lustre for adoption into the linux kernel

Open

Assignee:: WC Triage

Reporter:: Patrick Farrell (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 29/Aug/14 10:13 PM

Updated:: 23/Jan/26 10:51 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates