[LU-5566] Lustre waiting in TASK_INTERRUPTIBLE with all signals blocked results in unkillable processes Created: 29/Aug/14 Updated: 05/Sep/14 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 15523 |
| Description |
|
Lustre does much of its waiting on Linux in a TASK_INTERRUPTIBLE state with all signals blocked. This was changed briefly to TASK_UNINTERRUPTIBLE as part of bugzilla 16842 (https://projectlava.xyratex.com/show_bug.cgi?id=16842), then changed back because it led to extremely high reported load averages on Lustre servers: This is because the kernel calculates load average by counting tasks in TASK_UNINTERRUPTIBLE. (See the definition of "task_contributes_to_load".). This method of waiting in TASK_INTERRUPTIBLE with all signals blocked (including SIGKILL) causes a problem with delivery of signals in shared_pending which can result in unkillable processes. The situation is this: Separately, a SIGSTOP is sent directly to the process. This is commonly used as part of our debugging/tracing software, and normally SIGSTOP and SIGKILL arriving to the same process is not a problem: For a process waiting in TASK_INTERRUPTIBLE (without SIGKILL blocked), SIGKILL will cause that task to exit (whether it arrives before SIGSTOP or after - the effect is the same). For a task waiting in TASK_UNINTERRUPTIBLE, the task finishes waiting, then on return to userspace, the signals (SIGKILL in the shared mask & SIGSTOP in the per-process mask) are handled correctly & the process exits. But somehow, waiting in TASK_INTERRUPTIBLE with all signals blocked confuses things, and the result is stopped processes that do not exit. Sending another SIGKILL works, but any other process would have exited by this point. It's very clear why Lustre does its waiting in TASK_INTERRUPTIBLE (Oleg reported load averages on a health server of >3000 in the bugzilla bug linked above), and since a complex system like Lustre is not good at being interrupted arbitrarily, it's understandable why it waits with all signals blocked. The cleanest solution I can see is to add a new wait state to the Linux kernel, one that allows a process to wait uninterruptibly but not contribute to load. I'm thinking of proposing a patch to this effect (to start the conversation) on fs-devel or the LKML, and I wanted to get input from anyone at Intel who'd like to give it before starting that conversation. I can provide further details of the non-exiting processes, etc, if needed, including dumps, but I think the description above should be sufficient. |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 02/Sep/14 ] |
|
I've got a more specific suggestion, from Cray's Paul Cassella. He found this old lkml message about a proposed change for almost exactly this problem, suggested by Linus:
So it seems like this would fill Lustre's need nicely, and if done correctly, would presumably meet with approval upstream. |
| Comment by Patrick Farrell (Inactive) [ 03/Sep/14 ] |
|
Here's a proposed note to LKML about this change. In Lustre, this would mean re-writing l_wait_event so instead of blocking all signals and waiting at TASK_INTERRUPTIBLE, it would wait in a new scheduler state, one with the signal ignoring properties of TASK_UNINTERRUPTIBLE but without contributing to load average. https://lkml.org/lkml/2002/8/1/186 I've been looking at this, because the Lustre file system client makes heavy This is to manage load average, as Lustre regularly has very long waits for But the choice to block all signals while waiting in TASK_INTERRUPTIBLE causes Linus' suggestions from 2002 fits this use case perfectly, but it was never It nearly made it in 2007, with this from Matthew Wilcox: https://lkml.org/lkml/2007/8/29/219 But Matthew dropped the split of TASK_LOADAVG off from TASK_UNINTERRUPTIBLE in https://lkml.org/lkml/2007/9/1/232 So: |
| Comment by Andreas Dilger [ 05/Sep/14 ] |
|
Patrick, before we stir the hornet's nest upstream, it probably makes sense to see if we can change the Lustre code to better match the upstream kernel before we ask the kernel to change to match Lustre. There have been several improvements in the upstream kernel since l_wait_event() was first written that might be useful to us today. Also, it may be useful to have a different implementation for waiting on the client and on the server, since clients waiting on RPCs should contribute to the load average just like they would if they were waiting on the disk. |