Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5566

Lustre waiting in TASK_INTERRUPTIBLE with all signals blocked results in unkillable processes

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 15523

    Description

      Lustre does much of its waiting on Linux in a TASK_INTERRUPTIBLE state with all signals blocked.

      This was changed briefly to TASK_UNINTERRUPTIBLE as part of bugzilla 16842 (https://projectlava.xyratex.com/show_bug.cgi?id=16842), then changed back because it led to extremely high reported load averages on Lustre servers:
      https://projectlava.xyratex.com/show_bug.cgi?id=16842#c28
      These sorts of reported load averages would be expected to cause issues

      This is because the kernel calculates load average by counting tasks in TASK_UNINTERRUPTIBLE. (See the definition of "task_contributes_to_load".).

      This method of waiting in TASK_INTERRUPTIBLE with all signals blocked (including SIGKILL) causes a problem with delivery of signals in shared_pending which can result in unkillable processes.

      The situation is this:
      Lustre is waiting as described, in TASK_INTERRUPTIBLE with all signals blocked. A SIGKILL is sent to the process group of a user process in a syscall in to Lustre. This goes in the shared_pending mask for the process. This would normally wake a process sleeping in TASK_INTERRUPTIBLE, but is not handled because Lustre is waiting with all signals blocked.

      Separately, a SIGSTOP is sent directly to the process. This is commonly used as part of our debugging/tracing software, and normally SIGSTOP and SIGKILL arriving to the same process is not a problem:

      For a process waiting in TASK_INTERRUPTIBLE (without SIGKILL blocked), SIGKILL will cause that task to exit (whether it arrives before SIGSTOP or after - the effect is the same).

      For a task waiting in TASK_UNINTERRUPTIBLE, the task finishes waiting, then on return to userspace, the signals (SIGKILL in the shared mask & SIGSTOP in the per-process mask) are handled correctly & the process exits.

      But somehow, waiting in TASK_INTERRUPTIBLE with all signals blocked confuses things, and the result is stopped processes that do not exit. Sending another SIGKILL works, but any other process would have exited by this point.

      It's very clear why Lustre does its waiting in TASK_INTERRUPTIBLE (Oleg reported load averages on a health server of >3000 in the bugzilla bug linked above), and since a complex system like Lustre is not good at being interrupted arbitrarily, it's understandable why it waits with all signals blocked.
      At the same time, it's clearly wrong to be waiting in TASK_INTERRUPTIBLE while blocking all signals. Whatever scheduler behavior is causing these zombie processes is not the central problem.

      The cleanest solution I can see is to add a new wait state to the Linux kernel, one that allows a process to wait uninterruptibly but not contribute to load. I'm thinking of proposing a patch to this effect (to start the conversation) on fs-devel or the LKML, and I wanted to get input from anyone at Intel who'd like to give it before starting that conversation.

      I can provide further details of the non-exiting processes, etc, if needed, including dumps, but I think the description above should be sufficient.

      Attachments

        Activity

          People

            wc-triage WC Triage
            paf Patrick Farrell (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: