Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12362

kernel warning 'do not call blocking ops when !TASK_RUNNING ' in ptlrpcd

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Upstream, Lustre 2.15.0
    • Lustre 2.12.0
    • ARM client running TOSS kernel 4.14.0-115.6.1.1chaos.ch6a.aarch64
    • 3
    • 9223372036854775807

    Description

      kernel: do not call blocking ops when !TASK_RUNNING; state=1 set at <ffff000002a473d0>] ptlrpcd+0x410/0x5c0 [ptlrpc]
      kernel: ------------[ cut here ]------------
      kernel: WARNING: CPU: 35 PID: 14128 at kernel/sched/core.c:5988 __might_sleep+0x84/0x8c 

      Attachments

        Issue Links

          Activity

            [LU-12362] kernel warning 'do not call blocking ops when !TASK_RUNNING ' in ptlrpcd

            Thanks for weighing in, Neil.

            I think either is fine, my glance through the code suggested that the 'condition' being checked is kind of complicated.  Perhaps there is some simplification I missed, but it seemed like there were various ways/reasons it would decide it had nothing to do.

            pfarrell Patrick Farrell (Inactive) added a comment - Thanks for weighing in, Neil. I think either is fine, my glance through the code suggested that the 'condition' being checked is kind of complicated.  Perhaps there is some simplification I missed, but it seemed like there were various ways/reasons it would decide it had nothing to do.
            neilb Neil Brown added a comment -

            That code is rather convoluted...

            The possible problem here is that the wait loop could spin rather than waiting.  If the 'condition' ever comes close to sleeping, it will reset the task state to RUNNING and the subsequent schedule() call will do nothing, causing l_wait_event() to loop around and call the condition() function again.

            If the condition() does this every time through, then the wait loop becomes a busy-wait.  If it only does it occasionally, then there is no real problem.

            I can see two approaches to "fixing" it.

            One is to change every place that calls wake_up in ->set_wait to also set a flag.  Then ptlrpcd() waits for that flag to be set, then calls ptlrpcd_check().

            The other is to factor out the checks in ptlrpcd_check() from the work.  Then do the work before calling l_wait_event(), then just do the checks inside l_wait_event().

            I couldn't suggest which is better without first following through all the code and working out that it is really doing.

             

            neilb Neil Brown added a comment - That code is rather convoluted... The possible problem here is that the wait loop could spin rather than waiting.  If the 'condition' ever comes close to sleeping, it will reset the task state to RUNNING and the subsequent schedule() call will do nothing, causing l_wait_event() to loop around and call the condition() function again. If the condition() does this every time through, then the wait loop becomes a busy-wait.  If it only does it occasionally, then there is no real problem. I can see two approaches to "fixing" it. One is to change every place that calls wake_up in ->set_wait to also set a flag.  Then ptlrpcd() waits for that flag to be set, then calls ptlrpcd_check(). The other is to factor out the checks in ptlrpcd_check() from the work.  Then do the work before calling l_wait_event(), then just do the checks inside l_wait_event(). I couldn't suggest which is better without first following through all the code and working out that it is really doing.  

            So the problem is we're calling ptlrpcd_check() as the condition, and it sends RPCs, etc, rather than just checking a condition.

            BTW, in case anyone else was going to check, Neil didn't fix this for us already (: ) ):
            http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-February/006500.html

            pfarrell Patrick Farrell (Inactive) added a comment - So the problem is we're calling ptlrpcd_check() as the condition, and it sends RPCs, etc, rather than just checking a condition. BTW, in case anyone else was going to check, Neil didn't fix this for us already (: ) ): http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-February/006500.html

            Yes, that's the main thing, and that's a shame that it's not in RHEL 7.

            Oh, this is interesting - Actually, wait_event() has the same behavior.  It calls condition() after setting task_state to whatever has been requested.  So, actually, I think the problem might be with the nature of our condition().  So in fact we'd still have this problem even with Neil's patch.  My bad.

            In fact, now that I look at it, our condition here is pretty weird.  It's using condition() to do the main work of the loop.

            Having sampled about 20-30 of the uses of l_wait_event in Lustre, this is the only one I can find where condition() is anything more than trivial.  Most of them are a single read of some variable, others are trivial state checking functions.

            So I'd say we're pretty clearly misusing condition() here, and we need to move the work of this function out of the condition().

            pfarrell Patrick Farrell (Inactive) added a comment - Yes, that's the main thing, and that's a shame that it's not in RHEL 7. Oh, this is interesting - Actually, wait_event() has the same behavior.  It calls condition() after setting task_state to whatever has been requested.  So, actually, I think the problem might be with the nature of our condition().  So in fact we'd still have this problem even with Neil's patch.  My bad. In fact, now that I look at it, our condition here is pretty weird.  It's using condition() to do the main work of the loop. Having sampled about 20-30 of the uses of l_wait_event in Lustre, this is the only one I can find where condition() is anything more than trivial.  Most of them are a single read of some variable, others are trivial state checking functions. So I'd say we're pretty clearly misusing condition() here, and we need to move the work of this function out of the condition().

            I think it's too early for us to be able to give up RHEL7... One option would be to map wait_event_idle_*() to an equivalent l_wait_event() implementation so that we can migrate the code over to kernel style without dropping support. I suspect that would likely need several patches to change from setting the "style" of wait in LWI to being arguments to differently-named functions.

            adilger Andreas Dilger added a comment - I think it's too early for us to be able to give up RHEL7... One option would be to map wait_event_idle_*() to an equivalent l_wait_event() implementation so that we can migrate the code over to kernel style without dropping support. I suspect that would likely need several patches to change from setting the "style" of wait in LWI to being arguments to differently-named functions.

            To my understanding the key to using wait_event_idle*() that landed upstream in 4.17 is that it used the TASK_IDLE state. That was introduced with kernel commit 

            80ed87c8a9ca0cad7ca66cf3bbdfb17559a66dcf

             

            Looking at the RHEL7 kernel code I don't see that handling  It was introduce in kernel version 4.1-rc4. So pretty much its supported every where but RHEL7. 

            simmonsja James A Simmons added a comment - To my understanding the key to using wait_event_idle*() that landed upstream in 4.17 is that it used the TASK_IDLE state. That was introduced with kernel commit  80ed87c8a9ca0cad7ca66cf3bbdfb17559a66dcf   Looking at the RHEL7 kernel code I don't see that handling  It was introduce in kernel version 4.1-rc4. So pretty much its supported every where but RHEL7. 

            Right, so he submitted the new variants upstream and also stuck them in to his Lustre separately while waiting for them to land (partly because he wasn't sure he could get them landed without Lustre to use them).

            I'm pretty sure the only thing we need is the new task state for that wait without load behavior - Which I believe is in RHEL7, but I will have to double check.  (Neil didn't add that.)

            pfarrell Patrick Farrell (Inactive) added a comment - Right, so he submitted the new variants upstream and also stuck them in to his Lustre separately while waiting for them to land (partly because he wasn't sure he could get them landed without Lustre to use them). I'm pretty sure the only thing we need is the new task state for that wait without load behavior - Which I believe is in RHEL7, but I will have to double check.  (Neil didn't add that.)

            One of the reasons l_wait_event() was written is that the upstream kernel didn't have the proper primitives for a service thread to wait without increasing the load. Also, l_wait_event() will wake one thread exclusively to process an incoming request, rather than waking all of them each time.

            I recall that he added a wait_event_*() variant to handle the first case, but we need to check what kernel that went into, and whether it was backporred to RHEL7. Without that the MDS and OSS would always show a load of 500 due to sleeping service threads, and clients would always have a load of +10 or so due to LDLM and other client threads running/sleeping.

            adilger Andreas Dilger added a comment - One of the reasons l_wait_event() was written is that the upstream kernel didn't have the proper primitives for a service thread to wait without increasing the load. Also, l_wait_event() will wake one thread exclusively to process an incoming request, rather than waking all of them each time. I recall that he added a wait_event_*() variant to handle the first case, but we need to check what kernel that went into, and whether it was backporred to RHEL7. Without that the MDS and OSS would always show a load of 500 due to sleeping service threads, and clients would always have a load of +10 or so due to LDLM and other client threads running/sleeping.
            pfarrell Patrick Farrell (Inactive) added a comment - The final version of Neil's patches is here: http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-February/thread.html#6510 First one in the series is here: http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-February/006479.html  

            So to summarize, this is existing (for a very long time) Lustre behavior which is not actually a problem, but it trips up a sanity check in newer kernels because what we do here is technically incorrect.  There's no functional bug here, in that nothing is going to go wrong, but we obviously don't want Lustre tripping sanity checks and printing errors.  The easy fix is to switch ourselves to using the kernel wait infrastructure, which Neil has done the heavy lifting for already.

            So, ruth.klundt@gmail.com, nothing for you to worry about operationally, but this is something we should address.

            When I get a little time, I'll look at how much work it will be to bring Neil's patches in.  There will be some drift from what he applied them to, and he also didn't do server code, so this might be a fair bit of work.  It's a good thing to do - it's much nicer not to open code basic functionality like this, now that the kernel has what we need - but it's going to take a bit.

            pfarrell Patrick Farrell (Inactive) added a comment - So to summarize, this is existing (for a very long time) Lustre behavior which is not actually a problem, but it trips up a sanity check in newer kernels because what we do here is technically incorrect.  There's no functional bug here, in that nothing is going to go wrong, but we obviously don't want Lustre tripping sanity checks and printing errors.  The easy fix is to switch ourselves to using the kernel wait infrastructure, which Neil has done the heavy lifting for already. So, ruth.klundt@gmail.com , nothing for you to worry about operationally, but this is something we should address. When I get a little time, I'll look at how much work it will be to bring Neil's patches in.  There will be some drift from what he applied them to, and he also didn't do server code, so this might be a fair bit of work.  It's a good thing to do - it's much nicer not to open code basic functionality like this, now that the kernel has what we need - but it's going to take a bit.

            OK, I think I've figured this out.  This is a check which is not present in older kernels.

            The messages here contain clues:
            "kernel: do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffff000002a473d0>] ptlrpcd+0x410/0x5c0 [ptlrpc]"

            In particular, it points out where the state is stat - ptlrpcd+0x410.  I don't have the modules in use here, but there's only one call in ptlrpcd() that should do that, it's l_wait_event.

                            l_wait_event(set->set_waitq, ptlrpcd_check(&env, pc), &lwi);
            

            And we see that ptlrpcd_check() is the condition() for this wait_event, and that's the function in the stack trace for the warning.

            Looking at l_wait_event, we indeed call condition() with TASK_INTERRUPTIBLE set.  The obvious solution to simple change the task state before calling condition().

            The problem is this subtle race condition described in the code means we can't:

            /* To guarantee that the condition check will be done */ \
             /* after setting the thread state as TASK_INTERRUPTIBLE. */ \
             /* Otherwise, out-of-order execution may cause some race. */ \
             /* Consider the following real execution order: */ \
             \
             /* 1. Thread1 checks condition on CPU1, gets false. */ \
             /* 2. Thread2 sets condition on CPU2. */ \
             /* 3. Thread2 calls wake_up() on CPU2 to wake the threads */ \
             /* with state TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE. */ \
             /* But the Thread1's state is TASK_RUNNING at that time. */ \
             /* 4. Thread1 sets its state as TASK_INTERRUPTIBLE on CPU1, */ \
             /* then schedule. */ 

            (from __l_wait_event definition in lustre_lib.h)

            So we can't switch the state before calling condition().

            I thought about this for a few minutes and there's no obvious way to fix it in the code...  However, we've got another option entirely.

            Recently(-ish), Neil Brown removed l_wait_event and replaced it with the kernel wait_event() macros.

            Presumably, these macros do not have this problem with how they call their condition, since they're in the upstream kernel.  We could try to figure out what's different and incorporate it in our code, but I think it's much saner to just bring Neil's patches (which also fix a few other small things) in to our tree.

            I'm going to add another message here with a summary.

            pfarrell Patrick Farrell (Inactive) added a comment - - edited OK, I think I've figured this out.  This is a check which is not present in older kernels. The messages here contain clues: "kernel: do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffff000002a473d0>] ptlrpcd+0x410/0x5c0 [ptlrpc] " In particular, it points out where the state is stat - ptlrpcd+0x410.  I don't have the modules in use here, but there's only one call in ptlrpcd() that should do that, it's l_wait_event. l_wait_event(set->set_waitq, ptlrpcd_check(&env, pc), &lwi); And we see that ptlrpcd_check() is the condition() for this wait_event, and that's the function in the stack trace for the warning. Looking at l_wait_event, we indeed call condition() with TASK_INTERRUPTIBLE set.  The obvious solution to simple change the task state before calling condition(). The problem is this subtle race condition described in the code means we can't: /* To guarantee that the condition check will be done */ \ /* after setting the thread state as TASK_INTERRUPTIBLE. */ \ /* Otherwise, out-of-order execution may cause some race. */ \ /* Consider the following real execution order: */ \ \ /* 1. Thread1 checks condition on CPU1, gets false. */ \ /* 2. Thread2 sets condition on CPU2. */ \ /* 3. Thread2 calls wake_up() on CPU2 to wake the threads */ \ /* with state TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE. */ \ /* But the Thread1's state is TASK_RUNNING at that time. */ \ /* 4. Thread1 sets its state as TASK_INTERRUPTIBLE on CPU1, */ \ /* then schedule. */ (from __l_wait_event definition in lustre_lib.h) So we can't switch the state before calling condition(). I thought about this for a few minutes and there's no obvious way to fix it in the code...  However, we've got another option entirely. Recently(-ish), Neil Brown removed l_wait_event and replaced it with the kernel wait_event() macros. Presumably, these macros do not have this problem with how they call their condition, since they're in the upstream kernel.  We could try to figure out what's different and incorporate it in our code, but I think it's much saner to just bring Neil's patches (which also fix a few other small things) in to our tree. I'm going to add another message here with a summary.

            People

              simmonsja James A Simmons
              ruth.klundt@gmail.com Ruth Klundt (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: