[LU-12362] kernel warning 'do not call blocking ops when !TASK_RUNNING ' in ptlrpcd - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Upstream, Lustre 2.15.0
Affects Version/s: Lustre 2.12.0
Labels:
- arm
Environment:
ARM client running TOSS kernel 4.14.0-115.6.1.1chaos.ch6a.aarch64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

kernel: do not call blocking ops when !TASK_RUNNING; state=1 set at <ffff000002a473d0>] ptlrpcd+0x410/0x5c0 [ptlrpc]
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 35 PID: 14128 at kernel/sched/core.c:5988 __might_sleep+0x84/0x8c

Attachments

Issue Links

is blocking

LU-10756 Send Uevents for interesting Lustre changes

Open

is related to

LU-12431 Close race on cp_state changes

Resolved

LU-12511 Prepare lustre for adoption into the linux kernel

Open

is related to

LU-10756 Send Uevents for interesting Lustre changes

Open

LU-10467 use standard linux wait_event macros

Resolved

LU-10300 Can the Lustre 2.10.x clients support 64K kernel page?

Resolved

(1 is related to )

Activity

[LU-12362] kernel warning 'do not call blocking ops when !TASK_RUNNING ' in ptlrpcd

Patrick Farrell (Inactive) added a comment - 17/Jun/19 3:52 PM

Thanks for weighing in, Neil.

I think either is fine, my glance through the code suggested that the 'condition' being checked is kind of complicated. Perhaps there is some simplification I missed, but it seemed like there were various ways/reasons it would decide it had nothing to do.

Patrick Farrell (Inactive) added a comment - 17/Jun/19 3:52 PM Thanks for weighing in, Neil. I think either is fine, my glance through the code suggested that the 'condition' being checked is kind of complicated. Perhaps there is some simplification I missed, but it seemed like there were various ways/reasons it would decide it had nothing to do.

Neil Brown added a comment - 16/Jun/19 11:25 PM

That code is rather convoluted...

The possible problem here is that the wait loop could spin rather than waiting. If the 'condition' ever comes close to sleeping, it will reset the task state to RUNNING and the subsequent schedule() call will do nothing, causing l_wait_event() to loop around and call the condition() function again.

If the condition() does this every time through, then the wait loop becomes a busy-wait. If it only does it occasionally, then there is no real problem.

I can see two approaches to "fixing" it.

One is to change every place that calls wake_up in ->set_wait to also set a flag. Then ptlrpcd() waits for that flag to be set, then calls ptlrpcd_check().

The other is to factor out the checks in ptlrpcd_check() from the work. Then do the work before calling l_wait_event(), then just do the checks inside l_wait_event().

I couldn't suggest which is better without first following through all the code and working out that it is really doing.

Neil Brown added a comment - 16/Jun/19 11:25 PM That code is rather convoluted... The possible problem here is that the wait loop could spin rather than waiting. If the 'condition' ever comes close to sleeping, it will reset the task state to RUNNING and the subsequent schedule() call will do nothing, causing l_wait_event() to loop around and call the condition() function again. If the condition() does this every time through, then the wait loop becomes a busy-wait. If it only does it occasionally, then there is no real problem. I can see two approaches to "fixing" it. One is to change every place that calls wake_up in ->set_wait to also set a flag. Then ptlrpcd() waits for that flag to be set, then calls ptlrpcd_check(). The other is to factor out the checks in ptlrpcd_check() from the work. Then do the work before calling l_wait_event(), then just do the checks inside l_wait_event(). I couldn't suggest which is better without first following through all the code and working out that it is really doing.

Patrick Farrell (Inactive) added a comment - 05/Jun/19 8:23 PM

So the problem is we're calling ptlrpcd_check() as the condition, and it sends RPCs, etc, rather than just checking a condition.

BTW, in case anyone else was going to check, Neil didn't fix this for us already (: ) ):
http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-February/006500.html

Patrick Farrell (Inactive) added a comment - 05/Jun/19 8:23 PM So the problem is we're calling ptlrpcd_check() as the condition, and it sends RPCs, etc, rather than just checking a condition. BTW, in case anyone else was going to check, Neil didn't fix this for us already (: ) ): http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-February/006500.html

Patrick Farrell (Inactive) added a comment - 05/Jun/19 8:19 PM

Yes, that's the main thing, and that's a shame that it's not in RHEL 7.

Oh, this is interesting - Actually, wait_event() has the same behavior. It calls condition() after setting task_state to whatever has been requested. So, actually, I think the problem might be with the nature of our condition(). So in fact we'd still have this problem even with Neil's patch. My bad.

In fact, now that I look at it, our condition here is pretty weird. It's using condition() to do the main work of the loop.

Having sampled about 20-30 of the uses of l_wait_event in Lustre, this is the only one I can find where condition() is anything more than trivial. Most of them are a single read of some variable, others are trivial state checking functions.

So I'd say we're pretty clearly misusing condition() here, and we need to move the work of this function out of the condition().

Patrick Farrell (Inactive) added a comment - 05/Jun/19 8:19 PM Yes, that's the main thing, and that's a shame that it's not in RHEL 7. Oh, this is interesting - Actually, wait_event() has the same behavior. It calls condition() after setting task_state to whatever has been requested. So, actually, I think the problem might be with the nature of our condition(). So in fact we'd still have this problem even with Neil's patch. My bad. In fact, now that I look at it, our condition here is pretty weird. It's using condition() to do the main work of the loop. Having sampled about 20-30 of the uses of l_wait_event in Lustre, this is the only one I can find where condition() is anything more than trivial. Most of them are a single read of some variable, others are trivial state checking functions. So I'd say we're pretty clearly misusing condition() here, and we need to move the work of this function out of the condition().

Andreas Dilger added a comment - 05/Jun/19 8:05 PM

I think it's too early for us to be able to give up RHEL7... One option would be to map wait_event_idle_*() to an equivalent l_wait_event() implementation so that we can migrate the code over to kernel style without dropping support. I suspect that would likely need several patches to change from setting the "style" of wait in LWI to being arguments to differently-named functions.

Andreas Dilger added a comment - 05/Jun/19 8:05 PM I think it's too early for us to be able to give up RHEL7... One option would be to map wait_event_idle_*() to an equivalent l_wait_event() implementation so that we can migrate the code over to kernel style without dropping support. I suspect that would likely need several patches to change from setting the "style" of wait in LWI to being arguments to differently-named functions.

James A Simmons added a comment - 05/Jun/19 6:46 PM

To my understanding the key to using wait_event_idle*() that landed upstream in 4.17 is that it used the TASK_IDLE state. That was introduced with kernel commit

80ed87c8a9ca0cad7ca66cf3bbdfb17559a66dcf

Looking at the RHEL7 kernel code I don't see that handling It was introduce in kernel version 4.1-rc4. So pretty much its supported every where but RHEL7.

James A Simmons added a comment - 05/Jun/19 6:46 PM To my understanding the key to using wait_event_idle*() that landed upstream in 4.17 is that it used the TASK_IDLE state. That was introduced with kernel commit 80ed87c8a9ca0cad7ca66cf3bbdfb17559a66dcf Looking at the RHEL7 kernel code I don't see that handling It was introduce in kernel version 4.1-rc4. So pretty much its supported every where but RHEL7.

Patrick Farrell (Inactive) added a comment - 05/Jun/19 3:55 PM

Right, so he submitted the new variants upstream and also stuck them in to his Lustre separately while waiting for them to land (partly because he wasn't sure he could get them landed without Lustre to use them).

I'm pretty sure the only thing we need is the new task state for that wait without load behavior - Which I believe is in RHEL7, but I will have to double check. (Neil didn't add that.)

Patrick Farrell (Inactive) added a comment - 05/Jun/19 3:55 PM Right, so he submitted the new variants upstream and also stuck them in to his Lustre separately while waiting for them to land (partly because he wasn't sure he could get them landed without Lustre to use them). I'm pretty sure the only thing we need is the new task state for that wait without load behavior - Which I believe is in RHEL7, but I will have to double check. (Neil didn't add that.)

Andreas Dilger added a comment - 05/Jun/19 3:24 PM

One of the reasons l_wait_event() was written is that the upstream kernel didn't have the proper primitives for a service thread to wait without increasing the load. Also, l_wait_event() will wake one thread exclusively to process an incoming request, rather than waking all of them each time.

I recall that he added a wait_event_*() variant to handle the first case, but we need to check what kernel that went into, and whether it was backporred to RHEL7. Without that the MDS and OSS would always show a load of 500 due to sleeping service threads, and clients would always have a load of +10 or so due to LDLM and other client threads running/sleeping.

Andreas Dilger added a comment - 05/Jun/19 3:24 PM One of the reasons l_wait_event() was written is that the upstream kernel didn't have the proper primitives for a service thread to wait without increasing the load. Also, l_wait_event() will wake one thread exclusively to process an incoming request, rather than waking all of them each time. I recall that he added a wait_event_*() variant to handle the first case, but we need to check what kernel that went into, and whether it was backporred to RHEL7. Without that the MDS and OSS would always show a load of 500 due to sleeping service threads, and clients would always have a load of +10 or so due to LDLM and other client threads running/sleeping.

Patrick Farrell (Inactive) added a comment - 05/Jun/19 2:57 PM

The final version of Neil's patches is here:
http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-February/thread.html#6510

First one in the series is here:
http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-February/006479.html

Patrick Farrell (Inactive) added a comment - 05/Jun/19 2:57 PM The final version of Neil's patches is here: http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-February/thread.html#6510 First one in the series is here: http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-February/006479.html

Patrick Farrell (Inactive) added a comment - 05/Jun/19 2:56 PM

So to summarize, this is existing (for a very long time) Lustre behavior which is not actually a problem, but it trips up a sanity check in newer kernels because what we do here is technically incorrect. There's no functional bug here, in that nothing is going to go wrong, but we obviously don't want Lustre tripping sanity checks and printing errors. The easy fix is to switch ourselves to using the kernel wait infrastructure, which Neil has done the heavy lifting for already.

So, ruth.klundt@gmail.com, nothing for you to worry about operationally, but this is something we should address.

When I get a little time, I'll look at how much work it will be to bring Neil's patches in. There will be some drift from what he applied them to, and he also didn't do server code, so this might be a fair bit of work. It's a good thing to do - it's much nicer not to open code basic functionality like this, now that the kernel has what we need - but it's going to take a bit.

Patrick Farrell (Inactive) added a comment - 05/Jun/19 2:56 PM So to summarize, this is existing (for a very long time) Lustre behavior which is not actually a problem, but it trips up a sanity check in newer kernels because what we do here is technically incorrect. There's no functional bug here, in that nothing is going to go wrong, but we obviously don't want Lustre tripping sanity checks and printing errors. The easy fix is to switch ourselves to using the kernel wait infrastructure, which Neil has done the heavy lifting for already. So, ruth.klundt@gmail.com , nothing for you to worry about operationally, but this is something we should address. When I get a little time, I'll look at how much work it will be to bring Neil's patches in. There will be some drift from what he applied them to, and he also didn't do server code, so this might be a fair bit of work. It's a good thing to do - it's much nicer not to open code basic functionality like this, now that the kernel has what we need - but it's going to take a bit.

Patrick Farrell (Inactive) added a comment - 05/Jun/19 2:52 PM - edited

OK, I think I've figured this out. This is a check which is not present in older kernels.

The messages here contain clues:
"kernel: do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffff000002a473d0>] ptlrpcd+0x410/0x5c0 [ptlrpc]"

In particular, it points out where the state is stat - ptlrpcd+0x410. I don't have the modules in use here, but there's only one call in ptlrpcd() that should do that, it's l_wait_event.

                l_wait_event(set->set_waitq, ptlrpcd_check(&env, pc), &lwi);

And we see that ptlrpcd_check() is the condition() for this wait_event, and that's the function in the stack trace for the warning.

Looking at l_wait_event, we indeed call condition() with TASK_INTERRUPTIBLE set. The obvious solution to simple change the task state before calling condition().

The problem is this subtle race condition described in the code means we can't:

/* To guarantee that the condition check will be done */ \
 /* after setting the thread state as TASK_INTERRUPTIBLE. */ \
 /* Otherwise, out-of-order execution may cause some race. */ \
 /* Consider the following real execution order: */ \
 \
 /* 1. Thread1 checks condition on CPU1, gets false. */ \
 /* 2. Thread2 sets condition on CPU2. */ \
 /* 3. Thread2 calls wake_up() on CPU2 to wake the threads */ \
 /* with state TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE. */ \
 /* But the Thread1's state is TASK_RUNNING at that time. */ \
 /* 4. Thread1 sets its state as TASK_INTERRUPTIBLE on CPU1, */ \
 /* then schedule. */

(from __l_wait_event definition in lustre_lib.h)

So we can't switch the state before calling condition().

I thought about this for a few minutes and there's no obvious way to fix it in the code... However, we've got another option entirely.

Recently(-ish), Neil Brown removed l_wait_event and replaced it with the kernel wait_event() macros.

Presumably, these macros do not have this problem with how they call their condition, since they're in the upstream kernel. We could try to figure out what's different and incorporate it in our code, but I think it's much saner to just bring Neil's patches (which also fix a few other small things) in to our tree.

I'm going to add another message here with a summary.

Patrick Farrell (Inactive) added a comment - 05/Jun/19 2:52 PM - edited OK, I think I've figured this out. This is a check which is not present in older kernels. The messages here contain clues: "kernel: do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffff000002a473d0>] ptlrpcd+0x410/0x5c0 [ptlrpc] " In particular, it points out where the state is stat - ptlrpcd+0x410. I don't have the modules in use here, but there's only one call in ptlrpcd() that should do that, it's l_wait_event. l_wait_event(set->set_waitq, ptlrpcd_check(&env, pc), &lwi); And we see that ptlrpcd_check() is the condition() for this wait_event, and that's the function in the stack trace for the warning. Looking at l_wait_event, we indeed call condition() with TASK_INTERRUPTIBLE set. The obvious solution to simple change the task state before calling condition(). The problem is this subtle race condition described in the code means we can't: /* To guarantee that the condition check will be done */ \ /* after setting the thread state as TASK_INTERRUPTIBLE. */ \ /* Otherwise, out-of-order execution may cause some race. */ \ /* Consider the following real execution order: */ \ \ /* 1. Thread1 checks condition on CPU1, gets false. */ \ /* 2. Thread2 sets condition on CPU2. */ \ /* 3. Thread2 calls wake_up() on CPU2 to wake the threads */ \ /* with state TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE. */ \ /* But the Thread1's state is TASK_RUNNING at that time. */ \ /* 4. Thread1 sets its state as TASK_INTERRUPTIBLE on CPU1, */ \ /* then schedule. */ (from __l_wait_event definition in lustre_lib.h) So we can't switch the state before calling condition(). I thought about this for a few minutes and there's no obvious way to fix it in the code... However, we've got another option entirely. Recently(-ish), Neil Brown removed l_wait_event and replaced it with the kernel wait_event() macros. Presumably, these macros do not have this problem with how they call their condition, since they're in the upstream kernel. We could try to figure out what's different and incorporate it in our code, but I think it's much saner to just bring Neil's patches (which also fix a few other small things) in to our tree. I'm going to add another message here with a summary.

People

Assignee:: James A Simmons

Reporter:: Ruth Klundt (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 30/May/19 6:43 PM

Updated:: 29/Nov/22 9:07 AM

Resolved:: 10/Oct/21 1:44 PM