[LU-25] Blocking network request in ldlm shrinker Created: 16/Dec/10 Updated: 19/Jul/11 Resolved: 25/Apr/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.1.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 5075 |
| Description |
|
We have now verified at least twice that the ldlm shrinker is sometimes getting stuck waiting for a network transaction to complete. We are of the opinion that this is an architectural problem in lustre. A shrinker should NEVER block on network transactions. In particular, the ldlm_cli_pool_shrink() shrinker is the one that has been giving us trouble in 1.8 for as long as a year. It was only a couple of months ago that it was finally brought to my attention, and the problem is intermittent enough that is has been difficult to catch the problem as it happens. Now both Brian and I have caught the problem with crash, and we have some other sysreq console backtraces that the sysadmins have gotten us. The problem always looks like the following, at least from get_page_from_freelist() to ptlrpc_queue_wait(): ptlrpc_queue_wait It is not clear why we are not getting the reply to the cancel request in a reasonable time. And it appears that this thread can be stuck in ptlrpc_queue_wait() for at least an hour. We are also not yet sure why that should block for so long. But hopefully we can agree that a blocking operation like that should NOT happen in the shrinker. You'll notice that the kernel entry point did not even involve lustre here. We have seen other entry points, such as page faults, that also needed to call get_page_from_freelist() and then got stuck in lustre's ldlm client pool shrinker. Brian Behlendorf voiced concerns about this independently in comment #31 of Oracle bug 23598. We were hoping that Oleg would take a look at this. Eric might also be interested, since this appears to be an architectural flaw. Our immediate need is to fix this particular shrinker, but we should probably make it policy to never perform blocking network transactions in shrinkers. Brian suggested that a quick work-around is to just disable this shrinker, since shrinkers are "best effort". As he suggested in comment #31, the longer term fix may be to make the shrink asynchronous, handling the blocking network transactions in another thread. |
| Comments |
| Comment by Robert Read (Inactive) [ 16/Dec/10 ] |
|
Assigning to Oleg for analysis. |
| Comment by Christopher Morrone [ 05/Jan/11 ] |
|
The LDLM_ASYNC fix suggested in the Oracle bug seems reasonable. Should I go with that for now? |
| Comment by Dan Ferber (Inactive) [ 09/Jan/11 ] |
|
See BZ 23598 for comments as we work with this bug. |
| Comment by Oleg Drokin [ 10/Jan/11 ] |
|
Ok, I just went through the code, and we only really have 3 shrinkers registered in 1.8: Only ldlm_pools_cli_shrinker does blocking RPCs and this could be averted by a single replacement of LDLM_SYNC to LDLM_ASYNC in ldlm_cli_pool_shrink(). In 2.x in addition to the other three we had in 1.8 another shrinker is added, lu_cache_shrink. I looked through possible pathes (loo_object_delete and loo_object_free) and none of them seem to call any RPCs. |
| Comment by Dan Ferber (Inactive) [ 10/Jan/11 ] |
|
Assigned to Niu, to implement the patch. |
| Comment by Niu Yawei (Inactive) [ 10/Jan/11 ] |
|
"But hopefully we can agree that a blocking operation like that should NOT happen in the shrinker" I don't see why a shrinker can't do blocking operation, I think it's totally depends on the gfp masks passed from caller, maybe we should just take care of the gfp mask correctly in shrinker. Did I miss anything? I suspect this is a deadlock caused by ldlm_cli_pool_shrink() doing blocking operation without checking gfd mask, canceling lock could trigger io and other network RPCs, which is deadlock prone if the caller is already in an io/networking context. I'll go through the code to see if deadlock will happen in such case. Of course, it'll be great if there are full stack traces on both client and server. |
| Comment by Alex Zhuravlev [ 11/Jan/11 ] |
|
>> I don't see why a shrinker can't do blocking operation, I think it's totally depends on the gfp masks passed from caller, maybe we should just take care of the gfp mask correctly in shrinker. Did I miss anything? in some cases (tcp/ip internals or vmalloc, iirc) gfp is not used properly. |
| Comment by Andreas Dilger [ 11/Jan/11 ] |
|
What is the exact Lustre version? I recall a fix in 1.8 that added a check for GFP mask before doing anything in the dlm shrimpers. My recollection is that LLNL is based on 1.8.2, so maybe this fix is not included in your release? |
| Comment by Christopher Morrone [ 11/Jan/11 ] |
|
We are seeing the issues with 1.8.4, but I am certain that 1.8.5+ will have the same problem. Brian Behlendorf and others have noted the same issue with the 2.X code during kdmu branch development as well. We did indeed hit at least two GFP mask issues in the past year, but I do not think that this is not one of them. I am not sure that you can claim that the kernel is using the "wrong" GFP mask. To make that claim, you would basically be saying that the kernel should almost never be allowed to use GFP_KERNEL. I do not think that __GFP_IO and __GFP_FS are a license to block indefinitely. Or even for 15 minutes. Certainly no one in user space expects to write into a TCP socket and have it block for half of an hour before sending because lustre needed to block on an RPC before a page could be allocated. FYI, you can see the one-liner fix that I put in our tree here: https://github.com/chaos/lustre/tree/1.8.4-llnl It is going through two weeks of testing on hype, along with testing the latest chaos 4.4-1 point release. After that, it will being rolling out onto all of the lustre clients. |
| Comment by Niu Yawei (Inactive) [ 11/Jan/11 ] |
|
Given the high latency and unstableness of network, I tend to agree with Christopher that sending RPC in a shrinker is unwise. The patch of of changing LDLM_SYNC to LDLM_ASYNC looks ok to me, well, the side defect of the patch is that the ldlm_cli_pool_shrink() will not work as kernel expected, and the possiblity of mem alloc failure is increased, however, I don't see any other better solutions so far. I think we can land this patch to the b1_8/master if no one object. Andreas, Oleg, any comments? |
| Comment by Robert Read (Inactive) [ 12/Jan/11 ] |
|
At the very least we should wait for testing results from hyperion before this is landed. |
| Comment by Christopher Morrone [ 25/Feb/11 ] |
|
On our smaller hype cluster (96 nodes, 1536 processes), testing has gone even better than expected. I suspect that this fix was a big part of that. I think this should definitely go in in both 1.8 and master. |
| Comment by Christopher Morrone [ 28/Feb/11 ] |
|
This patch needs to be landed on master too. I pushed the patch to gerrit: |
| Comment by Build Master (Inactive) [ 28/Feb/11 ] |
|
Integrated in reviews-patchless-centos5 #72 (See http://build.whamcloud.com/job/reviews-patchless-centos5/72/) Christopher J. Morrone : http://git.whamcloud.com/gitweb/?p=lustre.git;a=commit&a=commit&h=960557722df2194526792f581e7dadc84247e408
|
| Comment by Build Master (Inactive) [ 28/Feb/11 ] |
|
Integrated in reviews-ubuntu #176 (See http://build.whamcloud.com/job/reviews-ubuntu/176/) Christopher J. Morrone : http://git.whamcloud.com/gitweb/?p=lustre.git;a=commit&a=commit&h=960557722df2194526792f581e7dadc84247e408
|
| Comment by Build Master (Inactive) [ 28/Feb/11 ] |
|
Integrated in reviews-centos5 #339 (See http://build.whamcloud.com/job/reviews-centos5/339/) Christopher J. Morrone : http://git.whamcloud.com/gitweb/?p=lustre.git;a=commit&a=commit&h=960557722df2194526792f581e7dadc84247e408
|
| Comment by Eric Barton (Inactive) [ 01/Mar/11 ] |
|
I would like to understand the mechanics of the hangs when LRU shrinking was doing sync lock cancels. I suspect having PF_MEMALLOC set in the thread sending requests could have allocated memory too aggressively so that replies from the server were getting dropped - for example in lnet_parse() where it calls lnet_msg_alloc(). Is it worthwhile trying to re-create this bug with better logging of allocation failures? |
| Comment by Johann Lombardi (Inactive) [ 01/Mar/11 ] |
|
Christopher, could you please let us know if you have the patch from bug 21776 applied (the one that set PF_MEMALLOC on the outgoing path)? |
| Comment by Christopher Morrone [ 01/Mar/11 ] |
|
No, we do not have the patch from 21776. |
| Comment by Oleg Drokin [ 01/Mar/11 ] |
|
I wonder if we can have some failure case logs from the client before the patch? The question is why the ptlrpc_queue_wait sits there for hours. There must be some clues in the logs. Unable to allocate memory? Retrying sending in a loop? Something else? |
| Comment by Christopher Morrone [ 02/Mar/11 ] |
|
Yes, it bothers me too that we don't know why the rpc was hanging. But sadly no, we don't have any good logs that explain it. I am also not aware of a reliable reproducer, outside of actual production usage. And to make matters worse, it was usually on the secure network that this happened. I was never able to get detailed enough logs to explain why it was stuck. I was usually alerted to the problem so long after it happened that it took quite a while to even definitively determine that things were getting stuck in the shrinker. |
| Comment by Christopher Morrone [ 17/Mar/11 ] |
|
I think this issue should be on the 2.1-blocker list until landed. |
| Comment by Niu Yawei (Inactive) [ 06/Apr/11 ] |
|
Hi, Chris The stack trace of this ticket is for 1.8, the 2.0 code is different from 1.8, in 2.0, the ldlm_cli_pool_shrink() should always pass all the cancel work to another thread then wait for the work done, so it's impossible to see the ldlm_cli_pool_shrink() calling ptlrpc_queue_wait() directly. I checked the bz23598, and looks it's a zfs related issue. Did you ever see such problem on 2.0 system (not the kdmu branch)? Thanks |
| Comment by Peter Jones [ 11/Apr/11 ] |
|
Niu I think that LLNL are carrying an equivalent patch in their 2.1 version to avoid ever hitting this issue so probably do not have such a stack trace Peter |
| Comment by Build Master (Inactive) [ 13/Apr/11 ] |
|
Integrated in Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
|
| Comment by Build Master (Inactive) [ 13/Apr/11 ] |
|
Integrated in Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
|
| Comment by Build Master (Inactive) [ 13/Apr/11 ] |
|
Integrated in Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
|
| Comment by Build Master (Inactive) [ 13/Apr/11 ] |
|
Integrated in Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
|
| Comment by Build Master (Inactive) [ 13/Apr/11 ] |
|
Integrated in Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
|
| Comment by Build Master (Inactive) [ 13/Apr/11 ] |
|
Integrated in Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
|
| Comment by Build Master (Inactive) [ 13/Apr/11 ] |
|
Integrated in Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
|
| Comment by Build Master (Inactive) [ 13/Apr/11 ] |
|
Integrated in Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
|
| Comment by Build Master (Inactive) [ 13/Apr/11 ] |
|
Integrated in Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
|
| Comment by Peter Jones [ 25/Apr/11 ] |
|
Landed for 2.1 |
| Comment by Cory Spitz [ 19/Jul/11 ] |
|
I see that the use of LDLM_ASYNC wasn't included in 1.8.6-wc1. Since it was in LLNL's chaos 1.8.4, then should it be landed to b1_8 now? |