[LU-25] Blocking network request in ldlm shrinker Created: 16/Dec/10  Updated: 19/Jul/11  Resolved: 25/Apr/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Critical
Reporter: Christopher Morrone Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 5075

 Description   

We have now verified at least twice that the ldlm shrinker is sometimes getting stuck waiting for a network transaction to complete. We are of the opinion that this is an architectural problem in lustre. A shrinker should NEVER block on network transactions.

In particular, the ldlm_cli_pool_shrink() shrinker is the one that has been giving us trouble in 1.8 for as long as a year. It was only a couple of months ago that it was finally brought to my attention, and the problem is intermittent enough that is has been difficult to catch the problem as it happens. Now both Brian and I have caught the problem with crash, and we have some other sysreq console backtraces that the sysadmins have gotten us. The problem always looks like the following, at least from get_page_from_freelist() to ptlrpc_queue_wait():

ptlrpc_queue_wait
lustre_msg_set_opc
ptlrpc_at_set_req_timeout
lustre_msg_buf
ldlm_cli_cancel_req
ldlm_cli_cancel_list
ldlm_cancel_passed_policy
ldlm_cancel_lru
ldlm_cli_pool_shrink
ldlm_cli_pool_shrink
ldlm_pool_shrink
ldlm_namespace_move_locked
ldlm_pools_shrink
ldlm_pools_cli_shrink
shrink_slab
get_page_from_freelist
__alloc_pages
alloc_pages_current
tcp_sendmsg
inet_sendmsg
do_sock_write
sock_aio_write
do_sync_write
sys_getsockname
vfs_write
sys_write

It is not clear why we are not getting the reply to the cancel request in a reasonable time. And it appears that this thread can be stuck in ptlrpc_queue_wait() for at least an hour. We are also not yet sure why that should block for so long.

But hopefully we can agree that a blocking operation like that should NOT happen in the shrinker. You'll notice that the kernel entry point did not even involve lustre here. We have seen other entry points, such as page faults, that also needed to call get_page_from_freelist() and then got stuck in lustre's ldlm client pool shrinker.

Brian Behlendorf voiced concerns about this independently in comment #31 of Oracle bug 23598.

We were hoping that Oleg would take a look at this. Eric might also be interested, since this appears to be an architectural flaw.

Our immediate need is to fix this particular shrinker, but we should probably make it policy to never perform blocking network transactions in shrinkers.

Brian suggested that a quick work-around is to just disable this shrinker, since shrinkers are "best effort". As he suggested in comment #31, the longer term fix may be to make the shrink asynchronous, handling the blocking network transactions in another thread.



 Comments   
Comment by Robert Read (Inactive) [ 16/Dec/10 ]

Assigning to Oleg for analysis.

Comment by Christopher Morrone [ 05/Jan/11 ]

The LDLM_ASYNC fix suggested in the Oracle bug seems reasonable. Should I go with that for now?

Comment by Dan Ferber (Inactive) [ 09/Jan/11 ]

See BZ 23598 for comments as we work with this bug.

Comment by Oleg Drokin [ 10/Jan/11 ]

Ok, I just went through the code, and we only really have 3 shrinkers registered in 1.8:
ldlm_pools_cli_shrinker, ldlm_pools_srv_shrinker, llap_shrink_cache.

Only ldlm_pools_cli_shrinker does blocking RPCs and this could be averted by a single replacement of LDLM_SYNC to LDLM_ASYNC in ldlm_cli_pool_shrink().
The other two don't cause any RPCs.

In 2.x in addition to the other three we had in 1.8 another shrinker is added, lu_cache_shrink. I looked through possible pathes (loo_object_delete and loo_object_free) and none of them seem to call any RPCs.
So as such you probably hit the last blocking shrinker piece remaining.

Comment by Dan Ferber (Inactive) [ 10/Jan/11 ]

Assigned to Niu, to implement the patch.

Comment by Niu Yawei (Inactive) [ 10/Jan/11 ]

"But hopefully we can agree that a blocking operation like that should NOT happen in the shrinker"

I don't see why a shrinker can't do blocking operation, I think it's totally depends on the gfp masks passed from caller, maybe we should just take care of the gfp mask correctly in shrinker. Did I miss anything?

I suspect this is a deadlock caused by ldlm_cli_pool_shrink() doing blocking operation without checking gfd mask, canceling lock could trigger io and other network RPCs, which is deadlock prone if the caller is already in an io/networking context.

I'll go through the code to see if deadlock will happen in such case. Of course, it'll be great if there are full stack traces on both client and server.

Comment by Alex Zhuravlev [ 11/Jan/11 ]

>> I don't see why a shrinker can't do blocking operation, I think it's totally depends on the gfp masks passed from caller, maybe we should just take care of the gfp mask correctly in shrinker. Did I miss anything?

in some cases (tcp/ip internals or vmalloc, iirc) gfp is not used properly.

Comment by Andreas Dilger [ 11/Jan/11 ]

What is the exact Lustre version? I recall a fix in 1.8 that added a check for GFP mask before doing anything in the dlm shrimpers. My recollection is that LLNL is based on 1.8.2, so maybe this fix is not included in your release?

Comment by Christopher Morrone [ 11/Jan/11 ]

We are seeing the issues with 1.8.4, but I am certain that 1.8.5+ will have the same problem. Brian Behlendorf and others have noted the same issue with the 2.X code during kdmu branch development as well.

We did indeed hit at least two GFP mask issues in the past year, but I do not think that this is not one of them.

I am not sure that you can claim that the kernel is using the "wrong" GFP mask. To make that claim, you would basically be saying that the kernel should almost never be allowed to use GFP_KERNEL.

I do not think that __GFP_IO and __GFP_FS are a license to block indefinitely. Or even for 15 minutes. Certainly no one in user space expects to write into a TCP socket and have it block for half of an hour before sending because lustre needed to block on an RPC before a page could be allocated.

FYI, you can see the one-liner fix that I put in our tree here:

https://github.com/chaos/lustre/tree/1.8.4-llnl

It is going through two weeks of testing on hype, along with testing the latest chaos 4.4-1 point release. After that, it will being rolling out onto all of the lustre clients.

Comment by Niu Yawei (Inactive) [ 11/Jan/11 ]

Given the high latency and unstableness of network, I tend to agree with Christopher that sending RPC in a shrinker is unwise.

The patch of of changing LDLM_SYNC to LDLM_ASYNC looks ok to me, well, the side defect of the patch is that the ldlm_cli_pool_shrink() will not work as kernel expected, and the possiblity of mem alloc failure is increased, however, I don't see any other better solutions so far.

I think we can land this patch to the b1_8/master if no one object. Andreas, Oleg, any comments?

Comment by Robert Read (Inactive) [ 12/Jan/11 ]

At the very least we should wait for testing results from hyperion before this is landed.

Comment by Christopher Morrone [ 25/Feb/11 ]

On our smaller hype cluster (96 nodes, 1536 processes), testing has gone even better than expected. I suspect that this fix was a big part of that. I think this should definitely go in in both 1.8 and master.

Comment by Christopher Morrone [ 28/Feb/11 ]

This patch needs to be landed on master too. I pushed the patch to gerrit:

http://review.whamcloud.com/277

Comment by Build Master (Inactive) [ 28/Feb/11 ]

Integrated in reviews-patchless-centos5 #72 (See http://build.whamcloud.com/job/reviews-patchless-centos5/72/)
LU-25: Use LDLM_ASYNC with ldlm_cancel_lru to avoid blocking.

Christopher J. Morrone : http://git.whamcloud.com/gitweb/?p=lustre.git;a=commit&a=commit&h=960557722df2194526792f581e7dadc84247e408
Files :

  • lustre/ldlm/ldlm_pool.c
Comment by Build Master (Inactive) [ 28/Feb/11 ]

Integrated in reviews-ubuntu #176 (See http://build.whamcloud.com/job/reviews-ubuntu/176/)
LU-25: Use LDLM_ASYNC with ldlm_cancel_lru to avoid blocking.

Christopher J. Morrone : http://git.whamcloud.com/gitweb/?p=lustre.git;a=commit&a=commit&h=960557722df2194526792f581e7dadc84247e408
Files :

  • lustre/ldlm/ldlm_pool.c
Comment by Build Master (Inactive) [ 28/Feb/11 ]

Integrated in reviews-centos5 #339 (See http://build.whamcloud.com/job/reviews-centos5/339/)
LU-25: Use LDLM_ASYNC with ldlm_cancel_lru to avoid blocking.

Christopher J. Morrone : http://git.whamcloud.com/gitweb/?p=lustre.git;a=commit&a=commit&h=960557722df2194526792f581e7dadc84247e408
Files :

  • lustre/ldlm/ldlm_pool.c
Comment by Eric Barton (Inactive) [ 01/Mar/11 ]

I would like to understand the mechanics of the hangs when LRU shrinking was doing sync lock cancels. I suspect having PF_MEMALLOC set in the thread sending requests could have allocated memory too aggressively so that replies from the server were getting dropped - for example in lnet_parse() where it calls lnet_msg_alloc(). Is it worthwhile trying to re-create this bug with better logging of allocation failures?

Comment by Johann Lombardi (Inactive) [ 01/Mar/11 ]

Christopher, could you please let us know if you have the patch from bug 21776 applied (the one that set PF_MEMALLOC on the outgoing path)?

Comment by Christopher Morrone [ 01/Mar/11 ]

No, we do not have the patch from 21776.

Comment by Oleg Drokin [ 01/Mar/11 ]

I wonder if we can have some failure case logs from the client before the patch? The question is why the ptlrpc_queue_wait sits there for hours. There must be some clues in the logs. Unable to allocate memory? Retrying sending in a loop? Something else?
Otherwise as Eric properly indicates we don't have total understanding why it hangs, even though we are able to treat the symptom.

Comment by Christopher Morrone [ 02/Mar/11 ]

Yes, it bothers me too that we don't know why the rpc was hanging.

But sadly no, we don't have any good logs that explain it. I am also not aware of a reliable reproducer, outside of actual production usage. And to make matters worse, it was usually on the secure network that this happened.

I was never able to get detailed enough logs to explain why it was stuck. I was usually alerted to the problem so long after it happened that it took quite a while to even definitively determine that things were getting stuck in the shrinker.

Comment by Christopher Morrone [ 17/Mar/11 ]

I think this issue should be on the 2.1-blocker list until landed.

Comment by Niu Yawei (Inactive) [ 06/Apr/11 ]

Hi, Chris

The stack trace of this ticket is for 1.8, the 2.0 code is different from 1.8, in 2.0, the ldlm_cli_pool_shrink() should always pass all the cancel work to another thread then wait for the work done, so it's impossible to see the ldlm_cli_pool_shrink() calling ptlrpc_queue_wait() directly.

I checked the bz23598, and looks it's a zfs related issue. Did you ever see such problem on 2.0 system (not the kdmu branch)?

Thanks
Niu

Comment by Peter Jones [ 11/Apr/11 ]

Niu

I think that LLNL are carrying an equivalent patch in their 2.1 version to avoid ever hitting this issue so probably do not have such a stack trace

Peter

Comment by Build Master (Inactive) [ 13/Apr/11 ]

Integrated in lustre-master » client,el5-x86_64 #24
LU-25: Use LDLM_ASYNC with ldlm_cancel_lru to avoid blocking.

Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
Files :

  • lustre/ldlm/ldlm_pool.c
Comment by Build Master (Inactive) [ 13/Apr/11 ]

Integrated in lustre-master-centos5 #191
LU-25: Use LDLM_ASYNC with ldlm_cancel_lru to avoid blocking.

Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
Files :

  • lustre/ldlm/ldlm_pool.c
Comment by Build Master (Inactive) [ 13/Apr/11 ]

Integrated in lustre-master » client,el5-i686 #24
LU-25: Use LDLM_ASYNC with ldlm_cancel_lru to avoid blocking.

Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
Files :

  • lustre/ldlm/ldlm_pool.c
Comment by Build Master (Inactive) [ 13/Apr/11 ]

Integrated in lustre-master » server,el6-x86_64 #24
LU-25: Use LDLM_ASYNC with ldlm_cancel_lru to avoid blocking.

Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
Files :

  • lustre/ldlm/ldlm_pool.c
Comment by Build Master (Inactive) [ 13/Apr/11 ]

Integrated in lustre-master » client,ubuntu-x86_64 #24
LU-25: Use LDLM_ASYNC with ldlm_cancel_lru to avoid blocking.

Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
Files :

  • lustre/ldlm/ldlm_pool.c
Comment by Build Master (Inactive) [ 13/Apr/11 ]

Integrated in lustre-master » client,el6-i686 #24
LU-25: Use LDLM_ASYNC with ldlm_cancel_lru to avoid blocking.

Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
Files :

  • lustre/ldlm/ldlm_pool.c
Comment by Build Master (Inactive) [ 13/Apr/11 ]

Integrated in lustre-master » server,el5-x86_64 #24
LU-25: Use LDLM_ASYNC with ldlm_cancel_lru to avoid blocking.

Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
Files :

  • lustre/ldlm/ldlm_pool.c
Comment by Build Master (Inactive) [ 13/Apr/11 ]

Integrated in lustre-master » client,el6-x86_64 #24
LU-25: Use LDLM_ASYNC with ldlm_cancel_lru to avoid blocking.

Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
Files :

  • lustre/ldlm/ldlm_pool.c
Comment by Build Master (Inactive) [ 13/Apr/11 ]

Integrated in lustre-master » server,el5-i686 #24
LU-25: Use LDLM_ASYNC with ldlm_cancel_lru to avoid blocking.

Oleg Drokin : 3dcb5fb7b79949d780bc82b1439622989516e34f
Files :

  • lustre/ldlm/ldlm_pool.c
Comment by Peter Jones [ 25/Apr/11 ]

Landed for 2.1

Comment by Cory Spitz [ 19/Jul/11 ]

I see that the use of LDLM_ASYNC wasn't included in 1.8.6-wc1. Since it was in LLNL's chaos 1.8.4, then should it be landed to b1_8 now?

Generated at Sat Feb 10 01:03:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.