Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.1.0
Affects Version/s: Lustre 1.8.6
Labels:
None

Severity:
3
Rank (Obsolete):
5075

Description

We have now verified at least twice that the ldlm shrinker is sometimes getting stuck waiting for a network transaction to complete. We are of the opinion that this is an architectural problem in lustre. A shrinker should NEVER block on network transactions.

In particular, the ldlm_cli_pool_shrink() shrinker is the one that has been giving us trouble in 1.8 for as long as a year. It was only a couple of months ago that it was finally brought to my attention, and the problem is intermittent enough that is has been difficult to catch the problem as it happens. Now both Brian and I have caught the problem with crash, and we have some other sysreq console backtraces that the sysadmins have gotten us. The problem always looks like the following, at least from get_page_from_freelist() to ptlrpc_queue_wait():

ptlrpc_queue_wait
lustre_msg_set_opc
ptlrpc_at_set_req_timeout
lustre_msg_buf
ldlm_cli_cancel_req
ldlm_cli_cancel_list
ldlm_cancel_passed_policy
ldlm_cancel_lru
ldlm_cli_pool_shrink
ldlm_cli_pool_shrink
ldlm_pool_shrink
ldlm_namespace_move_locked
ldlm_pools_shrink
ldlm_pools_cli_shrink
shrink_slab
get_page_from_freelist
__alloc_pages
alloc_pages_current
tcp_sendmsg
inet_sendmsg
do_sock_write
sock_aio_write
do_sync_write
sys_getsockname
vfs_write
sys_write

It is not clear why we are not getting the reply to the cancel request in a reasonable time. And it appears that this thread can be stuck in ptlrpc_queue_wait() for at least an hour. We are also not yet sure why that should block for so long.

But hopefully we can agree that a blocking operation like that should NOT happen in the shrinker. You'll notice that the kernel entry point did not even involve lustre here. We have seen other entry points, such as page faults, that also needed to call get_page_from_freelist() and then got stuck in lustre's ldlm client pool shrinker.

Brian Behlendorf voiced concerns about this independently in comment #31 of Oracle bug 23598.

We were hoping that Oleg would take a look at this. Eric might also be interested, since this appears to be an architectural flaw.

Our immediate need is to fix this particular shrinker, but we should probably make it policy to never perform blocking network transactions in shrinkers.

Brian suggested that a quick work-around is to just disable this shrinker, since shrinkers are "best effort". As he suggested in comment #31, the longer term fix may be to make the shrink asynchronous, handling the blocking network transactions in another thread.

Attachments

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 16/Dec/10 11:29 AM

Updated:: 19/Jul/11 7:43 PM

Resolved:: 25/Apr/11 7:16 AM