[LU-4194] kfree fails kernel paging request under ldlm_lock_put() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1
Affects Version/s: Lustre 2.4.1
Labels:
- mn4
Environment:
Lustre [2.4.0-18chaos|https://github.com/chaos/lustre/tree/2.4.0-18chaos] on the client. Similar on the servers, using ZFS OSDs.

Severity:
3
Rank (Obsolete):
11361

Description

We have a lustre client dedicated to the task of running robinhood. This node is running lustre 2.4.0-18chaos and talking to our 55PB zfs-osd filesystem running a similar tag of lustre.

The lustre client node that runs robinhood is hitting a "BUG: unable to handle kernel paging request at <pointer>" as a result of attempting to kfree() a bad pointer in lustre's ldlm_lock_put() function.

There are two stacks that we have seen lead up to this, both ending in the same failed kfree.

The first is from ldlm_bl_* threads:

cfs_free
ldlm_lock_put
ldlm_cli_cancel_list
ldlm_bl_thread_main

The second was under a robinhood process:

cfs_free
ldlm_lock_put
ldlm_cli_cancel_list
ldlm_prep_elc_req
ldlm_prep_enqueue_req
mdc_intent_getattr_pack
mdc_enqueue
mdc_intent_lock
lmv_intent_lookup
lmv_intent_lock
ll_lookup_it
ll_lookup_nd
do_lookup
__link_path_walk
path_walk
do_path_lookup
user_path_at
vfs_fstatat
vfs_lstat
sys_newlstat
system_call_fastpath

Note that it was the second call to ldlm_cli_cancel_list() in ldlm_prep_elc_req() that we were under.

The problem is pretty reproducible on our system. The client node usually crashes in less than an hour. But I am not aware of how to reproduce this elsewhere. We have other server clusters with robinhood instances on clients that are not crashing.

Attachments

Activity

[LU-4194] kfree fails kernel paging request under ldlm_lock_put()

James A Simmons added a comment - 16/Jan/14 6:45 PM

It looks like we just hit this bug with one of our 2.4 clients.

James A Simmons added a comment - 16/Jan/14 6:45 PM It looks like we just hit this bug with one of our 2.4 clients.

Peter Jones added a comment - 02/Dec/13 5:07 PM - edited

Landed for 2.6

Peter Jones added a comment - 02/Dec/13 5:07 PM - edited Landed for 2.6

Bruno Faccini (Inactive) added a comment - 18/Nov/13 8:11 AM

Great job!! BTW, the fact it was an address from the VMALLOC range becomes obvious with the kmem sub-command translation.

Bruno Faccini (Inactive) added a comment - 18/Nov/13 8:11 AM Great job!! BTW, the fact it was an address from the VMALLOC range becomes obvious with the kmem sub-command translation.

Christopher Morrone (Inactive) added a comment - 16/Nov/13 1:52 AM

Robinhood has been running for two hours one the client with patch http://review.whamcloud.com/8298 without a problem. That looks like our problem.

I will be away at SC'13 all of next week. If you could take over working on a finalized version of http://review.whamcloud.com/8298 while I am gone, that would be great.

Christopher Morrone (Inactive) added a comment - 16/Nov/13 1:52 AM Robinhood has been running for two hours one the client with patch http://review.whamcloud.com/8298 without a problem. That looks like our problem. I will be away at SC'13 all of next week. If you could take over working on a finalized version of http://review.whamcloud.com/8298 while I am gone, that would be great.

Christopher Morrone (Inactive) added a comment - 15/Nov/13 11:34 PM

Brian Behlendorf provided me with a key insight:

#define VMALLOC_START    _AC(0xffffc90000000000, UL)
#define VMALLOC_END      _AC(0xffffe8ffffffffff, UL)

So 0xffffc90046218000, the address that we are attempting to kfree, is in the vmalloc address range.

And sure enough, it looks like ldlm_lock_put() is using OBD_FREE(), which is kfree-only. However, there are a couple of functions, at least mdc_finish_enqueue() and ll_layout_fetch() that are allocating buffers using OBD_FREE_LARGE() and pointing to them from the ldlm_lock's l_lvb_data pointer.

I am testing the following patch in which I attempt to use the alloc and free functions a bit more consistently.

http://review.whamcloud.com/8298

I am not entirely sure that we want to use OBD_ALLOC_LARGE all the time in ldlm_lock_create(). I don't know how often the buffers will be large enough to switch to vmalloc. The fix could certainly be implemented other ways. But this should suffice for testing.

This code really needs some cleanup work.

Christopher Morrone (Inactive) added a comment - 15/Nov/13 11:34 PM Brian Behlendorf provided me with a key insight: #define VMALLOC_START _AC(0xffffc90000000000, UL) #define VMALLOC_END _AC(0xffffe8ffffffffff, UL) So 0xffffc90046218000, the address that we are attempting to kfree, is in the vmalloc address range. And sure enough, it looks like ldlm_lock_put() is using OBD_FREE(), which is kfree-only. However, there are a couple of functions, at least mdc_finish_enqueue() and ll_layout_fetch() that are allocating buffers using OBD_FREE_LARGE() and pointing to them from the ldlm_lock's l_lvb_data pointer. I am testing the following patch in which I attempt to use the alloc and free functions a bit more consistently. http://review.whamcloud.com/8298 I am not entirely sure that we want to use OBD_ALLOC_LARGE all the time in ldlm_lock_create(). I don't know how often the buffers will be large enough to switch to vmalloc. The fix could certainly be implemented other ways. But this should suffice for testing. This code really needs some cleanup work.

Christopher Morrone (Inactive) added a comment - 15/Nov/13 6:42 PM - edited

Again, can you tell me what returns the "kmem 0xffffc90046218000" sub-command ? Does-it return a page-table entry address of 0xffffeae380f57540 ??

No, it does not. But I don't understand what I am seeing yet. I'll transcribe it here and maybe you can help me:

crash> kmem 0xffffc90046218000
VM_STRUCT                  ADDRESS  RANGE                          SIZE
ffff885e3df9ed80      ffffc90046218000 - fffc9004621e000          24576

    PAGE              PHYSICAL       MAPPING          INDEX CNT  FLAGS
ffffea0149d59378    5e3d059000             0              0   1  c0000000000000

Christopher Morrone (Inactive) added a comment - 15/Nov/13 6:42 PM - edited Again, can you tell me what returns the "kmem 0xffffc90046218000" sub-command ? Does-it return a page-table entry address of 0xffffeae380f57540 ?? No, it does not. But I don't understand what I am seeing yet. I'll transcribe it here and maybe you can help me: crash> kmem 0xffffc90046218000 VM_STRUCT ADDRESS RANGE SIZE ffff885e3df9ed80 ffffc90046218000 - fffc9004621e000 24576 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffea0149d59378 5e3d059000 0 0 1 c0000000000000

Bruno Faccini (Inactive) added a comment - 15/Nov/13 6:02 PM

And patch attempt to set l_lvb_type coherent is at http://review.whamcloud.com/#/c/8270/.

Bruno Faccini (Inactive) added a comment - 15/Nov/13 6:02 PM And patch attempt to set l_lvb_type coherent is at http://review.whamcloud.com/#/c/8270/ .

Bruno Faccini (Inactive) added a comment - 15/Nov/13 10:58 AM - edited

Hello Chris, thanks again for this low-level VM debugging !

I am not 100% sure but according to your last inputs, I think there is something contradictory here, how crash tool allows you to read poisoned memory meaning it was able to do virt2phys translation, when the corresponding page-table entry address resolution (to find corresponding/owning kmem-cache), itself using the same virt2phys translation to compute the page-frame number and find offset in page-table, fails ???

Again, can you tell me what returns the "kmem 0xffffc90046218000" sub-command ? Does-it return a page-table entry address of 0xffffeae380f57540 ??

Bruno Faccini (Inactive) added a comment - 15/Nov/13 10:58 AM - edited Hello Chris, thanks again for this low-level VM debugging ! I am not 100% sure but according to your last inputs, I think there is something contradictory here, how crash tool allows you to read poisoned memory meaning it was able to do virt2phys translation, when the corresponding page-table entry address resolution (to find corresponding/owning kmem-cache), itself using the same virt2phys translation to compute the page-frame number and find offset in page-table, fails ??? Again, can you tell me what returns the "kmem 0xffffc90046218000" sub-command ? Does-it return a page-table entry address of 0xffffeae380f57540 ??

Christopher Morrone (Inactive) added a comment - 15/Nov/13 5:22 AM

It looks like the backtrace below kfree() where we hit the page fault failure looks a little like this:

constant_test_bit
test_bit
virt_to_page
virt_to_head_page
virt_to_cache
kfree

At least crash claims that we are on the pointer dereference under constant_test_bit(), and looking forward and backwards a bit from kfree+0x173 seems to place us under virt_to_page().

I started to try to calculate virt_to_page by hand, but that way lies madness. I am just going to assume that:

virt_to_page(0xffffc90046218000) = 0xffffeae380f57540

All that doesn't really change anything. But I feel somewhat enlightened.

Christopher Morrone (Inactive) added a comment - 15/Nov/13 5:22 AM It looks like the backtrace below kfree() where we hit the page fault failure looks a little like this: constant_test_bit test_bit virt_to_page virt_to_head_page virt_to_cache kfree At least crash claims that we are on the pointer dereference under constant_test_bit(), and looking forward and backwards a bit from kfree+0x173 seems to place us under virt_to_page(). I started to try to calculate virt_to_page by hand, but that way lies madness. I am just going to assume that: virt_to_page(0xffffc90046218000) = 0xffffeae380f57540 All that doesn't really change anything. But I feel somewhat enlightened.

Christopher Morrone (Inactive) added a comment - 15/Nov/13 4:15 AM

I'm glad that you have a theory, because I'm lost.

In one of the recent tests, the kernel complains about a page fault at

ffffeae380f57540

The lustre log extracted from that crash dump tells me the the last message from the thread that hit the page fault Oops was doing the final ldlm_lock_put() on the ldlm_lock at this address:

ffff885e3d09ce00

The lock->l_lvb_data pointer is:

ffffc90046218000

When I use crash's "rd" command to dump the contents of the l_lvb_data buffer, it is indeed poisoned with OBD_FREE's repeated bytes of 0x5a. The pointer to the buffer has not, on the other hand, been poisoned yet.

Which all probably just confirms that we are in kfree() under cfs_free() under ldlm_lock_put() at the time of the Oops.

But where does the page fault address of ffffeae380f57540 come from? I find it in both the RAX and CR2 registers under cfs_free from crash's backtrace output. RAX is a common register, and CR2 is the Page Fault Linear Address register, which makes sense because we Oops under the failed page fault.

I just wish that I understood where kfree() was getting that particular address.

Christopher Morrone (Inactive) added a comment - 15/Nov/13 4:15 AM I'm glad that you have a theory, because I'm lost. In one of the recent tests, the kernel complains about a page fault at ffffeae380f57540 The lustre log extracted from that crash dump tells me the the last message from the thread that hit the page fault Oops was doing the final ldlm_lock_put() on the ldlm_lock at this address: ffff885e3d09ce00 The lock->l_lvb_data pointer is: ffffc90046218000 When I use crash's "rd" command to dump the contents of the l_lvb_data buffer, it is indeed poisoned with OBD_FREE's repeated bytes of 0x5a. The pointer to the buffer has not, on the other hand, been poisoned yet. Which all probably just confirms that we are in kfree() under cfs_free() under ldlm_lock_put() at the time of the Oops. But where does the page fault address of ffffeae380f57540 come from? I find it in both the RAX and CR2 registers under cfs_free from crash's backtrace output. RAX is a common register, and CR2 is the Page Fault Linear Address register, which makes sense because we Oops under the failed page fault. I just wish that I understood where kfree() was getting that particular address.

Bruno Faccini (Inactive) added a comment - 14/Nov/13 4:27 PM

Thanks for the answers/confirmations.

My comment before about "the fact that there is LVB data with a type of LVB_T_NONE comes from layout_fetch()" is wrong and must be inverted, it can NOT come from layout_fetch() !!… but I think now can better come from un-expected/requested Layout embedding within LVB reply from Server, and this can happen at least in ldlm_handle_cp_callback().

I am currently testing a fix for that.

Bruno Faccini (Inactive) added a comment - 14/Nov/13 4:27 PM Thanks for the answers/confirmations. My comment before about "the fact that there is LVB data with a type of LVB_T_NONE comes from layout_fetch()" is wrong and must be inverted, it can NOT come from layout_fetch() !!… but I think now can better come from un-expected/requested Layout embedding within LVB reply from Server, and this can happen at least in ldlm_handle_cp_callback(). I am currently testing a fix for that.

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 31/Oct/13 10:16 PM

Updated:: 07/Feb/14 8:52 AM

Resolved:: 02/Dec/13 5:07 PM