[LU-4194] kfree fails kernel paging request under ldlm_lock_put() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1
Affects Version/s: Lustre 2.4.1
Labels:
- mn4
Environment:
Lustre [2.4.0-18chaos|https://github.com/chaos/lustre/tree/2.4.0-18chaos] on the client. Similar on the servers, using ZFS OSDs.

Severity:
3
Rank (Obsolete):
11361

Description

We have a lustre client dedicated to the task of running robinhood. This node is running lustre 2.4.0-18chaos and talking to our 55PB zfs-osd filesystem running a similar tag of lustre.

The lustre client node that runs robinhood is hitting a "BUG: unable to handle kernel paging request at <pointer>" as a result of attempting to kfree() a bad pointer in lustre's ldlm_lock_put() function.

There are two stacks that we have seen lead up to this, both ending in the same failed kfree.

The first is from ldlm_bl_* threads:

cfs_free
ldlm_lock_put
ldlm_cli_cancel_list
ldlm_bl_thread_main

The second was under a robinhood process:

cfs_free
ldlm_lock_put
ldlm_cli_cancel_list
ldlm_prep_elc_req
ldlm_prep_enqueue_req
mdc_intent_getattr_pack
mdc_enqueue
mdc_intent_lock
lmv_intent_lookup
lmv_intent_lock
ll_lookup_it
ll_lookup_nd
do_lookup
__link_path_walk
path_walk
do_path_lookup
user_path_at
vfs_fstatat
vfs_lstat
sys_newlstat
system_call_fastpath

Note that it was the second call to ldlm_cli_cancel_list() in ldlm_prep_elc_req() that we were under.

The problem is pretty reproducible on our system. The client node usually crashes in less than an hour. But I am not aware of how to reproduce this elsewhere. We have other server clusters with robinhood instances on clients that are not crashing.

Attachments

Activity

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 31/Oct/13 10:16 PM

Updated:: 07/Feb/14 8:52 AM

Resolved:: 02/Dec/13 5:07 PM