Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.4.1
-
Lustre [2.4.0-18chaos|https://github.com/chaos/lustre/tree/2.4.0-18chaos] on the client. Similar on the servers, using ZFS OSDs.
-
3
-
11361
Description
We have a lustre client dedicated to the task of running robinhood. This node is running lustre 2.4.0-18chaos and talking to our 55PB zfs-osd filesystem running a similar tag of lustre.
The lustre client node that runs robinhood is hitting a "BUG: unable to handle kernel paging request at <pointer>" as a result of attempting to kfree() a bad pointer in lustre's ldlm_lock_put() function.
There are two stacks that we have seen lead up to this, both ending in the same failed kfree.
The first is from ldlm_bl_* threads:
cfs_free ldlm_lock_put ldlm_cli_cancel_list ldlm_bl_thread_main
The second was under a robinhood process:
cfs_free ldlm_lock_put ldlm_cli_cancel_list ldlm_prep_elc_req ldlm_prep_enqueue_req mdc_intent_getattr_pack mdc_enqueue mdc_intent_lock lmv_intent_lookup lmv_intent_lock ll_lookup_it ll_lookup_nd do_lookup __link_path_walk path_walk do_path_lookup user_path_at vfs_fstatat vfs_lstat sys_newlstat system_call_fastpath
Note that it was the second call to ldlm_cli_cancel_list() in ldlm_prep_elc_req() that we were under.
The problem is pretty reproducible on our system. The client node usually crashes in less than an hour. But I am not aware of how to reproduce this elsewhere. We have other server clusters with robinhood instances on clients that are not crashing.
Patches http://review.whamcloud.com/8270 and http://review.whamcloud.com/8298 were cherry-picked to Lustre b2_5 branch.