Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.4.1
-
Lustre [2.4.0-18chaos|https://github.com/chaos/lustre/tree/2.4.0-18chaos] on the client. Similar on the servers, using ZFS OSDs.
-
3
-
11361
Description
We have a lustre client dedicated to the task of running robinhood. This node is running lustre 2.4.0-18chaos and talking to our 55PB zfs-osd filesystem running a similar tag of lustre.
The lustre client node that runs robinhood is hitting a "BUG: unable to handle kernel paging request at <pointer>" as a result of attempting to kfree() a bad pointer in lustre's ldlm_lock_put() function.
There are two stacks that we have seen lead up to this, both ending in the same failed kfree.
The first is from ldlm_bl_* threads:
cfs_free ldlm_lock_put ldlm_cli_cancel_list ldlm_bl_thread_main
The second was under a robinhood process:
cfs_free ldlm_lock_put ldlm_cli_cancel_list ldlm_prep_elc_req ldlm_prep_enqueue_req mdc_intent_getattr_pack mdc_enqueue mdc_intent_lock lmv_intent_lookup lmv_intent_lock ll_lookup_it ll_lookup_nd do_lookup __link_path_walk path_walk do_path_lookup user_path_at vfs_fstatat vfs_lstat sys_newlstat system_call_fastpath
Note that it was the second call to ldlm_cli_cancel_list() in ldlm_prep_elc_req() that we were under.
The problem is pretty reproducible on our system. The client node usually crashes in less than an hour. But I am not aware of how to reproduce this elsewhere. We have other server clusters with robinhood instances on clients that are not crashing.
It looks like we just hit this bug with one of our 2.4 clients.