Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.1.1
-
RHEL 6.2
Our branch: https://github.com/chaos/lustre/commits/2.1.0-llnl
-
3
-
6467
Description
We have multiple Lustre 2.1 OSS nodes crashing repeatedly during recovery. This is on our classified Lustre cluster which was updated from 1.8 to 2.1 on Tuesday. The summary is one observed symptom. We have also seen these assertions appearing together.
ASSERTION(cfs_list_empty(&imp->imp_zombie_chain)) failed
ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0) failed
We don't have backtraces for the assertions because STONITH kicked in before the crash dump completed.
Other OSS nodes are crashing in kernel string handling functions with stacks like
machine_kexec
crash_kexec
oops_end
die
do_general_protection
general_protection
(exception RIP: strlen+9)
strlen
string
vsnprintf
libcfs_debug_vmsg2
_debug_req
target_send_replay_msg
target_send_reply
ost_handle
ptlrpc_main
So it appears we are passing a bad value in a debug message.
Another stack trace:
BUG: unable to handle kernel NULL ptr dereference at 000...38
IP: [<fffffffa0a8706>] filter_export_stats_init+0x1f1/0x500 [obdfilter]
machine_kexec
crash_kexec
oops_end
no_context
__bad_area_semaphore
bad_area_semaphore
__do_page_fault
do_page_fault
page_fault
filter_reconnect
target_handle_connect
ost_handle
ptlrpc_main
We have multiple symptoms here that may or not be due to the same bug. We may need to open a separate issue to track the root cause. Note that our branch contains LU-874 patches that touched the ptlrpc queue management code, so we should be on the lookout for any races introduced there. Also note we can't send debug data from this system.