Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.1.1
-
RHEL 6.2
Our branch: https://github.com/chaos/lustre/commits/2.1.0-llnl
-
3
-
6467
Description
We have multiple Lustre 2.1 OSS nodes crashing repeatedly during recovery. This is on our classified Lustre cluster which was updated from 1.8 to 2.1 on Tuesday. The summary is one observed symptom. We have also seen these assertions appearing together.
ASSERTION(cfs_list_empty(&imp->imp_zombie_chain)) failed
ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0) failed
We don't have backtraces for the assertions because STONITH kicked in before the crash dump completed.
Other OSS nodes are crashing in kernel string handling functions with stacks like
machine_kexec
crash_kexec
oops_end
die
do_general_protection
general_protection
(exception RIP: strlen+9)
strlen
string
vsnprintf
libcfs_debug_vmsg2
_debug_req
target_send_replay_msg
target_send_reply
ost_handle
ptlrpc_main
So it appears we are passing a bad value in a debug message.
Another stack trace:
BUG: unable to handle kernel NULL ptr dereference at 000...38
IP: [<fffffffa0a8706>] filter_export_stats_init+0x1f1/0x500 [obdfilter]
machine_kexec
crash_kexec
oops_end
no_context
__bad_area_semaphore
bad_area_semaphore
__do_page_fault
do_page_fault
page_fault
filter_reconnect
target_handle_connect
ost_handle
ptlrpc_main
We have multiple symptoms here that may or not be due to the same bug. We may need to open a separate issue to track the root cause. Note that our branch contains LU-874 patches that touched the ptlrpc queue management code, so we should be on the lookout for any races introduced there. Also note we can't send debug data from this system.
Ignore the comment about being in recovery. So far I don't think the logs show that.
I've looked at a few nodes, and it looks like there is some kind of client timeout/eviction and then reconnection storm going on before the assertions.
It looks like generally there are tens of thousands of "haven't heard from client X in 226 seconds. I think it's dead, and I am evicting it" messages. A couple of minutes clients begin reconnecting in droves. There is a mix of ost "connection from" and ost "Not available for connect" messages.
The "haven't heard from client" and the client connect messages are both interleaved in the logs, and often repeated 30,000+ times (lustre squashes them into "previous similar messages" lines).
And then we hit one of the two assertions listed at the beginning of this bug.
Note that for several of the OSS nodes that I have looked at so far, the clients all seem to be from one particular cluster, which is running 1.8. (servers are all 2.1.0-24chaos).