Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
Lustre 2.1.0
-
None
-
3
-
8047
Description
The MDS on a classified production 2.1 lustre cluster got stuck today. The symptoms were high load (800+), but very little CPU usage.
Almost all of the lustre threads were stuck in jbd2_journal_start, while the jbd2/sda thread is stuck in
jbd2_journal_commit_transaction. There is zero I/O going to disk.
There's one thread that stands out as a suspect, as it's not in jbd2_journal_start but seems to be handling an unlink. Perhaps it got stuck waiting on a semaphore while holding an open transaction with jbd2. Its stack trace looks like this:
COMMAND: "mdt_152"
schedule
rwsem_down_failed_common
rwsem_down_read_failed
call_rwsem_down_read_failed
llog_cat_current_log.clone.0
llog_cat_add_rec
llog_obd_origin_add
llog_add
lov_llog_origin_add
llog_add
mds_llog_origin_add
llog_add
mds_llog_add_unlink
mds_log_op_unlink
mdd_unlink_log
mdd_object_kill
mdd_finish_unlink
mdd_unlink
cml_unlink
mdt_reint_unlink
mdt_reint_rec
mdt_reint_internal
mdt_reint
mdt_handle_common
mdt_regular_handle
ptlrpc_main
kernel_thread