Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
Lustre 2.1.0
-
None
-
3
-
8047
Description
The MDS on a classified production 2.1 lustre cluster got stuck today. The symptoms were high load (800+), but very little CPU usage.
Almost all of the lustre threads were stuck in jbd2_journal_start, while the jbd2/sda thread is stuck in
jbd2_journal_commit_transaction. There is zero I/O going to disk.
There's one thread that stands out as a suspect, as it's not in jbd2_journal_start but seems to be handling an unlink. Perhaps it got stuck waiting on a semaphore while holding an open transaction with jbd2. Its stack trace looks like this:
COMMAND: "mdt_152"
schedule
rwsem_down_failed_common
rwsem_down_read_failed
call_rwsem_down_read_failed
llog_cat_current_log.clone.0
llog_cat_add_rec
llog_obd_origin_add
llog_add
lov_llog_origin_add
llog_add
mds_llog_origin_add
llog_add
mds_llog_add_unlink
mds_log_op_unlink
mdd_unlink_log
mdd_object_kill
mdd_finish_unlink
mdd_unlink
cml_unlink
mdt_reint_unlink
mdt_reint_rec
mdt_reint_internal
mdt_reint
mdt_handle_common
mdt_regular_handle
ptlrpc_main
kernel_thread
One of the Bull customers (TGCC) had the same deadlock twice during the past six months: one thread is stuck in jbd2_journal_commit_transaction() and many other thread are stuck in jbd2_journal_start().
and most of the threads:
The are running lustre 2.1.6 which contains http://review.whamcloud.com/4743 from
LU-1648.I attach two files containing the dmseg and the crash back trace of all threads.
could you reopen this ticket, as it was closed with "Cannot Reproduce".