[LU-1276] MDS threads all stuck in jbd2_journal_start - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.1.0
Labels:
None
Environment:
https://github.com/chaos/lustre/commits/2.1.1-llnl

Severity:
3
Rank (Obsolete):
8047

Description

The MDS on a classified production 2.1 lustre cluster got stuck today. The symptoms were high load (800+), but very little CPU usage.

Almost all of the lustre threads were stuck in jbd2_journal_start, while the jbd2/sda thread is stuck in
jbd2_journal_commit_transaction. There is zero I/O going to disk.

There's one thread that stands out as a suspect, as it's not in jbd2_journal_start but seems to be handling an unlink. Perhaps it got stuck waiting on a semaphore while holding an open transaction with jbd2. Its stack trace looks like this:

COMMAND: "mdt_152"
schedule
rwsem_down_failed_common
rwsem_down_read_failed
call_rwsem_down_read_failed
llog_cat_current_log.clone.0
llog_cat_add_rec
llog_obd_origin_add
llog_add
lov_llog_origin_add
llog_add
mds_llog_origin_add
llog_add
mds_llog_add_unlink
mds_log_op_unlink
mdd_unlink_log
mdd_object_kill
mdd_finish_unlink
mdd_unlink
cml_unlink
mdt_reint_unlink
mdt_reint_rec
mdt_reint_internal
mdt_reint
mdt_handle_common
mdt_regular_handle
ptlrpc_main
kernel_thread

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bt-all.merged.txt
230 kB
28/Feb/14 6:53 PM
dmesg.txt
125 kB
28/Feb/14 6:53 PM

Activity

People

Assignee:: Oleg Drokin

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 30/Mar/12 8:56 PM

Updated:: 05/May/16 2:56 AM

Resolved:: 30/Apr/13 10:52 PM