[LU-1276] MDS threads all stuck in jbd2_journal_start - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.1.0
Labels:
None
Environment:
https://github.com/chaos/lustre/commits/2.1.1-llnl

Severity:
3
Rank (Obsolete):
8047

Description

The MDS on a classified production 2.1 lustre cluster got stuck today. The symptoms were high load (800+), but very little CPU usage.

Almost all of the lustre threads were stuck in jbd2_journal_start, while the jbd2/sda thread is stuck in
jbd2_journal_commit_transaction. There is zero I/O going to disk.

There's one thread that stands out as a suspect, as it's not in jbd2_journal_start but seems to be handling an unlink. Perhaps it got stuck waiting on a semaphore while holding an open transaction with jbd2. Its stack trace looks like this:

COMMAND: "mdt_152"
schedule
rwsem_down_failed_common
rwsem_down_read_failed
call_rwsem_down_read_failed
llog_cat_current_log.clone.0
llog_cat_add_rec
llog_obd_origin_add
llog_add
lov_llog_origin_add
llog_add
mds_llog_origin_add
llog_add
mds_llog_add_unlink
mds_log_op_unlink
mdd_unlink_log
mdd_object_kill
mdd_finish_unlink
mdd_unlink
cml_unlink
mdt_reint_unlink
mdt_reint_rec
mdt_reint_internal
mdt_reint
mdt_handle_common
mdt_regular_handle
ptlrpc_main
kernel_thread

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bt-all.merged.txt
28/Feb/14 6:53 PM
230 kB
Patrick Valentin
dmesg.txt
28/Feb/14 6:53 PM
125 kB
Patrick Valentin

Activity

[LU-1276] MDS threads all stuck in jbd2_journal_start

Joseph Gmitter (Inactive) made changes - 05/May/16 2:56 AM

Link

Original: This issue is related to DDN-283 [ DDN-283 ]

Joseph Gmitter (Inactive) made changes - 05/May/16 2:56 AM

Link

New: This issue is duplicated by DDN-283 [ DDN-283 ]

Joseph Gmitter (Inactive) made changes - 03/May/16 5:34 PM

Link

New: This issue is related to DDN-283 [ DDN-283 ]

Patrick Valentin (Inactive) made changes - 28/Feb/14 6:53 PM

Attachment		New: dmesg.txt [ 14189 ]
Attachment		New: bt-all.merged.txt [ 14190 ]

Patrick Valentin (Inactive) added a comment - 28/Feb/14 6:41 PM - edited

One of the Bull customers (TGCC) had the same deadlock twice during the past six months: one thread is stuck in jbd2_journal_commit_transaction() and many other thread are stuck in jbd2_journal_start().

PID: 29225 TASK: ffff88107c3bb040 CPU: 15 COMMAND: "jbd2/dm-2-8"
 #0 [ffff88107a343c60] schedule at ffffffff81485765
 0000001 [ffff88107a343d28] jbd2_journal_commit_transaction at ffffffffa006a94f [jbd2]
 0000002 [ffff88107a343e68] kjournald2 at ffffffffa0070c08 [jbd2]
 0000003 [ffff88107a343ee8] kthread at ffffffff8107b5f6
 0000004 [ffff88107a343f48] kernel_thread at ffffffff8100412a

and most of the threads:

PID: 15585 TASK: ffff88063062a790 CPU: 0 COMMAND: "mdt_503"
PID: 15586 TASK: ffff88063062a040 CPU: 23 COMMAND: "mdt_504"
PID: 15587 TASK: ffff88020f3ad7d0 CPU: 30 COMMAND: "mdt_505"
PID: 29286 TASK: ffff88087505e790 CPU: 25 COMMAND: "mdt_01"
...
#0 [ffff881949c078f0] schedule at ffffffff81485765
0000001 [ffff881949c079b8] start_this_handle at ffffffffa006908a [jbd2]
0000002 [ffff881949c07a78] jbd2_journal_start at ffffffffa0069500 [jbd2]
0000003 [ffff881949c07ac8] ldiskfs_journal_start_sb at ffffffffa0451ca8 [ldiskfs]
0000004 [ffff881949c07ad8] osd_trans_start at ffffffffa0d4a324 [osd_ldiskfs]
0000005 [ffff881949c07b18] mdd_trans_start at ffffffffa0c4c4e3 [mdd]
0000006 [ffff881949c07b38] mdd_unlink at ffffffffa0c401eb [mdd]
0000007 [ffff881949c07bf8] cml_unlink at ffffffffa0d82e07 [cmm]
0000008 [ffff881949c07c38] mdt_reint_unlink at ffffffffa0cba0f4 [mdt]
0000009 [ffff881949c07cb8] mdt_reint_rec at ffffffffa0cb7cb1 [mdt]
0000010 [ffff881949c07cd8] mdt_reint_internal at ffffffffa0caeed4 [mdt]
0000011 [ffff881949c07d28] mdt_reint at ffffffffa0caf2b4 [mdt]
0000012 [ffff881949c07d48] mdt_handle_common at ffffffffa0ca3762 [mdt]
0000013 [ffff881949c07d98] mdt_regular_handle at ffffffffa0ca4655 [mdt]
0000014 [ffff881949c07da8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc]
0000015 [ffff881949c07f48] kernel_thread at ffffffff8100412a

The are running lustre 2.1.6 which contains http://review.whamcloud.com/4743 from ~~LU-1648~~.

I attach two files containing the dmseg and the crash back trace of all threads.
could you reopen this ticket, as it was closed with "Cannot Reproduce".

Patrick Valentin (Inactive) added a comment - 28/Feb/14 6:41 PM - edited One of the Bull customers (TGCC) had the same deadlock twice during the past six months: one thread is stuck in jbd2_journal_commit_transaction() and many other thread are stuck in jbd2_journal_start(). PID: 29225 TASK: ffff88107c3bb040 CPU: 15 COMMAND: "jbd2/dm-2-8" #0 [ffff88107a343c60] schedule at ffffffff81485765 0000001 [ffff88107a343d28] jbd2_journal_commit_transaction at ffffffffa006a94f [jbd2] 0000002 [ffff88107a343e68] kjournald2 at ffffffffa0070c08 [jbd2] 0000003 [ffff88107a343ee8] kthread at ffffffff8107b5f6 0000004 [ffff88107a343f48] kernel_thread at ffffffff8100412a and most of the threads: PID: 15585 TASK: ffff88063062a790 CPU: 0 COMMAND: "mdt_503" PID: 15586 TASK: ffff88063062a040 CPU: 23 COMMAND: "mdt_504" PID: 15587 TASK: ffff88020f3ad7d0 CPU: 30 COMMAND: "mdt_505" PID: 29286 TASK: ffff88087505e790 CPU: 25 COMMAND: "mdt_01" ... #0 [ffff881949c078f0] schedule at ffffffff81485765 0000001 [ffff881949c079b8] start_this_handle at ffffffffa006908a [jbd2] 0000002 [ffff881949c07a78] jbd2_journal_start at ffffffffa0069500 [jbd2] 0000003 [ffff881949c07ac8] ldiskfs_journal_start_sb at ffffffffa0451ca8 [ldiskfs] 0000004 [ffff881949c07ad8] osd_trans_start at ffffffffa0d4a324 [osd_ldiskfs] 0000005 [ffff881949c07b18] mdd_trans_start at ffffffffa0c4c4e3 [mdd] 0000006 [ffff881949c07b38] mdd_unlink at ffffffffa0c401eb [mdd] 0000007 [ffff881949c07bf8] cml_unlink at ffffffffa0d82e07 [cmm] 0000008 [ffff881949c07c38] mdt_reint_unlink at ffffffffa0cba0f4 [mdt] 0000009 [ffff881949c07cb8] mdt_reint_rec at ffffffffa0cb7cb1 [mdt] 0000010 [ffff881949c07cd8] mdt_reint_internal at ffffffffa0caeed4 [mdt] 0000011 [ffff881949c07d28] mdt_reint at ffffffffa0caf2b4 [mdt] 0000012 [ffff881949c07d48] mdt_handle_common at ffffffffa0ca3762 [mdt] 0000013 [ffff881949c07d98] mdt_regular_handle at ffffffffa0ca4655 [mdt] 0000014 [ffff881949c07da8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc] 0000015 [ffff881949c07f48] kernel_thread at ffffffff8100412a The are running lustre 2.1.6 which contains http://review.whamcloud.com/4743 from LU-1648 . I attach two files containing the dmseg and the crash back trace of all threads. could you reopen this ticket, as it was closed with "Cannot Reproduce".

Peter Jones made changes - 30/Apr/13 10:52 PM

Resolution		New: Cannot Reproduce [ 5 ]
Status	Original: Reopened [ 4 ]	New: Resolved [ 5 ]

Peter Jones added a comment - 30/Apr/13 10:52 PM

ok thanks Chris

Peter Jones added a comment - 30/Apr/13 10:52 PM ok thanks Chris

Christopher Morrone (Inactive) added a comment - 30/Apr/13 10:00 PM

It looks like the ~~LU-1648~~ fix landed on before 2.1.4 in change 4743. I think we can close this until we see it again.

Christopher Morrone (Inactive) added a comment - 30/Apr/13 10:00 PM It looks like the LU-1648 fix landed on before 2.1.4 in change 4743 . I think we can close this until we see it again.

Christopher Morrone (Inactive) added a comment - 27/Nov/12 9:57 PM

It sounds like the patch from ~~LU-1648~~ is needed on b2_1.

Christopher Morrone (Inactive) added a comment - 27/Nov/12 9:57 PM It sounds like the patch from LU-1648 is needed on b2_1.

Oleg Drokin added a comment - 20/Aug/12 5:54 PM

I guess the other candidate for this issue is ~~LU-1648~~, can you add a patch from it as well please?

Oleg Drokin added a comment - 20/Aug/12 5:54 PM I guess the other candidate for this issue is LU-1648 , can you add a patch from it as well please?

People

Assignee:: Oleg Drokin

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 30/Mar/12 8:56 PM

Updated:: 05/May/16 2:56 AM

Resolved:: 30/Apr/13 10:52 PM