Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.1.6
-
None
-
3
-
13194
Description
This seems to be a duplicate of LU-1276, which was closed with "Cannot Reproduce", and in which I initially added the following note.
One of the Bull customers (TGCC) had the same deadlock as described in LU-1276 twice during the past six months: one thread is stuck in jbd2_journal_commit_transaction() and many other thread are stuck in jbd2_journal_start().
PID: 29225 TASK: ffff88107c3bb040 CPU: 15 COMMAND: "jbd2/dm-2-8" #0 [ffff88107a343c60] schedule at ffffffff81485765 0000001 [ffff88107a343d28] jbd2_journal_commit_transaction at ffffffffa006a94f [jbd2] 0000002 [ffff88107a343e68] kjournald2 at ffffffffa0070c08 [jbd2] 0000003 [ffff88107a343ee8] kthread at ffffffff8107b5f6 0000004 [ffff88107a343f48] kernel_thread at ffffffff8100412a
and most of the threads:
PID: 15585 TASK: ffff88063062a790 CPU: 0 COMMAND: "mdt_503" PID: 15586 TASK: ffff88063062a040 CPU: 23 COMMAND: "mdt_504" PID: 15587 TASK: ffff88020f3ad7d0 CPU: 30 COMMAND: "mdt_505" PID: 29286 TASK: ffff88087505e790 CPU: 25 COMMAND: "mdt_01" ... #0 [ffff881949c078f0] schedule at ffffffff81485765 0000001 [ffff881949c079b8] start_this_handle at ffffffffa006908a [jbd2] 0000002 [ffff881949c07a78] jbd2_journal_start at ffffffffa0069500 [jbd2] 0000003 [ffff881949c07ac8] ldiskfs_journal_start_sb at ffffffffa0451ca8 [ldiskfs] 0000004 [ffff881949c07ad8] osd_trans_start at ffffffffa0d4a324 [osd_ldiskfs] 0000005 [ffff881949c07b18] mdd_trans_start at ffffffffa0c4c4e3 [mdd] 0000006 [ffff881949c07b38] mdd_unlink at ffffffffa0c401eb [mdd] 0000007 [ffff881949c07bf8] cml_unlink at ffffffffa0d82e07 [cmm] 0000008 [ffff881949c07c38] mdt_reint_unlink at ffffffffa0cba0f4 [mdt] 0000009 [ffff881949c07cb8] mdt_reint_rec at ffffffffa0cb7cb1 [mdt] 0000010 [ffff881949c07cd8] mdt_reint_internal at ffffffffa0caeed4 [mdt] 0000011 [ffff881949c07d28] mdt_reint at ffffffffa0caf2b4 [mdt] 0000012 [ffff881949c07d48] mdt_handle_common at ffffffffa0ca3762 [mdt] 0000013 [ffff881949c07d98] mdt_regular_handle at ffffffffa0ca4655 [mdt] 0000014 [ffff881949c07da8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc] 0000015 [ffff881949c07f48] kernel_thread at ffffffff8100412a
They are running lustre 2.1.6 which contains http://review.whamcloud.com/4743 from LU-1648.
I attach two files containing the dmseg and the crash back trace of all threads.