[LU-4794] MDS threads all stuck in jbd2_journal_start Created: 20/Mar/14 Updated: 14/Dec/21 Resolved: 14/Dec/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Patrick Valentin (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 13194 | ||||||||||||||||
| Description |
|
This seems to be a duplicate of One of the Bull customers (TGCC) had the same deadlock as described in PID: 29225 TASK: ffff88107c3bb040 CPU: 15 COMMAND: "jbd2/dm-2-8" #0 [ffff88107a343c60] schedule at ffffffff81485765 0000001 [ffff88107a343d28] jbd2_journal_commit_transaction at ffffffffa006a94f [jbd2] 0000002 [ffff88107a343e68] kjournald2 at ffffffffa0070c08 [jbd2] 0000003 [ffff88107a343ee8] kthread at ffffffff8107b5f6 0000004 [ffff88107a343f48] kernel_thread at ffffffff8100412a and most of the threads: PID: 15585 TASK: ffff88063062a790 CPU: 0 COMMAND: "mdt_503" PID: 15586 TASK: ffff88063062a040 CPU: 23 COMMAND: "mdt_504" PID: 15587 TASK: ffff88020f3ad7d0 CPU: 30 COMMAND: "mdt_505" PID: 29286 TASK: ffff88087505e790 CPU: 25 COMMAND: "mdt_01" ... #0 [ffff881949c078f0] schedule at ffffffff81485765 0000001 [ffff881949c079b8] start_this_handle at ffffffffa006908a [jbd2] 0000002 [ffff881949c07a78] jbd2_journal_start at ffffffffa0069500 [jbd2] 0000003 [ffff881949c07ac8] ldiskfs_journal_start_sb at ffffffffa0451ca8 [ldiskfs] 0000004 [ffff881949c07ad8] osd_trans_start at ffffffffa0d4a324 [osd_ldiskfs] 0000005 [ffff881949c07b18] mdd_trans_start at ffffffffa0c4c4e3 [mdd] 0000006 [ffff881949c07b38] mdd_unlink at ffffffffa0c401eb [mdd] 0000007 [ffff881949c07bf8] cml_unlink at ffffffffa0d82e07 [cmm] 0000008 [ffff881949c07c38] mdt_reint_unlink at ffffffffa0cba0f4 [mdt] 0000009 [ffff881949c07cb8] mdt_reint_rec at ffffffffa0cb7cb1 [mdt] 0000010 [ffff881949c07cd8] mdt_reint_internal at ffffffffa0caeed4 [mdt] 0000011 [ffff881949c07d28] mdt_reint at ffffffffa0caf2b4 [mdt] 0000012 [ffff881949c07d48] mdt_handle_common at ffffffffa0ca3762 [mdt] 0000013 [ffff881949c07d98] mdt_regular_handle at ffffffffa0ca4655 [mdt] 0000014 [ffff881949c07da8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc] 0000015 [ffff881949c07f48] kernel_thread at ffffffff8100412a They are running lustre 2.1.6 which contains http://review.whamcloud.com/4743 from I attach two files containing the dmseg and the crash back trace of all threads. |
| Comments |
| Comment by Peter Jones [ 20/Mar/14 ] |
|
Bobijam Could you please advise? Thanks Peter |
| Comment by Zhenyu Xu [ 22/Apr/14 ] |
|
Relates to Thread 27469 stack trace from bt-all.merged.txt PID: 27469 TASK: ffff88199d25a080 CPU: 12 COMMAND: "ldlm_cn_88" #0 [ffff881c4f78b490] schedule at ffffffff81485765 #1 [ffff881c4f78b558] start_this_handle at ffffffffa006908a [jbd2] #2 [ffff881c4f78b618] jbd2_journal_restart at ffffffffa00693d1 [jbd2] #3 [ffff881c4f78b668] ldiskfs_truncate_restart_trans at ffffffffa042791a [ldiskfs] #4 [ffff881c4f78b698] ldiskfs_clear_blocks at ffffffffa042cc3d [ldiskfs] #5 [ffff881c4f78b6f8] ldiskfs_free_data at ffffffffa042ce24 [ldiskfs] #6 [ffff881c4f78b758] ldiskfs_free_branches at ffffffffa042d063 [ldiskfs] #7 [ffff881c4f78b7b8] ldiskfs_free_branches at ffffffffa042cf56 [ldiskfs] #8 [ffff881c4f78b818] ldiskfs_truncate at ffffffffa042d659 [ldiskfs] #9 [ffff881c4f78b938] ldiskfs_delete_inode at ffffffffa042e9d0 [ldiskfs] #10 [ffff881c4f78b958] generic_delete_inode at ffffffff8117f0de #11 [ffff881c4f78b988] generic_drop_inode at ffffffff8117f235 #12 [ffff881c4f78b9a8] iput at ffffffff8117df52 #13 [ffff881c4f78b9c8] mds_obd_destroy at ffffffffa0bf717d [mds] #14 [ffff881c4f78bb08] llog_lvfs_destroy at ffffffffa05705cd [obdclass] #15 [ffff881c4f78bbd8] llog_cancel_rec at ffffffffa0566424 [obdclass] #16 [ffff881c4f78bc08] llog_cat_cancel_records at ffffffffa056a3a1 [obdclass] #17 [ffff881c4f78bc68] llog_origin_handle_cancel at ffffffffa072923b [ptlrpc] #18 [ffff881c4f78bd68] ldlm_cancel_handler at ffffffffa06ee8ff [ptlrpc] #19 [ffff881c4f78bda8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc] #20 [ffff881c4f78bf48] kernel_thread at ffffffff8100412a the transaction credit is not enough, and the thread restart the transaction while holding log_handle::lgh_lock |
| Comment by Zhenyu Xu [ 22/Apr/14 ] |
|
the competing deadlock thread 15514 PID: 15514 TASK: ffff88199fe40850 CPU: 16 COMMAND: "mdt_432" #0 [ffff881cc6f1b548] schedule at ffffffff81485765 #1 [ffff881cc6f1b610] rwsem_down_failed_common at ffffffff81487d65 #2 [ffff881cc6f1b670] rwsem_down_read_failed at ffffffff81487f16 #3 [ffff881cc6f1b6b0] call_rwsem_down_read_failed at ffffffff81262b24 #4 [ffff881cc6f1b718] llog_cat_current_log.clone.0 at ffffffffa056ada5 [obdclass] #5 [ffff881cc6f1b7b8] llog_cat_add_rec at ffffffffa056baca [obdclass] #6 [ffff881cc6f1b808] llog_obd_origin_add at ffffffffa0571627 [obdclass] #7 [ffff881cc6f1b838] llog_add at ffffffffa0571801 [obdclass] #8 [ffff881cc6f1b888] lov_llog_origin_add at ffffffffa09f70fc [lov] #9 [ffff881cc6f1b908] llog_add at ffffffffa0571801 [obdclass] #10 [ffff881cc6f1b958] mds_llog_origin_add at ffffffffa0bf8d53 [mds] #11 [ffff881cc6f1b9a8] llog_add at ffffffffa0571801 [obdclass] #12 [ffff881cc6f1b9f8] mds_llog_add_unlink at ffffffffa0bf93ca [mds] #13 [ffff881cc6f1ba48] mds_log_op_unlink at ffffffffa0bf9a08 [mds] #14 [ffff881cc6f1baa8] mdd_unlink_log at ffffffffa0c2df31 [mdd] #15 [ffff881cc6f1bac8] mdd_object_kill at ffffffffa0c2526b [mdd] #16 [ffff881cc6f1baf8] mdd_finish_unlink at ffffffffa0c3b13e [mdd] #17 [ffff881cc6f1bb38] mdd_unlink at ffffffffa0c40696 [mdd] #18 [ffff881cc6f1bbf8] cml_unlink at ffffffffa0d82e07 [cmm] #19 [ffff881cc6f1bc38] mdt_reint_unlink at ffffffffa0cba0f4 [mdt] #20 [ffff881cc6f1bcb8] mdt_reint_rec at ffffffffa0cb7cb1 [mdt] #21 [ffff881cc6f1bcd8] mdt_reint_internal at ffffffffa0caeed4 [mdt] #22 [ffff881cc6f1bd28] mdt_reint at ffffffffa0caf2b4 [mdt] #23 [ffff881cc6f1bd48] mdt_handle_common at ffffffffa0ca3762 [mdt] #24 [ffff881cc6f1bd98] mdt_regular_handle at ffffffffa0ca4655 [mdt] #25 [ffff881cc6f1bda8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc] #26 [ffff881cc6f1bf48] kernel_thread at ffffffff8100412a |
| Comment by Zhenyu Xu [ 24/Apr/14 ] |
|
patch tracking at http://review.whamcloud.com/10076 (b2_1 needs it, b2_4 does not) |