Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
4726
Description
Hi,
The following assertion failed with Lustre 2.0.0 was reported by the on site support at CEA customer site:
fs/jbd2/transaction.c:jbd2_journal_dirty_metadata() ,line 1030
J_ASSERT_JH(jh, handle->h_buffer_credits > 0);
This issue has been hit several times on restart of an MDS. On this particular one, the problem is not extremely critical since after dump+restart, the service continue ------------[ cut here ]------------ kernel BUG at fs/jbd2/transaction.c:1030! invalid opcode: 0000 [#1] SMP PID: 24472 TASK: ffff8808556011c0 CPU: 22 COMMAND: "tgt_recov" #0 [ffff88083370a9d0] machine_kexec at ffffffff8102e77b #1 [ffff88083370aa30] crash_kexec at ffffffff810a6cd8 #2 [ffff88083370ab00] oops_end at ffffffff8146aad0 #3 [ffff88083370ab30] die at ffffffff8101021b #4 [ffff88083370ab60] do_trap at ffffffff8146a3a4 #5 [ffff88083370abc0] do_invalid_op at ffffffff8100dda5 #6 [ffff88083370ac60] invalid_op at ffffffff8100cf3b [exception RIP: jbd2_journal_dirty_metadata+269] RIP: ffffffffa00518ed RSP: ffff88083370ad10 RFLAGS: 00010246 RAX: ffff881831c8b8c0 RBX: ffff881834107468 RCX: ffff8808512adc90 RDX: 0000000000000000 RSI: ffff8808512adc90 RDI: 0000000000000000 RBP: ffff88083370ad30 R8: 2010000000000000 R9: f790d737baaf2402 R10: 0000000000000001 R11: 0000000000000040 R12: ffff8818343606d8 R13: ffff8808512adc90 R14: ffff880859b81800 R15: 0000000000002000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff88083370ad38] __ldiskfs_handle_dirty_metadata at ffffffffa04bb3fb [ldiskfs] #8 [ffff88083370ad78] fsfilt_ldiskfs_write_handle at ffffffffa09bede7 [fsfilt_ldiskfs] #9 [ffff88083370ae28] fsfilt_ldiskfs_write_record at ffffffffa09bf0fe [fsfilt_ldiskfs] #10 [ffff88083370aea8] llog_lvfs_write_blob at ffffffffa05a018c [obdclass] #11 [ffff88083370af58] llog_lvfs_write_rec at ffffffffa05a1732 [obdclass] #12 [ffff88083370b038] llog_cat_current_log.clone.0 at ffffffffa059e14f [obdclass] #13 [ffff88083370b118] llog_cat_add_rec at ffffffffa059e86a [obdclass] #14 [ffff88083370b198] llog_obd_origin_add at ffffffffa05a51a6 [obdclass] #15 [ffff88083370b1f8] llog_add at ffffffffa05a5381 [obdclass] #16 [ffff88083370b268] lov_llog_origin_add at ffffffffa089a0cc [lov] #17 [ffff88083370b318] llog_add at ffffffffa05a5381 [obdclass] #18 [ffff88083370b388] mds_llog_origin_add at ffffffffa09d46f9 [mds] #19 [ffff88083370b408] llog_add at ffffffffa05a5381 [obdclass] #20 [ffff88083370b478] mds_llog_add_unlink at ffffffffa09d4de4 [mds] #21 [ffff88083370b4f8] mds_log_op_orphan at ffffffffa09d5229 [mds] #22 [ffff88083370b578] mds_lov_update_objids at ffffffffa09de7ef [mds] #23 [ffff88083370b638] mdd_lov_objid_update at ffffffffa09f5cb2 [mdd] #24 [ffff88083370b648] mdd_create_data at ffffffffa0a02c91 [mdd] #25 [ffff88083370b6e8] cml_create_data at ffffffffa0acf036 [cmm] #26 [ffff88083370b768] mdt_finish_open at ffffffffa0a6c885 [mdt] #27 [ffff88083370b838] mdt_reint_open at ffffffffa0a6d119 [mdt] #28 [ffff88083370b958] mdt_reint_rec at ffffffffa0a5764f [mdt] #29 [ffff88083370b9a8] mdt_reint_internal at ffffffffa0a4ea04 [mdt] #30 [ffff88083370ba38] mdt_intent_reint at ffffffffa0a4f085 [mdt] #31 [ffff88083370bab8] mdt_intent_policy at ffffffffa0a48270 [mdt] #32 [ffff88083370bb28] ldlm_lock_enqueue at ffffffffa068ea9d [ptlrpc] #33 [ffff88083370bbc8] ldlm_handle_enqueue0 at ffffffffa06b64d1 [ptlrpc] #34 [ffff88083370bc68] mdt_enqueue at ffffffffa0a47dea [mdt] #35 [ffff88083370bc98] mdt_handle_common at ffffffffa0a439f5 [mdt] #36 [ffff88083370bd18] mdt_recovery_handle at ffffffffa0a44a68 [mdt] #37 [ffff88083370bd68] handle_recovery_req at ffffffffa0699512 [ptlrpc] #38 [ffff88083370bde8] target_recovery_thread at ffffffffa0699b36 [ptlrpc] #39 [ffff88083370bf48] kernel_thread at ffffffff8100d1aa Something similar is sometime hit just after the MDS end the recovery, during orphan cleanup. In such case the MDS fall repetitively after each lustre restart and, as a workaround, we had to mount the volume in ldiskfs mode and remove the PENDING subdirectory. Is block reservation done in fsfilt_ldiskfs_write_record for the jbd2 transaction is too small ?
Attachments
Issue Links
- is duplicated by
-
LU-1045 kernel BUG at fs/jbd2/transaction.c:1033!
- Resolved
- Trackbacks
-
Changelog 2.1 Changes from version 2.1.0 to version 2.1.1 Server support for kernels: 2.6.18274.12.1.el5 (RHEL5) 2.6.32220.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.12.1.el5 (RHEL5) 2.6.32220.el6 (RHEL6) 2.6.32.360....
-
Changelog 2.2 version 2.2.0 Support for networks: o2iblnd OFED 1.5.4 Server support for kernels: 2.6.32220.4.2.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.18.1.el5 (RHEL5) 2.6.32220.4.2.el6 (RHEL6) 2.6.32.360....