Loading...

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.2.0, Lustre 2.1.1
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
4726

Hi,

The following assertion failed with Lustre 2.0.0 was reported by the on site support at CEA customer site:

fs/jbd2/transaction.c:jbd2_journal_dirty_metadata() ,line 1030

J_ASSERT_JH(jh, handle->h_buffer_credits > 0);

This issue has been hit several times on restart of an MDS. On this particular one, the problem is not extremely critical
since after dump+restart, the service continue



------------[ cut here ]------------
kernel BUG at fs/jbd2/transaction.c:1030!
invalid opcode: 0000 [#1] SMP 

PID: 24472  TASK: ffff8808556011c0  CPU: 22  COMMAND: "tgt_recov"
 #0 [ffff88083370a9d0] machine_kexec at ffffffff8102e77b
 #1 [ffff88083370aa30] crash_kexec at ffffffff810a6cd8
 #2 [ffff88083370ab00] oops_end at ffffffff8146aad0
 #3 [ffff88083370ab30] die at ffffffff8101021b
 #4 [ffff88083370ab60] do_trap at ffffffff8146a3a4
 #5 [ffff88083370abc0] do_invalid_op at ffffffff8100dda5
 #6 [ffff88083370ac60] invalid_op at ffffffff8100cf3b
    [exception RIP: jbd2_journal_dirty_metadata+269]
    RIP: ffffffffa00518ed  RSP: ffff88083370ad10  RFLAGS: 00010246
    RAX: ffff881831c8b8c0  RBX: ffff881834107468  RCX: ffff8808512adc90
    RDX: 0000000000000000  RSI: ffff8808512adc90  RDI: 0000000000000000
    RBP: ffff88083370ad30   R8: 2010000000000000   R9: f790d737baaf2402
    R10: 0000000000000001  R11: 0000000000000040  R12: ffff8818343606d8
    R13: ffff8808512adc90  R14: ffff880859b81800  R15: 0000000000002000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff88083370ad38] __ldiskfs_handle_dirty_metadata at ffffffffa04bb3fb [ldiskfs]
 #8 [ffff88083370ad78] fsfilt_ldiskfs_write_handle at ffffffffa09bede7 [fsfilt_ldiskfs]
 #9 [ffff88083370ae28] fsfilt_ldiskfs_write_record at ffffffffa09bf0fe [fsfilt_ldiskfs]
#10 [ffff88083370aea8] llog_lvfs_write_blob at ffffffffa05a018c [obdclass]
#11 [ffff88083370af58] llog_lvfs_write_rec at ffffffffa05a1732 [obdclass]
#12 [ffff88083370b038] llog_cat_current_log.clone.0 at ffffffffa059e14f [obdclass]
#13 [ffff88083370b118] llog_cat_add_rec at ffffffffa059e86a [obdclass]
#14 [ffff88083370b198] llog_obd_origin_add at ffffffffa05a51a6 [obdclass]
#15 [ffff88083370b1f8] llog_add at ffffffffa05a5381 [obdclass]
#16 [ffff88083370b268] lov_llog_origin_add at ffffffffa089a0cc [lov]
#17 [ffff88083370b318] llog_add at ffffffffa05a5381 [obdclass]
#18 [ffff88083370b388] mds_llog_origin_add at ffffffffa09d46f9 [mds]
#19 [ffff88083370b408] llog_add at ffffffffa05a5381 [obdclass]
#20 [ffff88083370b478] mds_llog_add_unlink at ffffffffa09d4de4 [mds]
#21 [ffff88083370b4f8] mds_log_op_orphan at ffffffffa09d5229 [mds]
#22 [ffff88083370b578] mds_lov_update_objids at ffffffffa09de7ef [mds]
#23 [ffff88083370b638] mdd_lov_objid_update at ffffffffa09f5cb2 [mdd]
#24 [ffff88083370b648] mdd_create_data at ffffffffa0a02c91 [mdd]
#25 [ffff88083370b6e8] cml_create_data at ffffffffa0acf036 [cmm]
#26 [ffff88083370b768] mdt_finish_open at ffffffffa0a6c885 [mdt]
#27 [ffff88083370b838] mdt_reint_open at ffffffffa0a6d119 [mdt]
#28 [ffff88083370b958] mdt_reint_rec at ffffffffa0a5764f [mdt]
#29 [ffff88083370b9a8] mdt_reint_internal at ffffffffa0a4ea04 [mdt]
#30 [ffff88083370ba38] mdt_intent_reint at ffffffffa0a4f085 [mdt]
#31 [ffff88083370bab8] mdt_intent_policy at ffffffffa0a48270 [mdt]
#32 [ffff88083370bb28] ldlm_lock_enqueue at ffffffffa068ea9d [ptlrpc]
#33 [ffff88083370bbc8] ldlm_handle_enqueue0 at ffffffffa06b64d1 [ptlrpc]
#34 [ffff88083370bc68] mdt_enqueue at ffffffffa0a47dea [mdt]
#35 [ffff88083370bc98] mdt_handle_common at ffffffffa0a439f5 [mdt]
#36 [ffff88083370bd18] mdt_recovery_handle at ffffffffa0a44a68 [mdt]
#37 [ffff88083370bd68] handle_recovery_req at ffffffffa0699512 [ptlrpc]
#38 [ffff88083370bde8] target_recovery_thread at ffffffffa0699b36 [ptlrpc]
#39 [ffff88083370bf48] kernel_thread at ffffffff8100d1aa


Something similar is sometime hit just after the MDS end the recovery, during orphan cleanup. In such case the MDS fall
repetitively after each lustre restart and, as a workaround, we had to mount the volume in ldiskfs mode and remove the
PENDING subdirectory.

Is block reservation done in fsfilt_ldiskfs_write_record for the jbd2 transaction is too small ?

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

trace_chown_1_to_10
5.95 MB
16/Dec/11 11:45 AM

is duplicated by

LU-1045 kernel BUG at fs/jbd2/transaction.c:1033!

Resolved

Trackbacks

Changelog 2.1 Changes from version 2.1.0 to version 2.1.1 Server support for kernels: 2.6.18274.12.1.el5 (RHEL5) 2.6.32220.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.12.1.el5 (RHEL5) 2.6.32220.el6 (RHEL6) 2.6.32.360....

Changelog 2.2 version 2.2.0 Support for networks: o2iblnd OFED 1.5.4 Server support for kernels: 2.6.32220.4.2.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.18.1.el5 (RHEL5) 2.6.32220.4.2.el6 (RHEL6) 2.6.32.360....

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates