Details
-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
Lustre 2.1.4
-
Lustre 2.1.4 on RHEL 6.4
-
3
-
7472
Description
A couple OSS nodes went down with very similar stack traces. One of the stack is below:
2013-03-28 11:01:49 ------------[ cut here ]------------ 2013-03-28 11:01:49 kernel BUG at fs/jbd2/transaction.c:293! 2013-03-28 11:01:49 invalid opcode: 0000 [#1] SMP 2013-03-28 11:01:49 last sysfs file: /sys/devices/system/edac/pci/pci_parity_count 2013-03-28 11:01:49 CPU 11 2013-03-28 11:01:49 Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) acpi_cpufreq freq_table mperf ko2iblnd(U) lnet(U) libcfs(U) sd_mod crc_t10dif ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm_intel kvm sg sr_mod cdrom mpt2sas scsi_transport_sas raid_class serio_raw i2c_i801 i2c_core ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core shpchp nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca ptp pps_core be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand] 2013-03-28 11:01:50 2013-03-28 11:01:50 Pid: 6814, comm: ll_ost_io_72 Not tainted 2.6.32-358.5chaos.ch5.1.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH 2013-03-28 11:01:50 RIP: 0010:[<ffffffffa0c2b5ac>] [<ffffffffa0c2b5ac>] jbd2_journal_start+0x10c/0x110 [jbd2] 2013-03-28 11:01:50 RSP: 0018:ffff880107687190 EFLAGS: 00010287 2013-03-28 11:01:50 RAX: ffff8800add1f580 RBX: ffff880341bc2318 RCX: 0000000000000000 2013-03-28 11:01:50 RDX: 0000000000000000 RSI: 0000000000000012 RDI: ffff88010ed29000 2013-03-28 11:01:50 RBP: ffff8801076871d0 R08: ffff88039ec46b38 R09: fea40d252cae9407 2013-03-28 11:01:50 R10: 0000000000000000 R11: 0000000000000001 R12: ffff880341bc2318 2013-03-28 11:01:50 R13: ffff88010ed29000 R14: ffff8801237ae080 R15: 0000000000000012 2013-03-28 11:01:50 FS: 00007ffff7fdb700(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000 2013-03-28 11:01:50 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2013-03-28 11:01:50 CR2: 000000000047f670 CR3: 0000000001a85000 CR4: 00000000000007e0 2013-03-28 11:01:50 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2013-03-28 11:01:50 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2013-03-28 11:01:50 Process ll_ost_io_72 (pid: 6814, threadinfo ffff880107686000, task ffff8801237ae080) 2013-03-28 11:01:50 Stack: 2013-03-28 11:01:50 ffff8801076871a0 ffffffff810e684e ffff8801076871c0 ffff880129716c00 2013-03-28 11:01:50 <d> 0000000000000012 0000000000000000 ffffffff81fcb680 ffff88039ec46b38 2013-03-28 11:01:50 <d> ffff8801076871f0 ffffffffa0c83506 ffff88039ec46b28 ffff88039ec46c60 2013-03-28 11:01:50 Call Trace: 2013-03-28 11:01:50 [<ffffffff810e684e>] ? call_rcu+0xe/0x10 2013-03-28 11:01:50 [<ffffffffa0c83506>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs] 2013-03-28 11:01:50 [<ffffffffa0c83854>] ldiskfs_dquot_drop+0x34/0x80 [ldiskfs] 2013-03-28 11:01:50 [<ffffffff811e00a2>] vfs_dq_drop+0x52/0x60 2013-03-28 11:01:50 [<ffffffff8119cac3>] clear_inode+0x93/0x140 2013-03-28 11:01:50 [<ffffffff8119cbb0>] dispose_list+0x40/0x120 2013-03-28 11:01:50 [<ffffffff8119cf04>] shrink_icache_memory+0x274/0x2e0 2013-03-28 11:01:50 [<ffffffff8113196a>] shrink_slab+0x12a/0x1a0 2013-03-28 11:01:50 [<ffffffff81134439>] zone_reclaim+0x279/0x400 2013-03-28 11:01:50 [<ffffffff8112a08c>] get_page_from_freelist+0x69c/0x830 2013-03-28 11:01:50 [<ffffffff8112b623>] ? __alloc_pages_nodemask+0x113/0x8d0 2013-03-28 11:01:50 [<ffffffffa0c53456>] ? ldiskfs_ext_get_blocks+0x296/0x14b0 [ldiskfs] 2013-03-28 11:01:50 [<ffffffff8112b623>] __alloc_pages_nodemask+0x113/0x8d0 2013-03-28 11:01:50 [<ffffffff81166f78>] ? ____cache_alloc_node+0x108/0x160 2013-03-28 11:01:50 [<ffffffff81166572>] kmem_getpages+0x62/0x170 2013-03-28 11:01:50 [<ffffffff81166bdf>] cache_grow+0x2cf/0x320 2013-03-28 11:01:50 [<ffffffff81166e32>] cache_alloc_refill+0x202/0x240 2013-03-28 11:01:50 [<ffffffff81167ecf>] kmem_cache_alloc+0x15f/0x190 2013-03-28 11:01:50 [<ffffffff8109e6b6>] prepare_creds+0x26/0xb0 2013-03-28 11:01:50 [<ffffffffa052a865>] cfs_cap_raise+0x15/0x40 [libcfs] 2013-03-28 11:01:50 [<ffffffffa0d580aa>] filter_commitrw_write+0x11ca/0x2d28 [obdfilter] 2013-03-28 11:01:50 [<ffffffff8150db00>] ? thread_return+0x4e/0x76e 2013-03-28 11:01:50 [<ffffffff810812fc>] ? lock_timer_base+0x3c/0x70 2013-03-28 11:01:50 [<ffffffff81081e12>] ? del_timer_sync+0x22/0x30 2013-03-28 11:01:50 [<ffffffff8150e8fa>] ? schedule_timeout+0x19a/0x2e0 2013-03-28 11:01:50 [<ffffffff81081410>] ? process_timeout+0x0/0x10 2013-03-28 11:01:50 [<ffffffffa0d4b332>] filter_commitrw+0x272/0x290 [obdfilter] 2013-03-28 11:01:50 [<ffffffffa07c69a9>] ? target_bulk_io+0x619/0x940 [ptlrpc] 2013-03-28 11:01:50 [<ffffffffa0ce0bdd>] obd_commitrw+0x11d/0x3c0 [ost] 2013-03-28 11:01:50 [<ffffffffa0ce8994>] ost_brw_write+0xcc4/0x1600 [ost] 2013-03-28 11:01:50 [<ffffffffa07bb000>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] 2013-03-28 11:01:50 [<ffffffffa0cee0a7>] ost_handle+0x2b77/0x4270 [ost] 2013-03-28 11:01:50 [<ffffffffa07f9e1c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] 2013-03-28 11:01:50 [<ffffffffa080141b>] ? ptlrpc_update_export_timer+0x4b/0x470 [ptlrpc] 2013-03-28 11:01:50 [<ffffffffa080900e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc] 2013-03-28 11:01:50 [<ffffffffa08083c0>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] 2013-03-28 11:01:50 [<ffffffff8100c0ca>] child_rip+0xa/0x20 2013-03-28 11:01:50 [<ffffffffa08083c0>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] 2013-03-28 11:01:50 [<ffffffffa08083c0>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] 2013-03-28 11:01:50 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 2013-03-28 11:01:50 Code: e8 6a cf 53 e0 49 c7 86 38 08 00 00 00 00 00 00 8b 45 c8 48 63 d8 e9 62 ff ff ff 0f 1f 40 00 48 c7 c3 f4 ff ff ff e9 52 ff ff ff <0f> 0b eb fe 55 48 89 e5 0f 1f 44 00 00 be 01 00 00 00 e8 dd fe 2013-03-28 11:01:50 RIP [<ffffffffa0c2b5ac>] jbd2_journal_start+0x10c/0x110 [jbd2] 2013-03-28 11:01:50 RSP <ffff880107687190>
From what I can tell, the root of the problem is current->journal_info->h_transaction->t_journal does not match LDISKFS_SB(sb)->s_journal. Causing us to trigger this assertion in jbd2_journal_start:
292 if (handle) { 293 J_ASSERT(handle->h_transaction->t_journal == journal); 294 handle->h_ref++; 295 return handle; 296 }
Attachments
Issue Links
- is related to
-
LU-4529 kernel BUG at fs/jbd2/transaction.c:293! ?DUPLICATE OF LU-3071?
- Resolved