Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3071

kernel BUG at fs/jbd2/transaction.c:293 (jbd2_journal_start)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Do
    • Major
    • None
    • Lustre 2.1.4
    • Lustre 2.1.4 on RHEL 6.4
    • 3
    • 7472

    Description

      A couple OSS nodes went down with very similar stack traces. One of the stack is below:

      2013-03-28 11:01:49 ------------[ cut here ]------------
      2013-03-28 11:01:49 kernel BUG at fs/jbd2/transaction.c:293!
      2013-03-28 11:01:49 invalid opcode: 0000 [#1] SMP 
      2013-03-28 11:01:49 last sysfs file: /sys/devices/system/edac/pci/pci_parity_count
      2013-03-28 11:01:49 CPU 11 
      2013-03-28 11:01:49 Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) acpi_cpufreq freq_table mperf ko2iblnd(U) lnet(U) libcfs(U) sd_mod crc_t10dif ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm_intel kvm sg sr_mod cdrom mpt2sas scsi_transport_sas raid_class serio_raw i2c_i801 i2c_core ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core shpchp nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca ptp pps_core be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand]
      2013-03-28 11:01:50 
      2013-03-28 11:01:50 Pid: 6814, comm: ll_ost_io_72 Not tainted 2.6.32-358.5chaos.ch5.1.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
      2013-03-28 11:01:50 RIP: 0010:[<ffffffffa0c2b5ac>]  [<ffffffffa0c2b5ac>] jbd2_journal_start+0x10c/0x110 [jbd2]
      2013-03-28 11:01:50 RSP: 0018:ffff880107687190  EFLAGS: 00010287
      2013-03-28 11:01:50 RAX: ffff8800add1f580 RBX: ffff880341bc2318 RCX: 0000000000000000
      2013-03-28 11:01:50 RDX: 0000000000000000 RSI: 0000000000000012 RDI: ffff88010ed29000
      2013-03-28 11:01:50 RBP: ffff8801076871d0 R08: ffff88039ec46b38 R09: fea40d252cae9407
      2013-03-28 11:01:50 R10: 0000000000000000 R11: 0000000000000001 R12: ffff880341bc2318
      2013-03-28 11:01:50 R13: ffff88010ed29000 R14: ffff8801237ae080 R15: 0000000000000012
      2013-03-28 11:01:50 FS:  00007ffff7fdb700(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
      2013-03-28 11:01:50 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      2013-03-28 11:01:50 CR2: 000000000047f670 CR3: 0000000001a85000 CR4: 00000000000007e0
      2013-03-28 11:01:50 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      2013-03-28 11:01:50 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      2013-03-28 11:01:50 Process ll_ost_io_72 (pid: 6814, threadinfo ffff880107686000, task ffff8801237ae080)
      2013-03-28 11:01:50 Stack:
      2013-03-28 11:01:50  ffff8801076871a0 ffffffff810e684e ffff8801076871c0 ffff880129716c00
      2013-03-28 11:01:50 <d> 0000000000000012 0000000000000000 ffffffff81fcb680 ffff88039ec46b38
      2013-03-28 11:01:50 <d> ffff8801076871f0 ffffffffa0c83506 ffff88039ec46b28 ffff88039ec46c60
      2013-03-28 11:01:50 Call Trace:
      2013-03-28 11:01:50  [<ffffffff810e684e>] ? call_rcu+0xe/0x10
      2013-03-28 11:01:50  [<ffffffffa0c83506>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs]
      2013-03-28 11:01:50  [<ffffffffa0c83854>] ldiskfs_dquot_drop+0x34/0x80 [ldiskfs]
      2013-03-28 11:01:50  [<ffffffff811e00a2>] vfs_dq_drop+0x52/0x60
      2013-03-28 11:01:50  [<ffffffff8119cac3>] clear_inode+0x93/0x140
      2013-03-28 11:01:50  [<ffffffff8119cbb0>] dispose_list+0x40/0x120
      2013-03-28 11:01:50  [<ffffffff8119cf04>] shrink_icache_memory+0x274/0x2e0
      2013-03-28 11:01:50  [<ffffffff8113196a>] shrink_slab+0x12a/0x1a0
      2013-03-28 11:01:50  [<ffffffff81134439>] zone_reclaim+0x279/0x400
      2013-03-28 11:01:50  [<ffffffff8112a08c>] get_page_from_freelist+0x69c/0x830
      2013-03-28 11:01:50  [<ffffffff8112b623>] ? __alloc_pages_nodemask+0x113/0x8d0
      2013-03-28 11:01:50  [<ffffffffa0c53456>] ? ldiskfs_ext_get_blocks+0x296/0x14b0 [ldiskfs]
      2013-03-28 11:01:50  [<ffffffff8112b623>] __alloc_pages_nodemask+0x113/0x8d0
      2013-03-28 11:01:50  [<ffffffff81166f78>] ? ____cache_alloc_node+0x108/0x160
      2013-03-28 11:01:50  [<ffffffff81166572>] kmem_getpages+0x62/0x170
      2013-03-28 11:01:50  [<ffffffff81166bdf>] cache_grow+0x2cf/0x320
      2013-03-28 11:01:50  [<ffffffff81166e32>] cache_alloc_refill+0x202/0x240
      2013-03-28 11:01:50  [<ffffffff81167ecf>] kmem_cache_alloc+0x15f/0x190
      2013-03-28 11:01:50  [<ffffffff8109e6b6>] prepare_creds+0x26/0xb0
      2013-03-28 11:01:50  [<ffffffffa052a865>] cfs_cap_raise+0x15/0x40 [libcfs]
      2013-03-28 11:01:50  [<ffffffffa0d580aa>] filter_commitrw_write+0x11ca/0x2d28 [obdfilter]
      2013-03-28 11:01:50  [<ffffffff8150db00>] ? thread_return+0x4e/0x76e
      2013-03-28 11:01:50  [<ffffffff810812fc>] ? lock_timer_base+0x3c/0x70
      2013-03-28 11:01:50  [<ffffffff81081e12>] ? del_timer_sync+0x22/0x30
      2013-03-28 11:01:50  [<ffffffff8150e8fa>] ? schedule_timeout+0x19a/0x2e0
      2013-03-28 11:01:50  [<ffffffff81081410>] ? process_timeout+0x0/0x10
      2013-03-28 11:01:50  [<ffffffffa0d4b332>] filter_commitrw+0x272/0x290 [obdfilter]
      2013-03-28 11:01:50  [<ffffffffa07c69a9>] ? target_bulk_io+0x619/0x940 [ptlrpc]
      2013-03-28 11:01:50  [<ffffffffa0ce0bdd>] obd_commitrw+0x11d/0x3c0 [ost]
      2013-03-28 11:01:50  [<ffffffffa0ce8994>] ost_brw_write+0xcc4/0x1600 [ost]
      2013-03-28 11:01:50  [<ffffffffa07bb000>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
      2013-03-28 11:01:50  [<ffffffffa0cee0a7>] ost_handle+0x2b77/0x4270 [ost]
      2013-03-28 11:01:50  [<ffffffffa07f9e1c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
      2013-03-28 11:01:50  [<ffffffffa080141b>] ? ptlrpc_update_export_timer+0x4b/0x470 [ptlrpc]
      2013-03-28 11:01:50  [<ffffffffa080900e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
      2013-03-28 11:01:50  [<ffffffffa08083c0>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      2013-03-28 11:01:50  [<ffffffff8100c0ca>] child_rip+0xa/0x20
      2013-03-28 11:01:50  [<ffffffffa08083c0>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      2013-03-28 11:01:50  [<ffffffffa08083c0>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      2013-03-28 11:01:50  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      2013-03-28 11:01:50 Code: e8 6a cf 53 e0 49 c7 86 38 08 00 00 00 00 00 00 8b 45 c8 48 63 d8 e9 62 ff ff ff 0f 1f 40 00 48 c7 c3 f4 ff ff ff e9 52 ff ff ff <0f> 0b eb fe 55 48 89 e5 0f 1f 44 00 00 be 01 00 00 00 e8 dd fe 
      2013-03-28 11:01:50 RIP  [<ffffffffa0c2b5ac>] jbd2_journal_start+0x10c/0x110 [jbd2]
      2013-03-28 11:01:50  RSP <ffff880107687190>
      

      From what I can tell, the root of the problem is current->journal_info->h_transaction->t_journal does not match LDISKFS_SB(sb)->s_journal. Causing us to trigger this assertion in jbd2_journal_start:

       292         if (handle) {                                                           
       293                 J_ASSERT(handle->h_transaction->t_journal == journal);          
       294                 handle->h_ref++;                                                
       295                 return handle;                                                  
       296         }
      

      Attachments

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: