Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5040

kernel BUG at fs/jbd2/transaction.c:1033

Details

    • 3
    • 13932

    Description

      mdt crashed with

      <4>------------[ cut here ]------------^M
      <2>kernel BUG at fs/jbd2/transaction.c:1033!^M
      [1]kdb> sr 8^M
      SysRq : Changing Loglevel^M
      Loglevel set to 8^M
      [1]kdb> sr p^M
      SysRq : Show Regs^M
      CPU 1 ^M
      Modules linked in: osp(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) lquota(U) jbd2 mdd(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) dm_round_robin scsi_dh_rdac lpfc(U) scsi_transport_fc scsi_tgt nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc bonding 8021q garp stp llc ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mad(U) ib_core(U) dm_multipath tcp_bic power_meter dcdbas microcode iTCO_wdt iTCO_vendor_support shpchp mlx4_core(U) memtrack(U) ses enclosure sg tg3 hwmon ext3 jbd sd_mod crc_t10dif wmi megaraid_sas dm_mirror dm_region_hash dm_log dm_mod gru [last unloaded: scsi_wait_scan]^M
      ^M
      Pid: 13917, comm: mdt_rdpg02_017 Not tainted 2.6.32-358.23.2.el6.20140115.x86_64.lustre241 #1 Dell Inc. PowerEdge R720/0VWT90^M
      RIP: 0010:[<ffffffffa0bd88ad>]  [<ffffffffa0bd88ad>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]^M
      RSP: 0018:ffff880f537198a0  EFLAGS: 00010246^M
      RAX: ffff880f88da9cc0 RBX: ffff880eb8352d08 RCX: ffff880bf382b610^M
      RDX: 0000000000000000 RSI: ffff880bf382b610 RDI: 0000000000000000^M
      RBP: ffff880f537198c0 R08: 2010000000000000 R09: f3ee8046d0a58402^M
      R10: 0000000000000001 R11: ffff880863dd6e10 R12: ffff880f4897f518^M
      R13: ffff880bf382b610 R14: ffff881007dcc800 R15: 0000000000000008^M
      FS:  00007fffedaf3700(0000) GS:ffff88084c400000(0000) knlGS:0000000000000000^M
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
      CR2: 000000000061c9b8 CR3: 0000000001a25000 CR4: 00000000000407e0^M
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
      Process mdt_rdpg02_017 (pid: 13917, threadinfo ffff880f53718000, task ffff880f5370aae0)^M
      Stack:^M
       ffff880eb8352d08 ffffffffa0ca92d0 ffff880bf382b610 0000000000000000^M
      <d> ffff880f53719900 ffffffffa0c680bb ffff880f537198f0 ffffffff810962ff^M
      <d> ffff8810213f3350 ffff880eb8352d08 0000000000000018 ffff880bf382b610^M
      all Trace:^M
       [<ffffffffa0c680bb>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]^M
       [<ffffffff810962ff>] ? wake_up_bit+0x2f/0x40^M
       [<ffffffffa0c9ea55>] ldiskfs_quota_write+0x165/0x210 [ldiskfs]^M
       [<ffffffff811e2221>] v2_write_file_info+0xa1/0xe0^M
       [<ffffffff811de328>] dquot_acquire+0x138/0x140^M
       [<ffffffffa0c9d5f6>] ldiskfs_acquire_dquot+0x66/0xb0 [ldiskfs]^M
       [<ffffffff811e029c>] dqget+0x2ac/0x390^M
       [<ffffffff811e0848>] dquot_initialize+0x98/0x240^M
       [<ffffffffa0c9d812>] ldiskfs_dquot_initialize+0x62/0xc0 [ldiskfs]^M
       [<ffffffffa0cf8d6f>] osd_attr_set+0x12f/0x540 [osd_ldiskfs]^M
       [<ffffffffa0eb15cb>] lod_attr_set+0x12b/0x450 [lod]^M
       [<ffffffffa0b6d411>] mdd_attr_set_internal+0x151/0x230 [mdd]^M
       [<ffffffffa0b706ea>] mdd_attr_set+0x107a/0x1390 [mdd]^M
       [<ffffffffa06fd011>] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc]^M
       [<ffffffffa0e0e182>] mdt_mfd_close+0x502/0x6e0 [mdt]^M
       [<ffffffffa0e0f73a>] mdt_close+0x67a/0xab0 [mdt]^M
       [<ffffffffa0de7ad7>] mdt_handle_common+0x647/0x16d0 [mdt]^M
       [<ffffffffa0e21635>] mds_readpage_handle+0x15/0x20 [mdt]^M
       [<ffffffffa070d3d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]^M
       [<ffffffffa04175de>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
       [<ffffffffa0428d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]^M
       [<ffffffffa0704739>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]^M
       [<ffffffff81055813>] ? __wake_up+0x53/0x70^M
       [<ffffffffa070e76e>] ptlrpc_main+0xace/0x1700 [ptlrpc]^M
       [<ffffffffa070dca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
       [<ffffffff8100c0ca>] child_rip+0xa/0x20^M
       [<ffffffffa070dca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
       [<ffffffffa070dca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M
      Code: c6 9c 03 00 00 4c 89 f7 e8 11 97 96 e0 48 8b 33 ba 01 00 00 00 4c 89 e7 e8 11 ec ff ff 4c 89 f0 66 ff 00 66 66 90 e9 73 ff ff ff <0f> 0b eb fe 0f 0b eb fe 0f 0b 66 
      Call Trace:^M
       [<ffffffffa0c680bb>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]^M
       [<ffffffff810962ff>] ? wake_up_bit+0x2f/0x40^M
       [<ffffffffa0c9ea55>] ldiskfs_quota_write+0x165/0x210 [ldiskfs]^M
       [<ffffffff811e2221>] v2_write_file_info+0xa1/0xe0^M
       [<ffffffff811de328>] dquot_acquire+0x138/0x140^M
       [<ffffffffa0c9d5f6>] ldiskfs_acquire_dquot+0x66/0xb0 [ldiskfs]^M
       [<ffffffff811e029c>] dqget+0x2ac/0x390^M
       [<ffffffff811e0848>] dquot_initialize+0x98/0x240^M
       [<ffffffffa0c9d812>] ldiskfs_dquot_initialize+0x62/0xc0 [ldiskfs]^M
       [<ffffffffa0cf8d6f>] osd_attr_set+0x12f/0x540 [osd_ldiskfs]^M
       [<ffffffffa0eb15cb>] lod_attr_set+0x12b/0x450 [lod]^M
       [<ffffffffa0b6d411>] mdd_attr_set_internal+0x151/0x230 [mdd]^M
       [<ffffffffa0b706ea>] mdd_attr_set+0x107a/0x1390 [mdd]^M
       [<ffffffffa06fd011>] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc]^M
       [<ffffffffa0e0e182>] mdt_mfd_close+0x502/0x6e0 [mdt]^M
       [<ffffffffa0e0f73a>] mdt_close+0x67a/0xab0 [mdt]^M
       [<ffffffffa0de7ad7>] mdt_handle_common+0x647/0x16d0 [mdt]^M
       [<ffffffffa0e21635>] mds_readpage_handle+0x15/0x20 [mdt]^M
       [<ffffffffa070d3d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]^M
       [<ffffffffa04175de>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
       [<ffffffffa0428d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]^M
       [<ffffffffa0704739>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]^M
       [<ffffffff81055813>] ? __wake_up+0x53/0x70^M
       [<ffffffffa070e76e>] ptlrpc_main+0xace/0x1700 [ptlrpc]^M
       [<ffffffffa070dca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
       [<ffffffff8100c0ca>] child_rip+0xa/0x20^M
       [<ffffffffa070dca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
       [<ffffffffa070dca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M
      

      After recover it crashed again at the same place.

      AFTER RECOVER

      Lustre: nbp7-MDT0000: recovery is timed out, evict stale exports^M
      Lustre: nbp7-MDT0000: disconnecting 30 stale clients^M
      LustreError: 5667:0:(mdt_lvb.c:157:mdt_lvbo_fill()) nbp7-MDT0000: expected 56 actual 0.^M
      Lustre: nbp7-MDT0000: Recovery over after 5:02, of 11832 clients 11802 recovered and 30 were evicted.^M
      ------------[ cut here ]------------^M
      kernel BUG at fs/jbd2/transaction.c:1033!^M
      

      Rebooted Ran fsck.

      Ran recovery Crashed again same place

      Rebooted Mounted with abort recover no crash so far.

      Attachments

        Issue Links

          Activity

            [LU-5040] kernel BUG at fs/jbd2/transaction.c:1033
            jaylan Jay Lan (Inactive) added a comment - - edited

            I saw your back ported LU-5777 patch to b2_5. My cherry-pick went cleanly.
            I will wait for your patch to clear autotest before I build it. Thanks, Zhenyu!

            jaylan Jay Lan (Inactive) added a comment - - edited I saw your back ported LU-5777 patch to b2_5. My cherry-pick went cleanly. I will wait for your patch to clear autotest before I build it. Thanks, Zhenyu!

            Hi Zhenyu,

            We do not have patch of LU-5777 in our 2.4.3 repo. Hmm, not in our 2.5.3 either.

            jaylan Jay Lan (Inactive) added a comment - Hi Zhenyu, We do not have patch of LU-5777 in our 2.4.3 repo. Hmm, not in our 2.5.3 either.
            bobijam Zhenyu Xu added a comment -

            Hi Jay,

            Does your repository has patch of LU-5777? That also will cause credits deficiency, we'd backport it too.

            bobijam Zhenyu Xu added a comment - Hi Jay, Does your repository has patch of LU-5777 ? That also will cause credits deficiency, we'd backport it too.
            jaylan Jay Lan (Inactive) added a comment - - edited

            We hit this bug again in production. It hit an OSS. The backtrace looks exactly the same as that in Mahmoud's comment on 04/Aug/14 10:43 AM.

            We have patch set #5 from http://review.whamcloud.com/11097 in our git repo.
            The difference between #5 and #6 was an extra empty line.
            The difference between #6 and #7 was in commit message.

            jaylan Jay Lan (Inactive) added a comment - - edited We hit this bug again in production. It hit an OSS. The backtrace looks exactly the same as that in Mahmoud's comment on 04/Aug/14 10:43 AM. We have patch set #5 from http://review.whamcloud.com/11097 in our git repo. The difference between #5 and #6 was an extra empty line. The difference between #6 and #7 was in commit message.
            pjones Peter Jones added a comment -

            Landed for 2.5.4 and 2.7

            pjones Peter Jones added a comment - Landed for 2.5.4 and 2.7

            Thank you, Zhenyu, for the update. I will pick up the new patch set.

            jaylan Jay Lan (Inactive) added a comment - Thank you, Zhenyu, for the update. I will pick up the new patch set.
            bobijam Zhenyu Xu added a comment -

            the patch has been updated based on review result.

            bobijam Zhenyu Xu added a comment - the patch has been updated based on review result.

            That is fine, Zhenyu

            Peter mentioned we used to have too much information in JIRA and thus Intel no longer logs gerrit messages to JIRA.

            We do not need messages about Jenkins, Autotest or Maloo. A simple message "Patch Set # uploaded" to JIRA for every new patch set is sufficient and I do not consider it noisy. I think it can be implemented to your system.

            jaylan Jay Lan (Inactive) added a comment - That is fine, Zhenyu Peter mentioned we used to have too much information in JIRA and thus Intel no longer logs gerrit messages to JIRA. We do not need messages about Jenkins, Autotest or Maloo. A simple message "Patch Set # uploaded" to JIRA for every new patch set is sufficient and I do not consider it noisy. I think it can be implemented to your system.
            bobijam Zhenyu Xu added a comment -

            sorry for that, I forgot to update here, just updated in the gerrit.

            bobijam Zhenyu Xu added a comment - sorry for that, I forgot to update here, just updated in the gerrit.

            I think the LU should be updated when the patch provided is change/updated

            mhanafi Mahmoud Hanafi added a comment - I think the LU should be updated when the patch provided is change/updated

            People

              bobijam Zhenyu Xu
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: