Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14453

LDISKFS-fs error (device xxx) in ldiskfs_do_update_inode:4840: Journal has aborted

Details

    • Bug
    • Resolution: Not a Bug
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This affects 2.14.0-RC3. Hit this a couple of times under I/O stress testing (

      [349195.535733] LDISKFS-fs error (device dm-0) in ldiskfs_free_blocks:5437: IO failure
      [349195.536702] LDISKFS-fs error (device dm-0) in ldiskfs_free_blocks:5437: IO failure
      [349195.563895] Aborting journal on device md131.
      [349195.564407] Aborting journal on device md131.
      [349195.564433] LDISKFS-fs error (device dm-0) in ldiskfs_orphan_add:3367: Journal has aborted
      [349195.564435] LDISKFS-fs error (device dm-0) in ldiskfs_orphan_del:3447: Journal has aborted
      [349195.564439] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted
      [349195.564442] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted
      [349195.564445] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted
      [349195.564483] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted
      [349195.564485] Kernel panic - not syncing: LDISKFS-fs (device dm-0): panic forced after error
      
      [349195.564486] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted
      [349195.564489] CPU: 18 PID: 25430 Comm: ll_ost00_576 Kdump: loaded Tainted: P           OE  ------------   3.10.0-957.1.3957.1.3.x3.4.37.x86_64 #1
      [349195.564490] Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018
      [349195.564491] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted
      [349195.564492] Call Trace:
      [349195.564500]  [<ffffffffb3763e41>] dump_stack+0x19/0x1b
      [349195.564519]  [<ffffffffb375d550>] panic+0xe8/0x21f
      [349195.564538]  [<ffffffffc1f1e156>] ldiskfs_handle_error.part.191+0xa6/0xb0 [ldiskfs]
      [349195.564550]  [<ffffffffc1f1e7db>] __ldiskfs_std_error+0x7b/0x100 [ldiskfs]
      [349195.564562]  [<ffffffffc1f3708a>] ldiskfs_free_blocks+0xa1a/0xbb0 [ldiskfs]
      [349195.564570]  [<ffffffffc1eedef9>] ldiskfs_ext_remove_space+0x8a9/0x1150 [ldiskfs]
      [349195.564576]  [<ffffffffc1f06525>] ? ldiskfs_do_update_inode+0x525/0x810 [ldiskfs]
      [349195.564582]  [<ffffffffc1ef0670>] ldiskfs_ext_truncate+0xb0/0xe0 [ldiskfs]
      [349195.564588]  [<ffffffffc1f0976a>] ldiskfs_truncate+0x3da/0x430 [ldiskfs]
      [349195.564594]  [<ffffffffc1f0a4ea>] ldiskfs_evict_inode+0x58a/0x630 [ldiskfs]
      [349195.564596]  [<ffffffffb325eac4>] evict+0xb4/0x180
      [349195.564597]  [<ffffffffb325f3cc>] iput+0xfc/0x190
      [349195.564606]  [<ffffffffc1f7f137>] osd_object_delete+0x1e7/0x360 [osd_ldiskfs]
      [349195.564636]  [<ffffffffc1576ff8>] lu_object_free.isra.27+0xb8/0x1c0 [obdclass]
      [349195.564653]  [<ffffffffc157a955>] lu_object_put+0xa5/0x430 [obdclass]
      [349195.564661]  [<ffffffffc1dea40e>] ofd_destroy_by_fid+0x20e/0x500 [ofd]
      [349195.564697]  [<ffffffffc183a820>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc]
      [349195.564720]  [<ffffffffc1836b80>] ? ldlm_expired_completion_wait+0x2a0/0x2a0 [ptlrpc]
      [349195.564725]  [<ffffffffc1ddc307>] ofd_destroy_hdl+0x267/0x9f0 [ofd]
      [349195.564762]  [<ffffffffc18d472a>] tgt_request_handle+0x7ea/0x1750 [ptlrpc]
      [349195.564816]  [<ffffffffc18ad241>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [349195.564843]  [<ffffffffc0e1702e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [349195.564898]  [<ffffffffc18767a6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [349195.564902]  [<ffffffffb30cec64>] ? __wake_up+0x44/0x50
      [349195.564939]  [<ffffffffc187b2fc>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
      [349195.564974]  [<ffffffffc187a7c0>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc]
      [349195.564976]  [<ffffffffb30c1c31>] kthread+0xd1/0xe0
      [349195.564977]  [<ffffffffb30c1b60>] ? insert_kthread_work+0x40/0x40
      [349195.564979]  [<ffffffffb3776c1d>] ret_from_fork_nospec_begin+0x7/0x21
      [349195.564980]  [<ffffffffb30c1b60>] ? insert_kthread_work+0x40/0x40
      

      And an earlier hit on RC2:

      [59746.814376] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932e1294c9c0] [phy 16627861504] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042458]
      [59746.814380] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff9330c3183ea0] [phy 16627863552] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67344087]
      [59746.814383] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932b18244888] [phy 16627865600] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042515]
      [59746.814404] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff93206a2f1478] [phy 16627867648] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042531]
      [59746.814409] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932e12848c98] [phy 16627869696] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042485]
      [59746.814416] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932de984eaf8] [phy 16627871744] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042535]
      [59746.814469] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff9330575f4618] [phy 16627873792] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67344057]
      [59746.814475] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932fff6d21a0] [phy 16627875840] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042505]
      [59746.814536] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932de984f8f0] [phy 16627877888] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042513]
      [59746.814539] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff933057701270] [phy 16627879936] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67344093]
      [59870.004253] LDISKFS-fs warning: 39 callbacks suppressed
      [59870.011429] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff9329432eac98] [phy 16628195328] [logic 8192] [len 8192] [free 8192] [error 1] [inode 67042426]
      [59876.053222] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932fff61bbc8] [phy 16628289536] [logic 4096] [len 4096] [free 4096] [error 1] [inode 67344041]
      [59876.153987] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff93295856d6e8] [phy 16628322304] [logic 4096] [len 4096] [free 4096] [error 1] [inode 67042443]
      [59886.946880] LDISKFS-fs error (device dm-0) in ldiskfs_free_blocks:5437: IO failure
      [59887.691671] Aborting journal on device md129.
      [59887.890886] Kernel panic - not syncing: LDISKFS-fs (device dm-0): panic forced after error
      
      [59887.903674] CPU: 2 PID: 15660 Comm: ll_ost00_357 Kdump: loaded Tainted: P           OE  ------------   3.10.0-957.1.3957.1.3.x3.4.37.x86_64 #1
      [59887.919307] Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018
      [59887.928496] Call Trace:
      [59887.932396]  [<ffffffff97b63e41>] dump_stack+0x19/0x1b
      [59887.938877]  [<ffffffff97b5d550>] panic+0xe8/0x21f
      [59887.944972]  [<ffffffffc1cec156>] ldiskfs_handle_error.part.191+0xa6/0xb0 [ldiskfs]
      [59887.953962]  [<ffffffffc1cec7db>] __ldiskfs_std_error+0x7b/0x100 [ldiskfs]
      [59887.962318]  [<ffffffffc1d0508a>] ldiskfs_free_blocks+0xa1a/0xbb0 [ldiskfs]
      [59887.970693]  [<ffffffff9761b591>] ? __slab_free+0x81/0x2f0
      [59887.977580]  [<ffffffffc1cbbef9>] ldiskfs_ext_remove_space+0x8a9/0x1150 [ldiskfs]
      [59887.986704]  [<ffffffffc1cfa749>] ? ldiskfs_es_free_extent+0x99/0x130 [ldiskfs]
      [59887.995452]  [<ffffffffc1cbe670>] ldiskfs_ext_truncate+0xb0/0xe0 [ldiskfs]
      [59888.003741]  [<ffffffffc1cd776a>] ldiskfs_truncate+0x3da/0x430 [ldiskfs]
      [59888.011910]  [<ffffffffc1cd84ea>] ldiskfs_evict_inode+0x58a/0x630 [ldiskfs]
      [59888.020159]  [<ffffffff9765eac4>] evict+0xb4/0x180
      [59888.026205]  [<ffffffff9765f3cc>] iput+0xfc/0x190
      [59888.032015]  [<ffffffffc1d4d137>] osd_object_delete+0x1e7/0x360 [osd_ldiskfs]
      [59888.040336]  [<ffffffffc1377ff8>] lu_object_free.isra.27+0xb8/0x1c0 [obdclass]
      [59888.048763]  [<ffffffffc137b955>] lu_object_put+0xa5/0x430 [obdclass]
      [59888.056395]  [<ffffffffc1bb840e>] ofd_destroy_by_fid+0x20e/0x500 [ofd]
      [59888.064161]  [<ffffffffc163b820>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc]
      [59888.072993]  [<ffffffffc1637b80>] ? ldlm_expired_completion_wait+0x2a0/0x2a0 [ptlrpc]
      [59888.081995]  [<ffffffffc1baa307>] ofd_destroy_hdl+0x267/0x9f0 [ofd]
      [59888.089407]  [<ffffffffc16d572a>] tgt_request_handle+0x7ea/0x1750 [ptlrpc]
      [59888.097512]  [<ffffffffc16ae241>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [59888.106229]  [<ffffffffc0fac02e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [59888.114405]  [<ffffffffc16777a6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [59888.123233]  [<ffffffff974cec64>] ? __wake_up+0x44/0x50
      [59888.129602]  [<ffffffffc167c2fc>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
      [59888.133296] LDISKFS-fs error (device dm-0) in ldiskfs_delete_entry:2838: Journal has aborted
      [59888.146602]  [<ffffffffc167b7c0>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc]
      [59888.155066]  [<ffffffff974c1c31>] kthread+0xd1/0xe0
      [59888.161030]  [<ffffffff974c1b60>] ? insert_kthread_work+0x40/0x40
      [59888.168206]  [<ffffffff97b76c1d>] ret_from_fork_nospec_begin+0x7/0x21
      [59888.175648]  [<ffffffff974c1b60>] ? insert_kthread_work+0x40/0x40
      

      Attachments

        Issue Links

          Activity

            [LU-14453] LDISKFS-fs error (device xxx) in ldiskfs_do_update_inode:4840: Journal has aborted
            pjones Peter Jones added a comment -

            Thanks for the analysis Andrew. It seems like this issue does not affect the community releases then

            pjones Peter Jones added a comment - Thanks for the analysis Andrew. It seems like this issue does not affect the community releases then
            panda Andrew Perepechko added a comment - - edited

            adilger, only RHEL 7.6 seems to be affected.

            panda Andrew Perepechko added a comment - - edited adilger , only RHEL 7.6 seems to be affected.

            Thanks for the detailed explanation.

            Is that patch still in use for EL7.9 or is it only affecting EL7.6/7.7?

            adilger Andreas Dilger added a comment - Thanks for the detailed explanation. Is that patch still in use for EL7.9 or is it only affecting EL7.6/7.7?
            panda Andrew Perepechko added a comment - - edited

            This bug happens because of the original buggy version of LU-12988 prefetch code which was split to a separate RHEL 7.6 series by commit fc87b01 (LU-12477) and never fixed by the updated patches in LU-12988.

            Author: James Simmons <jsimmons@infradead.org>
            Date:   Thu Mar 5 20:40:47 2020 -0500
            
                LU-12477 ldiskfs: remove obsolete ext4 patches
            
            ldiskfs/kernel_patches/patches/rhel7.6# git blame ext4-mballoc-prefetch.patch
            
            ...
            
            fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  27) @@ -524,7 +532,7 @@ ext4_read_block_bitmap(struct super_b
            fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  28)      struct buffer_head *bh;
            fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  29)      int err;
            fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  30)  
            fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  31) -    bh = ext4_read_block_bitmap_nowait(sb, block_group);
            fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  32) +    bh = ext4_read_block_bitmap_nowait(sb, block_group, 1);
            fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  33)      if (!bh)
            fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  34)              return NULL;
            fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  35)      err = ext4_wait_block_bitmap(sb, block_group, bh);
            
            panda Andrew Perepechko added a comment - - edited This bug happens because of the original buggy version of LU-12988 prefetch code which was split to a separate RHEL 7.6 series by commit fc87b01 ( LU-12477 ) and never fixed by the updated patches in LU-12988 . Author: James Simmons <jsimmons@infradead.org> Date: Thu Mar 5 20:40:47 2020 -0500 LU-12477 ldiskfs: remove obsolete ext4 patches ldiskfs/kernel_patches/patches/rhel7.6# git blame ext4-mballoc-prefetch.patch ... fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 27) @@ -524,7 +532,7 @@ ext4_read_block_bitmap(struct super_b fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 28) struct buffer_head *bh; fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 29) int err; fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 30) fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 31) - bh = ext4_read_block_bitmap_nowait(sb, block_group); fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 32) + bh = ext4_read_block_bitmap_nowait(sb, block_group, 1); fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 33) if (!bh) fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 34) return NULL; fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 35) err = ext4_wait_block_bitmap(sb, block_group, bh);

            can you please attach full system log?

            bzzz Alex Zhuravlev added a comment - can you please attach full system log?
            spitzcor Cory Spitz added a comment - - edited

            Right, we agree it looks like IO errors below Lustre. This storage HW runs 2.12.x OK and we're seeing this (and more related ldiskfs issues) on multiple systems without any indication of lower level IO errors. We're continuing our investigations.

            spitzcor Cory Spitz added a comment - - edited Right, we agree it looks like IO errors below Lustre. This storage HW runs 2.12.x OK and we're seeing this (and more related ldiskfs issues) on multiple systems without any indication of lower level IO errors. We're continuing our investigations.

            The second failure also starts with IO errors, so this looks like it is a storage problem.

            adilger Andreas Dilger added a comment - The second failure also starts with IO errors, so this looks like it is a storage problem.

            something went wrong with low-level IO:

            [349195.535733] LDISKFS-fs error (device dm-0) in ldiskfs_free_blocks:5437: IO failure
            
            bzzz Alex Zhuravlev added a comment - something went wrong with low-level IO: [349195.535733] LDISKFS-fs error (device dm-0) in ldiskfs_free_blocks:5437: IO failure

            People

              wc-triage WC Triage
              stancheff Shaun Tancheff
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: