[LU-14453] LDISKFS-fs error (device xxx) in ldiskfs_do_update_inode:4840: Journal has aborted Created: 19/Feb/21  Updated: 27/Feb/21  Resolved: 27/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Shaun Tancheff Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12477 Remove obsolete config checks Resolved
is related to LU-12988 improve mount time on huge ldiskfs fi... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This affects 2.14.0-RC3. Hit this a couple of times under I/O stress testing (

[349195.535733] LDISKFS-fs error (device dm-0) in ldiskfs_free_blocks:5437: IO failure
[349195.536702] LDISKFS-fs error (device dm-0) in ldiskfs_free_blocks:5437: IO failure
[349195.563895] Aborting journal on device md131.
[349195.564407] Aborting journal on device md131.
[349195.564433] LDISKFS-fs error (device dm-0) in ldiskfs_orphan_add:3367: Journal has aborted
[349195.564435] LDISKFS-fs error (device dm-0) in ldiskfs_orphan_del:3447: Journal has aborted
[349195.564439] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted
[349195.564442] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted
[349195.564445] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted
[349195.564483] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted
[349195.564485] Kernel panic - not syncing: LDISKFS-fs (device dm-0): panic forced after error

[349195.564486] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted
[349195.564489] CPU: 18 PID: 25430 Comm: ll_ost00_576 Kdump: loaded Tainted: P           OE  ------------   3.10.0-957.1.3957.1.3.x3.4.37.x86_64 #1
[349195.564490] Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018
[349195.564491] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted
[349195.564492] Call Trace:
[349195.564500]  [<ffffffffb3763e41>] dump_stack+0x19/0x1b
[349195.564519]  [<ffffffffb375d550>] panic+0xe8/0x21f
[349195.564538]  [<ffffffffc1f1e156>] ldiskfs_handle_error.part.191+0xa6/0xb0 [ldiskfs]
[349195.564550]  [<ffffffffc1f1e7db>] __ldiskfs_std_error+0x7b/0x100 [ldiskfs]
[349195.564562]  [<ffffffffc1f3708a>] ldiskfs_free_blocks+0xa1a/0xbb0 [ldiskfs]
[349195.564570]  [<ffffffffc1eedef9>] ldiskfs_ext_remove_space+0x8a9/0x1150 [ldiskfs]
[349195.564576]  [<ffffffffc1f06525>] ? ldiskfs_do_update_inode+0x525/0x810 [ldiskfs]
[349195.564582]  [<ffffffffc1ef0670>] ldiskfs_ext_truncate+0xb0/0xe0 [ldiskfs]
[349195.564588]  [<ffffffffc1f0976a>] ldiskfs_truncate+0x3da/0x430 [ldiskfs]
[349195.564594]  [<ffffffffc1f0a4ea>] ldiskfs_evict_inode+0x58a/0x630 [ldiskfs]
[349195.564596]  [<ffffffffb325eac4>] evict+0xb4/0x180
[349195.564597]  [<ffffffffb325f3cc>] iput+0xfc/0x190
[349195.564606]  [<ffffffffc1f7f137>] osd_object_delete+0x1e7/0x360 [osd_ldiskfs]
[349195.564636]  [<ffffffffc1576ff8>] lu_object_free.isra.27+0xb8/0x1c0 [obdclass]
[349195.564653]  [<ffffffffc157a955>] lu_object_put+0xa5/0x430 [obdclass]
[349195.564661]  [<ffffffffc1dea40e>] ofd_destroy_by_fid+0x20e/0x500 [ofd]
[349195.564697]  [<ffffffffc183a820>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc]
[349195.564720]  [<ffffffffc1836b80>] ? ldlm_expired_completion_wait+0x2a0/0x2a0 [ptlrpc]
[349195.564725]  [<ffffffffc1ddc307>] ofd_destroy_hdl+0x267/0x9f0 [ofd]
[349195.564762]  [<ffffffffc18d472a>] tgt_request_handle+0x7ea/0x1750 [ptlrpc]
[349195.564816]  [<ffffffffc18ad241>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[349195.564843]  [<ffffffffc0e1702e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[349195.564898]  [<ffffffffc18767a6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[349195.564902]  [<ffffffffb30cec64>] ? __wake_up+0x44/0x50
[349195.564939]  [<ffffffffc187b2fc>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
[349195.564974]  [<ffffffffc187a7c0>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc]
[349195.564976]  [<ffffffffb30c1c31>] kthread+0xd1/0xe0
[349195.564977]  [<ffffffffb30c1b60>] ? insert_kthread_work+0x40/0x40
[349195.564979]  [<ffffffffb3776c1d>] ret_from_fork_nospec_begin+0x7/0x21
[349195.564980]  [<ffffffffb30c1b60>] ? insert_kthread_work+0x40/0x40

And an earlier hit on RC2:

[59746.814376] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932e1294c9c0] [phy 16627861504] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042458]
[59746.814380] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff9330c3183ea0] [phy 16627863552] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67344087]
[59746.814383] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932b18244888] [phy 16627865600] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042515]
[59746.814404] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff93206a2f1478] [phy 16627867648] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042531]
[59746.814409] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932e12848c98] [phy 16627869696] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042485]
[59746.814416] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932de984eaf8] [phy 16627871744] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042535]
[59746.814469] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff9330575f4618] [phy 16627873792] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67344057]
[59746.814475] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932fff6d21a0] [phy 16627875840] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042505]
[59746.814536] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932de984f8f0] [phy 16627877888] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042513]
[59746.814539] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff933057701270] [phy 16627879936] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67344093]
[59870.004253] LDISKFS-fs warning: 39 callbacks suppressed
[59870.011429] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff9329432eac98] [phy 16628195328] [logic 8192] [len 8192] [free 8192] [error 1] [inode 67042426]
[59876.053222] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932fff61bbc8] [phy 16628289536] [logic 4096] [len 4096] [free 4096] [error 1] [inode 67344041]
[59876.153987] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff93295856d6e8] [phy 16628322304] [logic 4096] [len 4096] [free 4096] [error 1] [inode 67042443]
[59886.946880] LDISKFS-fs error (device dm-0) in ldiskfs_free_blocks:5437: IO failure
[59887.691671] Aborting journal on device md129.
[59887.890886] Kernel panic - not syncing: LDISKFS-fs (device dm-0): panic forced after error

[59887.903674] CPU: 2 PID: 15660 Comm: ll_ost00_357 Kdump: loaded Tainted: P           OE  ------------   3.10.0-957.1.3957.1.3.x3.4.37.x86_64 #1
[59887.919307] Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018
[59887.928496] Call Trace:
[59887.932396]  [<ffffffff97b63e41>] dump_stack+0x19/0x1b
[59887.938877]  [<ffffffff97b5d550>] panic+0xe8/0x21f
[59887.944972]  [<ffffffffc1cec156>] ldiskfs_handle_error.part.191+0xa6/0xb0 [ldiskfs]
[59887.953962]  [<ffffffffc1cec7db>] __ldiskfs_std_error+0x7b/0x100 [ldiskfs]
[59887.962318]  [<ffffffffc1d0508a>] ldiskfs_free_blocks+0xa1a/0xbb0 [ldiskfs]
[59887.970693]  [<ffffffff9761b591>] ? __slab_free+0x81/0x2f0
[59887.977580]  [<ffffffffc1cbbef9>] ldiskfs_ext_remove_space+0x8a9/0x1150 [ldiskfs]
[59887.986704]  [<ffffffffc1cfa749>] ? ldiskfs_es_free_extent+0x99/0x130 [ldiskfs]
[59887.995452]  [<ffffffffc1cbe670>] ldiskfs_ext_truncate+0xb0/0xe0 [ldiskfs]
[59888.003741]  [<ffffffffc1cd776a>] ldiskfs_truncate+0x3da/0x430 [ldiskfs]
[59888.011910]  [<ffffffffc1cd84ea>] ldiskfs_evict_inode+0x58a/0x630 [ldiskfs]
[59888.020159]  [<ffffffff9765eac4>] evict+0xb4/0x180
[59888.026205]  [<ffffffff9765f3cc>] iput+0xfc/0x190
[59888.032015]  [<ffffffffc1d4d137>] osd_object_delete+0x1e7/0x360 [osd_ldiskfs]
[59888.040336]  [<ffffffffc1377ff8>] lu_object_free.isra.27+0xb8/0x1c0 [obdclass]
[59888.048763]  [<ffffffffc137b955>] lu_object_put+0xa5/0x430 [obdclass]
[59888.056395]  [<ffffffffc1bb840e>] ofd_destroy_by_fid+0x20e/0x500 [ofd]
[59888.064161]  [<ffffffffc163b820>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc]
[59888.072993]  [<ffffffffc1637b80>] ? ldlm_expired_completion_wait+0x2a0/0x2a0 [ptlrpc]
[59888.081995]  [<ffffffffc1baa307>] ofd_destroy_hdl+0x267/0x9f0 [ofd]
[59888.089407]  [<ffffffffc16d572a>] tgt_request_handle+0x7ea/0x1750 [ptlrpc]
[59888.097512]  [<ffffffffc16ae241>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[59888.106229]  [<ffffffffc0fac02e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[59888.114405]  [<ffffffffc16777a6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[59888.123233]  [<ffffffff974cec64>] ? __wake_up+0x44/0x50
[59888.129602]  [<ffffffffc167c2fc>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
[59888.133296] LDISKFS-fs error (device dm-0) in ldiskfs_delete_entry:2838: Journal has aborted
[59888.146602]  [<ffffffffc167b7c0>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc]
[59888.155066]  [<ffffffff974c1c31>] kthread+0xd1/0xe0
[59888.161030]  [<ffffffff974c1b60>] ? insert_kthread_work+0x40/0x40
[59888.168206]  [<ffffffff97b76c1d>] ret_from_fork_nospec_begin+0x7/0x21
[59888.175648]  [<ffffffff974c1b60>] ? insert_kthread_work+0x40/0x40


 Comments   
Comment by Alex Zhuravlev [ 19/Feb/21 ]

something went wrong with low-level IO:

[349195.535733] LDISKFS-fs error (device dm-0) in ldiskfs_free_blocks:5437: IO failure
Comment by Andreas Dilger [ 19/Feb/21 ]

The second failure also starts with IO errors, so this looks like it is a storage problem.

Comment by Cory Spitz [ 19/Feb/21 ]

Right, we agree it looks like IO errors below Lustre. This storage HW runs 2.12.x OK and we're seeing this (and more related ldiskfs issues) on multiple systems without any indication of lower level IO errors. We're continuing our investigations.

Comment by Alex Zhuravlev [ 20/Feb/21 ]

can you please attach full system log?

Comment by Andrew Perepechko [ 27/Feb/21 ]

This bug happens because of the original buggy version of LU-12988 prefetch code which was split to a separate RHEL 7.6 series by commit fc87b01 (LU-12477) and never fixed by the updated patches in LU-12988.

Author: James Simmons <jsimmons@infradead.org>
Date:   Thu Mar 5 20:40:47 2020 -0500

    LU-12477 ldiskfs: remove obsolete ext4 patches
ldiskfs/kernel_patches/patches/rhel7.6# git blame ext4-mballoc-prefetch.patch

...

fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  27) @@ -524,7 +532,7 @@ ext4_read_block_bitmap(struct super_b
fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  28)      struct buffer_head *bh;
fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  29)      int err;
fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  30)  
fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  31) -    bh = ext4_read_block_bitmap_nowait(sb, block_group);
fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  32) +    bh = ext4_read_block_bitmap_nowait(sb, block_group, 1);
fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  33)      if (!bh)
fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  34)              return NULL;
fc87b01f96e (James Simmons  2020-03-05 20:40:47 -0500  35)      err = ext4_wait_block_bitmap(sb, block_group, bh);
Comment by Andreas Dilger [ 27/Feb/21 ]

Thanks for the detailed explanation.

Is that patch still in use for EL7.9 or is it only affecting EL7.6/7.7?

Comment by Andrew Perepechko [ 27/Feb/21 ]

adilger, only RHEL 7.6 seems to be affected.

Comment by Peter Jones [ 27/Feb/21 ]

Thanks for the analysis Andrew. It seems like this issue does not affect the community releases then

Generated at Sat Feb 10 03:09:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.