[LU-14453] LDISKFS-fs error (device xxx) in ldiskfs_do_update_inode:4840: Journal has aborted Created: 19/Feb/21 Updated: 27/Feb/21 Resolved: 27/Feb/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Shaun Tancheff | Assignee: | WC Triage |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This affects 2.14.0-RC3. Hit this a couple of times under I/O stress testing ( [349195.535733] LDISKFS-fs error (device dm-0) in ldiskfs_free_blocks:5437: IO failure [349195.536702] LDISKFS-fs error (device dm-0) in ldiskfs_free_blocks:5437: IO failure [349195.563895] Aborting journal on device md131. [349195.564407] Aborting journal on device md131. [349195.564433] LDISKFS-fs error (device dm-0) in ldiskfs_orphan_add:3367: Journal has aborted [349195.564435] LDISKFS-fs error (device dm-0) in ldiskfs_orphan_del:3447: Journal has aborted [349195.564439] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted [349195.564442] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted [349195.564445] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted [349195.564483] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted [349195.564485] Kernel panic - not syncing: LDISKFS-fs (device dm-0): panic forced after error [349195.564486] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted [349195.564489] CPU: 18 PID: 25430 Comm: ll_ost00_576 Kdump: loaded Tainted: P OE ------------ 3.10.0-957.1.3957.1.3.x3.4.37.x86_64 #1 [349195.564490] Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018 [349195.564491] LDISKFS-fs error (device dm-0) in ldiskfs_do_update_inode:4840: Journal has aborted [349195.564492] Call Trace: [349195.564500] [<ffffffffb3763e41>] dump_stack+0x19/0x1b [349195.564519] [<ffffffffb375d550>] panic+0xe8/0x21f [349195.564538] [<ffffffffc1f1e156>] ldiskfs_handle_error.part.191+0xa6/0xb0 [ldiskfs] [349195.564550] [<ffffffffc1f1e7db>] __ldiskfs_std_error+0x7b/0x100 [ldiskfs] [349195.564562] [<ffffffffc1f3708a>] ldiskfs_free_blocks+0xa1a/0xbb0 [ldiskfs] [349195.564570] [<ffffffffc1eedef9>] ldiskfs_ext_remove_space+0x8a9/0x1150 [ldiskfs] [349195.564576] [<ffffffffc1f06525>] ? ldiskfs_do_update_inode+0x525/0x810 [ldiskfs] [349195.564582] [<ffffffffc1ef0670>] ldiskfs_ext_truncate+0xb0/0xe0 [ldiskfs] [349195.564588] [<ffffffffc1f0976a>] ldiskfs_truncate+0x3da/0x430 [ldiskfs] [349195.564594] [<ffffffffc1f0a4ea>] ldiskfs_evict_inode+0x58a/0x630 [ldiskfs] [349195.564596] [<ffffffffb325eac4>] evict+0xb4/0x180 [349195.564597] [<ffffffffb325f3cc>] iput+0xfc/0x190 [349195.564606] [<ffffffffc1f7f137>] osd_object_delete+0x1e7/0x360 [osd_ldiskfs] [349195.564636] [<ffffffffc1576ff8>] lu_object_free.isra.27+0xb8/0x1c0 [obdclass] [349195.564653] [<ffffffffc157a955>] lu_object_put+0xa5/0x430 [obdclass] [349195.564661] [<ffffffffc1dea40e>] ofd_destroy_by_fid+0x20e/0x500 [ofd] [349195.564697] [<ffffffffc183a820>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc] [349195.564720] [<ffffffffc1836b80>] ? ldlm_expired_completion_wait+0x2a0/0x2a0 [ptlrpc] [349195.564725] [<ffffffffc1ddc307>] ofd_destroy_hdl+0x267/0x9f0 [ofd] [349195.564762] [<ffffffffc18d472a>] tgt_request_handle+0x7ea/0x1750 [ptlrpc] [349195.564816] [<ffffffffc18ad241>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [349195.564843] [<ffffffffc0e1702e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [349195.564898] [<ffffffffc18767a6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [349195.564902] [<ffffffffb30cec64>] ? __wake_up+0x44/0x50 [349195.564939] [<ffffffffc187b2fc>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc] [349195.564974] [<ffffffffc187a7c0>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc] [349195.564976] [<ffffffffb30c1c31>] kthread+0xd1/0xe0 [349195.564977] [<ffffffffb30c1b60>] ? insert_kthread_work+0x40/0x40 [349195.564979] [<ffffffffb3776c1d>] ret_from_fork_nospec_begin+0x7/0x21 [349195.564980] [<ffffffffb30c1b60>] ? insert_kthread_work+0x40/0x40 And an earlier hit on RC2: [59746.814376] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932e1294c9c0] [phy 16627861504] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042458] [59746.814380] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff9330c3183ea0] [phy 16627863552] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67344087] [59746.814383] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932b18244888] [phy 16627865600] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042515] [59746.814404] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff93206a2f1478] [phy 16627867648] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042531] [59746.814409] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932e12848c98] [phy 16627869696] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042485] [59746.814416] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932de984eaf8] [phy 16627871744] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042535] [59746.814469] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff9330575f4618] [phy 16627873792] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67344057] [59746.814475] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932fff6d21a0] [phy 16627875840] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042505] [59746.814536] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932de984f8f0] [phy 16627877888] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67042513] [59746.814539] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff933057701270] [phy 16627879936] [logic 2048] [len 2048] [free 2048] [error 1] [inode 67344093] [59870.004253] LDISKFS-fs warning: 39 callbacks suppressed [59870.011429] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff9329432eac98] [phy 16628195328] [logic 8192] [len 8192] [free 8192] [error 1] [inode 67042426] [59876.053222] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff932fff61bbc8] [phy 16628289536] [logic 4096] [len 4096] [free 4096] [error 1] [inode 67344041] [59876.153987] LDISKFS-fs warning (device dm-0): ldiskfs_mb_new_blocks:5077: Updating bitmap error: [err -5] [pa ffff93295856d6e8] [phy 16628322304] [logic 4096] [len 4096] [free 4096] [error 1] [inode 67042443] [59886.946880] LDISKFS-fs error (device dm-0) in ldiskfs_free_blocks:5437: IO failure [59887.691671] Aborting journal on device md129. [59887.890886] Kernel panic - not syncing: LDISKFS-fs (device dm-0): panic forced after error [59887.903674] CPU: 2 PID: 15660 Comm: ll_ost00_357 Kdump: loaded Tainted: P OE ------------ 3.10.0-957.1.3957.1.3.x3.4.37.x86_64 #1 [59887.919307] Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018 [59887.928496] Call Trace: [59887.932396] [<ffffffff97b63e41>] dump_stack+0x19/0x1b [59887.938877] [<ffffffff97b5d550>] panic+0xe8/0x21f [59887.944972] [<ffffffffc1cec156>] ldiskfs_handle_error.part.191+0xa6/0xb0 [ldiskfs] [59887.953962] [<ffffffffc1cec7db>] __ldiskfs_std_error+0x7b/0x100 [ldiskfs] [59887.962318] [<ffffffffc1d0508a>] ldiskfs_free_blocks+0xa1a/0xbb0 [ldiskfs] [59887.970693] [<ffffffff9761b591>] ? __slab_free+0x81/0x2f0 [59887.977580] [<ffffffffc1cbbef9>] ldiskfs_ext_remove_space+0x8a9/0x1150 [ldiskfs] [59887.986704] [<ffffffffc1cfa749>] ? ldiskfs_es_free_extent+0x99/0x130 [ldiskfs] [59887.995452] [<ffffffffc1cbe670>] ldiskfs_ext_truncate+0xb0/0xe0 [ldiskfs] [59888.003741] [<ffffffffc1cd776a>] ldiskfs_truncate+0x3da/0x430 [ldiskfs] [59888.011910] [<ffffffffc1cd84ea>] ldiskfs_evict_inode+0x58a/0x630 [ldiskfs] [59888.020159] [<ffffffff9765eac4>] evict+0xb4/0x180 [59888.026205] [<ffffffff9765f3cc>] iput+0xfc/0x190 [59888.032015] [<ffffffffc1d4d137>] osd_object_delete+0x1e7/0x360 [osd_ldiskfs] [59888.040336] [<ffffffffc1377ff8>] lu_object_free.isra.27+0xb8/0x1c0 [obdclass] [59888.048763] [<ffffffffc137b955>] lu_object_put+0xa5/0x430 [obdclass] [59888.056395] [<ffffffffc1bb840e>] ofd_destroy_by_fid+0x20e/0x500 [ofd] [59888.064161] [<ffffffffc163b820>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc] [59888.072993] [<ffffffffc1637b80>] ? ldlm_expired_completion_wait+0x2a0/0x2a0 [ptlrpc] [59888.081995] [<ffffffffc1baa307>] ofd_destroy_hdl+0x267/0x9f0 [ofd] [59888.089407] [<ffffffffc16d572a>] tgt_request_handle+0x7ea/0x1750 [ptlrpc] [59888.097512] [<ffffffffc16ae241>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [59888.106229] [<ffffffffc0fac02e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [59888.114405] [<ffffffffc16777a6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [59888.123233] [<ffffffff974cec64>] ? __wake_up+0x44/0x50 [59888.129602] [<ffffffffc167c2fc>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc] [59888.133296] LDISKFS-fs error (device dm-0) in ldiskfs_delete_entry:2838: Journal has aborted [59888.146602] [<ffffffffc167b7c0>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc] [59888.155066] [<ffffffff974c1c31>] kthread+0xd1/0xe0 [59888.161030] [<ffffffff974c1b60>] ? insert_kthread_work+0x40/0x40 [59888.168206] [<ffffffff97b76c1d>] ret_from_fork_nospec_begin+0x7/0x21 [59888.175648] [<ffffffff974c1b60>] ? insert_kthread_work+0x40/0x40 |
| Comments |
| Comment by Alex Zhuravlev [ 19/Feb/21 ] |
|
something went wrong with low-level IO: [349195.535733] LDISKFS-fs error (device dm-0) in ldiskfs_free_blocks:5437: IO failure |
| Comment by Andreas Dilger [ 19/Feb/21 ] |
|
The second failure also starts with IO errors, so this looks like it is a storage problem. |
| Comment by Cory Spitz [ 19/Feb/21 ] |
|
Right, we agree it looks like IO errors below Lustre. This storage HW runs 2.12.x OK and we're seeing this (and more related ldiskfs issues) on multiple systems without any indication of lower level IO errors. We're continuing our investigations. |
| Comment by Alex Zhuravlev [ 20/Feb/21 ] |
|
can you please attach full system log? |
| Comment by Andrew Perepechko [ 27/Feb/21 ] |
|
This bug happens because of the original buggy version of
Author: James Simmons <jsimmons@infradead.org>
Date: Thu Mar 5 20:40:47 2020 -0500
LU-12477 ldiskfs: remove obsolete ext4 patches
ldiskfs/kernel_patches/patches/rhel7.6# git blame ext4-mballoc-prefetch.patch ... fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 27) @@ -524,7 +532,7 @@ ext4_read_block_bitmap(struct super_b fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 28) struct buffer_head *bh; fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 29) int err; fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 30) fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 31) - bh = ext4_read_block_bitmap_nowait(sb, block_group); fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 32) + bh = ext4_read_block_bitmap_nowait(sb, block_group, 1); fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 33) if (!bh) fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 34) return NULL; fc87b01f96e (James Simmons 2020-03-05 20:40:47 -0500 35) err = ext4_wait_block_bitmap(sb, block_group, bh); |
| Comment by Andreas Dilger [ 27/Feb/21 ] |
|
Thanks for the detailed explanation. Is that patch still in use for EL7.9 or is it only affecting EL7.6/7.7? |
| Comment by Andrew Perepechko [ 27/Feb/21 ] |
|
adilger, only RHEL 7.6 seems to be affected. |
| Comment by Peter Jones [ 27/Feb/21 ] |
|
Thanks for the analysis Andrew. It seems like this issue does not affect the community releases then |