Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
Server: RHEL8
-
3
-
9223372036854775807
Description
[458781.070693] LDISKFS-fs error (device md65): ldiskfs_xattr_inode_iget:407: comm lfsck: EA inode 2047917093 does not have LDISKFS_EA_INODE_FL flag [458781.136989] Aborting journal on device md65-8. [458781.142323] LDISKFS-fs error (device md65) in ldiskfs_evict_inode:251: Journal has aborted [458781.153243] LDISKFS-fs error (device md65): ldiskfs_journal_check_start:61: Detected aborted journal [458781.155099] LustreError: 98016:0:(osd_handler.c:1783:osd_trans_commit_cb()) transaction @0x000000002c9fd616 commit error: 2 [458781.158848] LDISKFS-fs error (device md65): ldiskfs_journal_check_start:61: Detected aborted journal [458781.170295] LDISKFS-fs error (device md65): ldiskfs_journal_check_start:61: Detected aborted journal [458781.170297] LDISKFS-fs error (device md65): ldiskfs_journal_check_start:61: Detected aborted journal [458781.175978] LDISKFS-fs error (device md65): ldiskfs_journal_check_start:61: Detected aborted journal [458781.182078] LDISKFS-fs error (device md65): ldiskfs_journal_check_start:61: Detected aborted journal [458781.199967] Kernel panic - not syncing: LDISKFS-fs (device md65): panic forced after error [458781.199972] LDISKFS-fs (md65): Remounting filesystem read-only [458781.199979] LDISKFS-fs (md65): Remounting filesystem read-only [458781.200005] LDISKFS-fs (md65): Remounting filesystem read-only [458781.200549] LDISKFS-fs error (device md65): ldiskfs_journal_check_start:61: Detected aborted journal [458781.200552] LDISKFS-fs error (device md65): ldiskfs_journal_check_start:61: Detected aborted journal [458781.200840] LDISKFS-fs error (device md65): ldiskfs_journal_check_start:61: Detected aborted journal [458781.201096] LDISKFS-fs (md65): Remounting filesystem read-only [458781.260424] LDISKFS-fs error (device md65): ldiskfs_journal_check_start:61: Detected aborted journal [458781.262419] CPU: 4 PID: 2861532 Comm: lfsck Kdump: loaded Tainted: G OE --------- - - 4.18.0-305.10.2.x6.0.24.x86_64 #1 [458781.262421] Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018 [458781.333307] Call Trace: [458781.336774] dump_stack+0x5c/0x80 [458781.341219] panic+0xe7/0x2a9 [458781.345208] ? wake_up_q+0x54/0x80 [458781.349955] ldiskfs_handle_error.cold.139+0x13/0x13 [ldiskfs] [458781.356863] __ldiskfs_error+0x8b/0x100 [ldiskfs] [458781.362710] ? ldiskfs_htree_fill_tree+0xa0/0x2d0 [ldiskfs] [458781.369344] ldiskfs_xattr_inode_iget+0xf4/0x170 [ldiskfs] [458781.375883] ldiskfs_xattr_inode_get+0x4c/0x1e0 [ldiskfs] [458781.382279] ? xattr_find_entry+0x95/0x110 [ldiskfs] [458781.388253] ldiskfs_xattr_ibody_get+0x15f/0x180 [ldiskfs] [458781.394742] ldiskfs_xattr_get+0x85/0x2d0 [ldiskfs] [458781.400634] __vfs_getxattr+0x53/0x70 [458781.405326] osd_xattr_get+0x167/0x650 [osd_ldiskfs] [458781.411326] lfsck_layout_get_lovea.part.77+0x6c/0x260 [lfsck] [458781.418171] lfsck_layout_master_exec_oit+0x1b5/0xc90 [lfsck] [458781.424910] lfsck_master_oit_engine+0xc52/0x1360 [lfsck] [458781.432113] lfsck_master_engine+0x50e/0xcd0 [lfsck] [458781.438056] ? finish_wait+0x80/0x80 [458781.442568] ? lfsck_master_oit_engine+0x1360/0x1360 [lfsck] [458781.449177] kthread+0x116/0x130 [458781.453342] ? kthread_flush_work_fn+0x10/0x10 [458781.458686] ret_from_fork+0x1f/0x40
And many backtraces:
[456491.541627] ret_from_fork+0x1f/0x40 [456491.547490] CPU: 1 PID: 2861532 Comm: lfsck Kdump: loaded Tainted: G OE --------- - - 4.18.0-305.10.2.x6.0.24.x86_64 #1 [456491.561264] Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018 [456491.569958] Call Trace: [456491.573363] dump_stack+0x5c/0x80 [456491.577599] lfsck_trans_create.part.58+0x63/0x70 [lfsck] [456491.583966] lfsck_namespace_trace_update+0xa3b/0xa50 [lfsck] [456491.590650] lfsck_namespace_exec_oit+0x4b3/0x990 [lfsck] [456491.597048] ? down_write+0xe/0x40 [456491.601438] lfsck_master_oit_engine+0xc52/0x1360 [lfsck] [456491.607787] lfsck_master_engine+0x50e/0xcd0 [lfsck] [456491.613699] ? finish_wait+0x80/0x80 [456491.618187] ? lfsck_master_oit_engine+0x1360/0x1360 [lfsck] [456491.624716] kthread+0x116/0x130 [456491.628964] ? kthread_flush_work_fn+0x10/0x10 [456491.634325] ret_from_fork+0x1f/0x40 [456494.228001] CPU: 18 PID: 2861532 Comm: lfsck Kdump: loaded Tainted: G OE --------- - - 4.18.0-305.10.2.x6.0.24.x86_64 #1 [456494.241276] Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018 [456494.249695] Call Trace: [456494.252853] dump_stack+0x5c/0x80 [456494.256885] lfsck_trans_create.part.58+0x63/0x70 [lfsck] [456494.262955] lfsck_namespace_trace_update+0xa3b/0xa50 [lfsck] [456494.269296] lfsck_namespace_exec_oit+0x4b3/0x990 [lfsck] [456494.275275] ? down_write+0xe/0x40 [456494.279264] lfsck_master_oit_engine+0xc52/0x1360 [lfsck] [456494.285258] lfsck_master_engine+0x50e/0xcd0 [lfsck] [456494.290924] ? finish_wait+0x80/0x80 [456494.295116] ? lfsck_master_oit_engine+0x1360/0x1360 [lfsck] [456494.301388] kthread+0x116/0x130 [456494.305199] ? kthread_flush_work_fn+0x10/0x10 [456494.310227] ret_from_fork+0x1f/0x40 [456494.314569] CPU: 8 PID: 2861532 Comm: lfsck Kdump: loaded Tainted: G OE --------- - - 4.18.0-305.10.2.x6.0.24.x86_64 #1 [456494.338328] Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018
crash> bt -l PID: 2861532 TASK: ffff9c083c05af80 CPU: 4 COMMAND: "lfsck" #0 [ffffbd866a4cf8f0] machine_kexec at ffffffff9dc6156e /usr/src/debug/kernel-4.18.0-305.10.2.x6.0.24/linux-4.18.0-305.10.2.x6.0.24.x86_64/arch/x86/kernel/machine_kexec_64.c: 389 #1 [ffffbd866a4cf948] __crash_kexec at ffffffff9dd8f94d /usr/src/debug/kernel-4.18.0-305.10.2.x6.0.24/linux-4.18.0-305.10.2.x6.0.24.x86_64/kernel/kexec_core.c: 957 #2 [ffffbd866a4cfa10] panic at ffffffff9dce0dc7 /usr/src/debug/kernel-4.18.0-305.10.2.x6.0.24/linux-4.18.0-305.10.2.x6.0.24.x86_64/./arch/x86/include/asm/smp.h: 72 #3 [ffffbd866a4cfaa0] __ldiskfs_error at ffffffffc1a9252b [ldiskfs] /home/centos/rpmbuild/BUILD/lustre-2.14.55_81_gc26b347/ldiskfs/inode.c: 4523 #4 [ffffbd866a4cfb48] ldiskfs_xattr_inode_iget at ffffffffc1a5cf14 [ldiskfs] /home/centos/rpmbuild/BUILD/lustre-2.14.55_81_gc26b347/ldiskfs/trace/events/ldiskfs.h: 2666 #5 [ffffbd866a4cfb80] ldiskfs_xattr_inode_get at ffffffffc1a5fd9c [ldiskfs] /home/centos/rpmbuild/BUILD/lustre-2.14.55_81_gc26b347/ldiskfs/trace/events/ldiskfs.h: 1775 #6 [ffffbd866a4cfbe0] ldiskfs_xattr_ibody_get at ffffffffc1a601ef [ldiskfs] /home/centos/rpmbuild/BUILD/lustre-2.14.55_81_gc26b347/ldiskfs/ldiskfs.h: 1572 #7 [ffffbd866a4cfc48] ldiskfs_xattr_get at ffffffffc1a60295 [ldiskfs] /usr/src/kernels/4.18.0-305.10.2.x6.0.24.x86_64/include/linux/quotaops.h: 19 #8 [ffffbd866a4cfca0] __vfs_getxattr at ffffffff9df43223 /usr/src/debug/kernel-4.18.0-305.10.2.x6.0.24/linux-4.18.0-305.10.2.x6.0.24.x86_64/fs/xattr.c: 374 #9 [ffffbd866a4cfcd0] osd_xattr_get at ffffffffc1b28c07 [osd_ldiskfs] /home/centos/rpmbuild/BUILD/lustre-2.14.55_81_gc26b347/lustre/include/lustre_compat.h: 540 #10 [ffffbd866a4cfd18] lfsck_layout_get_lovea at ffffffffc158bd5c [lfsck] /home/centos/rpmbuild/BUILD/lustre-2.14.55_81_gc26b347/lustre/include/dt_object.h: 2875 #11 [ffffbd866a4cfd50] lfsck_layout_master_exec_oit at ffffffffc1597025 [lfsck] /home/centos/rpmbuild/BUILD/lustre-2.14.55_81_gc26b347/lustre/lfsck/lfsck_layout.c: 5711 #12 [ffffbd866a4cfe08] lfsck_master_oit_engine at ffffffffc1560de2 [lfsck] /home/centos/rpmbuild/BUILD/lustre-2.14.55_81_gc26b347/lustre/lfsck/lfsck_engine.c: 531 #13 [ffffbd866a4cfe78] lfsck_master_engine at ffffffffc15619fe [lfsck] /home/centos/rpmbuild/BUILD/lustre-2.14.55_81_gc26b347/lustre/lfsck/lfsck_engine.c: 1083 #14 [ffffbd866a4cff10] kthread at ffffffff9dd043a6 /usr/src/debug/kernel-4.18.0-305.10.2.x6.0.24/linux-4.18.0-305.10.2.x6.0.24.x86_64/kernel/kthread.c: 319 #15 [ffffbd866a4cff50] ret_from_fork at ffffffff9e60023f /usr/src/debug/kernel-4.18.0-305.10.2.x6.0.24/linux-4.18.0-305.10.2.x6.0.24.x86_64/arch/x86/entry/entry_64.S: 319
With (READ ONLY) lfsck enabled this crash persisted after rebooting, running e2fsck and raid re-sysc.
lfsck was eventually cleared by running lctl lfsck_stop on the MDT nodes as early as possible in the mount (and/or failback) until no more lfsck activity was observed.
Cory mentioned that this may be fallout from
LU-15404, when the large xattr has failed to be unlinked because of transaction credits, so it may be that this problem goes away when that issue is fixed (i.e. it may not leave a large xattr inode in the filesystem without LDISKFS_EA_INODE_FL set).It probably makes sense to change this case from ext4_error() to ext4_warning_inode() or similar, and return -EIO when accessing that large xattr so that it doesn't cause the filesystem to be remounted read-only? That would be a lot more robust, and would only affect the one inode's xattr.