[LU-15404] kernel panic and filesystem corruption in setxattr due to journal transaction restart - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.16.0, Lustre 2.15.0, Lustre 2.15.3
Affects Version/s: Lustre 2.15.0
Labels:
- LTS12

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

During recent testing, we have found a repeatable kernel panic and ldiskfs corruption.

The kernel panics with the following stack trace:

[41021.608188] LNet: 18792:0:(o2iblnd_cb.c:3372:kiblnd_check_conns()) Timed out tx for 10.12.0.1@o2ib4000: 0 seconds
[41023.501466] ------------[ cut here ]------------
[41023.506369] kernel BUG at fs/jbd2/transaction.c:1476!
[41023.511714] invalid opcode: 0000 [#1] SMP NOPTI
[41023.516549] CPU: 18 PID: 304955 Comm: mdt03_009 Kdump: loaded Tainted: G OE --------- - - 4.18.0-305.10.2.x6.0.24.x86_64 #1
[41023.537576] RIP: 0010:jbd2_journal_dirty_metadata+0x1fb/0x250 [jbd2]
[41023.544198] Code: f3 90 48 8b 03 a9 00 00 80 00 75 f4 e9 b3 fe ff ff 44 8b 45 0c 41 83 f8 01 74 8c e9 ec 94 00 00 4c 39 65 30 0f 84 68 fe ff ff <0f> 0b 4d 8b 4a 70 4c 8d 73 02 4d 39 cc 0f 84 33 ff ff ff e9 53 95
[41023.563722] RSP: 0018:ffffb106eae8f5d8 EFLAGS: 00010207
[41023.569412] RAX: 000000000062c029 RBX: ffff997e385c99c0 RCX: 000000000000000
[41023.576940] RDX: ffff998c322432a0 RSI: ffff997e385c99c0 RDI: ffff998c322432a0
[41023.584556] RBP: ffff997f643bc960 R08: ffff997e385c99c0 R09: 0000000000000000
[41023.592073] R10: ffff998c2d496800 R11: 0000000000000100 R12: ffff998b65522300
[41023.599709] R13: 0000000000000000 R14: ffffffffc1a64350 R15: 0000000000001514
[41023.607327] FS: 0000000000000000(0000) GS:ffff998e7f080000(0000) knlGS:0000000000000000
[41023.615913] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41023.622237] CR2: 00007ff9d42640dc CR3: 0000000f00210004 CR4: 00000000003706e0
[41023.629848] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[41023.637566] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

[41023.645130] Call Trace:
[41023.648010] __ldiskfs_handle_dirty_metadata+0x51/0x190 [ldiskfs]
[41023.654687] ldiskfs_do_update_inode+0x49a/0x7f0 [ldiskfs]
[41023.660734] ldiskfs_mark_iloc_dirty+0x32/0x80 [ldiskfs]
[41023.666635] ldiskfs_xattr_set_handle+0x381/0x580 [ldiskfs]
[41023.672840] ldiskfs_xattr_set+0xd0/0x160 [ldiskfs]
[41023.678168] __vfs_setxattr+0x66/0x80
[41023.682392] osd_xattr_set+0x709/0x10a0 [osd_ldiskfs]
[41023.688059] ? lod_gen_component_ea+0x2c2/0x9e0 [lod]
[41023.693704] lod_sub_xattr_set+0x248/0x4d0 [lod]
[41023.698900] lod_generate_and_set_lovea+0x262/0x310 [lod]
[41023.704879] lod_striped_create+0x433/0x590 [lod]
[41023.710083] lod_layout_change+0x192/0x270 [lod]
[41023.715333] mdd_layout_change+0x13f7/0x1980 [mdd]
[41023.720809] mdt_layout_change+0x31c/0x4b0 [mdt]
[41023.726082] mdt_intent_layout+0x6c8/0x990 [mdt]
[41023.731241] ? mdt_intent_getxattr+0x320/0x320 [mdt]
[41023.736903] mdt_intent_opc+0x12c/0xbf0 [mdt]
[41023.742067] mdt_intent_policy+0x207/0x3a0 [mdt]
[41023.747281] ldlm_lock_enqueue+0x4e4/0xa80 [ptlrpc]
[41023.752854] ldlm_handle_enqueue0+0x634/0x1760 [ptlrpc]
[41023.758771] tgt_enqueue+0xa4/0x210 [ptlrpc]
[41023.763752] tgt_request_handle+0xc93/0x1a00 [ptlrpc]
[41023.769516] ? ptlrpc_nrs_req_get_nolock0+0xfb/0x1f0 [ptlrpc]
[41023.775901] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
[41023.782512] ptlrpc_main+0xc06/0x1550 [ptlrpc]
[41023.787683] ? ptlrpc_wait_event+0x500/0x500 [ptlrpc]
[41023.793276] kthread+0x116/0x130
[41023.797041] ? kthread_flush_work_fn+0x10/0x10
[41023.802049] ret_from_fork+0x1f/0x40

The corresponding line of code would map to:

                J_ASSERT_JH(jh, jh->b_transaction == transaction ||
                                jh->b_next_transaction == transaction);

More precisely, jh is associated with an actively committing transaction in its disk writing phase (i.e. t_updates already dropped to zero).

After a bit of tracing, we've found that the transaction is restarting when changing a large EA to another large EA, which in RHEL8-based ldiskfs code causes a new EA inode to be allocated and the old inode to be freed. The truncate part of the old inode release sometimes fails to extend current transaction and has to restart it:

mdt03_024-198115 [012] 45670.650452: kernel_stack:         <stack trace>
=> trace_event_raw_event_jbd2_handle_start_class (ffffffffc0c7e60c)
=> jbd2__journal_restart (ffffffffc0c75b5c)
=> ldiskfs_datasem_ensure_credits (ffffffffc1ac3431)
=> ldiskfs_ext_rm_leaf (ffffffffc1ac44e8)
=> ldiskfs_ext_remove_space (ffffffffc1ac8240)
=> ldiskfs_ext_truncate (ffffffffc1ac953a)
=> ldiskfs_truncate (ffffffffc1adbdcb)
=> ldiskfs_evict_inode (ffffffffc1adcc71)
=> evict (ffffffff84f37202)
=> ldiskfs_xattr_set_entry (ffffffffc1abcf1e)
=> ldiskfs_xattr_ibody_set (ffffffffc1abd5be)
=> ldiskfs_xattr_set_handle (ffffffffc1abf9e4)
=> ldiskfs_xattr_set (ffffffffc1abfd70)
=> __vfs_setxattr (ffffffff84f431b6)
=> osd_xattr_set (ffffffffc1b7891d)
=> lod_sub_xattr_set (ffffffffc17da152)
=> lod_generate_and_set_lovea (ffffffffc17c7d8c)
=> lod_striped_create (ffffffffc17c81d0)
=> lod_layout_change (ffffffffc17c839b)
=> mdd_layout_change (ffffffffc1850f7d)
=> mdt_layout_change (ffffffffc18aeaf1)
=> mdt_intent_layout (ffffffffc18b5e30)
=> mdt_intent_opc (ffffffffc18ac778)
=> mdt_intent_policy (ffffffffc18b3ba6)
=> ldlm_lock_enqueue (ffffffffc138ffff)
=> ldlm_handle_enqueue0 (ffffffffc13b811f)
=> tgt_enqueue (ffffffffc1441b14)
=> tgt_request_handle (ffffffffc14465cd)
=> ptlrpc_server_handle_request (ffffffffc13ecaea)
=> ptlrpc_main (ffffffffc13f132a)
=> kthread (ffffffff84d043a6)
=> ret_from_fork (ffffffff8560023f)

One problematic part here is that transaction restart enforces current transaction commit so the incomplete transaction will likely commit before the kernel panics. It will cause ldiskfs corruption after remount. The reason why the kernel panic is that we restart this transaction somewhere in between of ldiskfs_get_write_access() and ldiskfs_mark_dirty_metadata() so the inode bh sticks in the old transaction:

ldiskfs_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
                      const char *name, const void *value, size_t value_len,
                      int flags)
...
        error = ldiskfs_reserve_inode_write(handle, inode, &is.iloc);
...
                error = ldiskfs_xattr_ibody_set(handle, inode, &i, &is);
...
                error = ldiskfs_mark_iloc_dirty(handle, inode, &is.iloc);
...
}

We don't have a fix yet and haven't yet decided how to fix this. E.g. moving final iput for the old EA inode out of transaction may be problematic with osd-ldiskfs/ldiskfs layering.

The bug seems to be almost completely coming from upstream. However, credits calculation may be different in ext4 and osd-ldiskfs and the bug may not necessarily reproduce with ext4 alone.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bt-all.txt
147 kB
15/Mar/23 5:14 PM

Issue Links

is related to

LU-15238 lfsck crashes MDT LDISKFS-fs error (device md65): ldiskfs_xattr_inode_iget:407: comm lfsck: EA inode 2047917093 does not have LDISKFS_EA_INODE_FL flag

Open

LU-16032 Truncate for large objects can lead to a thread hung

Resolved

LU-15333 lfsck reports "XATTR trusted.fid: the parent FID is invalid"

Resolved

LU-16973 Busy device after successful umount

Resolved

LU-17312 interop conf-sanity test_53b: Assertion 28 failed: (($tstarted >= $tmin && $tstarted <= $tmax )) (expanded: ((5 >= 4 && 5 <= 4 )))nInsane MDT thread counts'

Resolved

Activity

[LU-15404] kernel panic and filesystem corruption in setxattr due to journal transaction restart

Gerrit Updater added a comment - 29/Apr/23 1:29 AM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50586/
Subject: ~~LU-15404~~ ldiskfs: use per-filesystem workqueues to avoid deadlocks
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 9ab613631b5833ba7e7a578fdb9819ebc593ab3c

Gerrit Updater added a comment - 29/Apr/23 1:29 AM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50586/ Subject: LU-15404 ldiskfs: use per-filesystem workqueues to avoid deadlocks Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 9ab613631b5833ba7e7a578fdb9819ebc593ab3c

Gerrit Updater added a comment - 10/Apr/23 11:51 AM

"Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50586
Subject: ~~LU-15404~~ ldiskfs: use per-filesystem workqueues to avoid deadlocks
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: f8b66bbe831c3f5fcf06b3ad22a5449fa004ff74

Gerrit Updater added a comment - 10/Apr/23 11:51 AM "Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50586 Subject: LU-15404 ldiskfs: use per-filesystem workqueues to avoid deadlocks Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: f8b66bbe831c3f5fcf06b3ad22a5449fa004ff74

Gerrit Updater added a comment - 04/Apr/23 2:32 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50354/
Subject: ~~LU-15404~~ ldiskfs: use per-filesystem workqueues to avoid deadlocks
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 616fa9b581798e1b66e4d36113c29531ad7e41a0

Gerrit Updater added a comment - 04/Apr/23 2:32 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50354/ Subject: LU-15404 ldiskfs: use per-filesystem workqueues to avoid deadlocks Project: fs/lustre-release Branch: master Current Patch Set: Commit: 616fa9b581798e1b66e4d36113c29531ad7e41a0

Alex Zhuravlev added a comment - 21/Mar/23 5:52 PM

adilger sorry, was busy with another tickets.
panda thanks! will test your patch quickly and report back.

Alex Zhuravlev added a comment - 21/Mar/23 5:52 PM adilger sorry, was busy with another tickets. panda thanks! will test your patch quickly and report back.

Gerrit Updater added a comment - 21/Mar/23 12:37 PM

"Andrew Perepechko <andrew.perepechko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50354
Subject: ~~LU-15404~~ ldiskfs: use per-filesystem workqueues to avoid deadlocks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ec3a8efce2a3328a89c9e0ef7eb0a3719f31290b

Gerrit Updater added a comment - 21/Mar/23 12:37 PM "Andrew Perepechko <andrew.perepechko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50354 Subject: LU-15404 ldiskfs: use per-filesystem workqueues to avoid deadlocks Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ec3a8efce2a3328a89c9e0ef7eb0a3719f31290b

Andreas Dilger added a comment - 16/Mar/23 12:24 AM

Alex, do you what was holding the filesystem reference for delayed_mntput in the kworker thread?

I'm wondering if we need to put the flush_scheduled_work() call in some "pre cleanup" superblock method for newer kernels, rather than having it in ldiskfs_put_super()?

The most recent version of this patch that Artem pushed to Linux-ext4 is at:
https://patchwork.ozlabs.org/project/linux-ext4/patch/20220711145735.53676-1-artem.blagodarenko@gmail.com/

This used a dedicated per-filesystem work queue that was used rather than the global work queue. That potentially avoids long waits or deadlocks during unmount when multiple filesystems are adding inodes to a single queue and then they all need to wait for the entire queue to empty, as pointed out here:
https://lore.kernel.org/all/385ce718-f965-4005-56b6-34922c4533b8@I-love.SAKURA.ne.jp/

Andreas Dilger added a comment - 16/Mar/23 12:24 AM Alex, do you what was holding the filesystem reference for delayed_mntput in the kworker thread? I'm wondering if we need to put the flush_scheduled_work() call in some "pre cleanup" superblock method for newer kernels, rather than having it in ldiskfs_put_super() ? The most recent version of this patch that Artem pushed to Linux-ext4 is at: https://patchwork.ozlabs.org/project/linux-ext4/patch/20220711145735.53676-1-artem.blagodarenko@gmail.com/ This used a dedicated per-filesystem work queue that was used rather than the global work queue. That potentially avoids long waits or deadlocks during unmount when multiple filesystems are adding inodes to a single queue and then they all need to wait for the entire queue to empty, as pointed out here: https://lore.kernel.org/all/385ce718-f965-4005-56b6-34922c4533b8@I-love.SAKURA.ne.jp/

Alex Zhuravlev added a comment - 15/Mar/23 5:14 PM

attaching full trace for the case, please have a look.

Alex Zhuravlev added a comment - 15/Mar/23 5:14 PM attaching full trace for the case, please have a look.

Alex Zhuravlev added a comment - 15/Mar/23 1:42 PM

static void ldiskfs_put_super(struct super_block *sb)
{
        struct ldiskfs_sb_info *sbi = LDISKFS_SB(sb);
        struct ldiskfs_super_block *es = sbi->s_es;
        struct buffer_head **group_desc;
        struct flex_groups **flex_groups;
        int aborted = 0;
        int i, err;

        flush_scheduled_work();

...

static inline void flush_scheduled_work(void)
{
        flush_workqueue(system_wq);
.....

void flush_workqueue(struct workqueue_struct *wq)
{
...
        wait_for_completion(&this_flusher.done);

I tend to think that calling flush_workqueue() from the context of kworker is not good.

Alex Zhuravlev added a comment - 15/Mar/23 1:42 PM static void ldiskfs_put_super(struct super_block *sb) { struct ldiskfs_sb_info *sbi = LDISKFS_SB(sb); struct ldiskfs_super_block *es = sbi->s_es; struct buffer_head **group_desc; struct flex_groups **flex_groups; int aborted = 0; int i, err; flush_scheduled_work(); ... static inline void flush_scheduled_work(void) { flush_workqueue(system_wq); ..... void flush_workqueue(struct workqueue_struct *wq) { ... wait_for_completion(&this_flusher.done); I tend to think that calling flush_workqueue() from the context of kworker is not good.

Andrew Perepechko added a comment - 15/Mar/23 5:10 AM

I don't think the umount thread waits for itself. However, this patch was made to use specific ext4 workqueues to avoid deadlocks on global workqueues when upstreaming. I believe, Artem was the last one who looked into upstreaming this patch and may have its latest version.

Andrew Perepechko added a comment - 15/Mar/23 5:10 AM I don't think the umount thread waits for itself. However, this patch was made to use specific ext4 workqueues to avoid deadlocks on global workqueues when upstreaming. I believe, Artem was the last one who looked into upstreaming this patch and may have its latest version.

Andreas Dilger added a comment - 14/Mar/23 3:44 PM

Alex, it looks like the workqueue is flushed prior to unmount, but possibly iput of internal inodes during unmount are delayed again? Probably need a check to avoid adding new inodes to queue once unmount has started?

Andreas Dilger added a comment - 14/Mar/23 3:44 PM Alex, it looks like the workqueue is flushed prior to unmount, but possibly iput of internal inodes during unmount are delayed again? Probably need a check to avoid adding new inodes to queue once unmount has started?

Alex Zhuravlev added a comment - 14/Mar/23 7:22 AM

I tend to think this patch causes specific issue at umount:

PID: 18     TASK: ffff8f2f07dc44c0  CPU: 1   COMMAND: "kworker/1:0"
 #0 [ffff8f2f07dcbb30] __schedule at ffffffff965a232d
    /tmp/kernel/kernel/sched/core.c: 3109
 #1 [ffff8f2f07dcbbb8] schedule at ffffffff965a2748
    /tmp/kernel/./arch/x86/include/asm/preempt.h: 84
 #2 [ffff8f2f07dcbbc8] schedule_timeout at ffffffff965a7559
    /tmp/kernel/kernel/time/timer.c: 1840
 #3 [ffff8f2f07dcbc98] wait_for_common at ffffffff965a3081
    /tmp/kernel/kernel/sched/completion.c: 86
 #4 [ffff8f2f07dcbce8] flush_workqueue at ffffffff960cdf9f
    /tmp/kernel/kernel/workqueue.c: 2828
 #5 [ffff8f2f07dcbdb8] ldiskfs_put_super at ffffffffc0b32a7e [ldiskfs]
    /home/lustre/master-mine/ldiskfs/super.c: 1028
 #6 [ffff8f2f07dcbdf0] generic_shutdown_super at ffffffff961d17cf
    /tmp/kernel/./include/linux/compiler.h: 276
 #7 [ffff8f2f07dcbe08] kill_block_super at ffffffff961d1a9c
    /tmp/kernel/fs/super.c: 1443
 #8 [ffff8f2f07dcbe20] deactivate_locked_super at ffffffff961d1e74
    /tmp/kernel/fs/super.c: 340
 #9 [ffff8f2f07dcbe38] cleanup_mnt at ffffffff961f0fc6
    /tmp/kernel/fs/namespace.c: 115
#10 [ffff8f2f07dcbe48] delayed_mntput at ffffffff961f1021
    /tmp/kernel/fs/namespace.c: 1156
#11 [ffff8f2f07dcbe58] process_one_work at ffffffff960cf8cf
    /tmp/kernel/kernel/workqueue.c: 2266
#12 [ffff8f2f07dcbed0] worker_thread at ffffffff960cfae5
    /tmp/kernel/./include/linux/compiler.h: 276
#13 [ffff8f2f07dcbf10] kthread at ffffffff960d5199
    /tmp/kernel/kernel/kthread.c: 340
#14 [ffff8f2f07dcbf50] ret_from_fork at ffffffff9660019f
    /tmp/kernel/arch/x86/entry/entry_64.S: 325

i.e. in worker thread (processing workqueues) ldiskfs_put_super() calls to flush_workqueue() to finish itself?

Alex Zhuravlev added a comment - 14/Mar/23 7:22 AM I tend to think this patch causes specific issue at umount: PID: 18 TASK: ffff8f2f07dc44c0 CPU: 1 COMMAND: "kworker/1:0" #0 [ffff8f2f07dcbb30] __schedule at ffffffff965a232d /tmp/kernel/kernel/sched/core.c: 3109 #1 [ffff8f2f07dcbbb8] schedule at ffffffff965a2748 /tmp/kernel/./arch/x86/include/asm/preempt.h: 84 #2 [ffff8f2f07dcbbc8] schedule_timeout at ffffffff965a7559 /tmp/kernel/kernel/time/timer.c: 1840 #3 [ffff8f2f07dcbc98] wait_for_common at ffffffff965a3081 /tmp/kernel/kernel/sched/completion.c: 86 #4 [ffff8f2f07dcbce8] flush_workqueue at ffffffff960cdf9f /tmp/kernel/kernel/workqueue.c: 2828 #5 [ffff8f2f07dcbdb8] ldiskfs_put_super at ffffffffc0b32a7e [ldiskfs] /home/lustre/master-mine/ldiskfs/ super .c: 1028 #6 [ffff8f2f07dcbdf0] generic_shutdown_super at ffffffff961d17cf /tmp/kernel/./include/linux/compiler.h: 276 #7 [ffff8f2f07dcbe08] kill_block_super at ffffffff961d1a9c /tmp/kernel/fs/ super .c: 1443 #8 [ffff8f2f07dcbe20] deactivate_locked_super at ffffffff961d1e74 /tmp/kernel/fs/ super .c: 340 #9 [ffff8f2f07dcbe38] cleanup_mnt at ffffffff961f0fc6 /tmp/kernel/fs/namespace.c: 115 #10 [ffff8f2f07dcbe48] delayed_mntput at ffffffff961f1021 /tmp/kernel/fs/namespace.c: 1156 #11 [ffff8f2f07dcbe58] process_one_work at ffffffff960cf8cf /tmp/kernel/kernel/workqueue.c: 2266 #12 [ffff8f2f07dcbed0] worker_thread at ffffffff960cfae5 /tmp/kernel/./include/linux/compiler.h: 276 #13 [ffff8f2f07dcbf10] kthread at ffffffff960d5199 /tmp/kernel/kernel/kthread.c: 340 #14 [ffff8f2f07dcbf50] ret_from_fork at ffffffff9660019f /tmp/kernel/arch/x86/entry/entry_64.S: 325 i.e. in worker thread (processing workqueues) ldiskfs_put_super() calls to flush_workqueue() to finish itself?

People

Assignee:: Andrew Perepechko

Reporter:: Andrew Perepechko

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 01/Jan/22 9:56 AM

Updated:: 15/Jan/25 1:02 PM

Resolved:: 07/Feb/22 2:58 PM