[LU-15404] kernel panic and filesystem corruption in setxattr due to journal transaction restart Created: 01/Jan/22  Updated: 24/Nov/23  Resolved: 07/Feb/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.16.0, Lustre 2.15.0, Lustre 2.15.3

Type: Bug Priority: Blocker
Reporter: Andrew Perepechko Assignee: Andrew Perepechko
Resolution: Fixed Votes: 0
Labels: LTS12

Attachments: Text File bt-all.txt    
Issue Links:
Duplicate
Related
is related to LU-15238 lfsck crashes MDT LDISKFS-fs error (d... Open
is related to LU-17312 interop conf-sanity test_53b: Asserti... Open
is related to LU-16032 Truncate for large objects can lead ... Resolved
is related to LU-15333 lfsck reports "XATTR trusted.fid: the... Resolved
is related to LU-16973 Busy device after successful umount Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During recent testing, we have found a repeatable kernel panic and ldiskfs corruption.

The kernel panics with the following stack trace:

[41021.608188] LNet: 18792:0:(o2iblnd_cb.c:3372:kiblnd_check_conns()) Timed out tx for 10.12.0.1@o2ib4000: 0 seconds
[41023.501466] ------------[ cut here ]------------
[41023.506369] kernel BUG at fs/jbd2/transaction.c:1476!
[41023.511714] invalid opcode: 0000 [#1] SMP NOPTI
[41023.516549] CPU: 18 PID: 304955 Comm: mdt03_009 Kdump: loaded Tainted: G OE --------- - - 4.18.0-305.10.2.x6.0.24.x86_64 #1
[41023.537576] RIP: 0010:jbd2_journal_dirty_metadata+0x1fb/0x250 [jbd2]
[41023.544198] Code: f3 90 48 8b 03 a9 00 00 80 00 75 f4 e9 b3 fe ff ff 44 8b 45 0c 41 83 f8 01 74 8c e9 ec 94 00 00 4c 39 65 30 0f 84 68 fe ff ff <0f> 0b 4d 8b 4a 70 4c 8d 73 02 4d 39 cc 0f 84 33 ff ff ff e9 53 95
[41023.563722] RSP: 0018:ffffb106eae8f5d8 EFLAGS: 00010207
[41023.569412] RAX: 000000000062c029 RBX: ffff997e385c99c0 RCX: 000000000000000
[41023.576940] RDX: ffff998c322432a0 RSI: ffff997e385c99c0 RDI: ffff998c322432a0
[41023.584556] RBP: ffff997f643bc960 R08: ffff997e385c99c0 R09: 0000000000000000
[41023.592073] R10: ffff998c2d496800 R11: 0000000000000100 R12: ffff998b65522300
[41023.599709] R13: 0000000000000000 R14: ffffffffc1a64350 R15: 0000000000001514
[41023.607327] FS: 0000000000000000(0000) GS:ffff998e7f080000(0000) knlGS:0000000000000000
[41023.615913] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41023.622237] CR2: 00007ff9d42640dc CR3: 0000000f00210004 CR4: 00000000003706e0
[41023.629848] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[41023.637566] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

[41023.645130] Call Trace:
[41023.648010] __ldiskfs_handle_dirty_metadata+0x51/0x190 [ldiskfs]
[41023.654687] ldiskfs_do_update_inode+0x49a/0x7f0 [ldiskfs]
[41023.660734] ldiskfs_mark_iloc_dirty+0x32/0x80 [ldiskfs]
[41023.666635] ldiskfs_xattr_set_handle+0x381/0x580 [ldiskfs]
[41023.672840] ldiskfs_xattr_set+0xd0/0x160 [ldiskfs]
[41023.678168] __vfs_setxattr+0x66/0x80
[41023.682392] osd_xattr_set+0x709/0x10a0 [osd_ldiskfs]
[41023.688059] ? lod_gen_component_ea+0x2c2/0x9e0 [lod]
[41023.693704] lod_sub_xattr_set+0x248/0x4d0 [lod]
[41023.698900] lod_generate_and_set_lovea+0x262/0x310 [lod]
[41023.704879] lod_striped_create+0x433/0x590 [lod]
[41023.710083] lod_layout_change+0x192/0x270 [lod]
[41023.715333] mdd_layout_change+0x13f7/0x1980 [mdd]
[41023.720809] mdt_layout_change+0x31c/0x4b0 [mdt]
[41023.726082] mdt_intent_layout+0x6c8/0x990 [mdt]
[41023.731241] ? mdt_intent_getxattr+0x320/0x320 [mdt]
[41023.736903] mdt_intent_opc+0x12c/0xbf0 [mdt]
[41023.742067] mdt_intent_policy+0x207/0x3a0 [mdt]
[41023.747281] ldlm_lock_enqueue+0x4e4/0xa80 [ptlrpc]
[41023.752854] ldlm_handle_enqueue0+0x634/0x1760 [ptlrpc]
[41023.758771] tgt_enqueue+0xa4/0x210 [ptlrpc]
[41023.763752] tgt_request_handle+0xc93/0x1a00 [ptlrpc]
[41023.769516] ? ptlrpc_nrs_req_get_nolock0+0xfb/0x1f0 [ptlrpc]
[41023.775901] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
[41023.782512] ptlrpc_main+0xc06/0x1550 [ptlrpc]
[41023.787683] ? ptlrpc_wait_event+0x500/0x500 [ptlrpc]
[41023.793276] kthread+0x116/0x130
[41023.797041] ? kthread_flush_work_fn+0x10/0x10
[41023.802049] ret_from_fork+0x1f/0x40

The corresponding line of code would map to:

                J_ASSERT_JH(jh, jh->b_transaction == transaction ||
                                jh->b_next_transaction == transaction);

More precisely, jh is associated with an actively committing transaction in its disk writing phase (i.e. t_updates already dropped to zero).

After a bit of tracing, we've found that the transaction is restarting when changing a large EA to another large EA, which in RHEL8-based ldiskfs code causes a new EA inode to be allocated and the old inode to be freed. The truncate part of the old inode release sometimes fails to extend current transaction and has to restart it:

mdt03_024-198115 [012] 45670.650452: kernel_stack:         <stack trace>
=> trace_event_raw_event_jbd2_handle_start_class (ffffffffc0c7e60c)
=> jbd2__journal_restart (ffffffffc0c75b5c)
=> ldiskfs_datasem_ensure_credits (ffffffffc1ac3431)
=> ldiskfs_ext_rm_leaf (ffffffffc1ac44e8)
=> ldiskfs_ext_remove_space (ffffffffc1ac8240)
=> ldiskfs_ext_truncate (ffffffffc1ac953a)
=> ldiskfs_truncate (ffffffffc1adbdcb)
=> ldiskfs_evict_inode (ffffffffc1adcc71)
=> evict (ffffffff84f37202)
=> ldiskfs_xattr_set_entry (ffffffffc1abcf1e)
=> ldiskfs_xattr_ibody_set (ffffffffc1abd5be)
=> ldiskfs_xattr_set_handle (ffffffffc1abf9e4)
=> ldiskfs_xattr_set (ffffffffc1abfd70)
=> __vfs_setxattr (ffffffff84f431b6)
=> osd_xattr_set (ffffffffc1b7891d)
=> lod_sub_xattr_set (ffffffffc17da152)
=> lod_generate_and_set_lovea (ffffffffc17c7d8c)
=> lod_striped_create (ffffffffc17c81d0)
=> lod_layout_change (ffffffffc17c839b)
=> mdd_layout_change (ffffffffc1850f7d)
=> mdt_layout_change (ffffffffc18aeaf1)
=> mdt_intent_layout (ffffffffc18b5e30)
=> mdt_intent_opc (ffffffffc18ac778)
=> mdt_intent_policy (ffffffffc18b3ba6)
=> ldlm_lock_enqueue (ffffffffc138ffff)
=> ldlm_handle_enqueue0 (ffffffffc13b811f)
=> tgt_enqueue (ffffffffc1441b14)
=> tgt_request_handle (ffffffffc14465cd)
=> ptlrpc_server_handle_request (ffffffffc13ecaea)
=> ptlrpc_main (ffffffffc13f132a)
=> kthread (ffffffff84d043a6)
=> ret_from_fork (ffffffff8560023f)

One problematic part here is that transaction restart enforces current transaction commit so the incomplete transaction will likely commit before the kernel panics. It will cause ldiskfs corruption after remount. The reason why the kernel panic is that we restart this transaction somewhere in between of ldiskfs_get_write_access() and ldiskfs_mark_dirty_metadata() so the inode bh sticks in the old transaction:

ldiskfs_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
                      const char *name, const void *value, size_t value_len,
                      int flags)
...
        error = ldiskfs_reserve_inode_write(handle, inode, &is.iloc);
...
                error = ldiskfs_xattr_ibody_set(handle, inode, &i, &is);
...
                error = ldiskfs_mark_iloc_dirty(handle, inode, &is.iloc);
...
}

We don't have a fix yet and haven't yet decided how to fix this. E.g. moving final iput for the old EA inode out of transaction may be problematic with osd-ldiskfs/ldiskfs layering.

The bug seems to be almost completely coming from upstream. However, credits calculation may be different in ext4 and osd-ldiskfs and the bug may not necessarily reproduce with ext4 alone.



 Comments   
Comment by Andreas Dilger [ 02/Jan/22 ]

There is a mechanism to truncate/free the xattr inode outside of the main transaction to avoid similar problems. That normally is used during inode unlink, but could also be used here.

Comment by Andrew Perepechko [ 11/Jan/22 ]

The crash/corruption can be reproduced with RHEL8/ext4, no Lustre involved:

dd if=/dev/zero of=/tmp/ldiskfs bs=1M count=100
mkfs.ext4 -O ea_inode /tmp/ldiskfs -J size=16 -I 512

mkdir -p /tmp/ldiskfs_m
mount -t ext4 /tmp/ldiskfs /tmp/ldiskfs_m -o loop,commit=600,no_mbcache
touch /tmp/ldiskfs_m/file{1..1024}

V=$(for i in `seq 60000`; do echo -n x ; done)
V1="1$V"
V2="2$V"

while true; do
        setfattr -n user.xattr -v $V /tmp/ldiskfs_m/file{1..1024}
        setfattr -n user.xattr -v $V1 /tmp/ldiskfs_m/file{1..1024} &
        setfattr -n user.xattr -v $V2 /tmp/ldiskfs_m/file{1024..1} &
        wait
done

umount /tmp/ldiskfs_m
[  583.890993] loop0: detected capacity change from 0 to 104857600
[  583.960218] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: commit=600,no_mbcache
[  596.812091] WARNING: CPU: 0 PID: 1799 at fs/ext4/ext4_jbd2.c:286 __ext4_handle_dirty_metadata+0xfe/0x190 [ext4]
[  596.813571] Modules linked in: loop netconsole rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set kmodlve(O) ext4 mbcache jbd2 nf_tables nfnetlink vmwgfx ttm drm_kms_helper intel_rapl_msr intel_rapl_common syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_piix4 pcspkr video sunrpc xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg ata_generic ahci libahci ata_piix libata e1000 crc32c_intel serio_raw
[  596.816151] CPU: 0 PID: 1799 Comm: setfattr Kdump: loaded Tainted: G           O     ---------r-  - 4.18.0-348.lve.el8.x86_64 #1
[  596.816952] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  596.817381] RIP: 0010:__ext4_handle_dirty_metadata+0xfe/0x190 [ext4]
[  596.817820] Code: c0 4c 89 ef 41 bc fb ff ff ff e8 4d ae 04 00 eb 92 3e 80 0b 01 4d 85 ed 75 a3 48 89 df 45 31 e4 e8 b7 c4 f5 ed e9 79 ff ff ff <0f> 0b 48 c7 c2 80 b2 86 c0 45 89 e0 48 89 e9 44 89 fe 4c 89 f7 e8
[  596.819164] RSP: 0018:ffffa57681fbfaa0 EFLAGS: 00010286
[  596.819619] RAX: ffff8c75c5ace000 RBX: ffff8c75def68f70 RCX: 0000000000000000
[  596.820096] RDX: ffff8c75df895a80 RSI: ffff8c75def68f70 RDI: ffff8c75df895a80
[  596.820569] RBP: ffff8c75df895a80 R08: ffff8c75def68f70 R09: 000000000000017c
[  596.821066] R10: ffff8c75c5ace000 R11: 0000000000000000 R12: 00000000ffffff8b
[  596.821547] R13: 0000000000000000 R14: ffffffffc086c080 R15: 0000000000001513
[  596.822040] FS:  00007fc9aa28b740(0000) GS:ffff8c75fec00000(0000) knlGS:0000000000000000
[  596.822514] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  596.822992] CR2: 000056391a5fc0a8 CR3: 000000001dfa8000 CR4: 00000000000506f0
[  596.823468] Call Trace:
[  596.823966]  ext4_do_update_inode+0x495/0x7e0 [ext4]
[  596.824450]  ext4_mark_iloc_dirty+0x32/0x80 [ext4]
[  596.824940]  ext4_xattr_set_handle+0x3b4/0x5b0 [ext4]
[  596.825427]  ext4_xattr_set+0xd0/0x160 [ext4]
[  596.825917]  __vfs_setxattr+0x66/0x80
[  596.826404]  __vfs_setxattr_noperm+0x67/0x1a0
[  596.826889]  vfs_setxattr+0x8f/0x160
[  596.827385]  setxattr+0x11f/0x180
[  596.827884]  ? filename_lookup.part.57+0xe0/0x170
[  596.828368]  ? 0xffffffffc29a90c8
[  596.828846]  path_setxattr+0xbe/0xe0
[  596.829300]  __x64_sys_setxattr+0x27/0x30
[  596.829762]  do_syscall_64+0x5b/0x1a0
[  596.830198]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  596.830604] RIP: 0033:0x7fc9a9b93bee
[  596.831000] Code: 48 8b 0d 9d 42 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 bc 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6a 42 2c 00 f7 d8 64 89 01 48
[  596.832186] RSP: 002b:00007ffda6d36628 EFLAGS: 00000246 ORIG_RAX: 00000000000000bc
[  596.832573] RAX: ffffffffffffffda RBX: 000055c783ce7460 RCX: 00007fc9a9b93bee
[  596.832960] RDX: 000055c783ce7460 RSI: 00007ffda6d3a7ee RDI: 00007ffda6d49b33
[  596.833332] RBP: 00007ffda6d49b33 R08: 0000000000000000 R09: 0000000000000003
[  596.833748] R10: 000000000000ea61 R11: 0000000000000246 R12: 00007ffda6d3a7ee
[  596.834121] R13: 00007ffda6d36750 R14: 0000000000000000 R15: 0000000000000000
[  596.834496] ---[ end trace d7885920de5cb84f ]---
[  596.834922] EXT4-fs: ext4_do_update_inode:5395: aborting transaction: Corrupt filesystem in __ext4_handle_dirty_metadata
[  596.835728] EXT4: jbd2_journal_dirty_metadata failed: handle type 10 started at line 2481, credits 9/6, errcode -117
[  596.835731] EXT4-fs error (device loop0) in ext4_do_update_inode:5411: Corrupt filesystem
[  596.988994] EXT4-fs error (device loop0) in ext4_xattr_set:2489: Corrupt filesystem
[  601.332819] ------------[ cut here ]------------
[  601.333997] kernel BUG at fs/jbd2/transaction.c:1476!
[  601.335146] invalid opcode: 0000 [#1] SMP NOPTI
[  601.336161] CPU: 0 PID: 1800 Comm: setfattr Kdump: loaded Tainted: G        W  O     ---------r-  - 4.18.0-348.lve.el8.x86_64 #1
[  601.336980] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  601.337391] RIP: 0010:jbd2_journal_dirty_metadata+0x1fb/0x250 [jbd2]
[  601.337814] Code: f3 90 48 8b 03 a9 00 00 80 00 75 f4 e9 b3 fe ff ff 44 8b 45 0c 41 83 f8 01 74 8c e9 ec 94 00 00 4c 39 65 30 0f 84 68 fe ff ff <0f> 0b 4d 8b 4a 70 4c 8d 73 02 4d 39 cc 0f 84 33 ff ff ff e9 53 95
[  601.339102] RSP: 0018:ffffa57681fafa70 EFLAGS: 00010207
[  601.339546] RAX: 000000000062c029 RBX: ffff8c75df0913a8 RCX: 0000000000000000
[  601.340014] RDX: ffff8c75df895508 RSI: ffff8c75df0913a8 RDI: ffff8c75df895508
[  601.340480] RBP: ffff8c75df33c8e8 R08: ffff8c75df0913a8 R09: 000000000000017c
[  601.340952] R10: ffff8c75c5ace000 R11: 0000000000000000 R12: ffff8c75df896400
[  601.341417] R13: 0000000000000000 R14: ffffffffc086c080 R15: 0000000000001513
[  601.341894] FS:  00007fbb98090740(0000) GS:ffff8c75fec00000(0000) knlGS:0000000000000000
[  601.342379] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  601.342899] CR2: 000055cb052b80a8 CR3: 000000001bfcc000 CR4: 00000000000506f0
[  601.343408] Call Trace:
[  601.343942]  __ext4_handle_dirty_metadata+0x51/0x190 [ext4]
[  601.344425]  ext4_do_update_inode+0x495/0x7e0 [ext4]
[  601.344916]  ext4_mark_iloc_dirty+0x32/0x80 [ext4]
[  601.345389]  ext4_xattr_set_handle+0x3b4/0x5b0 [ext4]
[  601.345875]  ext4_xattr_set+0xd0/0x160 [ext4]
[  601.346364]  __vfs_setxattr+0x66/0x80
[  601.346842]  __vfs_setxattr_noperm+0x67/0x1a0
[  601.347295]  vfs_setxattr+0x8f/0x160
[  601.347742]  setxattr+0x11f/0x180
[  601.348186]  ? filename_lookup.part.57+0xe0/0x170
[  601.348616]  ? 0xffffffffc29a90c8
[  601.349030]  path_setxattr+0xbe/0xe0
[  601.349423]  __x64_sys_setxattr+0x27/0x30
[  601.349825]  do_syscall_64+0x5b/0x1a0
[  601.350200]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  601.350577] RIP: 0033:0x7fbb97998bee

adilger, thank you for the advise. Yes, we considered this solution though it may be a bit ugly as we need to pass the inode through the whole ldiskfs/osd callstack. We cannot iput inside ldiskfs since the transaction handle is released in osd. Other solutions such as calculating the correct number of credits can be even more complicated and error-prone, though.

Comment by Andreas Dilger [ 20/Jan/22 ]

In discussion with Cory on the LWG call, one straight-forward option is to increase the transaction credits by EXT4_XATTR_SIZE_MAX=64KB so that there will be enough credits to also delete the old xattr inode.

We don't want to increase this for every xattr, since that would bloat the credits for creating files, but it is probably OK to do this for explicit setxattr RPCs, and only if the incoming xattr is large. It is common to increase the size of xattrs, but very unlikely to decrease the size of xattrs, so this would be enough.

The other alternative is to push the xattr inode to the orphan list (which will not take any extra credits), and then have a separate work queue to unlink the inode in the background. If the inode is in the orphan list it will be handled at mount time in case of a crash. This could be done entirely inside ldiskfs, so no need to push it up to osd-ldiskfs.

Comment by Peter Jones [ 27/Jan/22 ]

spitzcor when do you expect a patch to be pushed for this issue?

Comment by Cory Spitz [ 27/Jan/22 ]

Soon; should be this week. Panda has a quick fix a la the proposed credit++ approach, and a proposed real fix with a delayed iput for the old EA inode moved out of the transaction. I think he should push up both and we can keep our options open for 2.15.0.

Comment by Gerrit Updater [ 28/Jan/22 ]

"Andrew Perepechko <andrew.perepechko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46358
Subject: LU-15404 ldiskfs: truncate during setxattr leads to kernel panic
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1a6856c8b9d038ac4a6ec0b3857bd11a5f46d583

Comment by James A Simmons [ 28/Jan/22 ]

Is this a RHEL8 only problem?

Comment by Andreas Dilger [ 28/Jan/22 ]

I believe yes, it only affects RHEL8 and later kernels that have the upstream xattr_inode feature, not RHEL7 or earlier that have the CFS/WC version of that feature.

Comment by Andreas Dilger [ 30/Jan/22 ]

Andrew, could you please also submit your patch to the linux-ext4 mailing list so that it can hopefully be accepted there and maybe also maintenance kernels.

Comment by Andrew Perepechko [ 31/Jan/22 ]

Andreas, I'm afraid that the LKML guys will prefer to pass the inode to ext4_xattr_set() and iput there after transaction handle release, which won't work very well for us since our transaction starts and ends elsewhere and nice integration with the existing osd truncate list adds even more complexity since we operate with osd objects. Anyway, I'll ask Artem, who has reported this bug to LKML, to send the patch so we can at least push the discussion about possible solutions further.

Comment by Gerrit Updater [ 07/Feb/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46358/
Subject: LU-15404 ldiskfs: truncate during setxattr leads to kernel panic
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e239a14001b62d96c186ae2c9f58402f73e63dcc

Comment by Andrew Perepechko [ 07/Feb/22 ]

artem_blagodarenko, were you able to send the patch upstream? Thank you.

Comment by Peter Jones [ 07/Feb/22 ]

Landed for 2.15

Comment by Gerrit Updater [ 08/Feb/22 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/46480
Subject: LU-15404 ldiskfs: port truncate fix to Ubuntu 20 HWE
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b57d7fccd92a9867f00e6641d81661cfbb245c48

Comment by Artem Blagodarenko (Inactive) [ 25/Feb/22 ]

>artem_blagodarenko, were you able to send the patch upstream? Thank you.

panda  https://marc.info/?l=linux-ext4&m=164578706102904&w=2

Comment by Artem Blagodarenko (Inactive) [ 30/Mar/22 ]

 
The patch has been applied to the ext4

Applied, thanks!   
[1/1] ext4: truncate during setxattr leads to kernel panic commit: c7cded845fc192cc35a1ca37c0cd957ee35abdf8

 

Comment by Andrew Perepechko [ 30/Mar/22 ]

 The patch has been applied to the ext4

It seems there's a  deadlock possible when ext4 unmount races with some p9 fs operation.

Comment by Gerrit Updater [ 30/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46480/
Subject: LU-15404 ldiskfs: port truncate fix to Ubuntu 20 HWE
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 968a050f94c21aa48de9a5d9c034a4216e18aa46

Comment by Alex Zhuravlev [ 14/Mar/23 ]

I tend to think this patch causes specific issue at umount:

PID: 18     TASK: ffff8f2f07dc44c0  CPU: 1   COMMAND: "kworker/1:0"
 #0 [ffff8f2f07dcbb30] __schedule at ffffffff965a232d
    /tmp/kernel/kernel/sched/core.c: 3109
 #1 [ffff8f2f07dcbbb8] schedule at ffffffff965a2748
    /tmp/kernel/./arch/x86/include/asm/preempt.h: 84
 #2 [ffff8f2f07dcbbc8] schedule_timeout at ffffffff965a7559
    /tmp/kernel/kernel/time/timer.c: 1840
 #3 [ffff8f2f07dcbc98] wait_for_common at ffffffff965a3081
    /tmp/kernel/kernel/sched/completion.c: 86
 #4 [ffff8f2f07dcbce8] flush_workqueue at ffffffff960cdf9f
    /tmp/kernel/kernel/workqueue.c: 2828
 #5 [ffff8f2f07dcbdb8] ldiskfs_put_super at ffffffffc0b32a7e [ldiskfs]
    /home/lustre/master-mine/ldiskfs/super.c: 1028
 #6 [ffff8f2f07dcbdf0] generic_shutdown_super at ffffffff961d17cf
    /tmp/kernel/./include/linux/compiler.h: 276
 #7 [ffff8f2f07dcbe08] kill_block_super at ffffffff961d1a9c
    /tmp/kernel/fs/super.c: 1443
 #8 [ffff8f2f07dcbe20] deactivate_locked_super at ffffffff961d1e74
    /tmp/kernel/fs/super.c: 340
 #9 [ffff8f2f07dcbe38] cleanup_mnt at ffffffff961f0fc6
    /tmp/kernel/fs/namespace.c: 115
#10 [ffff8f2f07dcbe48] delayed_mntput at ffffffff961f1021
    /tmp/kernel/fs/namespace.c: 1156
#11 [ffff8f2f07dcbe58] process_one_work at ffffffff960cf8cf
    /tmp/kernel/kernel/workqueue.c: 2266
#12 [ffff8f2f07dcbed0] worker_thread at ffffffff960cfae5
    /tmp/kernel/./include/linux/compiler.h: 276
#13 [ffff8f2f07dcbf10] kthread at ffffffff960d5199
    /tmp/kernel/kernel/kthread.c: 340
#14 [ffff8f2f07dcbf50] ret_from_fork at ffffffff9660019f
    /tmp/kernel/arch/x86/entry/entry_64.S: 325

i.e. in worker thread (processing workqueues) ldiskfs_put_super() calls to flush_workqueue() to finish itself?

Comment by Andreas Dilger [ 14/Mar/23 ]

Alex, it looks like the workqueue is flushed prior to unmount, but possibly iput of internal inodes during unmount are delayed again? Probably need a check to avoid adding new inodes to queue once unmount has started?

Comment by Andrew Perepechko [ 15/Mar/23 ]

I don't think the umount thread waits for itself. However, this patch was made to use specific ext4 workqueues to avoid deadlocks on global workqueues when upstreaming. I believe, Artem was the last one who looked into upstreaming this patch and may have its latest version.

Comment by Alex Zhuravlev [ 15/Mar/23 ]
static void ldiskfs_put_super(struct super_block *sb)
{
        struct ldiskfs_sb_info *sbi = LDISKFS_SB(sb);
        struct ldiskfs_super_block *es = sbi->s_es;
        struct buffer_head **group_desc;
        struct flex_groups **flex_groups;
        int aborted = 0;
        int i, err;

        flush_scheduled_work();

...

static inline void flush_scheduled_work(void)
{
        flush_workqueue(system_wq);
.....

void flush_workqueue(struct workqueue_struct *wq)
{
...
        wait_for_completion(&this_flusher.done);

I tend to think that calling flush_workqueue() from the context of kworker is not good.

Comment by Alex Zhuravlev [ 15/Mar/23 ]

attaching full trace for the case, please have a look.

Comment by Andreas Dilger [ 16/Mar/23 ]

Alex, do you what was holding the filesystem reference for delayed_mntput in the kworker thread?

I'm wondering if we need to put the flush_scheduled_work() call in some "pre cleanup" superblock method for newer kernels, rather than having it in ldiskfs_put_super()?

The most recent version of this patch that Artem pushed to Linux-ext4 is at:
https://patchwork.ozlabs.org/project/linux-ext4/patch/20220711145735.53676-1-artem.blagodarenko@gmail.com/

This used a dedicated per-filesystem work queue that was used rather than the global work queue. That potentially avoids long waits or deadlocks during unmount when multiple filesystems are adding inodes to a single queue and then they all need to wait for the entire queue to empty, as pointed out here:
https://lore.kernel.org/all/385ce718-f965-4005-56b6-34922c4533b8@I-love.SAKURA.ne.jp/

Comment by Gerrit Updater [ 21/Mar/23 ]

"Andrew Perepechko <andrew.perepechko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50354
Subject: LU-15404 ldiskfs: use per-filesystem workqueues to avoid deadlocks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ec3a8efce2a3328a89c9e0ef7eb0a3719f31290b

Comment by Alex Zhuravlev [ 21/Mar/23 ]

adilger sorry, was busy with another tickets.
panda thanks! will test your patch quickly and report back.

Comment by Gerrit Updater [ 04/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50354/
Subject: LU-15404 ldiskfs: use per-filesystem workqueues to avoid deadlocks
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 616fa9b581798e1b66e4d36113c29531ad7e41a0

Comment by Gerrit Updater [ 10/Apr/23 ]

"Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50586
Subject: LU-15404 ldiskfs: use per-filesystem workqueues to avoid deadlocks
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: f8b66bbe831c3f5fcf06b3ad22a5449fa004ff74

Comment by Gerrit Updater [ 29/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50586/
Subject: LU-15404 ldiskfs: use per-filesystem workqueues to avoid deadlocks
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 9ab613631b5833ba7e7a578fdb9819ebc593ab3c

Comment by Gerrit Updater [ 15/Jun/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51335
Subject: LU-15404 ldiskfs: fix truncate during setxattr for el7.9
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9595ef28e16eeb110844c952e6a58079ded40500

Comment by Gerrit Updater [ 28/Jun/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51335/
Subject: LU-15404 ldiskfs: fix truncate during setxattr for el7.9
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 471ce3d95651ca06209a76973cae3bbdb5b6aa2f

Generated at Sat Feb 10 03:18:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.