[LU-15726] Introduce / use min journal credit for ldiskfs Created: 07/Apr/22 Updated: 21/Jan/24 Resolved: 21/Jan/24 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shaun Tancheff | Assignee: | Shaun Tancheff |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
lustre/ldiskfs consumes more journal credits than ext4. Try to place nice with jbd2 and increase the requested journal credits as needed. |
| Comments |
| Comment by Gerrit Updater [ 07/Apr/22 ] |
|
"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47009 |
| Comment by Shaun Tancheff [ 08/Apr/22 ] |
|
Not an improvement |
| Comment by James A Simmons [ 05/May/22 ] |
|
Starting with 5.10 kernels the way xattr credits is handled has changed so that the ext4-xattr-disable-credits-check.patch is not good enough to work around this issue. We need a real solution to this problem so I'm reopening this ticket. |
| Comment by Xinliang Liu [ 31/May/23 ] |
|
I tried the latest b2_15 branch(2.15.3-RC1) on 5.10 kernel, it seems that without patch https://review.whamcloud.com/47009 it can't mount client(all-in-one) or start MDS (muli-node). So branch b2_15 must lack of a patch or some patches from branch master for fixing this credit related issue. Which ones? And here the warning log from the kernel: [ 8189.170458] ------------[ cut here ]------------ [ 8189.170504] WARNING: CPU: 0 PID: 115468 at /tmp/rpmbuild-lustre-openeuler-AL963B8M/BUILD/lustre-2.15.3_RC1_5_g4aaae55_dirty/ldiskfs/ext4_jbd2.c:336 __ldiskfs_handle_dirty_metadata+0x18c/0x2e0 [ldiskfs] [ 8189.170506] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) osd_ldiskfs(OE) lquota(OE) loop ldiskfs(OE) lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) kso cklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod crc32_generic rfkill sunrpc virtio_balloon vfat fat sch_fq_codel fuse ext4 mbcache jbd2 virtio_gpu virtio_net virtio_dma_buf net_failover virtio_blk failover ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_pci virtio_pci_ modern_dev virtio_mmio virtio_rng virtio virtio_ring aes_neon_bs aes_neon_blk aes_ce_blk crypto_simd cryptd aes_ce_cipher [last unloaded: libcfs] [ 8189.170583] CPU: 0 PID: 115468 Comm: mdt00_001 Kdump: loaded Tainted: G W OE 5.10.0-152.0.0.78.oe2203sp2.aarch64 #1 [ 8189.170585] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 8189.170588] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) [ 8189.170610] pc : __ldiskfs_handle_dirty_metadata+0x18c/0x2e0 [ldiskfs] [ 8189.170631] lr : __ldiskfs_handle_dirty_metadata+0x9c/0x2e0 [ldiskfs] [ 8189.170632] sp : ffff8000109e36a0 [ 8189.170634] x29: ffff8000109e36c0 x28: 0000000000000000 [ 8189.170638] x27: 0000000000000002 x26: 0000000000000001 [ 8189.170641] x25: 0000000000000001 x24: ffff0bf31c309af8 [ 8189.170645] x23: 0000000000000372 x22: ffffde4408547508 [ 8189.170648] x21: 00000000ffffffe4 x20: ffff0bf318c45a10 [ 8189.170651] x19: ffff0bf31c1ff1a0 x18: 0000000000000020 [ 8189.170654] x17: 0000000000000000 x16: ffffde4435035f10 [ 8189.170658] x15: ffffffffffffffff x14: 0000000000000000 [ 8189.170661] x13: 0000000000191000 x12: 0000000000000000 [ 8189.170665] x11: 0000000000000000 x10: 00000000ffffffff [ 8189.170668] x9 : ffffde44084c254c x8 : ffff0bf34f812000 [ 8189.170671] x7 : 0000000000000000 x6 : 0000000000000000 [ 8189.170674] x5 : 61c8864680b583eb x4 : 0000000000116011 [ 8189.170678] x3 : ffff0bf31f327800 x2 : 0000000000000001 [ 8189.170681] x1 : 00000000007be000 x0 : 0000000000000030 [ 8189.170685] Call trace: [ 8189.170706] __ldiskfs_handle_dirty_metadata+0x18c/0x2e0 [ldiskfs] [ 8189.170727] ldiskfs_getblk+0x150/0x210 [ldiskfs] [ 8189.170748] ldiskfs_bread+0x1c/0xd4 [ldiskfs] [ 8189.170765] osd_ldiskfs_write_record+0x4a4/0x8fc [osd_ldiskfs] [ 8189.170779] osd_write+0x104/0x6e4 [osd_ldiskfs] [ 8189.170842] dt_record_write+0x38/0xf0 [obdclass] [ 8189.170943] tgt_client_data_write+0x12c/0x180 [ptlrpc] [ 8189.171012] tgt_client_data_update+0x4fc/0x86c [ptlrpc] [ 8189.171079] tgt_client_new+0x610/0xcb0 [ptlrpc] [ 8189.171117] mdt_obd_connect+0x5b0/0x940 [mdt] [ 8189.171370] target_handle_connect+0x10e4/0x3b00 [ptlrpc] [ 8189.171465] tgt_request_handle+0x174/0xd9c [ptlrpc] [ 8189.171545] ptlrpc_server_handle_request.isra.0+0x3d4/0x11fc [ptlrpc] [ 8189.171613] ptlrpc_main+0xdb0/0x1670 [ptlrpc] [ 8189.171620] kthread+0x108/0x13c [ 8189.171624] ret_from_fork+0x10/0x18 [ 8189.171626] ---[ end trace ce1929bc2ec68092 ]--- [ 8189.171631] LDISKFS-fs: ldiskfs_getblk:882: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata [ 8189.174222] LDISKFS-fs error (device dm-0): ldiskfs_getblk:882: inode #91: block 31655: comm mdt00_001: journal_dirty_metadata failed: handle type 0 started at line 1982, credits 7/0, errcode -28 [ 8189.178341] Aborting journal on device dm-0-8. [ 8189.179762] LDISKFS-fs (dm-0): Remounting filesystem read-only [ 8189.181278] LustreError: 115468:0:(osd_io.c:2123:osd_ldiskfs_write_record()) lustre-MDT0000: error reading offset 8192 (block 2, size 128, offs 8192), credits 7/1: rc = -28 [ 8189.184623] LDISKFS-fs error (device dm-0) in osd_trans_stop:2092: error 28 [ 8189.184635] LustreError: 115449:0:(osd_handler.c:1790:osd_trans_commit_cb()) transaction @0x00000000ce92c156 commit error: 2 |
| Comment by Xinliang Liu [ 27/Jul/23 ] |
|
After a long bisect on branch master, find out that branch b2_15 with commit ef90a02d12 can run on kernel 5.10 with no crash. But don't know why? Does this issue still exist in 5.10+ kernel for non root-owned files? Does anyone have any ideas on this? @Alex Zhuravlev Anyway, just cherry-picked it to branch b2_15 and make a note it is related to this issue: https://review.whamcloud.com/c/fs/lustre-release/+/51776 |
| Comment by James A Simmons [ 20/Jan/24 ] |
|
I noticed the same thing for Ubuntu 5.15 kernels. Patch 51776 fixes this issue. Shaun can you close this ticket. |
| Comment by Shaun Tancheff [ 21/Jan/24 ] |
|
Resolved with |