[LU-13765] ldiskfs_mb_mark_diskspace_used:3472: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata Created: 09/Jul/20 Updated: 28/Apr/22 Resolved: 29/Oct/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Shuichi Ihara | Assignee: | Arshad Hussain |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
master (commit f3c95bb) |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 2 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
OSS showed the following traces and remounted OSTs with readonly if client calls fallocate(). Jul 8 23:44:21 es400nv-vm1 kernel: ------------[ cut here ]------------ Jul 8 23:44:21 es400nv-vm1 kernel: WARNING: CPU: 0 PID: 6870 at /tmp/rpmbuild-lustre-root-DUtheNY7/BUILD/lustre-2.13.54_60_gf3c95bb/ldiskfs/ext4_jbd2.c:266 __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs] Jul 8 23:44:21 es400nv-vm1 kernel: Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) ksocklnd(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mlx4_core(OE) sunrpc iTCO_wdt ppdev iTCO_vendor_support nfit libnvdimm iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr i2c_i801 joydev parport_pc lpc_ich sg parport i6300esb ip_tables ext4 mbcache jbd2 sr_mod sd_mod cdrom crc_t10dif crct10dif_generic bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_net Jul 8 23:44:21 es400nv-vm1 kernel: virtio_blk virtio_scsi(OE) ahci drm libahci mlx5_core(OE) libata crct10dif_pclmul crct10dif_common mlxfw(OE) ptp crc32c_intel iavf pps_core virtio_pci devlink serio_raw virtio_ring mlx_compat(OE) virtio drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod Jul 8 23:44:21 es400nv-vm1 kernel: CPU: 0 PID: 6870 Comm: ll_ost00_000 Kdump: loaded Tainted: G OE ------------ T 3.10.0-1062.1.1.el7_lustre.ddn3.x86_64 #1 Jul 8 23:44:21 es400nv-vm1 kernel: Hardware name: DDN SFA400NVXE, BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Jul 8 23:44:21 es400nv-vm1 kernel: Call Trace: Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffb57792c2>] dump_stack+0x19/0x1b Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffb509a878>] __warn+0xd8/0x100 Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffb509a9bd>] warn_slowpath_null+0x1d/0x20 Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc107c5b2>] __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc108b88b>] ldiskfs_mb_mark_diskspace_used+0x2bb/0x510 [ldiskfs] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc108fb10>] ldiskfs_mb_new_blocks+0x350/0xb20 [ldiskfs] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc10932e5>] ? __read_extent_tree_block+0x55/0x1e0 [ldiskfs] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffb522454b>] ? __kmalloc+0x1eb/0x230 Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc1094d44>] ? ldiskfs_ext_find_extent+0x134/0x340 [ldiskfs] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc10987b6>] ldiskfs_ext_map_blocks+0x4a6/0xf60 [ldiskfs] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffb5287fff>] ? bdev_evict_inode+0x8f/0xc0 Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc105c53c>] ldiskfs_map_blocks+0x12c/0x6a0 [ldiskfs] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc109382c>] ? ldiskfs_alloc_file_blocks.isra.36+0xbc/0x310 [ldiskfs] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc109384f>] ldiskfs_alloc_file_blocks.isra.36+0xdf/0x310 [ldiskfs] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc109ac8d>] ldiskfs_fallocate+0x85d/0x8f0 [ldiskfs] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffb52b74da>] ? __dquot_initialize+0x3a/0x240 Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc02fba93>] ? jbd2__journal_start+0xf3/0x1f0 [jbd2] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc1179ac3>] osd_fallocate+0x243/0x530 [osd_ldiskfs] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc115beb5>] ? osd_trans_start+0x235/0x4e0 [osd_ldiskfs] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc1562077>] ofd_object_fallocate+0x5a7/0x810 [ofd] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc154d196>] ofd_fallocate_hdl+0x236/0x960 [ofd] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc0bf35af>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc0c5ecda>] tgt_request_handle+0x96a/0x1640 [ptlrpc] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc0c36111>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc07c902e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc0c002c6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffb50d20a0>] ? task_rq_unlock+0x20/0x20 Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffb50d2643>] ? __wake_up+0x13/0x20 Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc0c047f4>] ptlrpc_main+0xbb4/0x1550 [ptlrpc] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffb50d3efe>] ? finish_task_switch+0x4e/0x1c0 Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffc0c03c40>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc] Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffb50c50d1>] kthread+0xd1/0xe0 Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffb50c5000>] ? insert_kthread_work+0x40/0x40 Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffb578cd1d>] ret_from_fork_nospec_begin+0x7/0x21 Jul 8 23:44:21 es400nv-vm1 kernel: [<ffffffffb50c5000>] ? insert_kthread_work+0x40/0x40 Jul 8 23:44:21 es400nv-vm1 kernel: ---[ end trace a9da39c17d13d034 ]--- Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs: ldiskfs_mb_mark_diskspace_used:3472: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS: jbd2_journal_dirty_metadata failed: handle type 0 started at line 1914, credits 41/0, errcode -28 Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs warning (device sdf): ldiskfs_mb_new_blocks:5099: Updating bitmap error: [err -28] [pa ffff8aed20b08068] [phy 1572864] [logic 1277952] [len 32768] [free 32768] [error 1] [inode 115] Jul 8 23:44:21 es400nv-vm1 kernel: Quota error (device sdf): qtree_write_dquot: dquota write failed Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs error (device sdf) in ldiskfs_write_dquot:5561: error 28 Jul 8 23:44:21 es400nv-vm1 kernel: Aborting journal on device sdf-8. Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs (sdf): Remounting filesystem read-only Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs error (device sdf) in ldiskfs_reserve_inode_write:5332: Journal has aborted Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs error (device sdf) in ldiskfs_alloc_file_blocks:4773: error 28 Jul 8 23:44:21 es400nv-vm1 kernel: Lustre: 6870:0:(osd_internal.h:1237:osd_trans_exec_op()) ai400-OST0000: opcode 7: before 0 < left 8, rollback = 7 Jul 8 23:44:21 es400nv-vm1 kernel: Lustre: 6870:0:(osd_handler.c:1809:osd_trans_dump_creds()) create: 0/0/0, destroy: 0/0/0 Jul 8 23:44:21 es400nv-vm1 kernel: Lustre: 6870:0:(osd_handler.c:1816:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 2/15/0 Jul 8 23:44:21 es400nv-vm1 kernel: Lustre: 6870:0:(osd_handler.c:1826:osd_trans_dump_creds()) write: 1/8/0, punch: 0/0/0, quota 3/3/0 Jul 8 23:44:21 es400nv-vm1 kernel: Lustre: 6870:0:(osd_handler.c:1833:osd_trans_dump_creds()) insert: 0/0/0, delete: 0/0/0 Jul 8 23:44:21 es400nv-vm1 kernel: Lustre: 6870:0:(osd_handler.c:1840:osd_trans_dump_creds()) ref_add: 0/0/0, ref_del: 0/0/0 Jul 8 23:44:21 es400nv-vm1 kernel: LustreError: 6870:0:(osd_io.c:1785:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 Jul 8 23:44:21 es400nv-vm1 kernel: LustreError: 6870:0:(osd_handler.c:2017:osd_trans_stop()) ai400-OST0000: failed in transaction hook: rc = -30 Jul 8 23:44:21 es400nv-vm1 kernel: LustreError: 6864:0:(osd_handler.c:1723:osd_trans_commit_cb()) transaction @0xffff8af8613a8300 commit error: 2 Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs error (device sdf) in osd_trans_stop:2024: error 28 Jul 8 23:44:21 es400nv-vm1 kernel: LustreError: 6870:0:(osd_handler.c:2027:osd_trans_stop()) ai400-OST0000: failed to stop transaction: rc = -28 Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs: ldiskfs_mb_mark_diskspace_used:3472: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS: jbd2_journal_dirty_metadata failed: handle type 0 started at line 1914, credits 41/0, errcode -28 Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs warning (device sdg): ldiskfs_mb_new_blocks:5099: Updating bitmap error: [err -28] [pa ffff8aee29932f70] [phy 4751360] [logic 1212416] [len 32768] [free 32768] [error 1] [inode 115] Jul 8 23:44:21 es400nv-vm1 kernel: Quota error (device sdg): qtree_write_dquot: dquota write failed Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs error (device sdg) in ldiskfs_write_dquot:5561: error 28 Jul 8 23:44:21 es400nv-vm1 kernel: Aborting journal on device sdg-8. Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs (sdg): Remounting filesystem read-only Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs error (device sdg) in ldiskfs_reserve_inode_write:5332: Journal has aborted Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs error (device sdg) in ldiskfs_alloc_file_blocks:4773: error 28 Jul 8 23:44:21 es400nv-vm1 kernel: LDISKFS-fs error (device sdg) in osd_trans_stop:2024: error 28 Jul 8 23:44:24 es400nv-vm1 kernel: LDISKFS-fs warning (device sdf): kmmpd:186: kmmpd being stopped since filesystem has been remounted as readonly. Jul 8 23:44:24 es400nv-vm1 kernel: LDISKFS-fs warning (device sdg): kmmpd:186: kmmpd being stopped since filesystem has been remounted as readonly. Jul 8 23:44:25 es400nv-vm1 kernel: LustreError: 6874:0:(ofd_dev.c:1815:ofd_destroy_hdl()) ai400-OST0000: error destroying object [0x100000000:0x12:0x0]: -30 Jul 8 23:44:25 es400nv-vm1 kernel: LustreError: 6874:0:(ofd_dev.c:1815:ofd_destroy_hdl()) Skipped 31 previous similar messages Jul 8 23:44:46 es400nv-vm1 kernel: LustreError: 6870:0:(tgt_lastrcvd.c:1160:tgt_client_del()) ai400-OST0000: failed to update server data, skip client 9a3ea305-a278-48e5-87a1-1f3e457a205c zeroing, rc -30 Here is a reproducer. enabling fallocate with fio caused problem. # fio --name=write --directory=/ai400/fio --rw=write --ioengine=libaio --bs=128K --numjobs=32 --filesize=200G --filename=test --time_based --runtime=100 --fallocate=1 |
| Comments |
| Comment by Andreas Dilger [ 09/Jul/20 ] |
|
It seems that if fallocate() is called with a large enough range, that the ext4_fallocate() code will restart the transaction internally. That makes sense, since the number of blocks needed for the operation may be very large. I don't think there is any atomicity requirement for fallocate() of the blocks themselves, so long as the size (if needed) is only changed at the end when the blocks are actually allocated. It seems like the right thing to do is extend the transaction after |
| Comment by Andreas Dilger [ 09/Jul/20 ] |
|
Arshad, would you be able to look into this? Shuichi, you could work around this for now by temporarily changing the client and/or OFD to return -EOPNOTSUPP for all fallocate calls. |
| Comment by Arshad Hussain [ 09/Jul/20 ] |
|
Hi Andreas, - Yes, I will look into this. |
| Comment by Gerrit Updater [ 12/Jul/20 ] |
|
Arshad Hussain (arshad.super@gmail.com) uploaded a new patch: https://review.whamcloud.com/39342 |
| Comment by Wang Shilong (Inactive) [ 13/Jul/20 ] |
|
I think the problem is in ext4_fallocate() we will start_journal and end_journal ourselves, however,Lustre start journal handle already, we have nested journal here, JBD2 logic will try to return previous start journal in this case. Since we might have several loops to finish allocation, that means journal credits could not be enough here.. And it looks pretty hard to predict how much credits we might use for fallocation operation, for example, in fresh filesystem, we could think 128M per extent we could get, but in a most worst case, it could be 4K blocks per extent... And if we always blindly reserve a big credits for fallocate, that could be pretty hurting for performance, osd_punch have similar problem, but it is a bit different, we could just add orphan inode and move it out of main transaction. I haven't got some better idea for this, maybe bzzz have some better ideas for how to solve this? |
| Comment by Andreas Dilger [ 13/Jul/20 ] |
|
Shilong, I agree we can't predict the total number of credits needed in advance. However, we can extend the journal handle after the ->fallocate() method is called if there are not enough credits left at that point. This is the same as with ->truncate(). The patch https://review.whamcloud.com/39342 is not quite correct, but is going in the right direction. |
| Comment by Arshad Hussain [ 19/Jul/20 ] |
|
Hi Wang Shilong, >And it looks pretty hard to predict how much credits we might use for fallocation operation, for example, in fresh filesystem, we could think 128M per extent we could get, but in a most worst case, it could be 4K blocks per extent... Yes i can see that now. I misunderstood Andreas's direction/hints on this. I am creating local test-bed for this now. |
| Comment by Alex Zhuravlev [ 24/Aug/20 ] |
how this is possible? I think Andreas's idea to restart transaction is correct approach - this is what truncate does. plus Lustre doesn't require whole thing to be atomic. |
| Comment by Arshad Hussain [ 25/Aug/20 ] |
|
Hi Andreas, Thanks Arshad |
| Comment by Shuichi Ihara [ 11/Sep/20 ] |
|
Any updates and progress of patch https://review.whamcloud.com/39342 ? fallocate is enabled in fio by default. it's very annoying all OSTs are remounted as readonly when client calls fallocate(). |
| Comment by Arshad Hussain [ 11/Sep/20 ] |
|
Hi Shuichi - Apologies for the delay. Now, at least on my local testing it is withstanding 300MB with fallocate. Unfortunately I could not have any larger OST to test out. I should be pushing the patch for Andreas/Want Shilong/Alex for review in next 30 min or so. Will try to get this fix very soon.
# fio --name=write --directory=/mnt/lustre/fio --rw=write --ioengine=libaio --bs=128K --numjobs=1 --filesize=300M --filename=test2 --fallocate=1
write: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 process
write: Laying out IO file (1 file / 300MiB)
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=24.2MiB/s][r=0,w=193 IOPS][eta 00m:00s]
write: (groupid=0, jobs=1): err= 0: pid=24383: Wed Sep 9 15:01:59 2020
write: IOPS=248, BW=31.1MiB/s (32.6MB/s)(300MiB/9651msec)
slat (usec): min=529, max=67978, avg=4008.99, stdev=3956.89
clat (nsec): min=2114, max=66329, avg=4229.04, stdev=1932.25
lat (usec): min=531, max=67989, avg=4015.17, stdev=3957.63
clat percentiles (nsec):
| 1.00th=[ 2352], 5.00th=[ 2960], 10.00th=[ 3216], 20.00th=[ 3504],
| 30.00th=[ 3664], 40.00th=[ 3824], 50.00th=[ 3984], 60.00th=[ 4128],
| 70.00th=[ 4384], 80.00th=[ 4704], 90.00th=[ 5408], 95.00th=[ 6176],
| 99.00th=[ 7520], 99.50th=[ 8640], 99.90th=[32640], 99.95th=[33536],
| 99.99th=[66048]
bw ( KiB/s): min=19968, max=63744, per=100.00%, avg=31847.79, stdev=10501.34, samples=19
iops : min= 156, max= 498, avg=248.74, stdev=82.05, samples=19
lat (usec) : 4=51.71%, 10=47.96%, 20=0.12%, 50=0.17%, 100=0.04%
cpu : usr=0.09%, sys=14.57%, ctx=2274, majf=1, minf=31
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,2400,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1Run status group 0 (all jobs):
WRITE: bw=31.1MiB/s (32.6MB/s), 31.1MiB/s-31.1MiB/s (32.6MB/s-32.6MB/s), io=300MiB (315MB), run=9651-9651msec
#
Test 2:
# fio --name=write --directory=/mnt/lustre/fio --rw=write --ioengine=libaio --bs=128K --numjobs=3 --filesize=100M --filename=test2 --fallocate=1
write: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=1
...
fio-3.7
Starting 3 processes
<SNIP>
Run status group 0 (all jobs):
WRITE: bw=17.8MiB/s (18.7MB/s), 9113KiB/s-9637KiB/s (9331kB/s-9868kB/s), io=200MiB (210MB), run=10626-11237msec
#
|
| Comment by Shuichi Ihara [ 12/Sep/20 ] |
|
Thanks Arshad! let me test your updated patch. |
| Comment by Qian Yingjin [ 13/Sep/20 ] |
|
Hi Ihara, I have updated the patch, test it locally and succeeded. Please try it.
Regards, Qian |
| Comment by Gerrit Updater [ 29/Oct/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39342/ |
| Comment by Peter Jones [ 29/Oct/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 01/Nov/20 ] |
|
Arshad Hussain (arshad.super@gmail.com) uploaded a new patch: https://review.whamcloud.com/40509 |
| Comment by Gerrit Updater [ 07/Nov/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40509/ |