[LU-13973] 4K random write performance impacts on large sparse files Created: 20/Sep/20  Updated: 29/Oct/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Shuichi Ihara Assignee: Qian Yingjin
Resolution: Unresolved Votes: 0
Labels: None
Environment:

master


Attachments: File 128g-4krandomwrite.svg     File 1g-4krandomwrite.svg    
Issue Links:
Related
is related to LU-13765 ldiskfs_mb_mark_diskspace_used:3472: ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Here is a tested workload.

4k, random write, FPP(File per process)

[randwrite]
ioengine=libaio
rw=randwrite
blocksize=4k
iodepth=4
direct=1
size=${SIZE}
runtime=60
numjobs=16
group_reporting
directory=/ai400x/out
create_serialize=0
filename_format=f.$jobnum.$filenum

The test case is that 2 clients have each 16 fio processes and each fio process does 4k random write to different files.
However, if file size is large (128GB in this case), it causes the huge performance impacts. Here is two test results.

1GB file

# SIZE=1g /work/ihara/fio.git/fio --client=hostfile randomwrite.fio

write: IOPS=16.8k, BW=65.5MiB/s (68.7MB/s)(3930MiB/60004msec); 0 zone resets
 

128GB file

# SIZE=128g /work/ihara/fio.git/fio --client=hostfile randomwrite.fio

write: IOPS=2894, BW=11.3MiB/s (11.9MB/s)(679MiB/60039msec)
 

As far as I observed those two cases and collected cpu profiles on OSS, in 128GB file case, there were big spinlocks in ldiskfs_mb_new_block() and ldiskfs_mb_normalized_request() and it spent 89% time (14085/15823 samples) of total ost_io_xx() against 20% (1895/9296 samples) in 1GB file case. Please see attached framegraph.



 Comments   
Comment by Qian Yingjin [ 20/Sep/20 ]

Hi Ihara,

Could you please first preallocate all space via fallocate?
i.e
fio with fallocate,
or use the command 'fallocate -l ' to preallocate all needed space,
and then do the fio testing?

Thanks,
Qian

Comment by Shuichi Ihara [ 21/Sep/20 ]

Yingjin, I also thought fallocate might help and tried fallocate with fio (NOTE, fio use fallocate if filesystem supports it) after patch https://review.whamcloud.com/#/c/39342/ applied, but it was same problem and fallocate didn't help neither. btw, overwriting files helped. e.g. create 128GB files and allocate all blocks first then randomwrite on them.

 

Comment by Qian Yingjin [ 21/Sep/20 ]

Hi Ihara,

I may find the reason, it should be a problem of fallocate for direct IO (not for buffered IO).

Will make a revised patch soon.

Regards,
Qian

Comment by Qian Yingjin [ 21/Sep/20 ]

Please try the updated fallocate patch:
https://review.whamcloud.com/39342 LU-13765 osd-ldiskfs: Extend credit correctly for fallocate

It jus modified one line:

diff --git a/lustre/osd-ldiskfs/osd_io.c b/lustre/osd-ldiskfs/osd_io.c
index 462a462cc9..689471e8a3 100644
--- a/lustre/osd-ldiskfs/osd_io.c
+++ b/lustre/osd-ldiskfs/osd_io.c
@@ -2009,7 +2009,7 @@ static int osd_fallocate(const struct lu_env *env, struct dt_object *dt,
                        break;
 
                rc = ldiskfs_map_blocks(handle, inode, &map,
-                                       LDISKFS_GET_BLOCKS_CREATE_UNWRIT_EXT);
+                                       LDISKFS_GET_BLOCKS_CREATE);
                if (rc <= 0) {
                        CDEBUG(D_INODE, "inode #%lu: block %u: len %u: "
                               "ldiskfs_map_blocks returned %d\n",

Regards,
Qian

Comment by Qian Yingjin [ 21/Sep/20 ]

Btw, could you please measure the fallocate performance with/without the updated patches?

i.e.

time fallocate -l 128G test1
time fallocate -l 256G test2
I just want to known whether it will affect the fallocate using time.

thanks,
Qian

Comment by Shuichi Ihara [ 21/Sep/20 ]

In fact, it seems that fallocate is not working in both patch (patchset6 and patchet7) properly..

patchset 6

[root@ec01 ~]# time  fallocate -l 128g /ai400x/test1

real	0m0.004s
user	0m0.001s
sys	0m0.000s
[root@ec01 ~]# ls -l /ai400x/test1 
-rw-r--r-- 1 root root 0 Sep 21 14:47 /ai400x/test1

patchset 7

[root@ec01 ~]# time  fallocate -l 128g /ai400x/test1

real	0m0.003s
user	0m0.001s
sys	0m0.000s
[root@ec01 ~]# ls -l /ai400x/test1 
-rw-r--r-- 1 root root 0 Sep 21 15:06 /ai400x/test1
Comment by Qian Yingjin [ 21/Sep/20 ]

Just fixed the problem:
With LDISKFS_GET_BLOCKS_CREATE:

[root@qvm1 tests]# time fallocate -l 5G /mnt/lustre/test

real	0m0.220s
user	0m0.002s
sys	0m0.003s
[root@qvm1 tests]# stat /mnt/lustre/test
  File: /mnt/lustre/test
  Size: 5368709120	Blocks: 10485768   IO Block: 4194304 regular file
[root@qvm1 tests]# time fallocate -l 1G /mnt/lustre/test

real	0m0.175s
user	0m0.002s
sys	0m0.003s

With LDISKFS_GET_BLOCKS_CREATE_UNWRIT_EXT:

[root@qvm1 tests]# time fallocate -l 5G /mnt/lustre/test

real	0m0.268s
user	0m0.002s
sys	0m0.005s
[root@qvm1 tests]# stat /mnt/lustre/test
  File: /mnt/lustre/test
  Size: 5368709120	Blocks: 10485768   IO Block: 4194304 regular file
Device: 2c54f966h/743766374d	Inode: 144115205272502273  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: unconfined_u:object_r:unlabeled_t:s0
Access: 2020-09-21 16:36:57.000000000 +0800
Modify: 2020-09-21 16:36:57.000000000 +0800
Change: 2020-09-21 16:36:57.000000000 +0800
 Birth: -

Please try the updated patch again.

BTW, could you please also try with large allocation by using EXT4 allocation flags:

[root@qvm1 lustre-release]# git diff
diff --git a/lustre/osd-ldiskfs/osd_io.c b/lustre/osd-ldiskfs/osd_io.c
index 7897fd4082..233ea54c6f 100644
--- a/lustre/osd-ldiskfs/osd_io.c
+++ b/lustre/osd-ldiskfs/osd_io.c
@@ -1983,7 +1983,7 @@ static int osd_fallocate(const struct lu_env *env, struct dt_object *dt,
        boff = start >> inode->i_blkbits;
        blen = (ALIGN(end, 1 << inode->i_blkbits) >> inode->i_blkbits) - boff;
 
-       flags = LDISKFS_GET_BLOCKS_CREATE;
+       flags = LDISKFS_GET_BLOCKS_CREATE_UNWRIT_EXT;
        if (mode & FALLOC_FL_KEEP_SIZE)
                flags |= LDISKFS_GET_BLOCKS_KEEP_SIZE;

and measure the allocation time and the performance you did via fio again?

Thanks,
Qian

Comment by Shuichi Ihara [ 21/Sep/20 ]

Hi Yingjin,
yup, I've also confirmed the latest patch (patchset 8 of https://review.whamcloud.com/39342) solved problem.
I was back to original problem of LU-13973 and re-tested again, it also solved problem. fallocate works with O_DIRECT well for now.

# cat hostlist
ec01
ec02
# SIZE=1g /work/ihara/fio.git/fio --client=hostlist randomwrite.fio
  write: IOPS=37.4k, BW=146Mi (153M)(8761MiB/60004msec); 0 zone resets

# SIZE=128g /work/ihara/fio.git/fio --client=hostlist randomwrite.fio
  write: IOPS=38.1k, BW=149Mi (156M)(8921MiB/60007msec); 0 zone resets
Generated at Sat Feb 10 03:05:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.