Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13973

4K random write performance impacts on large sparse files

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.14.0
    • None
    • master
    • 3
    • 9223372036854775807

    Description

      Here is a tested workload.

      4k, random write, FPP(File per process)

      [randwrite]
      ioengine=libaio
      rw=randwrite
      blocksize=4k
      iodepth=4
      direct=1
      size=${SIZE}
      runtime=60
      numjobs=16
      group_reporting
      directory=/ai400x/out
      create_serialize=0
      filename_format=f.$jobnum.$filenum
      

      The test case is that 2 clients have each 16 fio processes and each fio process does 4k random write to different files.
      However, if file size is large (128GB in this case), it causes the huge performance impacts. Here is two test results.

      1GB file

      # SIZE=1g /work/ihara/fio.git/fio --client=hostfile randomwrite.fio
      
      write: IOPS=16.8k, BW=65.5MiB/s (68.7MB/s)(3930MiB/60004msec); 0 zone resets
       

      128GB file

      # SIZE=128g /work/ihara/fio.git/fio --client=hostfile randomwrite.fio
      
      write: IOPS=2894, BW=11.3MiB/s (11.9MB/s)(679MiB/60039msec)
       

      As far as I observed those two cases and collected cpu profiles on OSS, in 128GB file case, there were big spinlocks in ldiskfs_mb_new_block() and ldiskfs_mb_normalized_request() and it spent 89% time (14085/15823 samples) of total ost_io_xx() against 20% (1895/9296 samples) in 1GB file case. Please see attached framegraph.

      Attachments

        Issue Links

          Activity

            [LU-13973] 4K random write performance impacts on large sparse files

            Hi Yingjin,
            yup, I've also confirmed the latest patch (patchset 8 of https://review.whamcloud.com/39342) solved problem.
            I was back to original problem of LU-13973 and re-tested again, it also solved problem. fallocate works with O_DIRECT well for now.

            # cat hostlist
            ec01
            ec02
            # SIZE=1g /work/ihara/fio.git/fio --client=hostlist randomwrite.fio
              write: IOPS=37.4k, BW=146Mi (153M)(8761MiB/60004msec); 0 zone resets
            
            # SIZE=128g /work/ihara/fio.git/fio --client=hostlist randomwrite.fio
              write: IOPS=38.1k, BW=149Mi (156M)(8921MiB/60007msec); 0 zone resets
            
            sihara Shuichi Ihara added a comment - Hi Yingjin, yup, I've also confirmed the latest patch (patchset 8 of https://review.whamcloud.com/39342 ) solved problem. I was back to original problem of LU-13973 and re-tested again, it also solved problem. fallocate works with O_DIRECT well for now. # cat hostlist ec01 ec02 # SIZE=1g /work/ihara/fio.git/fio --client=hostlist randomwrite.fio write: IOPS=37.4k, BW=146Mi (153M)(8761MiB/60004msec); 0 zone resets # SIZE=128g /work/ihara/fio.git/fio --client=hostlist randomwrite.fio write: IOPS=38.1k, BW=149Mi (156M)(8921MiB/60007msec); 0 zone resets
            qian_wc Qian Yingjin added a comment -

            Just fixed the problem:
            With LDISKFS_GET_BLOCKS_CREATE:

            [root@qvm1 tests]# time fallocate -l 5G /mnt/lustre/test
            
            real	0m0.220s
            user	0m0.002s
            sys	0m0.003s
            [root@qvm1 tests]# stat /mnt/lustre/test
              File: /mnt/lustre/test
              Size: 5368709120	Blocks: 10485768   IO Block: 4194304 regular file
            [root@qvm1 tests]# time fallocate -l 1G /mnt/lustre/test
            
            real	0m0.175s
            user	0m0.002s
            sys	0m0.003s
            

            With LDISKFS_GET_BLOCKS_CREATE_UNWRIT_EXT:

            [root@qvm1 tests]# time fallocate -l 5G /mnt/lustre/test
            
            real	0m0.268s
            user	0m0.002s
            sys	0m0.005s
            [root@qvm1 tests]# stat /mnt/lustre/test
              File: /mnt/lustre/test
              Size: 5368709120	Blocks: 10485768   IO Block: 4194304 regular file
            Device: 2c54f966h/743766374d	Inode: 144115205272502273  Links: 1
            Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
            Context: unconfined_u:object_r:unlabeled_t:s0
            Access: 2020-09-21 16:36:57.000000000 +0800
            Modify: 2020-09-21 16:36:57.000000000 +0800
            Change: 2020-09-21 16:36:57.000000000 +0800
             Birth: -
            
            

            Please try the updated patch again.

            BTW, could you please also try with large allocation by using EXT4 allocation flags:

            [root@qvm1 lustre-release]# git diff
            diff --git a/lustre/osd-ldiskfs/osd_io.c b/lustre/osd-ldiskfs/osd_io.c
            index 7897fd4082..233ea54c6f 100644
            --- a/lustre/osd-ldiskfs/osd_io.c
            +++ b/lustre/osd-ldiskfs/osd_io.c
            @@ -1983,7 +1983,7 @@ static int osd_fallocate(const struct lu_env *env, struct dt_object *dt,
                    boff = start >> inode->i_blkbits;
                    blen = (ALIGN(end, 1 << inode->i_blkbits) >> inode->i_blkbits) - boff;
             
            -       flags = LDISKFS_GET_BLOCKS_CREATE;
            +       flags = LDISKFS_GET_BLOCKS_CREATE_UNWRIT_EXT;
                    if (mode & FALLOC_FL_KEEP_SIZE)
                            flags |= LDISKFS_GET_BLOCKS_KEEP_SIZE;
            
            

            and measure the allocation time and the performance you did via fio again?

            Thanks,
            Qian

            qian_wc Qian Yingjin added a comment - Just fixed the problem: With LDISKFS_GET_BLOCKS_CREATE: [root@qvm1 tests]# time fallocate -l 5G /mnt/lustre/test real 0m0.220s user 0m0.002s sys 0m0.003s [root@qvm1 tests]# stat /mnt/lustre/test File: /mnt/lustre/test Size: 5368709120 Blocks: 10485768 IO Block: 4194304 regular file [root@qvm1 tests]# time fallocate -l 1G /mnt/lustre/test real 0m0.175s user 0m0.002s sys 0m0.003s With LDISKFS_GET_BLOCKS_CREATE_UNWRIT_EXT: [root@qvm1 tests]# time fallocate -l 5G /mnt/lustre/test real 0m0.268s user 0m0.002s sys 0m0.005s [root@qvm1 tests]# stat /mnt/lustre/test File: /mnt/lustre/test Size: 5368709120 Blocks: 10485768 IO Block: 4194304 regular file Device: 2c54f966h/743766374d Inode: 144115205272502273 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Context: unconfined_u:object_r:unlabeled_t:s0 Access: 2020-09-21 16:36:57.000000000 +0800 Modify: 2020-09-21 16:36:57.000000000 +0800 Change: 2020-09-21 16:36:57.000000000 +0800 Birth: - Please try the updated patch again. BTW, could you please also try with large allocation by using EXT4 allocation flags: [root@qvm1 lustre-release]# git diff diff --git a/lustre/osd-ldiskfs/osd_io.c b/lustre/osd-ldiskfs/osd_io.c index 7897fd4082..233ea54c6f 100644 --- a/lustre/osd-ldiskfs/osd_io.c +++ b/lustre/osd-ldiskfs/osd_io.c @@ -1983,7 +1983,7 @@ static int osd_fallocate( const struct lu_env *env, struct dt_object *dt, boff = start >> inode->i_blkbits; blen = (ALIGN(end, 1 << inode->i_blkbits) >> inode->i_blkbits) - boff; - flags = LDISKFS_GET_BLOCKS_CREATE; + flags = LDISKFS_GET_BLOCKS_CREATE_UNWRIT_EXT; if (mode & FALLOC_FL_KEEP_SIZE) flags |= LDISKFS_GET_BLOCKS_KEEP_SIZE; and measure the allocation time and the performance you did via fio again? Thanks, Qian

            In fact, it seems that fallocate is not working in both patch (patchset6 and patchet7) properly..

            patchset 6

            [root@ec01 ~]# time  fallocate -l 128g /ai400x/test1
            
            real	0m0.004s
            user	0m0.001s
            sys	0m0.000s
            [root@ec01 ~]# ls -l /ai400x/test1 
            -rw-r--r-- 1 root root 0 Sep 21 14:47 /ai400x/test1
            

            patchset 7

            [root@ec01 ~]# time  fallocate -l 128g /ai400x/test1
            
            real	0m0.003s
            user	0m0.001s
            sys	0m0.000s
            [root@ec01 ~]# ls -l /ai400x/test1 
            -rw-r--r-- 1 root root 0 Sep 21 15:06 /ai400x/test1
            
            sihara Shuichi Ihara added a comment - In fact, it seems that fallocate is not working in both patch (patchset6 and patchet7) properly.. patchset 6 [root@ec01 ~]# time fallocate -l 128g /ai400x/test1 real 0m0.004s user 0m0.001s sys 0m0.000s [root@ec01 ~]# ls -l /ai400x/test1 -rw-r--r-- 1 root root 0 Sep 21 14:47 /ai400x/test1 patchset 7 [root@ec01 ~]# time fallocate -l 128g /ai400x/test1 real 0m0.003s user 0m0.001s sys 0m0.000s [root@ec01 ~]# ls -l /ai400x/test1 -rw-r--r-- 1 root root 0 Sep 21 15:06 /ai400x/test1
            qian_wc Qian Yingjin added a comment - - edited

            Btw, could you please measure the fallocate performance with/without the updated patches?

            i.e.

            time fallocate -l 128G test1
            time fallocate -l 256G test2
            I just want to known whether it will affect the fallocate using time.

            thanks,
            Qian

            qian_wc Qian Yingjin added a comment - - edited Btw, could you please measure the fallocate performance with/without the updated patches? i.e. time fallocate -l 128G test1 time fallocate -l 256G test2 I just want to known whether it will affect the fallocate using time. thanks, Qian
            qian_wc Qian Yingjin added a comment -

            Please try the updated fallocate patch:
            https://review.whamcloud.com/39342 LU-13765 osd-ldiskfs: Extend credit correctly for fallocate

            It jus modified one line:

            diff --git a/lustre/osd-ldiskfs/osd_io.c b/lustre/osd-ldiskfs/osd_io.c
            index 462a462cc9..689471e8a3 100644
            --- a/lustre/osd-ldiskfs/osd_io.c
            +++ b/lustre/osd-ldiskfs/osd_io.c
            @@ -2009,7 +2009,7 @@ static int osd_fallocate(const struct lu_env *env, struct dt_object *dt,
                                    break;
             
                            rc = ldiskfs_map_blocks(handle, inode, &map,
            -                                       LDISKFS_GET_BLOCKS_CREATE_UNWRIT_EXT);
            +                                       LDISKFS_GET_BLOCKS_CREATE);
                            if (rc <= 0) {
                                    CDEBUG(D_INODE, "inode #%lu: block %u: len %u: "
                                           "ldiskfs_map_blocks returned %d\n",
            
            

            Regards,
            Qian

            qian_wc Qian Yingjin added a comment - Please try the updated fallocate patch: https://review.whamcloud.com/39342 LU-13765 osd-ldiskfs: Extend credit correctly for fallocate It jus modified one line: diff --git a/lustre/osd-ldiskfs/osd_io.c b/lustre/osd-ldiskfs/osd_io.c index 462a462cc9..689471e8a3 100644 --- a/lustre/osd-ldiskfs/osd_io.c +++ b/lustre/osd-ldiskfs/osd_io.c @@ -2009,7 +2009,7 @@ static int osd_fallocate( const struct lu_env *env, struct dt_object *dt, break ; rc = ldiskfs_map_blocks(handle, inode, &map, - LDISKFS_GET_BLOCKS_CREATE_UNWRIT_EXT); + LDISKFS_GET_BLOCKS_CREATE); if (rc <= 0) { CDEBUG(D_INODE, "inode #%lu: block %u: len %u: " "ldiskfs_map_blocks returned %d\n" , Regards, Qian
            qian_wc Qian Yingjin added a comment -

            Hi Ihara,

            I may find the reason, it should be a problem of fallocate for direct IO (not for buffered IO).

            Will make a revised patch soon.

            Regards,
            Qian

            qian_wc Qian Yingjin added a comment - Hi Ihara, I may find the reason, it should be a problem of fallocate for direct IO (not for buffered IO). Will make a revised patch soon. Regards, Qian
            sihara Shuichi Ihara added a comment - - edited

            Yingjin, I also thought fallocate might help and tried fallocate with fio (NOTE, fio use fallocate if filesystem supports it) after patch https://review.whamcloud.com/#/c/39342/ applied, but it was same problem and fallocate didn't help neither. btw, overwriting files helped. e.g. create 128GB files and allocate all blocks first then randomwrite on them.

             

            sihara Shuichi Ihara added a comment - - edited Yingjin, I also thought fallocate might help and tried fallocate with fio (NOTE, fio use fallocate if filesystem supports it) after patch https://review.whamcloud.com/#/c/39342/ applied, but it was same problem and fallocate didn't help neither. btw, overwriting files helped. e.g. create 128GB files and allocate all blocks first then randomwrite on them.  
            qian_wc Qian Yingjin added a comment -

            Hi Ihara,

            Could you please first preallocate all space via fallocate?
            i.e
            fio with fallocate,
            or use the command 'fallocate -l ' to preallocate all needed space,
            and then do the fio testing?

            Thanks,
            Qian

            qian_wc Qian Yingjin added a comment - Hi Ihara, Could you please first preallocate all space via fallocate? i.e fio with fallocate, or use the command 'fallocate -l ' to preallocate all needed space, and then do the fio testing? Thanks, Qian

            People

              qian_wc Qian Yingjin
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: