[LU-10355] Lower write throughput after put many files into lustre Created: 08/Dec/17  Updated: 06/Sep/18  Resolved: 06/Sep/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1
Fix Version/s: None

Type: Question/Request Priority: Minor
Reporter: sebg-crd-pm (Inactive) Assignee: Peter Jones
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre 2.10.1 + OSD_ZFS


Rank (Obsolete): 9223372036854775807

 Description   

Hi ,

   The total lustre write throughput became lower after I transfer many small files to lustre.(about 1000000 files, 1MB per file). The write throughput seems the same even remove these small files. I have test obdfilter-survey, it also got the lower throughput performance. Do  you  have any suggestion to make the luster cleaner?   Thanks.



 Comments   
Comment by Andreas Dilger [ 08/Dec/17 ]

Hi, could you please give us some more details about your test, and how big the performance problem is? Is this a newly-formatted filesystem, or is this an existing filesystem that you are running the test on? Are you creating/removing/creating files in the same directory, or a new directory each time? What are the actual speeds (creates/sec) for the first and second test? Are you using HDD or SSD storage for the MDT and OST?

In general, a disk-based filesystem will get slower over time, because the free space becomes fragmented, as well as a significant performance impact is seen between the inner tracks of the disk and the outer tracks of the disk (about 50% slower, see this StackExchange questionhttps://unix.stackexchange.com/questions/293176/hdds-outer-track-vs-inner-track-performance-benchmarks/409832] for example).

Note that ZFS doesn't immediately delete the files from disk after they are removed. There are "internal snapshots" (at least 4) that are kept in the filesystem in case of a crash, which may take tens of seconds to actually be removed if the filesystem is not being modified. This may perturb your test results if you are doing create/remove/create in quick succession.

Secondly, if you create 1M files in a directory, the size of the directory itself needs to grow to store the filenames, but these blocks are not freed when the files are deleted. File access in a large directory is slower than in a small directory, since the filenames are hashed and distributed around the whole directory, and blocks may need to be read from disk. However, the performance of the directory should not continue to get worse after the first few cycles if you continually create/remove files in the same directory.

Comment by sebg-crd-pm (Inactive) [ 12/Dec/17 ]

Lustre Configuration

MDSx2  :  SSDx 10  mirror / per MDS

OSSx2 :   HDD  Raidz2(9+2)  x 3 / Per OSS

 

1.Test Case1: With many small files(about 1M files, in seperate directorys, 500 files/ directory ) 

8 client IOR  512G test   write throughput: 3GB/s

2.Test Case2: Then re-create zpool and formate all 

8 client IOR  512G test   write throughput: 4GB/s

3.Test Case3:  new create zpool  + copy many small files (about 1M files in the same directory)

Test one ost  zfs fio , 1.3GB/s(Newly created) => 1GBs (after copy 1M files)

 

I will try to test it  base one zfs pool  with 1M files(seperate directorys)

It looks like OSS zfs became slower when save many files. 

 

Comment by Andreas Dilger [ 12/Dec/17 ]

How long after small file create was Fio test run? If right after small file create+write, then it may be large file will be waiting for previous small files to be written?

We are working on the “Data on MDS” feature to separate small files onto the MDS, but it is not released yet.

Comment by Andreas Dilger [ 12/Dec/17 ]

Does the test have 512TB in total, or per client? How big is the zpool in total?

Have you tried this test on a local ZFS filesystem, without Lustre? It may be that this is the behavior of ZFS.

Comment by sebg-crd-pm (Inactive) [ 14/Dec/17 ]

Hi 

     Test Case1/2  are 8 lustre clients, per client write 512GB.  (lustre 6 OST , Raidz9+2 ) 

     Test Case3 is on local test ZFS  (one zpool local test )

 

     We look forward to this feature "“Data on MDS”  for small files access if it is available.

      I have test other cases about perforamce

 

     Test Case4:  local zfs pool Capacity 47TB(raidz 9+2) 1M files (130k/ per file)  

      The throghput is also the same without these small files. (1.3GB/s)

      =>So many very small files will not impact throughput.

 

     Test Case5: local zfs pool Capacity 47TB(raidz 9+2) 5000 files (1G/ per files )

      It is almost only  "half" throghput(655M/s)  compared with new created zpool (1.3GB/s). 

      =>It looks like  the throughput has been impacted by  the inner/outer tracks.

      But the case only use 10% Capacity, so I expect the zpool throughput only

      lower  5~10% just like one disk throghput. It seems like have other reason to made

      zpool throughput down to 50%.  

      

      Do you have any suggestion? Thanks. 

      

  

    

 

 

        

Comment by Joseph Gmitter (Inactive) [ 14/Dec/17 ]

" We look forward to this feature "“Data on MDS” for small files access if it is available."

  • Data on MDT will be available beginning with the Lustre 2.11.0 release projected for the end of Q1.
Comment by Peter Jones [ 15/Dec/17 ]

sebg-crd-pm are you able to test the pre-release 2.11 code including the data on MDT feature?

Comment by sebg-crd-pm (Inactive) [ 19/Dec/17 ]

I am focusing on throughput performance and other issues, so I may test  data on MDT later.

Can I download pre-release 2.11 code now?

Comment by Peter Jones [ 19/Dec/17 ]

Yes the latest 2.11 pre-release build can always be accessed via https://build.hpdd.intel.com/job/lustre-master/

However, when are you targeting entering production?

Comment by Peter Jones [ 21/Mar/18 ]

2.11 RC1 is now in testing

Comment by Peter Jones [ 06/Sep/18 ]

2.11 has been GA for some time

Generated at Sat Feb 10 02:34:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.