[LU-10355] Lower write throughput after put many files into lustre Created: 08/Dec/17 Updated: 06/Sep/18 Resolved: 06/Sep/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.1 |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Minor |
| Reporter: | sebg-crd-pm (Inactive) | Assignee: | Peter Jones |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 2.10.1 + OSD_ZFS |
||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hi , The total lustre write throughput became lower after I transfer many small files to lustre.(about 1000000 files, 1MB per file). The write throughput seems the same even remove these small files. I have test obdfilter-survey, it also got the lower throughput performance. Do you have any suggestion to make the luster cleaner? Thanks. |
| Comments |
| Comment by Andreas Dilger [ 08/Dec/17 ] |
|
Hi, could you please give us some more details about your test, and how big the performance problem is? Is this a newly-formatted filesystem, or is this an existing filesystem that you are running the test on? Are you creating/removing/creating files in the same directory, or a new directory each time? What are the actual speeds (creates/sec) for the first and second test? Are you using HDD or SSD storage for the MDT and OST? In general, a disk-based filesystem will get slower over time, because the free space becomes fragmented, as well as a significant performance impact is seen between the inner tracks of the disk and the outer tracks of the disk (about 50% slower, see this StackExchange questionhttps://unix.stackexchange.com/questions/293176/hdds-outer-track-vs-inner-track-performance-benchmarks/409832] for example). Note that ZFS doesn't immediately delete the files from disk after they are removed. There are "internal snapshots" (at least 4) that are kept in the filesystem in case of a crash, which may take tens of seconds to actually be removed if the filesystem is not being modified. This may perturb your test results if you are doing create/remove/create in quick succession. Secondly, if you create 1M files in a directory, the size of the directory itself needs to grow to store the filenames, but these blocks are not freed when the files are deleted. File access in a large directory is slower than in a small directory, since the filenames are hashed and distributed around the whole directory, and blocks may need to be read from disk. However, the performance of the directory should not continue to get worse after the first few cycles if you continually create/remove files in the same directory. |
| Comment by sebg-crd-pm (Inactive) [ 12/Dec/17 ] |
|
Lustre Configuration MDSx2 : SSDx 10 mirror / per MDS OSSx2 : HDD Raidz2(9+2) x 3 / Per OSS
1.Test Case1: With many small files(about 1M files, in seperate directorys, 500 files/ directory ) 8 client IOR 512G test write throughput: 3GB/s 2.Test Case2: Then re-create zpool and formate all 8 client IOR 512G test write throughput: 4GB/s 3.Test Case3: new create zpool + copy many small files (about 1M files in the same directory) Test one ost zfs fio , 1.3GB/s(Newly created) => 1GBs (after copy 1M files)
I will try to test it base one zfs pool with 1M files(seperate directorys) It looks like OSS zfs became slower when save many files.
|
| Comment by Andreas Dilger [ 12/Dec/17 ] |
|
How long after small file create was Fio test run? If right after small file create+write, then it may be large file will be waiting for previous small files to be written? We are working on the “Data on MDS” feature to separate small files onto the MDS, but it is not released yet. |
| Comment by Andreas Dilger [ 12/Dec/17 ] |
|
Does the test have 512TB in total, or per client? How big is the zpool in total? Have you tried this test on a local ZFS filesystem, without Lustre? It may be that this is the behavior of ZFS. |
| Comment by sebg-crd-pm (Inactive) [ 14/Dec/17 ] |
|
Hi Test Case1/2 are 8 lustre clients, per client write 512GB. (lustre 6 OST , Raidz9+2 ) Test Case3 is on local test ZFS (one zpool local test )
We look forward to this feature "“Data on MDS” for small files access if it is available. I have test other cases about perforamce
Test Case4: local zfs pool Capacity 47TB(raidz 9+2) 1M files (130k/ per file) The throghput is also the same without these small files. (1.3GB/s) =>So many very small files will not impact throughput.
Test Case5: local zfs pool Capacity 47TB(raidz 9+2) 5000 files (1G/ per files ) It is almost only "half" throghput(655M/s) compared with new created zpool (1.3GB/s). =>It looks like the throughput has been impacted by the inner/outer tracks. But the case only use 10% Capacity, so I expect the zpool throughput only lower 5~10% just like one disk throghput. It seems like have other reason to made zpool throughput down to 50%.
Do you have any suggestion? Thanks.
|
| Comment by Joseph Gmitter (Inactive) [ 14/Dec/17 ] |
|
" We look forward to this feature "“Data on MDS” for small files access if it is available."
|
| Comment by Peter Jones [ 15/Dec/17 ] |
|
sebg-crd-pm are you able to test the pre-release 2.11 code including the data on MDT feature? |
| Comment by sebg-crd-pm (Inactive) [ 19/Dec/17 ] |
|
I am focusing on throughput performance and other issues, so I may test data on MDT later. Can I download pre-release 2.11 code now? |
| Comment by Peter Jones [ 19/Dec/17 ] |
|
Yes the latest 2.11 pre-release build can always be accessed via https://build.hpdd.intel.com/job/lustre-master/ However, when are you targeting entering production? |
| Comment by Peter Jones [ 21/Mar/18 ] |
|
2.11 RC1 is now in testing |
| Comment by Peter Jones [ 06/Sep/18 ] |
|
2.11 has been GA for some time |