[LU-7293] DNE2 perfomance analysis Created: 13/Oct/15 Updated: 12/Jan/19 Resolved: 12/Jan/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | dne2, dne3 | ||
| Environment: |
DNE2 system with up to 16 MDS servers. Uses up to 400 client nodes spread across 20 physical nodes. All the results are based on mdtest 1.9.3 runs. |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Epic/Theme: | Performance | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
I did a detail study of the client scaling behavior for 10k and 100k files per directory using 1,2,4, and 8 MDS servers each having one MDT. I also attempted to collect dat for 16 MDS servers but the results were so bad I didn't both to finish collecting them since it would take several months to finish the 16 node case. |
| Comments |
| Comment by Di Wang [ 13/Oct/15 ] |
|
James, Could you please tell me more of your test? commands? "MDS striping 8" means the directory stripe_count = 8 ? Just curious, you did not disable quota in your test right? Thanks |
| Comment by James A Simmons [ 13/Oct/15 ] |
|
I attached my job script I used. You just need to replace apron with mpirun with the correct node count and thread count. As for quotas I haven't touch that setting. In the past it was off by default. Is this not the case anymore? Yes the MDS striping is the lfs setdirstripe -c value I used. |
| Comment by Di Wang [ 14/Oct/15 ] |
|
I think quota is enable by default since 2.4. To disable quota you have to "tune2fs -O ^quota" after reformat. But since there are only 1 MDT per MDS, quota is probably irrelevant. Thanks I will check the script. |
| Comment by Di Wang [ 14/Oct/15 ] |
|
In mdtest-scale.pbs !/bin/bash
#PBS -l nodes=20
#PBS -l walltime=24:00:00
#PBS -N results-mdtest-scale
#PBS -j oe
MOUNT=sultan
OSTCOUNT=$(lctl get_param -n lov.$MOUNT-clilov*.numobd)
ITER=5
PBS_JOBID="dne2_8_mds"
BINDIR=/lustre/$MOUNT/stf008/scratch/$USER
OUTDIR=$BINDIR/${PBS_JOBID}_md_test
[ -e $OUTDIR ] || {
mkdir -p $OUTDIR
lfs setstripe -c $OSTCOUNT $OUTDIR
}
cd $BINDIR
It seems this script is for testing OST stripes? you probably post the wrong script? thanks |
| Comment by James A Simmons [ 14/Oct/15 ] |
|
The script always sets the test directories so all created files are striped across all the OSTs. I do this with all my tests so when I move to our large stripe test setup I can test to ensure 1008 OST stripe files work as well. It also ensure that my fail over testing will always cover active servers on the OSS side. |
| Comment by Di Wang [ 14/Oct/15 ] |
|
oh, I want to know how the directories are striped among MDT? Did you set default dir stripes (or /lustre/$MOUNT/stf008/scratch/$USER ? If you did, then it probably explain why directory creation is much slower for multiple MDT, because all of directories created here are striped directory. Anyway, could you please tell me how BINDIR/OUTDIR are striped here? Thanks. |
| Comment by James A Simmons [ 14/Oct/15 ] |
|
I precreated each striped/split directory before each run of the mdtest-scale.pbs script. Each directory was called dne2_"stripe_count"_mds. I set remote_dir=1 and remote_dir_gid=-1 so the normal user (as myself) could create the striped directory. lfs setdirstripe -c 4 --index=4 /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds Then I ran the test. I used the -D options so the directories created by mdtest would be the same as dne2_X_mds. I also used the index to avoid filling up my MDS disk with inodes. |
| Comment by Di Wang [ 19/Oct/15 ] |
|
James: Could you please re-run the test with -F, so only do file test. I want to see if it can show liner performance improvement with only file operation. I suspect there are some interfere between cross-MDT and single MDT operation. And also any reason you use "--index=4 -D" for default stripedEA? That means all of dir create request will be sent MDT4, which might not be what you want? I would suggest remove --index=4 for default striped EA. i.e. |
| Comment by Andreas Dilger [ 20/Oct/15 ] |
|
James, it also doesn't make sense to create huge numbers of regular files striped across all OSTs. Applications should either create large numbers of files with one stripe per file (i.e. file per process) or create small numbers of widely striped files (i.e. shared single file). Doing both at the same time is IMHO not testing what happens on most systems. Having large numbers of widely-striped files both stresses OST object creation rates, as well as slowing down the MDS because it needs to store an extra xattr for each file. In your case with 1008 OSTs, this is actually creating two inodes per file on the MDT in order to store the large xattr (about 24KB/file). |
| Comment by James A Simmons [ 21/Oct/15 ] |
|
This is a test on a small system with only 56 OSTs. I can try the default stripe of 4 but I don't expect that much of difference, Our users tend to do one of two things. Use the default setting or do a lsf setstripe -c -1. Also the goal here was to how scaling behaved. DiWang our test system is under going a upgrade. It will be a few days before it is finished. |
| Comment by James A Simmons [ 29/Oct/15 ] |
|
We are in the process of installing perf on our test systems to analysis what is going on. I should have something next week. |
| Comment by James A Simmons [ 03/Nov/15 ] |
|
I have some good news and some bad news with my testing with perf installed. The good news is I'm seeing much better performance so far with large MDS stripe count. I will start collecting new data soon and post it here. The bad news is when creating one million plus files I'm seeing constant client eviction and reconnects due to time outs from the OSS. I will open a separate ticket for that. |
| Comment by James A Simmons [ 04/Nov/15 ] |
|
Sorry but I was wrong about performance fixes. My script had a error in it where the default stripe was 1 for mdtest. I did do some profiling and I'm not seeing anything hogging cycles. What I did see is this with slab top on my client: Active / Total Objects (% used) : 549437 / 559744 (98.2%) 372512 372485 99% 1.12K 53216 7 425728K lustre_inode_cache |
| Comment by Di Wang [ 04/Nov/15 ] |
Sorry but I was wrong about performance fixes. So you mean "seeing much better performance" is not correct? will you redo the test? Actually I hope you can use Oprofile to profile MDS, so I can see which function or lock is being hit most, then know where is the bottleneck for this load. Thanks. |
| Comment by James A Simmons [ 09/Nov/15 ] |
|
Here is the perf data I gathered on one of the MDS being used: Samples: 136K of event 'cycles', Event count (approx.): 45672465691 and the slab top gives: Active / Total Objects (% used) : 8211619 / 11280489 (72.8%) OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME From this the MDS doesn't look to have too heavy of a burden. Its not cpu pegged nor no memory exhaustion. |
| Comment by James A Simmons [ 09/Nov/15 ] |
|
On the client side I see with perf and slab top: + 30.58% swapper [kernel.kallsyms] [k] __schedule - Active / Total Objects (% used) : 6238442 / 6248913 (99.8%) OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME Here you can see the client nodes are pegged due to swapper running constantly due to the memory pressure on the client. The lustre_inode_cache is huge. |
| Comment by Di Wang [ 09/Nov/15 ] |
|
James: did this only happen when you do directory creation? or both file and directory creation? how many threads per client? Thanks Btw: did you rerun the test with "-F" and remove --index for default stripeEA as comment in Oct 18th. Thanks |
| Comment by James A Simmons [ 21/Dec/15 ] |
|
I did both file only operations and with directory operations. I did tracked down the issue of directory operations as well. What is happening in that case is that the lustre inode cache is consuming all the memory on the client thus causing various timeout and client evictions and reconnects. This only happens for when many directory operations are performed. When only doing file operations the memory pressure issues go away. My latest testings as all been without the --index. |
| Comment by Di Wang [ 21/Dec/15 ] |
|
Hmm, according to the slab information in Nov 15th, it seems "lustre_inode_cache" is much more than "inode_cache", so it means client has more ll_inode_info than inode, hmm, maybe ll_inode_info is leaked somewhere. Do you still keep that client? Could you please get lru_size for me? lctl get_param ldlm.*.*MDT*.lru_size |
| Comment by James A Simmons [ 21/Dec/15 ] |
|
Oh this doesn't look right. ldlm.namespaces.sultan-MDT0000-mdc-ffff8803f3d12c00.lru_size=29 |
| Comment by James A Simmons [ 19/Sep/16 ] |
|
Here is the final report of our results for our DNE2 performance analysis http://info.ornl.gov/sites/publications/Files/Pub59510.pdf Enjoy the read. Perhaps we can link it to the wiki. If people want it linked to the wiki we can do that |
| Comment by Jian Yu [ 20/Sep/16 ] |
|
Hi James, Thank you very much for the report! |
| Comment by Peter Jones [ 12/Jan/19 ] |
|
closing ancient ticket |