[LU-4257] parallel dds are slower than serial dds Created: 15/Nov/13 Updated: 16/Jun/20 Resolved: 03/Jun/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 11618 | ||||||||||||||||||||||||||||||||
| Description |
|
Sanger has an interesting test in which they read from the same file from 20 processes. They first run in parallel and then run serially (after flushing cache). Their expected result is that the serial and parallel runs should take about the same amount of time. What they see however is that parallel reads are about 50% slower than serial reads: client1# cat readfile.sh #!/bin/sh dd if=/lustre/scratch110/sanger/jb23/test/delete bs=4M of=/dev/null client1# for i in `seq -w 1 20 ` do (time $LOC/readfile.sh ) > $LOC/results/${i}_out 2>&1 & done
So the question is: Is there some lock contention going on with multiple readers? Or is the Lustre client sending multiple RPCs for the same data, even though there is already an outstanding request? They have tried this on 1.8.x clients as well as 2.5.0. Thanks. |
| Comments |
| Comment by Peter Jones [ 15/Nov/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Bobijam Could you please advise on this one? Thanks Peter | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 15/Nov/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
How large are the files being tested? One stripe or multiple stripes per file? It surprises me that Lustre is slower than NFS, but I guess that isn't directly relevant for the problem itself. It would be useful to collect +rpctrace +dlmtrace +vfstrace +reada debug logs on the client for the parallel and serial runs. That would allow us to see if there are duplicate RPCs being sent. This would be most useful with a 2.5.0 client. It may be that Jinshan's client-side single-threaded performance patches would help improve the overall performance, but it isn't totally clear if they would affect the parallel vs. serial behaviour. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kit Westneat (Inactive) [ 15/Nov/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
They are 2GB, single-striped files. The client nodes have 4GB of RAM and are mounting with the localflock option. I'll work on getting the debug logs. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 15/Nov/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
It seems really terrible performance to be reading a 2GB file in 151s (=13.5MB/s) in the serial case, or even 216s (=9.5MB/s) in the parallel case. It seems like something else is broken in this filesystem that is causing everything to run slowly? Even my 8-year-old home system runs this fast. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 18/Nov/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Note when I filtered on +rpctrace +dlmtrace +vfstrace +reada I got no events so I have given the full logs.... Sorry. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 18/Nov/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
This is using my desktop which is a lustre 2.5.1 client. root@deskpro21498:/tmp/results# grep copied /tmp/results/ser | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 20/Nov/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Can I add any information ? Can the problem be reproduced else where ? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kit Westneat (Inactive) [ 21/Nov/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi James, Can you confirm how large the original file was? Also can you double check your debug settings? When I run it on a Lustre 2.1.5 VM I get a ton of debug data. sysctl lnet.debug="+rpctrace +dlmtrace +vfstrace +reada" Here's the script I used: #!/bin/bash lctl clear lctl debug_daemon start /scratch/debug_file 2048 echo 3 > /proc/sys/vm/drop_caches lctl mark parallel read start for x in 1 2 3 4 5; do dd if=/lustre/pfs/client/testfile bs=4M of=/dev/null & done wait lctl mark parallel read done echo 3 > /proc/sys/vm/drop_caches lctl mark serial read start for x in 1 2 3 4 5; do dd if=/lustre/pfs/client/testfile bs=4M of=/dev/null ; done lctl mark serial read done lctl debug_daemon stop lctl df /scratch/debug_file > /scratch/debug_file.out I was able to reproduce the slow down you mentioned. I'll attach the debug logs that I got. parallel: 62+1 records in 62+1 records out 262144000 bytes (262 MB) copied, 4.40159 s, 59.6 MB/s 62+1 records in 62+1 records out 262144000 bytes (262 MB) copied, 4.40117 s, 59.6 MB/s 62+1 records in 62+1 records out 262144000 bytes (262 MB) copied, 4.39914 s, 59.6 MB/s 62+1 records in 62+1 records out 262144000 bytes (262 MB) copied, 4.39789 s, 59.6 MB/s 62+1 records in 62+1 records out 262144000 bytes (262 MB) copied, 4.39852 s, 59.6 MB/s serial: 62+1 records in 62+1 records out 262144000 bytes (262 MB) copied, 4.57006 s, 57.4 MB/s 62+1 records in 62+1 records out 262144000 bytes (262 MB) copied, 0.0536567 s, 4.9 GB/s 62+1 records in 62+1 records out 262144000 bytes (262 MB) copied, 0.046017 s, 5.7 GB/s 62+1 records in 62+1 records out 262144000 bytes (262 MB) copied, 0.0454 s, 5.8 GB/s 62+1 records in 62+1 records out 262144000 bytes (262 MB) copied, 0.0453329 s, 5.8 GB/s | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 22/Nov/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
In the most recent results with the 2.5.0 client (we don't have a 2.5.1 yet?) the parallel reads take only 28.5s, while the serial reads take 39s in total, so the parallel reads are faster in aggregate than serial reads, though the single serial read is faster (19.4s) than any of the parallel reads. These results also are more in line with NFS (19.9s for the first serial read, 1.3s for each later read). I updated the script a bit and ran it on my 2.4.0 x86_64 2-core 4-thread client, 2.4.1 server: #!/bin/sh FILE=/myth/tmp/100M # previously created OST=myth-OST0000 lctl set_param ldlm.namespaces.$OST*.lru_size=clear lctl set_param llite.*.stats=0 osc.$OST*.rpc_stats=0 for i in $(seq -w 1 20); do (time dd if=$FILE bs=4M of=/dev/null ) > /tmp/par_${i}.out 2>&1 & done wait lctl get_param llite.*.stats osc.$OST*.rpc_stats | tee /tmp/par_stats.out lctl set_param ldlm.namespaces.$OST*.lru_size=clear lctl set_param llite.*.stats=0 osc.$OST*.rpc_stats=0 for i in $(seq -w 1 20); do (time dd if=$FILE bs=4M of=/dev/null ) > /tmp/ser_${i}.out 2>&1 done lctl get_param llite.*.stats osc.$OST*.rpc_stats | tee /tmp/par_stats.out This produced the following from rpc_stats, which showed that there are not multiple RPCs being sent: SERIAL read write pages per rpc rpcs % cum % | rpcs % cum % 64: 2 1 1 | 0 0 0 256: 100 98 100 | 0 0 0 PARALLEL read write pages per rpc rpcs % cum % | rpcs % cum % 256: 100 100 100 | 0 0 0 The time profile is as I would expect, shorter single time for serial but longer total time (no parallelism), and a shorter total time for parallel (runtime happens in parallel): SERIAL PARALLEL | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 26/Nov/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Just as a note, I think I can see the problem however I am busy upgrading a lustre file system... and I want to ensure that my communication is clear, apologies for the delay. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Bonfield [ 03/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I have a basic C program to do a while(read()) style loop until EOF on an ~800Mb file with varying buffer sizes. I tested this on a lustre, tmpfs and nfs using a mix of block sizes and 1, 4 and 32 concurrent copies. For example on lustre with 4 copies: jkb@sf-7-1-02[work/benchmarks] fn=/lustre/scratch110/srpipe/references/Human/default/all/bowtie/human_g1k_v37.fasta.1.ebwt; for i in `seq 8 30`;do bs=$(perl -e "print 2 ** $i");echo -n "$bs";(for x in `seq 1 4`;do ./linear2 $fn & done) | grep Took|tail -1;done I then graphed these. Thin lines are 1 process running. The thicker ones are 4, and the thickest are 32 (also the labels lustre, lustre4 and lustre32 correspond as dot the crosses, circles and squares on the data points).
Naturally for any read there will be a per system-call overhead, so obviously tiny buffers for read calls are inefficient. This is evidently clear for all file systems. Also there is clearly a per-byte overhead. Even reading from a file in cached memory, it takes time to copy from kernel space buffer cache to the user space buffer. This is a per-byte delay and will scale linearly will the data size. These two timings combine to get the net speed of the while(read()) loops. Indeed we see this for NFS. Below 1k block size there is a decrease in speed as the block size decreases, indicating that the overhead of calling more system calls is the major bottleneck. At 10k and above the time is more or less constant, so the per-system call overhead has become insignificant. Inbetween 1k and 10k we see the gradual bend in the graph where real time is more evenly spread between memory copying and per-system call costs. Lustre single-threaded code demonstrates the same bend between 10k and 100k implying the per system-call cost is higher. Now if we have 4 processes running we could expect the per-system call overhead to be the same on all processes, and much the same delay as we get with 1 or 32 processes running (more or less). This is visible in the NFS benchmark as the 32-process version still shows a linear decrease in performance for blocks below 1k (as for the single-process NFS stats) and the bend is "more or less" in the same place. However Lustre shows a marked difference. As we have more and more simultaneous processes it reveals the location of the trade-off between per-system call and per-byte costs moving from ~10K through ~100k for 4 processes and up to ~1Mb for 32 processes. Why is this? Basically it indicates that the CPU overhead of each and every system call goes up as other processes are doing more system calls. The net effect is that even on moderate sized buffers (eg 10K) lustre becomes a couple orders of magnitude slower than NFS. To my eyes that implies there is some contention between independent system calls, possibly a thread lock. I appreciate that coherency is HARD, so I wonder if there is some way we can provide hints to the lustre client to make things easier. Eg is it possible to set the immutable flag on a file to avoid this lock? Does opening in read-only mode help? Does having the file itself be read-only help? James | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kit Westneat (Inactive) [ 03/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi James, Can you confirm what version of Lustre the clients and servers are running? Are you able to get the debug logs that Andreas requested while running these tests? I had to use the debug_daemon function in order to capture all the logs. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 03/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I have done my test on lustre 1.8.9 clients and 2.5.0 clients however James above would have only used lustre 1.8.9. I am rather deep in a set of lustre server upgrades right now so I will not be able to look at this until Tuesday unless it is critical before then ? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 03/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
James, you are correct that there is per-syscall overhead in Lustre, and contention will indeed increase with multiple threads. Lustre has fully coherent locking between nodes, while NFS has no locking at all. That means if one node is writing to a file and another node is reading at the same time, or if two nodes are writing in on non-block-aligned boundaries, NFS does not guarantee anything about the state of your data at the end. NFS only ensures that local data is flushed when the file is closed, and another client can see this data when the file is opened afterward. That said, I agree the client-side locking overhead is higher than it should be, and we are working to improve this for the 2.6 release. If you are able to test prototype code on a client node, it would be possible to see if the current performance patch series is showing improvement on your system. Most of the patches are already in the lustre-master branch in Git. The last patch in this series that I'm aware of is: There are pre-built RPMs available at (click on blue circle for your arch distro combo): | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 12/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
This is a set of test results from a production node that has been taken out of one of our clusters, Note that the lustre client is mounted with localflock. We have an expectation that a number of processes all reading the same file should be quicker than a number of processes all reading different files as the file is in cache and we would not expect additional IO however there appears to be significant contention which we do not understand. I note that the block size that we read the file with makes a very significant difference. The problem was discovered with a program that read the file one character at a time. I will repeat the runs with a lustre 2.1 client pulled down from the lustre-reviews. A little table which summarises the results.
Lustre 1.8.9 client, lustre 2.4.1 server ( 32 files each 512MB in size , reading with dd 4MB block size), mounted using localflock
Creating source files................................
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 4843.25
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 8469.83
Lustre 1.8.9 client, lustre 2.4.1 server ( 32 files each 512MB in size , reading with dd 4KB block size), mounted using local flock
root@sf-7-1-02:/root# ./test.sh
Creating source files................................
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 7202.32
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 131779
tmpfs ( 32 files each 512MB in size , reading with dd 4MB block size)
Creating source files................................
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 8018.69
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 8160.59
tmpfs ( 32 files each 512MB in size , reading with dd 4KB block size)
Creating source files................................
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 6141.82
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 6130.82
The scripts are as follows:
cat readfile.sh
#!/bin/sh
TYPE=$1
LOC=$2
PAR=$3
MULT=$4
if [ ${TYPE} = "SEQ" ] ; then
for i in `seq -w 1 ${PAR}`
do
echo ${LOC}/${i}
for j in `seq 1 ${MULT}`
do
dd if=${LOC}/${i} of=/dev/null bs=4k
done
done
fi
if [ ${TYPE} = "PAR" ] ; then
echo $LOC
for i in `seq -w 1 ${PAR}`
do
echo ${LOC}
for j in `seq 1 ${MULT}`
do
dd if=${LOC} of=/dev/null bs=4k
done
done
fi
And the test script
cat test.sh
#!/bin/sh
LOC=/mnt/tmp/4k
#Where to store the results
RES=`echo $LOC| sed -e 's#/#-#g' -e 's/^-//' `
# How many processes to run at one time
PAR=32
BASE=`dirname $0`
# How many times to read the file ( to multiple the caching effect )
MULT=4
mkdir -p ${RES}
echo -n Creating source files
for i in `seq -w 1 ${PAR}`
do
dd if=/dev/zero of=${LOC}/${i} bs=4M count=128 > ${RES}/${i}_create 2>&1
echo -n .
done
echo
echo Dropping cache
echo 3 > /proc/sys/vm/drop_caches
echo "All threads reading the same file"
for i in `seq -w 1 ${PAR}`
do
(time $BASE/readfile.sh SEQ ${LOC} ${PAR} ${MULT}) > ${RES}/${i}_parallel_same_file 2>&1 &
echo -n
done
wait
echo Dropping cache
echo 3 > /proc/sys/vm/drop_caches
echo "Reading the same file in a single thread "
for i in `seq -w 1 ${PAR} `
do
(time ${BASE}/readfile.sh PAR ${LOC}/${i} ${PAR} ${MULT} ) > ${RES}/${i}_parallel_different_file 2>&1 &
done
wait
echo "Time taken using read pattern"
echo "Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo -n "Pattern takes "
grep -h real ${RES}/*_parallel_different_file | sed -e 's/real *[^ ]//' -e 's/s$//' -e 's/m/*60+/' | bc | awk '{ sum+=$1} END {print sum}'
echo "Time taken using read pattern"
echo "Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H"
echo "File_A File_A File_A File_A File_A File_A File_A File_A"
echo "File_B File_B File_B File_B File_B File_B File_B File_B"
echo "File_C File_C File_C File_C File_C File_C File_C File_C"
echo "File_D File_D File_D File_D File_D File_D File_D File_D"
echo "File_E File_E File_E File_E File_E File_E File_E File_E"
echo "File_F File_F File_F File_F File_F File_F File_F File_F"
echo "File_G File_G File_G File_G File_G File_G File_G File_G"
echo "File_H File_H File_H File_H File_H File_H File_H File_H"
echo -n "Pattern takes "
grep -h real ${RES}/*_parallel_same_file | sed -e 's/real *[^ ]//' -e 's/s$//' -e 's/m/*60+/' | bc | awk '{ sum+=$1} END {print sum}'
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kit Westneat (Inactive) [ 12/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi James, Can you also try with a master branch client? If you have time, a 2.5.x client would also be interesting I think. I'll try to test with different client versions as well. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 12/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
This is from an old system that has only 16Gig of RAM, and only 8 cores. The size of each file was increased to 1G but the amount of parallelism reduced to 8. [ 67.821266] Lustre: Lustre: Build Version: jenkins-gb641cc0-PRISTINE-2.6.32-41-server [ 68.182300] Lustre: Added LNI 172.17.96.27@tcp [8/256/0/180] [ 68.182364] Lustre: Accept secure, port 988 [ 68.414016] Lustre: Lustre OSC module (ffffffffa08556e0). [ 68.538800] Lustre: Lustre LOV module (ffffffffa08ea0e0). [ 68.701672] Lustre: Lustre client module (ffffffffa09d6ca0). [ 174.074540] Lustre: MGC172.17.128.135@tcp: Reactivating import [ 174.074569] Lustre: Server MGS version (2.4.1.0) is much newer than client version. Consider upgrading client (2.1.6) [ 174.086076] LustreError: 1200:0:(mgc_request.c:247:do_config_log_add()) failed processing sptlrpc log: -2 [ 174.180021] Lustre: Server lus03-MDT0000_UUID version (2.4.1.0) is much newer than client version. Consider upgrading client (2.1.6)
4M read, lustre root@isg-dev4:/root# bash test.sh 4K read,lustre root@isg-dev4:/root# bash test.sh 4k reads, tmpfs root@isg-dev4:/root# ./test.sh 4M reads,tmpfs root@isg-dev4:/root# ./test.sh | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 13/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
sorry I just saw this ticket today. I didn't read all the comments but the result mentioned at: "James Beal added a comment - 18/Nov/13 4:08 AM" seems quite reasonable to me. For serial reading, only the first read is from the server; the following readings are from client cache so this is why they are super fast. For parallel reading, things are different. Basically only one process issued the reading RPC and the other 19 processes are just waiting for the result of reading. For example, for the first 4M pages, the first process sends read RPC, then other processes try to read the same portion of data but find the pages are locked, so they will have to wait for the RPC to finish; the same thing happened to the 2nd 4M block, and so on. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kit Westneat (Inactive) [ 13/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
But then wouldn't you expect the first serial process time to be about the same as the first parallel process time? You can see from the data that the first serial process completes in about 1/3 less time than any of the parallel processes. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 13/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Actually you can't expect that because there must be lot of page lock waiting for parallel processes, but things are much better due to read ahead for serial reading. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 13/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I note that the files are only being read and the file system is mounted with flocklocal, if we compare the ratio of speeds with tmpfs we see no effective difference with tmpfs compared to a slow down by a factor of 15.6 for lustre. Note we have applications that read files one character at a time. Our current work around involves a hack with a LD_PRELOAD which changes the reads in to memory mapped access.....
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 13/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Unless there is a difference in what is being tested between the filesystems, or I'm misunderstanding your results, I find it interesting that Lustre concurrent 4MB reads (@4843s) is significantly faster than tmpfs 4MB reads (@~8019s)? It is reasonable that there is overhead for Lustre with smaller read sizes and with concurrent IO compared to tmpfs. Lustre needs to do significantly more work per syscall than tmpfs. Even with a single client, there is locking overhead to maintain consistency (whether it is needed in this particular use or not), overhead in being able to potentially distribute the IO across multiple servers, etc., while tmpfs is as close to a "no-op" as a filesystem can be. That said, I agree the 15x slowdown for 4kB reads is excessive, and indicates there is too much contention/overhead in this part of the code. In 1.8 there was a fast path implemented for small reads, since most of the pages were prefetched via readahead and already in the page cache, so only a minimal amount of work was needed before returning the page to userspace. I think Jinshan is working on a patch to improve this for 2.x as well. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 13/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Andreas, the numbers I pasted were for a 1.8.9 client..., I will try and see what a 2.5 client gives us.
For all these reads the results should be "in the cache", I will try and do a run with 1 character reads which is what the program which showed this issue in the first place so it is apparent how painful the difference is. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 15/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
The change between lustre 1.8.9 and lustre 2.5.0 looks very unpleasant....., ( The run on one character reads has been running over the weekend and it has not completed the first part yet ).... Given that we are moving to lustre 2.5 as our client on all our hosts very soon I am very wary of these numbers.
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 16/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
It seems bad. Let's start by investigating the single file concurrently read case. Can you please me the output of: lctl get_param llite.*.read_ahead_stats when running 1.8 and 2.5 clients specifically. Please remount the client before running the test case so that we can get accurate stats. Also, can you please run it with single thread on both clients to check how much difference it is between them? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 16/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
To complete the tests in a reasonable time I had to reduce the size of the file to 128MB
The single client reads with a larger file This is with a 4G files, which seems to follow the smaller file patten.
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 16/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
The logs have been added as attachments, along with a new copy of the test script. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 16/Dec/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
It looks like that per IO overhead of 2.x is much higher than 1.8. I did some experiments on my local machine and cl_lock contributes most of the overhead. I hope we can solve this problem in the on going clio simplification project. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kit Westneat (Inactive) [ 16/Jan/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Jinshan, Can you elaborate on the clio simplification project and what sort of impact it might have on this IO pattern? Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 16/Jan/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Clio project is funded by opensfs. We will simplify cl_lock implementation there. There will be performance tune phase at the end of the project, which will give us a chance to look at this problem. I've ever set up an environment to check what's the major components to slow it down. And from what I have seen, cl_lock() contributes some overhead. Other components for example, cl_io data initialization, refresh layout, etc also contributes significant overhead. I'm on a really tight schedule so I can't find any time to work on this recently. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Richard Henwood (Inactive) [ 16/Jan/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Kit: for clarity, OpenSFS have funded a design phase for CLIO Simplification. Implementation and an associated performance analysis are not within scope for the CLIO Simplification design phase. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 01/Feb/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I'd like to share some result I've got so far. First of all, I can reproduce the problem mentioned by James. So I worked out a patch, plus the prototype of cl_lock simplification, I have some results to share. There are two type of read in my test: Also run the above test cases with memory in cache and not in cache specifically. Here are the results(seconds):
ind: threads read individual files Raw data is here: [root@c01 tests]# CACHE=yes ~/test.sh Dropping cache Read everything into memory cache..................................done Creating source files................................ Block size 4k cache: yes, Threads read individual files - wall: 16, real: 198.53 seconds cache: yes, All threads reading the same file - wall: 43, real: 111.02 seconds Block size 4M cache: yes, Threads read individual files - wall: 1, real: 24.96 seconds cache: yes, All threads reading the same file - wall: 1, real: 20.65 seconds [root@c01 tests]# CACHE=no ~/test.sh Dropping cache Creating source files................................ Block size 4k Dropping cache cache: no, Threads read individual files - wall: 101, real: 157.37 seconds Dropping cache cache: no, All threads reading the same file - wall: 43, real: 115.51 seconds Block size 4M Dropping cache cache: no, Threads read individual files - wall: 94, real: 161.8 seconds Dropping cache cache: no, All threads reading the same file - wall: 1, real: 13.44 seconds and this is my test script: [root@c01 tests]# cat ~/test.sh
#!/bin/bash
CACHE=${CACHE:-"no"}
LOC=/mnt/lustre/lustre_test
mkdir -p $LOC
#Where to store the results
RES=~/res
# How many processes to run at one time
PAR=32
BASE=`dirname $0`
# How many times to read the file ( to multiple the caching effect )
MULT=1
rm -rf ${RES} && mkdir -p ${RES}
function remount_lustre {
echo Dropping cache
echo 3 > /proc/sys/vm/drop_caches
umount /mnt/lustre
mount | grep c01
mount -t lustre -o localflock c01@tcp0:/lustre /mnt/lustre
if [ "$CACHE" = "yes" ]; then
echo -n Read everything into memory cache..
for i in `seq -f "%03.f" 1 ${PAR}`; do
dd if=${LOC}/${i} of=/dev/null bs=4M > /dev/null 2>&1
echo -n .
done
echo done
fi
}
remount_lustre
echo -n Creating source files
for i in `seq -f "%03.f" 1 ${PAR}`
do
[ -e ${LOC}/${i} ] || dd if=/dev/zero of=${LOC}/${i} bs=4M count=32 > ${RES}/${i}_create 2>&1
echo -n .
done
echo
for BLOCK in "4k" "4M"
do
echo "Block size $BLOCK"
[ "$CACHE" = "no" ] && remount_lustre
echo -n "cache: $CACHE, Threads read individual files - "
st=$(date +%s)
for i in `seq -f "%03.f" 1 ${PAR}`; do
/usr/bin/time -p -a -o ${RES}/single.${BLOCK} dd if=${LOC}/$i of=/dev/null bs=${BLOCK} > /dev/null 2>&1 &
done
wait
echo -n "wall: $((`date +%s`-st)), "
t=`grep -h real ${RES}/single.${BLOCK} | sed -e 's/real *[^ ]//' -e 's/s$//' -e 's/m/*60+/' | bc | awk '{ sum+=$1} END {print sum}'`
echo "real: $t seconds"
echo
[ "$CACHE" = "no" ] && remount_lustre
echo -n "cache: $CACHE, All threads reading the same file - "
st=$(date +%s)
for i in `seq -f "%03.f" 1 ${PAR}`; do
/usr/bin/time -p -a -o ${RES}/par.${BLOCK} dd if=${LOC}/001 of=/dev/null bs=${BLOCK} > /dev/null 2>&1 &
done
wait
echo -n "wall: $((`date +%s`-st)), "
t=`grep -h real ${RES}/par.${BLOCK} | sed -e 's/real *[^ ]//' -e 's/s$//' -e 's/m/*60+/' | bc | awk '{ sum+=$1} END {print sum}'`
echo "real: $t seconds"
done
[root@c01 tests]#
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 01/Feb/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Some of the patches are still in prototype phase, and I'll share the patches with you guys as soon as I can, so that you can try to reproduce what I have done. Today is Chinese New Year. Happy Chinese New Year, everybody! | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 01/Feb/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Happy New Year to you to I completely understand about prototypes..., It would be good to have before and after numbers in the tables. Thank you for looking at this for us. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Eremin (Inactive) [ 03/Feb/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Could you test with my patch http://review.whamcloud.com/#/c/9095? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 03/Feb/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I will build a new client tomorrow. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 03/Feb/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Here are the results before and after my patches is applied (in seconds):
single: single thread read single file the file size is 1G and is already in cache before threads start; the read block size is 4k. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Beal [ 13/Mar/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
While we wait I found the following interesting. http://lwn.net/Articles/590243/ Performance-oriented patches should, of course, always be accompanied by benchmark results. In this case, Waiman included a set of AIM7 benchmark results with his patch set (which did not include the pending-bit optimization). Some workloads regressed a little, but others shows improvements of 1-2% — a good result for a low-level locking improvement. The disk benchmark runs, however, improved by as much as 116%; that benchmark suffers from especially strong contention for locks in the virtual filesystem layer and ext4 filesystem code. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Manish Patel (Inactive) [ 10/Apr/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Jinshan, Any updates on the request made by James. We are looking for the new patch for the performance issues to try out, based on the above comment. Thank You, | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 28/Jun/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I came up with a solution for this issue and the initial testing result is exciting. The test case I used is based on the test cases shared by sanger, please check below: #!/bin/bash nr_cpus=$(grep -c ^processor /proc/cpuinfo) for CACHE in no yes; do for BS in 4k 1M; do echo "===== cache: $CACHE, block size: $BS =====" [ "$CACHE" = "yes" ] && { dd if=/mnt/lustre/testfile bs=1M of=/dev/null > /dev/null 2>&1; } [ "$CACHE" = "no" ] && { echo 3 > /proc/sys/vm/drop_caches; } echo -n " single read: " dd if=/mnt/lustre/testfile bs=$BS of=/dev/null 2>&1 |grep copied |awk -F, '{print $3}' [ "$CACHE" = "no" ] && { echo 3 > /proc/sys/vm/drop_caches; } echo -n " parallel read: " for i in `seq -w 1 ${nr_cpus}`; do dd if=/mnt/lustre/testfile bs=$BS of=/dev/null > results/${i}_out 2>&1 & done wait grep copied results/1_out | awk -F, '{print $3}' done done The test file is 2G in size so that it can fit in memory for cache enabled testing. I applied the patches and compare the test results w/ and w/o patches. The result w/ my patches: ===== cache: no, block size: 4k =====
single read: 1.2 GB/s
parallel read: 576 MB/s
===== cache: no, block size: 1M =====
single read: 1.4 GB/s
parallel read: 566 MB/s
===== cache: yes, block size: 4k =====
single read: 3.8 GB/s
parallel read: 1.8 GB/s
===== cache: yes, block size: 1M =====
single read: 6.4 GB/s
parallel read: 1.3 GB/s
The test w/o my patches: ===== cache: no, block size: 4k =====
single read: 257 MB/s
parallel read: 148 MB/s
===== cache: no, block size: 1M =====
single read: 1.1 GB/s
parallel read: 420 MB/s
===== cache: yes, block size: 4k =====
single read: 361 MB/s
parallel read: 147 MB/s
===== cache: yes, block size: 1M =====
single read: 5.8 GB/s
parallel read: 1.3 GB/s
The small IO performance improved significantly. I'm still doing some fine tune for the patches and I will release them as soon as I can so that you can do some evaluation. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 02/Jun/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/20574 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 03/Jun/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20254/ | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 03/Jun/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20256/ | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 03/Jun/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20255/ | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 03/Jun/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Landed for 2.9 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 07/Jun/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
This patch introduced an intermittent test failure | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 09/Jun/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/20647/ |