[LU-4257] parallel dds are slower than serial dds Created: 15/Nov/13  Updated: 16/Jun/20  Resolved: 03/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File debug_file.out.gz     PNG File io.png     Text File lu-4257.tar.gz     File lustre_1.8.9     File lustre_2.5     File readfile.sh     File test.sh    
Issue Links:
Duplicate
Related
is related to LU-8248 sanity test_248: fast read was not 4 ... Resolved
is related to LU-9106 ASSERTION( vio->vui_tot_nrsegs >= vi... Resolved
is related to LU-7382 (vvp_io.c:573:vvp_io_update_iov()) AS... Reopened
is related to LU-9491 v2.9: silent data corruption with wri... Resolved
is related to LU-4650 contention on ll_inode_size_lock with... Resolved
is related to LU-4588 replace semaphores with mutexes Resolved
Severity: 3
Rank (Obsolete): 11618

 Description   

Sanger has an interesting test in which they read from the same file from 20 processes. They first run in parallel and then run serially (after flushing cache). Their expected result is that the serial and parallel runs should take about the same amount of time. What they see however is that parallel reads are about 50% slower than serial reads:

client1# cat readfile.sh
#!/bin/sh

dd if=/lustre/scratch110/sanger/jb23/test/delete bs=4M of=/dev/null

client1# for i in `seq -w 1 20 `
do
  (time $LOC/readfile.sh )  > $LOC/results/${i}_out 2>&1 &
done

In parallel

01_out:real 3m36.228s
02_out:real 3m36.227s
03_out:real 3m36.226s
04_out:real 3m36.224s
05_out:real 3m36.224s
06_out:real 3m36.224s
07_out:real 3m36.222s
08_out:real 3m36.221s
09_out:real 3m36.228s
10_out:real 3m36.222s
11_out:real 3m36.220s
12_out:real 3m36.220s
13_out:real 3m36.228s
14_out:real 3m36.219s
15_out:real 3m36.217s
16_out:real 3m36.218s
17_out:real 3m36.214s
18_out:real 3m36.214s
19_out:real 3m36.211s
20_out:real 3m36.212s

A serial read ( I expect all the time to be in the first read ).

grep -i real *_serial
01_out_serial:real 2m31.372s
02_out_serial:real 0m1.190s
03_out_serial:real 0m0.654s
04_out_serial:real 0m0.562s
05_out_serial:real 0m0.574s
06_out_serial:real 0m0.570s
07_out_serial:real 0m0.574s
08_out_serial:real 0m0.461s
09_out_serial:real 0m0.456s
10_out_serial:real 0m0.462s
11_out_serial:real 0m0.475s
12_out_serial:real 0m0.473s
13_out_serial:real 0m0.582s
14_out_serial:real 0m0.580s
15_out_serial:real 0m0.569s
16_out_serial:real 0m0.679s
17_out_serial:real 0m0.565s
18_out_serial:real 0m0.573s
19_out_serial:real 0m0.579s
20_out_serial:real 0m0.472s

And try the same experiment with nfs

Serial access.

root@farm3-head4:~/tmp/test/results# grep -i real *
results/01_out_serial:real 0m19.923s
results/02_out_serial:real 0m1.373s
results/03_out_serial:real 0m1.237s
results/04_out_serial:real 0m1.276s
results/05_out_serial:real 0m1.289s
results/06_out_serial:real 0m1.297s
results/07_out_serial:real 0m1.265s
results/08_out_serial:real 0m1.278s
results/09_out_serial:real 0m1.224s
results/10_out_serial:real 0m1.225s
results/11_out_serial:real 0m1.221s
...

So the question is:
Why is the access slower if we are accessing the file in parallel and it is not in the cache ?

Is there some lock contention going on with multiple readers? Or is the Lustre client sending multiple RPCs for the same data, even though there is already an outstanding request? They have tried this on 1.8.x clients as well as 2.5.0.

Thanks.



 Comments   
Comment by Peter Jones [ 15/Nov/13 ]

Bobijam

Could you please advise on this one?

Thanks

Peter

Comment by Andreas Dilger [ 15/Nov/13 ]

How large are the files being tested? One stripe or multiple stripes per file? It surprises me that Lustre is slower than NFS, but I guess that isn't directly relevant for the problem itself.

It would be useful to collect +rpctrace +dlmtrace +vfstrace +reada debug logs on the client for the parallel and serial runs. That would allow us to see if there are duplicate RPCs being sent. This would be most useful with a 2.5.0 client. It may be that Jinshan's client-side single-threaded performance patches would help improve the overall performance, but it isn't totally clear if they would affect the parallel vs. serial behaviour.

Comment by Kit Westneat (Inactive) [ 15/Nov/13 ]

They are 2GB, single-striped files. The client nodes have 4GB of RAM and are mounting with the localflock option.

I'll work on getting the debug logs.

Comment by Andreas Dilger [ 15/Nov/13 ]

It seems really terrible performance to be reading a 2GB file in 151s (=13.5MB/s) in the serial case, or even 216s (=9.5MB/s) in the parallel case. It seems like something else is broken in this filesystem that is causing everything to run slowly? Even my 8-year-old home system runs this fast.

Comment by James Beal [ 18/Nov/13 ]

Note when I filtered on +rpctrace +dlmtrace +vfstrace +reada I got no events so I have given the full logs....

Sorry.

Comment by James Beal [ 18/Nov/13 ]

This is using my desktop which is a lustre 2.5.1 client.

root@deskpro21498:/tmp/results# grep copied /tmp/results/ser
/tmp/results/01_out_ser:2147483648 bytes (2.1 GB) copied, 19.4224 s, 111 MB/s
/tmp/results/02_out_ser:2147483648 bytes (2.1 GB) copied, 1.1007 s, 2.0 GB/s
/tmp/results/03_out_ser:2147483648 bytes (2.1 GB) copied, 1.03958 s, 2.1 GB/s
/tmp/results/04_out_ser:2147483648 bytes (2.1 GB) copied, 1.03101 s, 2.1 GB/s
/tmp/results/05_out_ser:2147483648 bytes (2.1 GB) copied, 1.03045 s, 2.1 GB/s
/tmp/results/06_out_ser:2147483648 bytes (2.1 GB) copied, 1.03191 s, 2.1 GB/s
/tmp/results/07_out_ser:2147483648 bytes (2.1 GB) copied, 1.0305 s, 2.1 GB/s
/tmp/results/08_out_ser:2147483648 bytes (2.1 GB) copied, 1.03164 s, 2.1 GB/s
/tmp/results/09_out_ser:2147483648 bytes (2.1 GB) copied, 1.03211 s, 2.1 GB/s
/tmp/results/10_out_ser:2147483648 bytes (2.1 GB) copied, 1.0314 s, 2.1 GB/s
/tmp/results/11_out_ser:2147483648 bytes (2.1 GB) copied, 1.03315 s, 2.1 GB/s
/tmp/results/12_out_ser:2147483648 bytes (2.1 GB) copied, 1.03305 s, 2.1 GB/s
/tmp/results/13_out_ser:2147483648 bytes (2.1 GB) copied, 1.03047 s, 2.1 GB/s
/tmp/results/14_out_ser:2147483648 bytes (2.1 GB) copied, 1.03324 s, 2.1 GB/s
/tmp/results/15_out_ser:2147483648 bytes (2.1 GB) copied, 1.03284 s, 2.1 GB/s
/tmp/results/16_out_ser:2147483648 bytes (2.1 GB) copied, 1.03156 s, 2.1 GB/s
/tmp/results/17_out_ser:2147483648 bytes (2.1 GB) copied, 1.03013 s, 2.1 GB/s
/tmp/results/18_out_ser:2147483648 bytes (2.1 GB) copied, 1.03125 s, 2.1 GB/s
/tmp/results/19_out_ser:2147483648 bytes (2.1 GB) copied, 1.03179 s, 2.1 GB/s
/tmp/results/20_out_ser:2147483648 bytes (2.1 GB) copied, 1.02893 s, 2.1 GB/s
root@deskpro21498:/tmp/results# grep copied /tmp/results/par
/tmp/results/01_out_par:2147483648 bytes (2.1 GB) copied, 28.5889 s, 75.1 MB/s
/tmp/results/02_out_par:2147483648 bytes (2.1 GB) copied, 28.535 s, 75.3 MB/s
/tmp/results/03_out_par:2147483648 bytes (2.1 GB) copied, 28.5606 s, 75.2 MB/s
/tmp/results/04_out_par:2147483648 bytes (2.1 GB) copied, 28.5644 s, 75.2 MB/s
/tmp/results/05_out_par:2147483648 bytes (2.1 GB) copied, 28.5557 s, 75.2 MB/s
/tmp/results/06_out_par:2147483648 bytes (2.1 GB) copied, 28.5399 s, 75.2 MB/s
/tmp/results/07_out_par:2147483648 bytes (2.1 GB) copied, 28.5578 s, 75.2 MB/s
/tmp/results/08_out_par:2147483648 bytes (2.1 GB) copied, 28.5544 s, 75.2 MB/s
/tmp/results/09_out_par:2147483648 bytes (2.1 GB) copied, 28.5705 s, 75.2 MB/s
/tmp/results/10_out_par:2147483648 bytes (2.1 GB) copied, 28.5436 s, 75.2 MB/s
/tmp/results/11_out_par:2147483648 bytes (2.1 GB) copied, 28.5596 s, 75.2 MB/s
/tmp/results/12_out_par:2147483648 bytes (2.1 GB) copied, 28.5437 s, 75.2 MB/s
/tmp/results/13_out_par:2147483648 bytes (2.1 GB) copied, 28.5611 s, 75.2 MB/s
/tmp/results/14_out_par:2147483648 bytes (2.1 GB) copied, 28.5543 s, 75.2 MB/s
/tmp/results/15_out_par:2147483648 bytes (2.1 GB) copied, 28.5604 s, 75.2 MB/s
/tmp/results/16_out_par:2147483648 bytes (2.1 GB) copied, 28.5546 s, 75.2 MB/s
/tmp/results/17_out_par:2147483648 bytes (2.1 GB) copied, 28.5358 s, 75.3 MB/s
/tmp/results/18_out_par:2147483648 bytes (2.1 GB) copied, 28.5552 s, 75.2 MB/s
/tmp/results/19_out_par:2147483648 bytes (2.1 GB) copied, 28.5478 s, 75.2 MB/s
/tmp/results/20_out_par:2147483648 bytes (2.1 GB) copied, 28.5414 s, 75.2 MB/s

Comment by James Beal [ 20/Nov/13 ]

Can I add any information ?

Can the problem be reproduced else where ?

Comment by Kit Westneat (Inactive) [ 21/Nov/13 ]

Hi James,

Can you confirm how large the original file was? Also can you double check your debug settings? When I run it on a Lustre 2.1.5 VM I get a ton of debug data.

sysctl lnet.debug="+rpctrace +dlmtrace +vfstrace +reada"
lnet.debug = ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace reada config console

Here's the script I used:

#!/bin/bash

lctl clear
lctl debug_daemon start /scratch/debug_file 2048
echo 3 > /proc/sys/vm/drop_caches
lctl mark parallel read start
for x in 1 2 3 4 5; do dd if=/lustre/pfs/client/testfile bs=4M of=/dev/null & done
wait
lctl mark parallel read done
echo 3 > /proc/sys/vm/drop_caches
lctl mark serial read start
for x in 1 2 3 4 5; do dd if=/lustre/pfs/client/testfile bs=4M of=/dev/null ; done
lctl mark serial read done
lctl debug_daemon stop
lctl df /scratch/debug_file > /scratch/debug_file.out

I was able to reproduce the slow down you mentioned. I'll attach the debug logs that I got.

parallel:

62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 4.40159 s, 59.6 MB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 4.40117 s, 59.6 MB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 4.39914 s, 59.6 MB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 4.39789 s, 59.6 MB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 4.39852 s, 59.6 MB/s

serial:

62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 4.57006 s, 57.4 MB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 0.0536567 s, 4.9 GB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 0.046017 s, 5.7 GB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 0.0454 s, 5.8 GB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 0.0453329 s, 5.8 GB/s
Comment by Andreas Dilger [ 22/Nov/13 ]

In the most recent results with the 2.5.0 client (we don't have a 2.5.1 yet?) the parallel reads take only 28.5s, while the serial reads take 39s in total, so the parallel reads are faster in aggregate than serial reads, though the single serial read is faster (19.4s) than any of the parallel reads. These results also are more in line with NFS (19.9s for the first serial read, 1.3s for each later read).

I updated the script a bit and ran it on my 2.4.0 x86_64 2-core 4-thread client, 2.4.1 server:

#!/bin/sh
FILE=/myth/tmp/100M  # previously created
OST=myth-OST0000

lctl set_param ldlm.namespaces.$OST*.lru_size=clear
lctl set_param llite.*.stats=0 osc.$OST*.rpc_stats=0

for i in $(seq -w 1 20); do
	(time dd if=$FILE bs=4M of=/dev/null )  > /tmp/par_${i}.out 2>&1 &
done

wait
lctl get_param llite.*.stats osc.$OST*.rpc_stats | tee /tmp/par_stats.out

lctl set_param ldlm.namespaces.$OST*.lru_size=clear
lctl set_param llite.*.stats=0 osc.$OST*.rpc_stats=0

for i in $(seq -w 1 20); do
	(time dd if=$FILE bs=4M of=/dev/null )  > /tmp/ser_${i}.out 2>&1
done
lctl get_param llite.*.stats osc.$OST*.rpc_stats | tee /tmp/par_stats.out

This produced the following from rpc_stats, which showed that there are not multiple RPCs being sent:

SERIAL  		read			write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
64:		         2   1   1   |          0   0   0
256:		       100  98 100   |          0   0   0

PARALLEL		read			write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
256:		       100 100 100   |          0   0   0

The time profile is as I would expect, shorter single time for serial but longer total time (no parallelism), and a shorter total time for parallel (runtime happens in parallel):


SERIAL
real 0m9.657s sys 0m1.556s
real 0m0.284s sys 0m0.281s
real 0m0.244s sys 0m0.241s
real 0m0.377s sys 0m0.368s
real 0m0.202s sys 0m0.201s
real 0m0.207s sys 0m0.195s
real 0m0.200s sys 0m0.196s
real 0m0.210s sys 0m0.209s
real 0m0.198s sys 0m0.197s
real 0m0.197s sys 0m0.194s
real 0m0.198s sys 0m0.196s
real 0m0.202s sys 0m0.198s
real 0m0.295s sys 0m0.291s
real 0m0.164s sys 0m0.162s
real 0m0.163s sys 0m0.162s
real 0m0.162s sys 0m0.160s
real 0m0.162s sys 0m0.160s
real 0m0.162s sys 0m0.159s
real 0m0.164s sys 0m0.160s
real 0m0.162s sys 0m0.160s

PARALLEL
real 0m12.817s sys 0m0.918s
real 0m12.816s sys 0m1.018s
real 0m12.804s sys 0m0.837s
real 0m12.791s sys 0m0.865s
real 0m12.783s sys 0m0.922s
real 0m12.768s sys 0m0.866s
real 0m12.773s sys 0m0.888s
real 0m12.773s sys 0m0.968s
real 0m12.761s sys 0m0.842s
real 0m12.741s sys 0m0.897s
real 0m12.734s sys 0m0.856s
real 0m12.718s sys 0m0.869s
real 0m12.700s sys 0m0.896s
real 0m12.711s sys 0m0.933s
real 0m12.685s sys 0m0.931s
real 0m12.656s sys 0m0.854s
real 0m12.669s sys 0m0.970s
real 0m12.663s sys 0m0.848s
real 0m12.655s sys 0m0.947s
real 0m12.649s sys 0m0.880s

Comment by James Beal [ 26/Nov/13 ]

Just as a note, I think I can see the problem however I am busy upgrading a lustre file system... and I want to ensure that my communication is clear, apologies for the delay.

Comment by James Bonfield [ 03/Dec/13 ]

I have a basic C program to do a while(read()) style loop until EOF on an ~800Mb file with varying buffer sizes. I tested this on a lustre, tmpfs and nfs using a mix of block sizes and 1, 4 and 32 concurrent copies. For example on lustre with 4 copies:

jkb@sf-7-1-02[work/benchmarks] fn=/lustre/scratch110/srpipe/references/Human/default/all/bowtie/human_g1k_v37.fasta.1.ebwt; for i in `seq 8 30`;do bs=$(perl -e "print 2 ** $i");echo -n "$bs";(for x in `seq 1 4`;do ./linear2 $fn & done) | grep Took|tail -1;done
256 Took 243.905528 seconds
512 Took 121.146812 seconds
1024 Took 62.789902 seconds
2048 Took 30.558433 seconds
4096 Took 14.908553 seconds
8192 Took 7.506444 seconds
16384 Took 3.727981 seconds
32768 Took 1.864213 seconds
65536 Took 0.882102 seconds
131072 Took 0.468359 seconds
262144 Took 0.405715 seconds
524288 Took 0.411299 seconds
1048576 Took 0.420036 seconds
2097152 Took 0.470832 seconds
4194304 Took 0.680612 seconds
8388608 Took 0.776593 seconds
16777216 Took 0.752007 seconds
33554432 Took 0.765050 seconds
67108864 Took 0.741756 seconds
134217728 Took 0.712213 seconds
268435456 Took 0.677054 seconds
536870912 Took 0.753955 seconds
1073741824 Took 0.931833 seconds

I then graphed these. Thin lines are 1 process running. The thicker ones are 4, and the thickest are 32 (also the labels lustre, lustre4 and lustre32 correspond as dot the crosses, circles and squares on the data points).

Naturally for any read there will be a per system-call overhead, so obviously tiny buffers for read calls are inefficient. This is evidently clear for all file systems.

Also there is clearly a per-byte overhead. Even reading from a file in cached memory, it takes time to copy from kernel space buffer cache to the user space buffer. This is a per-byte delay and will scale linearly will the data size.

These two timings combine to get the net speed of the while(read()) loops. Indeed we see this for NFS. Below 1k block size there is a decrease in speed as the block size decreases, indicating that the overhead of calling more system calls is the major bottleneck. At 10k and above the time is more or less constant, so the per-system call overhead has become insignificant. Inbetween 1k and 10k we see the gradual bend in the graph where real time is more evenly spread between memory copying and per-system call costs. Lustre single-threaded code demonstrates the same bend between 10k and 100k implying the per system-call cost is higher.

Now if we have 4 processes running we could expect the per-system call overhead to be the same on all processes, and much the same delay as we get with 1 or 32 processes running (more or less). This is visible in the NFS benchmark as the 32-process version still shows a linear decrease in performance for blocks below 1k (as for the single-process NFS stats) and the bend is "more or less" in the same place.

However Lustre shows a marked difference. As we have more and more simultaneous processes it reveals the location of the trade-off between per-system call and per-byte costs moving from ~10K through ~100k for 4 processes and up to ~1Mb for 32 processes. Why is this? Basically it indicates that the CPU overhead of each and every system call goes up as other processes are doing more system calls. The net effect is that even on moderate sized buffers (eg 10K) lustre becomes a couple orders of magnitude slower than NFS.

To my eyes that implies there is some contention between independent system calls, possibly a thread lock. I appreciate that coherency is HARD, so I wonder if there is some way we can provide hints to the lustre client to make things easier. Eg is it possible to set the immutable flag on a file to avoid this lock? Does opening in read-only mode help? Does having the file itself be read-only help?

James

Comment by Kit Westneat (Inactive) [ 03/Dec/13 ]

Hi James,

Can you confirm what version of Lustre the clients and servers are running? Are you able to get the debug logs that Andreas requested while running these tests? I had to use the debug_daemon function in order to capture all the logs.

Thanks,
Kit

Comment by James Beal [ 03/Dec/13 ]

I have done my test on lustre 1.8.9 clients and 2.5.0 clients however James above would have only used lustre 1.8.9. I am rather deep in a set of lustre server upgrades right now so I will not be able to look at this until Tuesday unless it is critical before then ?

Comment by Andreas Dilger [ 03/Dec/13 ]

James, you are correct that there is per-syscall overhead in Lustre, and contention will indeed increase with multiple threads. Lustre has fully coherent locking between nodes, while NFS has no locking at all. That means if one node is writing to a file and another node is reading at the same time, or if two nodes are writing in on non-block-aligned boundaries, NFS does not guarantee anything about the state of your data at the end. NFS only ensures that local data is flushed when the file is closed, and another client can see this data when the file is opened afterward.

That said, I agree the client-side locking overhead is higher than it should be, and we are working to improve this for the 2.6 release. If you are able to test prototype code on a client node, it would be possible to see if the current performance patch series is showing improvement on your system. Most of the patches are already in the lustre-master branch in Git.

The last patch in this series that I'm aware of is:
http://review.whamcloud.com/7895

There are pre-built RPMs available at (click on blue circle for your arch distro combo):
http://build.whamcloud.com/job/lustre-reviews/20005/

Comment by James Beal [ 12/Dec/13 ]

This is a set of test results from a production node that has been taken out of one of our clusters, Note that the lustre client is mounted with localflock. We have an expectation that a number of processes all reading the same file should be quicker than a number of processes all reading different files as the file is in cache and we would not expect additional IO however there appears to be significant contention which we do not understand. I note that the block size that we read the file with makes a very significant difference. The problem was discovered with a program that read the file one character at a time. I will repeat the runs with a lustre 2.1 client pulled down from the lustre-reviews.

A little table which summarises the results.

  4M reads 4K reads slow down
lustre reading multiple files concurrently 4843 secs 7202 secs 1.5
lustre reading a single file concurrently 8470 secs 131779 secs 15.6
tmpfs reading multiple files concurrently 8019 secs 6142 secs 0.8
tmpfs reading a single file concurrently 8161 secs 6131 secs 0.8
Lustre 1.8.9 client, lustre 2.4.1 server ( 32 files each 512MB in size , reading with dd 4MB block size), mounted using localflock

Creating source files................................
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread 
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 4843.25
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 8469.83

Lustre 1.8.9 client, lustre 2.4.1 server  ( 32 files each 512MB in size , reading with dd 4KB block size), mounted using local flock

root@sf-7-1-02:/root# ./test.sh
Creating source files................................
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 7202.32
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 131779

tmpfs ( 32 files each 512MB in size , reading with dd 4MB block size)

Creating source files................................
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread 
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 8018.69
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 8160.59

tmpfs ( 32 files each 512MB in size , reading with dd 4KB block size)

Creating source files................................
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread 
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 6141.82
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 6130.82

The scripts are as follows:

cat readfile.sh 
#!/bin/sh
TYPE=$1
LOC=$2
PAR=$3
MULT=$4

if [ ${TYPE} = "SEQ" ] ; then 
   for i in `seq -w 1 ${PAR}`
   do
     echo ${LOC}/${i}
     for j in `seq 1 ${MULT}`
     do
       dd if=${LOC}/${i} of=/dev/null bs=4k
     done
   done
fi
if [ ${TYPE} = "PAR" ] ; then
  echo $LOC
  for i in `seq -w 1 ${PAR}`
  do
    echo ${LOC}
    for j in `seq 1 ${MULT}`
    do    
      dd if=${LOC} of=/dev/null bs=4k
    done
  done
fi

And the test script

cat test.sh 
#!/bin/sh

LOC=/mnt/tmp/4k
#Where to store the results
RES=`echo $LOC| sed -e 's#/#-#g' -e 's/^-//' `
# How many processes to run at one time
PAR=32
BASE=`dirname $0`
# How many times to read the file ( to multiple the caching effect )
MULT=4
mkdir -p  ${RES}

echo -n Creating source files
for i in `seq -w  1 ${PAR}`
do
  dd if=/dev/zero of=${LOC}/${i} bs=4M count=128 > ${RES}/${i}_create 2>&1 
  echo -n .
done

echo
echo Dropping cache
echo 3 > /proc/sys/vm/drop_caches
echo "All threads reading the same file"
for i in `seq -w 1 ${PAR}`
do
   (time $BASE/readfile.sh SEQ ${LOC} ${PAR} ${MULT})  > ${RES}/${i}_parallel_same_file  2>&1 &
echo -n
done
wait

echo Dropping cache
echo 3 > /proc/sys/vm/drop_caches
echo "Reading the same file in a single thread "
for i in `seq -w 1 ${PAR} `
do
   (time ${BASE}/readfile.sh PAR ${LOC}/${i} ${PAR} ${MULT}  )  > ${RES}/${i}_parallel_different_file  2>&1 &
done
wait

echo "Time taken using read pattern"
echo "Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo "FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H"
echo -n "Pattern takes "
grep -h real ${RES}/*_parallel_different_file | sed -e 's/real *[^ ]//' -e 's/s$//' -e 's/m/*60+/'  | bc | awk '{ sum+=$1} END {print sum}'

echo "Time taken using read pattern"
echo "Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H"
echo "File_A File_A File_A File_A File_A File_A File_A File_A"
echo "File_B File_B File_B File_B File_B File_B File_B File_B"
echo "File_C File_C File_C File_C File_C File_C File_C File_C"
echo "File_D File_D File_D File_D File_D File_D File_D File_D"
echo "File_E File_E File_E File_E File_E File_E File_E File_E"
echo "File_F File_F File_F File_F File_F File_F File_F File_F"
echo "File_G File_G File_G File_G File_G File_G File_G File_G"
echo "File_H File_H File_H File_H File_H File_H File_H File_H"
echo -n "Pattern takes "
grep -h real ${RES}/*_parallel_same_file | sed -e 's/real *[^ ]//' -e 's/s$//' -e 's/m/*60+/'  | bc | awk '{ sum+=$1} END {print sum}'
Comment by Kit Westneat (Inactive) [ 12/Dec/13 ]

Hi James,

Can you also try with a master branch client? If you have time, a 2.5.x client would also be interesting I think.

I'll try to test with different client versions as well.

Thanks,
Kit

Comment by James Beal [ 12/Dec/13 ]

This is from an old system that has only 16Gig of RAM, and only 8 cores. The size of each file was increased to 1G but the amount of parallelism reduced to 8.

[   67.821266] Lustre: Lustre: Build Version: jenkins-gb641cc0-PRISTINE-2.6.32-41-server
[   68.182300] Lustre: Added LNI 172.17.96.27@tcp [8/256/0/180]
[   68.182364] Lustre: Accept secure, port 988
[   68.414016] Lustre: Lustre OSC module (ffffffffa08556e0).
[   68.538800] Lustre: Lustre LOV module (ffffffffa08ea0e0).
[   68.701672] Lustre: Lustre client module (ffffffffa09d6ca0).
[  174.074540] Lustre: MGC172.17.128.135@tcp: Reactivating import
[  174.074569] Lustre: Server MGS version (2.4.1.0) is much newer than client version. Consider upgrading client (2.1.6)
[  174.086076] LustreError: 1200:0:(mgc_request.c:247:do_config_log_add()) failed processing sptlrpc log: -2
[  174.180021] Lustre: Server lus03-MDT0000_UUID version (2.4.1.0) is much newer than client version. Consider upgrading client (2.1.6)
  4M reads 4K reads slow down
lustre reading multiple files concurrently 868 secs 1702 secs 2.0
lustre reading a single file concurrently 988 secs 4744 secs 4.8
tmpfs reading multiple files concurrently 424 secs 207 secs 0.5
tmpfs reading a single file concurrently 537 secs 191 secs 0.4

4M read, lustre

root@isg-dev4:/root# bash test.sh
Creating source files........
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread
pTime taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 867.87
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 988.253
root@isg-dev4:/root#

4K read,lustre

root@isg-dev4:/root# bash test.sh
Creating source files........
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 1702.89
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 4744.43
root@isg-dev4:/root#

4k reads, tmpfs

root@isg-dev4:/root# ./test.sh
Creating source files........
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 207.198
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 191.962
root@isg-dev4:/root#

4M reads,tmpfs

root@isg-dev4:/root# ./test.sh
Creating source files........
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 424.451
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 537.192


Comment by Jinshan Xiong (Inactive) [ 13/Dec/13 ]

sorry I just saw this ticket today. I didn't read all the comments but the result mentioned at: "James Beal added a comment - 18/Nov/13 4:08 AM" seems quite reasonable to me.

For serial reading, only the first read is from the server; the following readings are from client cache so this is why they are super fast.

For parallel reading, things are different. Basically only one process issued the reading RPC and the other 19 processes are just waiting for the result of reading. For example, for the first 4M pages, the first process sends read RPC, then other processes try to read the same portion of data but find the pages are locked, so they will have to wait for the RPC to finish; the same thing happened to the 2nd 4M block, and so on.

Comment by Kit Westneat (Inactive) [ 13/Dec/13 ]

But then wouldn't you expect the first serial process time to be about the same as the first parallel process time? You can see from the data that the first serial process completes in about 1/3 less time than any of the parallel processes.

Comment by Jinshan Xiong (Inactive) [ 13/Dec/13 ]

Actually you can't expect that because there must be lot of page lock waiting for parallel processes, but things are much better due to read ahead for serial reading.

Comment by James Beal [ 13/Dec/13 ]

I note that the files are only being read and the file system is mounted with flocklocal, if we compare the ratio of speeds with tmpfs we see no effective difference with tmpfs compared to a slow down by a factor of 15.6 for lustre. Note we have applications that read files one character at a time. Our current work around involves a hack with a LD_PRELOAD which changes the reads in to memory mapped access.....

  4M reads 4K reads slow down
lustre reading multiple files concurrently 4843 secs 7202 secs 1.5
lustre reading a single file concurrently 8470 secs 131779 secs 15.6
tmpfs reading multiple files concurrently 8019 secs 6142 secs 0.8
tmpfs reading a single file concurrently 8161 secs 6131 secs 0.8
Comment by Andreas Dilger [ 13/Dec/13 ]

Unless there is a difference in what is being tested between the filesystems, or I'm misunderstanding your results, I find it interesting that Lustre concurrent 4MB reads (@4843s) is significantly faster than tmpfs 4MB reads (@~8019s)? Even with 4k reads the total time for Lustre (@7202s) is less than the 4MB reads for tmpfs (@80192).

It is reasonable that there is overhead for Lustre with smaller read sizes and with concurrent IO compared to tmpfs. Lustre needs to do significantly more work per syscall than tmpfs. Even with a single client, there is locking overhead to maintain consistency (whether it is needed in this particular use or not), overhead in being able to potentially distribute the IO across multiple servers, etc., while tmpfs is as close to a "no-op" as a filesystem can be. That said, I agree the 15x slowdown for 4kB reads is excessive, and indicates there is too much contention/overhead in this part of the code.

In 1.8 there was a fast path implemented for small reads, since most of the pages were prefetched via readahead and already in the page cache, so only a minimal amount of work was needed before returning the page to userspace. I think Jinshan is working on a patch to improve this for 2.x as well.

Comment by James Beal [ 13/Dec/13 ]

Andreas, the numbers I pasted were for a 1.8.9 client..., I will try and see what a 2.5 client gives us.

Lustre 1.8.9 4M reads 4K reads slow down
lustre reading multiple files concurrently 4843 secs 7202 secs 1.5
lustre reading a single file concurrently 8470 secs 131779 secs 15.6
tmpfs reading multiple files concurrently 8019 secs 6142 secs 0.8
tmpfs reading a single file concurrently 8161 secs 6131 secs 0.8

For all these reads the results should be "in the cache", I will try and do a run with 1 character reads which is what the program which showed this issue in the first place so it is apparent how painful the difference is.

Comment by James Beal [ 15/Dec/13 ]

The change between lustre 1.8.9 and lustre 2.5.0 looks very unpleasant....., ( The run on one character reads has been running over the weekend and it has not completed the first part yet ).... Given that we are moving to lustre 2.5 as our client on all our hosts very soon I am very wary of these numbers.

  4M reads 4K reads slow down
lustre 1.8.9 reading multiple files concurrently 4843 secs 7202 secs 1.5
lustre 2.5.0 reading multiple files concurrently 4578 secs 55267 secs 12.1
lustre 1.8.9 reading a single file concurrently 8470 secs 131779 secs 15.6
lustre 2.5.0 reading a single file concurrently 18500 secs 2033260 secs 110.0
tmpfs reading multiple files concurrently 8019 secs 6142 secs 0.8
tmpfs reading a single file concurrently 8161 secs 6131 secs 0.8
Comment by Jinshan Xiong (Inactive) [ 16/Dec/13 ]

It seems bad. Let's start by investigating the single file concurrently read case. Can you please me the output of:

lctl get_param llite.*.read_ahead_stats
lctl get_param osc.*.rpc_stats

when running 1.8 and 2.5 clients specifically. Please remount the client before running the test case so that we can get accurate stats.

Also, can you please run it with single thread on both clients to check how much difference it is between them?

Comment by James Beal [ 16/Dec/13 ]

To complete the tests in a reasonable time I had to reduce the size of the file to 128MB

  4M reads 4K reads slow down
lustre 1.8.9 reading a single file concurrently 768 secs 5373 secs 7
lustre 2.5.0 reading a single file concurrently 1363 secs 132559 secs 97
Slow down 1.8 24.7  
  4M reads 4K reads slow down
lustre 1.8.9 reading a single file 0.45 secs 0.35 secs 0.78
lustre 2.5.0 reading a single file 0.38 secs 1.25 secs 3.29
Slow down 0.85 3.57  

The single client reads with a larger file

This is with a 4G files, which seems to follow the smaller file patten.

  4M reads 4K reads slow down
lustre 1.8.9 reading a single file 9.1 secs 9.0 secs 1.01
lustre 2.5.0 reading a single file 8.8 secs 36.9 secs 4.19
Slow down 0.97 4.1  
Comment by James Beal [ 16/Dec/13 ]

The logs have been added as attachments, along with a new copy of the test script.

Comment by Jinshan Xiong (Inactive) [ 16/Dec/13 ]

It looks like that per IO overhead of 2.x is much higher than 1.8. I did some experiments on my local machine and cl_lock contributes most of the overhead.

I hope we can solve this problem in the on going clio simplification project.

Comment by Kit Westneat (Inactive) [ 16/Jan/14 ]

Hi Jinshan,

Can you elaborate on the clio simplification project and what sort of impact it might have on this IO pattern?

Thanks,
Kit

Comment by Jinshan Xiong (Inactive) [ 16/Jan/14 ]

Clio project is funded by opensfs. We will simplify cl_lock implementation there. There will be performance tune phase at the end of the project, which will give us a chance to look at this problem.

I've ever set up an environment to check what's the major components to slow it down. And from what I have seen, cl_lock() contributes some overhead. Other components for example, cl_io data initialization, refresh layout, etc also contributes significant overhead.

I'm on a really tight schedule so I can't find any time to work on this recently.

Comment by Richard Henwood (Inactive) [ 16/Jan/14 ]

Kit: for clarity, OpenSFS have funded a design phase for CLIO Simplification. Implementation and an associated performance analysis are not within scope for the CLIO Simplification design phase.

Comment by Jinshan Xiong (Inactive) [ 01/Feb/14 ]

I'd like to share some result I've got so far.

First of all, I can reproduce the problem mentioned by James. So I worked out a patch, plus the prototype of cl_lock simplification, I have some results to share.

There are two type of read in my test:
1. 32 threads read 32 individual files;
2. 32 threads read the same file.

Also run the above test cases with memory in cache and not in cache specifically.

Here are the results(seconds):

  4M reads 4K read Slow Down
ind + cache 24.96 198.53  
par + cache 20.65 111.02  
ind + no cache 161.8 157.37  
par + no cache 13.44 115.51  

ind: threads read individual files
par: threads read the same file
cache: before reading, the data is already in memory
no cache: clean up memory cache before reading

Raw data is here:

[root@c01 tests]# CACHE=yes ~/test.sh 
Dropping cache
Read everything into memory cache..................................done
Creating source files................................
Block size 4k
cache: yes, Threads read individual files - wall: 16, real: 198.53 seconds

cache: yes, All threads reading the same file - wall: 43, real: 111.02 seconds
Block size 4M
cache: yes, Threads read individual files - wall: 1, real: 24.96 seconds

cache: yes, All threads reading the same file - wall: 1, real: 20.65 seconds
[root@c01 tests]# CACHE=no ~/test.sh 
Dropping cache
Creating source files................................
Block size 4k
Dropping cache
cache: no, Threads read individual files - wall: 101, real: 157.37 seconds

Dropping cache
cache: no, All threads reading the same file - wall: 43, real: 115.51 seconds
Block size 4M
Dropping cache
cache: no, Threads read individual files - wall: 94, real: 161.8 seconds

Dropping cache
cache: no, All threads reading the same file - wall: 1, real: 13.44 seconds

and this is my test script:

[root@c01 tests]# cat ~/test.sh 
#!/bin/bash

CACHE=${CACHE:-"no"}
LOC=/mnt/lustre/lustre_test
mkdir -p $LOC

#Where to store the results
RES=~/res
# How many processes to run at one time
PAR=32
BASE=`dirname $0`
# How many times to read the file ( to multiple the caching effect )
MULT=1

rm -rf ${RES} && mkdir -p ${RES}

function remount_lustre {
  echo Dropping cache
  echo 3 > /proc/sys/vm/drop_caches
  umount /mnt/lustre
  mount | grep c01
  mount -t lustre -o localflock  c01@tcp0:/lustre /mnt/lustre 

  if [ "$CACHE" = "yes" ]; then
    echo -n Read everything into memory cache..
    for i in `seq -f "%03.f" 1 ${PAR}`; do
          dd if=${LOC}/${i} of=/dev/null bs=4M > /dev/null 2>&1 
          echo -n .
    done
    echo done
  fi
}

remount_lustre

echo -n Creating source files
for i in `seq -f "%03.f" 1 ${PAR}`
do
  [ -e ${LOC}/${i} ] || dd if=/dev/zero of=${LOC}/${i} bs=4M count=32 > ${RES}/${i}_create 2>&1 
  echo -n .
done
echo

for BLOCK in "4k" "4M"
do
  echo "Block size $BLOCK"

  [ "$CACHE" = "no" ] && remount_lustre

  echo -n "cache: $CACHE, Threads read individual files - "

  st=$(date +%s)
  for i in `seq -f "%03.f" 1 ${PAR}`; do
    /usr/bin/time -p -a -o ${RES}/single.${BLOCK} dd if=${LOC}/$i of=/dev/null bs=${BLOCK} > /dev/null 2>&1 &
  done
  wait

  echo -n "wall: $((`date +%s`-st)), "
  t=`grep -h real ${RES}/single.${BLOCK} | sed -e 's/real *[^ ]//' -e 's/s$//' -e 's/m/*60+/'  | bc | awk '{ sum+=$1} END {print sum}'`
  echo "real: $t seconds"
  echo

  [ "$CACHE" = "no" ] && remount_lustre

  echo -n "cache: $CACHE, All threads reading the same file - "

  st=$(date +%s)
  for i in `seq -f "%03.f" 1 ${PAR}`; do
    /usr/bin/time -p -a -o ${RES}/par.${BLOCK} dd if=${LOC}/001 of=/dev/null bs=${BLOCK} > /dev/null 2>&1 &
  done
  wait

  echo -n "wall: $((`date +%s`-st)), "
  t=`grep -h real ${RES}/par.${BLOCK} | sed -e 's/real *[^ ]//' -e 's/s$//' -e 's/m/*60+/'  | bc | awk '{ sum+=$1} END {print sum}'`
  echo "real: $t seconds"
done
[root@c01 tests]# 
Comment by Jinshan Xiong (Inactive) [ 01/Feb/14 ]

Some of the patches are still in prototype phase, and I'll share the patches with you guys as soon as I can, so that you can try to reproduce what I have done.

Today is Chinese New Year. Happy Chinese New Year, everybody!

Comment by James Beal [ 01/Feb/14 ]

Happy New Year to you to

I completely understand about prototypes..., It would be good to have before and after numbers in the tables.

Thank you for looking at this for us.

Comment by Dmitry Eremin (Inactive) [ 03/Feb/14 ]

Could you test with my patch http://review.whamcloud.com/#/c/9095?
This patch significantly reduce amount of context switches on ll_inode_size_lock(). So, overhead is gone and I have speedup in parallel dds on my machine.

Comment by James Beal [ 03/Feb/14 ]

I will build a new client tomorrow.

Comment by Jinshan Xiong (Inactive) [ 03/Feb/14 ]

Here are the results before and after my patches is applied (in seconds):

  single parallel
master 3.28 18.91
master + patch 2.84 5.80
b1_8 0.62 11.54

single: single thread read single file
parallel: NR_CPUS threads read single file in parallel(8 threads in my case).

the file size is 1G and is already in cache before threads start; the read block size is 4k.

Comment by James Beal [ 13/Mar/14 ]

While we wait I found the following interesting.

http://lwn.net/Articles/590243/

Performance-oriented patches should, of course, always be accompanied by benchmark results. In this case, Waiman included a set of AIM7 benchmark results with his patch set (which did not include the pending-bit optimization). Some workloads regressed a little, but others shows improvements of 1-2% — a good result for a low-level locking improvement. The disk benchmark runs, however, improved by as much as 116%; that benchmark suffers from especially strong contention for locks in the virtual filesystem layer and ext4 filesystem code.

Comment by Manish Patel (Inactive) [ 10/Apr/14 ]

Hi Jinshan,

Any updates on the request made by James. We are looking for the new patch for the performance issues to try out, based on the above comment.

Thank You,
Manish

Comment by Jinshan Xiong (Inactive) [ 28/Jun/14 ]

I came up with a solution for this issue and the initial testing result is exciting.

The test case I used is based on the test cases shared by sanger, please check below:

#!/bin/bash

nr_cpus=$(grep -c ^processor /proc/cpuinfo)

for CACHE in no yes; do
	for BS in 4k 1M; do
		echo "===== cache: $CACHE, block size: $BS ====="
		[ "$CACHE" = "yes" ] && { dd if=/mnt/lustre/testfile bs=1M of=/dev/null > /dev/null 2>&1; }
		[ "$CACHE" = "no" ] && { echo 3 > /proc/sys/vm/drop_caches; }

		echo -n "      single read: "
		dd if=/mnt/lustre/testfile bs=$BS of=/dev/null 2>&1 |grep copied |awk -F, '{print $3}'

		[ "$CACHE" = "no" ] && { echo 3 > /proc/sys/vm/drop_caches; }

		echo -n "      parallel read: "
		for i in `seq -w 1 ${nr_cpus}`; do
			dd if=/mnt/lustre/testfile bs=$BS of=/dev/null > results/${i}_out 2>&1 &
		done
		wait
		grep copied results/1_out | awk -F, '{print $3}'
	done
done

The test file is 2G in size so that it can fit in memory for cache enabled testing. I applied the patches and compare the test results w/ and w/o patches.

The result w/ my patches:

===== cache: no, block size: 4k =====
      single read:  1.2 GB/s
      parallel read:  576 MB/s
===== cache: no, block size: 1M =====
      single read:  1.4 GB/s
      parallel read:  566 MB/s
===== cache: yes, block size: 4k =====
      single read:  3.8 GB/s
      parallel read:  1.8 GB/s
===== cache: yes, block size: 1M =====
      single read:  6.4 GB/s
      parallel read:  1.3 GB/s

The test w/o my patches:

===== cache: no, block size: 4k =====
      single read:  257 MB/s
      parallel read:  148 MB/s
===== cache: no, block size: 1M =====
      single read:  1.1 GB/s
      parallel read:  420 MB/s
===== cache: yes, block size: 4k =====
      single read:  361 MB/s
      parallel read:  147 MB/s
===== cache: yes, block size: 1M =====
      single read:  5.8 GB/s
      parallel read:  1.3 GB/s

The small IO performance improved significantly. I'm still doing some fine tune for the patches and I will release them as soon as I can so that you can do some evaluation.

Comment by Gerrit Updater [ 02/Jun/16 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/20574
Subject: LU-4257 clio: replace semaphore with mutex
Project: fs/lustre-release
Branch: b2_4
Current Patch Set: 1
Commit: 3e1cbe0b81eaee6e509c825455669c89df157915

Comment by Gerrit Updater [ 03/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20254/
Subject: LU-4257 obdclass: Get rid of cl_env hash table
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 45332712783a4756bf5930d6bd5f697bbc27acdb

Comment by Gerrit Updater [ 03/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20256/
Subject: LU-4257 llite: fix up iov_iter implementation
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1101120d3258509fa74f952cd8664bfdc17bd97d

Comment by Gerrit Updater [ 03/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20255/
Subject: LU-4257 llite: fast read implementation
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 172048eaefa834e310e6a0fa37e506579f4079df

Comment by Peter Jones [ 03/Jun/16 ]

Landed for 2.9

Comment by Andreas Dilger [ 07/Jun/16 ]

This patch introduced an intermittent test failure LU-8248 in sanity.sh test_248.

Comment by Gerrit Updater [ 09/Jun/16 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/20647/
Subject: LU-4257 test: Correct error_ignore message
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5f03bf91e68e925149f2331a44d1e4ad858b8006

Generated at Sat Feb 10 01:41:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.