[LU-2001] read operation is slow when mmap is enabled Created: 21/Sep/12  Updated: 08/Feb/18  Resolved: 08/Feb/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Won't Fix Votes: 0
Labels: patch
Environment:

lustre-1.8.8 CentOS5.8


Attachments: File debug-1.txt.gz     File debug-2.txt.gz     File debug-3.txt.gz     Zip Archive lustre-2.1.3-TCP-8thr.zip     Zip Archive lustre.8.zip     Microsoft Word miss-rates.xlsx     Zip Archive nfs_on_lustre.8.zip    
Severity: 3
Rank (Obsolete): 8891

 Description   

There is an application and running on the lustre.

When we run this application trough the lustre client, it's really slow and application seems to be stall.
However, if the lustre exports lustre filesystem with NFS and mount it client itself, then application does IOs to lustre through the NFS, the performance seems to be reasonable.

  1. mount -t lustre
    192.168.100.131@tcp:/lustre on /lustre type lustre (rw)
  2. exportfs
    /lustre <world>
  3. mount -t nfs
    localhost:/lustre on /lustre-nfs type nfs (rw,addr=127.0.0.1)

The application is doing the following operations.
1. 4 jobs (same application) are running on the single client and these jobs read single shared file as input file. file is open with open() and mmap().
2. computes with them and each job write output file separately. the block size 8k for read() and write().

Any ideas why application is fast if it goes through the NFS layer to lustre?



 Comments   
Comment by Peter Jones [ 21/Sep/12 ]

Ihara

Is this different behaviour compared to earlier 1.8.x releases? If not could this be a known issue.

Jinshan

You did the performance improvements for mmap that went into 2.2 (under LU-884) so can you comment?

Thanks

Peter

Comment by Jinshan Xiong (Inactive) [ 21/Sep/12 ]

This seems like readahead issue.

Ihara, does the application use mmap to read file and write it via vfs interface? If this is the case, we can try to reproduce by issuing mmap read only.

You did the performance improvements for mmap that went into 2.2 (under LU-884) so can you comment?

it smells unrelated. If my above guess is wrong, I will verify this.

Comment by Shuichi Ihara (Inactive) [ 21/Sep/12 ]

Jinshan,
I believe so.
any good idea to collect data to make sure what you said write it via vfs interface?

Comment by Jinshan Xiong (Inactive) [ 22/Sep/12 ]

This could be nfs or lustre problem. Please show me the stats of:
1. llite.*.read_ahead_stats
2. llite.*.extents_stats and llite.*.extents_stats_per_process
3. osc.*.rpc_stats
4. nfsiostat on the client node(nfs server)

Comment by Shuichi Ihara (Inactive) [ 23/Sep/12 ]

Jinshan,

the job needs the long time to finish, so, I started 8 jobs on single client, and collected statics only first 1800 sec. even first 1800 sec, we can see the performance difference between NFS on Lustre vs native Lustre.

there is two test cases. (1xOSS, 1xOST and 1xClient)

1. Native lustre client
2. NFS on Lustre, (the lustre client mounts the lustre and export it with for NFS. same client mounts it via NFS as loopbackmount)

192.168.100.126@tcp:/lustre
                     15283944352  68657584 14450657920   1% /lustre
localhost:/lustre    15283944448  68657152 14450658304   1% /lustre-nfs

Here is output files during 1800sec. 2.3GB file per jobs are created when application runs through the native lustre client, but 7.4GB files per jobs are created when if application goes through the NFS to the Lustre.

Native Lustre

# cat lustre.8/output.log 
total 18161696
-rw-r--r-- 1 root root 2325741568 Sep 24 03:01 18855.out
-rw-r--r-- 1 root root 2325741568 Sep 24 03:01 18858.out
-rw-r--r-- 1 root root 2325741568 Sep 24 03:01 18861.out
-rw-r--r-- 1 root root 2325741568 Sep 24 03:01 18864.out
-rw-r--r-- 1 root root 2325741568 Sep 24 03:01 18867.out
-rw-r--r-- 1 root root 2325741568 Sep 24 03:01 18870.out
-rw-r--r-- 1 root root 2325741568 Sep 24 03:01 18873.out
-rw-r--r-- 1 root root 2325741568 Sep 24 03:01 18876.out

NFS on Lustre

# cat nfs_on_lustre.8/output.log 
total 58710112
-rw-r--r-- 1 root root 7584350208 Sep 24 02:28 17205.out
-rw-r--r-- 1 root root 7462715392 Sep 24 02:28 17208.out
-rw-r--r-- 1 root root 7462715392 Sep 24 02:28 17211.out
-rw-r--r-- 1 root root 7462715392 Sep 24 02:28 17214.out
-rw-r--r-- 1 root root 7523532800 Sep 24 02:28 17217.out
-rw-r--r-- 1 root root 7523532800 Sep 24 02:28 17220.out
-rw-r--r-- 1 root root 7643070464 Sep 24 02:28 17223.out
-rw-r--r-- 1 root root 7523532800 Sep 24 02:28 17226.out

You can see Lustre statics as well. with native Lustre, mostly 1 page per RPC.

Lustre

			read			write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:		   4973077  99  99   |          0   0   0
2:		      9224   0  99   |          0   0   0
4:		      1216   0  99   |          1   0   0
8:		       949   0  99   |          0   0   0
16:		      1081   0  99   |          0   0   0
32:		      1255   0  99   |          0   0   0
64:		      1099   0  99   |          0   0   0
128:		       600   0  99   |          3   0   0
256:		     16625   0 100   |      17736  99 100

NFS on Lustre

			read			write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:		      2394   7   7   |          0   0   0
2:		       812   2   9   |          0   0   0
4:		       935   2  12   |          1   0   0
8:		      1111   3  16   |          1   0   0
16:		      1624   5  21   |          1   0   0
32:		      2283   7  28   |          1   0   0
64:		      3083   9  37   |          0   0   0
128:		      4324  13  51   |          0   0   0
256:		     15755  48 100   |      57389  99 100
Comment by Jinshan Xiong (Inactive) [ 24/Sep/12 ]

From the readahead stats in the log file, it can interpret that what's going on:

llite.lustre-ffff810744e39800.read_ahead_stats=
snapshot_time             1348423289.615683 secs.usecs
hits                      4345662 samples [pages]
misses                    5015426 samples [pages]
readpage not consecutive  5469845 samples [pages]
miss inside window        1812 samples [pages]
failed grab_cache_page    3187455 samples [pages]
read but discarded        15712 samples [pages]
zero size window          6238737 samples [pages]

and nfs over lustre:

llite.lustre-ffff81054d3c6c00.read_ahead_stats=
snapshot_time             1348421290.346850 secs.usecs
hits                      4263548 samples [pages]
misses                    17453 samples [pages]
readpage not consecutive  335391 samples [pages]
failed grab_cache_page    5578837 samples [pages]
read but discarded        24 samples [pages]
zero size window          113 samples [pages]
hit max r-a issue         3 samples [pages]

See the miss rate, lustre native is much higher, this caused one page RPCs were issued. The application is accessing the file in random pattern. See the zero-size-window stats in native lustre case.

(I don't check the code of nfs) I think there are two readahead in the nfs-over-lustre case, where 1, mmap on nfs client and 2, vfs_read in lustre. This makes the readahead would be a little more aggressive - the side effect is that the high rate of 'failed grab_cache_page', I guess it was short of memory at that time.

In general, this kind of issue is hard to fix as what should be blamed is the application itself . Anyway, I'll think about how to improve readahead algorithm in mmap case when I have a free moment.

Comment by Shuichi Ihara (Inactive) [ 25/Sep/12 ]

Jinshan,

I just tested it on lustre-2.1.3 and it worked well as well as that we are seeing numbers over nfs on lustre-1.8.x.
Do you have any hints we can try lustre-1.8.x if we see improved performance?

Comment by Jinshan Xiong (Inactive) [ 25/Sep/12 ]

Interesting.. both rpc and read ahead stats are much better in 2.1.3. I remember there was a readahead bug fixed by wangdi. I'll update it here after I find it out.

Comment by Jinshan Xiong (Inactive) [ 25/Sep/12 ]

Recently there are two readahead patches committed:

LU-983: f1a4b79e378407e4161c2a922478d625a38452b5
LU-15: b7eb1d769cc80216e353c4e4217dfe9070927139

Can you please check if they are in your branch?

Comment by Shuichi Ihara (Inactive) [ 25/Sep/12 ]

Hm. these are already landed in 1.8.8 (or before 1.8.8). We are using lustre-1.8.8 here, these patches should be included..

Comment by Di Wang [ 25/Sep/12 ]

Ihara, could you please gather some "+reada" debug log(maybe only reada, lctl set_param debug=reada) for me during the running? Thanks.

Comment by Shuichi Ihara (Inactive) [ 25/Sep/12 ]

Wangdi,
I collected a couple of debug log during running the application. Please check on them.
Thanks

Comment by Di Wang [ 25/Sep/12 ]

Ihara, I just post a patch here, http://review.whamcloud.com/4097

Could you please try it? According to the debug log you post, it seems Readahead does not work well in mmap, the reason why exporting NFS works better is that NFS will open the file for every read, which refreshs the readahead state, and it happens to help "correct" the behaviour. Also I found a lot random 4k read in the debug log, is this expected for this job? And I also do not understand why 2.1.3 works better here. So could you please

1 set debug mask as "warning error emerg vfstrace reada console"

2. run the test over 1.8.8 and 2.1.3.

And post the debug_log , llite..read_ahead_stats and osc..rpc_stats here. Thanks.

Comment by Shuichi Ihara (Inactive) [ 26/Sep/12 ]

wangdi, yes, there are few inputs files during the single job. The application reads these input files on some points, but it's very random read IO I believe.

Anyway, I tested patches, but it didn't help very much. And, I collected debug and statics the following version and uploaded files to under /uploads/LU-2001/ on ftp site. The job is very long, so, for testing, I killed it after first 30 mins. I collected debug files on 3 points, 5 mins, 15 mins and 25 mins after job starts.

centos5.8-lustre-1.8.8
centos5.8-lustre-1.8.8-patch1 (http://review.whamcloud.com/4097)
centos6.3-lustre-1.8.8
centos6.3-lustre-2.1.3

The Linux distro is client side and the server side is running with lustre-2.1.3 on centos6.3. I tested lustre-1.8.8 on centos5.8 for server, but no much differences. For 2.1.3 vs 1.8.x testing, I used 2.1.3 on Centos6.3 for server.

Please check and analyze on them.

Thank you!

Comment by Di Wang [ 27/Sep/12 ]

Ihara, I did not see 2.1.3 debug log there?
wangdi@brent:/scratch/ftp/uploads/LU-2001$ ll -h
total 56M
drwxr-xr-x 2 ftp mlocate 4.0K 2012-09-26 15:10 ./
d-wxrwx--- 36 ftp ipausers 4.0K 2012-09-26 15:09 ../
rw-rr- 1 ftp mlocate 20M 2012-09-26 15:09 centos5.8-lustre-1.8.8-IB-patch1.tar.gz
rw-rr- 1 ftp mlocate 18M 2012-09-26 15:09 centos5.8-lustre-1.8.8-IB.tar.gz
rw-rr- 1 ftp mlocate 19M 2012-09-26 15:10 centos6.3-lustre-1.8.8-IB.tar.gz

Could you please upload 2.1.3 log for me, if you have. Thanks.

Comment by Shuichi Ihara (Inactive) [ 27/Sep/12 ]

Di, sorry, just uploaded centos6.3-lustre-2.1.3-IB.tar.gz

btw, the customer environment is 1.8.x, we want to fix and speedup on 1.8.x that we are seeing nfs on lustre-1.8.x..

thank you very much!

Comment by Di Wang [ 27/Sep/12 ]

Ihara, I just upgraded the patch, http://review.whamcloud.com/#change,4097 Could you please try this one, and get those information for me. Thanks.

Comment by Shuichi Ihara (Inactive) [ 29/Sep/12 ]

Di, I tested new patch (patch set 4), but still no luck. The original behavior was better. I've uploaded the debug information and results on the same place. And, the application is opensource, I uploaded it as well if it helps for analysis.

Thanks!

Comment by Di Wang [ 05/Oct/12 ]

Ihara, unfortunately I am busy with another urgent project these days, so I did not take time to see your log yet. And I will take vacation next week, my colleague will probably continue work on this problem. Could you please tell me how did you run LAST (exact command line please)? Thanks.

Comment by Shuichi Ihara (Inactive) [ 05/Oct/12 ]

Hi WangDi,
Let me check if we can give you sample input files that you can reproduce problem.

Comment by Shuichi Ihara (Inactive) [ 13/Oct/12 ]

Hi I uploaded all comporments (includes input files) on uploads/LU-2001.
If you mounted the lustre to /lustre on the client, you can just extract last.tar.bz2 on it, then copy run.sh to the /lustre and run it that what we are exactly doing.

Please let me know if you have any questions.

Comment by Shuichi Ihara (Inactive) [ 01/Nov/12 ]

WanDi, did you have an chance to test LAST on your environment?

Comment by Di Wang [ 03/Nov/12 ]

Hi, ihara

Sorry, not yet. I am quite busy with DNE since I got back from vacation. I will try to set aside sometime next week. Btw: did you try this test on a local environment? Does that mean I have to setup a 30GB lustre environment to repeat what you have done?

Comment by Di Wang [ 05/Nov/12 ]

Hi, ihara

Hmm, I just checked patch-4(centos5.8-lustre-1.8.8-IB-patch4) result. The rpc quality seems better

osc.lustre-OST0000-osc-ffff88106a6ab000.rpc_stats=
snapshot_time: 1348974722.320932 (secs.usecs)
read RPCs in flight: 0
write RPCs in flight: 0
dio read RPCs in flight: 0
dio write RPCs in flight: 0
pending write pages: 0
pending read pages: 0

read write
pages per rpc rpcs % cum % | rpcs % cum %
1: 46178 12 12 | 0 0 0
2: 21499 5 18 | 0 0 0
4: 26737 7 25 | 0 0 0
8: 32963 9 34 | 0 0 0
16: 41309 11 46 | 0 0 0
32: 51106 14 60 | 0 0 0
64: 52684 14 74 | 0 0 0
128: 45786 12 87 | 4 0 0
256: 45763 12 100 | 4350 99 100

Here is the original result

osc.lustre-OST0000-osc-ffff81107910cc00.rpc_stats=
snapshot_time: 1348683325.358237 (secs.usecs)
read RPCs in flight: 0
write RPCs in flight: 0
dio read RPCs in flight: 0
dio write RPCs in flight: 0
pending write pages: 0
pending read pages: 0

read write
pages per rpc rpcs % cum % | rpcs % cum %
1: 6490484 99 99 | 0 0 0
2: 21527 0 99 | 0 0 0
4: 4072 0 99 | 0 0 0
8: 3115 0 99 | 0 0 0
16: 2570 0 99 | 0 0 0
32: 1454 0 99 | 1 0 0
64: 548 0 99 | 0 0 0
128: 195 0 99 | 5 0 0
256: 16623 0 100 | 27792 99 100

So why do you said "the original behaviour was better" ? Please correct me if I missed sth.

Comment by Shuichi Ihara (Inactive) [ 05/Nov/12 ]

WangDi,

Thanks for your testing. Yes, I also saw better behavior on rpc_stats with patch4, but read performance was dropped compared without patch.

Also, when did you get this statistics after job starts? My understanding, once job starts it read files with mmap(), then, compute something (and read() ) and write out files with write(). My statices was after the mmap().

Comment by Di Wang [ 05/Nov/12 ]

Oh, I did not try the test, Just looked at the result you post here. I do not have rhel5 environment locally. I will see what I can do here. Thanks.

Comment by Di Wang [ 06/Nov/12 ]

Hi, Ihara

sorry, I can not find node to run the test locally, all my nodes are rhel6 here. I investigated those upload logs again, centos6.3-lustre-2.1.3-IB is for lustre 2.1 or lustre 2.1 + nfs, I saw a lot read, instead of mmap in this log. Could you please check. Do you have debug log for lustre 1.8 + nfs ?

Comment by Shuichi Ihara (Inactive) [ 29/Jun/13 ]

comparing miss rates of lustre-2.4, lustre-1.8 and nfs over lustre-1.8.

Comment by Shuichi Ihara (Inactive) [ 29/Jun/13 ]

lustre-1.8's peak miss rates were much higher than lustre-2.x and nfs over lustre-1.8 during application was running.
llap_shrink_cache_internail() is shrinking memory even the page is still active. This is related to LU-1878 (http://review.whamcloud.com/#/c/4026), but it doesn't solve our problme even yet. We will file new patches soon.

Comment by Li Xi (Inactive) [ 29/Jun/13 ]

Here is the patch that makes the application much faster.
http://review.whamcloud.com/#/c/6826/

Comment by Li Xi (Inactive) [ 22/Jul/13 ]

Again, following patch makes the application runs much faster.
http://review.whamcloud.com/#/c/7064/

Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ]

close old tickets

Generated at Sat Feb 10 01:21:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.