[LU-2032] small random read i/o performance regression Created: 26/Sep/12  Updated: 21/Sep/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Robin Humble Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

2.1.3 server with 2.1.3 and 1.8.8 clients. centos6, x86_64, QDR IB


Attachments: File dk_ra_vfs.100m_1.8.8.txt.gz     File dk_ra_vfs.100m_2.1.3.txt.gz    
Issue Links:
Duplicate
duplicates LU-5561 Lustre random reads: 80% performance ... Resolved
Related
is related to LU-3321 2.x single thread/process throughput ... Resolved
is related to LU-11416 Improve readahead for random read of ... Open
Severity: 3
Rank (Obsolete): 4173

 Description   

I've been doing some 2.1.3 pre-rollout testing, and there seems to be a client problem with small random reads. performance is considerably worse on 2.1.3 clients than 1.8.8 clients. it's about a 35x slowdown for 4k random read i/o.

tests use the same files on a rhel6 x86_64 2.1.3 server (stock 2.1.3 rpm is used), QDR IB fabric, single disk or md8+2 lun for an OST, all client & server VFS caches dropped between trials.

checksums on or off and client rpc's 8 or 32 makes little difference. I've also tried umount'ing the fs from the 1.8.8 client between tests to make sure there's no hidden caching, but that didn't change anything.

random read ->
IOR -a POSIX -C -r -F -k -e -t $i -z -b 1024m -o /mnt/yo96/rjh/blah

i/o size client version single disk lun md8+2 lun
i=4k 1.8.8 20.70 MB/s 22.0 MB/s
i=4k 2.1.3 0.55 MB/s 0.6 MB/s
i/o size client version single disk lun md8+2 lun
i=1M 1.8.8 87 MB/s 137 MB/s
i=1M 2.1.3 63 MB/s 83 MB/s

although these numbers are for a single process, the same trend applies when the IOR is scaled up to 8 processes/node and to multiple nodes.



 Comments   
Comment by Robin Humble [ 26/Sep/12 ]

it seems current master has the same problem.
/proc/fs/lustre/version is 2.3.50-g1d2bf12-CHANGED-2.6.32-279.9.1.el6-2.3.50-ish.a

i/o size client version single disk lun
i=4k master 0.54 MB/s
Comment by Robin Humble [ 27/Sep/12 ]

rpc's, especially pages and bytes per rpc, look pretty different between 1.8.8 and 2.1.3.
eg. after running
IOR -a POSIX -C -r -F -k -e -t $i -z -b 100m -o /mnt/yo*/rjh/blah

osc/*/rpc_stats ->

1.8.8

                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                      68  12  12   |          0   0   0
2:                      34   6  19   |          0   0   0
4:                      57  10  30   |          0   0   0
8:                      52   9  39   |          0   0   0
16:                     63  11  51   |          0   0   0
32:                     50   9  61   |          0   0   0
64:                     66  12  73   |          0   0   0
128:                    67  12  86   |          0   0   0
256:                    72  13 100   |          0   0   0

                        read                    write
rpcs in flight        rpcs   % cum % |       rpcs   % cum %
0:                     483  91  91   |          0   0   0
1:                      39   7  98   |          0   0   0
2:                       5   0  99   |          0   0   0
3:                       2   0 100   |          0   0   0

                        read                    write
offset                rpcs   % cum % |       rpcs   % cum %
0:                     101  19  19   |          0   0   0
1:                      26   4  24   |          0   0   0
2:                      49   9  33   |          0   0   0
4:                      48   9  42   |          0   0   0
8:                      61  11  53   |          0   0   0
16:                     64  12  65   |          0   0   0
32:                     56  10  76   |          0   0   0
64:                     55  10  86   |          0   0   0
128:                    69  13 100   |          0   0   0

2.1.3

                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                   25568  99  99   |          0   0   0
2:                       1   0  99   |          0   0   0
4:                       0   0  99   |          0   0   0
8:                       5   0 100   |          0   0   0

                        read                    write
rpcs in flight        rpcs   % cum % |       rpcs   % cum %
0:                   25574 100 100   |          0   0   0
                      
                        read                    write
offset                rpcs   % cum % |       rpcs   % cum %
0:                     100   0   0   |          0   0   0
1:                     100   0   0   |          0   0   0
2:                     200   0   1   |          0   0   0
4:                     400   1   3   |          0   0   0
8:                     800   3   6   |          0   0   0
16:                   1600   6  12   |          0   0   0
32:                   3200  12  25   |          0   0   0
64:                   6383  24  49   |          0   0   0
128:                 12791  50 100   |          0   0   0

osc/*/import ->

1.8.8

...
    read_data_averages:
       bytes_per_rpc: 200200
       usec_per_rpc: 7673
       MB_per_sec: 26.09

2.1.3

    read_data_averages:
       bytes_per_rpc: 4100
       usec_per_rpc: 6108
       MB_per_sec: 0.67

also read ahead stats hit/miss ratios are pretty much reversed between 1.8.8 and 2.1.3, but that might just be accounting?
llite/*/read_ahead_stats ->

1.8.8

snapshot_time             1348738815.173086 secs.usecs
hits                      25120 samples [pages]
misses                    480 samples [pages]
readpage not consecutive  25583 samples [pages]
failed grab_cache_page    25583 samples [pages]
read but discarded        256 samples [pages]

2.1.3

snapshot_time             1348738815.198485 secs.usecs
hits                      26 samples [pages]
misses                    25574 samples [pages]
readpage not consecutive  25589 samples [pages]
miss inside window        3 samples [pages]
failed grab_cache_page    25599 samples [pages]
zero size window          1 samples [pages]
Comment by Jinshan Xiong (Inactive) [ 27/Sep/12 ]

Can you please run your test case again with D_READA enabled on the client side? This seems to be a readahead problem for sure.

Comment by Robin Humble [ 28/Sep/12 ]

sure. this is +reada +vfstrace for a 1.8.8 client and a 2.1.3 client.

Generated at Sat Feb 10 01:21:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.