[LU-8775] ZFS OST has an extra 128 byte for each IO Created: 24/Oct/16  Updated: 28/Mar/17  Resolved: 29/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Improvement Priority: Minor
Reporter: Zhiqi Tao (Inactive) Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None
Environment:

ZFS-Lustre: lustre 2.8.57_62_g919224d, zfs-0.7.0
ldiskfs-lustre: lustre 2.7.16.7


Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

It seems that, when ZFS is used as the backend file system for Lustre OST, there is an additional 128 byte for each IO.

For example, write 1GB write to a ZFS based file system.

# dd if=/dev/zero of=File_1GB count=1024 bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.869192 s, 1.2 GB/s
[root@kapollo01 ~]# lctl get_param obdfilter.edafs-OST0001.brw_stats
obdfilter.edafs-OST0001.brw_stats=
snapshot_time:         1477281605.706719 (secs.usecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs        % cum %
1:                       0   0   0   | 1024  50  50
2:                       0   0   0   |    0   0  50
4:                       0   0   0   |    0   0  50
8:                       0   0   0   |    0   0  50
16:                      0   0   0   |    0   0  50
32:                      0   0   0   |    0   0  50
64:                      0   0   0   |    0   0  50
128:                     0   0   0   |    0   0  50
256:                     0   0   0   | 1024  50 100

                           read      |     write
discontiguous pages    rpcs  % cum % |  rpcs        % cum %
0:                       0   0   0   | 1024  50  50
1:                       0   0   0   |    0   0  50
2:                       0   0   0   |    0   0  50
3:                       0   0   0   |    0   0  50
4:                       0   0   0   |    0   0  50
5:                       0   0   0   |    0   0  50
6:                       0   0   0   |    0   0  50
7:                       0   0   0   |    0   0  50
8:                       0   0   0   |    0   0  50
9:                       0   0   0   |    0   0  50
10:                      0   0   0   |    0   0  50
11:                      0   0   0   |    0   0  50
12:                      0   0   0   |    0   0  50
13:                      0   0   0   |    0   0  50
14:                      0   0   0   |    0   0  50
15:                      0   0   0   |    0   0  50
16:                      0   0   0   |    0   0  50
17:                      0   0   0   |    0   0  50
18:                      0   0   0   |    0   0  50
19:                      0   0   0   |    0   0  50
20:                      0   0   0   |    0   0  50
21:                      0   0   0   |    0   0  50
22:                      0   0   0   |    0   0  50
23:                      0   0   0   |    0   0  50
24:                      0   0   0   |    0   0  50
25:                      0   0   0   |    0   0  50
26:                      0   0   0   |    0   0  50
27:                      0   0   0   |    0   0  50
28:                      0   0   0   |    0   0  50
29:                      0   0   0   |    0   0  50
30:                      0   0   0   |    0   0  50
31:                      0   0   0   | 1024  50 100

                           read      |     write
disk I/Os in flight    ios   % cum % |  ios         % cum %
1:                       0   0   0   | 2039  99  99
2:                       0   0   0   |    9   0 100

                           read      |     write
I/O time (1/1000s)     ios   % cum % |  ios         % cum %

                           read      |     write
disk I/O size          ios   % cum % |  ios         % cum %
128:                     0   0   0   | 1024  50  50
256:                     0   0   0   |    0   0  50
512:                     0   0   0   |    0   0  50
1K:                      0   0   0   |    0   0  50
2K:                      0   0   0   |    0   0  50
4K:                      0   0   0   |    0   0  50
8K:                      0   0   0   |    0   0  50
16K:                     0   0   0   |    0   0  50
32K:                     0   0   0   |    0   0  50
64K:                     0   0   0   |    0   0  50
128K:                    0   0   0   |    0   0  50
256K:                    0   0   0   |    0   0  50
512K:                    0   0   0   |    0   0  50
1M:                      0   0   0   | 1024  50 100

On the other side, such 128 bytes additional IO does not exist.

[root@kcli01 ost01]# dd if=/dev/zero of=File_1GB count=1024 bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 1.79642 s, 598 MB/s
[root@koss02 ~]# lctl get_param obdfilter.krakenfs-OST0001.brw_stats
obdfilter.krakenfs-OST0001.brw_stats=
snapshot_time:         1477297530.705071 (secs.usecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs        % cum %
1K:                      0   0   0   |  256 100 100

                           read      |     write
discontiguous pages    rpcs  % cum % |  rpcs        % cum %
0:                       0   0   0   |  256 100 100

                           read      |     write
discontiguous blocks   rpcs  % cum % |  rpcs        % cum %
0:                       0   0   0   |  256 100 100

                           read      |     write
disk fragmented I/Os   ios   % cum % |  ios         % cum %
4:                       0   0   0   |  255  99  99
5:                       0   0   0   |    0   0  99
6:                       0   0   0   |    1   0 100

                           read      |     write
disk I/Os in flight    ios   % cum % |  ios         % cum %
1:                       0   0   0   |    4   0   0
2:                       0   0   0   |    5   0   0
3:                       0   0   0   |    5   0   1
4:                       0   0   0   |    5   0   1
5:                       0   0   0   |    2   0   2
6:                       0   0   0   |    2   0   2
7:                       0   0   0   |    1   0   2
8:                       0   0   0   |    1   0   2
9:                       0   0   0   |    1   0   2
10:                      0   0   0   |    1   0   2
11:                      0   0   0   |    1   0   2
12:                      0   0   0   |    1   0   2
13:                      0   0   0   |    3   0   3
14:                      0   0   0   |   11   1   4
15:                      0   0   0   |   23   2   6
16:                      0   0   0   |   44   4  10
17:                      0   0   0   |   81   7  18
18:                      0   0   0   |  118  11  30
19:                      0   0   0   |  144  14  44
20:                      0   0   0   |  133  12  57
21:                      0   0   0   |  104  10  67
22:                      0   0   0   |   72   7  74
23:                      0   0   0   |   45   4  78
24:                      0   0   0   |   40   3  82
25:                      0   0   0   |   42   4  86
26:                      0   0   0   |   42   4  90
27:                      0   0   0   |   37   3  94
28:                      0   0   0   |   29   2  97
29:                      0   0   0   |   16   1  98
30:                      0   0   0   |    8   0  99
31:                      0   0   0   |    5   0 100

                           read      |     write
I/O time (1/1000s)     ios   % cum % |  ios         % cum %
4:                       0   0   0   |    3   1   1
8:                       0   0   0   |    0   0   1
16:                      0   0   0   |   13   5   6
32:                      0   0   0   |  172  67  73
64:                      0   0   0   |   54  21  94
128:                     0   0   0   |   14   5 100

                           read      |     write
disk I/O size          ios   % cum % |  ios         % cum %
512K:                    0   0   0   |    2   0   0
1M:                      0   0   0   | 1024  99 100
 


 Comments   
Comment by Andreas Dilger [ 24/Oct/16 ]

There is a ZFS patch that AFAIK Alex was working on to do I/O queue plugging, so that ZFS didn't submit 1MB writes to disk without giving them a chance to merge.

Comment by Gabriele Paciucci (Inactive) [ 24/Oct/16 ]

zhiqi I think this is the reason why we suggest 9+2 instead of 8+2 ?

Comment by Jinshan Xiong (Inactive) [ 24/Oct/16 ]

From what I can see in osd_zfs code, it (wrongly) does statistic in osd_write(), which is supposed to be used to update internal data structures such as last_rcvd, llog, etc. ldiskfs only records I/O statistic from client side.

Comment by Zhiqi Tao (Inactive) [ 24/Oct/16 ]
[root@kcli01 ost01]# dd if=/dev/zero of=File_2 count=1024 bs=100K
1024+0 records in
1024+0 records out
104857600 bytes (105 MB) copied, 0.111276 s, 942 MB/s

[root@kcli01 ost01]# ls -lh File_2
-rw-r--r-- 1 root root 100M Oct 24 14:51 File_2

It appears that the 128 byte is associated with each 1MB instead of being associated with individual IOs. It' nice to see that Lustre ZFS merged data together and commmitted together.

                           read      |     write
disk I/Os in flight    ios   % cum % |  ios         % cum %
1:                       0   0   0   |  201 100 100

                           read      |     write
I/O time (1/1000s)     ios   % cum % |  ios         % cum %

                           read      |     write
disk I/O size          ios   % cum % |  ios         % cum %
128:                     0   0   0   |  101  50  50
256:                     0   0   0   |    0   0  50
512:                     0   0   0   |    0   0  50
1K:                      0   0   0   |    0   0  50
2K:                      0   0   0   |    0   0  50
4K:                      0   0   0   |    0   0  50
8K:                      0   0   0   |    0   0  50
16K:                     0   0   0   |    0   0  50
32K:                     0   0   0   |    0   0  50
64K:                     0   0   0   |    0   0  50
128K:                    0   0   0   |    0   0  50
256K:                    0   0   0   |    0   0  50
512K:                    0   0   0   |    0   0  50
1M:                      0   0   0   |  100  49 100
Comment by Zhiqi Tao (Inactive) [ 24/Oct/16 ]

Ldiskfs obviously does the same merge and commit.

                           read      |     write
I/O time (1/1000s)     ios   % cum % |  ios         % cum %
4:                       0   0   0   |    1   4   4
8:                       0   0   0   |    0   0   4
16:                      0   0   0   |    0   0   4
32:                      0   0   0   |    1   4   8
64:                      0   0   0   |    1   4  12
128:                     0   0   0   |   12  48  60
256:                     0   0   0   |    2   8  68
512:                     0   0   0   |    7  28  96
1K:                      0   0   0   |    1   4 100

                           read      |     write
disk I/O size          ios   % cum % |  ios         % cum %
512K:                    0   0   0   |    4   3   3
1M:                      0   0   0   |   99  96 100
Comment by Zhiqi Tao (Inactive) [ 24/Oct/16 ]

Actually the merge happened on the client side.

                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                       0   0   0   |          0   0   0
2:                       0   0   0   |          0   0   0
4:                       0   0   0   |          0   0   0
8:                       0   0   0   |          0   0   0
16:                      0   0   0   |          0   0   0
32:                      0   0   0   |          0   0   0
64:                      0   0   0   |          0   0   0
128:                     0   0   0   |          0   0   0
256:                     0   0   0   |          0   0   0
512:                     0   0   0   |          0   0   0
1024:                    0   0   0   |         25 100 100
Comment by Zhiqi Tao (Inactive) [ 24/Oct/16 ]

With sync, LDISKFS

[root@kcli01 ost01]# dd if=/dev/zero of=File_2 count=1024 bs=100K oflag=sync
1024+0 records in
1024+0 records out
104857600 bytes (105 MB) copied, 2.58155 s, 40.6 MB/s

[root@kcli01 ost01]# lctl get_param osc.krakenfs-OST0001*.rpc_stats
osc.krakenfs-OST0001-osc-ffff880035853800.rpc_stats=
snapshot_time:         1477343197.889076 (secs.usecs)
read RPCs in flight:  0
write RPCs in flight: 0
pending write pages:  0
pending read pages:   0

                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                       0   0   0   |          0   0   0
2:                       0   0   0   |          0   0   0
4:                       0   0   0   |          0   0   0
8:                       0   0   0   |          0   0   0
16:                      0   0   0   |          0   0   0
32:                      0   0   0   |       1024 100 100

                        read                    write
rpcs in flight        rpcs   % cum % |       rpcs   % cum %
0:                       0   0   0   |          0   0   0
1:                       0   0   0   |       1024 100 100

[root@koss02 ~]# lctl get_param obdfilter.krakenfs-OST0001.brw_stats
obdfilter.krakenfs-OST0001.brw_stats=
snapshot_time:         1477353369.955434 (secs.usecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs        % cum %
32:                      0   0   0   | 1024 100 100

                           read      |     write
discontiguous pages    rpcs  % cum % |  rpcs        % cum %
0:                       0   0   0   | 1024 100 100

                           read      |     write
discontiguous blocks   rpcs  % cum % |  rpcs        % cum %
0:                       0   0   0   | 1024 100 100

                           read      |     write
disk fragmented I/Os   ios   % cum % |  ios         % cum %
1:                       0   0   0   | 1024 100 100

                           read      |     write
disk I/Os in flight    ios   % cum % |  ios         % cum %
1:                       0   0   0   | 1024 100 100

                           read      |     write
I/O time (1/1000s)     ios   % cum % |  ios         % cum %
1:                       0   0   0   | 1023  99  99
2:                       0   0   0   |    0   0  99
4:                       0   0   0   |    1   0 100

                           read      |     write
disk I/O size          ios   % cum % |  ios         % cum %
128K:                    0   0   0   | 1024 100 100

With Sync, ZFS OST

[root@kcli01 ost01]# dd if=/dev/zero of=File_2 count=1024 bs=100K oflag=sync
1024+0 records in
1024+0 records out
104857600 bytes (105 MB) copied, 905.785 s, 116 kB/s

[root@kcli01 ost01]# lctl get_param osc.edafs-OST0001*.rpc_stats
osc.edafs-OST0001-osc-ffff880fc7dce800.rpc_stats=
snapshot_time:         1477344403.356697 (secs.usecs)
read RPCs in flight:  0
write RPCs in flight: 0
pending write pages:  0
pending read pages:   0

                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                       0   0   0   |          0   0   0
2:                       0   0   0   |          0   0   0
4:                       0   0   0   |          0   0   0
8:                       0   0   0   |          0   0   0
16:                      0   0   0   |          0   0   0
32:                      0   0   0   |       1024 100 100

                        read                    write
rpcs in flight        rpcs   % cum % |       rpcs   % cum %
0:                       0   0   0   |          0   0   0
1:                       0   0   0   |       1024 100 100

                        read                    write
offset                rpcs   % cum % |       rpcs   % cum %
0:                       0   0   0   |          1   0   0
1:                       0   0   0   |          0   0   0
2:                       0   0   0   |          0   0   0
4:                       0   0   0   |          0   0   0
8:                       0   0   0   |          0   0   0
16:                      0   0   0   |          1   0   0
32:                      0   0   0   |          1   0   0
64:                      0   0   0   |          3   0   0
128:                     0   0   0   |          5   0   1
256:                     0   0   0   |         10   0   2
512:                     0   0   0   |         20   1   4
1024:                    0   0   0   |         41   4   8
2048:                    0   0   0   |         82   8  16
4096:                    0   0   0   |        164  16  32
8192:                    0   0   0   |        328  32  64
16384:                   0   0   0   |        368  35 100

[root@kapollo01 ~]# lctl get_param obdfilter.edafs-OST0001.brw_stats
obdfilter.edafs-OST0001.brw_stats=
snapshot_time:         1477362848.412812 (secs.usecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs        % cum %
1:                       0   0   0   | 1025  50  50
2:                       0   0   0   |    0   0  50
4:                       0   0   0   |    0   0  50
8:                       0   0   0   |    0   0  50
16:                      0   0   0   |    0   0  50
32:                      0   0   0   | 1024  49 100

                           read      |     write
discontiguous pages    rpcs  % cum % |  rpcs        % cum %
0:                       0   0   0   | 1025  50  50
1:                       0   0   0   |    0   0  50
2:                       0   0   0   |    0   0  50
3:                       0   0   0   |    0   0  50
4:                       0   0   0   |    0   0  50
5:                       0   0   0   |    0   0  50
6:                       0   0   0   |    0   0  50
7:                       0   0   0   |    0   0  50
8:                       0   0   0   |    0   0  50
9:                       0   0   0   |    0   0  50
10:                      0   0   0   |    0   0  50
11:                      0   0   0   |    0   0  50
12:                      0   0   0   |    0   0  50
13:                      0   0   0   |    0   0  50
14:                      0   0   0   |    0   0  50
15:                      0   0   0   |    0   0  50
16:                      0   0   0   |    0   0  50
17:                      0   0   0   |    0   0  50
18:                      0   0   0   |    0   0  50
19:                      0   0   0   |    0   0  50
20:                      0   0   0   |    0   0  50
21:                      0   0   0   |    0   0  50
22:                      0   0   0   |    0   0  50
23:                      0   0   0   |    0   0  50
24:                      0   0   0   | 1024  49 100

                           read      |     write
disk I/Os in flight    ios   % cum % |  ios         % cum %
1:                       0   0   0   | 2049 100 100

                           read      |     write
I/O time (1/1000s)     ios   % cum % |  ios         % cum %

                           read      |     write
disk I/O size          ios   % cum % |  ios         % cum %
128:                     0   0   0   | 1025  50  50
256:                     0   0   0   |    0   0  50
512:                     0   0   0   |    0   0  50
1K:                      0   0   0   |    0   0  50
2K:                      0   0   0   |    0   0  50
4K:                      0   0   0   |    0   0  50
8K:                      0   0   0   |    0   0  50
16K:                     0   0   0   |    0   0  50
32K:                     0   0   0   |    0   0  50
64K:                     0   0   0   |    0   0  50
128K:                    0   0   0   | 1024  49 100

So this 128 byte is actually associated with each disk IO.

Comment by Andreas Dilger [ 24/Oct/16 ]

I'd agree with Jinshan - brw_stats should only contain information about the BRW data read/write operations. If the administrator wants to monitor all writes going to a device then they can use iostat or similar.

Zhiqi, your latest data shows that the client is only sending 128KB RPCs, but the osd-zfs code is incorrectly accounting the extra 128-byte local write (suspect last_rcvd) in the brw_stats when it shouldn't. I don't know if this is easily fixed or not.

Alex, could you please take a quick look?

Comment by Gerrit Updater [ 25/Oct/16 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/23363
Subject: LDEV-521 osd: do not report special writes in brw stats
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a68320ec32de1a3e765cc5fa0e7610128135e019

Comment by Zhiqi Tao (Inactive) [ 28/Oct/16 ]

I tried this build and confirm that the 128 bytes are no long there.

[root@kcli05 ~]# dd if=/dev/zero of=/mnt/lustre/File_2 count=100 bs=101K oflag=sync
100+0 records in
100+0 records out
10342400 bytes (10 MB) copied, 14.1616 s, 730 kB/s
[root@kcli05 ~]# lctl get_param obdfilter.lustre-OST0001.brw_stats
obdfilter.lustre-OST0001.brw_stats=
snapshot_time:         1477631083.556473 (secs.usecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs        % cum %
32:                      0   0   0   |  100 100 100

                           read      |     write
discontiguous pages    rpcs  % cum % |  rpcs        % cum %
25:                      0   0   0   |  100 100 100

                           read      |     write
disk I/Os in flight    ios   % cum % |  ios         % cum %
1:                       0   0   0   |  100 100 100

                           read      |     write
I/O time (1/1000s)     ios   % cum % |  ios         % cum %

                           read      |     write
disk I/O size          ios   % cum % |  ios         % cum %
128K:                    0   0   0   |  100 100 100
Comment by Gerrit Updater [ 28/Oct/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23363/
Subject: LU-8775 osd: do not report special writes in brw stats
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ff9ac8af1e5bb009f0353e28fcea63902cff0066

Comment by Peter Jones [ 29/Oct/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:20:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.