[LU-8775] ZFS OST has an extra 128 byte for each IO Created: 24/Oct/16 Updated: 28/Mar/17 Resolved: 29/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Zhiqi Tao (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
ZFS-Lustre: lustre 2.8.57_62_g919224d, zfs-0.7.0 |
||
| Issue Links: |
|
||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
It seems that, when ZFS is used as the backend file system for Lustre OST, there is an additional 128 byte for each IO. For example, write 1GB write to a ZFS based file system. # dd if=/dev/zero of=File_1GB count=1024 bs=1M 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 0.869192 s, 1.2 GB/s [root@kapollo01 ~]# lctl get_param obdfilter.edafs-OST0001.brw_stats
obdfilter.edafs-OST0001.brw_stats=
snapshot_time: 1477281605.706719 (secs.usecs)
read | write
pages per bulk r/w rpcs % cum % | rpcs % cum %
1: 0 0 0 | 1024 50 50
2: 0 0 0 | 0 0 50
4: 0 0 0 | 0 0 50
8: 0 0 0 | 0 0 50
16: 0 0 0 | 0 0 50
32: 0 0 0 | 0 0 50
64: 0 0 0 | 0 0 50
128: 0 0 0 | 0 0 50
256: 0 0 0 | 1024 50 100
read | write
discontiguous pages rpcs % cum % | rpcs % cum %
0: 0 0 0 | 1024 50 50
1: 0 0 0 | 0 0 50
2: 0 0 0 | 0 0 50
3: 0 0 0 | 0 0 50
4: 0 0 0 | 0 0 50
5: 0 0 0 | 0 0 50
6: 0 0 0 | 0 0 50
7: 0 0 0 | 0 0 50
8: 0 0 0 | 0 0 50
9: 0 0 0 | 0 0 50
10: 0 0 0 | 0 0 50
11: 0 0 0 | 0 0 50
12: 0 0 0 | 0 0 50
13: 0 0 0 | 0 0 50
14: 0 0 0 | 0 0 50
15: 0 0 0 | 0 0 50
16: 0 0 0 | 0 0 50
17: 0 0 0 | 0 0 50
18: 0 0 0 | 0 0 50
19: 0 0 0 | 0 0 50
20: 0 0 0 | 0 0 50
21: 0 0 0 | 0 0 50
22: 0 0 0 | 0 0 50
23: 0 0 0 | 0 0 50
24: 0 0 0 | 0 0 50
25: 0 0 0 | 0 0 50
26: 0 0 0 | 0 0 50
27: 0 0 0 | 0 0 50
28: 0 0 0 | 0 0 50
29: 0 0 0 | 0 0 50
30: 0 0 0 | 0 0 50
31: 0 0 0 | 1024 50 100
read | write
disk I/Os in flight ios % cum % | ios % cum %
1: 0 0 0 | 2039 99 99
2: 0 0 0 | 9 0 100
read | write
I/O time (1/1000s) ios % cum % | ios % cum %
read | write
disk I/O size ios % cum % | ios % cum %
128: 0 0 0 | 1024 50 50
256: 0 0 0 | 0 0 50
512: 0 0 0 | 0 0 50
1K: 0 0 0 | 0 0 50
2K: 0 0 0 | 0 0 50
4K: 0 0 0 | 0 0 50
8K: 0 0 0 | 0 0 50
16K: 0 0 0 | 0 0 50
32K: 0 0 0 | 0 0 50
64K: 0 0 0 | 0 0 50
128K: 0 0 0 | 0 0 50
256K: 0 0 0 | 0 0 50
512K: 0 0 0 | 0 0 50
1M: 0 0 0 | 1024 50 100
On the other side, such 128 bytes additional IO does not exist. [root@kcli01 ost01]# dd if=/dev/zero of=File_1GB count=1024 bs=1M 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 1.79642 s, 598 MB/s [root@koss02 ~]# lctl get_param obdfilter.krakenfs-OST0001.brw_stats
obdfilter.krakenfs-OST0001.brw_stats=
snapshot_time: 1477297530.705071 (secs.usecs)
read | write
pages per bulk r/w rpcs % cum % | rpcs % cum %
1K: 0 0 0 | 256 100 100
read | write
discontiguous pages rpcs % cum % | rpcs % cum %
0: 0 0 0 | 256 100 100
read | write
discontiguous blocks rpcs % cum % | rpcs % cum %
0: 0 0 0 | 256 100 100
read | write
disk fragmented I/Os ios % cum % | ios % cum %
4: 0 0 0 | 255 99 99
5: 0 0 0 | 0 0 99
6: 0 0 0 | 1 0 100
read | write
disk I/Os in flight ios % cum % | ios % cum %
1: 0 0 0 | 4 0 0
2: 0 0 0 | 5 0 0
3: 0 0 0 | 5 0 1
4: 0 0 0 | 5 0 1
5: 0 0 0 | 2 0 2
6: 0 0 0 | 2 0 2
7: 0 0 0 | 1 0 2
8: 0 0 0 | 1 0 2
9: 0 0 0 | 1 0 2
10: 0 0 0 | 1 0 2
11: 0 0 0 | 1 0 2
12: 0 0 0 | 1 0 2
13: 0 0 0 | 3 0 3
14: 0 0 0 | 11 1 4
15: 0 0 0 | 23 2 6
16: 0 0 0 | 44 4 10
17: 0 0 0 | 81 7 18
18: 0 0 0 | 118 11 30
19: 0 0 0 | 144 14 44
20: 0 0 0 | 133 12 57
21: 0 0 0 | 104 10 67
22: 0 0 0 | 72 7 74
23: 0 0 0 | 45 4 78
24: 0 0 0 | 40 3 82
25: 0 0 0 | 42 4 86
26: 0 0 0 | 42 4 90
27: 0 0 0 | 37 3 94
28: 0 0 0 | 29 2 97
29: 0 0 0 | 16 1 98
30: 0 0 0 | 8 0 99
31: 0 0 0 | 5 0 100
read | write
I/O time (1/1000s) ios % cum % | ios % cum %
4: 0 0 0 | 3 1 1
8: 0 0 0 | 0 0 1
16: 0 0 0 | 13 5 6
32: 0 0 0 | 172 67 73
64: 0 0 0 | 54 21 94
128: 0 0 0 | 14 5 100
read | write
disk I/O size ios % cum % | ios % cum %
512K: 0 0 0 | 2 0 0
1M: 0 0 0 | 1024 99 100
|
| Comments |
| Comment by Andreas Dilger [ 24/Oct/16 ] |
|
There is a ZFS patch that AFAIK Alex was working on to do I/O queue plugging, so that ZFS didn't submit 1MB writes to disk without giving them a chance to merge. |
| Comment by Gabriele Paciucci (Inactive) [ 24/Oct/16 ] |
|
zhiqi I think this is the reason why we suggest 9+2 instead of 8+2 ? |
| Comment by Jinshan Xiong (Inactive) [ 24/Oct/16 ] |
|
From what I can see in osd_zfs code, it (wrongly) does statistic in osd_write(), which is supposed to be used to update internal data structures such as last_rcvd, llog, etc. ldiskfs only records I/O statistic from client side. |
| Comment by Zhiqi Tao (Inactive) [ 24/Oct/16 ] |
[root@kcli01 ost01]# dd if=/dev/zero of=File_2 count=1024 bs=100K 1024+0 records in 1024+0 records out 104857600 bytes (105 MB) copied, 0.111276 s, 942 MB/s [root@kcli01 ost01]# ls -lh File_2 -rw-r--r-- 1 root root 100M Oct 24 14:51 File_2 It appears that the 128 byte is associated with each 1MB instead of being associated with individual IOs. It' nice to see that Lustre ZFS merged data together and commmitted together.
read | write
disk I/Os in flight ios % cum % | ios % cum %
1: 0 0 0 | 201 100 100
read | write
I/O time (1/1000s) ios % cum % | ios % cum %
read | write
disk I/O size ios % cum % | ios % cum %
128: 0 0 0 | 101 50 50
256: 0 0 0 | 0 0 50
512: 0 0 0 | 0 0 50
1K: 0 0 0 | 0 0 50
2K: 0 0 0 | 0 0 50
4K: 0 0 0 | 0 0 50
8K: 0 0 0 | 0 0 50
16K: 0 0 0 | 0 0 50
32K: 0 0 0 | 0 0 50
64K: 0 0 0 | 0 0 50
128K: 0 0 0 | 0 0 50
256K: 0 0 0 | 0 0 50
512K: 0 0 0 | 0 0 50
1M: 0 0 0 | 100 49 100
|
| Comment by Zhiqi Tao (Inactive) [ 24/Oct/16 ] |
|
Ldiskfs obviously does the same merge and commit. read | write
I/O time (1/1000s) ios % cum % | ios % cum %
4: 0 0 0 | 1 4 4
8: 0 0 0 | 0 0 4
16: 0 0 0 | 0 0 4
32: 0 0 0 | 1 4 8
64: 0 0 0 | 1 4 12
128: 0 0 0 | 12 48 60
256: 0 0 0 | 2 8 68
512: 0 0 0 | 7 28 96
1K: 0 0 0 | 1 4 100
read | write
disk I/O size ios % cum % | ios % cum %
512K: 0 0 0 | 4 3 3
1M: 0 0 0 | 99 96 100
|
| Comment by Zhiqi Tao (Inactive) [ 24/Oct/16 ] |
|
Actually the merge happened on the client side. read write pages per rpc rpcs % cum % | rpcs % cum % 1: 0 0 0 | 0 0 0 2: 0 0 0 | 0 0 0 4: 0 0 0 | 0 0 0 8: 0 0 0 | 0 0 0 16: 0 0 0 | 0 0 0 32: 0 0 0 | 0 0 0 64: 0 0 0 | 0 0 0 128: 0 0 0 | 0 0 0 256: 0 0 0 | 0 0 0 512: 0 0 0 | 0 0 0 1024: 0 0 0 | 25 100 100 |
| Comment by Zhiqi Tao (Inactive) [ 24/Oct/16 ] |
|
With sync, LDISKFS [root@kcli01 ost01]# dd if=/dev/zero of=File_2 count=1024 bs=100K oflag=sync
1024+0 records in
1024+0 records out
104857600 bytes (105 MB) copied, 2.58155 s, 40.6 MB/s
[root@kcli01 ost01]# lctl get_param osc.krakenfs-OST0001*.rpc_stats
osc.krakenfs-OST0001-osc-ffff880035853800.rpc_stats=
snapshot_time: 1477343197.889076 (secs.usecs)
read RPCs in flight: 0
write RPCs in flight: 0
pending write pages: 0
pending read pages: 0
read write
pages per rpc rpcs % cum % | rpcs % cum %
1: 0 0 0 | 0 0 0
2: 0 0 0 | 0 0 0
4: 0 0 0 | 0 0 0
8: 0 0 0 | 0 0 0
16: 0 0 0 | 0 0 0
32: 0 0 0 | 1024 100 100
read write
rpcs in flight rpcs % cum % | rpcs % cum %
0: 0 0 0 | 0 0 0
1: 0 0 0 | 1024 100 100
[root@koss02 ~]# lctl get_param obdfilter.krakenfs-OST0001.brw_stats
obdfilter.krakenfs-OST0001.brw_stats=
snapshot_time: 1477353369.955434 (secs.usecs)
read | write
pages per bulk r/w rpcs % cum % | rpcs % cum %
32: 0 0 0 | 1024 100 100
read | write
discontiguous pages rpcs % cum % | rpcs % cum %
0: 0 0 0 | 1024 100 100
read | write
discontiguous blocks rpcs % cum % | rpcs % cum %
0: 0 0 0 | 1024 100 100
read | write
disk fragmented I/Os ios % cum % | ios % cum %
1: 0 0 0 | 1024 100 100
read | write
disk I/Os in flight ios % cum % | ios % cum %
1: 0 0 0 | 1024 100 100
read | write
I/O time (1/1000s) ios % cum % | ios % cum %
1: 0 0 0 | 1023 99 99
2: 0 0 0 | 0 0 99
4: 0 0 0 | 1 0 100
read | write
disk I/O size ios % cum % | ios % cum %
128K: 0 0 0 | 1024 100 100
With Sync, ZFS OST [root@kcli01 ost01]# dd if=/dev/zero of=File_2 count=1024 bs=100K oflag=sync
1024+0 records in
1024+0 records out
104857600 bytes (105 MB) copied, 905.785 s, 116 kB/s
[root@kcli01 ost01]# lctl get_param osc.edafs-OST0001*.rpc_stats
osc.edafs-OST0001-osc-ffff880fc7dce800.rpc_stats=
snapshot_time: 1477344403.356697 (secs.usecs)
read RPCs in flight: 0
write RPCs in flight: 0
pending write pages: 0
pending read pages: 0
read write
pages per rpc rpcs % cum % | rpcs % cum %
1: 0 0 0 | 0 0 0
2: 0 0 0 | 0 0 0
4: 0 0 0 | 0 0 0
8: 0 0 0 | 0 0 0
16: 0 0 0 | 0 0 0
32: 0 0 0 | 1024 100 100
read write
rpcs in flight rpcs % cum % | rpcs % cum %
0: 0 0 0 | 0 0 0
1: 0 0 0 | 1024 100 100
read write
offset rpcs % cum % | rpcs % cum %
0: 0 0 0 | 1 0 0
1: 0 0 0 | 0 0 0
2: 0 0 0 | 0 0 0
4: 0 0 0 | 0 0 0
8: 0 0 0 | 0 0 0
16: 0 0 0 | 1 0 0
32: 0 0 0 | 1 0 0
64: 0 0 0 | 3 0 0
128: 0 0 0 | 5 0 1
256: 0 0 0 | 10 0 2
512: 0 0 0 | 20 1 4
1024: 0 0 0 | 41 4 8
2048: 0 0 0 | 82 8 16
4096: 0 0 0 | 164 16 32
8192: 0 0 0 | 328 32 64
16384: 0 0 0 | 368 35 100
[root@kapollo01 ~]# lctl get_param obdfilter.edafs-OST0001.brw_stats
obdfilter.edafs-OST0001.brw_stats=
snapshot_time: 1477362848.412812 (secs.usecs)
read | write
pages per bulk r/w rpcs % cum % | rpcs % cum %
1: 0 0 0 | 1025 50 50
2: 0 0 0 | 0 0 50
4: 0 0 0 | 0 0 50
8: 0 0 0 | 0 0 50
16: 0 0 0 | 0 0 50
32: 0 0 0 | 1024 49 100
read | write
discontiguous pages rpcs % cum % | rpcs % cum %
0: 0 0 0 | 1025 50 50
1: 0 0 0 | 0 0 50
2: 0 0 0 | 0 0 50
3: 0 0 0 | 0 0 50
4: 0 0 0 | 0 0 50
5: 0 0 0 | 0 0 50
6: 0 0 0 | 0 0 50
7: 0 0 0 | 0 0 50
8: 0 0 0 | 0 0 50
9: 0 0 0 | 0 0 50
10: 0 0 0 | 0 0 50
11: 0 0 0 | 0 0 50
12: 0 0 0 | 0 0 50
13: 0 0 0 | 0 0 50
14: 0 0 0 | 0 0 50
15: 0 0 0 | 0 0 50
16: 0 0 0 | 0 0 50
17: 0 0 0 | 0 0 50
18: 0 0 0 | 0 0 50
19: 0 0 0 | 0 0 50
20: 0 0 0 | 0 0 50
21: 0 0 0 | 0 0 50
22: 0 0 0 | 0 0 50
23: 0 0 0 | 0 0 50
24: 0 0 0 | 1024 49 100
read | write
disk I/Os in flight ios % cum % | ios % cum %
1: 0 0 0 | 2049 100 100
read | write
I/O time (1/1000s) ios % cum % | ios % cum %
read | write
disk I/O size ios % cum % | ios % cum %
128: 0 0 0 | 1025 50 50
256: 0 0 0 | 0 0 50
512: 0 0 0 | 0 0 50
1K: 0 0 0 | 0 0 50
2K: 0 0 0 | 0 0 50
4K: 0 0 0 | 0 0 50
8K: 0 0 0 | 0 0 50
16K: 0 0 0 | 0 0 50
32K: 0 0 0 | 0 0 50
64K: 0 0 0 | 0 0 50
128K: 0 0 0 | 1024 49 100
So this 128 byte is actually associated with each disk IO. |
| Comment by Andreas Dilger [ 24/Oct/16 ] |
|
I'd agree with Jinshan - brw_stats should only contain information about the BRW data read/write operations. If the administrator wants to monitor all writes going to a device then they can use iostat or similar. Zhiqi, your latest data shows that the client is only sending 128KB RPCs, but the osd-zfs code is incorrectly accounting the extra 128-byte local write (suspect last_rcvd) in the brw_stats when it shouldn't. I don't know if this is easily fixed or not. Alex, could you please take a quick look? |
| Comment by Gerrit Updater [ 25/Oct/16 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/23363 |
| Comment by Zhiqi Tao (Inactive) [ 28/Oct/16 ] |
|
I tried this build and confirm that the 128 bytes are no long there. [root@kcli05 ~]# dd if=/dev/zero of=/mnt/lustre/File_2 count=100 bs=101K oflag=sync
100+0 records in
100+0 records out
10342400 bytes (10 MB) copied, 14.1616 s, 730 kB/s
[root@kcli05 ~]# lctl get_param obdfilter.lustre-OST0001.brw_stats
obdfilter.lustre-OST0001.brw_stats=
snapshot_time: 1477631083.556473 (secs.usecs)
read | write
pages per bulk r/w rpcs % cum % | rpcs % cum %
32: 0 0 0 | 100 100 100
read | write
discontiguous pages rpcs % cum % | rpcs % cum %
25: 0 0 0 | 100 100 100
read | write
disk I/Os in flight ios % cum % | ios % cum %
1: 0 0 0 | 100 100 100
read | write
I/O time (1/1000s) ios % cum % | ios % cum %
read | write
disk I/O size ios % cum % | ios % cum %
128K: 0 0 0 | 100 100 100
|
| Comment by Gerrit Updater [ 28/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23363/ |
| Comment by Peter Jones [ 29/Oct/16 ] |
|
Landed for 2.9 |