[LU-10026] Client-side data compression Created: 23/Sep/17  Updated: 25/Oct/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.17.0

Type: New Feature Priority: Minor
Reporter: Andreas Dilger Assignee: Artem Blagodarenko
Resolution: Unresolved Votes: 0
Labels: patch

Attachments: PDF File BGI-Intel-QAT-Performance-2016-09.pdf     PDF File LAD2016-compression_kuhn.pdf     PDF File Lustre-client-compression-poster_ISC17.pdf     PNG File compression-lz4-chunksize.png     PDF File intel_hub_isc17.pdf     PDF File ipcc_12_17.pdf    
Issue Links:
Related
is related to LU-16085 Ubuntu 22.04 sanityn test_106c: suppo... Resolved
is related to LU-16837 interop: client skip unknown componen... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Due to the increasing gap between computational speed, network speed and storage capacity, it has become necessary to investigate data reduction techniques. Storage systems have become a significant part of the total cost of ownership due to the increased amount of storage devices, their associated acquisition cost and energy consumption.

Ultimately, we are aiming for compression support in Lustre at multiple levels:

  • Client-side compression allows using the available network and storage capacity more efficiently,
  • Client hints empower applications to provide information useful for compression and
  • Adaptive compression makes it possible to choose appropriate settings depending on performance metrics and projected benefits.

Compression will be completely transparent to the applications because it will be performed by the client and/or server on their behalf. However, it will be possible for users to tune Lustre's behavior to obtain the best performance/compression/etc. When using client-side compression, the single stream performance bottleneck will directly benefit from the compression. Initial studies have shown that a compression ratio of 1.5 can be achieved for scientific data using lz4.



 Comments   
Comment by Gerrit Updater [ 06/Dec/17 ]

Anna Fuchs (anna.fuchs@informatik.uni-hamburg.de) uploaded a new patch: https://review.whamcloud.com/30393
Subject: LU-10026: WIP: client-side compression (write path)
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bc288aebb2d07df8d931a7a0a30e188659380473

Comment by Anna Fuchs [ 06/Dec/17 ]

The patch contains a first version of application-transparent client-side compression (write path).

The data is compressed on the client, transferred compressed over
the network and decompressed on the object server side. It reaches the
ZFS-backend unmodified and is stored like it was never touched.

The current version works with clients and object servers built
with configure option --enable-compression. It supports only new file
writes, not modifying of existing ones.

The size of compression pool is currently static. For test purposes, please, change cmp_pool_init(number_of_buffers, size_of_buffer_in_kb) to an appropriate size in osc/osc_request.c:osc_init() and target/tgt_main.c:tgt_mod_init(). Also, set ppchunk (pages per chunk) in osc_request.c:calc_chunks() to a desired value for testing. See TODO 20.

Approach:

Compression is introduced on the client side just before the data transfer happens.
The potential RPC is split into chunks. The chunk size is at least 1 page and at most
as large as the preset ZFS record size. In general, chunking is based on stripes.

Every chunk is compressed separately and currently serially. Parallel compression within
one RPC is generally possible, but is not prioritized due to already parallel processing
of RPCs. All compressed chunks are concatenated without any gaps in between and sent as 1
RPC over the network. Every chunk is forced to be a separate network I/O buffer. If a chunk
could not be compressed for any uncritical reasons (random data), it stays uncompressed at
the beginning, end or between other potentially compressed chunks in a stripe.

In this version, the data is supposed to be decompressed on the server. In future version it
could be passed through the server layers and be persisted compressed on the ZFS-backend.
However, to decompress the data and deliver it to the application, one needs to know at least
which algorithm has been used and, in the most cases, how large the compressed data is and how
large it was. We differ here between physical size (actual byte exact size in memory, compressed)
and the logical size (application view, uncompressed). This information must not be lost after
compression and is currently present in two places. Also see TODOs for storing compressed data.

On the one side, it is the chunk descriptor. There exists 1 chunk descriptor per chunk. This
descriptor is to be sent as part of the RPC (together with the body, ioobject, niobuf) before
the transfer. For comfort reasons, it currently contains more information than necessary (see
comments in code), but could be shrunk to fewer fields. Based on that information, every chunk
can be decompressed on the server-side. Very important to notice is that some algorithms like
lz4 require the byte-exact size of compressed data for decompression.

If the data is supposed to be stored compressed, we still need that information to be present.
ZFS, which supports its own compression, can already remember the rounded physical size and its
own compression algorithms. This list is to be extended by a value, which indicates external
compression ("data has been compressed by Lustre, don't touch it"). The actually used compression
algorithm and the byte-exact sizes have to be stored as a data header before the actual data.
There exists a header for every compressed chunk. Uncompressed or incompressible chunks do not
have any header and are marked as not externally compressed in ZFS. The size of this header
defines the file layout for compressed data and should not be ever changed for compatibility
reasons in the future. The code for this already exists (and is included in the patch) but is
currently not necessary since data is decompressed by the server.

The read path in not affected in this version and the data is read like in an unpatched build.

It is not planned to compress any data transfer between clients and MDSs.
The MDS will be involved only for configuration and further settings.

This version supports only lz4.

Technical details:

Compression forces the use of at least one additional memory buffer for de-/compressed data.
Algorithms like lz4 can handle only contiguous memory buffers for in- and output. The data
in Lustre is accessible page-wise; the pages are commonly not contiguous. Therefore, the original data
has to be first copied into temporary contiguous buffers. This fact increases the memory usage
and forces additional copies. To keep the memory usage minimal, this version uses one temporary
buffer for one chunk (src) and a second large buffer for the output. The output (dst) buffer is
as large as the original data size (page_count). Also see TODOs for page-based compression.

Compression process (client-side). Mainly in osc_request.c:compressed_osc_brw_prep_request:

1. Input data array (pga), page_count many pages.
2. Calculate chunks, from 1 page to zfs_record_size large
3. Prepare 1 src-buffer (sizeof chunk) and many dst-buffesr (sizeof logical_page_count in total)
4. Copy chunk into src-buffer
5. Compress src-buffer to dst[chunk] buffer.
6. Continue composing output data array (cpga) from dst (no additional copy!).
Skip gaps in dst, compressed chunks are consecutively. If chunk was compressible, write a header before compressed data. If chunk was incompressible, use pga pages, write no header and release dst[uncompressible_chunk] immediately
7. Repeat 4 + 5 for all chunks
8. Release src-buffer.
9. Build request with cpga.
10. Transfer cpga.
11. After request successfully finished, release compressed pages, try to release uncompressed pages (see TODOs).

Transfer. Main changes in tgt_handler.c:tgt_brw_write_compressed:

1. Receive RPC.
2. Prepare remote network buffers for compressed data, recognize physical sizes
3. Prepare local network buffers for decompressed data, recognize logical sizes
4. Build bulk descriptor with local network buffers, but physical sizes
5. Receive compressed (physical size) data into local network buffers
6. Decompress data from and to local network buffers (see next section)
7. Commit decompressed data to the backend

Decompression process (server-side). Main changes in tgt_handler.c:tgt_brw_write_compressed:

1. Input data in lnb[logical_page_count] array.
2. Prepare 1 src-buffer (sizeof chunk) and many dst-buffers (sizeof logical_page_count in total)
3. If input chunk is not contiguous, copy chunk from lnb into src-buffer
4. Decompress src-buffer into dst[chunk] buffer.
5. When the chunk was not compressible, still need to copy src to dst.
6. Repeat 3-5 for every chunk.
7. Copy complete dst to local buffers.
8. Release src and dst-buffers.

Since memory allocation (especially contiguous) in the kernel is not only expensive, but also not
always possible, we have introduced an exclusive memory pool for compression. This module is
initialized at system start, when the osc and the target modules are inserted. At that time most of
the kernel memory is hopefully not in use and we should have the best chances to get big contiguous
memory blocks.

The buffers can be loaned from pool only in fixed size. Once you request a smaller buffer, you would
get the bigger fixed size buffer. When you try to request a larger buffer, the request will fail.

The current state is work in progress. There is a long list of things to be done. Some of them are
already in progress, some need to be discussed until we can proceed with the development.

TODOs:

1. Style check (checkpatch.pl)

2. The compression pool already provides the ability to loan several buffers atomically to prevent deadlocks
for large parallel requests. It needs some few fixes for releasing those buffers.

3. Discuss whether the chunk descriptor can be present even if the build was configured without compression. Otherwise, isolate everything that affects compression.

4. Add/improve error handling. Check return values on failure and prepare fallbacks. Add debugging messages.

5. The header size within the data should be discussed.

6. Refactoring and merging duplicated functions into original ones as far as possible (compressed_osc_brw_prep_request, tgt_brw_write_compressed, osd_bufs_get_compressed_write, etc.)

7. Shrink chunk descriptor to really necessary values, since it is sent over the network

8. Add read path.

9. Currently there are no compatibility checks for the compression feature. The current version is based on the
2.10+ state. Clients and object servers have to be configured with enabled compression. The MDS is in this version
not affected by compression and might be just compatible with unpatched 2.10 version. For the future, is also has to
support compression.

10. Mixing of different algos of the same type should be possible chunk-wise to enable more flexibility for the future features.

11. Currently, compression is only controlled via the configuration setting --enable-compression when building.
We plan to enable/disable it additionally and dynamically (lctl, lfs, /proc).

12. Expand compression feature to store the data compressed.

13. Release original pga-pages on the client. See comments in osc_request.c:osc_release_cppga

14. Memory summary on the server might not recognize all the used memory. See comments in ofd_io.c:ofd_preprw_write_compressed

15. Minor TODOs in the code comments

16. Write compression tests (build and functionality) for Gerrit

17. Support read-modify-write.

18. In the near future we will support an algorithm than can handle page arrays as in- and output (pga-able) instead of contiguous buffers (cbuf-able).
We will not need to allocate so many buffers and make additional copies.

19. We should restrict mixing of cbuf-able and pga-able algorithms. In theory, mixing all algorithm for every chunk is possible, in practice it results
in complex code and potentially not very smart use of resources. Maybe restrict to one type (cbuf-able of page-able) per file.

20. Get/set correct values for
– compression threshold: max(short IO, x)
– chunk size: get from ZFS
– compression pool size: allocate MAX(32, number_of_cpus)*default_RPC_size large pool with record-size sized contiguous chunks
– for future: algorithm (certain one, quick one, best compression ratio, etc.), policy (force, dynamic, etc.) via ladvise

21. Dynamic decision making for compression usage, algorithms, strategy, etc.

Further, see the project description and updated information at http://wiki.lustre.org/Enhanced_Adaptive_Compression_in_Lustre

Comment by Gerrit Updater [ 05/Apr/19 ]

Anna Fuchs (anna.fuchs@informatik.uni-hamburg.de) uploaded a new patch: https://review.whamcloud.com/34603
Subject: LU-10026: Configuration for compression support and lz4 module backport
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3599c495bb2a430d780ecdd0459270a1c9463820

Comment by Anna Fuchs [ 05/Apr/19 ]

We have taken care of some of the todos and split up the feature into multiple patches.

The first of the series adds compression support to the build system and a backport for the lz4 module needed for kernel version <3.11.

The remaining patches will follow soon and add support for the page buffer pool, basic data structures, client-side compression and server-side handling.

Comment by Gerrit Updater [ 12/Apr/19 ]

Anna Fuchs (anna.fuchs@informatik.uni-hamburg.de) uploaded a new patch: https://review.whamcloud.com/34649
Subject: LU-10026: Page buffer pool for compression
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2446a55b697d744d7f503ce6c1ae9c17ca6d0bd2

Comment by Gerrit Updater [ 12/Apr/19 ]

Anna Fuchs (anna.fuchs@informatik.uni-hamburg.de) uploaded a new patch: https://review.whamcloud.com/34650
Subject: LU-10026: Prepare structures for compression
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9e39536b65f078d5f2dc3d2455b020451c35ea33
(duplicate with 34651)

Comment by Gerrit Updater [ 12/Apr/19 ]

Anna Fuchs (anna.fuchs@informatik.uni-hamburg.de) uploaded a new patch: https://review.whamcloud.com/34651
Subject: LU-10026: Prepare structures for compression
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8170c9d07f38e0ead8dfd0db99e380d9691aaf03

Comment by Gerrit Updater [ 06/May/19 ]

Anna Fuchs (anna.fuchs@informatik.uni-hamburg.de) uploaded a new patch: https://review.whamcloud.com/34810
Subject: LU-10026: Compress data on client with lz4, send compressed over
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c216f2445d829e776ed8db1fe25eb71ee22091bb

Comment by Gerrit Updater [ 08/May/19 ]

Anna Fuchs (anna.fuchs@informatik.uni-hamburg.de) uploaded a new patch: https://review.whamcloud.com/34831
Subject: LU-10026: Decompress data by server (previously compressed by client) and
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 59b59c6ff201119bc6eb43af5ae280b30a6a4869

Comment by Sebastien Buisson [ 13/Jun/22 ]

intel_hub_isc17.pdf is the initial description of work to be done.
ipcc_12_17.pdf is the status as of December 2017.

Comment by Andreas Dilger [ 07/Dec/22 ]

Hi Anna, Michael, Matt,
I'm wondering if any of you could share some "real" data input files for testing compression effectiveness, rather than just using e.g. the Linux or Lustre source tree for input? It doesn't need to be huge, tens of MB to single GB is enough, but small parts of it would be included directly into the Lustre source tree for regression testing, so it should be freely available/redistributable. I'm sure there are e.g. weather or CFD or other datasets that can be downloaded, I just don't know where to look for them.

Comment by Andreas Dilger [ 09/Dec/22 ]

Links from Anna with publicly available data:

https://www.earthdata.nasa.gov/eosdis/daacs
https://github.com/pangeo-data/WeatherBench

Comment by Gerrit Updater [ 19/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49170/
Subject: LU-10026 csdc: reserve layout bits for compress component
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 570d76dc3fd677357d76187ba718e339dd695788

Comment by Gerrit Updater [ 18/Aug/23 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51978
Subject: LU-10026 csdc: DoM pattern could be a combined value
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 12f9b0d34cd172374133b159236b8d2c8bc14652

Comment by Gerrit Updater [ 06/Sep/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51978/
Subject: LU-10026 csdc: DoM pattern could be a combined value
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bb0cc84fbed51e006bfac230dada426bfac4f500

Comment by Gerrit Updater [ 03/Oct/23 ]

"Sarah Liu <sarah@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52572
Subject: LU-10026 tests: Fix sanity test_56ab for CSDC
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1ce661dd56fb4b6ecc9e909805c6101bbd9c3161

Comment by Gerrit Updater [ 25/Oct/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52572/
Subject: LU-10026 tests: Fix sanity test_56ab for CSDC
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4fc8ca3325d2eb49d8342eb943e3a17932f24517

Generated at Sat Feb 10 02:31:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.