[LU-7659] Replace KUC by more standard mechanisms Created: 13/Jan/16 Updated: 15/Dec/21 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Upstream |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Henri Doreau (Inactive) | Assignee: | Yohan Pipereau |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | cea, patch | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
The Kernel Userland Communication (KUC) subsystem is a lustre-specific API for something relatively common (deliver stream of records from kernel to userland, transmit feedback from userland to kernel). We propose to replace it by character devices. Besides being more standard, it can also increase performance significantly. A process can read large chunks from the character device. Our proof of concept shows a 5~10x speedup for reading changelogs by blocks of 4k. I would like feedback and suggestions. The proposed implementation works as follows:
The implementation for the copytool has not been done yet but would work in a similar way. |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 13/Jan/16 ] |
|
Hello Henri, |
| Comment by Henri Doreau (Inactive) [ 13/Jan/16 ] |
|
Thanks Bruno, this thread makes records retrieval operations asynchronous and speeds up the whole thing. While the userland processes a batch of records, a new one gets retrieved in the background. Do you think it's overkill? |
| Comment by Henri Doreau (Inactive) [ 13/Jan/16 ] |
|
I should add that kuc can deliver ~100k records per second in my benchmarks, though we have a robinhood setup able to consume 70k/s. This motived my work. It would also make it easier to read records from any language. And would look nicer. |
| Comment by Andreas Dilger [ 14/Jan/16 ] |
|
Some comments and questions:
One issue that had come up with ChangeLogs in the past was that they are single-threaded in the kernel, which limits performance during metadata operations. If we are changing the API in userspace, it might also be good to change the on-disk format to allow multiple ChangeLog files to be written in parallel. Probably not one per core (that may become too many on large MDS nodes), but maybe 4-8 or so. The records could be merge sorted in the kernel by the helper thread at read time. |
| Comment by Andreas Dilger [ 14/Jan/16 ] |
|
Also, in theory the user-kernel interface could be changed without changing the userspace API, though this may be less desirable because of the licensing. Also note that it may be possible to just change the existing pipe interface to allow reading multiple records at once, instead of the current implementation that does two read() calls per record (one for the header and one for the body). It would be possible to read up to 64KB chunks from the pipe I think, and it could get as many full records as fit into the buffer and return a short read. |
| Comment by Robert Read (Inactive) [ 14/Jan/16 ] |
|
The inotify API could be a good model for this, particularly providing a file descriptor that can be used with select or poll. |
| Comment by Gerrit Updater [ 14/Mar/16 ] |
|
Henri Doreau (henri.doreau@cea.fr) uploaded a new patch: http://review.whamcloud.com/18900 |
| Comment by Andreas Dilger [ 28/Mar/16 ] |
|
I was thinking that this might also be more useful to expose via .lustre/changelog/MDTxxxx rather than /dev/XXX so that it is easily accessed by applications when multiple filesystems are mounted on the same node. Also, very similar to this would be exposing (probably only on the server?) the virtual "all objects" iterator for each target under similar .lustre/iterator/MDTxxxx and .lustre/iterator/OSTxxxx virtual files (or similar). The MDTxxxx iterators are useful for listing all inodes in order so that they can efficiently be processed for initial RBH scans of all files. The OSTxxxx iterators might be useful for e.g. migrating objects off OSTs, replication of file data, and other operations that touch every object on an online OST, but could be implemented separately as needed. The caveat is that this would only be easily accessed if the OST is online, unless it was handled virtually by traversing the MDT layouts when the OST is offline which would not be nearly as efficient. |
| Comment by James A Simmons [ 25/Apr/16 ] |
|
As I explore netlink I wonder if the API could be used for this? In in my research I discovered it being used by the SCSI layer which surprised me. |
| Comment by Gerrit Updater [ 19/May/16 ] |
|
Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20327 |
| Comment by Gerrit Updater [ 30/May/16 ] |
|
Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20501 |
| Comment by Gerrit Updater [ 30/May/16 ] |
|
Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20502 |
| Comment by Henri Doreau (Inactive) [ 05/Mar/17 ] |
|
Andreas, I realize that have not answered your questions, sorry for that, see below.
Yes, multiple processes can open the char device. By default they start reading from the beginning of the llog and they can lseek to wherever they want in the log to start at a given record. Very similar to the existing implementation in this sense.
Done.
It is a misc char device.
Done, using the record number as the offset to jump to.
Done.
I'd love that. It is beyond the scope of this patch I'd' say, but I keep it in mind. Maybe indexes instead of llog catalogs? |
| Comment by Andreas Dilger [ 06/Mar/17 ] |
|
Yes, we've discussed changing llogs over to use an index instead of a flat file. The benefit of the llog file is that it can be written mostly sequentially, and record cancellation only needs to update the bitmap in the header. The drawback is that updating the header is serialized, reserving space in the llog file is difficult if the record size is unknown, and there is added complexity the order of the log records does not match the order that transactions are completed. On a related note, did you look into connecting the LFSCK iterator to the new char interface to speed up the initial scanning for RBH? |
| Comment by Gerrit Updater [ 06/Apr/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/18900/ |
| Comment by Peter Jones [ 09/Apr/17 ] |
|
Landed for 2.10 |
| Comment by Gerrit Updater [ 06/Aug/18 ] |
|
Yohan Pipereau (yohan.pipereau.ocre@cea.fr) uploaded a new patch: https://review.whamcloud.com/32941 |
| Comment by Gerrit Updater [ 14/Feb/19 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34258 |