[LU-7659] Replace KUC by more standard mechanisms Created: 13/Jan/16  Updated: 15/Dec/21

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Upstream

Type: Improvement Priority: Minor
Reporter: Henri Doreau (Inactive) Assignee: Yohan Pipereau
Resolution: Unresolved Votes: 0
Labels: cea, patch

Issue Links:
Related
is related to LU-15373 changelog improvements tracking Open
is related to LU-10968 add coordinator bypass upcalls for HS... Reopened
is related to LU-12506 Client unable to mount filesystem wit... Resolved
is related to LU-11626 mdc: obd might go away while referenc... Resolved
is related to LU-9680 Improve the user land to kernel space... In Progress
is related to LU-10141 Integer overflow in llapi_changelog_s... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

The Kernel Userland Communication (KUC) subsystem is a lustre-specific API for something relatively common (deliver stream of records from kernel to userland, transmit feedback from userland to kernel). We propose to replace it by character devices.

Besides being more standard, it can also increase performance significantly. A process can read large chunks from the character device. Our proof of concept shows a 5~10x speedup for reading changelogs by blocks of 4k.

I would like feedback and suggestions. The proposed implementation works as follows:

  • register a misc char device at mdc_setup (eg. /dev/changelog-lustre0-MDT0000). The minor number is associated to the corresponding OBD.
  • The .open handler starts a thread in the background, that iterates over the llog and enqueues up to X records into a ring buffer
  • The .read dequeues records from the ring buffer. We can make it blocking or not.
  • .release stops the background thread and releases resources
  • changelog clear is not yet implemented. It can either be a .write or a .unlocked_ioctl handler. Which would be preferable?

The implementation for the copytool has not been done yet but would work in a similar way.



 Comments   
Comment by Bruno Faccini (Inactive) [ 13/Jan/16 ]

Hello Henri,
This looks really promising, and also much more fitting with the growing volume and required need of bandwidth for ChangeLogs, than KUC can offer now.
Can you detail the need for the back-ground thread being started during open?

Comment by Henri Doreau (Inactive) [ 13/Jan/16 ]

Thanks Bruno,

this thread makes records retrieval operations asynchronous and speeds up the whole thing. While the userland processes a batch of records, a new one gets retrieved in the background. Do you think it's overkill?

Comment by Henri Doreau (Inactive) [ 13/Jan/16 ]

I should add that kuc can deliver ~100k records per second in my benchmarks, though we have a robinhood setup able to consume 70k/s. This motived my work. It would also make it easier to read records from any language. And would look nicer.

Comment by Andreas Dilger [ 14/Jan/16 ]

Some comments and questions:

  • would this new mechanism be able to handle multiple ChangeLog consumers?
  • my preference would be to use read and write for the interface, instead of ioctl, since this can be used even from scripts
  • I would have suggested a /proc file instead of a char device, but new /proc files are frowned upon, and /sys files are only one value per file. The (minor) issue with a char device is the registration of the char major/minor, but it could use a misc char device?
  • the .llseek() operation should allow seeking to a specific record, so that if there are multiple consumers and old records are not yet cancelled the new records can be found easily
  • the char device should also have a .poll() method so that userspace can wait for new records efficiently instead of busy looping

One issue that had come up with ChangeLogs in the past was that they are single-threaded in the kernel, which limits performance during metadata operations. If we are changing the API in userspace, it might also be good to change the on-disk format to allow multiple ChangeLog files to be written in parallel. Probably not one per core (that may become too many on large MDS nodes), but maybe 4-8 or so. The records could be merge sorted in the kernel by the helper thread at read time.

Comment by Andreas Dilger [ 14/Jan/16 ]

Also, in theory the user-kernel interface could be changed without changing the userspace API, though this may be less desirable because of the licensing.

Also note that it may be possible to just change the existing pipe interface to allow reading multiple records at once, instead of the current implementation that does two read() calls per record (one for the header and one for the body). It would be possible to read up to 64KB chunks from the pipe I think, and it could get as many full records as fit into the buffer and return a short read.

Comment by Robert Read (Inactive) [ 14/Jan/16 ]

The inotify API could be a good model for this, particularly providing a file descriptor that can be used with select or poll.

Comment by Gerrit Updater [ 14/Mar/16 ]

Henri Doreau (henri.doreau@cea.fr) uploaded a new patch: http://review.whamcloud.com/18900
Subject: LU-7659 mdc: expose changelog through char devices
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d34ff747f04f30e37a6f13b970bbcaa6ffe9e813

Comment by Andreas Dilger [ 28/Mar/16 ]

I was thinking that this might also be more useful to expose via .lustre/changelog/MDTxxxx rather than /dev/XXX so that it is easily accessed by applications when multiple filesystems are mounted on the same node.

Also, very similar to this would be exposing (probably only on the server?) the virtual "all objects" iterator for each target under similar .lustre/iterator/MDTxxxx and .lustre/iterator/OSTxxxx virtual files (or similar). The MDTxxxx iterators are useful for listing all inodes in order so that they can efficiently be processed for initial RBH scans of all files. The OSTxxxx iterators might be useful for e.g. migrating objects off OSTs, replication of file data, and other operations that touch every object on an online OST, but could be implemented separately as needed. The caveat is that this would only be easily accessed if the OST is online, unless it was handled virtually by traversing the MDT layouts when the OST is offline which would not be nearly as efficient.

Comment by James A Simmons [ 25/Apr/16 ]

As I explore netlink I wonder if the API could be used for this? In in my research I discovered it being used by the SCSI layer which surprised me.

Comment by Gerrit Updater [ 19/May/16 ]

Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20327
Subject: LU-7659 mdc: expose hsm requests through char devices
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 52935cd7190a1b4d4b5def5a9244ce1e5ca60c3a

Comment by Gerrit Updater [ 30/May/16 ]

Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20501
Subject: LU-7659 mdc: add an ioctl call to the copytool char device
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3d4f22eab430874560db611f9bd95fb31f63350f

Comment by Gerrit Updater [ 30/May/16 ]

Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20502
Subject: LU-7659 mdc: revise copytool char device locking
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ebae9fe2b1fdf458638cabbde8a02fc1522ebb75

Comment by Henri Doreau (Inactive) [ 05/Mar/17 ]

Andreas, I realize that have not answered your questions, sorry for that, see below.

would this new mechanism be able to handle multiple ChangeLog consumers?

Yes, multiple processes can open the char device. By default they start reading from the beginning of the llog and they can lseek to wherever they want in the log to start at a given record. Very similar to the existing implementation in this sense.

my preference would be to use read and write for the interface, instead of ioctl, since this can be used even from scripts

Done.

I would have suggested a /proc file instead of a char device, but new /proc files are frowned upon, and /sys files are only one value per file. The (minor) issue with a char device is the registration of the char major/minor, but it could use a misc char device?

It is a misc char device.

the .llseek() operation should allow seeking to a specific record, so that if there are multiple consumers and old records are not yet cancelled the new records can be found easily

Done, using the record number as the offset to jump to.

the char device should also have a .poll() method so that userspace can wait for new records efficiently instead of busy looping

Done.

One issue that had come up with ChangeLogs in the past was that they are single-threaded in the kernel, which limits performance during metadata operations. If we are changing the API in userspace, it might also be good to change the on-disk format to allow multiple ChangeLog files to be written in parallel. Probably not one per core (that may become too many on large MDS nodes), but maybe 4-8 or so. The records could be merge sorted in the kernel by the helper thread at read time.

I'd love that. It is beyond the scope of this patch I'd' say, but I keep it in mind. Maybe indexes instead of llog catalogs?

Comment by Andreas Dilger [ 06/Mar/17 ]

Yes, we've discussed changing llogs over to use an index instead of a flat file. The benefit of the llog file is that it can be written mostly sequentially, and record cancellation only needs to update the bitmap in the header. The drawback is that updating the header is serialized, reserving space in the llog file is difficult if the record size is unknown, and there is added complexity the order of the log records does not match the order that transactions are completed.

On a related note, did you look into connecting the LFSCK iterator to the new char interface to speed up the initial scanning for RBH?

Comment by Gerrit Updater [ 06/Apr/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/18900/
Subject: LU-7659 mdc: expose changelog through char devices
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1d40214d96dd6e36bd39a35f8419f753bae8d305

Comment by Peter Jones [ 09/Apr/17 ]

Landed for 2.10

Comment by Gerrit Updater [ 06/Aug/18 ]

Yohan Pipereau (yohan.pipereau.ocre@cea.fr) uploaded a new patch: https://review.whamcloud.com/32941
Subject: LU-7659 libcfs: Use netlink for KUC communication
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 31a839a7e846b7e53c0e846452435fffe83a0585

Comment by Gerrit Updater [ 14/Feb/19 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34258
Subject: LU-7659 hsm: Use netlink for KUC communication
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e3e1c2f7fa81dd149308b41c74ad190e32c858ae

Generated at Sat Feb 10 02:10:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.