Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7659

Replace KUC by more standard mechanisms

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • Upstream
    • Lustre 2.10.0
    • 9223372036854775807

    Description

      The Kernel Userland Communication (KUC) subsystem is a lustre-specific API for something relatively common (deliver stream of records from kernel to userland, transmit feedback from userland to kernel). We propose to replace it by character devices.

      Besides being more standard, it can also increase performance significantly. A process can read large chunks from the character device. Our proof of concept shows a 5~10x speedup for reading changelogs by blocks of 4k.

      I would like feedback and suggestions. The proposed implementation works as follows:

      • register a misc char device at mdc_setup (eg. /dev/changelog-lustre0-MDT0000). The minor number is associated to the corresponding OBD.
      • The .open handler starts a thread in the background, that iterates over the llog and enqueues up to X records into a ring buffer
      • The .read dequeues records from the ring buffer. We can make it blocking or not.
      • .release stops the background thread and releases resources
      • changelog clear is not yet implemented. It can either be a .write or a .unlocked_ioctl handler. Which would be preferable?

      The implementation for the copytool has not been done yet but would work in a similar way.

      Attachments

        Issue Links

          Activity

            [LU-7659] Replace KUC by more standard mechanisms

            Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20502
            Subject: LU-7659 mdc: revise copytool char device locking
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ebae9fe2b1fdf458638cabbde8a02fc1522ebb75

            gerrit Gerrit Updater added a comment - Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20502 Subject: LU-7659 mdc: revise copytool char device locking Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ebae9fe2b1fdf458638cabbde8a02fc1522ebb75

            Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20501
            Subject: LU-7659 mdc: add an ioctl call to the copytool char device
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3d4f22eab430874560db611f9bd95fb31f63350f

            gerrit Gerrit Updater added a comment - Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20501 Subject: LU-7659 mdc: add an ioctl call to the copytool char device Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3d4f22eab430874560db611f9bd95fb31f63350f

            Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20327
            Subject: LU-7659 mdc: expose hsm requests through char devices
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 52935cd7190a1b4d4b5def5a9244ce1e5ca60c3a

            gerrit Gerrit Updater added a comment - Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20327 Subject: LU-7659 mdc: expose hsm requests through char devices Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 52935cd7190a1b4d4b5def5a9244ce1e5ca60c3a

            As I explore netlink I wonder if the API could be used for this? In in my research I discovered it being used by the SCSI layer which surprised me.

            simmonsja James A Simmons added a comment - As I explore netlink I wonder if the API could be used for this? In in my research I discovered it being used by the SCSI layer which surprised me.

            I was thinking that this might also be more useful to expose via .lustre/changelog/MDTxxxx rather than /dev/XXX so that it is easily accessed by applications when multiple filesystems are mounted on the same node.

            Also, very similar to this would be exposing (probably only on the server?) the virtual "all objects" iterator for each target under similar .lustre/iterator/MDTxxxx and .lustre/iterator/OSTxxxx virtual files (or similar). The MDTxxxx iterators are useful for listing all inodes in order so that they can efficiently be processed for initial RBH scans of all files. The OSTxxxx iterators might be useful for e.g. migrating objects off OSTs, replication of file data, and other operations that touch every object on an online OST, but could be implemented separately as needed. The caveat is that this would only be easily accessed if the OST is online, unless it was handled virtually by traversing the MDT layouts when the OST is offline which would not be nearly as efficient.

            adilger Andreas Dilger added a comment - I was thinking that this might also be more useful to expose via .lustre/changelog/MDTxxxx rather than /dev/XXX so that it is easily accessed by applications when multiple filesystems are mounted on the same node. Also, very similar to this would be exposing (probably only on the server?) the virtual "all objects" iterator for each target under similar .lustre/iterator/MDTxxxx and .lustre/iterator/OSTxxxx virtual files (or similar). The MDTxxxx iterators are useful for listing all inodes in order so that they can efficiently be processed for initial RBH scans of all files. The OSTxxxx iterators might be useful for e.g. migrating objects off OSTs, replication of file data, and other operations that touch every object on an online OST, but could be implemented separately as needed. The caveat is that this would only be easily accessed if the OST is online, unless it was handled virtually by traversing the MDT layouts when the OST is offline which would not be nearly as efficient.

            Henri Doreau (henri.doreau@cea.fr) uploaded a new patch: http://review.whamcloud.com/18900
            Subject: LU-7659 mdc: expose changelog through char devices
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d34ff747f04f30e37a6f13b970bbcaa6ffe9e813

            gerrit Gerrit Updater added a comment - Henri Doreau (henri.doreau@cea.fr) uploaded a new patch: http://review.whamcloud.com/18900 Subject: LU-7659 mdc: expose changelog through char devices Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d34ff747f04f30e37a6f13b970bbcaa6ffe9e813
            rread Robert Read added a comment -

            The inotify API could be a good model for this, particularly providing a file descriptor that can be used with select or poll.

            rread Robert Read added a comment - The inotify API could be a good model for this, particularly providing a file descriptor that can be used with select or poll.

            Also, in theory the user-kernel interface could be changed without changing the userspace API, though this may be less desirable because of the licensing.

            Also note that it may be possible to just change the existing pipe interface to allow reading multiple records at once, instead of the current implementation that does two read() calls per record (one for the header and one for the body). It would be possible to read up to 64KB chunks from the pipe I think, and it could get as many full records as fit into the buffer and return a short read.

            adilger Andreas Dilger added a comment - Also, in theory the user-kernel interface could be changed without changing the userspace API, though this may be less desirable because of the licensing. Also note that it may be possible to just change the existing pipe interface to allow reading multiple records at once, instead of the current implementation that does two read() calls per record (one for the header and one for the body). It would be possible to read up to 64KB chunks from the pipe I think, and it could get as many full records as fit into the buffer and return a short read.

            Some comments and questions:

            • would this new mechanism be able to handle multiple ChangeLog consumers?
            • my preference would be to use read and write for the interface, instead of ioctl, since this can be used even from scripts
            • I would have suggested a /proc file instead of a char device, but new /proc files are frowned upon, and /sys files are only one value per file. The (minor) issue with a char device is the registration of the char major/minor, but it could use a misc char device?
            • the .llseek() operation should allow seeking to a specific record, so that if there are multiple consumers and old records are not yet cancelled the new records can be found easily
            • the char device should also have a .poll() method so that userspace can wait for new records efficiently instead of busy looping

            One issue that had come up with ChangeLogs in the past was that they are single-threaded in the kernel, which limits performance during metadata operations. If we are changing the API in userspace, it might also be good to change the on-disk format to allow multiple ChangeLog files to be written in parallel. Probably not one per core (that may become too many on large MDS nodes), but maybe 4-8 or so. The records could be merge sorted in the kernel by the helper thread at read time.

            adilger Andreas Dilger added a comment - Some comments and questions: would this new mechanism be able to handle multiple ChangeLog consumers? my preference would be to use read and write for the interface, instead of ioctl, since this can be used even from scripts I would have suggested a /proc file instead of a char device, but new /proc files are frowned upon, and /sys files are only one value per file. The (minor) issue with a char device is the registration of the char major/minor, but it could use a misc char device? the .llseek() operation should allow seeking to a specific record, so that if there are multiple consumers and old records are not yet cancelled the new records can be found easily the char device should also have a .poll() method so that userspace can wait for new records efficiently instead of busy looping One issue that had come up with ChangeLogs in the past was that they are single-threaded in the kernel, which limits performance during metadata operations. If we are changing the API in userspace, it might also be good to change the on-disk format to allow multiple ChangeLog files to be written in parallel. Probably not one per core (that may become too many on large MDS nodes), but maybe 4-8 or so. The records could be merge sorted in the kernel by the helper thread at read time.
            hdoreau Henri Doreau (Inactive) added a comment - - edited

            I should add that kuc can deliver ~100k records per second in my benchmarks, though we have a robinhood setup able to consume 70k/s. This motived my work. It would also make it easier to read records from any language. And would look nicer.

            hdoreau Henri Doreau (Inactive) added a comment - - edited I should add that kuc can deliver ~100k records per second in my benchmarks, though we have a robinhood setup able to consume 70k/s. This motived my work. It would also make it easier to read records from any language. And would look nicer.

            Thanks Bruno,

            this thread makes records retrieval operations asynchronous and speeds up the whole thing. While the userland processes a batch of records, a new one gets retrieved in the background. Do you think it's overkill?

            hdoreau Henri Doreau (Inactive) added a comment - Thanks Bruno, this thread makes records retrieval operations asynchronous and speeds up the whole thing. While the userland processes a batch of records, a new one gets retrieved in the background. Do you think it's overkill?

            People

              ypo Yohan Pipereau (Inactive)
              hdoreau Henri Doreau (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

                Created:
                Updated: