Details

    • New Feature
    • Resolution: Won't Do
    • Minor
    • Lustre 2.10.0
    • None
    • None
    • 9223372036854775807

    Description

      Tracking bug for fixing the Lustre lloop driver. There are a number of improvements to be made internally to better integrate with the loop driver in the upstream kernel, which will allow removal of a lot of code that is just copied directly from the existing loop.c file.

      While most applications deal with files, in a number of cases it is desirable to export a block device interface on a client in an efficient manner. These include for making loopback images for VM hosting, containers for very small files, swap, etc. A prototype block device was created for Lustre, based on the Linux loop.c driver, but was never completed and has become outdated as kernel APIs have evolved. The goal of this project is to update or rewrite the Lustre lloop driver so that it can be used for high-performance block device access in a reliable manner.

      A further goal would be to investigate and resolve deadlocks in the lloop IO path by using preallocation or memory pools to avoid allocation under memory pressure. This could be used for swapping on the client, which is useful on HPC systems where the clients do not have any disks. When running on an RDMA network (which is typical for Lustre) the space for replies is reserved in advance, so no memory allocation is needed to receive replies from the server, unlike with TCP-based networks.

      • Salvage/replace existing prototype block device driver
      • High performance loop driver for Lustre files
      • Avoid memory allocation deadlocks under load
      • Bypass kernel VFS for efficient network IO
      • Stretch Goal: swap on Lustre on RDMA network

      Attachments

        Issue Links

          Activity

            [LU-6585] Virtual block device (lloop)

            I did both but I don't have those numbers anymore. I sent the results to Andreas so me might have that email.

            simmonsja James A Simmons added a comment - I did both but I don't have those numbers anymore. I sent the results to Andreas so me might have that email.

            what mode did kernel loop driver use, direct or cached IO? I will appreciate if you can share any test data(I know it was long time ago).

            Jinshan Jinshan Xiong added a comment - what mode did kernel loop driver use, direct or cached IO? I will appreciate if you can share any test data(I know it was long time ago).

            Once of the main reason for deleting llite_loop was it sucked so bad for performance compared to upstream. Before we starting reinventing the wheel we should find out why this limitation exist. Also I strongly suggest that we contribute to the linux kernel to improve the loop back device to earn credit with the kernel community. Plus it save us with the cost of maintaince in the long term.

            simmonsja James A Simmons added a comment - Once of the main reason for deleting llite_loop was it sucked so bad for performance compared to upstream. Before we starting reinventing the wheel we should find out why this limitation exist. Also I strongly suggest that we contribute to the linux kernel to improve the loop back device to earn credit with the kernel community. Plus it save us with the cost of maintaince in the long term.

            We probably have to bring llite_loop driver back because of some obvious drawback in kernel loop back device. Even though kernel's loop back device already has direct IO support but there seems no way to increase the size of I/O.

            Jinshan Jinshan Xiong added a comment - We probably have to bring llite_loop driver back because of some obvious drawback in kernel loop back device. Even though kernel's loop back device already has direct IO support but there seems no way to increase the size of I/O.

            The llite_lloop device is no longe supported so we can close this ticket.

            simmonsja James A Simmons added a comment - The llite_lloop device is no longe supported so we can close this ticket.

            In the Life Science many applications are not able to run in parallel across different nodes on the same data (especially in the genomics sector). These application can take benefit of local NVM/SSD devices, but the space available is limited.
            I measured the performance of a very recent PCIe NVM devices versus Lustre versus a Loopback devices hosted on Lustre using compile bench these are the results:

            LUSTRE

            [cwuser2@cw-1-00 compilebench-0.6]$ ./compilebench -D /mnt/lustrefs/cwuser2/ -i 2 -r 2 --makej -n
            using working directory /mnt/lustrefs/cwuser2/, 2 intial dirs 2 runs
            native unpatched native-0 222MB in 69.37 seconds (3.21 MB/s)
            native patched native-0 109MB in 20.61 seconds (5.32 MB/s)
            native patched compiled native-0 691MB in 10.98 seconds (62.99 MB/s)
            create dir kernel-0 222MB in 70.10 seconds (3.17 MB/s)
            create dir kernel-1 222MB in 69.68 seconds (3.19 MB/s)
            compile dir kernel-1 680MB in 12.31 seconds (55.29 MB/s)
            compile dir kernel-0 680MB in 12.19 seconds (55.84 MB/s)
            read dir kernel-1 in 30.01 30.09 MB/s
            read dir kernel-0 in 30.09 30.01 MB/s
            read dir kernel-1 in 22.83 39.55 MB/s
            delete kernel-1 in 14.75 seconds
            delete kernel-0 in 15.10 seconds
            

            INTEL NVM

            [cwuser2@cw-1-00 compilebench-0.6]$ ./compilebench -D /mnt/intel-nvm/ -i 2 -r 2 --makej -n
            using working directory /mnt/intel-nvm/, 2 intial dirs 2 runs
            native unpatched native-0 222MB in 0.83 seconds (267.92 MB/s)
            native patched native-0 109MB in 0.30 seconds (365.57 MB/s)
            native patched compiled native-0 691MB in 0.66 seconds (1047.87 MB/s)
            create dir kernel-0 222MB in 0.82 seconds (271.19 MB/s)
            create dir kernel-1 222MB in 0.98 seconds (226.91 MB/s)
            compile dir kernel-1 680MB in 0.68 seconds (1000.93 MB/s)
            compile dir kernel-0 680MB in 0.68 seconds (1000.93 MB/s)
            read dir kernel-1 in 0.48 1881.27 MB/s
            read dir kernel-0 in 0.47 1921.29 MB/s
            read dir kernel-1 in 0.43 2100.02 MB/s
            delete kernel-1 in 0.55 seconds
            delete kernel-0 in 0.54 seconds
            

            LUSTRE LOOPBACK

            [cwuser2@cw-1-00 compilebench-0.6]$ ./compilebench -D /mnt/lustre-loopback/ -i 2 -r 2 --makej -n
            using working directory /mnt/lustre-loopback/, 2 intial dirs 2 runs
            native unpatched native-0 222MB in 0.70 seconds (317.68 MB/s)
            native patched native-0 109MB in 0.23 seconds (476.83 MB/s)
            native patched compiled native-0 691MB in 0.44 seconds (1571.81 MB/s)
            create dir kernel-0 222MB in 0.68 seconds (327.02 MB/s)
            create dir kernel-1 222MB in 0.68 seconds (327.02 MB/s)
            compile dir kernel-1 680MB in 0.46 seconds (1479.64 MB/s)
            compile dir kernel-0 680MB in 0.47 seconds (1448.16 MB/s)
            read dir kernel-1 in 0.45 2006.69 MB/s
            read dir kernel-0 in 0.46 1963.06 MB/s
            read dir kernel-1 in 0.43 2100.02 MB/s
            delete kernel-1 in 0.43 seconds
            delete kernel-0 in 0.43 seconds
            
            gabriele.paciucci Gabriele Paciucci (Inactive) added a comment - In the Life Science many applications are not able to run in parallel across different nodes on the same data (especially in the genomics sector). These application can take benefit of local NVM/SSD devices, but the space available is limited. I measured the performance of a very recent PCIe NVM devices versus Lustre versus a Loopback devices hosted on Lustre using compile bench these are the results: LUSTRE [cwuser2@cw-1-00 compilebench-0.6]$ ./compilebench -D /mnt/lustrefs/cwuser2/ -i 2 -r 2 --makej -n using working directory /mnt/lustrefs/cwuser2/, 2 intial dirs 2 runs native unpatched native-0 222MB in 69.37 seconds (3.21 MB/s) native patched native-0 109MB in 20.61 seconds (5.32 MB/s) native patched compiled native-0 691MB in 10.98 seconds (62.99 MB/s) create dir kernel-0 222MB in 70.10 seconds (3.17 MB/s) create dir kernel-1 222MB in 69.68 seconds (3.19 MB/s) compile dir kernel-1 680MB in 12.31 seconds (55.29 MB/s) compile dir kernel-0 680MB in 12.19 seconds (55.84 MB/s) read dir kernel-1 in 30.01 30.09 MB/s read dir kernel-0 in 30.09 30.01 MB/s read dir kernel-1 in 22.83 39.55 MB/s delete kernel-1 in 14.75 seconds delete kernel-0 in 15.10 seconds INTEL NVM [cwuser2@cw-1-00 compilebench-0.6]$ ./compilebench -D /mnt/intel-nvm/ -i 2 -r 2 --makej -n using working directory /mnt/intel-nvm/, 2 intial dirs 2 runs native unpatched native-0 222MB in 0.83 seconds (267.92 MB/s) native patched native-0 109MB in 0.30 seconds (365.57 MB/s) native patched compiled native-0 691MB in 0.66 seconds (1047.87 MB/s) create dir kernel-0 222MB in 0.82 seconds (271.19 MB/s) create dir kernel-1 222MB in 0.98 seconds (226.91 MB/s) compile dir kernel-1 680MB in 0.68 seconds (1000.93 MB/s) compile dir kernel-0 680MB in 0.68 seconds (1000.93 MB/s) read dir kernel-1 in 0.48 1881.27 MB/s read dir kernel-0 in 0.47 1921.29 MB/s read dir kernel-1 in 0.43 2100.02 MB/s delete kernel-1 in 0.55 seconds delete kernel-0 in 0.54 seconds LUSTRE LOOPBACK [cwuser2@cw-1-00 compilebench-0.6]$ ./compilebench -D /mnt/lustre-loopback/ -i 2 -r 2 --makej -n using working directory /mnt/lustre-loopback/, 2 intial dirs 2 runs native unpatched native-0 222MB in 0.70 seconds (317.68 MB/s) native patched native-0 109MB in 0.23 seconds (476.83 MB/s) native patched compiled native-0 691MB in 0.44 seconds (1571.81 MB/s) create dir kernel-0 222MB in 0.68 seconds (327.02 MB/s) create dir kernel-1 222MB in 0.68 seconds (327.02 MB/s) compile dir kernel-1 680MB in 0.46 seconds (1479.64 MB/s) compile dir kernel-0 680MB in 0.47 seconds (1448.16 MB/s) read dir kernel-1 in 0.45 2006.69 MB/s read dir kernel-0 in 0.46 1963.06 MB/s read dir kernel-1 in 0.43 2100.02 MB/s delete kernel-1 in 0.43 seconds delete kernel-0 in 0.43 seconds
            rread Robert Read added a comment -

            Another interesting use case would be to combine multiple devices with AUFS or OverlayFS to provide a COW-style filesystem, similar to what Docker does. The lower, readonly device could be a common, shared filesystem image, and upper layer could be a writable, private filesystem stored on local storage or perhaps even on Lustre as well.

            rread Robert Read added a comment - Another interesting use case would be to combine multiple devices with AUFS or OverlayFS to provide a COW-style filesystem, similar to what Docker does. The lower, readonly device could be a common, shared filesystem image, and upper layer could be a writable, private filesystem stored on local storage or perhaps even on Lustre as well.

            People

              simmonsja James A Simmons
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: