Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11264

llapi_* routines demonstrate poor performance

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • Argonne's Theta machine is a Cray running lustre: 2.7.3.74
      kernel: patchless_client
      build: 2.7.3.74
    • 9223372036854775807

    Description

      I'm going to bug Cray about this but Andreas said  "we can't fix problems that aren't in Jira" so here I am.

      In the old days one would get striping information via an ioctl:

         

      struct lov_user_md *lum;
          size_t lumsize = sizeof(struct lov_user_md) +LOV_MAX_STRIPE_COUNT * sizeof(struct lov_user_ost_data);
          lum->lmm_magic = LOV_USER_MAGIC;
          ret = ioctl( fd, LL_IOC_LOV_GETSTRIPE, (void *)lum );
      

       

       

      Then the llapi_ routines showed up:

       

         ret = llapi_file_get_stripe(opt_file, lum);

      and even more sophisticated llapi_layout_ routines:

       

      struct llapi_layout *layout;
       layout = llapi_layout_get_by_fd(fd, 0);
       llapi_layout_stripe_count_get(layout, &(lum->lmm_stripe_count));
       llapi_layout_stripe_size_get(layout, &(lum->lmm_stripe_size));
       llapi_layout_ost_index_get(layout, 0, &(lum->lmm_stripe_offset));
       
      

       

      I haven't done a full scaling run, but small scale tests don't look great for the newer interfaces.   Average time to get the stripe size for a 2 node 128 process MPI job:

      ioctl:   17.2 ms

      llapi_file_get_stripe: 142.0 ms

      llapi_layout:  2220 ms

       

      Wow!  I want to use the new interfaces but these overheads are bonkers.

       

      Attachments

        Issue Links

          Activity

            [LU-11264] llapi_* routines demonstrate poor performance

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35093/
            Subject: LU-11264 llapi: clean up llapi_file_create_foreign()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: efe6615d7558ee6bda106546c55d92758bec528d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35093/ Subject: LU-11264 llapi: clean up llapi_file_create_foreign() Project: fs/lustre-release Branch: master Current Patch Set: Commit: efe6615d7558ee6bda106546c55d92758bec528d

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35092/
            Subject: LU-11264 llapi: clean up llapi_search_tgt() code
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 096db80e0810b4abb8e268a58623c23bf0b57d2c

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35092/ Subject: LU-11264 llapi: clean up llapi_search_tgt() code Project: fs/lustre-release Branch: master Current Patch Set: Commit: 096db80e0810b4abb8e268a58623c23bf0b57d2c

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35091/
            Subject: LU-11264 llapi: reduce llapi_stripe_limit_check() overhead
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1c8ce141c6b6b621cf0fc89d32e1e7a2db43729d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35091/ Subject: LU-11264 llapi: reduce llapi_stripe_limit_check() overhead Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1c8ce141c6b6b621cf0fc89d32e1e7a2db43729d

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35091
            Subject: LU-11264 llapi: reduce llapi_stripe_limit_check() overhead
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: dfea5bd239531a91d19f603945601e39f9254f84

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35091 Subject: LU-11264 llapi: reduce llapi_stripe_limit_check() overhead Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: dfea5bd239531a91d19f603945601e39f9254f84

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35028
            Subject: LU-11264 utils: avoid pool overhead in lfs setstripe
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e1246538418d2b58f5457d91956c7c75a9fb1153

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35028 Subject: LU-11264 utils: avoid pool overhead in lfs setstripe Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e1246538418d2b58f5457d91956c7c75a9fb1153

            I just happened to be looking at this code on the setstripe side and see a number of places where it could be optimized. Running a simple test comparing lfs setstripe with touch and createmany shows a big difference in performance (files removed between each test):

            $ time lfs setstripe -c 1 /myth/tmp/tmp/f{1..1000} :  real: 0m11.089s, user: 0m0.115s, sys: 0m2.132s
            $ time touch /myth/tmp/tmp/f{1..1000}              :  real:  0m6.400s, user: 0m0.018s, sys: 0m0.863s
            $ time createmany -o /myth/tmp/tmp/f    1000       :  real:  0m2.329s, user: 0m0.008s, sys: 0m0.185s
            

            It is pretty clear from the elapsed time and the system time that lfs setstripe is doing a lot more stuff for each file create compared to the others (touch is additionally setting the timestamps). Looking at the strace of lfs setstripe -c1 on a client shows what is going on:

            gettid()                                = 3237
            open("/dev/urandom", O_RDONLY|O_NOFOLLOW) = 3
            read(3, "Fg#\306", 4)                   = 4
            close(3)                                = 0
            lstat("/myth", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0
            lstat("/myth/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0
            lstat("/myth/tmp/tmp", {st_mode=S_IFDIR|0775, st_size=53248, ...}) = 0
            lstat("/myth/tmp/tmp/foo", 0x7ffe9dc99410) = -1 ENOENT (No such file or directory)
            lstat("/myth", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0
            lstat("/myth/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0
            lstat("/myth/tmp/tmp", {st_mode=S_IFDIR|0775, st_size=53248, ...}) = 0
            lstat("/myth/tmp/tmp/foo", 0x7ffe9dc99410) = -1 ENOENT (No such file or directory)
            lstat("/myth", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0
            lstat("/myth/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0
            lstat("/myth/tmp/tmp", {st_mode=S_IFDIR|0775, st_size=53248, ...}) = 0
            open("/proc/mounts", O_RDONLY|O_CLOEXEC) = 3
            fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
            read(3, "rootfs / rootfs rw 0 0\nproc /pro"..., 1024) = 1024
            read(3, "tmp ext4 ro,relatime,barrier=1,d"..., 1024) = 88
            read(3, "", 1024)                       = 0
            close(3)                                = 0
            open("/myth/tmp/tmp/foo", O_WRONLY|O_CREAT|O_LOV_DELAY_CREATE, 0644) = 3
            ioctl(3, LL_IOC_LOV_SETSTRIPE, 0x942060)          = 0
            

            Firstly, it appears that liblustreapi_init() function is always doing some work to initialize the PRNG, but this PRNG is only used in a small number of cases. This initialization can be skipped unless the PRNG is actually needed in particular functions, even though it is only a one-time thing it is still needless overhead.

            More importantly, it is doing multiple traversals of the full pathname (bad) to do realpath() on the filename and processing /proc/mounts for each file to determine if the file is on a Lustre filesystem or not. IMHO, it should be enough to return an error from ioctl(LL_IOC_LOV_SETSTRIPE) to determine that the file is not on a Lustre filesystem, or if that is really deemed necessary we can do statfs() on the directory and check the magic without doing dozens of syscalls.

            There is also extra overhead if a pool name is specified to verify that it is valid that could be a one-time check for repeated use of the same pool(s), and to verify the stripe_size >= PAGE_SIZE, where again the PAGE_SIZE does not change from one file to the next.

            Finally, I observe in the debug logs that between open(O_LOV_DELAY_CREATE) and ioctl(LL_IOC_LOV_SETSTRIPE) there is a DLM lock callback from the MDS to the client for the layout  lock (I think). It doesn't make sense to grant the MDS_INODELOCK_LAYOUT lock to the client opening with O_LOV_DELAY_CREATE, when we know the client will immediately be changing it.

            adilger Andreas Dilger added a comment - I just happened to be looking at this code on the setstripe side and see a number of places where it could be optimized. Running a simple test comparing lfs setstripe with touch and createmany shows a big difference in performance (files removed between each test): $ time lfs setstripe -c 1 /myth/tmp/tmp/f{1..1000} : real: 0m11.089s, user: 0m0.115s, sys: 0m2.132s $ time touch /myth/tmp/tmp/f{1..1000} : real: 0m6.400s, user: 0m0.018s, sys: 0m0.863s $ time createmany -o /myth/tmp/tmp/f 1000 : real: 0m2.329s, user: 0m0.008s, sys: 0m0.185s It is pretty clear from the elapsed time and the system time that lfs setstripe is doing a lot more stuff for each file create compared to the others ( touch is additionally setting the timestamps). Looking at the strace of lfs setstripe -c1 on a client shows what is going on: gettid() = 3237 open("/dev/urandom", O_RDONLY|O_NOFOLLOW) = 3 read(3, "Fg#\306", 4) = 4 close(3) = 0 lstat("/myth", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0 lstat("/myth/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0 lstat("/myth/tmp/tmp", {st_mode=S_IFDIR|0775, st_size=53248, ...}) = 0 lstat("/myth/tmp/tmp/foo", 0x7ffe9dc99410) = -1 ENOENT (No such file or directory) lstat("/myth", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0 lstat("/myth/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0 lstat("/myth/tmp/tmp", {st_mode=S_IFDIR|0775, st_size=53248, ...}) = 0 lstat("/myth/tmp/tmp/foo", 0x7ffe9dc99410) = -1 ENOENT (No such file or directory) lstat("/myth", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0 lstat("/myth/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0 lstat("/myth/tmp/tmp", {st_mode=S_IFDIR|0775, st_size=53248, ...}) = 0 open("/proc/mounts", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 read(3, "rootfs / rootfs rw 0 0\nproc /pro"..., 1024) = 1024 read(3, "tmp ext4 ro,relatime,barrier=1,d"..., 1024) = 88 read(3, "", 1024) = 0 close(3) = 0 open("/myth/tmp/tmp/foo", O_WRONLY|O_CREAT|O_LOV_DELAY_CREATE, 0644) = 3 ioctl(3, LL_IOC_LOV_SETSTRIPE, 0x942060) = 0 Firstly, it appears that liblustreapi_init() function is always doing some work to initialize the PRNG, but this PRNG is only used in a small number of cases. This initialization can be skipped unless the PRNG is actually needed in particular functions, even though it is only a one-time thing it is still needless overhead. More importantly, it is doing multiple traversals of the full pathname (bad) to do realpath() on the filename and processing /proc/mounts for each file to determine if the file is on a Lustre filesystem or not. IMHO, it should be enough to return an error from ioctl(LL_IOC_LOV_SETSTRIPE) to determine that the file is not on a Lustre filesystem, or if that is really deemed necessary we can do statfs() on the directory and check the magic without doing dozens of syscalls. There is also extra overhead if a pool name is specified to verify that it is valid that could be a one-time check for repeated use of the same pool(s), and to verify the stripe_size >= PAGE_SIZE , where again the PAGE_SIZE does not change from one file to the next. Finally, I observe in the debug logs that between open(O_LOV_DELAY_CREATE) and ioctl(LL_IOC_LOV_SETSTRIPE) there is a DLM lock callback from the MDS to the client for the layout  lock (I think). It doesn't make sense to grant the MDS_INODELOCK_LAYOUT lock to the client opening with O_LOV_DELAY_CREATE , when we know the client will immediately be changing it.

            By the way, thanks for your quick reply last week.

            paf Patrick Farrell (Inactive) added a comment - By the way, thanks for your quick reply last week.

            Robert,

            Here's the problem in a nutshell.

            lfs getstripe asks the client for the stripe count information that it has locally, and gets it locally if it's present.  This is fast because that information is generally already cached locally when the file is open.

            The other two functions do not ask the client for the layout, as such - Instead, the ioctl they use actually directly reads the layout extended attribute from the MDS.  This read of an extended attribute cannot be satisfied by the layout information held in memory on the client (which it usually has if the file is open), instead, it goes directly to the MDS.

            The solution would be to rewrite the implementation of those new ioctls to get the layout that the client has in memory and request it if it's not present.

            paf Patrick Farrell (Inactive) added a comment - Robert, Here's the problem in a nutshell. lfs getstripe asks the client for the stripe count information that it has locally, and gets it locally if it's present.  This is fast because that information is generally already cached locally when the file is open. The other two functions do not ask the client for the layout, as such - Instead, the ioctl they use actually directly reads the layout extended attribute from the MDS.  This read of an extended attribute cannot be satisfied by the layout information held in memory on the client (which it usually has if the file is open), instead, it goes directly to the MDS. The solution would be to rewrite the implementation of those new ioctls to get the layout that the client has in memory and request it if it's not present.

            oh yeah, lots of variance in these tests. It's not statistically rigorous but above numbers came from me running each test 5 times and reporting an average.

            robl Robert Latham added a comment - oh yeah, lots of variance in these tests. It's not statistically rigorous but above numbers came from me running each test 5 times and reporting an average.

            Thanks, Rob - I actually just grabbed your test and did a bunch of testing myself.

            I see a really wildly large gap even with 1 process - interesting.

            This is with debug turned on, on a small VM setup:

            [root@cent7c01 cent7s01]# rm -f testfile; lctl clear; mpirun --allow-run-as-root -n 1 /tmp/a.out -c -i testfile; lctl dk > /tmp/out3
            opt_file: testfile, opt_create: 1, opt_fstat: 0, opt_lseek: 0, opt_realpath: 0, opt_llapi: 0, opt_ioctl: 1, opt_fpp: 0, opt_llapi_fd: 0, nprocs: 1, time: 2.622000 ms

            [root@cent7c01 cent7s01]# rm -f testfile; lctl clear; mpirun --allow-run-as-root -n 1 /tmp/a.out -c -a testfile; lctl dk > /tmp/out
            opt_file: testfile, opt_create: 1, opt_fstat: 0, opt_lseek: 0, opt_realpath: 0, opt_llapi: 1, opt_ioctl: 0, opt_fpp: 0, opt_llapi_fd: 0, nprocs: 1, time: 3.715000 ms
            [root@cent7c01 cent7s01]# rm -f testfile; lctl clear; mpirun --allow-run-as-root -n 1 /tmp/a.out -c -A testfile; lctl dk > /tmp/out2
            opt_file: testfile, opt_create: 1, opt_fstat: 0, opt_lseek: 0, opt_realpath: 0, opt_llapi: 0, opt_ioctl: 0, opt_fpp: 0, opt_llapi_fd: 1, nprocs: 1, time: 5.026000 ms

            So that's "ioctl" at 2.6 ms

            Get_stripe at 3.7 ms
            And
            get_by_fd at 5.0 ms

            That's... wild.  I'll try to dig in later.

            EDIT:
            My previous comment about no additional server communications was wrong; had an error in my data collection.

            The simple create+ioctl does 2, one of which is close().

            The llapi_get_stripe function does 3 (create, some sort of get, then close).

            The layout get function does 4.

            I've got some ideas why - I'll take a further look next week.

            paf Patrick Farrell (Inactive) added a comment - - edited Thanks, Rob - I actually just grabbed your test and did a bunch of testing myself. I see a really wildly large gap even with 1 process - interesting. This is with debug turned on, on a small VM setup: [root@cent7c01 cent7s01] # rm -f testfile; lctl clear; mpirun --allow-run-as-root -n 1 /tmp/a.out -c -i testfile; lctl dk > /tmp/out3 opt_file: testfile, opt_create: 1, opt_fstat: 0, opt_lseek: 0, opt_realpath: 0, opt_llapi: 0, opt_ioctl: 1, opt_fpp: 0, opt_llapi_fd: 0, nprocs: 1, time: 2.622000 ms [root@cent7c01 cent7s01] # rm -f testfile; lctl clear; mpirun --allow-run-as-root -n 1 /tmp/a.out -c -a testfile; lctl dk > /tmp/out opt_file: testfile, opt_create: 1, opt_fstat: 0, opt_lseek: 0, opt_realpath: 0, opt_llapi: 1, opt_ioctl: 0, opt_fpp: 0, opt_llapi_fd: 0, nprocs: 1, time: 3.715000 ms [root@cent7c01 cent7s01] # rm -f testfile; lctl clear; mpirun --allow-run-as-root -n 1 /tmp/a.out -c -A testfile; lctl dk > /tmp/out2 opt_file: testfile, opt_create: 1, opt_fstat: 0, opt_lseek: 0, opt_realpath: 0, opt_llapi: 0, opt_ioctl: 0, opt_fpp: 0, opt_llapi_fd: 1, nprocs: 1, time: 5.026000 ms So that's "ioctl" at 2.6 ms Get_stripe at 3.7 ms And get_by_fd at 5.0 ms That's... wild.  I'll try to dig in later. EDIT: My previous comment about no additional server communications was wrong; had an error in my data collection. The simple create+ioctl does 2, one of which is close(). The llapi_get_stripe function does 3 (create, some sort of get, then close). The layout get function does 4 . I've got some ideas why - I'll take a further look next week.

            People

              wc-triage WC Triage
              robl Robert Latham
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: