[LU-11264] llapi_* routines demonstrate poor performance Created: 17/Aug/18  Updated: 14/Sep/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Robert Latham Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Argonne's Theta machine is a Cray running lustre: 2.7.3.74
kernel: patchless_client
build: 2.7.3.74


Attachments: File llapi-perf.c     PNG File lustre-ioctl-vs-llapi.png    
Issue Links:
Duplicate
Related
is related to LU-14645 setstripe cleanup Resolved
is related to LU-13693 lfs getstripe should avoid opening re... Resolved
is related to LU-17025 Invalid pool name does not return error Open
is related to LU-14316 lfs: using old ioctl(LL_IOC_LOV_GETST... Open
is related to LU-14897 "lfs setstripe/migrate" should check ... Open
is related to LU-10500 Buffer overflow of 'buffer' due to no... Resolved
is related to LU-12327 DNE3: add llapi_dir_create_from_xattr... Open
is related to LU-12369 lfs setstripe causes MDS DLM lock cal... Open
is related to LU-14396 cache llapi_search_mounts() result Open
Epic/Theme: Performance
Rank (Obsolete): 9223372036854775807

 Description   

I'm going to bug Cray about this but Andreas said  "we can't fix problems that aren't in Jira" so here I am.

In the old days one would get striping information via an ioctl:

   

struct lov_user_md *lum;
    size_t lumsize = sizeof(struct lov_user_md) +LOV_MAX_STRIPE_COUNT * sizeof(struct lov_user_ost_data);
    lum->lmm_magic = LOV_USER_MAGIC;
    ret = ioctl( fd, LL_IOC_LOV_GETSTRIPE, (void *)lum );

 

 

Then the llapi_ routines showed up:

 

   ret = llapi_file_get_stripe(opt_file, lum);

and even more sophisticated llapi_layout_ routines:

 

struct llapi_layout *layout;
 layout = llapi_layout_get_by_fd(fd, 0);
 llapi_layout_stripe_count_get(layout, &(lum->lmm_stripe_count));
 llapi_layout_stripe_size_get(layout, &(lum->lmm_stripe_size));
 llapi_layout_ost_index_get(layout, 0, &(lum->lmm_stripe_offset));
 

 

I haven't done a full scaling run, but small scale tests don't look great for the newer interfaces.   Average time to get the stripe size for a 2 node 128 process MPI job:

ioctl:   17.2 ms

llapi_file_get_stripe: 142.0 ms

llapi_layout:  2220 ms

 

Wow!  I want to use the new interfaces but these overheads are bonkers.

 



 Comments   
Comment by Robert Latham [ 17/Aug/18 ]

I attached a graph of a scaling experiment on NERSC's Cray Lustre deployment on Cori from I think two years ago.  The problem is not a recent regression.

Comment by Patrick Farrell (Inactive) [ 07/Sep/18 ]

Robert,

I'm a Cray person, but I'll discuss this here so as not to leave the Whamcloud guys in the cold.

What exactly are you measuring?  Is this execution time of the function directly, within a binary, or is it execution time of an entire binary containing just this function call?

Can you share whatever your test code is here?  Just to make our lives easier.

A few more thoughts/questions.

  1. The llapi_file_get_stripe function you selected is not a direct equivalent of that ioctl.  It also has to open() the file.  So that may be a source of some of the overhead.
  2. Which of the llapi_layout functions is taking that much time?  You don't specify.

Those last three functions (stripe_count_get, stripe_size_get, and ost_index_get) literally just access pointers in the memory given, so if they're taking milliseconds by themselves...

 

Comment by Patrick Farrell (Inactive) [ 07/Sep/18 ]

Also, can you compare execution time from one thread?  If it's really as bad as your measurements indicate, it should be straightforward to do so.

Comment by Robert Latham [ 07/Sep/18 ]

I attached the test program I used.  it's testing a bunch of different things but to get the routines in this test case:

  • ioctl to get information: llapi-perf -i LUSTRE_FILE
  • llapi_file_get_stripe : llapi-perf -a LUSTRE_FILE
  • llapi_layout_get_by_fd: llapi_perf -A LUSTRE_FILE

Absolutely agreed that llapi_file_get_stripe should be expensive since it is operating on a filename. Was astonished to see llapi_layout aproach took so long considering it operates on the fd directly.

I assumed the initial llapi_layout_get_by_fd call took all the time, but I was measuring the time to call the four routines together – after all, there's no reason for an application to just call get_by_fd.
I'll add some finer grained timing information and re-run the experiment
 

Comment by Patrick Farrell (Inactive) [ 07/Sep/18 ]

Thanks, Rob - I actually just grabbed your test and did a bunch of testing myself.

I see a really wildly large gap even with 1 process - interesting.

This is with debug turned on, on a small VM setup:

[root@cent7c01 cent7s01]# rm -f testfile; lctl clear; mpirun --allow-run-as-root -n 1 /tmp/a.out -c -i testfile; lctl dk > /tmp/out3
opt_file: testfile, opt_create: 1, opt_fstat: 0, opt_lseek: 0, opt_realpath: 0, opt_llapi: 0, opt_ioctl: 1, opt_fpp: 0, opt_llapi_fd: 0, nprocs: 1, time: 2.622000 ms

[root@cent7c01 cent7s01]# rm -f testfile; lctl clear; mpirun --allow-run-as-root -n 1 /tmp/a.out -c -a testfile; lctl dk > /tmp/out
opt_file: testfile, opt_create: 1, opt_fstat: 0, opt_lseek: 0, opt_realpath: 0, opt_llapi: 1, opt_ioctl: 0, opt_fpp: 0, opt_llapi_fd: 0, nprocs: 1, time: 3.715000 ms
[root@cent7c01 cent7s01]# rm -f testfile; lctl clear; mpirun --allow-run-as-root -n 1 /tmp/a.out -c -A testfile; lctl dk > /tmp/out2
opt_file: testfile, opt_create: 1, opt_fstat: 0, opt_lseek: 0, opt_realpath: 0, opt_llapi: 0, opt_ioctl: 0, opt_fpp: 0, opt_llapi_fd: 1, nprocs: 1, time: 5.026000 ms

So that's "ioctl" at 2.6 ms

Get_stripe at 3.7 ms
And
get_by_fd at 5.0 ms

That's... wild.  I'll try to dig in later.

EDIT:
My previous comment about no additional server communications was wrong; had an error in my data collection.

The simple create+ioctl does 2, one of which is close().

The llapi_get_stripe function does 3 (create, some sort of get, then close).

The layout get function does 4.

I've got some ideas why - I'll take a further look next week.

Comment by Robert Latham [ 07/Sep/18 ]

oh yeah, lots of variance in these tests. It's not statistically rigorous but above numbers came from me running each test 5 times and reporting an average.

Comment by Patrick Farrell (Inactive) [ 10/Sep/18 ]

Robert,

Here's the problem in a nutshell.

lfs getstripe asks the client for the stripe count information that it has locally, and gets it locally if it's present.  This is fast because that information is generally already cached locally when the file is open.

The other two functions do not ask the client for the layout, as such - Instead, the ioctl they use actually directly reads the layout extended attribute from the MDS.  This read of an extended attribute cannot be satisfied by the layout information held in memory on the client (which it usually has if the file is open), instead, it goes directly to the MDS.

The solution would be to rewrite the implementation of those new ioctls to get the layout that the client has in memory and request it if it's not present.

Comment by Patrick Farrell (Inactive) [ 10/Sep/18 ]

By the way, thanks for your quick reply last week.

Comment by Andreas Dilger [ 31/May/19 ]

I just happened to be looking at this code on the setstripe side and see a number of places where it could be optimized. Running a simple test comparing lfs setstripe with touch and createmany shows a big difference in performance (files removed between each test):

$ time lfs setstripe -c 1 /myth/tmp/tmp/f{1..1000} :  real: 0m11.089s, user: 0m0.115s, sys: 0m2.132s
$ time touch /myth/tmp/tmp/f{1..1000}              :  real:  0m6.400s, user: 0m0.018s, sys: 0m0.863s
$ time createmany -o /myth/tmp/tmp/f    1000       :  real:  0m2.329s, user: 0m0.008s, sys: 0m0.185s

It is pretty clear from the elapsed time and the system time that lfs setstripe is doing a lot more stuff for each file create compared to the others (touch is additionally setting the timestamps). Looking at the strace of lfs setstripe -c1 on a client shows what is going on:

gettid()                                = 3237
open("/dev/urandom", O_RDONLY|O_NOFOLLOW) = 3
read(3, "Fg#\306", 4)                   = 4
close(3)                                = 0
lstat("/myth", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0
lstat("/myth/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0
lstat("/myth/tmp/tmp", {st_mode=S_IFDIR|0775, st_size=53248, ...}) = 0
lstat("/myth/tmp/tmp/foo", 0x7ffe9dc99410) = -1 ENOENT (No such file or directory)
lstat("/myth", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0
lstat("/myth/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0
lstat("/myth/tmp/tmp", {st_mode=S_IFDIR|0775, st_size=53248, ...}) = 0
lstat("/myth/tmp/tmp/foo", 0x7ffe9dc99410) = -1 ENOENT (No such file or directory)
lstat("/myth", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0
lstat("/myth/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0
lstat("/myth/tmp/tmp", {st_mode=S_IFDIR|0775, st_size=53248, ...}) = 0
open("/proc/mounts", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(3, "rootfs / rootfs rw 0 0\nproc /pro"..., 1024) = 1024
read(3, "tmp ext4 ro,relatime,barrier=1,d"..., 1024) = 88
read(3, "", 1024)                       = 0
close(3)                                = 0
open("/myth/tmp/tmp/foo", O_WRONLY|O_CREAT|O_LOV_DELAY_CREATE, 0644) = 3
ioctl(3, LL_IOC_LOV_SETSTRIPE, 0x942060)          = 0

Firstly, it appears that liblustreapi_init() function is always doing some work to initialize the PRNG, but this PRNG is only used in a small number of cases. This initialization can be skipped unless the PRNG is actually needed in particular functions, even though it is only a one-time thing it is still needless overhead.

More importantly, it is doing multiple traversals of the full pathname (bad) to do realpath() on the filename and processing /proc/mounts for each file to determine if the file is on a Lustre filesystem or not. IMHO, it should be enough to return an error from ioctl(LL_IOC_LOV_SETSTRIPE) to determine that the file is not on a Lustre filesystem, or if that is really deemed necessary we can do statfs() on the directory and check the magic without doing dozens of syscalls.

There is also extra overhead if a pool name is specified to verify that it is valid that could be a one-time check for repeated use of the same pool(s), and to verify the stripe_size >= PAGE_SIZE, where again the PAGE_SIZE does not change from one file to the next.

Finally, I observe in the debug logs that between open(O_LOV_DELAY_CREATE) and ioctl(LL_IOC_LOV_SETSTRIPE) there is a DLM lock callback from the MDS to the client for the layout  lock (I think). It doesn't make sense to grant the MDS_INODELOCK_LAYOUT lock to the client opening with O_LOV_DELAY_CREATE, when we know the client will immediately be changing it.

Comment by Gerrit Updater [ 01/Jun/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35028
Subject: LU-11264 utils: avoid pool overhead in lfs setstripe
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e1246538418d2b58f5457d91956c7c75a9fb1153

Comment by Gerrit Updater [ 07/Jun/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35091
Subject: LU-11264 llapi: reduce llapi_stripe_limit_check() overhead
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dfea5bd239531a91d19f603945601e39f9254f84

Comment by Gerrit Updater [ 20/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35091/
Subject: LU-11264 llapi: reduce llapi_stripe_limit_check() overhead
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1c8ce141c6b6b621cf0fc89d32e1e7a2db43729d

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35092/
Subject: LU-11264 llapi: clean up llapi_search_tgt() code
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 096db80e0810b4abb8e268a58623c23bf0b57d2c

Comment by Gerrit Updater [ 09/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35093/
Subject: LU-11264 llapi: clean up llapi_file_create_foreign()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: efe6615d7558ee6bda106546c55d92758bec528d

Generated at Sat Feb 10 02:42:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.