[LU-11264] llapi_* routines demonstrate poor performance Created: 17/Aug/18 Updated: 14/Sep/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major |
| Reporter: | Robert Latham | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Argonne's Theta machine is a Cray running lustre: 2.7.3.74 |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Epic/Theme: | Performance | ||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
I'm going to bug Cray about this but Andreas said "we can't fix problems that aren't in Jira" so here I am. In the old days one would get striping information via an ioctl:
struct lov_user_md *lum; size_t lumsize = sizeof(struct lov_user_md) +LOV_MAX_STRIPE_COUNT * sizeof(struct lov_user_ost_data); lum->lmm_magic = LOV_USER_MAGIC; ret = ioctl( fd, LL_IOC_LOV_GETSTRIPE, (void *)lum );
Then the llapi_ routines showed up:
ret = llapi_file_get_stripe(opt_file, lum); and even more sophisticated llapi_layout_ routines:
struct llapi_layout *layout; layout = llapi_layout_get_by_fd(fd, 0); llapi_layout_stripe_count_get(layout, &(lum->lmm_stripe_count)); llapi_layout_stripe_size_get(layout, &(lum->lmm_stripe_size)); llapi_layout_ost_index_get(layout, 0, &(lum->lmm_stripe_offset));
I haven't done a full scaling run, but small scale tests don't look great for the newer interfaces. Average time to get the stripe size for a 2 node 128 process MPI job: ioctl: 17.2 ms llapi_file_get_stripe: 142.0 ms llapi_layout: 2220 ms
Wow! I want to use the new interfaces but these overheads are bonkers.
|
| Comments |
| Comment by Robert Latham [ 17/Aug/18 ] |
|
I attached a graph of a scaling experiment on NERSC's Cray Lustre deployment on Cori from I think two years ago. The problem is not a recent regression. |
| Comment by Patrick Farrell (Inactive) [ 07/Sep/18 ] |
|
Robert, I'm a Cray person, but I'll discuss this here so as not to leave the Whamcloud guys in the cold. What exactly are you measuring? Is this execution time of the function directly, within a binary, or is it execution time of an entire binary containing just this function call? Can you share whatever your test code is here? Just to make our lives easier. A few more thoughts/questions.
Those last three functions (stripe_count_get, stripe_size_get, and ost_index_get) literally just access pointers in the memory given, so if they're taking milliseconds by themselves...
|
| Comment by Patrick Farrell (Inactive) [ 07/Sep/18 ] |
|
Also, can you compare execution time from one thread? If it's really as bad as your measurements indicate, it should be straightforward to do so. |
| Comment by Robert Latham [ 07/Sep/18 ] |
|
I attached the test program I used. it's testing a bunch of different things but to get the routines in this test case:
Absolutely agreed that llapi_file_get_stripe should be expensive since it is operating on a filename. Was astonished to see llapi_layout aproach took so long considering it operates on the fd directly. I assumed the initial llapi_layout_get_by_fd call took all the time, but I was measuring the time to call the four routines together – after all, there's no reason for an application to just call get_by_fd. |
| Comment by Patrick Farrell (Inactive) [ 07/Sep/18 ] |
|
Thanks, Rob - I actually just grabbed your test and did a bunch of testing myself. I see a really wildly large gap even with 1 process - interesting. This is with debug turned on, on a small VM setup: [root@cent7c01 cent7s01]# rm -f testfile; lctl clear; mpirun --allow-run-as-root -n 1 /tmp/a.out -c -i testfile; lctl dk > /tmp/out3 [root@cent7c01 cent7s01]# rm -f testfile; lctl clear; mpirun --allow-run-as-root -n 1 /tmp/a.out -c -a testfile; lctl dk > /tmp/out So that's "ioctl" at 2.6 ms Get_stripe at 3.7 ms That's... wild. I'll try to dig in later. EDIT: The simple create+ioctl does 2, one of which is close(). The llapi_get_stripe function does 3 (create, some sort of get, then close). The layout get function does 4 I've got some ideas why - I'll take a further look next week. |
| Comment by Robert Latham [ 07/Sep/18 ] |
|
oh yeah, lots of variance in these tests. It's not statistically rigorous but above numbers came from me running each test 5 times and reporting an average. |
| Comment by Patrick Farrell (Inactive) [ 10/Sep/18 ] |
|
Robert, Here's the problem in a nutshell. lfs getstripe asks the client for the stripe count information that it has locally, and gets it locally if it's present. This is fast because that information is generally already cached locally when the file is open. The other two functions do not ask the client for the layout, as such - Instead, the ioctl they use actually directly reads the layout extended attribute from the MDS. This read of an extended attribute cannot be satisfied by the layout information held in memory on the client (which it usually has if the file is open), instead, it goes directly to the MDS. The solution would be to rewrite the implementation of those new ioctls to get the layout that the client has in memory and request it if it's not present. |
| Comment by Patrick Farrell (Inactive) [ 10/Sep/18 ] |
|
By the way, thanks for your quick reply last week. |
| Comment by Andreas Dilger [ 31/May/19 ] |
|
I just happened to be looking at this code on the setstripe side and see a number of places where it could be optimized. Running a simple test comparing lfs setstripe with touch and createmany shows a big difference in performance (files removed between each test): $ time lfs setstripe -c 1 /myth/tmp/tmp/f{1..1000} : real: 0m11.089s, user: 0m0.115s, sys: 0m2.132s
$ time touch /myth/tmp/tmp/f{1..1000} : real: 0m6.400s, user: 0m0.018s, sys: 0m0.863s
$ time createmany -o /myth/tmp/tmp/f 1000 : real: 0m2.329s, user: 0m0.008s, sys: 0m0.185s
It is pretty clear from the elapsed time and the system time that lfs setstripe is doing a lot more stuff for each file create compared to the others (touch is additionally setting the timestamps). Looking at the strace of lfs setstripe -c1 on a client shows what is going on: gettid() = 3237
open("/dev/urandom", O_RDONLY|O_NOFOLLOW) = 3
read(3, "Fg#\306", 4) = 4
close(3) = 0
lstat("/myth", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0
lstat("/myth/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0
lstat("/myth/tmp/tmp", {st_mode=S_IFDIR|0775, st_size=53248, ...}) = 0
lstat("/myth/tmp/tmp/foo", 0x7ffe9dc99410) = -1 ENOENT (No such file or directory)
lstat("/myth", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0
lstat("/myth/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0
lstat("/myth/tmp/tmp", {st_mode=S_IFDIR|0775, st_size=53248, ...}) = 0
lstat("/myth/tmp/tmp/foo", 0x7ffe9dc99410) = -1 ENOENT (No such file or directory)
lstat("/myth", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0
lstat("/myth/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0
lstat("/myth/tmp/tmp", {st_mode=S_IFDIR|0775, st_size=53248, ...}) = 0
open("/proc/mounts", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(3, "rootfs / rootfs rw 0 0\nproc /pro"..., 1024) = 1024
read(3, "tmp ext4 ro,relatime,barrier=1,d"..., 1024) = 88
read(3, "", 1024) = 0
close(3) = 0
open("/myth/tmp/tmp/foo", O_WRONLY|O_CREAT|O_LOV_DELAY_CREATE, 0644) = 3
ioctl(3, LL_IOC_LOV_SETSTRIPE, 0x942060) = 0
Firstly, it appears that liblustreapi_init() function is always doing some work to initialize the PRNG, but this PRNG is only used in a small number of cases. This initialization can be skipped unless the PRNG is actually needed in particular functions, even though it is only a one-time thing it is still needless overhead. More importantly, it is doing multiple traversals of the full pathname (bad) to do realpath() on the filename and processing /proc/mounts for each file to determine if the file is on a Lustre filesystem or not. IMHO, it should be enough to return an error from ioctl(LL_IOC_LOV_SETSTRIPE) to determine that the file is not on a Lustre filesystem, or if that is really deemed necessary we can do statfs() on the directory and check the magic without doing dozens of syscalls. There is also extra overhead if a pool name is specified to verify that it is valid that could be a one-time check for repeated use of the same pool(s), and to verify the stripe_size >= PAGE_SIZE, where again the PAGE_SIZE does not change from one file to the next. Finally, I observe in the debug logs that between open(O_LOV_DELAY_CREATE) and ioctl(LL_IOC_LOV_SETSTRIPE) there is a DLM lock callback from the MDS to the client for the layout lock (I think). It doesn't make sense to grant the MDS_INODELOCK_LAYOUT lock to the client opening with O_LOV_DELAY_CREATE, when we know the client will immediately be changing it. |
| Comment by Gerrit Updater [ 01/Jun/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35028 |
| Comment by Gerrit Updater [ 07/Jun/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35091 |
| Comment by Gerrit Updater [ 20/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35091/ |
| Comment by Gerrit Updater [ 12/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35092/ |
| Comment by Gerrit Updater [ 09/Aug/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35093/ |