[LU-10948] client cache open lock after N opens Created: 24/Apr/18 Updated: 22/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Bob Ciotti | Assignee: | Oleg Drokin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
cent server/sles client |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Epic/Theme: | Performance | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
listed as minor but when a user does this, we start to get phone calls form other users and then page POC to identify code/user. Workaround is to terminate user job(s). Oleg has said that Lustre has an existing feature for a client to acquire open lock but off by default. This to mimic NFS behavior. Ideal change would be that we can specify a number of times that a file is opened on single client at which time lock is acquired. (e.g. 10th time) Use case is naive user who loop in this way on like 5000+ threads in java: do until till the sun turns black() { fd = open(*my_thread_ID, O_APPEND) calculate_something_small_but_useful() write(fd, *fortytwo, 42) close(fd }Users often don't have complete control over the code they run and as a result may not be able to quickly make even simple changes. |
| Comments |
| Comment by Oleg Drokin [ 25/Apr/18 ] |
|
So I took initial look into this with a simple patch that implements an inode open counter. This uncovered a number of unanticipated problems: 1. if you never properly look up the file before doing open (or do an ls in the directory that contains it), all lookups don't match (due to invalid flag in dentry so dcompare rejects it) and we always enter lustre via ll_atomic_open() with a new/negative dentry => we have no inode, so we cannot reach the counter until it's too late. I am somewhat surprised to find we do not return a lookup lock from an open request even if the open lock was not requested, since normally we'd want to do that anyway and also include a layout lock (and save an RPC fetching layout so we can start doing io right away), but I guess this is a function of mdtest penalizing us if we return any lock? There's still some looking to be done here on my part, I guess. |
| Comment by Gerrit Updater [ 25/Apr/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/32156 |
| Comment by Gerrit Updater [ 25/Apr/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/32157 |
| Comment by Gerrit Updater [ 25/Apr/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/32158 |
| Comment by Oleg Drokin [ 25/Apr/18 ] |
|
I just pushed 3 patches that make the problem go away. the 3rd patch is a bit of a WIP in that you cannot actually tune the counter value, but good for quick verification. I don't know if this will apply to 2.9, this is to try at some other time I guess (please confirm the exact version you would be able to try this on). I tested the patches with this reproducer and it does the expected thing: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> int main(int argc, char **argv) { int fd; char fortytwo[42]; int i; for (i = 0; i < 100 ; i++) { fd = open(argv[1], O_WRONLY|O_APPEND); if (fd == -1) { perror("open"); return -1; } write(fd, fortytwo, sizeof(fortytwo)); close(fd); } return 0; } Note you need to create the file beforehand. I also verified that if you do add O_CREAT (that does not require a file create beforehand) the things still work as expected and the caching still works too. Likely sideeffect: you won't be able to write into your executables on Lustre after 10 executions - we'll need to pull in |
| Comment by Oleg Drokin [ 25/Apr/18 ] |
|
Also forgot to add that the patches do not fix problem #3 outlined, but it's sidestepped by the virtue of adding the lookup bit into the mix which suddenly makes that whole lock discoverable by ELC as a conflicting one. We still need to add update bit into the elc match pattern for open requests as a separate patch, I suspect. |
| Comment by Bob Ciotti [ 25/Apr/18 ] |
|
From JLan (our build engineer) All 3 patches were applied to nas-2.10.3 cleanly. Jay
— so we test soon |
| Comment by Mahmoud Hanafi [ 18/May/18 ] |
|
Attaching our testing results.
|
| Comment by Oleg Drokin [ 18/May/18 ] |
|
Thanks for the results! Yes, the change 32156 is server-side and it's a pretty important part of the set, that's why when you don't patch the server you don't observe any positive impact. That said, your open counts are still elevated which means at least something is not working as planned unless I misunderstood your testcase. |
| Comment by Mahmoud Hanafi [ 20/May/18 ] |
|
This is my test case.
program multi_stats
! compile with ifort -o multi_stats multi_stats.f -lmpi
use mpi
integer (kind=8), parameter :: niter=10000
integer (kind=8) :: i
real (kind=8) :: t0, t1, t2
logical :: ex
character*50 :: filename
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world, myid, ierr)
call mpi_comm_size(mpi_comm_world, nprocs, ierr) t0 = mpi_wtime()
write(filename,'(a,i0.8)') "test.", myid
! print *, 'my filename is', filename
open(10, file=filename, status="new", IOSTAT=IERR)
close(10)
do i = 1,niter
open(10, file=filename, status='old', position='append')
write ( 10,*) "test", i
close(10)
! if (myid .eq. nprocs-1) write(0,*) i
! call sleep(1)
call mpi_barrier(mpi_comm_world, ierr)
enddo
60 call mpi_barrier(mpi_comm_world, ierr)
t1 = mpi_wtime()
if (myid .eq. 0) print *, 'Total runtime = ',t1 - t0
call mpi_finalize(ierr)
end
|
| Comment by Oleg Drokin [ 21/May/18 ] |
|
does the result change if you only run one instance of it, not 10 (Are all 10 on the same node)? Do you have ability to test on 2.11? I wonder if it's just some 2.10 difference that makes this not work as expected, though cursory check does not seem to indicate anything like that. If a single-process test still results in elevated open counts, please try my C reproducer, if that one works as expected, please run your reproducer under strace. |
| Comment by Mahmoud Hanafi [ 23/May/18 ] |
|
Here is comparing the C code vs FORTRAN. Including strace of the FORTRAN. C 100 iterations ---------------- /proc/fs/lustre/mdc/nbptest2-MDT0000-mdc-ffff88203d2dd800/md_stats close 9 intent_lock 11 /proc/fs/lustre/mdc/nbptest2-MDT0000-mdc-ffff88203d2dd800/stats req_waittime 29 req_active 29 mds_close 9 ldlm_cancel 9 /proc/fs/lustre/llite/nbptest2-ffff88203d2dd800/stats write_bytes 100 open 100 close 100 getxattr 10 getxattr_hits 9 inode_permission 200 /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/10.151.31.132@o2ib/ldlm_stats ldlm_enqueue 11 ldlm_cancel 9 /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/10.151.31.132@o2ib/stats open 10 close 9 getxattr 1 Fortran 100 iterations --------------------- /proc/fs/lustre/mdc/nbptest2-MDT0000-mdc-ffff88203d2dd800/md_stats close 99 intent_lock 302 setattr 100 /proc/fs/lustre/mdc/nbptest2-MDT0000-mdc-ffff88203d2dd800/stats req_waittime 497 req_active 497 mds_close 99 ldlm_cancel 99 obd_ping 1 /proc/fs/lustre/llite/nbptest2-ffff88203d2dd800/stats write_bytes 100 ioctl 200 open 100 close 100 seek 200 truncate 100 getattr 101 getxattr 100 getxattr_hits 100 inode_permission 406 /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/10.151.31.132@o2ib/ldlm_stats ldlm_enqueue 198 ldlm_cancel 99 /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/10.151.31.132@o2ib/stats open 99 close 99 getattr 100 setattr 100
|
| Comment by Patrick Farrell (Inactive) [ 01/Aug/18 ] |
|
For what it's worth, we've observed a 70-80% reduction in the mdsrate open() benchmark from the change that https://review.whamcloud.com/#/c/32156/ reverses. Just something to consider - The impact of that change alone is enormous. |
| Comment by Andreas Dilger [ 02/Aug/18 ] |
|
Patrick, when you say "reduction in the mdsrate open() benchmark", does that mean "reduction in the time taken" (== good), or "reduction in the open rate" (== bad)? |
| Comment by Patrick Farrell (Inactive) [ 02/Aug/18 ] |
|
Ah, sorry, that was ambiguous. Reduction in the open rate, so, bad news. On a real system, this mdsrate open benchmark drops from 35K opens/second to around 11K. On a much smaller VM, I see a drop from 8K opens/second to 4K with this benchmark. As discussed elsewhere, I'll open an LU for this in a few minutes. |
| Comment by Gerrit Updater [ 01/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32157/ |
| Comment by Gerrit Updater [ 03/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35039 |
| Comment by Gerrit Updater [ 16/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35039/ |
| Comment by James A Simmons [ 22/Sep/19 ] |
|
Is this resolved? |
| Comment by Alex Kulyavtsev [ 17/Apr/20 ] |
|
Is this resolved? In what lustre release this patch is available?
|
| Comment by Andreas Dilger [ 29/Apr/20 ] |
|
This issue has not been resolved, as the main functional patches have not been landed yet. The few patches that have been landed so far are just cleaning up issues in the code. There are two patches from this ticket outstanding for this ticket that add a simple client-side per-inode counter+threshold to decide whether the client should fetch the openlock:
There is also an older patch on LU-7915 that just provides a whole-client tunable to enable opencache:
Getting the patches for this ticket updated to the latest master and reviewing/investigating the test failures would be useful for getting it landed sooner. The LU-7915 patch might be useful for testing how much opencache would help a specific workload, but always-on opencache had shown to be a performance hit in some cases in the past, which is why we didn't enable opencache permanently. Still, landing patch 19664 would at least give an available tunable to turn this on/off on a per-client basis if it shows a net improvement for a site's workload. |
| Comment by Gerrit Updater [ 25/Dec/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41091 |
| Comment by Gerrit Updater [ 04/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41091/ |
| Comment by Gerrit Updater [ 12/May/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32158/ |
| Comment by Jay Lan (Inactive) [ 22/May/21 ] |
|
Can you port #32158 to b2_12? Thanks! |
| Comment by Gerrit Updater [ 03/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43903 |
| Comment by Gerrit Updater [ 03/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43907 |
| Comment by Gerrit Updater [ 30/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43907/ |
| Comment by Andreas Dilger [ 22/Jan/24 ] |
|
Patch https://review.whamcloud.com/32156 "LU-10948 mdt: Always return lookup lock on opens" is not landed yet. |