Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10948

client cache open lock after N opens

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.9.0
    • None
    • cent server/sles client
    • 9223372036854775807

    Description

      listed as minor but when a user does this, we start to get phone calls form other users and then page POC to identify code/user. Workaround is to terminate user job(s).

      Oleg has said that Lustre has an existing feature for a client to acquire open lock but off by default. This to mimic NFS behavior.

      Ideal change would be that we can specify a number of times that a file is opened on single client at which time lock is acquired. (e.g. 10th time)

      Use case is naive user who loop in this way on like 5000+ threads in java:

      do until till the sun turns black()

      { fd = open(*my_thread_ID, O_APPEND) calculate_something_small_but_useful() write(fd, *fortytwo, 42) close(fd }

      Users often don't have complete control over the code they run and as a result may not be able to quickly make even simple changes.

      Attachments

        Issue Links

          Activity

            [LU-10948] client cache open lock after N opens

            Patrick, when you say "reduction in the mdsrate open() benchmark", does that mean "reduction in the time taken" (== good), or "reduction in the open rate" (== bad)?

            adilger Andreas Dilger added a comment - Patrick, when you say "reduction in the mdsrate open() benchmark", does that mean "reduction in the time taken" (== good), or "reduction in the open rate" (== bad)?

            For what it's worth, we've observed a 70-80% reduction in the mdsrate open() benchmark from the change that https://review.whamcloud.com/#/c/32156/ reverses.

            Just something to consider - The impact of that change alone is enormous.

            paf Patrick Farrell (Inactive) added a comment - For what it's worth, we've observed a 70-80% reduction in the mdsrate open() benchmark from the change that https://review.whamcloud.com/#/c/32156/  reverses. Just something to consider - The impact of that change alone is enormous.

            Here is comparing the C code vs FORTRAN. Including strace of the FORTRAN.

             C 100 iterations
            ----------------  
            /proc/fs/lustre/mdc/nbptest2-MDT0000-mdc-ffff88203d2dd800/md_stats
            close               9
            intent_lock         11
            /proc/fs/lustre/mdc/nbptest2-MDT0000-mdc-ffff88203d2dd800/stats
            req_waittime        29
            req_active          29
            mds_close           9
            ldlm_cancel         9
            /proc/fs/lustre/llite/nbptest2-ffff88203d2dd800/stats
            write_bytes         100
            open                100
            close               100
            getxattr            10
            getxattr_hits       9
            inode_permission    200
            
            /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/10.151.31.132@o2ib/ldlm_stats
            ldlm_enqueue        11
            ldlm_cancel         9
            /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/10.151.31.132@o2ib/stats
            open                10
            close               9
            getxattr            1
            
            Fortran 100 iterations
            ---------------------
            /proc/fs/lustre/mdc/nbptest2-MDT0000-mdc-ffff88203d2dd800/md_stats
            close               99
            intent_lock         302 
            setattr             100
            /proc/fs/lustre/mdc/nbptest2-MDT0000-mdc-ffff88203d2dd800/stats
            req_waittime        497
            req_active          497
            mds_close           99
            ldlm_cancel         99
            obd_ping            1
            /proc/fs/lustre/llite/nbptest2-ffff88203d2dd800/stats
            write_bytes         100
            ioctl               200
            open                100
            close               100
            seek                200
            truncate            100
            getattr             101
            getxattr            100
            getxattr_hits       100 
            inode_permission    406
            
            /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/10.151.31.132@o2ib/ldlm_stats
            ldlm_enqueue        198
            ldlm_cancel         99
            /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/10.151.31.132@o2ib/stats
            open                99
            close               99
            getattr             100
            setattr             100
            

             

            fortran.100iter.strace

            mhanafi Mahmoud Hanafi added a comment - Here is comparing the C code vs FORTRAN. Including strace of the FORTRAN. C 100 iterations ---------------- /proc/fs/lustre/mdc/nbptest2-MDT0000-mdc-ffff88203d2dd800/md_stats close 9 intent_lock 11 /proc/fs/lustre/mdc/nbptest2-MDT0000-mdc-ffff88203d2dd800/stats req_waittime 29 req_active 29 mds_close 9 ldlm_cancel 9 /proc/fs/lustre/llite/nbptest2-ffff88203d2dd800/stats write_bytes 100 open 100 close 100 getxattr 10 getxattr_hits 9 inode_permission 200 /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/10.151.31.132@o2ib/ldlm_stats ldlm_enqueue 11 ldlm_cancel 9 /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/10.151.31.132@o2ib/stats open 10 close 9 getxattr 1 Fortran 100 iterations --------------------- /proc/fs/lustre/mdc/nbptest2-MDT0000-mdc-ffff88203d2dd800/md_stats close 99 intent_lock 302 setattr 100 /proc/fs/lustre/mdc/nbptest2-MDT0000-mdc-ffff88203d2dd800/stats req_waittime 497 req_active 497 mds_close 99 ldlm_cancel 99 obd_ping 1 /proc/fs/lustre/llite/nbptest2-ffff88203d2dd800/stats write_bytes 100 ioctl 200 open 100 close 100 seek 200 truncate 100 getattr 101 getxattr 100 getxattr_hits 100 inode_permission 406 /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/10.151.31.132@o2ib/ldlm_stats ldlm_enqueue 198 ldlm_cancel 99 /proc/fs/lustre/mdt/nbptest2-MDT0000/exports/10.151.31.132@o2ib/stats open 99 close 99 getattr 100 setattr 100   fortran.100iter.strace
            green Oleg Drokin added a comment -

            does the result change if you only run one instance of it, not 10 (Are all 10 on the same node)?

            Do you have ability to test on 2.11? I wonder if it's just some 2.10 difference that makes this not work as expected, though cursory check does not seem to indicate anything like that.

            If a single-process test still results in elevated open counts, please try my C reproducer, if that one works as expected, please run your reproducer under strace.

            green Oleg Drokin added a comment - does the result change if you only run one instance of it, not 10 (Are all 10 on the same node)? Do you have ability to test on 2.11? I wonder if it's just some 2.10 difference that makes this not work as expected, though cursory check does not seem to indicate anything like that. If a single-process test still results in elevated open counts, please try my C reproducer, if that one works as expected, please run your reproducer under strace.

            This is my test case.

             

                   program multi_stats
            ! compile with ifort -o multi_stats multi_stats.f -lmpi
                  use mpi
                  integer (kind=8), parameter :: niter=10000
                  integer (kind=8) :: i
                  real (kind=8) :: t0, t1, t2
                  logical :: ex
                  character*50 :: filename
                  call mpi_init(ierr)
                  call mpi_comm_rank(mpi_comm_world, myid, ierr)
                  call mpi_comm_size(mpi_comm_world, nprocs, ierr)      t0 = mpi_wtime()
                  write(filename,'(a,i0.8)') "test.", myid
            !      print *, 'my filename is', filename
                  open(10, file=filename, status="new", IOSTAT=IERR)
                  close(10)
                  do i = 1,niter
                     open(10, file=filename, status='old', position='append')
                     write ( 10,*) "test", i
                     close(10)
            !      if (myid .eq. nprocs-1) write(0,*) i
            !      call sleep(1)
                  call mpi_barrier(mpi_comm_world, ierr)
                  enddo
             60   call mpi_barrier(mpi_comm_world, ierr)
                  t1 = mpi_wtime()
                  if (myid .eq. 0) print *, 'Total runtime = ',t1 - t0
                  call mpi_finalize(ierr)
                  end 
            
            mhanafi Mahmoud Hanafi added a comment - This is my test case.   program multi_stats ! compile with ifort -o multi_stats multi_stats.f -lmpi use mpi integer (kind=8), parameter :: niter=10000 integer (kind=8) :: i real (kind=8) :: t0, t1, t2 logical :: ex character*50 :: filename call mpi_init(ierr) call mpi_comm_rank(mpi_comm_world, myid, ierr) call mpi_comm_size(mpi_comm_world, nprocs, ierr) t0 = mpi_wtime() write(filename, '(a,i0.8)' ) "test." , myid ! print *, 'my filename is' , filename open(10, file=filename, status= " new " , IOSTAT=IERR) close(10) do i = 1,niter open(10, file=filename, status= 'old' , position= 'append' ) write ( 10,*) "test" , i close(10) ! if (myid .eq. nprocs-1) write(0,*) i ! call sleep(1) call mpi_barrier(mpi_comm_world, ierr) enddo 60 call mpi_barrier(mpi_comm_world, ierr) t1 = mpi_wtime() if (myid .eq. 0) print *, 'Total runtime = ' ,t1 - t0 call mpi_finalize(ierr) end
            green Oleg Drokin added a comment -

            Thanks for the results!

            Yes, the change 32156 is server-side and it's a pretty important part of the set, that's why when you don't patch the server you don't observe any positive impact.

            That said, your open counts are still elevated which means at least something is not working as planned unless I misunderstood your testcase.

            green Oleg Drokin added a comment - Thanks for the results! Yes, the change 32156 is server-side and it's a pretty important part of the set, that's why when you don't patch the server you don't observe any positive impact. That said, your open counts are still elevated which means at least something is not working as planned unless I misunderstood your testcase.

            Attaching our testing results.

            1. We observed big difference in run time.
            2. Non-patched clients with patched clients had the same results as patched clients.

            LU-10948_testing.pdf

             

            mhanafi Mahmoud Hanafi added a comment - Attaching our testing results. We observed big difference in run time. Non-patched clients with patched clients had the same results as patched clients. LU-10948_testing.pdf  

            From JLan (our build engineer)

            All 3 patches were applied to nas-2.10.3 cleanly.

            Jay

             

            — so we test soon

            Bob.C Bob Ciotti (Inactive) added a comment - From JLan (our build engineer) All 3 patches were applied to nas-2.10.3 cleanly. Jay   — so we test soon
            green Oleg Drokin added a comment -

            Also forgot to add that the patches do not fix problem #3 outlined, but it's sidestepped by the virtue of adding the lookup bit into the mix which suddenly makes that whole lock discoverable by ELC as a conflicting one.

            We still need to add update bit into the elc match pattern for open requests as a separate patch, I suspect.

            green Oleg Drokin added a comment - Also forgot to add that the patches do not fix problem #3 outlined, but it's sidestepped by the virtue of adding the lookup bit into the mix which suddenly makes that whole lock discoverable by ELC as a conflicting one. We still need to add update bit into the elc match pattern for open requests as a separate patch, I suspect.
            green Oleg Drokin added a comment - - edited

            I just pushed 3 patches that make the problem go away. the 3rd patch is a bit of a WIP in that you cannot actually tune the counter value, but good for quick verification. I don't know if this will apply to 2.9, this is to try at some other time I guess (please confirm the exact version you would be able to try this on).

            I tested the patches with this reproducer and it does the expected thing:

            #include <stdio.h>
            #include <stdlib.h>
            #include <unistd.h>
            #include <fcntl.h>
            
            int main(int argc, char **argv)
            {
            	int fd;
            	char fortytwo[42];
            	int i;
            
            	for (i = 0; i < 100 ; i++) {
            		fd = open(argv[1], O_WRONLY|O_APPEND);
            		if (fd == -1) {
            			perror("open");
            			return -1;
            		}
            
            		write(fd, fortytwo, sizeof(fortytwo));
            		close(fd);
            	}
            
            	return 0;
            }
            

            Note you need to create the file beforehand.

            I also verified that if you do add O_CREAT (that does not require a file create beforehand) the things still work as expected and the caching still works too.

            Likely sideeffect: you won't be able to write into your executables on Lustre after 10 executions - we'll need to pull in LU-4398 for that.

            green Oleg Drokin added a comment - - edited I just pushed 3 patches that make the problem go away. the 3rd patch is a bit of a WIP in that you cannot actually tune the counter value, but good for quick verification. I don't know if this will apply to 2.9, this is to try at some other time I guess (please confirm the exact version you would be able to try this on). I tested the patches with this reproducer and it does the expected thing: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> int main( int argc, char **argv) { int fd; char fortytwo[42]; int i; for (i = 0; i < 100 ; i++) { fd = open(argv[1], O_WRONLY|O_APPEND); if (fd == -1) { perror( "open" ); return -1; } write(fd, fortytwo, sizeof(fortytwo)); close(fd); } return 0; } Note you need to create the file beforehand. I also verified that if you do add O_CREAT (that does not require a file create beforehand) the things still work as expected and the caching still works too. Likely sideeffect: you won't be able to write into your executables on Lustre after 10 executions - we'll need to pull in LU-4398 for that.

            Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/32158
            Subject: LU-10948 llite: WIP! Introduce inode open counter
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 07d1059dcd4f0e4965b8403cd1bcf52ba391343b

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/32158 Subject: LU-10948 llite: WIP! Introduce inode open counter Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 07d1059dcd4f0e4965b8403cd1bcf52ba391343b

            People

              green Oleg Drokin
              Bob.C Bob Ciotti (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated: