Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3277

LU-2139 may cause the performance regression

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0
    • Lustre 2.4.0
    • RHEL6.3 and current master
    • 2
    • 8114

    Description

      There is a performance regression on the current master(c864582b5d4541c7830d628457e55cd859aee005) if we have multiple IOR threads per client. As far as I can test, LU-2576 might cause this performance regression. Here is quick test results on each commit.

      client : commit ac37e7b4d101761bbff401ed12fcf671d6b68f9c

      # mpirun -np 8 /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
      IOR-2.10.3: MPI Coordinated Test of Parallel I/O
      
      Run began: Sun May  5 12:24:09 2013
      Command line used: /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
      Machine: Linux s08 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP Sat Feb 9 21:55:32 PST 2013 x86_64
      Using synchronized MPI timer
      Start time skew across all tasks: 0.00 sec
      Path: /lustre/ior.out
      FS: 683.5 TiB   Used FS: 0.0%   Inodes: 5.0 Mi   Used Inodes: 0.0%
      Participating tasks: 8
      Using reorderTasks '-C' (expecting block, not cyclic, task assignment)
      task 0 on s08
      task 1 on s08
      task 2 on s08
      task 3 on s08
      task 4 on s08
      task 5 on s08
      task 6 on s08
      task 7 on s08
      
      Summary:
      	api                = POSIX
      	test filename      = /lustre/ior.out/file
      	access             = file-per-process
      	pattern            = segmented (1 segment)
      	ordering in a file = sequential offsets
      	ordering inter file=constant task offsets = 1
      	clients            = 8 (8 per node)
      	repetitions        = 1
      	xfersize           = 1 MiB
      	blocksize          = 8 GiB
      	aggregate filesize = 64 GiB
      
      Using Time Stamp 1367781849 (0x5186b1d9) for Data Signature
      Commencing write performance test.
      Sun May  5 12:24:09 2013
      
      access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s) total(s)  iter
      ------    ---------  ---------- ---------  --------   --------   --------  --------   ----
      write     3228.38    8388608    1024.00    0.001871   20.30      1.34       20.30      0    XXCEL
      Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  Op grep #Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize
      
      ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
      write        3228.38    3228.38     3228.38      0.00    3228.38    3228.38     3228.38      0.00  20.29996   8 8 1 1 1 1 0 0 1 8589934592 1048576 68719476736 -1 POSIX EXCEL
      
      Max Write: 3228.38 MiB/sec (3385.20 MB/sec)
      
      Run finished: Sun May  5 12:24:30 2013
      

      client : commit 5661651b2cc6414686e7da581589c2ea0e1f1969

      # mpirun -np 8 /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
      IOR-2.10.3: MPI Coordinated Test of Parallel I/O
      
      Run began: Sun May  5 12:16:35 2013
      Command line used: /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
      Machine: Linux s08 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP Sat Feb 9 21:55:32 PST 2013 x86_64
      Using synchronized MPI timer
      Start time skew across all tasks: 0.00 sec
      Path: /lustre/ior.out
      FS: 683.5 TiB   Used FS: 0.0%   Inodes: 5.0 Mi   Used Inodes: 0.0%
      Participating tasks: 8
      Using reorderTasks '-C' (expecting block, not cyclic, task assignment)
      task 0 on s08
      task 1 on s08
      task 2 on s08
      task 3 on s08
      task 4 on s08
      task 5 on s08
      task 6 on s08
      task 7 on s08
      
      Summary:
      	api                = POSIX
      	test filename      = /lustre/ior.out/file
      	access             = file-per-process
      	pattern            = segmented (1 segment)
      	ordering in a file = sequential offsets
      	ordering inter file=constant task offsets = 1
      	clients            = 8 (8 per node)
      	repetitions        = 1
      	xfersize           = 1 MiB
      	blocksize          = 8 GiB
      	aggregate filesize = 64 GiB
      
      Using Time Stamp 1367781395 (0x5186b013) for Data Signature
      Commencing write performance test.
      Sun May  5 12:16:35 2013
      
      access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s) total(s)  iter
      ------    ---------  ---------- ---------  --------   --------   --------  --------   ----
      write     550.28     8388608    1024.00    0.001730   119.10     2.76       119.10     0    XXCEL
      Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  Op grep #Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize
      
      ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
      write         550.28     550.28      550.28      0.00     550.28     550.28      550.28      0.00 119.09623   8 8 1 1 1 1 0 0 1 8589934592 1048576 68719476736 -1 POSIX EXCEL
      
      Max Write: 550.28 MiB/sec (577.01 MB/sec)
      
      Run finished: Sun May  5 12:18:34 2013
      

      Both tests, the servers are running current master (c864582b5d4541c7830d628457e55cd859aee005)

      Attachments

        1. collectl.log
          13 kB
        2. stat.log
          16 kB

        Issue Links

          Activity

            [LU-3277] LU-2139 may cause the performance regression

            Since I've been assigned to this, I'm marking it resolved. The "bad patch" was reverted, and there's been no reports of this since, which leads me to believe it is no longer an issue. Feel free to reopen if there is a compelling case to do so.

            General discussion of the LU-2139 issue and pending patch stack is better suited to happen in LU-2139 ticket.

            prakash Prakash Surya (Inactive) added a comment - Since I've been assigned to this, I'm marking it resolved. The "bad patch" was reverted, and there's been no reports of this since, which leads me to believe it is no longer an issue. Feel free to reopen if there is a compelling case to do so. General discussion of the LU-2139 issue and pending patch stack is better suited to happen in LU-2139 ticket.

            In case it proves useful, here's an example stack for a thread waiting in the state I described in my previous comment:

                fio           S 00000fffae72633c     0 59338  59283 0x00000000
                Call Trace:
                [c0000003e0deed20] [c0000003e0deede0] 0xc0000003e0deede0 (unreliable)
                [c0000003e0deeef0] [c000000000008e10] .__switch_to+0xc4/0x100
                [c0000003e0deef80] [c00000000042b0e0] .schedule+0x858/0x9c0
                [c0000003e0def230] [c00000000042b7c8] .schedule_timeout+0x1f8/0x240
                [c0000003e0def310] [c00000000042a444] .io_schedule_timeout+0x54/0x98
                [c0000003e0def3a0] [c00000000009ddfc] .balance_dirty_pages+0x294/0x390
                [c0000003e0def520] [c000000000095a2c] .generic_file_buffered_write+0x268/0x354
                [c0000003e0def660] [c000000000096074] .__generic_file_aio_write+0x374/0x3d8
                [c0000003e0def760] [c000000000096150] .generic_file_aio_write+0x78/0xe8
                [c0000003e0def810] [8000000006a7062c] .vvp_io_write_start+0xfc/0x3e0 [lustre]
                [c0000003e0def8e0] [800000000249a81c] .cl_io_start+0xcc/0x220 [obdclass]
                [c0000003e0def980] [80000000024a2634] .cl_io_loop+0x194/0x2c0 [obdclass]
                [c0000003e0defa30] [80000000069ea208] .ll_file_io_generic+0x498/0x670 [lustre]
                [c0000003e0defb30] [80000000069ea864] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
                [c0000003e0defc00] [80000000069eab80] .ll_file_write+0x150/0x320 [lustre]
                [c0000003e0defce0] [c0000000000d1e9c] .vfs_write+0xd0/0x1c4
                [c0000003e0defd80] [c0000000000d208c] .SyS_write+0x54/0x98
                [c0000003e0defe30] [c000000000000580] syscall_exit+0x0/0x2c
            
            prakash Prakash Surya (Inactive) added a comment - In case it proves useful, here's an example stack for a thread waiting in the state I described in my previous comment: fio S 00000fffae72633c 0 59338 59283 0x00000000 Call Trace: [c0000003e0deed20] [c0000003e0deede0] 0xc0000003e0deede0 (unreliable) [c0000003e0deeef0] [c000000000008e10] .__switch_to+0xc4/0x100 [c0000003e0deef80] [c00000000042b0e0] .schedule+0x858/0x9c0 [c0000003e0def230] [c00000000042b7c8] .schedule_timeout+0x1f8/0x240 [c0000003e0def310] [c00000000042a444] .io_schedule_timeout+0x54/0x98 [c0000003e0def3a0] [c00000000009ddfc] .balance_dirty_pages+0x294/0x390 [c0000003e0def520] [c000000000095a2c] .generic_file_buffered_write+0x268/0x354 [c0000003e0def660] [c000000000096074] .__generic_file_aio_write+0x374/0x3d8 [c0000003e0def760] [c000000000096150] .generic_file_aio_write+0x78/0xe8 [c0000003e0def810] [8000000006a7062c] .vvp_io_write_start+0xfc/0x3e0 [lustre] [c0000003e0def8e0] [800000000249a81c] .cl_io_start+0xcc/0x220 [obdclass] [c0000003e0def980] [80000000024a2634] .cl_io_loop+0x194/0x2c0 [obdclass] [c0000003e0defa30] [80000000069ea208] .ll_file_io_generic+0x498/0x670 [lustre] [c0000003e0defb30] [80000000069ea864] .ll_file_aio_write+0x1d4/0x3a0 [lustre] [c0000003e0defc00] [80000000069eab80] .ll_file_write+0x150/0x320 [lustre] [c0000003e0defce0] [c0000000000d1e9c] .vfs_write+0xd0/0x1c4 [c0000003e0defd80] [c0000000000d208c] .SyS_write+0x54/0x98 [c0000003e0defe30] [c000000000000580] syscall_exit+0x0/0x2c

            Lowering priority as LU-2139 was reverted.

            jlevi Jodi Levi (Inactive) added a comment - Lowering priority as LU-2139 was reverted.
            prakash Prakash Surya (Inactive) added a comment - - edited

            Prakash, attached are llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo output during IOR.

            Thanks, I'll give these a look.

            it's doing like this writing; idle, writing; idle;...

            OK, that sounds like exactly what I saw during my testing when I was hitting the kernel's dirty page limits. Allow me to explain..

            So, with that patch in place, we're now properly informing the kernel of our unstable pages by incrementing and decrementing the NFS_Unstable zone page. You can see this by watching the NFS_Unstable field in /proc/meminfo (before the patch this will always be zero, after the patch this will fluctuate with respect to Lustre IO). So that's all well and good, but how does that relate to the idle time seen during IO.. Well, what I think's happening is, the newly accounted for unstable pages are being factored in when the kernel calls balance_dirty_pages, balance_dirty_pages is then determining the system is out of dirty pages, and sleeps waiting for writeback to flush the dirty pages to disk.

            You can verify this theory by dumping the stacks of all processes when the IO is stalled, and check to see if any of the write threads are stuck sleeping in balance_dirty_pages. What do these files show on you system: /proc/sys/vm/dirty_background_bytes, /proc/sys/vm/dirty_background_ratio, /proc/sys/vm/dirty_bytes, /proc/sys/vm/dirty_ratio? I think the dirty limit on the system is a calculation of dirty_ratio * available_memory, if dirty_bytes is 0. So in your case the limit is about 12.8 GB, (assuming a dirty_ratio of 20, which is the default on my system, and dirty_bytes of 0). Seeing as the the value of NFS_Unstable in your logs hovers around 11556864 kB, it's plausible that dirty + unstable is hitting the limit.

            If my above hypothesis is correct, the behavior you were seeing was expected and working as designed. The same problem would occur if you could push NFS at the same rates. If you had the full LU-2139 patch stack applied to the client and servers (http://review.whamcloud.com/4245, http://review.whamcloud.com/4374, http://review.whamcloud.com/4375, http://review.whamcloud.com/5935), I'd expect this effect would go away.

            If you can, try setting dirty_ratio and max_dirty_mb to a large fraction of memory and rerun the test.

            prakash Prakash Surya (Inactive) added a comment - - edited Prakash, attached are llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo output during IOR. Thanks, I'll give these a look. it's doing like this writing; idle, writing; idle;... OK, that sounds like exactly what I saw during my testing when I was hitting the kernel's dirty page limits. Allow me to explain.. So, with that patch in place, we're now properly informing the kernel of our unstable pages by incrementing and decrementing the NFS_Unstable zone page. You can see this by watching the NFS_Unstable field in /proc/meminfo (before the patch this will always be zero, after the patch this will fluctuate with respect to Lustre IO). So that's all well and good, but how does that relate to the idle time seen during IO.. Well, what I think's happening is, the newly accounted for unstable pages are being factored in when the kernel calls balance_dirty_pages , balance_dirty_pages is then determining the system is out of dirty pages, and sleeps waiting for writeback to flush the dirty pages to disk. You can verify this theory by dumping the stacks of all processes when the IO is stalled, and check to see if any of the write threads are stuck sleeping in balance_dirty_pages . What do these files show on you system: /proc/sys/vm/dirty_background_bytes , /proc/sys/vm/dirty_background_ratio , /proc/sys/vm/dirty_bytes , /proc/sys/vm/dirty_ratio ? I think the dirty limit on the system is a calculation of dirty_ratio * available_memory , if dirty_bytes is 0. So in your case the limit is about 12.8 GB, (assuming a dirty_ratio of 20, which is the default on my system, and dirty_bytes of 0). Seeing as the the value of NFS_Unstable in your logs hovers around 11556864 kB, it's plausible that dirty + unstable is hitting the limit. If my above hypothesis is correct, the behavior you were seeing was expected and working as designed. The same problem would occur if you could push NFS at the same rates. If you had the full LU-2139 patch stack applied to the client and servers ( http://review.whamcloud.com/4245 , http://review.whamcloud.com/4374 , http://review.whamcloud.com/4375 , http://review.whamcloud.com/5935 ), I'd expect this effect would go away. If you can, try setting dirty_ratio and max_dirty_mb to a large fraction of memory and rerun the test.

            Prakash, attached are llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo output during IOR. I also collected collectl log and found out the client was not writing always. it's doing like this writing; idle, writing; idle;...

            our client's memory is 64GB and tested larger max_dirty_mb(tested more than 3/4 of memory size from default 1/2), but it didn't help either.

            ihara Shuichi Ihara (Inactive) added a comment - Prakash, attached are llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo output during IOR. I also collected collectl log and found out the client was not writing always. it's doing like this writing; idle, writing; idle;... our client's memory is 64GB and tested larger max_dirty_mb(tested more than 3/4 of memory size from default 1/2), but it didn't help either.

            Peter, yes, the performance is back with the latest commit (2.3.65). However, I'm hitting another issue with IOR.. will figure out and if it's different problem, I will open new ticket.

            ihara Shuichi Ihara (Inactive) added a comment - Peter, yes, the performance is back with the latest commit (2.3.65). However, I'm hitting another issue with IOR.. will figure out and if it's different problem, I will open new ticket.

            How much memory is there on the client? And what is the commit frequency on the servers? I would expect the performance to be worse with the patch Niu points to if the client has sufficient bandwidth (combined with async server commits) to fill it's available dirty page space with unstable pages. So this performance regression might be "working as intended", but that all depends on how many unstable pages the client is exhausting during the test.

            Can you sample lctl get_param 'llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo a few times while the test is running to give me an idea of the number of unstable pages are being consumed? If this value is anywhere near the limit set in /proc/sys/lustre/max_dirty_mb then maybe we need to rethink the default value of max_dirty_mb and set it to something larger.

            prakash Prakash Surya (Inactive) added a comment - How much memory is there on the client? And what is the commit frequency on the servers? I would expect the performance to be worse with the patch Niu points to if the client has sufficient bandwidth (combined with async server commits) to fill it's available dirty page space with unstable pages. So this performance regression might be "working as intended", but that all depends on how many unstable pages the client is exhausting during the test. Can you sample lctl get_param 'llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo a few times while the test is running to give me an idea of the number of unstable pages are being consumed? If this value is anywhere near the limit set in /proc/sys/lustre/max_dirty_mb then maybe we need to rethink the default value of max_dirty_mb and set it to something larger.
            pjones Peter Jones added a comment -

            The patch has now been reverted - Ihara can you confirm whether the problem has now disappeared?

            pjones Peter Jones added a comment - The patch has now been reverted - Ihara can you confirm whether the problem has now disappeared?

            Oleg will revert the patch in LU-2139 that caused this regression.

            jlevi Jodi Levi (Inactive) added a comment - Oleg will revert the patch in LU-2139 that caused this regression.

            The commit 5661651b2cc6414686e7da581589c2ea0e1f1969 is from LU-2139, which added unstable pages accounting for Lustre, and following code changes could caused much more sync write.

            @@ -1463,7 +1465,8 @@ static int osc_enter_cache_try(struct client_obd *cli,
                            return 0;
            
                    if (cli->cl_dirty + CFS_PAGE_SIZE <= cli->cl_dirty_max &&
            -           cfs_atomic_read(&obd_dirty_pages) + 1 <= obd_max_dirty_pages) {
            +           cfs_atomic_read(&obd_unstable_pages) + 1 +
            +           cfs_atomic_read(&obd_dirty_pages) <= obd_max_dirty_pages) {
                            osc_consume_write_grant(cli, &oap->oap_brw_page);
                            if (transient) {
                                    cli->cl_dirty_transit += CFS_PAGE_SIZE;
            @@ -1576,9 +1579,9 @@ void osc_wake_cache_waiters(struct client_obd *cli)
            
                            ocw->ocw_rc = -EDQUOT;
                            /* we can't dirty more */
            -               if ((cli->cl_dirty + CFS_PAGE_SIZE > cli->cl_dirty_max) ||
            -                   (cfs_atomic_read(&obd_dirty_pages) + 1 >
            -                    obd_max_dirty_pages)) {
            +               if (cli->cl_dirty + CFS_PAGE_SIZE > cli->cl_dirty_max ||
            +                   cfs_atomic_read(&obd_unstable_pages) + 1 +
            +                   cfs_atomic_read(&obd_dirty_pages) > obd_max_dirty_pages) {
                                    CDEBUG(D_CACHE, "no dirty room: dirty: %ld "
                                           "osc max %ld, sys max %d\n", cli->cl_dirty,
                                           cli->cl_dirty_max, obd_max_dirty_pages);
            

            I think the title of this ticket should be changed as "LU-2139 may cause the performance regression".

            niu Niu Yawei (Inactive) added a comment - The commit 5661651b2cc6414686e7da581589c2ea0e1f1969 is from LU-2139 , which added unstable pages accounting for Lustre, and following code changes could caused much more sync write. @@ -1463,7 +1465,8 @@ static int osc_enter_cache_try(struct client_obd *cli, return 0; if (cli->cl_dirty + CFS_PAGE_SIZE <= cli->cl_dirty_max && - cfs_atomic_read(&obd_dirty_pages) + 1 <= obd_max_dirty_pages) { + cfs_atomic_read(&obd_unstable_pages) + 1 + + cfs_atomic_read(&obd_dirty_pages) <= obd_max_dirty_pages) { osc_consume_write_grant(cli, &oap->oap_brw_page); if ( transient ) { cli->cl_dirty_transit += CFS_PAGE_SIZE; @@ -1576,9 +1579,9 @@ void osc_wake_cache_waiters(struct client_obd *cli) ocw->ocw_rc = -EDQUOT; /* we can't dirty more */ - if ((cli->cl_dirty + CFS_PAGE_SIZE > cli->cl_dirty_max) || - (cfs_atomic_read(&obd_dirty_pages) + 1 > - obd_max_dirty_pages)) { + if (cli->cl_dirty + CFS_PAGE_SIZE > cli->cl_dirty_max || + cfs_atomic_read(&obd_unstable_pages) + 1 + + cfs_atomic_read(&obd_dirty_pages) > obd_max_dirty_pages) { CDEBUG(D_CACHE, "no dirty room: dirty: %ld " "osc max %ld, sys max %d\n" , cli->cl_dirty, cli->cl_dirty_max, obd_max_dirty_pages); I think the title of this ticket should be changed as " LU-2139 may cause the performance regression".

            People

              prakash Prakash Surya (Inactive)
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: