Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3277

LU-2139 may cause the performance regression

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0
    • Lustre 2.4.0
    • RHEL6.3 and current master
    • 2
    • 8114

    Description

      There is a performance regression on the current master(c864582b5d4541c7830d628457e55cd859aee005) if we have multiple IOR threads per client. As far as I can test, LU-2576 might cause this performance regression. Here is quick test results on each commit.

      client : commit ac37e7b4d101761bbff401ed12fcf671d6b68f9c

      # mpirun -np 8 /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
      IOR-2.10.3: MPI Coordinated Test of Parallel I/O
      
      Run began: Sun May  5 12:24:09 2013
      Command line used: /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
      Machine: Linux s08 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP Sat Feb 9 21:55:32 PST 2013 x86_64
      Using synchronized MPI timer
      Start time skew across all tasks: 0.00 sec
      Path: /lustre/ior.out
      FS: 683.5 TiB   Used FS: 0.0%   Inodes: 5.0 Mi   Used Inodes: 0.0%
      Participating tasks: 8
      Using reorderTasks '-C' (expecting block, not cyclic, task assignment)
      task 0 on s08
      task 1 on s08
      task 2 on s08
      task 3 on s08
      task 4 on s08
      task 5 on s08
      task 6 on s08
      task 7 on s08
      
      Summary:
      	api                = POSIX
      	test filename      = /lustre/ior.out/file
      	access             = file-per-process
      	pattern            = segmented (1 segment)
      	ordering in a file = sequential offsets
      	ordering inter file=constant task offsets = 1
      	clients            = 8 (8 per node)
      	repetitions        = 1
      	xfersize           = 1 MiB
      	blocksize          = 8 GiB
      	aggregate filesize = 64 GiB
      
      Using Time Stamp 1367781849 (0x5186b1d9) for Data Signature
      Commencing write performance test.
      Sun May  5 12:24:09 2013
      
      access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s) total(s)  iter
      ------    ---------  ---------- ---------  --------   --------   --------  --------   ----
      write     3228.38    8388608    1024.00    0.001871   20.30      1.34       20.30      0    XXCEL
      Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  Op grep #Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize
      
      ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
      write        3228.38    3228.38     3228.38      0.00    3228.38    3228.38     3228.38      0.00  20.29996   8 8 1 1 1 1 0 0 1 8589934592 1048576 68719476736 -1 POSIX EXCEL
      
      Max Write: 3228.38 MiB/sec (3385.20 MB/sec)
      
      Run finished: Sun May  5 12:24:30 2013
      

      client : commit 5661651b2cc6414686e7da581589c2ea0e1f1969

      # mpirun -np 8 /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
      IOR-2.10.3: MPI Coordinated Test of Parallel I/O
      
      Run began: Sun May  5 12:16:35 2013
      Command line used: /lustre/IOR -w -b 8g -t 1m -e -C -F -vv -o /lustre/ior.out/file
      Machine: Linux s08 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP Sat Feb 9 21:55:32 PST 2013 x86_64
      Using synchronized MPI timer
      Start time skew across all tasks: 0.00 sec
      Path: /lustre/ior.out
      FS: 683.5 TiB   Used FS: 0.0%   Inodes: 5.0 Mi   Used Inodes: 0.0%
      Participating tasks: 8
      Using reorderTasks '-C' (expecting block, not cyclic, task assignment)
      task 0 on s08
      task 1 on s08
      task 2 on s08
      task 3 on s08
      task 4 on s08
      task 5 on s08
      task 6 on s08
      task 7 on s08
      
      Summary:
      	api                = POSIX
      	test filename      = /lustre/ior.out/file
      	access             = file-per-process
      	pattern            = segmented (1 segment)
      	ordering in a file = sequential offsets
      	ordering inter file=constant task offsets = 1
      	clients            = 8 (8 per node)
      	repetitions        = 1
      	xfersize           = 1 MiB
      	blocksize          = 8 GiB
      	aggregate filesize = 64 GiB
      
      Using Time Stamp 1367781395 (0x5186b013) for Data Signature
      Commencing write performance test.
      Sun May  5 12:16:35 2013
      
      access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s) total(s)  iter
      ------    ---------  ---------- ---------  --------   --------   --------  --------   ----
      write     550.28     8388608    1024.00    0.001730   119.10     2.76       119.10     0    XXCEL
      Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  Op grep #Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize
      
      ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
      write         550.28     550.28      550.28      0.00     550.28     550.28      550.28      0.00 119.09623   8 8 1 1 1 1 0 0 1 8589934592 1048576 68719476736 -1 POSIX EXCEL
      
      Max Write: 550.28 MiB/sec (577.01 MB/sec)
      
      Run finished: Sun May  5 12:18:34 2013
      

      Both tests, the servers are running current master (c864582b5d4541c7830d628457e55cd859aee005)

      Attachments

        1. collectl.log
          13 kB
        2. stat.log
          16 kB

        Issue Links

          Activity

            [LU-3277] LU-2139 may cause the performance regression
            prakash Prakash Surya (Inactive) made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Since I've been assigned to this, I'm marking it resolved. The "bad patch" was reverted, and there's been no reports of this since, which leads me to believe it is no longer an issue. Feel free to reopen if there is a compelling case to do so.

            General discussion of the LU-2139 issue and pending patch stack is better suited to happen in LU-2139 ticket.

            prakash Prakash Surya (Inactive) added a comment - Since I've been assigned to this, I'm marking it resolved. The "bad patch" was reverted, and there's been no reports of this since, which leads me to believe it is no longer an issue. Feel free to reopen if there is a compelling case to do so. General discussion of the LU-2139 issue and pending patch stack is better suited to happen in LU-2139 ticket.
            adilger Andreas Dilger made changes -
            Assignee Original: Niu Yawei [ niu ] New: Prakash Surya [ prakash ]
            adilger Andreas Dilger made changes -
            Link New: This issue is blocking LU-2139 [ LU-2139 ]
            adilger Andreas Dilger made changes -
            Fix Version/s New: Lustre 2.6.0 [ 10595 ]

            In case it proves useful, here's an example stack for a thread waiting in the state I described in my previous comment:

                fio           S 00000fffae72633c     0 59338  59283 0x00000000
                Call Trace:
                [c0000003e0deed20] [c0000003e0deede0] 0xc0000003e0deede0 (unreliable)
                [c0000003e0deeef0] [c000000000008e10] .__switch_to+0xc4/0x100
                [c0000003e0deef80] [c00000000042b0e0] .schedule+0x858/0x9c0
                [c0000003e0def230] [c00000000042b7c8] .schedule_timeout+0x1f8/0x240
                [c0000003e0def310] [c00000000042a444] .io_schedule_timeout+0x54/0x98
                [c0000003e0def3a0] [c00000000009ddfc] .balance_dirty_pages+0x294/0x390
                [c0000003e0def520] [c000000000095a2c] .generic_file_buffered_write+0x268/0x354
                [c0000003e0def660] [c000000000096074] .__generic_file_aio_write+0x374/0x3d8
                [c0000003e0def760] [c000000000096150] .generic_file_aio_write+0x78/0xe8
                [c0000003e0def810] [8000000006a7062c] .vvp_io_write_start+0xfc/0x3e0 [lustre]
                [c0000003e0def8e0] [800000000249a81c] .cl_io_start+0xcc/0x220 [obdclass]
                [c0000003e0def980] [80000000024a2634] .cl_io_loop+0x194/0x2c0 [obdclass]
                [c0000003e0defa30] [80000000069ea208] .ll_file_io_generic+0x498/0x670 [lustre]
                [c0000003e0defb30] [80000000069ea864] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
                [c0000003e0defc00] [80000000069eab80] .ll_file_write+0x150/0x320 [lustre]
                [c0000003e0defce0] [c0000000000d1e9c] .vfs_write+0xd0/0x1c4
                [c0000003e0defd80] [c0000000000d208c] .SyS_write+0x54/0x98
                [c0000003e0defe30] [c000000000000580] syscall_exit+0x0/0x2c
            
            prakash Prakash Surya (Inactive) added a comment - In case it proves useful, here's an example stack for a thread waiting in the state I described in my previous comment: fio S 00000fffae72633c 0 59338 59283 0x00000000 Call Trace: [c0000003e0deed20] [c0000003e0deede0] 0xc0000003e0deede0 (unreliable) [c0000003e0deeef0] [c000000000008e10] .__switch_to+0xc4/0x100 [c0000003e0deef80] [c00000000042b0e0] .schedule+0x858/0x9c0 [c0000003e0def230] [c00000000042b7c8] .schedule_timeout+0x1f8/0x240 [c0000003e0def310] [c00000000042a444] .io_schedule_timeout+0x54/0x98 [c0000003e0def3a0] [c00000000009ddfc] .balance_dirty_pages+0x294/0x390 [c0000003e0def520] [c000000000095a2c] .generic_file_buffered_write+0x268/0x354 [c0000003e0def660] [c000000000096074] .__generic_file_aio_write+0x374/0x3d8 [c0000003e0def760] [c000000000096150] .generic_file_aio_write+0x78/0xe8 [c0000003e0def810] [8000000006a7062c] .vvp_io_write_start+0xfc/0x3e0 [lustre] [c0000003e0def8e0] [800000000249a81c] .cl_io_start+0xcc/0x220 [obdclass] [c0000003e0def980] [80000000024a2634] .cl_io_loop+0x194/0x2c0 [obdclass] [c0000003e0defa30] [80000000069ea208] .ll_file_io_generic+0x498/0x670 [lustre] [c0000003e0defb30] [80000000069ea864] .ll_file_aio_write+0x1d4/0x3a0 [lustre] [c0000003e0defc00] [80000000069eab80] .ll_file_write+0x150/0x320 [lustre] [c0000003e0defce0] [c0000000000d1e9c] .vfs_write+0xd0/0x1c4 [c0000003e0defd80] [c0000000000d208c] .SyS_write+0x54/0x98 [c0000003e0defe30] [c000000000000580] syscall_exit+0x0/0x2c
            jlevi Jodi Levi (Inactive) made changes -
            Priority Original: Blocker [ 1 ] New: Major [ 3 ]

            Lowering priority as LU-2139 was reverted.

            jlevi Jodi Levi (Inactive) added a comment - Lowering priority as LU-2139 was reverted.
            prakash Prakash Surya (Inactive) added a comment - - edited

            Prakash, attached are llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo output during IOR.

            Thanks, I'll give these a look.

            it's doing like this writing; idle, writing; idle;...

            OK, that sounds like exactly what I saw during my testing when I was hitting the kernel's dirty page limits. Allow me to explain..

            So, with that patch in place, we're now properly informing the kernel of our unstable pages by incrementing and decrementing the NFS_Unstable zone page. You can see this by watching the NFS_Unstable field in /proc/meminfo (before the patch this will always be zero, after the patch this will fluctuate with respect to Lustre IO). So that's all well and good, but how does that relate to the idle time seen during IO.. Well, what I think's happening is, the newly accounted for unstable pages are being factored in when the kernel calls balance_dirty_pages, balance_dirty_pages is then determining the system is out of dirty pages, and sleeps waiting for writeback to flush the dirty pages to disk.

            You can verify this theory by dumping the stacks of all processes when the IO is stalled, and check to see if any of the write threads are stuck sleeping in balance_dirty_pages. What do these files show on you system: /proc/sys/vm/dirty_background_bytes, /proc/sys/vm/dirty_background_ratio, /proc/sys/vm/dirty_bytes, /proc/sys/vm/dirty_ratio? I think the dirty limit on the system is a calculation of dirty_ratio * available_memory, if dirty_bytes is 0. So in your case the limit is about 12.8 GB, (assuming a dirty_ratio of 20, which is the default on my system, and dirty_bytes of 0). Seeing as the the value of NFS_Unstable in your logs hovers around 11556864 kB, it's plausible that dirty + unstable is hitting the limit.

            If my above hypothesis is correct, the behavior you were seeing was expected and working as designed. The same problem would occur if you could push NFS at the same rates. If you had the full LU-2139 patch stack applied to the client and servers (http://review.whamcloud.com/4245, http://review.whamcloud.com/4374, http://review.whamcloud.com/4375, http://review.whamcloud.com/5935), I'd expect this effect would go away.

            If you can, try setting dirty_ratio and max_dirty_mb to a large fraction of memory and rerun the test.

            prakash Prakash Surya (Inactive) added a comment - - edited Prakash, attached are llite.*.unstable_stats' and grep NFS_Unstable /proc/meminfo output during IOR. Thanks, I'll give these a look. it's doing like this writing; idle, writing; idle;... OK, that sounds like exactly what I saw during my testing when I was hitting the kernel's dirty page limits. Allow me to explain.. So, with that patch in place, we're now properly informing the kernel of our unstable pages by incrementing and decrementing the NFS_Unstable zone page. You can see this by watching the NFS_Unstable field in /proc/meminfo (before the patch this will always be zero, after the patch this will fluctuate with respect to Lustre IO). So that's all well and good, but how does that relate to the idle time seen during IO.. Well, what I think's happening is, the newly accounted for unstable pages are being factored in when the kernel calls balance_dirty_pages , balance_dirty_pages is then determining the system is out of dirty pages, and sleeps waiting for writeback to flush the dirty pages to disk. You can verify this theory by dumping the stacks of all processes when the IO is stalled, and check to see if any of the write threads are stuck sleeping in balance_dirty_pages . What do these files show on you system: /proc/sys/vm/dirty_background_bytes , /proc/sys/vm/dirty_background_ratio , /proc/sys/vm/dirty_bytes , /proc/sys/vm/dirty_ratio ? I think the dirty limit on the system is a calculation of dirty_ratio * available_memory , if dirty_bytes is 0. So in your case the limit is about 12.8 GB, (assuming a dirty_ratio of 20, which is the default on my system, and dirty_bytes of 0). Seeing as the the value of NFS_Unstable in your logs hovers around 11556864 kB, it's plausible that dirty + unstable is hitting the limit. If my above hypothesis is correct, the behavior you were seeing was expected and working as designed. The same problem would occur if you could push NFS at the same rates. If you had the full LU-2139 patch stack applied to the client and servers ( http://review.whamcloud.com/4245 , http://review.whamcloud.com/4374 , http://review.whamcloud.com/4375 , http://review.whamcloud.com/5935 ), I'd expect this effect would go away. If you can, try setting dirty_ratio and max_dirty_mb to a large fraction of memory and rerun the test.
            ihara Shuichi Ihara (Inactive) made changes -
            Attachment New: collectl.log [ 12639 ]
            Attachment New: stat.log [ 12640 ]

            People

              prakash Prakash Surya (Inactive)
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: