Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3442

MDS performance degraded by reading of ZFS spacemaps

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • Lustre 2.4.0
    • server: lustre-2.4.0-RC2_2chaos_2.6.32_358.6.1.3chaos.ch5.1.ch5.1.x86_64
      clients: mix of PPC/Lustre 2.4 and x86_64/Lustre 2.1
    • 3
    • 8581

    Description

      We started to experience degraded performance on our MDS with a ZFS backend. Certain RPCs were taking many seconds or even minutes to service. Users would accordingly see very slow interactive responsiveness. On investigation, this turned out to be due to ZFS transaction groups taking very long to sync, blocking request handlers that needed to write out an llog record. This in turn was due to zio processing threads waiting in space_map_load_wait():

      [<ffffffffa038cdad>] cv_wait_common+0xed/0x100 [spl]                              
      [<ffffffffa038ce15>] __cv_wait+0x15/0x20 [spl]                                    
      [<ffffffffa0480f2f>] space_map_load_wait+0x2f/0x40 [zfs]                          
      [<ffffffffa046ab47>] metaslab_activate+0x77/0x160 [zfs]                           
      [<ffffffffa046b67e>] metaslab_alloc+0x4fe/0x950 [zfs]                             
      [<ffffffffa04c801a>] zio_dva_allocate+0xaa/0x350 [zfs]                            
      [<ffffffffa04c93e0>] zio_ready+0x3c0/0x460 [zfs]                                  
      [<ffffffffa04c93e0>] zio_ready+0x3c0/0x460 [zfs]                                  
      [<ffffffffa04c6293>] zio_execute+0xb3/0x130 [zfs]                                 
      [<ffffffffa0389277>] taskq_thread+0x1e7/0x3f0 [spl]                               
      [<ffffffff81096c76>] kthread+0x96/0xa0                                            
      [<ffffffff8100c0ca>] child_rip+0xa/0x20                                           
      [<ffffffffffffffff>] 0xffffffffffffffff          
      

      We are able to mitigate this problem by setting the zfs module option metaslab_debug=1, which forces all spacemaps to stay resident in memory. However, this solution is a bit heavy-handed, and we'd like to gain a better understanding of why we're reading spacemaps from disk so often, and what should be done about it.

      Our first thought was that pool fragmentation was the underlying cause, causing the block allocator to search all spacemaps to find a suitable interval. Our thinking was that llog cancellation promotes fragmentation by punching holes in otherwise contiguously allocated regions. But I'm not sure this theory is consistent with how llogs actually work, or with how the ZFS allocator works for that matter.

      Another idea is that a concurrent write and unlink workload could cause this behaviour, but it's all just speculation until we better understand the workload and how ZFS manages spacemaps.

      The most appealing approach we've discussed so far is to modify ZFS to use the ARC to cache spacemap objects. I believe ZFS currently only keeps one spacemap (per vdev?) active in memory at a time, and it bypasses the ARC for these objects. Using the ARC would keep the hot spacemaps in memory, but allow them to get pitched under memory pressure.

      So, I'm not sure there's a Lustre bug here, but it's an issue to be aware of when using ZFS backends.

      Attachments

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: