Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18218

Performance impact on MDT performance with changelogs enabled

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • Upstream
    • 3
    • 9223372036854775807

    Description

      We tested MDT performance with and without changelogs enabled and observed a big performance impact with changelogs enabled.

      pdsh -g mds 'lctl get_param mdd.*.changelog*' | dshbak -c
      ----------------
      lmm1302
      ----------------
      mdd.lmm13-MDT0000.changelog_deniednext=60
      mdd.lmm13-MDT0000.changelog_gc=1
      mdd.lmm13-MDT0000.changelog_max_idle_indexes=2097446912
      mdd.lmm13-MDT0000.changelog_max_idle_time=2592000
      mdd.lmm13-MDT0000.changelog_min_free_cat_entries=2
      mdd.lmm13-MDT0000.changelog_min_gc_interval=3600
      mdd.lmm13-MDT0000.changelog_size=1637620216
      mdd.lmm13-MDT0000.changelog_striped_dir_real_pfid=0
      mdd.lmm13-MDT0000.changelog_current_mask=
      MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC
      mdd.lmm13-MDT0000.changelog_mask=
      MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC
      mdd.lmm13-MDT0000.changelog_users=
      current_index: 227636205
      ID                            index (idle) mask
      cl3                       219246813 (76)
      
      SUMMARY rate: (of 3 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         File creation             :      64293.217      61191.096      62598.992       1282.215
         File stat                 :     697756.541     690173.219     694512.173       3152.598
         File read                 :     293942.923     292588.428     293054.714        615.889
         File removal              :      64479.412      57107.824      61776.225       3314.447
         Tree creation             :        169.033        145.318        154.133         10.595
         Tree removal              :         82.949         44.846         69.342         17.357
      V-1: Entering PrintTimestamp...
      -- finished at 11/30/2023 12:19:44 --
       

      When we disable changelog, performance comes back

      # cscli lustre changelog disable
      lmm13-MDT0000: Deregistered changelog user #3
      lmm13-MDT0001: Deregistered changelog user #3
      
      SUMMARY rate: (of 3 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         File creation             :     158468.362     153578.205     155692.937       2048.523
         File stat                 :     703378.629     665431.521     689570.511      17093.925
         File read                 :     290063.030     278902.768     284177.285       4568.720
         File removal              :     141796.451     136881.639     139915.690       2163.940
         Tree creation             :        199.212        131.040        173.177         30.070
         Tree removal              :         95.939         39.770         74.663         24.871
      V-1: Entering PrintTimestamp...
      -- finished at 11/30/2023 12:23:45 --
      

      I'v taken a perf report with changelog enabled, looks like llog_cat_add_rec()->mutex_lock is a bottle neck

         --14.94%--mdt_reint_create                                                     
       |                                                                               
        --14.93%--mdt_create                                                           
                  |                                                                    
                  |--14.63%--mdd_create                                                
                  |          |                                                         
                  |          |--12.85%--mdd_changelog_ns_store                         
                  |          |          |                                              
                  |          |           --12.84%--mdd_changelog_store                 
                  |          |                     |                                   
                  |          |                      --12.84%--llog_add                 
                  |          |                                |                        
                  |          |                                 --12.84%--llog_cat_add_rec
                  |          |                                           |             
                  |          |                                           |--12.70%--rwsem_down_write_slowpath        
                  |          |                                           |          |  
                  |          |                                           |          |--11.80%--osq_lock   
                  |          |                                           |          |  
                  |          |                                           |           --0.46%--rwsem_spin_on_owner    
                  |          |                                           |             
                  |          |                                            --0.12%--llog_write_rec         
                  |          |                                                      |  
                  |          |                                                       --0.12%--mdd_changelog_write_rec
      

      Without changelog perf looks like

      |--5.90%--mdd_create                                                             
                  |                                                                            
                  |--4.74%--mdd_create_object                                                  
                  |          |                                                                 
                  |          |--3.01%--mdd_create_object_internal                              
                  |          |          |                                                      
                  |          |           --3.01%--lod_create                                   
                  |          |                     |                                           
                  |          |                      --3.01%--lod_sub_create                    
                  |          |                                |                                
                  |          |                                 --3.01%--osd_create             
                  |          |                                           |                     
                  |          |                                           |--2.78%--osd_mkfile.constprop.104                                           
                  |          |                                           |          |          
                  |          |                                           |           --2.78%--ldiskfs_create_inode                                    
                  |          |                                           |                     |          
                  |          |                                           |                      --2.78%--__ldiskfs_new_inode 
      
      

      From a Lustre llog design/implementation, adding a record to changelog have a synchronization on down_wtrite(plain_llog->lgh_lock). It is a top semaphore.
      All locking for a adding record looks next

        down_write((&loghandle->lgh_lock) synchronize writers
        ----down_write(&loghandle->lgh_last_sem) synchronize write and parallel read
        --------mutex_lock(&loghandle->lgh_hdr_mutex)  protects llog header/bitmap data from concurrent update/cancel
        --------dt_write_lock(env, o, 0); for atomic update header and record for a remote readers
        -----------  write header update
        --------mutex_unlock(&loghandle->lgh_hdr_mutex);
        --------write a record
        --------dt_write_unlock(env, o);
        ----up_write(&loghandle->lgh_last_sem);
      up_write(&loghandle->lgh_lock);
      

      So there is a real limit for adding record at changelog, and all metadata threads, 512 or so, would sleep at a top semaphore during changelog adding. Only one could be a writer for a moment. Bottleneck.

      Attachments

        Activity

          People

            aboyko Alexander Boyko
            aboyko Alexander Boyko
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: