Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18218

Performance impact on MDT performance with changelogs enabled

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.17.0
    • Upstream
    • 3
    • 9223372036854775807

    Description

      We tested MDT performance with and without changelogs enabled and observed a big performance impact with changelogs enabled.

      pdsh -g mds 'lctl get_param mdd.*.changelog*' | dshbak -c
      ----------------
      lmm1302
      ----------------
      mdd.lmm13-MDT0000.changelog_deniednext=60
      mdd.lmm13-MDT0000.changelog_gc=1
      mdd.lmm13-MDT0000.changelog_max_idle_indexes=2097446912
      mdd.lmm13-MDT0000.changelog_max_idle_time=2592000
      mdd.lmm13-MDT0000.changelog_min_free_cat_entries=2
      mdd.lmm13-MDT0000.changelog_min_gc_interval=3600
      mdd.lmm13-MDT0000.changelog_size=1637620216
      mdd.lmm13-MDT0000.changelog_striped_dir_real_pfid=0
      mdd.lmm13-MDT0000.changelog_current_mask=
      MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC
      mdd.lmm13-MDT0000.changelog_mask=
      MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC
      mdd.lmm13-MDT0000.changelog_users=
      current_index: 227636205
      ID                            index (idle) mask
      cl3                       219246813 (76)
      
      SUMMARY rate: (of 3 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         File creation             :      64293.217      61191.096      62598.992       1282.215
         File stat                 :     697756.541     690173.219     694512.173       3152.598
         File read                 :     293942.923     292588.428     293054.714        615.889
         File removal              :      64479.412      57107.824      61776.225       3314.447
         Tree creation             :        169.033        145.318        154.133         10.595
         Tree removal              :         82.949         44.846         69.342         17.357
      V-1: Entering PrintTimestamp...
      -- finished at 11/30/2023 12:19:44 --
       

      When we disable changelog, performance comes back

      # cscli lustre changelog disable
      lmm13-MDT0000: Deregistered changelog user #3
      lmm13-MDT0001: Deregistered changelog user #3
      
      SUMMARY rate: (of 3 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         File creation             :     158468.362     153578.205     155692.937       2048.523
         File stat                 :     703378.629     665431.521     689570.511      17093.925
         File read                 :     290063.030     278902.768     284177.285       4568.720
         File removal              :     141796.451     136881.639     139915.690       2163.940
         Tree creation             :        199.212        131.040        173.177         30.070
         Tree removal              :         95.939         39.770         74.663         24.871
      V-1: Entering PrintTimestamp...
      -- finished at 11/30/2023 12:23:45 --
      

      I'v taken a perf report with changelog enabled, looks like llog_cat_add_rec()->mutex_lock is a bottle neck

         --14.94%--mdt_reint_create                                                     
       |                                                                               
        --14.93%--mdt_create                                                           
                  |                                                                    
                  |--14.63%--mdd_create                                                
                  |          |                                                         
                  |          |--12.85%--mdd_changelog_ns_store                         
                  |          |          |                                              
                  |          |           --12.84%--mdd_changelog_store                 
                  |          |                     |                                   
                  |          |                      --12.84%--llog_add                 
                  |          |                                |                        
                  |          |                                 --12.84%--llog_cat_add_rec
                  |          |                                           |             
                  |          |                                           |--12.70%--rwsem_down_write_slowpath        
                  |          |                                           |          |  
                  |          |                                           |          |--11.80%--osq_lock   
                  |          |                                           |          |  
                  |          |                                           |           --0.46%--rwsem_spin_on_owner    
                  |          |                                           |             
                  |          |                                            --0.12%--llog_write_rec         
                  |          |                                                      |  
                  |          |                                                       --0.12%--mdd_changelog_write_rec
      

      Without changelog perf looks like

      |--5.90%--mdd_create                                                             
                  |                                                                            
                  |--4.74%--mdd_create_object                                                  
                  |          |                                                                 
                  |          |--3.01%--mdd_create_object_internal                              
                  |          |          |                                                      
                  |          |           --3.01%--lod_create                                   
                  |          |                     |                                           
                  |          |                      --3.01%--lod_sub_create                    
                  |          |                                |                                
                  |          |                                 --3.01%--osd_create             
                  |          |                                           |                     
                  |          |                                           |--2.78%--osd_mkfile.constprop.104                                           
                  |          |                                           |          |          
                  |          |                                           |           --2.78%--ldiskfs_create_inode                                    
                  |          |                                           |                     |          
                  |          |                                           |                      --2.78%--__ldiskfs_new_inode 
      
      

      From a Lustre llog design/implementation, adding a record to changelog have a synchronization on down_wtrite(plain_llog->lgh_lock). It is a top semaphore.
      All locking for a adding record looks next

        down_write((&loghandle->lgh_lock) synchronize writers
        ----down_write(&loghandle->lgh_last_sem) synchronize write and parallel read
        --------mutex_lock(&loghandle->lgh_hdr_mutex)  protects llog header/bitmap data from concurrent update/cancel
        --------dt_write_lock(env, o, 0); for atomic update header and record for a remote readers
        -----------  write header update
        --------mutex_unlock(&loghandle->lgh_hdr_mutex);
        --------write a record
        --------dt_write_unlock(env, o);
        ----up_write(&loghandle->lgh_last_sem);
      up_write(&loghandle->lgh_lock);
      

      So there is a real limit for adding record at changelog, and all metadata threads, 512 or so, would sleep at a top semaphore during changelog adding. Only one could be a writer for a moment. Bottleneck.

      Attachments

        Issue Links

          Activity

            [LU-18218] Performance impact on MDT performance with changelogs enabled

            dt_write() fail for a changelog is already problem without this patch, because operation fails.

            it doesn't leave holes AFAIU

            bzzz Alex Zhuravlev added a comment - dt_write() fail for a changelog is already problem without this patch, because operation fails. it doesn't leave holes AFAIU

            dt_write() fail for a changelog is already problem  without this patch, because operation fails.

            aboyko Alexander Boyko added a comment - dt_write() fail for a changelog is already problem  without this patch, because operation fails.

            Actually I think that you mean opposite problem when offset A < B

            yes, literally I mean a gap.

            Currently, when a bad record is encountered, the code proceeds to the next chunk.

            that would be a regression leading to missing records. personally I don't think this is acceptable.

            bzzz Alex Zhuravlev added a comment - Actually I think that you mean opposite problem when offset A < B yes, literally I mean a gap. Currently, when a bad record is encountered, the code proceeds to the next chunk. that would be a regression leading to missing records. personally I don't think this is acceptable.

            what would happen if a write to offset A fails while another write to offset B succeed where A > B ? while this is very unlikely, but still possible I think.

             

            A > B  = offset B | offset A = rec B | 0x00 0x00 0x00 0x00

            Actually I think that you mean opposite problem when offset A < B.

            A < B = offset A | offset B = 0x00 0x00 0x00 0x00 | rec B

            The index bit would not be set for the record A, and the tail, the bit for B, the B record would be fine. 

            In this scenario, the reader (llog_process_thread()) should handle null records appropriately. Currently, when a bad record is encountered, the code proceeds to the next chunk. However, I believe we could improve this by skipping the null record and processing the next valid one, similar to how the patch handles null padding at the end of a block.

             

            aboyko Alexander Boyko added a comment - what would happen if a write to offset A fails while another write to offset B succeed where A > B ? while this is very unlikely, but still possible I think.   A > B  = offset B | offset A = rec B | 0x00 0x00 0x00 0x00 Actually I think that you mean opposite problem when offset A < B. A < B = offset A | offset B = 0x00 0x00 0x00 0x00 | rec B The index bit would not be set for the record A, and the tail, the bit for B, the B record would be fine.  In this scenario, the reader ( llog_process_thread() ) should handle null records appropriately. Currently, when a bad record is encountered, the code proceeds to the next chunk. However, I believe we could improve this by skipping the null record and processing the next valid one, similar to how the patch handles null padding at the end of a block.  

            The key enhancement is the ability to perform parallel writes . The mdd_changelog_write_rec() function calculates the file offset and record index and releasing the &loghandle->lgh_lock. A semaphore is used to protect the offset/index calculation only, enabling dt_write() to execute concurrently. This reduces contention and improves efficiency.

            what would happen if a write to offset A fails while another write to offset B succeed where A > B ? while this is very unlikely, but still possible I think.

            bzzz Alex Zhuravlev added a comment - The key enhancement is the ability to perform parallel writes . The mdd_changelog_write_rec() function calculates the file offset and record index and releasing the &loghandle->lgh_lock. A semaphore is used to protect the offset/index calculation only, enabling dt_write() to execute concurrently. This reduces contention and improves efficiency. what would happen if a write to offset A fails while another write to offset B succeed where A > B ? while this is very unlikely, but still possible I think.

            bzzz

            looking at LU-18218 mdd: changelog specific write function patch I still don't quite understand where the improvement comes from. the patch replaces the mutex with a spinlock, but there is outter down_write(&loghandle->lgh_last_sem).

            With the patch, the "changelog add record" operation now utilizes its own dedicated function, mdd_changelog_write_rec(), to write a llog record, replacing the previous use of llog_osd_write_rec().

            The key enhancement is the ability to perform parallel writes . The mdd_changelog_write_rec() function calculates the file offset and record index and releasing the &loghandle->lgh_lock. A semaphore is used to protect the offset/index calculation only, enabling dt_write() to execute concurrently. This reduces contention and improves efficiency.

            aboyko Alexander Boyko added a comment - bzzz looking at LU-18218 mdd: changelog specific write function patch I still don't quite understand where the improvement comes from. the patch replaces the mutex with a spinlock, but there is outter down_write(&loghandle->lgh_last_sem). With the patch, the "changelog add record" operation now utilizes its own dedicated function, mdd_changelog_write_rec() , to write a llog record, replacing the previous use of llog_osd_write_rec() . The key enhancement is the ability to perform parallel writes . The mdd_changelog_write_rec() function calculates the file offset and record index and releasing the &loghandle->lgh_lock . A semaphore is used to protect the offset/index calculation only, enabling dt_write() to execute concurrently. This reduces contention and improves efficiency.

            looking at LU-18218 mdd: changelog specific write function patch I still don't quite understand where the improvement comes from. the patch replaces the mutex with a spinlock, but there is outter down_write(&loghandle->lgh_last_sem).

            bzzz Alex Zhuravlev added a comment - looking at LU-18218 mdd: changelog specific write function patch I still don't quite understand where the improvement comes from. the patch replaces the mutex with a spinlock, but there is outter down_write(&loghandle->lgh_last_sem).
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56342/
            Subject: LU-18218 mdd: changelog specific write function
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: cb1290768df9fca6ead194c2812fb0182d85191c

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56342/ Subject: LU-18218 mdd: changelog specific write function Project: fs/lustre-release Branch: master Current Patch Set: Commit: cb1290768df9fca6ead194c2812fb0182d85191c

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/57920/
            Subject: LU-18218 llog: repeat declare for remote obj
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9b2de53f9b39f0e421e97e6a16a2f5998fe8cbfb

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/57920/ Subject: LU-18218 llog: repeat declare for remote obj Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9b2de53f9b39f0e421e97e6a16a2f5998fe8cbfb

            "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57920
            Subject: LU-18218 llog: repeat declare for remote obj
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ead97bde20156f50e9c21b0211170d60f98ae0fd

            gerrit Gerrit Updater added a comment - "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57920 Subject: LU-18218 llog: repeat declare for remote obj Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ead97bde20156f50e9c21b0211170d60f98ae0fd

            People

              aboyko Alexander Boyko
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: