Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18218

Performance impact on MDT performance with changelogs enabled

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.17.0
    • Upstream
    • 3
    • 9223372036854775807

    Description

      We tested MDT performance with and without changelogs enabled and observed a big performance impact with changelogs enabled.

      pdsh -g mds 'lctl get_param mdd.*.changelog*' | dshbak -c
      ----------------
      lmm1302
      ----------------
      mdd.lmm13-MDT0000.changelog_deniednext=60
      mdd.lmm13-MDT0000.changelog_gc=1
      mdd.lmm13-MDT0000.changelog_max_idle_indexes=2097446912
      mdd.lmm13-MDT0000.changelog_max_idle_time=2592000
      mdd.lmm13-MDT0000.changelog_min_free_cat_entries=2
      mdd.lmm13-MDT0000.changelog_min_gc_interval=3600
      mdd.lmm13-MDT0000.changelog_size=1637620216
      mdd.lmm13-MDT0000.changelog_striped_dir_real_pfid=0
      mdd.lmm13-MDT0000.changelog_current_mask=
      MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC
      mdd.lmm13-MDT0000.changelog_mask=
      MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC
      mdd.lmm13-MDT0000.changelog_users=
      current_index: 227636205
      ID                            index (idle) mask
      cl3                       219246813 (76)
      
      SUMMARY rate: (of 3 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         File creation             :      64293.217      61191.096      62598.992       1282.215
         File stat                 :     697756.541     690173.219     694512.173       3152.598
         File read                 :     293942.923     292588.428     293054.714        615.889
         File removal              :      64479.412      57107.824      61776.225       3314.447
         Tree creation             :        169.033        145.318        154.133         10.595
         Tree removal              :         82.949         44.846         69.342         17.357
      V-1: Entering PrintTimestamp...
      -- finished at 11/30/2023 12:19:44 --
       

      When we disable changelog, performance comes back

      # cscli lustre changelog disable
      lmm13-MDT0000: Deregistered changelog user #3
      lmm13-MDT0001: Deregistered changelog user #3
      
      SUMMARY rate: (of 3 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         File creation             :     158468.362     153578.205     155692.937       2048.523
         File stat                 :     703378.629     665431.521     689570.511      17093.925
         File read                 :     290063.030     278902.768     284177.285       4568.720
         File removal              :     141796.451     136881.639     139915.690       2163.940
         Tree creation             :        199.212        131.040        173.177         30.070
         Tree removal              :         95.939         39.770         74.663         24.871
      V-1: Entering PrintTimestamp...
      -- finished at 11/30/2023 12:23:45 --
      

      I'v taken a perf report with changelog enabled, looks like llog_cat_add_rec()->mutex_lock is a bottle neck

         --14.94%--mdt_reint_create                                                     
       |                                                                               
        --14.93%--mdt_create                                                           
                  |                                                                    
                  |--14.63%--mdd_create                                                
                  |          |                                                         
                  |          |--12.85%--mdd_changelog_ns_store                         
                  |          |          |                                              
                  |          |           --12.84%--mdd_changelog_store                 
                  |          |                     |                                   
                  |          |                      --12.84%--llog_add                 
                  |          |                                |                        
                  |          |                                 --12.84%--llog_cat_add_rec
                  |          |                                           |             
                  |          |                                           |--12.70%--rwsem_down_write_slowpath        
                  |          |                                           |          |  
                  |          |                                           |          |--11.80%--osq_lock   
                  |          |                                           |          |  
                  |          |                                           |           --0.46%--rwsem_spin_on_owner    
                  |          |                                           |             
                  |          |                                            --0.12%--llog_write_rec         
                  |          |                                                      |  
                  |          |                                                       --0.12%--mdd_changelog_write_rec
      

      Without changelog perf looks like

      |--5.90%--mdd_create                                                             
                  |                                                                            
                  |--4.74%--mdd_create_object                                                  
                  |          |                                                                 
                  |          |--3.01%--mdd_create_object_internal                              
                  |          |          |                                                      
                  |          |           --3.01%--lod_create                                   
                  |          |                     |                                           
                  |          |                      --3.01%--lod_sub_create                    
                  |          |                                |                                
                  |          |                                 --3.01%--osd_create             
                  |          |                                           |                     
                  |          |                                           |--2.78%--osd_mkfile.constprop.104                                           
                  |          |                                           |          |          
                  |          |                                           |           --2.78%--ldiskfs_create_inode                                    
                  |          |                                           |                     |          
                  |          |                                           |                      --2.78%--__ldiskfs_new_inode 
      
      

      From a Lustre llog design/implementation, adding a record to changelog have a synchronization on down_wtrite(plain_llog->lgh_lock). It is a top semaphore.
      All locking for a adding record looks next

        down_write((&loghandle->lgh_lock) synchronize writers
        ----down_write(&loghandle->lgh_last_sem) synchronize write and parallel read
        --------mutex_lock(&loghandle->lgh_hdr_mutex)  protects llog header/bitmap data from concurrent update/cancel
        --------dt_write_lock(env, o, 0); for atomic update header and record for a remote readers
        -----------  write header update
        --------mutex_unlock(&loghandle->lgh_hdr_mutex);
        --------write a record
        --------dt_write_unlock(env, o);
        ----up_write(&loghandle->lgh_last_sem);
      up_write(&loghandle->lgh_lock);
      

      So there is a real limit for adding record at changelog, and all metadata threads, 512 or so, would sleep at a top semaphore during changelog adding. Only one could be a writer for a moment. Bottleneck.

      Attachments

        Issue Links

          Activity

            [LU-18218] Performance impact on MDT performance with changelogs enabled

            The key enhancement is the ability to perform parallel writes . The mdd_changelog_write_rec() function calculates the file offset and record index and releasing the &loghandle->lgh_lock. A semaphore is used to protect the offset/index calculation only, enabling dt_write() to execute concurrently. This reduces contention and improves efficiency.

            what would happen if a write to offset A fails while another write to offset B succeed where A > B ? while this is very unlikely, but still possible I think.

            bzzz Alex Zhuravlev added a comment - The key enhancement is the ability to perform parallel writes . The mdd_changelog_write_rec() function calculates the file offset and record index and releasing the &loghandle->lgh_lock. A semaphore is used to protect the offset/index calculation only, enabling dt_write() to execute concurrently. This reduces contention and improves efficiency. what would happen if a write to offset A fails while another write to offset B succeed where A > B ? while this is very unlikely, but still possible I think.

            bzzz

            looking at LU-18218 mdd: changelog specific write function patch I still don't quite understand where the improvement comes from. the patch replaces the mutex with a spinlock, but there is outter down_write(&loghandle->lgh_last_sem).

            With the patch, the "changelog add record" operation now utilizes its own dedicated function, mdd_changelog_write_rec(), to write a llog record, replacing the previous use of llog_osd_write_rec().

            The key enhancement is the ability to perform parallel writes . The mdd_changelog_write_rec() function calculates the file offset and record index and releasing the &loghandle->lgh_lock. A semaphore is used to protect the offset/index calculation only, enabling dt_write() to execute concurrently. This reduces contention and improves efficiency.

            aboyko Alexander Boyko added a comment - bzzz looking at LU-18218 mdd: changelog specific write function patch I still don't quite understand where the improvement comes from. the patch replaces the mutex with a spinlock, but there is outter down_write(&loghandle->lgh_last_sem). With the patch, the "changelog add record" operation now utilizes its own dedicated function, mdd_changelog_write_rec() , to write a llog record, replacing the previous use of llog_osd_write_rec() . The key enhancement is the ability to perform parallel writes . The mdd_changelog_write_rec() function calculates the file offset and record index and releasing the &loghandle->lgh_lock . A semaphore is used to protect the offset/index calculation only, enabling dt_write() to execute concurrently. This reduces contention and improves efficiency.

            looking at LU-18218 mdd: changelog specific write function patch I still don't quite understand where the improvement comes from. the patch replaces the mutex with a spinlock, but there is outter down_write(&loghandle->lgh_last_sem).

            bzzz Alex Zhuravlev added a comment - looking at LU-18218 mdd: changelog specific write function patch I still don't quite understand where the improvement comes from. the patch replaces the mutex with a spinlock, but there is outter down_write(&loghandle->lgh_last_sem).
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56342/
            Subject: LU-18218 mdd: changelog specific write function
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: cb1290768df9fca6ead194c2812fb0182d85191c

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56342/ Subject: LU-18218 mdd: changelog specific write function Project: fs/lustre-release Branch: master Current Patch Set: Commit: cb1290768df9fca6ead194c2812fb0182d85191c

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/57920/
            Subject: LU-18218 llog: repeat declare for remote obj
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9b2de53f9b39f0e421e97e6a16a2f5998fe8cbfb

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/57920/ Subject: LU-18218 llog: repeat declare for remote obj Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9b2de53f9b39f0e421e97e6a16a2f5998fe8cbfb

            "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57920
            Subject: LU-18218 llog: repeat declare for remote obj
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ead97bde20156f50e9c21b0211170d60f98ae0fd

            gerrit Gerrit Updater added a comment - "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57920 Subject: LU-18218 llog: repeat declare for remote obj Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ead97bde20156f50e9c21b0211170d60f98ae0fd

            >not sure what have you reproduced, but I'm hitting this symptom:
            the symptoms are ESTALE like you have

            aboyko Alexander Boyko added a comment - >not sure what have you reproduced, but I'm hitting this symptom: the symptoms are ESTALE like you have

            not sure what have you reproduced, but I'm hitting this symptom:

            rm: cannot remove '/mnt/lustre/d60g.sanity/subdir1': Stale file handle
            rm: cannot remove '/mnt/lustre/d60g.sanity/subdir13': Stale file handle
            rm: cannot remove '/mnt/lustre/d60g.sanity/subdir21': Stale file handle
             sanity test_60g: @@@@@@ FAIL: rmdir failed 
            

            this is because of llog_declare_write_rec() you call for remote objects in llog_declare_write_rec() -> llog_declare_write_rec().
            going to try a small patch..

            bzzz Alex Zhuravlev added a comment - not sure what have you reproduced, but I'm hitting this symptom: rm: cannot remove '/mnt/lustre/d60g.sanity/subdir1' : Stale file handle rm: cannot remove '/mnt/lustre/d60g.sanity/subdir13' : Stale file handle rm: cannot remove '/mnt/lustre/d60g.sanity/subdir21' : Stale file handle sanity test_60g: @@@@@@ FAIL: rmdir failed this is because of llog_declare_write_rec() you call for remote objects in llog_declare_write_rec() -> llog_declare_write_rec(). going to try a small patch..

            I've reproduce 60g fail.

            looks like transaction abort from test happened for llog osp file creation. And later it leads to ESTALE at llog_osd_declare_write_rec()

            00080000:02000000:5.0:1737730681.616524:0:1162613:0:(osd_handler.c:2091:osd_trans_start()) *** cfs_fail_loc=19a, val=0***
            00080000:00000001:5.0:1737730681.616524:0:1162613:0:(osd_handler.c:2092:osd_trans_start()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
            00080000:00000001:5.0:1737730681.616525:0:1162613:0:(osd_handler.c:2119:osd_trans_start()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
            00000020:00000001:5.0:1737730681.616525:0:1162613:0:(out_handler.c:903:out_tx_end()) Process leaving via stop (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
            00080000:00000001:5.0:1737730681.616525:0:1162613:0:(osd_handler.c:2176:osd_trans_stop()) Process entered
            00040000:00000001:5.0:1737730681.616526:0:1162613:0:(qsd_handler.c:1120:qsd_op_end()) Process entered
            00040000:00000001:5.0:1737730681.616526:0:1162613:0:(qsd_handler.c:1151:qsd_op_end()) Process leaving
            00080000:00000001:5.0:1737730681.616526:0:1162613:0:(osd_handler.c:2271:osd_trans_stop()) Process leaving (rc=0 : 0 : 0)
            00000004:00000001:5.0:1737730681.616527:0:1162613:0:(mdt_handler.c:6840:mdt_object_free()) Process entered
            00000004:00000040:5.0:1737730681.616527:0:1162613:0:(mdt_handler.c:6843:mdt_object_free()) object free, fid = [0x240000401:0x8cb0:0x0]
            00000004:00000001:5.0:1737730681.616527:0:1162613:0:(mdt_handler.c:6854:mdt_object_free()) Process leaving
            00000020:00000002:5.0:1737730681.616527:0:1162613:0:(lu_object.c:224:lu_object_put()) Add ffff98c42af76700/ffff98c42d0e0150 to site lru. bkt: ffff98c4320e5338
            00000020:00000002:5.0:1737730681.616528:0:1162613:0:(lu_object.c:224:lu_object_put()) Add ffff98c42bcd4b00/ffff98c430506a80 to site lru. bkt: ffff98c4320e6670
            00000020:00000001:5.0:1737730681.616528:0:1162613:0:(out_handler.c:944:out_tx_end()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
            00000020:00000001:5.0:1737730681.616528:0:1162613:0:(out_handler.c:1275:out_handle()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
            00010000:00000040:5.0:1737730681.616528:0:1162613:0:(ldlm_lib.c:3209:target_committed_to_req()) last_committed 4295177667, transno 0, xid 1822138016289024
            00010000:00000001:5.0:1737730681.616529:0:1162613:0:(ldlm_lib.c:3279:target_send_reply()) Process entered
            00010000:00000200:5.0:1737730681.616529:0:1162613:0:(ldlm_lib.c:3267:target_send_reply_msg()) @@@ sending reply  req@ffff98c342d1c450 x1822138016289024/t0(0) o1000->lustre-MDT0000-mdtlov_UUID@0@lo:42/0 lens 1320/4320 e 0 to 0 dl 1737730692 ref 1 fl Interpret:/200/0 rc -5/0 job:'osp_up1-0.0' uid:0 gid:0
            ....
            00000100:00000001:5.0:1737730681.616576:0:1163369:0:(niobuf.c:447:ptlrpc_unregister_bulk()) Process entered
            00000100:00000001:5.0:1737730681.616576:0:1163369:0:(niobuf.c:461:ptlrpc_unregister_bulk()) Process leaving (rc=1 : 1 : 1)
            00000004:00000001:5.0:1737730681.616577:0:1163369:0:(osp_trans.c:600:osp_update_interpret()) Process entered
            00000004:00080000:5.0:1737730681.616577:0:1163369:0:(osp_md_object.c:1164:osp_write_interpreter()) error [0x240000401:0x1:0x0]: rc = -5
            00000004:00080000:5.0:1737730681.616577:0:1163369:0:(osp_md_object.c:1164:osp_write_interpreter()) error [0x240000401:0x1:0x0]: rc = -5
            00000004:00080000:5.0:1737730681.616578:0:1163369:0:(osp_md_object.c:1164:osp_write_interpreter()) error [0x240000401:0x1:0x0]: rc = -5
            00000004:00080000:5.0:1737730681.616578:0:1163369:0:(osp_md_object.c:1164:osp_write_interpreter()) error [0x240000401:0x1:0x0]: rc = -5
            00000004:00000001:5.0:1737730681.616578:0:1163369:0:(osp_object.c:1374:osp_invalidate()) Process entered
            00000004:00080000:5.0:1737730681.616578:0:1163369:0:(osp_object.c:1376:osp_invalidate()) Invalidate osp_object [0x240000401:0x1:0x0]
            00000004:00000001:5.0:1737730681.616579:0:1163369:0:(osp_object.c:1396:osp_invalidate()) Process leaving (rc=0 : 0 : 0)
            00000004:00000001:5.0:1737730681.616579:0:1163369:0:(osp_trans.c:689:osp_update_interpret()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
             

            But the real problem for this llog fid [0x240000401:0x8cb0:0x0] happened early

            00000020:00000001:5.0:1737730681.616457:0:1163628:0:(out_lib.c:251:out_attr_set_pack()) Process leaving (rc=0 : 0 : 0)
            00000040:00001000:5.0:1737730681.616458:0:1163628:0:(llog_cat.c:186:llog_cat_new_log()) new plain log [0x240000401:0x8cb0:0x0].61 of catalog [0x240000401:0x1:0x0]
            00000004:00000001:5.0:1737730681.616458:0:1163628:0:(osp_dev.c:793:osp_statfs()) Process entered
            00000004:00001000:5.0:1737730681.616458:0:1163628:0:(osp_dev.c:808:osp_statfs()) lustre-MDT0001-osp-MDT0000: blocks=68640, bfree=17648, bavail=11756, bsize=4096, reserved_mb_low=1, reserved_mb_high=3, files=199976, ffree=199442, state=0
            00000004:00000001:5.0:1737730681.616459:0:1163628:0:(osp_dev.c:816:osp_statfs()) Process leaving (rc=0 : 0 : 0)
            00000004:00000001:5.0:1737730681.616459:0:1163628:0:(osp_trans.c:1555:osp_trans_stop()) Process entered
            00000004:00080000:5.0:1737730681.616459:0:1163628:0:(osp_trans.c:1581:osp_trans_stop()) lustre-MDT0001-osp-MDT0000: add oth ffff98c42b331c00 with version 137317
            00000004:00000001:5.0:1737730681.616460:0:1163628:0:(osp_trans.c:1603:osp_trans_stop()) Process leaving (rc=0 : 0 : 0)
            00000040:00000001:5.0:1737730681.616460:0:1163628:0:(llog_cat.c:222:llog_cat_new_log()) Process leaving (rc=0 : 0 : 0)
            00000040:00001000:5.0:1737730681.616460:0:1163628:0:(llog_cat.c:366:llog_cat_prep_log()) lustre-MDT0001-osp-MDT0000: open log [0x240000401:0x8cb0:0x0] for catalog [0x240000401:0x1:0x0] rc=0
            

            this means that next logic is not work, there is no sync llog creation base on Lustre logs, and out request happened later at 1737730681.616525

            	if (dt_object_remote(cathandle->lgh_obj)) {
            		/* For remote operation, if we put the llog object
            		 * creation in the current transaction, then the
            		 * llog object will not be created on the remote
            		 * target until the transaction stop, if other
            		 * operations start before the transaction stop,
            		 * and use the same llog object, will be dependent
            		 * on the success of this transaction. So let's
            		 * create the llog object synchronously here to
            		 * remove the dependency.
            		 */
            		rc = llog_cat_new_log(env, cathandle, loghandle, NULL);
            		if (rc == -ESTALE) {
            

            I guess because this check "if (likely(!list_empty(&our->our_list)))" is first, no matter how llog_cat_new_log() tries to do it sync (by handle->th_result to 0).

                 CDEBUG(D_HA, "%s: add oth %p with version %llu\n",                              
                           osp->opd_obd->obd_name, oth, our->our_version);                          
                                                                                                    
                    LASSERT(our->our_req_ready == 0);                                               
                    spin_lock(&our->our_list_lock);                                                 
                    if (likely(!list_empty(&our->our_list))) {                                      
                            /* notify sending thread */                                             
                            our->our_req_ready = 1;                                                 
                            wake_up(&osp->opd_update->ou_waitq);                                    
                            spin_unlock(&our->our_list_lock);                                       
                    } else if (th->th_result == 0) {                                                
                            /* if the request does not needs to be serialized,                      
                             * read-only request etc, let's send it right away */                   
                            spin_unlock(&our->our_list_lock);                                       
                            rc = osp_send_update_req(env, osp, our);                                
                    } else {                                                                        
                            spin_unlock(&our->our_list_lock);                                       
                            osp_trans_callback(env, oth, th->th_result);                            
                    }                                                                               
            out:                                                                                    
                    osp_thandle_put(env, oth);                                                      
                                                                                                    
                    RETURN(rc);                                        
            

            bzzz any thoughts about fixing this?

            aboyko Alexander Boyko added a comment - I've reproduce 60g fail. looks like transaction abort from test happened for llog osp file creation. And later it leads to ESTALE at llog_osd_declare_write_rec() 00080000:02000000:5.0:1737730681.616524:0:1162613:0:(osd_handler.c:2091:osd_trans_start()) *** cfs_fail_loc=19a, val=0*** 00080000:00000001:5.0:1737730681.616524:0:1162613:0:(osd_handler.c:2092:osd_trans_start()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb) 00080000:00000001:5.0:1737730681.616525:0:1162613:0:(osd_handler.c:2119:osd_trans_start()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) 00000020:00000001:5.0:1737730681.616525:0:1162613:0:(out_handler.c:903:out_tx_end()) Process leaving via stop (rc=18446744073709551611 : -5 : 0xfffffffffffffffb) 00080000:00000001:5.0:1737730681.616525:0:1162613:0:(osd_handler.c:2176:osd_trans_stop()) Process entered 00040000:00000001:5.0:1737730681.616526:0:1162613:0:(qsd_handler.c:1120:qsd_op_end()) Process entered 00040000:00000001:5.0:1737730681.616526:0:1162613:0:(qsd_handler.c:1151:qsd_op_end()) Process leaving 00080000:00000001:5.0:1737730681.616526:0:1162613:0:(osd_handler.c:2271:osd_trans_stop()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:5.0:1737730681.616527:0:1162613:0:(mdt_handler.c:6840:mdt_object_free()) Process entered 00000004:00000040:5.0:1737730681.616527:0:1162613:0:(mdt_handler.c:6843:mdt_object_free()) object free, fid = [0x240000401:0x8cb0:0x0] 00000004:00000001:5.0:1737730681.616527:0:1162613:0:(mdt_handler.c:6854:mdt_object_free()) Process leaving 00000020:00000002:5.0:1737730681.616527:0:1162613:0:(lu_object.c:224:lu_object_put()) Add ffff98c42af76700/ffff98c42d0e0150 to site lru. bkt: ffff98c4320e5338 00000020:00000002:5.0:1737730681.616528:0:1162613:0:(lu_object.c:224:lu_object_put()) Add ffff98c42bcd4b00/ffff98c430506a80 to site lru. bkt: ffff98c4320e6670 00000020:00000001:5.0:1737730681.616528:0:1162613:0:(out_handler.c:944:out_tx_end()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) 00000020:00000001:5.0:1737730681.616528:0:1162613:0:(out_handler.c:1275:out_handle()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) 00010000:00000040:5.0:1737730681.616528:0:1162613:0:(ldlm_lib.c:3209:target_committed_to_req()) last_committed 4295177667, transno 0, xid 1822138016289024 00010000:00000001:5.0:1737730681.616529:0:1162613:0:(ldlm_lib.c:3279:target_send_reply()) Process entered 00010000:00000200:5.0:1737730681.616529:0:1162613:0:(ldlm_lib.c:3267:target_send_reply_msg()) @@@ sending reply req@ffff98c342d1c450 x1822138016289024/t0(0) o1000->lustre-MDT0000-mdtlov_UUID@0@lo:42/0 lens 1320/4320 e 0 to 0 dl 1737730692 ref 1 fl Interpret:/200/0 rc -5/0 job:'osp_up1-0.0' uid:0 gid:0 .... 00000100:00000001:5.0:1737730681.616576:0:1163369:0:(niobuf.c:447:ptlrpc_unregister_bulk()) Process entered 00000100:00000001:5.0:1737730681.616576:0:1163369:0:(niobuf.c:461:ptlrpc_unregister_bulk()) Process leaving (rc=1 : 1 : 1) 00000004:00000001:5.0:1737730681.616577:0:1163369:0:(osp_trans.c:600:osp_update_interpret()) Process entered 00000004:00080000:5.0:1737730681.616577:0:1163369:0:(osp_md_object.c:1164:osp_write_interpreter()) error [0x240000401:0x1:0x0]: rc = -5 00000004:00080000:5.0:1737730681.616577:0:1163369:0:(osp_md_object.c:1164:osp_write_interpreter()) error [0x240000401:0x1:0x0]: rc = -5 00000004:00080000:5.0:1737730681.616578:0:1163369:0:(osp_md_object.c:1164:osp_write_interpreter()) error [0x240000401:0x1:0x0]: rc = -5 00000004:00080000:5.0:1737730681.616578:0:1163369:0:(osp_md_object.c:1164:osp_write_interpreter()) error [0x240000401:0x1:0x0]: rc = -5 00000004:00000001:5.0:1737730681.616578:0:1163369:0:(osp_object.c:1374:osp_invalidate()) Process entered 00000004:00080000:5.0:1737730681.616578:0:1163369:0:(osp_object.c:1376:osp_invalidate()) Invalidate osp_object [0x240000401:0x1:0x0] 00000004:00000001:5.0:1737730681.616579:0:1163369:0:(osp_object.c:1396:osp_invalidate()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:5.0:1737730681.616579:0:1163369:0:(osp_trans.c:689:osp_update_interpret()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) But the real problem for this llog fid [0x240000401:0x8cb0:0x0] happened early 00000020:00000001:5.0:1737730681.616457:0:1163628:0:(out_lib.c:251:out_attr_set_pack()) Process leaving (rc=0 : 0 : 0) 00000040:00001000:5.0:1737730681.616458:0:1163628:0:(llog_cat.c:186:llog_cat_new_log()) new plain log [0x240000401:0x8cb0:0x0].61 of catalog [0x240000401:0x1:0x0] 00000004:00000001:5.0:1737730681.616458:0:1163628:0:(osp_dev.c:793:osp_statfs()) Process entered 00000004:00001000:5.0:1737730681.616458:0:1163628:0:(osp_dev.c:808:osp_statfs()) lustre-MDT0001-osp-MDT0000: blocks=68640, bfree=17648, bavail=11756, bsize=4096, reserved_mb_low=1, reserved_mb_high=3, files=199976, ffree=199442, state=0 00000004:00000001:5.0:1737730681.616459:0:1163628:0:(osp_dev.c:816:osp_statfs()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:5.0:1737730681.616459:0:1163628:0:(osp_trans.c:1555:osp_trans_stop()) Process entered 00000004:00080000:5.0:1737730681.616459:0:1163628:0:(osp_trans.c:1581:osp_trans_stop()) lustre-MDT0001-osp-MDT0000: add oth ffff98c42b331c00 with version 137317 00000004:00000001:5.0:1737730681.616460:0:1163628:0:(osp_trans.c:1603:osp_trans_stop()) Process leaving (rc=0 : 0 : 0) 00000040:00000001:5.0:1737730681.616460:0:1163628:0:(llog_cat.c:222:llog_cat_new_log()) Process leaving (rc=0 : 0 : 0) 00000040:00001000:5.0:1737730681.616460:0:1163628:0:(llog_cat.c:366:llog_cat_prep_log()) lustre-MDT0001-osp-MDT0000: open log [0x240000401:0x8cb0:0x0] for catalog [0x240000401:0x1:0x0] rc=0 this means that next logic is not work, there is no sync llog creation base on Lustre logs, and out request happened later at 1737730681.616525 if (dt_object_remote(cathandle->lgh_obj)) { /* For remote operation, if we put the llog object * creation in the current transaction, then the * llog object will not be created on the remote * target until the transaction stop, if other * operations start before the transaction stop, * and use the same llog object, will be dependent * on the success of this transaction. So let's * create the llog object synchronously here to * remove the dependency. */ rc = llog_cat_new_log(env, cathandle, loghandle, NULL); if (rc == -ESTALE) { I guess because this check "if (likely(!list_empty(&our->our_list)))" is first, no matter how llog_cat_new_log() tries to do it sync (by handle->th_result to 0). CDEBUG(D_HA, "%s: add oth %p with version %llu\n", osp->opd_obd->obd_name, oth, our->our_version); LASSERT(our->our_req_ready == 0); spin_lock(&our->our_list_lock); if (likely(!list_empty(&our->our_list))) { /* notify sending thread */ our->our_req_ready = 1; wake_up(&osp->opd_update->ou_waitq); spin_unlock(&our->our_list_lock); } else if (th->th_result == 0) { /* if the request does not needs to be serialized, * read-only request etc, let's send it right away */ spin_unlock(&our->our_list_lock); rc = osp_send_update_req(env, osp, our); } else { spin_unlock(&our->our_list_lock); osp_trans_callback(env, oth, th->th_result); } out: osp_thandle_put(env, oth); RETURN(rc); bzzz any thoughts about fixing this?

            Subject: LU-18218 llog: catalog lgh_lock refactoring

            with this patch applied I hit a failure in sanity/60g:

            # ONLY=60g MDSCOUNT=2 ONLY_REPEAT=1000 MDSSIZE=3000000  bash sanity.sh:
            [  795.666375] Lustre: DEBUG MARKER: == sanity test 60g: transaction abort won't cause MDT hung (repeat 233/1000 iter, 12/ min) ========================================================== 07:31:05 (1737617465)
            
            ...
            /mnt/lustre/d60g.sanity/subdir97:
            rm: cannot remove '/mnt/lustre/d60g.sanity/subdir22': Stale file handle
            rm: cannot remove '/mnt/lustre/d60g.sanity/subdir14': Stale file handle
            rm: cannot remove '/mnt/lustre/d60g.sanity/subdir6': Stale file handle
            rm: cannot remove '/mnt/lustre/d60g.sanity/subdir32': Stale file handle
             sanity test_60g: @@@@@@ FAIL: rmdir failed 
            [  799.015419] Lustre: DEBUG MARKER: sanity test_60g: @@@@@@ FAIL: rmdir failed
            

            it takes quite a while, but without the patch I'm unable to reproduce at all.

            bzzz Alex Zhuravlev added a comment - Subject: LU-18218 llog: catalog lgh_lock refactoring with this patch applied I hit a failure in sanity/60g: # ONLY=60g MDSCOUNT=2 ONLY_REPEAT=1000 MDSSIZE=3000000 bash sanity.sh: [ 795.666375] Lustre: DEBUG MARKER: == sanity test 60g: transaction abort won't cause MDT hung (repeat 233/1000 iter, 12/ min) ========================================================== 07:31:05 (1737617465) ... /mnt/lustre/d60g.sanity/subdir97: rm: cannot remove '/mnt/lustre/d60g.sanity/subdir22' : Stale file handle rm: cannot remove '/mnt/lustre/d60g.sanity/subdir14' : Stale file handle rm: cannot remove '/mnt/lustre/d60g.sanity/subdir6' : Stale file handle rm: cannot remove '/mnt/lustre/d60g.sanity/subdir32' : Stale file handle sanity test_60g: @@@@@@ FAIL: rmdir failed [ 799.015419] Lustre: DEBUG MARKER: sanity test_60g: @@@@@@ FAIL: rmdir failed it takes quite a while, but without the patch I'm unable to reproduce at all.

            People

              aboyko Alexander Boyko
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: