Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
Upstream
-
3
-
9223372036854775807
Description
We tested MDT performance with and without changelogs enabled and observed a big performance impact with changelogs enabled.
pdsh -g mds 'lctl get_param mdd.*.changelog*' | dshbak -c ---------------- lmm1302 ---------------- mdd.lmm13-MDT0000.changelog_deniednext=60 mdd.lmm13-MDT0000.changelog_gc=1 mdd.lmm13-MDT0000.changelog_max_idle_indexes=2097446912 mdd.lmm13-MDT0000.changelog_max_idle_time=2592000 mdd.lmm13-MDT0000.changelog_min_free_cat_entries=2 mdd.lmm13-MDT0000.changelog_min_gc_interval=3600 mdd.lmm13-MDT0000.changelog_size=1637620216 mdd.lmm13-MDT0000.changelog_striped_dir_real_pfid=0 mdd.lmm13-MDT0000.changelog_current_mask= MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC mdd.lmm13-MDT0000.changelog_mask= MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC mdd.lmm13-MDT0000.changelog_users= current_index: 227636205 ID index (idle) mask cl3 219246813 (76) SUMMARY rate: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation : 64293.217 61191.096 62598.992 1282.215 File stat : 697756.541 690173.219 694512.173 3152.598 File read : 293942.923 292588.428 293054.714 615.889 File removal : 64479.412 57107.824 61776.225 3314.447 Tree creation : 169.033 145.318 154.133 10.595 Tree removal : 82.949 44.846 69.342 17.357 V-1: Entering PrintTimestamp... -- finished at 11/30/2023 12:19:44 --
When we disable changelog, performance comes back
# cscli lustre changelog disable lmm13-MDT0000: Deregistered changelog user #3 lmm13-MDT0001: Deregistered changelog user #3 SUMMARY rate: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation : 158468.362 153578.205 155692.937 2048.523 File stat : 703378.629 665431.521 689570.511 17093.925 File read : 290063.030 278902.768 284177.285 4568.720 File removal : 141796.451 136881.639 139915.690 2163.940 Tree creation : 199.212 131.040 173.177 30.070 Tree removal : 95.939 39.770 74.663 24.871 V-1: Entering PrintTimestamp... -- finished at 11/30/2023 12:23:45 --
I'v taken a perf report with changelog enabled, looks like llog_cat_add_rec()->mutex_lock is a bottle neck
--14.94%--mdt_reint_create | --14.93%--mdt_create | |--14.63%--mdd_create | | | |--12.85%--mdd_changelog_ns_store | | | | | --12.84%--mdd_changelog_store | | | | | --12.84%--llog_add | | | | | --12.84%--llog_cat_add_rec | | | | | |--12.70%--rwsem_down_write_slowpath | | | | | | | |--11.80%--osq_lock | | | | | | | --0.46%--rwsem_spin_on_owner | | | | | --0.12%--llog_write_rec | | | | | --0.12%--mdd_changelog_write_rec
Without changelog perf looks like
|--5.90%--mdd_create | |--4.74%--mdd_create_object | | | |--3.01%--mdd_create_object_internal | | | | | --3.01%--lod_create | | | | | --3.01%--lod_sub_create | | | | | --3.01%--osd_create | | | | | |--2.78%--osd_mkfile.constprop.104 | | | | | | | --2.78%--ldiskfs_create_inode | | | | | | | --2.78%--__ldiskfs_new_inode
From a Lustre llog design/implementation, adding a record to changelog have a synchronization on down_wtrite(plain_llog->lgh_lock). It is a top semaphore.
All locking for a adding record looks next
down_write((&loghandle->lgh_lock) synchronize writers ----down_write(&loghandle->lgh_last_sem) synchronize write and parallel read --------mutex_lock(&loghandle->lgh_hdr_mutex) protects llog header/bitmap data from concurrent update/cancel --------dt_write_lock(env, o, 0); for atomic update header and record for a remote readers ----------- write header update --------mutex_unlock(&loghandle->lgh_hdr_mutex); --------write a record --------dt_write_unlock(env, o); ----up_write(&loghandle->lgh_last_sem); up_write(&loghandle->lgh_lock);
So there is a real limit for adding record at changelog, and all metadata threads, 512 or so, would sleep at a top semaphore during changelog adding. Only one could be a writer for a moment. Bottleneck.