Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6365

Eliminate unnecessary loop in lu_cache_shrink to improve performance

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.4.1, Lustre 2.5.0
    • 3
    • 9223372036854775807

    Description

      Customer has a test application that tries to allocate 20000 2M huge pages. After the node has be up and running for some period of time and fragmentation has occurred, the allocation takes several minutes before failing.

      # time aprun -m20000h -L 2168 -n 2 -N 2 -ss -cc cpu -j 2 -d 24 ./prog2
      > 
      > Started at 11:34:02
      > 
      >   Node   Name   Huge Page(MB)   MemFree(MB) Small0 (MB)  Small1(MB)  Huge0(MB)   Huge1(MB)  
      >    0 nid02168       40000       19908        9814        1109           0        8398
      > 
      > Allocated  19800MB
      > At address      10000080000
      > Application 48079344 is crashing. ATP analysis proceeding...
      > 
      > ATP Stack walkback for Rank 0 starting:
      >   start_thread@0x2aaaaee6f805
      >   _new_slave_entry@0x2aaaab3b0f79
      >   memcheck__cray$mt$p0001@prog2.f90:103
      >   sub_@sub.f90:5
      >   sub2_@sub2.f90:9
      >   touch_@touch.f90:1
      > ATP Stack walkback for Rank 0 done
      > Process died with signal 7: 'Bus error'
      > Forcing core dumps of ranks 0, 1
      [skip]
      > [NID 02168] 2015-03-09 11:34:13 Apid 48079344: Huge page could not be allocated.  Process terminated via bus error.
      > Application 48079344 exit codes: 139
      > Application 48079344 resources: utime ~92s, stime ~9s, Rss ~123412, inblocks ~628, outblocks ~98
      > 
      > real    2m19.012s
      > user    0m0.984s
      > sys     0m0.172s
      

      In researching the slowness, I noticed that the lu_cache_shrinker is taking noticeably longer to execute than the other shrinkers.

      Kernel tracing - second column is time
                 <...>-35127 [013] 2965136.085636: shrink_slab <-do_try_to_free_pages
                 <...>-35127 [013] 2965136.085636: down_read_trylock <-shrink_slab
                 <...>-35127 [013] 2965136.085636: shrink_dcache_memory <-shrink_slab
                 <...>-35127 [013] 2965136.085636: shrink_icache_memory <-shrink_slab
                 <...>-35127 [013] 2965136.085636: shrink_dqcache_memory <-shrink_slab
                 <...>-35127 [013] 2965136.085637: lu_cache_shrink <-shrink_slab
                 <...>-35127 [013] 2965136.085990: enc_pools_shrink <-shrink_slab
                 <...>-35127 [013] 2965136.085990: ldlm_pools_srv_shrink <-shrink_slab
                 <...>-35127 [013] 2965136.085994: ldlm_pools_cli_shrink <-shrink_slab
                 <...>-35127 [013] 2965136.086004: up_read <-shrink_slab
                 <...>-35127 [013] 2965136.086004: _cond_resched <-shrink_slab
                 <...>-35127 [013] 2965136.086006: shrink_slab <-do_try_to_free_pages
      

      Most of the time in lu_cache_shrink is spent repeatedly getting a spinlock in lu_site_stats_get().

      >            <...>-46696 [012] 4654974.718236: cfs_hash_hd_hhead_size <-lu_site_stats_get
      >            <...>-46696 [012] 4654974.718237: _raw_spin_lock <-cfs_hash_spin_lock
      

      The lu_cache_shrink algothithm loops over the entries in lu_sites (8 in this customer's case) calling lu_site_stats_get for each to compute the number of freeable objects. lu_site_stats_get itself loops over each bucket in the ls_obj_hash (256 buckets) to summing the lengths of the lru list in each bucket.

      The debug data suggests that most of the time taken by lu_cache_shrink is spent in lu_site_stats_get. This overhead can be eliminated simply by keeping an aggregated total of the lsb_lru_len from all buckets in the lu_site struct. In other words, keep a running count of total lru objects rather recomputing the total each time lu_cache_shrink is called.

      Attachments

        Issue Links

          Activity

            [LU-6365] Eliminate unnecessary loop in lu_cache_shrink to improve performance

            There's no dependency on LU-6147 or LU-5722.

            (Sorry for the slow reply. I was out of the office and not reading e-mail.)

            amk Ann Koehler (Inactive) added a comment - There's no dependency on LU-6147 or LU-5722 . (Sorry for the slow reply. I was out of the office and not reading e-mail.)
            jaylan Jay Lan (Inactive) added a comment - - edited

            The conflict I had was caused by two LU's that I do not have: LU-6147 and LU-5722.
            Are these prerequisites to this LU? If no dependency, I can resolve the cherry-pick conflicts.
            Thank you for your help!

            jaylan Jay Lan (Inactive) added a comment - - edited The conflict I had was caused by two LU's that I do not have: LU-6147 and LU-5722 . Are these prerequisites to this LU? If no dependency, I can resolve the cherry-pick conflicts. Thank you for your help!

            Attached 2.5.3 patch file. If this doesn't work, give me a copy of your lustre/obdclass/lu_object.c and I'll make the necessary changes.

            amk Ann Koehler (Inactive) added a comment - Attached 2.5.3 patch file. If this doesn't work, give me a copy of your lustre/obdclass/lu_object.c and I'll make the necessary changes.

            When trying to back port the commit to 2.5.3-fe, I got a conflict in lu_object_put():

            <<<<<<< HEAD
            if (!lu_object_is_dying(top)) {
            LASSERT(cfs_list_empty(&top->loh_lru));
            cfs_list_add_tail(&top->loh_lru, &bkt->lsb_lru);
            =======
            if (!lu_object_is_dying(top) &&
            (lu_object_exists(orig) || lu_object_is_cl(orig))) {
            LASSERT(list_empty(&top->loh_lru));
            list_add_tail(&top->loh_lru, &bkt->lsb_lru);
            bkt->lsb_lru_len++;
            lprocfs_counter_incr(site->ls_stats, LU_SS_LRU_LEN);
            CDEBUG(D_INODE, "Add %p to site lru. hash: %p, bkt: %p, "
            "lru_len: %ld\n",
            o, site->ls_obj_hash, bkt, bkt->lsb_lru_len);
            >>>>>>> fc2f0c3... LU-6365 obd: Eliminate hash bucket scans in lu_cache_shrink

            Could you provide me a back port? Or do I need to pick up another commit before this?
            Thanks!

            jaylan Jay Lan (Inactive) added a comment - When trying to back port the commit to 2.5.3-fe, I got a conflict in lu_object_put(): <<<<<<< HEAD if (!lu_object_is_dying(top)) { LASSERT(cfs_list_empty(&top->loh_lru)); cfs_list_add_tail(&top->loh_lru, &bkt->lsb_lru); ======= if (!lu_object_is_dying(top) && (lu_object_exists(orig) || lu_object_is_cl(orig))) { LASSERT(list_empty(&top->loh_lru)); list_add_tail(&top->loh_lru, &bkt->lsb_lru); bkt->lsb_lru_len++; lprocfs_counter_incr(site->ls_stats, LU_SS_LRU_LEN); CDEBUG(D_INODE, "Add %p to site lru. hash: %p, bkt: %p, " "lru_len: %ld\n", o, site->ls_obj_hash, bkt, bkt->lsb_lru_len); >>>>>>> fc2f0c3... LU-6365 obd: Eliminate hash bucket scans in lu_cache_shrink Could you provide me a back port? Or do I need to pick up another commit before this? Thanks!
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14066/
            Subject: LU-6365 obd: Eliminate hash bucket scans in lu_cache_shrink
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fc2f0c39d8bed5060774e3b4cca7bf13faa3692a

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14066/ Subject: LU-6365 obd: Eliminate hash bucket scans in lu_cache_shrink Project: fs/lustre-release Branch: master Current Patch Set: Commit: fc2f0c39d8bed5060774e3b4cca7bf13faa3692a

            People

              wc-triage WC Triage
              amk Ann Koehler (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: