Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6770

use per_cpu request pool osc_rq_pools

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • None
    • 9223372036854775807

    Description

      With many OSCs, the osc will pre-alloc memory at start.
      That will occupy the memory of application, especially when the
      client need to interact with hundreds of OSTs.

      We can solve it by using a golbal per_cpu pool 'osc_rq_pools' rather than
      local pool for per osc to change this situation. The upper limit
      size of requests in pools is about 1 percent of the total memory.

      Also, administrator can use a module parameter to limit the momory
      usage by:
      options osc osc_reqpool_mem_max=num
      The unit of num is MB, and the upper limit will be:
      MIN(num, 1% total memory)

      Attachments

        Activity

          [LU-6770] use per_cpu request pool osc_rq_pools
          gerrit Gerrit Updater added a comment - - edited

          Wrong ticket number.

          gerrit Gerrit Updater added a comment - - edited Wrong ticket number.
          ys Yang Sheng added a comment -

          Patch landed, Close this ticket.

          ys Yang Sheng added a comment - Patch landed, Close this ticket.

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15422/
          Subject: LU-6770 osc: use global osc_rq_pool to reduce memory usage
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 44c4f47c4d1f185831d4629cc9ca5ae5f50a8e07

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15422/ Subject: LU-6770 osc: use global osc_rq_pool to reduce memory usage Project: fs/lustre-release Branch: master Current Patch Set: Commit: 44c4f47c4d1f185831d4629cc9ca5ae5f50a8e07

          This is memory usage on client with 200 OSTs configuation.

            Mem usage Slab allocation
          master 444616 136412
          master+15422 91324 64412
          Mem usage=MemFree(Before mount) - MemFree(after mount)
          Slab allocation=Slab(After mount) - Slab(Before mount)
          

          Patch 15422 helps to reduce significant memory usages.

          ihara Shuichi Ihara (Inactive) added a comment - This is memory usage on client with 200 OSTs configuation.   Mem usage Slab allocation master 444616 136412 master+15422 91324 64412 Mem usage=MemFree(Before mount) - MemFree(after mount) Slab allocation=Slab(After mount) - Slab(Before mount) Patch 15422 helps to reduce significant memory usages.

          Here is quick benchmark results on master with/without http://review.whamcloud.com/#/c/15422
          4 x OSS and an client(2 x E5-2660v3, 20 CPU cores, 128GB memory and 1 x FDR Infiniband)

            master(w/o stress) master(w/ stress) master+15422(w/o stress) master+15422(w/ stress)
          Write(M/B/sec) 5604 4838 5702 4846
          Read(M/B/sec) 4218 3703 4261 3939

          Here is IOR syntax on this test.

          # mpirun -np 10 /work/ihara/IOR -w -e -t 1m -b 26g -k -F -o /scratch1/file
          # pdsh -g oss,client "sync; echo 3 > /proc/sys/vm/drop_caches"
          # mpirun -np 10 /work/ihara/IOR -r -e -t 1m -b 26g -F -o /scratch1/file
          

          For IOR with stress testing, I generated memory pressure with "stress" command and ran IOR under it.

          # stress --vm 10 --vm-bytes 10G
          

          No perforamnce regression with patch 15422 so far.

          ihara Shuichi Ihara (Inactive) added a comment - Here is quick benchmark results on master with/without http://review.whamcloud.com/#/c/15422 4 x OSS and an client(2 x E5-2660v3, 20 CPU cores, 128GB memory and 1 x FDR Infiniband)   master(w/o stress) master(w/ stress) master+15422(w/o stress) master+15422(w/ stress) Write(M/B/sec) 5604 4838 5702 4846 Read(M/B/sec) 4218 3703 4261 3939 Here is IOR syntax on this test. # mpirun -np 10 /work/ihara/IOR -w -e -t 1m -b 26g -k -F -o /scratch1/file # pdsh -g oss,client "sync; echo 3 > /proc/sys/vm/drop_caches" # mpirun -np 10 /work/ihara/IOR -r -e -t 1m -b 26g -F -o /scratch1/file For IOR with stress testing, I generated memory pressure with "stress" command and ran IOR under it. # stress --vm 10 --vm-bytes 10G No perforamnce regression with patch 15422 so far.
          ihara Shuichi Ihara (Inactive) added a comment - patch http://review.whamcloud.com/15585 Abandoned. new patch is http://review.whamcloud.com/#/c/15422/

          Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/15585
          Subject: LU-6770 osc: use global osc_rq_pool to reduce memory usage
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: c22aa4d8c553e974214a27e516728d88df73663c

          gerrit Gerrit Updater added a comment - Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/15585 Subject: LU-6770 osc: use global osc_rq_pool to reduce memory usage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c22aa4d8c553e974214a27e516728d88df73663c
          ys Yang Sheng added a comment -

          Hi, Shilong,

          I think maybe we still need consider OST number when we decide the mem_max parameter. Unless it big than a fixed number or override by module paramter. How do you think about it?

          Thanks,
          Yang Sheng

          ys Yang Sheng added a comment - Hi, Shilong, I think maybe we still need consider OST number when we decide the mem_max parameter. Unless it big than a fixed number or override by module paramter. How do you think about it? Thanks, Yang Sheng

          Hi Yang Sheng,

          So maybe we can set @osc_reqpool_mem_max=100MB or so in default, and it will try to allocate memory by checking
          min(100M, 1% of memory), dose this make sense for you?

          Best Regards,
          Shilong

          wangshilong Wang Shilong (Inactive) added a comment - Hi Yang Sheng, So maybe we can set @osc_reqpool_mem_max=100MB or so in default, and it will try to allocate memory by checking min(100M, 1% of memory), dose this make sense for you? Best Regards, Shilong
          ys Yang Sheng added a comment -

          The patch has passed test & review process. But Oleg has some comment as below:

          [CST下午1时53分29秒] yang sheng: could you please give me some point about http://review.whamcloud.com/#/c/15422/
          [CST下午1时54分01秒] yang sheng: can it be landed or still need waiting a while?
          [CST下午1时54分07秒] Oleg Drokin: why do we need percpu pool there?
          [CST下午1时54分28秒] Oleg Drokin: I mean it's still an improvement, but what if I have 260 CPUs?
          [CST下午1时54分40秒] Oleg Drokin: I would imagine havign a static pool of a fixed size is probably best of all
          [CST下午1时57分45秒] Oleg Drokin: I think the pool does not need to be super big. Just a fixed number of reqests, something like 50 (or 100, need to see how big they are) should be enough. we only expect to use them during severe OOM anyway
          [CST下午1时58分00秒] Oleg Drokin: with perhaps a module parameter if somebody wants an override
          [CST下午1时59分05秒] yang sheng: Yes, it is reasonable.
          [CST下午1时59分44秒] yang sheng: as this patch given the limit is %1 of total memory.
          [CST下午2时00分16秒] yang sheng: seem big than you are point.
          [CST下午2时01分19秒] Oleg Drokin: Yes. I feel it's really excessive. But the initial reasoning was that every OSC could have up to 32M of dirty pages and can send up to 8 (default) RPCs in flight.
          [CST下午2时01分40秒] Oleg Drokin: so every OSC had this pool in order to send the many RPCs even in OOM
          [CST下午2时02分02秒] Oleg Drokin: in reality if you have 2000 OSTs, it's unlikely you'd have dirty pages in all of them at the same time
          [CST下午2时02分11秒] Oleg Drokin: so we need to be reasonable here
          [CST下午2时02分38秒] Oleg Drokin: 1% of 1T of memory is still a cool 10G
          [CST下午2时03分34秒] yang sheng: so a fixed size is enough to handle such stiuation.
          [CST下午2时03分46秒] Oleg Drokin: finding a proper number is going to be tricky, but I feel it should be on the lower side somewhere in tens or low hundreds for most cases except perhaps the most extreme ones
          [CST下午2时04分10秒] Oleg Drokin: that's why having an override is important of course, with a good documentation about it like I explained above
          [CST下午2时05分02秒] yang sheng: I see. got it. Thank you very much. Oleg.
          
          ys Yang Sheng added a comment - The patch has passed test & review process. But Oleg has some comment as below: [CST下午1时53分29秒] yang sheng: could you please give me some point about http://review.whamcloud.com/#/c/15422/ [CST下午1时54分01秒] yang sheng: can it be landed or still need waiting a while? [CST下午1时54分07秒] Oleg Drokin: why do we need percpu pool there? [CST下午1时54分28秒] Oleg Drokin: I mean it's still an improvement, but what if I have 260 CPUs? [CST下午1时54分40秒] Oleg Drokin: I would imagine havign a static pool of a fixed size is probably best of all [CST下午1时57分45秒] Oleg Drokin: I think the pool does not need to be super big. Just a fixed number of reqests, something like 50 (or 100, need to see how big they are) should be enough. we only expect to use them during severe OOM anyway [CST下午1时58分00秒] Oleg Drokin: with perhaps a module parameter if somebody wants an override [CST下午1时59分05秒] yang sheng: Yes, it is reasonable. [CST下午1时59分44秒] yang sheng: as this patch given the limit is %1 of total memory. [CST下午2时00分16秒] yang sheng: seem big than you are point. [CST下午2时01分19秒] Oleg Drokin: Yes. I feel it's really excessive. But the initial reasoning was that every OSC could have up to 32M of dirty pages and can send up to 8 (default) RPCs in flight. [CST下午2时01分40秒] Oleg Drokin: so every OSC had this pool in order to send the many RPCs even in OOM [CST下午2时02分02秒] Oleg Drokin: in reality if you have 2000 OSTs, it's unlikely you'd have dirty pages in all of them at the same time [CST下午2时02分11秒] Oleg Drokin: so we need to be reasonable here [CST下午2时02分38秒] Oleg Drokin: 1% of 1T of memory is still a cool 10G [CST下午2时03分34秒] yang sheng: so a fixed size is enough to handle such stiuation. [CST下午2时03分46秒] Oleg Drokin: finding a proper number is going to be tricky, but I feel it should be on the lower side somewhere in tens or low hundreds for most cases except perhaps the most extreme ones [CST下午2时04分10秒] Oleg Drokin: that's why having an override is important of course, with a good documentation about it like I explained above [CST下午2时05分02秒] yang sheng: I see. got it. Thank you very much. Oleg.

          People

            ys Yang Sheng
            wangshilong Wang Shilong (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: