[LU-6770] use per_cpu request pool osc_rq_pools - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.8.0
Affects Version/s: None
Labels:
- patch

Rank (Obsolete):
9223372036854775807

Description

With many OSCs, the osc will pre-alloc memory at start.
That will occupy the memory of application, especially when the
client need to interact with hundreds of OSTs.

We can solve it by using a golbal per_cpu pool 'osc_rq_pools' rather than
local pool for per osc to change this situation. The upper limit
size of requests in pools is about 1 percent of the total memory.

Also, administrator can use a module parameter to limit the momory
usage by:
options osc osc_reqpool_mem_max=num
The unit of num is MB, and the upper limit will be:
MIN(num, 1% total memory)

Attachments

Activity

[LU-6770] use per_cpu request pool osc_rq_pools

Gerrit Updater added a comment - 08/Sep/15 4:02 PM - edited

Wrong ticket number.

Gerrit Updater added a comment - 08/Sep/15 4:02 PM - edited Wrong ticket number.

Yang Sheng added a comment - 13/Aug/15 2:46 AM

Patch landed, Close this ticket.

Yang Sheng added a comment - 13/Aug/15 2:46 AM Patch landed, Close this ticket.

Gerrit Updater added a comment - 12/Aug/15 11:44 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15422/
Subject: ~~LU-6770~~ osc: use global osc_rq_pool to reduce memory usage
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 44c4f47c4d1f185831d4629cc9ca5ae5f50a8e07

Gerrit Updater added a comment - 12/Aug/15 11:44 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15422/ Subject: LU-6770 osc: use global osc_rq_pool to reduce memory usage Project: fs/lustre-release Branch: master Current Patch Set: Commit: 44c4f47c4d1f185831d4629cc9ca5ae5f50a8e07

Shuichi Ihara (Inactive) added a comment - 05/Aug/15 2:40 PM

This is memory usage on client with 200 OSTs configuation.

	Mem usage	Slab allocation
master	444616	136412
master+15422	91324	64412

Mem usage=MemFree(Before mount) - MemFree(after mount)
Slab allocation=Slab(After mount) - Slab(Before mount)

Patch 15422 helps to reduce significant memory usages.

Shuichi Ihara (Inactive) added a comment - 05/Aug/15 2:40 PM This is memory usage on client with 200 OSTs configuation. Mem usage Slab allocation master 444616 136412 master+15422 91324 64412 Mem usage=MemFree(Before mount) - MemFree(after mount) Slab allocation=Slab(After mount) - Slab(Before mount) Patch 15422 helps to reduce significant memory usages.

Shuichi Ihara (Inactive) added a comment - 05/Aug/15 2:12 PM

Here is quick benchmark results on master with/without http://review.whamcloud.com/#/c/15422
4 x OSS and an client(2 x E5-2660v3, 20 CPU cores, 128GB memory and 1 x FDR Infiniband)

	master(w/o stress)	master(w/ stress)	master+15422(w/o stress)	master+15422(w/ stress)
Write(M/B/sec)	5604	4838	5702	4846
Read(M/B/sec)	4218	3703	4261	3939

Here is IOR syntax on this test.

# mpirun -np 10 /work/ihara/IOR -w -e -t 1m -b 26g -k -F -o /scratch1/file
# pdsh -g oss,client "sync; echo 3 > /proc/sys/vm/drop_caches"
# mpirun -np 10 /work/ihara/IOR -r -e -t 1m -b 26g -F -o /scratch1/file

For IOR with stress testing, I generated memory pressure with "stress" command and ran IOR under it.

# stress --vm 10 --vm-bytes 10G

No perforamnce regression with patch 15422 so far.

Shuichi Ihara (Inactive) added a comment - 05/Aug/15 2:12 PM Here is quick benchmark results on master with/without http://review.whamcloud.com/#/c/15422 4 x OSS and an client(2 x E5-2660v3, 20 CPU cores, 128GB memory and 1 x FDR Infiniband) master(w/o stress) master(w/ stress) master+15422(w/o stress) master+15422(w/ stress) Write(M/B/sec) 5604 4838 5702 4846 Read(M/B/sec) 4218 3703 4261 3939 Here is IOR syntax on this test. # mpirun -np 10 /work/ihara/IOR -w -e -t 1m -b 26g -k -F -o /scratch1/file # pdsh -g oss,client "sync; echo 3 > /proc/sys/vm/drop_caches" # mpirun -np 10 /work/ihara/IOR -r -e -t 1m -b 26g -F -o /scratch1/file For IOR with stress testing, I generated memory pressure with "stress" command and ran IOR under it. # stress --vm 10 --vm-bytes 10G No perforamnce regression with patch 15422 so far.

Shuichi Ihara (Inactive) added a comment - 05/Aug/15 1:56 PM

patch http://review.whamcloud.com/15585 Abandoned. new patch is http://review.whamcloud.com/#/c/15422/

Shuichi Ihara (Inactive) added a comment - 05/Aug/15 1:56 PM patch http://review.whamcloud.com/15585 Abandoned. new patch is http://review.whamcloud.com/#/c/15422/

Gerrit Updater added a comment - 13/Jul/15 3:06 PM

Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/15585
Subject: ~~LU-6770~~ osc: use global osc_rq_pool to reduce memory usage
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c22aa4d8c553e974214a27e516728d88df73663c

Gerrit Updater added a comment - 13/Jul/15 3:06 PM Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/15585 Subject: LU-6770 osc: use global osc_rq_pool to reduce memory usage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c22aa4d8c553e974214a27e516728d88df73663c

Yang Sheng added a comment - 08/Jul/15 4:15 AM

Hi, Shilong,

I think maybe we still need consider OST number when we decide the mem_max parameter. Unless it big than a fixed number or override by module paramter. How do you think about it?

Thanks,
Yang Sheng

Yang Sheng added a comment - 08/Jul/15 4:15 AM Hi, Shilong, I think maybe we still need consider OST number when we decide the mem_max parameter. Unless it big than a fixed number or override by module paramter. How do you think about it? Thanks, Yang Sheng

Wang Shilong (Inactive) added a comment - 07/Jul/15 8:48 AM

Hi Yang Sheng,

So maybe we can set @osc_reqpool_mem_max=100MB or so in default, and it will try to allocate memory by checking
min(100M， 1% of memory), dose this make sense for you?

Best Regards,
Shilong

Wang Shilong (Inactive) added a comment - 07/Jul/15 8:48 AM Hi Yang Sheng, So maybe we can set @osc_reqpool_mem_max=100MB or so in default, and it will try to allocate memory by checking min(100M， 1% of memory), dose this make sense for you? Best Regards, Shilong

Yang Sheng added a comment - 07/Jul/15 7:40 AM

The patch has passed test & review process. But Oleg has some comment as below:

[CST下午1时53分29秒] yang sheng: could you please give me some point about http://review.whamcloud.com/#/c/15422/
[CST下午1时54分01秒] yang sheng: can it be landed or still need waiting a while?
[CST下午1时54分07秒] Oleg Drokin: why do we need percpu pool there?
[CST下午1时54分28秒] Oleg Drokin: I mean it's still an improvement, but what if I have 260 CPUs?
[CST下午1时54分40秒] Oleg Drokin: I would imagine havign a static pool of a fixed size is probably best of all
[CST下午1时57分45秒] Oleg Drokin: I think the pool does not need to be super big. Just a fixed number of reqests, something like 50 (or 100, need to see how big they are) should be enough. we only expect to use them during severe OOM anyway
[CST下午1时58分00秒] Oleg Drokin: with perhaps a module parameter if somebody wants an override
[CST下午1时59分05秒] yang sheng: Yes, it is reasonable.
[CST下午1时59分44秒] yang sheng: as this patch given the limit is %1 of total memory.
[CST下午2时00分16秒] yang sheng: seem big than you are point.
[CST下午2时01分19秒] Oleg Drokin: Yes. I feel it's really excessive. But the initial reasoning was that every OSC could have up to 32M of dirty pages and can send up to 8 (default) RPCs in flight.
[CST下午2时01分40秒] Oleg Drokin: so every OSC had this pool in order to send the many RPCs even in OOM
[CST下午2时02分02秒] Oleg Drokin: in reality if you have 2000 OSTs, it's unlikely you'd have dirty pages in all of them at the same time
[CST下午2时02分11秒] Oleg Drokin: so we need to be reasonable here
[CST下午2时02分38秒] Oleg Drokin: 1% of 1T of memory is still a cool 10G
[CST下午2时03分34秒] yang sheng: so a fixed size is enough to handle such stiuation.
[CST下午2时03分46秒] Oleg Drokin: finding a proper number is going to be tricky, but I feel it should be on the lower side somewhere in tens or low hundreds for most cases except perhaps the most extreme ones
[CST下午2时04分10秒] Oleg Drokin: that's why having an override is important of course, with a good documentation about it like I explained above
[CST下午2时05分02秒] yang sheng: I see. got it. Thank you very much. Oleg.

Yang Sheng added a comment - 07/Jul/15 7:40 AM The patch has passed test & review process. But Oleg has some comment as below: [CST下午1时53分29秒] yang sheng: could you please give me some point about http://review.whamcloud.com/#/c/15422/ [CST下午1时54分01秒] yang sheng: can it be landed or still need waiting a while? [CST下午1时54分07秒] Oleg Drokin: why do we need percpu pool there? [CST下午1时54分28秒] Oleg Drokin: I mean it's still an improvement, but what if I have 260 CPUs? [CST下午1时54分40秒] Oleg Drokin: I would imagine havign a static pool of a fixed size is probably best of all [CST下午1时57分45秒] Oleg Drokin: I think the pool does not need to be super big. Just a fixed number of reqests, something like 50 (or 100, need to see how big they are) should be enough. we only expect to use them during severe OOM anyway [CST下午1时58分00秒] Oleg Drokin: with perhaps a module parameter if somebody wants an override [CST下午1时59分05秒] yang sheng: Yes, it is reasonable. [CST下午1时59分44秒] yang sheng: as this patch given the limit is %1 of total memory. [CST下午2时00分16秒] yang sheng: seem big than you are point. [CST下午2时01分19秒] Oleg Drokin: Yes. I feel it's really excessive. But the initial reasoning was that every OSC could have up to 32M of dirty pages and can send up to 8 (default) RPCs in flight. [CST下午2时01分40秒] Oleg Drokin: so every OSC had this pool in order to send the many RPCs even in OOM [CST下午2时02分02秒] Oleg Drokin: in reality if you have 2000 OSTs, it's unlikely you'd have dirty pages in all of them at the same time [CST下午2时02分11秒] Oleg Drokin: so we need to be reasonable here [CST下午2时02分38秒] Oleg Drokin: 1% of 1T of memory is still a cool 10G [CST下午2时03分34秒] yang sheng: so a fixed size is enough to handle such stiuation. [CST下午2时03分46秒] Oleg Drokin: finding a proper number is going to be tricky, but I feel it should be on the lower side somewhere in tens or low hundreds for most cases except perhaps the most extreme ones [CST下午2时04分10秒] Oleg Drokin: that's why having an override is important of course, with a good documentation about it like I explained above [CST下午2时05分02秒] yang sheng: I see. got it. Thank you very much. Oleg.

People

Assignee:: Yang Sheng

Reporter:: Wang Shilong (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 27/Jun/15 6:45 AM

Updated:: 04/Aug/17 7:41 PM

Resolved:: 13/Aug/15 2:46 AM