[LU-1282] Lustre 2.1 client memory usage at mount is excessive - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Major
Fix Version/s: Lustre 2.4.0, Lustre 2.1.3
Affects Version/s: Lustre 2.1.0
Labels:
- llnl
Environment:
https://github.com/chaos/lustre/tree/2.1.0-24chaos

Severity:
3
Rank (Obsolete):
4554

Description

The memory usage at mount time for lustre 2.1 appears to be significantly worse than under 1.8. In particular, it looks like slab-8192 usage has grown significantly.

On 1.8 clients, the memory usage by lustre is maybe 1GB of memory to mount four of our filesystems.

On 2.1 clients, the memory usage has jumped to 5GB of memory to mount the same four filesystems.

It looks like there are 3144 oscs at this time.

The memory pretty clearly increases with each filesystem mounted, and then reduces again at each umount. I would suspect that we have some bad new per-osc memory usage or something along those lines, or otherwise there would be more fallout.

But this is a pretty significant loss of memory, and it means that our applications are now OOMing on the 2.1 clients. Many of the applications are very specifically tuned in their memory usage, and the loss 4GB of memory per node is quite a problem.

Attachments

Issue Links

is related to

LU-2979 sanity 133a: proc counter for mkdir on mds1 was not incremented

Closed

Trackbacks

Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....

Activity

[LU-1282] Lustre 2.1 client memory usage at mount is excessive

James A Simmons added a comment - 16/Aug/16 4:37 PM

Old ticket for unsupported version

James A Simmons added a comment - 16/Aug/16 4:37 PM Old ticket for unsupported version

Aurelien Degremont (Inactive) added a comment - 05/Dec/12 4:37 PM

It is a really good idea to reduce memory usage by Lustre stats, as it is already huge, I would like to point you that simply changing per-cpu struct with a common struct is not a good idea.
Per-cpu struct purpose is mostly to avoid SMP/NUMA issues. There was several patches in the past which introduce per_cpu structs to reduce pressure here. As already pointed by Chris, if you change this, you can increase memory contention and severaly dropped performances: Both on servers or clients depending where the patch applies.
I clearly remember issue with stats with Lustre 1.4 on a 32-core client node, due to this kind of global struct, which as to be protected when modified.
lock value and struct values are held in a ping-pong game between all sockets where local caches as to be cleared and a memory barrier applied each time another socket is modifying this kind of common struct.

You need to be very careful when removing this kind of per-cpu stuff and be sure this will not impact SMP performances.

I would advise to fix this in 2 ways:

A first simple version would be to simply be able to disable these stats. This will offer the possibility to reclaim all memory for people not interested in those stats. And maybe simply activate them when needed, for a short period of time.
Then, find a way to reduce the memory consumption without impacting NUMA performances (we have a 128-core client node with a 2-level numa arch which will suffer a lot with this kind of patch)

Aurelien Degremont (Inactive) added a comment - 05/Dec/12 4:37 PM It is a really good idea to reduce memory usage by Lustre stats, as it is already huge, I would like to point you that simply changing per-cpu struct with a common struct is not a good idea. Per-cpu struct purpose is mostly to avoid SMP/NUMA issues. There was several patches in the past which introduce per_cpu structs to reduce pressure here. As already pointed by Chris, if you change this, you can increase memory contention and severaly dropped performances: Both on servers or clients depending where the patch applies. I clearly remember issue with stats with Lustre 1.4 on a 32-core client node, due to this kind of global struct, which as to be protected when modified. lock value and struct values are held in a ping-pong game between all sockets where local caches as to be cleared and a memory barrier applied each time another socket is modifying this kind of common struct. You need to be very careful when removing this kind of per-cpu stuff and be sure this will not impact SMP performances. I would advise to fix this in 2 ways: A first simple version would be to simply be able to disable these stats. This will offer the possibility to reclaim all memory for people not interested in those stats. And maybe simply activate them when needed, for a short period of time. Then, find a way to reduce the memory consumption without impacting NUMA performances (we have a 128-core client node with a 2-level numa arch which will suffer a lot with this kind of patch)

Jay Lan (Inactive) added a comment - 28/Aug/12 6:16 PM - edited

Is it safe and useful for us to cherry pick to 2.1.3
http://review.whamcloud.com/3246
~~LU-1282~~ lprocfs: reduce lprocfs stats memory use
until a refined version becomes available?

We are concerned about memory usage and stack overflow issues.

Jay Lan (Inactive) added a comment - 28/Aug/12 6:16 PM - edited Is it safe and useful for us to cherry pick to 2.1.3 http://review.whamcloud.com/3246 LU-1282 lprocfs: reduce lprocfs stats memory use until a refined version becomes available? We are concerned about memory usage and stack overflow issues.

Zhenyu Xu added a comment - 26/Aug/12 11:56 AM

Liang had thought to make a more complete refinement patch based upon http://review.whamcloud.com/3246, so w/o urgent request I think we don't need land it.

Zhenyu Xu added a comment - 26/Aug/12 11:56 AM Liang had thought to make a more complete refinement patch based upon http://review.whamcloud.com/3246 , so w/o urgent request I think we don't need land it.

Peter Jones added a comment - 26/Aug/12 10:39 AM

Bobijam

Am I right in thinking that only one of these patches still needs to land?

and http://review.whamcloud.com/3246

~~LU-1282~~ lprocfs: reduce lprocfs stats memory use

If so could you please request reviews?

Thanks

Peter

Peter Jones added a comment - 26/Aug/12 10:39 AM Bobijam Am I right in thinking that only one of these patches still needs to land? and http://review.whamcloud.com/3246 LU-1282 lprocfs: reduce lprocfs stats memory use If so could you please request reviews? Thanks Peter

Zhenyu Xu added a comment - 29/Jun/12 2:07 PM - edited

updated http://review.whamcloud.com/3240

~~LU-1282~~ lprocfs: fix multi-thread contention glitch

Revert clear stats patch committed on 8c831cb8, it is not multi-
thread safe.
Should protect the change of lprocfs_stats::ls_biggest_alloc_num

-------------------------------------------------------------------

and http://review.whamcloud.com/3246

~~LU-1282~~ lprocfs: reduce lprocfs stats memory use

Move percpu counter common data out, do not need to store them
redundantly in percpu counter.

-------------------------------------------------------------------

http://review.whamcloud.com/3248
~~LU-1282~~ lprocfs: make lprocfs control a global percpu data

This control data is for stats data collecting accuracy, it does not
need to have a copy for every lproc stats data, a global percpu array
is good enough for data collecting correctness and performance
efficiency.

Zhenyu Xu added a comment - 29/Jun/12 2:07 PM - edited updated http://review.whamcloud.com/3240 LU-1282 lprocfs: fix multi-thread contention glitch Revert clear stats patch committed on 8c831cb8, it is not multi- thread safe. Should protect the change of lprocfs_stats::ls_biggest_alloc_num ------------------------------------------------------------------- and http://review.whamcloud.com/3246 LU-1282 lprocfs: reduce lprocfs stats memory use Move percpu counter common data out, do not need to store them redundantly in percpu counter. ------------------------------------------------------------------- http://review.whamcloud.com/3248 LU-1282 lprocfs: make lprocfs control a global percpu data This control data is for stats data collecting accuracy, it does not need to have a copy for every lproc stats data, a global percpu array is good enough for data collecting correctness and performance efficiency.

Liang Zhen (Inactive) added a comment - 29/Jun/12 3:22 AM - edited

There are so many things can be improved for lprocfs_stats and save a lot of memory:

ptlrpc_lprocfs_register_obd() always allocate 95 (EXTRA_MAX_OPCODES+LUSTRE_MAX_OPCODES) percpu lprocfs_counters for all obd_devices, which doesn't make sense to me, different devices can have different number of counters.

#define LUSTRE_MAX_OPCODES (OPC_RANGE(OST)  + \
                            OPC_RANGE(MDS)  + \
                            OPC_RANGE(LDLM) + \
                            OPC_RANGE(MGS)  + \
                            OPC_RANGE(OBD)  + \
                            OPC_RANGE(LLOG) + \
                            OPC_RANGE(SEC)  + \
                            OPC_RANGE(SEQ)  + \
                            OPC_RANGE(SEC)  + \
                            OPC_RANGE(FLD)  )

lprocfs_counter can be percpu, so its size should be as small as possible, however:

struct lprocfs_counter {
        struct lprocfs_atomic  lc_cntl;  /* can be moved into lprocfs_stats */
        unsigned int           lc_config; /* can be moved into lprocfs_stats */
        __s64                  lc_count;
        __s64                  lc_sum;
        __s64                  lc_sum_irq;
        __s64                  lc_min;
        __s64                  lc_max;
        __s64                  lc_sumsquare;
        const char            *lc_name;   /* doesn't have to be percpu */
        const char            *lc_units;  /* doesn't have to be percpu */
};

this will save 32 bytes for each counter, and (32 * online_cpus) for each percpu counter, and (32 * online_cpus * N_COUNTERs) for each lprocfs_stats
if all counters under one lprocfs_stats always have same lc_config (well, it's true for all our use cases), we can have different structure for different lc_config, this is not a small change, but will save us a lot of memory, i.e:
- simple counter
- counter with min/max/sum
- counter with min/max/sum/sumsquare
- counter needs to be safe for softirq context

Liang Zhen (Inactive) added a comment - 29/Jun/12 3:22 AM - edited There are so many things can be improved for lprocfs_stats and save a lot of memory: ptlrpc_lprocfs_register_obd() always allocate 95 (EXTRA_MAX_OPCODES+LUSTRE_MAX_OPCODES) percpu lprocfs_counters for all obd_devices, which doesn't make sense to me, different devices can have different number of counters. #define LUSTRE_MAX_OPCODES (OPC_RANGE(OST) + \ OPC_RANGE(MDS) + \ OPC_RANGE(LDLM) + \ OPC_RANGE(MGS) + \ OPC_RANGE(OBD) + \ OPC_RANGE(LLOG) + \ OPC_RANGE(SEC) + \ OPC_RANGE(SEQ) + \ OPC_RANGE(SEC) + \ OPC_RANGE(FLD) ) lprocfs_counter can be percpu, so its size should be as small as possible, however: struct lprocfs_counter { struct lprocfs_atomic lc_cntl; /* can be moved into lprocfs_stats */ unsigned int lc_config; /* can be moved into lprocfs_stats */ __s64 lc_count; __s64 lc_sum; __s64 lc_sum_irq; __s64 lc_min; __s64 lc_max; __s64 lc_sumsquare; const char *lc_name; /* doesn't have to be percpu */ const char *lc_units; /* doesn't have to be percpu */ }; this will save 32 bytes for each counter, and (32 * online_cpus) for each percpu counter, and (32 * online_cpus * N_COUNTERs) for each lprocfs_stats if all counters under one lprocfs_stats always have same lc_config (well, it's true for all our use cases), we can have different structure for different lc_config, this is not a small change, but will save us a lot of memory, i.e: simple counter counter with min/max/sum counter with min/max/sum/sumsquare counter needs to be safe for softirq context

Zhenyu Xu added a comment - 29/Jun/12 1:46 AM - edited

http://review.whamcloud.com/3240

~~LU-1282~~ lprocfs: disable some client percpu stats data

Client collect OPC stats through their mgc/mdc/osc obd device,
it's unnecessary to use percpu stats on client side.
Revert clear stats patch committed on 8c831cb8, it is not multi-
thread safe.
Should protect the change of lprocfs_stats::ls_biggest_alloc_num

the major percpu data allocation locates at ptlrpc_lprocfs_register(), which asks for (EXTRA_MAX_OPCODES+LUSTRE_MAX_OPCODES) items for each cpu block, which is 5(PTLRPC op#)+20(OST op#)+21(MDS op#)+6(LDLM op#)+6(MGS op#)+3(OBD op#)+9(LLOG op#)+3(SEC op#)+1(SEQ op#)+1(FLD op#) = 75

this patch disables this stats been allocated for each cpu on the client side, while this could possibly affect client's performance.

I think we could reduce lprocfs_counter, just keep its common data to another lprocfs_counter_header structure, they are essentially the same for all cpu per each type of counter.

Zhenyu Xu added a comment - 29/Jun/12 1:46 AM - edited http://review.whamcloud.com/3240 LU-1282 lprocfs: disable some client percpu stats data Client collect OPC stats through their mgc/mdc/osc obd device, it's unnecessary to use percpu stats on client side. Revert clear stats patch committed on 8c831cb8, it is not multi- thread safe. Should protect the change of lprocfs_stats::ls_biggest_alloc_num the major percpu data allocation locates at ptlrpc_lprocfs_register(), which asks for (EXTRA_MAX_OPCODES+LUSTRE_MAX_OPCODES) items for each cpu block, which is 5(PTLRPC op#)+20(OST op#)+21(MDS op#)+6(LDLM op#)+6(MGS op#)+3(OBD op#)+9(LLOG op#)+3(SEC op#)+1(SEQ op#)+1(FLD op#) = 75 this patch disables this stats been allocated for each cpu on the client side, while this could possibly affect client's performance. I think we could reduce lprocfs_counter, just keep its common data to another lprocfs_counter_header structure, they are essentially the same for all cpu per each type of counter.

Zhenyu Xu added a comment - 29/Jun/12 12:24 AM

Christopher,

you are right, it's not multi-thread safe to free stat memory this way.

Zhenyu Xu added a comment - 29/Jun/12 12:24 AM Christopher, you are right, it's not multi-thread safe to free stat memory this way.

Christopher Morrone (Inactive) added a comment - 28/Jun/12 1:26 PM

b2_1 patch port tracking at http://review.whamcloud.com/3208

Looks like the patch for b2_1 was already in progress here:

http://review.whamcloud.com/2578

Christopher Morrone (Inactive) added a comment - 28/Jun/12 1:26 PM b2_1 patch port tracking at http://review.whamcloud.com/3208 Looks like the patch for b2_1 was already in progress here: http://review.whamcloud.com/2578

People

Assignee:: Zhenyu Xu

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 03/Apr/12 8:53 PM

Updated:: 16/Aug/16 4:37 PM

Resolved:: 16/Aug/16 4:37 PM