Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1282

Lustre 2.1 client memory usage at mount is excessive

Details

    • 3
    • 4554

    Description

      The memory usage at mount time for lustre 2.1 appears to be significantly worse than under 1.8. In particular, it looks like slab-8192 usage has grown significantly.

      On 1.8 clients, the memory usage by lustre is maybe 1GB of memory to mount four of our filesystems.

      On 2.1 clients, the memory usage has jumped to 5GB of memory to mount the same four filesystems.

      It looks like there are 3144 oscs at this time.

      The memory pretty clearly increases with each filesystem mounted, and then reduces again at each umount. I would suspect that we have some bad new per-osc memory usage or something along those lines, or otherwise there would be more fallout.

      But this is a pretty significant loss of memory, and it means that our applications are now OOMing on the 2.1 clients. Many of the applications are very specifically tuned in their memory usage, and the loss 4GB of memory per node is quite a problem.

      Attachments

        Issue Links

          Activity

            [LU-1282] Lustre 2.1 client memory usage at mount is excessive

            Old ticket for unsupported version

            simmonsja James A Simmons added a comment - Old ticket for unsupported version

            It is a really good idea to reduce memory usage by Lustre stats, as it is already huge, I would like to point you that simply changing per-cpu struct with a common struct is not a good idea.
            Per-cpu struct purpose is mostly to avoid SMP/NUMA issues. There was several patches in the past which introduce per_cpu structs to reduce pressure here. As already pointed by Chris, if you change this, you can increase memory contention and severaly dropped performances: Both on servers or clients depending where the patch applies.
            I clearly remember issue with stats with Lustre 1.4 on a 32-core client node, due to this kind of global struct, which as to be protected when modified.
            lock value and struct values are held in a ping-pong game between all sockets where local caches as to be cleared and a memory barrier applied each time another socket is modifying this kind of common struct.

            You need to be very careful when removing this kind of per-cpu stuff and be sure this will not impact SMP performances.

            I would advise to fix this in 2 ways:

            • A first simple version would be to simply be able to disable these stats. This will offer the possibility to reclaim all memory for people not interested in those stats. And maybe simply activate them when needed, for a short period of time.
            • Then, find a way to reduce the memory consumption without impacting NUMA performances (we have a 128-core client node with a 2-level numa arch which will suffer a lot with this kind of patch)
            adegremont Aurelien Degremont (Inactive) added a comment - It is a really good idea to reduce memory usage by Lustre stats, as it is already huge, I would like to point you that simply changing per-cpu struct with a common struct is not a good idea. Per-cpu struct purpose is mostly to avoid SMP/NUMA issues. There was several patches in the past which introduce per_cpu structs to reduce pressure here. As already pointed by Chris, if you change this, you can increase memory contention and severaly dropped performances: Both on servers or clients depending where the patch applies. I clearly remember issue with stats with Lustre 1.4 on a 32-core client node, due to this kind of global struct, which as to be protected when modified. lock value and struct values are held in a ping-pong game between all sockets where local caches as to be cleared and a memory barrier applied each time another socket is modifying this kind of common struct. You need to be very careful when removing this kind of per-cpu stuff and be sure this will not impact SMP performances. I would advise to fix this in 2 ways: A first simple version would be to simply be able to disable these stats. This will offer the possibility to reclaim all memory for people not interested in those stats. And maybe simply activate them when needed, for a short period of time. Then, find a way to reduce the memory consumption without impacting NUMA performances (we have a 128-core client node with a 2-level numa arch which will suffer a lot with this kind of patch)
            jaylan Jay Lan (Inactive) added a comment - - edited

            Is it safe and useful for us to cherry pick to 2.1.3
            http://review.whamcloud.com/3246
            LU-1282 lprocfs: reduce lprocfs stats memory use
            until a refined version becomes available?

            We are concerned about memory usage and stack overflow issues.

            jaylan Jay Lan (Inactive) added a comment - - edited Is it safe and useful for us to cherry pick to 2.1.3 http://review.whamcloud.com/3246 LU-1282 lprocfs: reduce lprocfs stats memory use until a refined version becomes available? We are concerned about memory usage and stack overflow issues.
            bobijam Zhenyu Xu added a comment -

            Liang had thought to make a more complete refinement patch based upon http://review.whamcloud.com/3246, so w/o urgent request I think we don't need land it.

            bobijam Zhenyu Xu added a comment - Liang had thought to make a more complete refinement patch based upon http://review.whamcloud.com/3246 , so w/o urgent request I think we don't need land it.
            pjones Peter Jones added a comment -

            Bobijam

            Am I right in thinking that only one of these patches still needs to land?

            and http://review.whamcloud.com/3246

            LU-1282 lprocfs: reduce lprocfs stats memory use

            If so could you please request reviews?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Am I right in thinking that only one of these patches still needs to land? and http://review.whamcloud.com/3246 LU-1282 lprocfs: reduce lprocfs stats memory use If so could you please request reviews? Thanks Peter
            bobijam Zhenyu Xu added a comment - - edited

            updated http://review.whamcloud.com/3240

            LU-1282 lprocfs: fix multi-thread contention glitch

            • Revert clear stats patch committed on 8c831cb8, it is not multi-
              thread safe.
            • Should protect the change of lprocfs_stats::ls_biggest_alloc_num

            -------------------------------------------------------------------

            and http://review.whamcloud.com/3246

            LU-1282 lprocfs: reduce lprocfs stats memory use

            Move percpu counter common data out, do not need to store them
            redundantly in percpu counter.

            -------------------------------------------------------------------

            http://review.whamcloud.com/3248
            LU-1282 lprocfs: make lprocfs control a global percpu data

            This control data is for stats data collecting accuracy, it does not
            need to have a copy for every lproc stats data, a global percpu array
            is good enough for data collecting correctness and performance
            efficiency.

            bobijam Zhenyu Xu added a comment - - edited updated http://review.whamcloud.com/3240 LU-1282 lprocfs: fix multi-thread contention glitch Revert clear stats patch committed on 8c831cb8, it is not multi- thread safe. Should protect the change of lprocfs_stats::ls_biggest_alloc_num ------------------------------------------------------------------- and http://review.whamcloud.com/3246 LU-1282 lprocfs: reduce lprocfs stats memory use Move percpu counter common data out, do not need to store them redundantly in percpu counter. ------------------------------------------------------------------- http://review.whamcloud.com/3248 LU-1282 lprocfs: make lprocfs control a global percpu data This control data is for stats data collecting accuracy, it does not need to have a copy for every lproc stats data, a global percpu array is good enough for data collecting correctness and performance efficiency.
            liang Liang Zhen (Inactive) added a comment - - edited

            There are so many things can be improved for lprocfs_stats and save a lot of memory:

            • ptlrpc_lprocfs_register_obd() always allocate 95 (EXTRA_MAX_OPCODES+LUSTRE_MAX_OPCODES) percpu lprocfs_counters for all obd_devices, which doesn't make sense to me, different devices can have different number of counters.
              #define LUSTRE_MAX_OPCODES (OPC_RANGE(OST)  + \
                                          OPC_RANGE(MDS)  + \
                                          OPC_RANGE(LDLM) + \
                                          OPC_RANGE(MGS)  + \
                                          OPC_RANGE(OBD)  + \
                                          OPC_RANGE(LLOG) + \
                                          OPC_RANGE(SEC)  + \
                                          OPC_RANGE(SEQ)  + \
                                          OPC_RANGE(SEC)  + \
                                          OPC_RANGE(FLD)  )
              
            • lprocfs_counter can be percpu, so its size should be as small as possible, however:
              struct lprocfs_counter {
                      struct lprocfs_atomic  lc_cntl;  /* can be moved into lprocfs_stats */
                      unsigned int           lc_config; /* can be moved into lprocfs_stats */
                      __s64                  lc_count;
                      __s64                  lc_sum;
                      __s64                  lc_sum_irq;
                      __s64                  lc_min;
                      __s64                  lc_max;
                      __s64                  lc_sumsquare;
                      const char            *lc_name;   /* doesn't have to be percpu */
                      const char            *lc_units;  /* doesn't have to be percpu */
              };
              
              • this will save 32 bytes for each counter, and (32 * online_cpus) for each percpu counter, and (32 * online_cpus * N_COUNTERs) for each lprocfs_stats
              • if all counters under one lprocfs_stats always have same lc_config (well, it's true for all our use cases), we can have different structure for different lc_config, this is not a small change, but will save us a lot of memory, i.e:
                • simple counter
                • counter with min/max/sum
                • counter with min/max/sum/sumsquare
                • counter needs to be safe for softirq context
            liang Liang Zhen (Inactive) added a comment - - edited There are so many things can be improved for lprocfs_stats and save a lot of memory: ptlrpc_lprocfs_register_obd() always allocate 95 (EXTRA_MAX_OPCODES+LUSTRE_MAX_OPCODES) percpu lprocfs_counters for all obd_devices, which doesn't make sense to me, different devices can have different number of counters. #define LUSTRE_MAX_OPCODES (OPC_RANGE(OST) + \ OPC_RANGE(MDS) + \ OPC_RANGE(LDLM) + \ OPC_RANGE(MGS) + \ OPC_RANGE(OBD) + \ OPC_RANGE(LLOG) + \ OPC_RANGE(SEC) + \ OPC_RANGE(SEQ) + \ OPC_RANGE(SEC) + \ OPC_RANGE(FLD) ) lprocfs_counter can be percpu, so its size should be as small as possible, however: struct lprocfs_counter { struct lprocfs_atomic lc_cntl; /* can be moved into lprocfs_stats */ unsigned int lc_config; /* can be moved into lprocfs_stats */ __s64 lc_count; __s64 lc_sum; __s64 lc_sum_irq; __s64 lc_min; __s64 lc_max; __s64 lc_sumsquare; const char *lc_name; /* doesn't have to be percpu */ const char *lc_units; /* doesn't have to be percpu */ }; this will save 32 bytes for each counter, and (32 * online_cpus) for each percpu counter, and (32 * online_cpus * N_COUNTERs) for each lprocfs_stats if all counters under one lprocfs_stats always have same lc_config (well, it's true for all our use cases), we can have different structure for different lc_config, this is not a small change, but will save us a lot of memory, i.e: simple counter counter with min/max/sum counter with min/max/sum/sumsquare counter needs to be safe for softirq context
            bobijam Zhenyu Xu added a comment - - edited

            http://review.whamcloud.com/3240

            LU-1282 lprocfs: disable some client percpu stats data

            • Client collect OPC stats through their mgc/mdc/osc obd device,
              it's unnecessary to use percpu stats on client side.
            • Revert clear stats patch committed on 8c831cb8, it is not multi-
              thread safe.
            • Should protect the change of lprocfs_stats::ls_biggest_alloc_num

            the major percpu data allocation locates at ptlrpc_lprocfs_register(), which asks for (EXTRA_MAX_OPCODES+LUSTRE_MAX_OPCODES) items for each cpu block, which is 5(PTLRPC op#)+20(OST op#)+21(MDS op#)+6(LDLM op#)+6(MGS op#)+3(OBD op#)+9(LLOG op#)+3(SEC op#)+1(SEQ op#)+1(FLD op#) = 75

            this patch disables this stats been allocated for each cpu on the client side, while this could possibly affect client's performance.

            I think we could reduce lprocfs_counter, just keep its common data to another lprocfs_counter_header structure, they are essentially the same for all cpu per each type of counter.

            bobijam Zhenyu Xu added a comment - - edited http://review.whamcloud.com/3240 LU-1282 lprocfs: disable some client percpu stats data Client collect OPC stats through their mgc/mdc/osc obd device, it's unnecessary to use percpu stats on client side. Revert clear stats patch committed on 8c831cb8, it is not multi- thread safe. Should protect the change of lprocfs_stats::ls_biggest_alloc_num the major percpu data allocation locates at ptlrpc_lprocfs_register(), which asks for (EXTRA_MAX_OPCODES+LUSTRE_MAX_OPCODES) items for each cpu block, which is 5(PTLRPC op#)+20(OST op#)+21(MDS op#)+6(LDLM op#)+6(MGS op#)+3(OBD op#)+9(LLOG op#)+3(SEC op#)+1(SEQ op#)+1(FLD op#) = 75 this patch disables this stats been allocated for each cpu on the client side, while this could possibly affect client's performance. I think we could reduce lprocfs_counter, just keep its common data to another lprocfs_counter_header structure, they are essentially the same for all cpu per each type of counter.
            bobijam Zhenyu Xu added a comment -

            Christopher,

            you are right, it's not multi-thread safe to free stat memory this way.

            bobijam Zhenyu Xu added a comment - Christopher, you are right, it's not multi-thread safe to free stat memory this way.

            b2_1 patch port tracking at http://review.whamcloud.com/3208

            Looks like the patch for b2_1 was already in progress here:

            http://review.whamcloud.com/2578

            morrone Christopher Morrone (Inactive) added a comment - b2_1 patch port tracking at http://review.whamcloud.com/3208 Looks like the patch for b2_1 was already in progress here: http://review.whamcloud.com/2578

            People

              bobijam Zhenyu Xu
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: