[LU-1282] Lustre 2.1 client memory usage at mount is excessive Created: 03/Apr/12  Updated: 16/Aug/16  Resolved: 16/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.4.0, Lustre 2.1.3

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Zhenyu Xu
Resolution: Won't Fix Votes: 0
Labels: llnl
Environment:

https://github.com/chaos/lustre/tree/2.1.0-24chaos


Issue Links:
Related
is related to LU-2979 sanity 133a: proc counter for mkdir o... Closed
Severity: 3
Rank (Obsolete): 4554

 Description   

The memory usage at mount time for lustre 2.1 appears to be significantly worse than under 1.8. In particular, it looks like slab-8192 usage has grown significantly.

On 1.8 clients, the memory usage by lustre is maybe 1GB of memory to mount four of our filesystems.

On 2.1 clients, the memory usage has jumped to 5GB of memory to mount the same four filesystems.

It looks like there are 3144 oscs at this time.

The memory pretty clearly increases with each filesystem mounted, and then reduces again at each umount. I would suspect that we have some bad new per-osc memory usage or something along those lines, or otherwise there would be more fallout.

But this is a pretty significant loss of memory, and it means that our applications are now OOMing on the 2.1 clients. Many of the applications are very specifically tuned in their memory usage, and the loss 4GB of memory per node is quite a problem.



 Comments   
Comment by Christopher Morrone [ 03/Apr/12 ]

Specifically, mounting the same filesystems at the same time, the 1.8 client currently has 84948 size-8192 slabs and the 2.1 client has 487811 size-8192 slabs.

Before mounting lustre, less than 1000 size-8192 slabs are used.

Comment by Christopher Morrone [ 03/Apr/12 ]

I turned on the "malloc" debug level and looked for likely memory abusers at mount time in the log. The most prominent by far is this one:

lprocfs_alloc_stats()) kmalloced 'stats->ls_percpu[i]': 7552 at <address>

Cummulatively I saw that about 165,000 times mounting 4 filesystems.

It is odd that I see that line appearing in batches of 128. There are only 16 cpus on these nodes, as reported by /proc/cpuinfo. Perhaps not coincidentally, 3144 (number of osc) * 128 = 402432. I think it quite likely that this is the main culprit.

So I am led to believe that "cfs_num_possible_cpus()" is returning 128, when in fact we have only 16 cores (2 sockets, 8 cores each), and no hyperthreading. These are Sandy Bridge processors. Even with hyperthreading enabled (which it is not, as far as I know), I don't think we can get up to 128 processors. So that number seems odd and problematic.

The next most common was:

cfs_hash_buckets_realloc()) kmalloced 'new_bkts[i]: 4180 at <address>

This one in particular is unfortunately wastful. But I only see it 20,000 times in the logs, so perhaps it is not that big of a problem.

Comment by Oleg Drokin [ 04/Apr/12 ]

Wow, that's quite a bit indeed even with 1.8

I imagine another large chunk is per-osc preallocated pools that we eventually need to convert into a common pool.

Plus in 2.2 when you have many OSTs currently the clients assume the worst and allocate the metadata requests assuming you'll have 3000 stripes too which adds to the bloat, this one is also in plans to be addressed.

Plus this obvious excessive statistics percpu memory usage.

Comment by Peter Jones [ 04/Apr/12 ]

Bobi

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 05/Apr/12 ]

patch tracking at http://review.whamcloud.com/2451

misc: Use present cpu numbers to save memory.

lprocfs stats data should allocated by the number of present cpus in
stead of by possible cpu numbers which wastes a lot of memory.

OSS minimum thread number also better be decided by present cpu
numbers.

Comment by Christopher Morrone [ 05/Apr/12 ]

That should work I believe, even with dynamic hyperthreading which uses the kernels "cpu hotplug" facility. It looks like "present" cpus is 32, and "active" and "online" are currently 16. So I think we are covered with using "present" instead of "possible".

I think this patch should be reviewed and land for b2_1.

After that, I think we need to figure out the next step. This patch will reduce the usage by a factor of 4, but we'll still be using 3144*32*8k = ~800MB of memory just for osc stat structures. That is kind of ridiculous.

Why are we using per-cpu counters for these stats at all? On clients, statically allocated memory is a at a premium. Could the performance benefits possibly outweigh the costs in memory? It certainly doesn't scale.

I believe that one of our monitoring efforts is now gathering osc stat information at job completions time, so I am reluctant to just disable osc stats. But maybe it is the right thing to do near term.

Slightly longer term, we should probably be aggregating stats in some way rather than per-cpu and per-osc.

Comment by Jinshan Xiong (Inactive) [ 06/Apr/12 ]

Another significant memory consumer would be lu_object cache. Please try to configure it by module option for example:

options obdclass lu_cache_percent=5

to see if memory usage can become better. Typically lu_cache_percent should be an integer between 1 and 20.

Comment by Christopher Morrone [ 06/Apr/12 ]

I suggested a change in the http://review.whamcloud.com/2451 patch that led me down a path that I'll expand on here.

Rather than adding ls_num_cpu to struct lprocfs_stats, I think that we should have the cpu check its own per-cpu pointer for NULL, and allocate its memory on demand. That eliminates the need to lock the lprocfs_stats structure like we would need if we add ls_num_cpu and have a single cpu allocate multiple fields.

Then I would suggest that we have ALL per-cpu fields in the lprocfs_stats structure begin as NULL. No static allocation when the lprocfs_stats structure is created.

Then if we could figure out a way to free the per-cpu fields when user-space zeroes the stats structure (as LLNL does at the end of every user job that is run) we would very significanlty reduce the total memory usage. Although now that I think it through, freeing the per-cpu fields probably implies adding the lock that I just eliminated above. Not sure that that will work.

Which makes me ask again: WHY do we need per-cpu-per-osc fields? It is flat out unscalable.

Also, when I ran this by Brian he raised an issue that I had not considered. If we are now performing allocations under the stat increment/decrement functions, we really need to make sure that we can't block. Even if we completely audit the code and assure ourselves that there are no interupt handlers that increment stats right now, that is not something that anyone will expect. In the future this will cause problems for us.

So we should make sure that we use an atomic allocation. Its not the end of the world if the allocation fails...we've just missed incrementing a stats counter.

Comment by Christopher Morrone [ 06/Apr/12 ]

options obdclass lu_cache_percent=5

That does not exist in 2.1. But I also don't understand why it would be relevant if we did have it.

Comment by Christopher Morrone [ 10/Apr/12 ]

I think that the patch you are working on is good.

I also think we need another patch that gives us a configuration option (a module option most likely) to choose to use a single stats structure rather than per-cpu. I suspect that our users would be happier with the spin lock than the loss of memory. Although without some testing we won't really know how much of an impact there is on the clients. A configuration option would make it easier for our admins to try it out.

Comment by Christopher Morrone [ 11/Apr/12 ]

Any change that you guys can give us an option to disable per-cpu stats usage in a short period of time? Do I need to allocate one of my developers to the task?

Comment by Zhenyu Xu [ 11/Apr/12 ]

Crhistopher,

Is this (http://review.whamcloud.com/2515) what you need?

Comment by Christopher Morrone [ 12/Apr/12 ]

Yes. Is it safe to pull that into my tree?

Comment by Christopher Morrone [ 16/Apr/12 ]

The http://review.whamcloud.com/2515 patch needs work. I think thats the one I would like to see finished first, though.

When we try to load the module with that flag enabled we get:

2012-04-16 16:21:47 BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
2012-04-16 16:21:47 IP: [<ffffffffa03b5043>] lprocfs_read_helper+0x23/0x140 [lvfs]
Comment by Zhenyu Xu [ 16/Apr/12 ]

Does it contain LU-1245 patch (http://review.whamcloud.com/2384)? If not, I think that issue could be the culprit.

Comment by Christopher Morrone [ 16/Apr/12 ]

That wasn't in 2.1.1 (not on b2_1 at all it looks like), so we do not have it. But it does look like a likely candidate, so I'll give it a try.

Comment by Christopher Morrone [ 17/Apr/12 ]

Thanks, the LU-1245 patch does appear to address the crash.

Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » i686,client,el5,ofa #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » i686,server,el5,ofa #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » i686,server,el6,ofa #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » i686,client,el6,ofa #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Christopher Morrone [ 18/Apr/12 ]

It looks like we're deadlocking when we turn on the flag for http://review.whamcloud.com/2384. Looks like trying to grab the spin lock under an interrupt while already holding the lock:

2012-04-18 11:48:38  <IRQ>
2012-04-18 11:48:38  [<ffffffffa020ca80>] ? lprocfs_counter_sub+0xc0/0x100 [lvfs]
2012-04-18 11:48:38  [<ffffffffa069ada6>] ? ldlm_lock_free+0x36/0xe0 [ptlrpc]
2012-04-18 11:48:38  [<ffffffffa058788f>] ? class_handle_free_cb+0x2f/0xe0 [obdclass]
2012-04-18 11:48:38  [<ffffffff810deddd>] ? __rcu_process_callbacks+0x10d/0x330
2012-04-18 11:48:38  [<ffffffff81012b59>] ? read_tsc+0x9/0x20
2012-04-18 11:48:38  [<ffffffff810df02b>] ? rcu_process_callbacks+0x2b/0x50
2012-04-18 11:48:38  [<ffffffff81072011>] ? __do_softirq+0xc1/0x1d0
2012-04-18 11:48:38  [<ffffffff81095970>] ? hrtimer_interrupt+0x140/0x250
2012-04-18 11:48:38  [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
2012-04-18 11:48:38  [<ffffffff8100de85>] ? do_softirq+0x65/0xa0
2012-04-18 11:48:38  [<ffffffff81071df5>] ? irq_exit+0x85/0x90
2012-04-18 11:48:38  [<ffffffff814f6ad0>] ? smp_apic_timer_interrupt+0x70/0x9b
2012-04-18 11:48:38  [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
2012-04-18 11:48:38  <EOI>
2012-04-18 11:48:38  [<ffffffff814f1002>] ? _spin_lock+0x12/0x30
2012-04-18 11:48:38  [<ffffffffa020c970>] ? lprocfs_counter_add+0x140/0x190 [lvfs]
2012-04-18 11:48:38  [<ffffffffa091ec66>] ? lov_sublock_alloc+0xc6/0x350 [lov]
2012-04-18 11:48:38  [<ffffffffa091f97c>] ? lov_lock_init_raid0+0x3fc/0xba0 [lov]
2012-04-18 11:48:38  [<ffffffffa0919a7b>] ? lov_lock_init+0x5b/0xe0 [lov]
2012-04-18 11:48:38  [<ffffffffa05aedec>] ? cl_lock_hold_mutex+0x33c/0x5c0 [obdclass]
2012-04-18 11:48:38  [<ffffffffa05b014e>] ? cl_lock_request+0x5e/0x180 [obdclass]
2012-04-18 11:48:38  [<ffffffffa09d73b8>] ? cl_io_get+0x68/0xd0 [lustre]
2012-04-18 11:48:38  [<ffffffffa09d716b>] ? cl_glimpse_lock+0x14b/0x330 [lustre]
2012-04-18 11:48:38  [<ffffffffa09d77f4>] ? cl_glimpse_size+0x124/0x130 [lustre]
2012-04-18 11:48:38  [<ffffffffa099fb6a>] ? ll_inode_revalidate_it+0x5a/0x140 [lustre]
2012-04-18 11:48:39  [<ffffffff81184155>] ? putname+0x45/0x50
2012-04-18 11:48:39  [<ffffffffa099fc99>] ? ll_getattr_it+0x49/0x170 [lustre]
2012-04-18 11:48:39  [<ffffffffa099fdf7>] ? ll_getattr+0x37/0x40 [lustre]
2012-04-18 11:48:39  [<ffffffff8117c891>] ? vfs_getattr+0x51/0x80
2012-04-18 11:48:39  [<ffffffff8126c005>] ? _atomic_dec_and_lock+0x55/0x80
2012-04-18 11:48:39  [<ffffffff8117c928>] ? vfs_fstatat+0x68/0x80
2012-04-18 11:48:39  [<ffffffff8117c9ae>] ? vfs_lstat+0x1e/0x20
2012-04-18 11:48:39  [<ffffffff8117c9d4>] ? sys_newlstat+0x24/0x50
2012-04-18 11:48:39  [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
Comment by Christopher Morrone [ 18/Apr/12 ]

So we need to use the irqdisable version of the spin lock.

Comment by Christopher Morrone [ 18/Apr/12 ]

By the way, this also perfectly illustrates why we are concerned about using a non-atomic memory allocation in the increment/decrement functions.

Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » x86_64,server,el6,ofa #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » x86_64,client,el6,ofa #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 18/Apr/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #490
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Zhenyu Xu [ 18/Apr/12 ]

Thank you Christopher, will update the patch.

Comment by Christopher Morrone [ 19/Apr/12 ]

FYI using the lprocfs_no_percpu_stats flag, our memory usage for all four filesystems mounted went from 5GB to around 1.4GB. So that much better. Still more than we'd like but we can live with that for a while.

I logged the malloc level again with the lprocfs_no_percpu_stats flag set, and it looks like the next obvious large consumer of memory is cfs_hash_buckets_realloc(). Out of about 360MB used when mounting the one filesystem, I added up 180MB used by hashes. The hashes could be much harder to shrink though.

As predicted by Oleg, another a fairly large consumer of memory is prlrpc_add_rqs_to_pool(). I added up about 60MB out of the total 360MB for the single filesystem mount.

Comment by Ned Bass [ 19/Apr/12 ]

I pushed a patch here to use the irqsave version of the spinlock. We will use this for local testing.

http://review.whamcloud.com/2578

The build will fail though because I failed to get gerrit to make it dependent on http://review.whamcloud.com/2566.

Comment by Build Master (Inactive) [ 19/Apr/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #491
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 19/Apr/12 ]

Integrated in lustre-master » x86_64,server,el6,ofa #491
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 19/Apr/12 ]

Integrated in lustre-master » x86_64,client,el6,ofa #491
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Andreas Dilger [ 20/Apr/12 ]

Chris, the 1.4GB works out to be about 445kB/OSC, which doesn't seem excessive until you have a lot of OSCs as you do... Also, this number will go up once the percpu counters are allocated, if you don't have the nopercpu setting active.

I suspect that there are other space reductions * 3144 that could help. Every 300 bytes per OSC is one MB saved. There are quite a number of server-only fields in struct obd_device (which is a whopping 6kB) that could be moved into obd_device_target. There is e.g. 2kB per OSC in "obd_histogram" structs in client_obd that is only used when activated, and could be dynamically allocated.

Comment by Nathan Rutman [ 20/Apr/12 ]

Which makes me ask again: WHY do we need per-cpu-per-osc fields? It is flat out unscalable.

I don't see an answer to that, but my belief was that the per-cpu stats allowed them to remain in L1 cache, which can't be done if they're being shared between CPUs. I believe there was a significant performance improvement for this. Andreas should be the authority here.

Comment by Christopher Morrone [ 20/Apr/12 ]

Yes, I realize that further drops will be more difficult, and I'm not expecting any more changes than are already under way. I just wanted to record what I'd seen for posterity.

Comment by Andreas Dilger [ 21/Apr/12 ]

The per-cpu statistics were created mostly to avoid contention on the server side. That said, there is the potential for increased CPU contention on clients as cores proliferate.

Another idea that I had is that instead of rolling our own per-cpu counters, we could see whether using the kernel percpu counters and/or percpu memory allocation would reduce the memory overhead due to the L1 cacheline alignment.

Comment by Build Master (Inactive) [ 23/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,client,sles11,inkernel #50
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425)

Result = SUCCESS
Oleg Drokin : 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 23/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,client,el6,inkernel #50
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425)

Result = SUCCESS
Oleg Drokin : 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 23/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,client,el5,inkernel #50
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425)

Result = SUCCESS
Oleg Drokin : 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 23/Apr/12 ]

Integrated in lustre-b2_1 » i686,server,el5,ofa #50
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425)

Result = SUCCESS
Oleg Drokin : 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 23/Apr/12 ]

Integrated in lustre-b2_1 » i686,client,el6,inkernel #50
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425)

Result = SUCCESS
Oleg Drokin : 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 23/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,server,el5,ofa #50
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425)

Result = SUCCESS
Oleg Drokin : 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 23/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,server,el6,inkernel #50
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425)

Result = SUCCESS
Oleg Drokin : 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 23/Apr/12 ]

Integrated in lustre-b2_1 » i686,client,el5,ofa #50
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425)

Result = SUCCESS
Oleg Drokin : 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 23/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,client,el5,ofa #50
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425)

Result = SUCCESS
Oleg Drokin : 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 23/Apr/12 ]

Integrated in lustre-b2_1 » i686,client,el5,inkernel #50
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425)

Result = SUCCESS
Oleg Drokin : 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 23/Apr/12 ]

Integrated in lustre-b2_1 » i686,server,el6,inkernel #50
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425)

Result = SUCCESS
Oleg Drokin : 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 23/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,server,el5,inkernel #50
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425)

Result = SUCCESS
Oleg Drokin : 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 23/Apr/12 ]

Integrated in lustre-b2_1 » i686,server,el5,inkernel #50
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425)

Result = SUCCESS
Oleg Drokin : 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Peter Jones [ 30/Apr/12 ]

Landed for 2.1.2 and 2.3

Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,client,el5,inkernel #340
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,client,el6,inkernel #340
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,server,el5,inkernel #340
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,server,el6,inkernel #340
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,client,el5,inkernel #340
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,server,el5,inkernel #340
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,client,el6,inkernel #340
LU-1282 lprocfs: Add a module param to disable percpu stats (Revision 46684f926792a4532030698d90d9bc59b93e225d)

Result = SUCCESS
Oleg Drokin : 46684f926792a4532030698d90d9bc59b93e225d
Files :

  • lustre/obdclass/lprocfs_status.c
Comment by Christopher Morrone [ 11/May/12 ]

Either this or Ned's patch should really land soon since that other LU-1282 patch already landed. Any one using that new tunable (e.g. us on Sequoia...) is going to hit client deadlocks until the fix is landed.

Comment by Peter Jones [ 14/May/12 ]

By "this" do you mean this - http://review.whamcloud.com/#change,2451?

Comment by Christopher Morrone [ 14/May/12 ]

Yes, either http://review.whamcloud.com/#change,2451 or http://review.whamcloud.com/#change,2578.

It would be nice to finally finish change 2451 and close this bug.

Comment by Peter Jones [ 04/Jun/12 ]

Landed for 2.1.2 and 2.3

Comment by Christopher Morrone [ 04/Jun/12 ]

I think we need to reopen this. The work-around that we are using to disable per-cpu stats structures on clients is, we have found, limiting the client performance quite badly (down to 500MB/s or so). That is still probably better than using gigs of RAM at mount time for most of our users, but a better fix is needed.

The fix that Whamcloud made reduces the static memory usage to closer to 1.8 usage levels on systems that over-report "possible cpus", but even those levels are quite excessive.

I think we really do need to dynamically allocate and free (when stats are zeroed, which we do after every job is run) all stats structures. Either that or we need to switch to only collecting aggregate stats in a scalable fashion rather than trying to do it on a per-OSC basis.

Comment by Peter Jones [ 04/Jun/12 ]

ok Chris

Comment by Zhenyu Xu [ 04/Jun/12 ]

http://review.whamcloud.com/3026

LU-1282 lprocfs: free stat memory as it is cleared

Write "clear" into proc stats entries to free the stat memory
occupation.

Comment by Peter Jones [ 11/Jun/12 ]

This latest patch has now been landed to master. Can this ticket now be closed?

Comment by Peter Jones [ 18/Jun/12 ]

ok I will reopen again if need be

Comment by Zhenyu Xu [ 28/Jun/12 ]

b2_1 patch port tracking at http://review.whamcloud.com/3208

Comment by Christopher Morrone [ 28/Jun/12 ]

http://review.whamcloud.com/3026

LU-1282 lprocfs: free stat memory as it is cleared

Write "clear" into proc stats entries to free the stat memory
occupation.

Sorry, I missed this one. Can you explain to me how that is multi-thread safe? It looks to me like anyone using those memory allocations while this one thread is freeing them will cause a kernel crash.

Comment by Christopher Morrone [ 28/Jun/12 ]

b2_1 patch port tracking at http://review.whamcloud.com/3208

Looks like the patch for b2_1 was already in progress here:

http://review.whamcloud.com/2578

Comment by Zhenyu Xu [ 29/Jun/12 ]

Christopher,

you are right, it's not multi-thread safe to free stat memory this way.

Comment by Zhenyu Xu [ 29/Jun/12 ]

http://review.whamcloud.com/3240

LU-1282 lprocfs: disable some client percpu stats data

  • Client collect OPC stats through their mgc/mdc/osc obd device,
    it's unnecessary to use percpu stats on client side.
  • Revert clear stats patch committed on 8c831cb8, it is not multi-
    thread safe.
  • Should protect the change of lprocfs_stats::ls_biggest_alloc_num

the major percpu data allocation locates at ptlrpc_lprocfs_register(), which asks for (EXTRA_MAX_OPCODES+LUSTRE_MAX_OPCODES) items for each cpu block, which is 5(PTLRPC op#)+20(OST op#)+21(MDS op#)+6(LDLM op#)+6(MGS op#)+3(OBD op#)+9(LLOG op#)+3(SEC op#)+1(SEQ op#)+1(FLD op#) = 75

this patch disables this stats been allocated for each cpu on the client side, while this could possibly affect client's performance.

I think we could reduce lprocfs_counter, just keep its common data to another lprocfs_counter_header structure, they are essentially the same for all cpu per each type of counter.

Comment by Liang Zhen (Inactive) [ 29/Jun/12 ]

There are so many things can be improved for lprocfs_stats and save a lot of memory:

  • ptlrpc_lprocfs_register_obd() always allocate 95 (EXTRA_MAX_OPCODES+LUSTRE_MAX_OPCODES) percpu lprocfs_counters for all obd_devices, which doesn't make sense to me, different devices can have different number of counters.
    #define LUSTRE_MAX_OPCODES (OPC_RANGE(OST)  + \
                                OPC_RANGE(MDS)  + \
                                OPC_RANGE(LDLM) + \
                                OPC_RANGE(MGS)  + \
                                OPC_RANGE(OBD)  + \
                                OPC_RANGE(LLOG) + \
                                OPC_RANGE(SEC)  + \
                                OPC_RANGE(SEQ)  + \
                                OPC_RANGE(SEC)  + \
                                OPC_RANGE(FLD)  )
    
  • lprocfs_counter can be percpu, so its size should be as small as possible, however:
    struct lprocfs_counter {
            struct lprocfs_atomic  lc_cntl;  /* can be moved into lprocfs_stats */
            unsigned int           lc_config; /* can be moved into lprocfs_stats */
            __s64                  lc_count;
            __s64                  lc_sum;
            __s64                  lc_sum_irq;
            __s64                  lc_min;
            __s64                  lc_max;
            __s64                  lc_sumsquare;
            const char            *lc_name;   /* doesn't have to be percpu */
            const char            *lc_units;  /* doesn't have to be percpu */
    };
    
    • this will save 32 bytes for each counter, and (32 * online_cpus) for each percpu counter, and (32 * online_cpus * N_COUNTERs) for each lprocfs_stats
    • if all counters under one lprocfs_stats always have same lc_config (well, it's true for all our use cases), we can have different structure for different lc_config, this is not a small change, but will save us a lot of memory, i.e:
      • simple counter
      • counter with min/max/sum
      • counter with min/max/sum/sumsquare
      • counter needs to be safe for softirq context
Comment by Zhenyu Xu [ 29/Jun/12 ]

updated http://review.whamcloud.com/3240

LU-1282 lprocfs: fix multi-thread contention glitch

  • Revert clear stats patch committed on 8c831cb8, it is not multi-
    thread safe.
  • Should protect the change of lprocfs_stats::ls_biggest_alloc_num

-------------------------------------------------------------------

and http://review.whamcloud.com/3246

LU-1282 lprocfs: reduce lprocfs stats memory use

Move percpu counter common data out, do not need to store them
redundantly in percpu counter.

-------------------------------------------------------------------

http://review.whamcloud.com/3248
LU-1282 lprocfs: make lprocfs control a global percpu data

This control data is for stats data collecting accuracy, it does not
need to have a copy for every lproc stats data, a global percpu array
is good enough for data collecting correctness and performance
efficiency.

Comment by Peter Jones [ 26/Aug/12 ]

Bobijam

Am I right in thinking that only one of these patches still needs to land?

and http://review.whamcloud.com/3246

LU-1282 lprocfs: reduce lprocfs stats memory use

If so could you please request reviews?

Thanks

Peter

Comment by Zhenyu Xu [ 26/Aug/12 ]

Liang had thought to make a more complete refinement patch based upon http://review.whamcloud.com/3246, so w/o urgent request I think we don't need land it.

Comment by Jay Lan (Inactive) [ 28/Aug/12 ]

Is it safe and useful for us to cherry pick to 2.1.3
http://review.whamcloud.com/3246
LU-1282 lprocfs: reduce lprocfs stats memory use
until a refined version becomes available?

We are concerned about memory usage and stack overflow issues.

Comment by Aurelien Degremont (Inactive) [ 05/Dec/12 ]

It is a really good idea to reduce memory usage by Lustre stats, as it is already huge, I would like to point you that simply changing per-cpu struct with a common struct is not a good idea.
Per-cpu struct purpose is mostly to avoid SMP/NUMA issues. There was several patches in the past which introduce per_cpu structs to reduce pressure here. As already pointed by Chris, if you change this, you can increase memory contention and severaly dropped performances: Both on servers or clients depending where the patch applies.
I clearly remember issue with stats with Lustre 1.4 on a 32-core client node, due to this kind of global struct, which as to be protected when modified.
lock value and struct values are held in a ping-pong game between all sockets where local caches as to be cleared and a memory barrier applied each time another socket is modifying this kind of common struct.

You need to be very careful when removing this kind of per-cpu stuff and be sure this will not impact SMP performances.

I would advise to fix this in 2 ways:

  • A first simple version would be to simply be able to disable these stats. This will offer the possibility to reclaim all memory for people not interested in those stats. And maybe simply activate them when needed, for a short period of time.
  • Then, find a way to reduce the memory consumption without impacting NUMA performances (we have a 128-core client node with a 2-level numa arch which will suffer a lot with this kind of patch)
Comment by James A Simmons [ 16/Aug/16 ]

Old ticket for unsupported version

Generated at Sat Feb 10 01:15:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.