[LU-685] Wide busy lock in kiblnd_pool_alloc_node Created: 14/Sep/11  Updated: 04/May/12  Resolved: 12/Dec/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.2.0, Lustre 2.1.2

Type: Bug Priority: Major
Reporter: Sebastien Buisson (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4837

 Description   

Hi,

At CEA they have experienced at least 3 hangs of an MDS in the following context:

  • most of the memory (> 99%) is allocated by the kernel, caching inodes + attributes (system have 64 GB of memory)
  • 31 threads (over 32 cores) are busy waiting with the following backtrace (all are kiblns_sd_xx)

_spin_lock
kiblnd_pool_alloc_node
kiblnd_get_idle_tx
kiblnd_check_sends
kiblnd_post_rx
kiblnd_recv
lnet_ni_recv
lnet_recv_put
lnet_parse
kiblnd_handle_rx
kiblnd_rx_complete
kiblnd_complete
kiblnd_scheduler
kernel_thread

  • 1 thread (kiblnd_sd_11 in this case) is scheduled with the following bt

schedule
cfs_schedule
kiblnd_pool_alloc_node
kiblnd_get_idle_tx
kiblnd_check_sends
kiblnd_post_rx
kiblnd_recv
lnet_ni_recv
lnet_recv_put
lnet_parse
kiblnd_handle_rx
kiblnd_rx_complete
kiblnd_complete
kiblnd_scheduler
kernel_thread

  • 1 thread (router_checker) is busy waiting with the following bt (the 32 th)

spin_lock
kiblnd_pool_alloc_node
kiblnd_get_idle_tx
kiblnd_send
lnet_ni_send
lnet_send
LNetGet
lnet_router_checker
kernel_thread

  • 1 thread (kiblnd_connd) is scheduled with the following bt (the one owning the lock?)

schedule_timeout
__down
down
ldlm_pools_shrink
ldlm_polls_cli_shrink
shrink_slab
do_try_to_free_pages
__alloc_pages_nodemask
kmem_getpages
fallback_alloc
____cache_alloc_node
__kmalloc
cfs_alloc
kiblnd_alloc_pages
kiblnd_create_tx_pool
kiblnd_pool_alloc_node
kiblnd_get_idle_tx
kiblnd_check_sends
kiblnd_check_conns
kiblnd_connd
kernel_kthread

So, havind #cpu threads busy waiting on a spinlock is not ideal, especially when the owner seems to be scheduled awaiting for something else to happen. At the moment it is not possible to verify if the lock is the same for all threads because the crash dump is not accessible.

Trying to workaround this issue, we made a patch that allows choosing how many kiblnd threads are started, via a kernel module option. We set this number to 24 (for 32 cores).
The issue reproduced, this time having 24 threads spinning on a spinlock, 4 kswapd waiting to acquire a lock in lu_cache_shrink, most of the others threads stucks in congestion_wait.

Cheers,
Sebastien.



 Comments   
Comment by Alexandre Louvet [ 14/Sep/11 ]

Well, we made some progress on this one. The root issue seems to be in the reclaim path which doesn't work. When we are in bad shape, forcing a reclaim (echo 3 > /proc/sys/vm/drop_caches) doesn't free memory. We have verified by activating lustre debug log that the number of objects doesn't change between lu_cache_shrink invocation.

After more investigations, we have verified that the assertion in the comment of lu_site_purge is not verified "(less then ten)". By scanning the first 100000 nodes (over 22 millions !) we found about 1500 objects with a h->loh_ref > 0 (the first 128 were in this stat, which explain why no nodes are selected during reclaim).

Our tentative to workaround this issue is to move busy object found during the lu_site_purge scan at tail of the lru, but may be not having those nodes into the lru would be a better idea ? (Based on the expectation that the number of busy object we have is normal).

Alex.

Comment by Peter Jones [ 14/Sep/11 ]

Lai

Can you please look into this one?

Thanks

Peter

Comment by Lai Siyao [ 15/Sep/11 ]

I am not sure the inefficient lu_site_purge() will cause system hang; IMHO it should only make system load high and memory reclamation slow.

I will commit a patch to put only non-referenced objects in lru list to speed up object reclamation. And you may verify whether it helps.

Comment by Lai Siyao [ 15/Sep/11 ]

Patch is on http://review.whamcloud.com/#change,1384

Comment by Alexandre Louvet [ 15/Sep/11 ]

My understanding is not a performance problem of lu_site_purge but more a 'design' issue.
Please correct me if I'm wrong, but my understanding is :

  • the kernel vm manger call lu_cache_shrink with nr_to_scan set to SHRINK_BATCH (=128)
  • then call lu_site_purge with nr set to (at most) nr_to_scan.
  • lu_site_purge scan the nr first objects of the lru list, avoiding all object with a h->loh_ref > 0

Our problem is that the first nr object of the lru list have h->loh_ref > 0, so l_site_purge doesn't select any objects for being freed. Since the vm didn't made progress, it return to the callback logic and as nothing moved forward, the nr first entry of the list are still busy and we are not any more able to free memory ... the callback process continue again and again. At this point the system is out of memory and stop answering.

I haven't see any process to scan something else than the first SHRINK_BATCH entry of the list.

Whatever, moving busy object out of the list should fix the problem.

Comment by Johann Lombardi (Inactive) [ 15/Sep/11 ]

I definitely agree with Alexandre. I looked at this code some time ago and also found out that memory pressure handling on the lu_object_header slab is quite poor. Not only objects in-use are kept in the LRU, but we have one LRU list per hash bucket which are scanned in ascending order. As a result, if some buckets have a lot of objects in-use, we still waste time scanning all those objects sitting at the end of the list before moving on to the next bucket.

While i would like very much to have in-use objects out of the LRU list, i wonder if a quick patch like this would help a bit:

diff --git a/lustre/obdclass/lu_object.c b/lustre/obdclass/lu_object.c
index 7b9a87c..60ac319 100644
--- a/lustre/obdclass/lu_object.c
+++ b/lustre/obdclass/lu_object.c
@@ -303,6 +303,14 @@ int lu_site_purge(const struct lu_env *env, struct lu_site *s, int nr)
                 cfs_hash_bd_lock(s->ls_obj_hash, &bd, 1);
                 bkt = cfs_hash_bd_extra_get(s->ls_obj_hash, &bd);
 
+                if (bkt->lsb_busy >= cfs_hash_bd_count_get(&bd)) {
+                        /* all objects are busy, skip this bucket */
+                        cfs_hash_bd_unlock(s->ls_obj_hash, &bd, 1);
+                        continue;
+                } else if (count > 0)
+                        count = min_t(__u32, count,
+                                    cfs_hash_bd_count_get(&bd) - bkt->lsb_busy);
+
                 cfs_list_for_each_entry_safe(h, temp, &bkt->lsb_lru, loh_lru) {
                         /*
                          * Objects are sorted in lru order, and "busy"
Comment by Lai Siyao [ 15/Sep/11 ]

Hi Alexandre, looking in to lu_site_purge(), I think it will try best to reclaim 'nr' objects, not just scan 'nr' objects. Did I miss anything?

Comment by Bruno Faccini (Inactive) [ 16/Sep/11 ]

In fact, when working on it, we thought about 3 possible ways to fix/work-around this issue :

_ just/simply raise the "SHRINK_BATCH" hard-coded value used in shrink_slab() to invoke the shrinkers. Quite very basic solution is'nt-it !!

_ switch the "nr" 0-test/decrease and the busy/loh_ref test before loop-back, this in order to allow "nr" not-busy objects reclaim. But thinking about the fact that the "busy" state of objects on the LRU-list may not be transcient, we thought that this may lead to the situation that all the busy objects will progressivelly become linked at the very beginning of the LRU, thus causing a very/too long search for the not-busy ones !!... What do you think ??

_ this is why we thought that may be a better idea would be to still work/parse only "SHRINK_BATCH" objects but move each busy object encountered at the end of the LRU, to allow for non-busy objects to be found during the next search ...

Comment by Alexandre Louvet [ 16/Sep/11 ]

I don't have a clear idea about how many object with a ref > 0 we can found, but based on our scanning of the 100000 first entry, we have already found more than 1500 in this state.
Based on that, the first solution will be difficult to implement (which value), and the second will certainly drive us to a performance issue at some point (have to scan a lot of entry before finding some .

The latest seems to be a better approach if we doesn't want to move those objects out of the lru, no ? a simple move_tail when we find a loh_ref > 0 should help :

  • if (cfs_atomic_read(&h->loh_ref) > 0)
    + if (cfs_atomic_read(&h->loh_ref) > 0) { + cfs_list_move_tail(&h->loh_lru, &s->ls_lru); continue; + }

    What do you think ?

Comment by Lai Siyao [ 16/Sep/11 ]

Could you check out the patch I posted above? In that patch object will be placed in lru list when the last refcount is put, and be removed from lru list when it's referenced. In this way 'busy' object won't be scanned in reclamation. Since lru list is only used for object reclamation, this should be the most efficient.

Comment by Johann Lombardi (Inactive) [ 16/Sep/11 ]

Well, in-use objects are supposed to live near the end of the list already (see lu_object_find_try()).
As long as we assume that we have to free "nr / #buckets" objects per bucket, if some buckets have less free objects than that, we would end up scanning the whole list. That's why i thought the patch i posted above might help a bit.

Comment by Johann Lombardi (Inactive) [ 16/Sep/11 ]

> Could you check out the patch I posted above? In that patch object will be placed in lru list
> when the last refcount is put, and be removed from lru list when it's referenced

That's great. This is really the only clean solution IMHO.

Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » i686,client,el5,ofa #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Build Master (Inactive) [ 07/Dec/11 ]

Integrated in lustre-master » i686,server,el5,ofa #370
LU-685 obdclass: lu_object reclamation is inefficient (Revision 598996a12f5818190635e1f3a948c9e6a77297b5)

Result = SUCCESS
Oleg Drokin : 598996a12f5818190635e1f3a948c9e6a77297b5
Files :

  • lustre/obdclass/lu_object.c
Comment by Bob Glossman (Inactive) [ 02/May/12 ]

http://review.whamcloud.com/#change,2628
back port to b2_1

Generated at Sat Feb 10 01:09:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.