[LU-685] Wide busy lock in kiblnd_pool_alloc_node Created: 14/Sep/11 Updated: 04/May/12 Resolved: 12/Dec/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.1.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sebastien Buisson (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 4837 |
| Description |
|
Hi, At CEA they have experienced at least 3 hangs of an MDS in the following context:
_spin_lock
schedule
spin_lock
schedule_timeout So, havind #cpu threads busy waiting on a spinlock is not ideal, especially when the owner seems to be scheduled awaiting for something else to happen. At the moment it is not possible to verify if the lock is the same for all threads because the crash dump is not accessible. Trying to workaround this issue, we made a patch that allows choosing how many kiblnd threads are started, via a kernel module option. We set this number to 24 (for 32 cores). Cheers, |
| Comments |
| Comment by Alexandre Louvet [ 14/Sep/11 ] |
|
Well, we made some progress on this one. The root issue seems to be in the reclaim path which doesn't work. When we are in bad shape, forcing a reclaim (echo 3 > /proc/sys/vm/drop_caches) doesn't free memory. We have verified by activating lustre debug log that the number of objects doesn't change between lu_cache_shrink invocation. After more investigations, we have verified that the assertion in the comment of lu_site_purge is not verified "(less then ten)". By scanning the first 100000 nodes (over 22 millions !) we found about 1500 objects with a h->loh_ref > 0 (the first 128 were in this stat, which explain why no nodes are selected during reclaim). Our tentative to workaround this issue is to move busy object found during the lu_site_purge scan at tail of the lru, but may be not having those nodes into the lru would be a better idea ? (Based on the expectation that the number of busy object we have is normal). Alex. |
| Comment by Peter Jones [ 14/Sep/11 ] |
|
Lai Can you please look into this one? Thanks Peter |
| Comment by Lai Siyao [ 15/Sep/11 ] |
|
I am not sure the inefficient lu_site_purge() will cause system hang; IMHO it should only make system load high and memory reclamation slow. I will commit a patch to put only non-referenced objects in lru list to speed up object reclamation. And you may verify whether it helps. |
| Comment by Lai Siyao [ 15/Sep/11 ] |
|
Patch is on http://review.whamcloud.com/#change,1384 |
| Comment by Alexandre Louvet [ 15/Sep/11 ] |
|
My understanding is not a performance problem of lu_site_purge but more a 'design' issue.
Our problem is that the first nr object of the lru list have h->loh_ref > 0, so l_site_purge doesn't select any objects for being freed. Since the vm didn't made progress, it return to the callback logic and as nothing moved forward, the nr first entry of the list are still busy and we are not any more able to free memory ... the callback process continue again and again. At this point the system is out of memory and stop answering. I haven't see any process to scan something else than the first SHRINK_BATCH entry of the list. Whatever, moving busy object out of the list should fix the problem. |
| Comment by Johann Lombardi (Inactive) [ 15/Sep/11 ] |
|
I definitely agree with Alexandre. I looked at this code some time ago and also found out that memory pressure handling on the lu_object_header slab is quite poor. Not only objects in-use are kept in the LRU, but we have one LRU list per hash bucket which are scanned in ascending order. As a result, if some buckets have a lot of objects in-use, we still waste time scanning all those objects sitting at the end of the list before moving on to the next bucket. While i would like very much to have in-use objects out of the LRU list, i wonder if a quick patch like this would help a bit: diff --git a/lustre/obdclass/lu_object.c b/lustre/obdclass/lu_object.c
index 7b9a87c..60ac319 100644
--- a/lustre/obdclass/lu_object.c
+++ b/lustre/obdclass/lu_object.c
@@ -303,6 +303,14 @@ int lu_site_purge(const struct lu_env *env, struct lu_site *s, int nr)
cfs_hash_bd_lock(s->ls_obj_hash, &bd, 1);
bkt = cfs_hash_bd_extra_get(s->ls_obj_hash, &bd);
+ if (bkt->lsb_busy >= cfs_hash_bd_count_get(&bd)) {
+ /* all objects are busy, skip this bucket */
+ cfs_hash_bd_unlock(s->ls_obj_hash, &bd, 1);
+ continue;
+ } else if (count > 0)
+ count = min_t(__u32, count,
+ cfs_hash_bd_count_get(&bd) - bkt->lsb_busy);
+
cfs_list_for_each_entry_safe(h, temp, &bkt->lsb_lru, loh_lru) {
/*
* Objects are sorted in lru order, and "busy"
|
| Comment by Lai Siyao [ 15/Sep/11 ] |
|
Hi Alexandre, looking in to lu_site_purge(), I think it will try best to reclaim 'nr' objects, not just scan 'nr' objects. Did I miss anything? |
| Comment by Bruno Faccini (Inactive) [ 16/Sep/11 ] |
|
In fact, when working on it, we thought about 3 possible ways to fix/work-around this issue : _ just/simply raise the "SHRINK_BATCH" hard-coded value used in shrink_slab() to invoke the shrinkers. Quite very basic solution is'nt-it !! _ switch the "nr" 0-test/decrease and the busy/loh_ref test before loop-back, this in order to allow "nr" not-busy objects reclaim. But thinking about the fact that the "busy" state of objects on the LRU-list may not be transcient, we thought that this may lead to the situation that all the busy objects will progressivelly become linked at the very beginning of the LRU, thus causing a very/too long search for the not-busy ones !!... What do you think ?? _ this is why we thought that may be a better idea would be to still work/parse only "SHRINK_BATCH" objects but move each busy object encountered at the end of the LRU, to allow for non-busy objects to be found during the next search ... |
| Comment by Alexandre Louvet [ 16/Sep/11 ] |
|
I don't have a clear idea about how many object with a ref > 0 we can found, but based on our scanning of the 100000 first entry, we have already found more than 1500 in this state. The latest seems to be a better approach if we doesn't want to move those objects out of the lru, no ? a simple move_tail when we find a loh_ref > 0 should help :
|
| Comment by Lai Siyao [ 16/Sep/11 ] |
|
Could you check out the patch I posted above? In that patch object will be placed in lru list when the last refcount is put, and be removed from lru list when it's referenced. In this way 'busy' object won't be scanned in reclamation. Since lru list is only used for object reclamation, this should be the most efficient. |
| Comment by Johann Lombardi (Inactive) [ 16/Sep/11 ] |
|
Well, in-use objects are supposed to live near the end of the list already (see lu_object_find_try()). |
| Comment by Johann Lombardi (Inactive) [ 16/Sep/11 ] |
|
> Could you check out the patch I posted above? In that patch object will be placed in lru list That's great. This is really the only clean solution IMHO. |
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Dec/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Bob Glossman (Inactive) [ 02/May/12 ] |
|
http://review.whamcloud.com/#change,2628 |