[LU-569] make lu_object cache size adjustable Created: 05/Aug/11 Updated: 20/Jan/12 Resolved: 09/Aug/11 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Jinshan Xiong (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 4900 | ||||||||
| Description |
|
lu_object cache is specified to consume 20% of total memory. This limits 200 clients can be mounted on one node. We should make it adjustable so that customers have a chance to configure it by their needs. |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 05/Aug/11 ] |
|
patch is at: http://review.whamcloud.com/1188 |
| Comment by Andreas Dilger [ 05/Aug/11 ] |
|
In addition to just allowing Lustre to consume more memory on the client, I think it is also/more important to determine WHY it is consuming so much memory, and try to reduce the actual memory used. Is it because of too-large hash tables, that could be started at a small size and dynamically grown only as needed? Is it because of other large/static arrays per mountpoint? My 1.8 client consumes about 7MB after flushing the LDLM cache (lctl get_param memused). It should be fairly straight forward to run with +malloc debug for a second/third/fourth mount and dump the debug logs, parse with lustre/tests/leakfinder.pl (may need some fixing) and determine where all of the memory is being used. |
| Comment by Jinshan Xiong (Inactive) [ 05/Aug/11 ] |
|
The memory usage is because it allocates a large hash table when it is mounted. With this patch and set lu_cache_percent to be 1, I can run 1K client on one node without any problem. I agree it will be good to have dynamic hash table size, especially on the server size. Personally I don't think we need it on clients because it's not desirable for clients to have incredible # of objects. |
| Comment by Liang Zhen (Inactive) [ 06/Aug/11 ] |
|
the reason we don't allow rehash lu_site especially on server side is because if we want to enabled "rehash" (by using flag cfs_hash_create(..CFS_HASH_REHASH)), then there has to be a single rwlock to protect the whole hash-table, which could be overhead for such a high contention hash-table. |
| Comment by Andreas Dilger [ 06/Aug/11 ] |
|
Liang, is this needed also for a hash table that can only grow? Probably yes, but just to confirm. Is the improved hash table code already landed on master? Unfortunately (I think) there is no way to know when the lu_cache is set up there is no way to know whether there is going to be a server or only a client on that node. I also assume that it is not possible/safe to share the lu_cache on the client between mountpoints. I wonder if we might have some scalable method for hash table resize that does not need a single rwlock for the whole table? One option is to implement rehash as two independent hash tables, and as long as the migration of entries from the old table to the new table is done while locking both the source and target bucket then it should be transparent to the users,band relatively low contention (only two of all the buckets are locked at one time). |
| Comment by Liang Zhen (Inactive) [ 07/Aug/11 ] |
|
Andreas, yes it's only for hash table can grow and it's already on master for a while. Another thing we need to notice is lu_site is not using high-level cfs_hash APIs like cfs_hash_find/add/del which will hide locks of cfs_hash, lu_site will directly refer to cfs_hash locks and low-level bucket APIs, so it can use those hash locks to protect it's own data, for example, counters and LRU for shrinker, some waitq etc. Which means we need to make some changes to lu_site if we want to enable rehash. I think there is another option to support growing of lu_site, we can have multiple cfs_hash tables for the lu_site, i.e: 64 hash tables, and hash objects to different hash tables, any of these hash tables can grow when necessary and we don't need to worry about "big rehash" with millions of elements, global lock wouldn't be an issue either because we have many of these hash tables. btw: shouldn't caller of lu_site_init() know about which stack (server/client) the lu_site is created for? If so can we just pass in a flag or whatever to indicate client stack to use smaller hash table? |
| Comment by Jinshan Xiong (Inactive) [ 08/Aug/11 ] |
|
Rehash is way too complex for me. Yes, we can add a parameter in lu_site_init() so that client and server can have different size of hash table. However, I'm afraid that we still may need a way to be able to configure it for special needs - for example, I have to mount 1K mountpoints to test the scalability of IR. |
| Comment by Andreas Dilger [ 08/Aug/11 ] |
|
If the lu_site_init() picks a very small hash table size for clients, say 4096 entries, does that prevent you from mounting 1k clients on a single node? Is the lu_cache hash table the only significant memory user for each client mount? How much memory does the lu_cache hash table use on a server if it uses the default lu_htable_order() value? |
| Comment by Jinshan Xiong (Inactive) [ 08/Aug/11 ] |
|
If the entries can be as small as 4096, I think that is absolutely fine. I don't know how much exactly memory it consumes - it is prorated by memory size, but after I changed lu_cache_percent from 20 to 1, I could mount 1K mountpoints - it used to be 200 at most. |
| Comment by Liang Zhen (Inactive) [ 09/Aug/11 ] |
|
it's kind of off-topic, I think we can improve cfs_hash to make it support rehash-in-bucket in the future:
|
| Comment by Jinshan Xiong (Inactive) [ 09/Aug/11 ] |
|
It will help, but if you are using an evenly distributed hash function, I could say the time for each first-level bucket to be rehashed will be really close. |
| Comment by Liang Zhen (Inactive) [ 09/Aug/11 ] |
|
yes, they should be close, but it doesn't matter if they are handled by different threads on different CPUs, instead of "hog" one thread on one CPU for seconds. btw: although not fully tested, I remember the new cfs_hash can support "shrink" of hash-table which is non-blocking too, we probably should test and enable it in the future. |
| Comment by Jinshan Xiong (Inactive) [ 09/Aug/11 ] |
|
I'll use this patch for IR test only. |
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|
| Comment by Build Master (Inactive) [ 03/Oct/11 ] |
|
Integrated in Oleg Drokin : c8d7c99ec50c81a33eea43ed1c535fa4d65cef23
|