Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      lru_max_age is set to 65min for ages. This is a looong time and that makes clients keep lots of LDLM locks in cache, and more data in cache. The only reason for that, that I'm aware of, is to help login nodes to avoid them requesting the same locks too often.

      I know a lot of sites tune this value to something much smaller. Compute nodes have a more agressive value usually, and the LRU cache is clear between jobs anyway.

      I think it would be valuable to change the default value to a much reasonable one to avoid most of users to decrease this value. I think going down to 5 min would be a good move.

      If login nodes has to re-enqueue some locks every 5 min I think it is not a problem at all.

       

      What do you think of that?

      Attachments

        Issue Links

          Activity

            [LU-14517] Decrease default lru_max_age value

            I'm going to close this as a duplicate of LU-17428, which has a patch close to landing to reduce the default lru_max_age=600s.

            adilger Andreas Dilger added a comment - I'm going to close this as a duplicate of LU-17428 , which has a patch close to landing to reduce the default lru_max_age=600s.
            adegremont_nvda Aurelien Degremont added a comment - - edited

            Getting addressed (without LRU / ARC ) in LU-17428

            adegremont_nvda Aurelien Degremont added a comment - - edited Getting addressed (without LRU / ARC ) in LU-17428

            I was not against the advantages of replacing the LRU algorithm by something else like ARC, I think it makes sense.

            My point was only that: 65minutes as the default value is too big in my opinion, that is making clients keep unused locks for that long. If we reduce that to, let's say 10 minutes, the lock volume will be smaller, the lock callback traffic will be smaller (less lock conflicts, less eviction due to lock callbacks) at a price of a increased lock enqueue traffic, which I suspect will be small. I'm curious to know the p90 lock age on most clients.

             

            degremoa Aurelien Degremont (Inactive) added a comment - I was not against the advantages of replacing the LRU algorithm by something else like ARC, I think it makes sense. My point was only that: 65minutes as the default value is too big in my opinion, that is making clients keep unused locks for that long. If we reduce that to, let's say 10 minutes, the lock volume will be smaller, the lock callback traffic will be smaller (less lock conflicts, less eviction due to lock callbacks) at a price of a increased lock enqueue traffic, which I suspect will be small. I'm curious to know the p90 lock age on most clients.  

            You are correct lru_max_age is based on the idle time of a lock, which is refreshed on lock usage. However, consider the competing needs of keeping an often used lock in cache vs. flushing many use-once locks from cache. If lru_max_age is high, it helps locks that may be used repeatedly, but not continuously, but if lru_max_age is high then the client may accumulate a large number of use-once locks before they age out. If lru_max_age is low to keep too many use-once locks out of cache locks reused many times may also be cancelled if there is any gap in their usage. Currently, there is no frequency counter on locks, so all locks that hit lru_max_age (or lru_size if set) will be cancelled regardless of how useful they are to the client. That is what LU-11509 is about - improving the algorithm for flushing locks from the cache instead of strict LRU.

            adilger Andreas Dilger added a comment - You are correct lru_max_age is based on the idle time of a lock, which is refreshed on lock usage. However, consider the competing needs of keeping an often used lock in cache vs. flushing many use-once locks from cache. If lru_max_age is high, it helps locks that may be used repeatedly, but not continuously, but if lru_max_age is high then the client may accumulate a large number of use-once locks before they age out. If lru_max_age is low to keep too many use-once locks out of cache locks reused many times may also be cancelled if there is any gap in their usage. Currently, there is no frequency counter on locks, so all locks that hit lru_max_age (or lru_size if set) will be cancelled regardless of how useful they are to the client. That is what LU-11509 is about - improving the algorithm for flushing locks from the cache instead of strict LRU.

            I can see the bad effects of a fixed size lru_size. That's why I'm not using it and this value makes more sense as a read-only one nowadays.

            However, I may have been misunderstanding the behavior of lru_max_age. I'm using it as a way to limit dynamically number of locks on client side, based on access pattern. I understood it was the time since the resource last access. FS scan can bring a lot of locks on client indeed, but every time a directory is accessed again I thought it was considered "young" again and not evicted from cache. I'm also using that to force client to flush dirty cache more aggressively.

            degremoa Aurelien Degremont (Inactive) added a comment - I can see the bad effects of a fixed size lru_size . That's why I'm not using it and this value makes more sense as a read-only one nowadays. However, I may have been misunderstanding the behavior of lru_max_age. I'm using it as a way to limit dynamically number of locks on client side, based on access pattern. I understood it was the time since the resource last access. FS scan can bring a lot of locks on client indeed, but every time a directory is accessed again I thought it was considered "young" again and not evicted from cache. I'm also using that to force client to flush dirty cache more aggressively.

            Aurelien, note that lru_max_age should really only be a fallback upper limit for the dynamic LRU pool management. Unfortunately, the dynamic LRU code has not been working well for a long time (see LU-7266), and often users disable it by setting lru_max_age and lru_size=N.

            However, that is sub-optimal since it means some clients may have too many locks, while others too few, and setting too high a limit causes memory pressure on the servers and/or clients.

            What is really needed here is some investigation into the LDLM pool "Lock Volume" calculations to see why this is not working. The basic theory is that sum(age of locks) is a "volume" that the server distributes among clients, and the client can manage locks within that volume as it sees fit (many short-lived locks, few long-lived locks), and if the client lock volme is growing to exceed its assigned limit (due to aging of old locks and/or acquiring many new locks) then it should cancel the oldest unused locks to reduce the volume again. The client is really in the best position to judge which of its locks are most important, but as a workaround to memory pressure issues, LU-6529 was implemented to give the server the ability to cancel locks more aggressively to avoid OOM.

            It may be that LDLM_POOL_MAX_AGE is just set much too high and/or the DLM server is allowing too much memory to be put toward locks (e.g. not considering multiple namespaces, or just assigning too large a fraction of RAM to LDLM vs. filesystem cache, etc), so the clients are not cancelling locks aggressively enough. There may also be issues w.r.t. hooking into the kernel slab cache shrinkers not working properly (this should reduce the lock volume on the server to force clients to cancel locks, and on the client to directly cancel locks).

            The other area that could benefit is replacing the strict LRU managing the locks on the client. For clients doing things like filesystem scanning, strict LRU is not a very good algorithm, since that flushes out "valuable" locks too quickly (e.g. parent directory locks) and doesn't drop "boring" locks (e.g. the use-once locks for the individual files). Using a better caching algorithm (e.g. LFRU, 2Q/SLRU, ARC) would go a long way to improving lock cache usage on the client. ARC is probably the best choice, since it would be possible to keep the FIDs in the "ghost" lists without actually caching the lock/pages, and in case some frequently-used lock had to be cancelled due to contention it doesn't immediately lose the "value" that had been built up for that lock.

            adilger Andreas Dilger added a comment - Aurelien, note that lru_max_age should really only be a fallback upper limit for the dynamic LRU pool management. Unfortunately, the dynamic LRU code has not been working well for a long time (see LU-7266 ), and often users disable it by setting lru_max_age and lru_size=N . However, that is sub-optimal since it means some clients may have too many locks, while others too few, and setting too high a limit causes memory pressure on the servers and/or clients. What is really needed here is some investigation into the LDLM pool "Lock Volume" calculations to see why this is not working. The basic theory is that sum(age of locks) is a "volume" that the server distributes among clients, and the client can manage locks within that volume as it sees fit (many short-lived locks, few long-lived locks), and if the client lock volme is growing to exceed its assigned limit (due to aging of old locks and/or acquiring many new locks) then it should cancel the oldest unused locks to reduce the volume again. The client is really in the best position to judge which of its locks are most important, but as a workaround to memory pressure issues, LU-6529 was implemented to give the server the ability to cancel locks more aggressively to avoid OOM. It may be that LDLM_POOL_MAX_AGE is just set much too high and/or the DLM server is allowing too much memory to be put toward locks (e.g. not considering multiple namespaces, or just assigning too large a fraction of RAM to LDLM vs. filesystem cache, etc), so the clients are not cancelling locks aggressively enough. There may also be issues w.r.t. hooking into the kernel slab cache shrinkers not working properly (this should reduce the lock volume on the server to force clients to cancel locks, and on the client to directly cancel locks). The other area that could benefit is replacing the strict LRU managing the locks on the client. For clients doing things like filesystem scanning, strict LRU is not a very good algorithm, since that flushes out "valuable" locks too quickly (e.g. parent directory locks) and doesn't drop "boring" locks (e.g. the use-once locks for the individual files). Using a better caching algorithm (e.g. LFRU, 2Q/SLRU, ARC) would go a long way to improving lock cache usage on the client. ARC is probably the best choice, since it would be possible to keep the FIDs in the "ghost" lists without actually caching the lock/pages, and in case some frequently-used lock had to be cancelled due to contention it doesn't immediately lose the "value" that had been built up for that lock.

            Aurelien, I wouldn't be against this. Maybe 5 minutes is a bit too short, but 10 minutes is better? This is set with LDLM_DEFAULT_MAX_ALIVE value.

            See also LU-6402, which is the LDLM_POOL_MAX_AGE value, which is still at the very old 36000s/10h value, which controls how (badly) the dynamic LRU pressure on the client is keeping locks on the client. It makes sense to reduce this to at least 3600s, but probably also lower.

            adilger Andreas Dilger added a comment - Aurelien, I wouldn't be against this. Maybe 5 minutes is a bit too short, but 10 minutes is better? This is set with LDLM_DEFAULT_MAX_ALIVE value. See also LU-6402 , which is the LDLM_POOL_MAX_AGE value, which is still at the very old 36000s/10h value, which controls how (badly) the dynamic LRU pressure on the client is keeping locks on the client. It makes sense to reduce this to at least 3600s, but probably also lower.

            The reason this is done is that a common work case is to do a checkpoint every hour.

            simmonsja James A Simmons added a comment - The reason this is done is that a common work case is to do a checkpoint every hour.

            People

              wc-triage WC Triage
              degremoa Aurelien Degremont (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: