Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14221

Client hangs when using DoM with a fixed mdc lru_size

Details

    • Bug
    • Resolution: Won't Do
    • Major
    • None
    • Lustre 2.12.5, Lustre 2.12.6
    • 3
    • 9223372036854775807

    Description

      After enabling DoM and beginning to use one of our file systems more heavily recently, we discovered a bug seemingly related to locking.

      Basically, with any fixed `lru_size`, everything will work normally until the number of locks hit the `lru_size`. From that point, everything will hang until the `lru_max_age` is hit, at which point it will clear the locks and move on, until filling again. We confirmed this by setting the number of locks pretty low, then setting a low (10s) `lru_max_age`, and kicking off a tar extraction. The tar would extract until the `lock_count` hit our `lru_size` value (basically 1 for 1 with number of files), then hang for 10s, then continue with another batch after the locks had been cleared. The same behavior can be replicated by letting it hang and then running `lctl set_param ldlm.namespaces.mdc.lru_size=clear`, which will free up the process temporarily as well.

       

      Our current workaround is to set `lru_size` to 0 and set the `lru_max_age` to 30s to keep the number of locks to a manageable level.

       

      This appears to only occur on our SLES clients. RHEL clients running the same Lustre version encounter no such problems. This may be due to the kernel version on SLES (4.12.14-197) vs RHEL (3.10.0-1160)

       

      James believes this may be related to LU-11518.

       

      lru_size and lock_count while it's stuck:

      lctl get_param ldlm.namespaces.*.lru_size
      ldlm.namespaces.cyclone-MDT0000-mdc-ffff88078946d800.lru_size=200
      lctl get_param ldlm.namespaces.*.lock_count
      ldlm.namespaces.cyclone-MDT0000-mdc-ffff88078946d800.lock_count=201

       

      Process stack while it's stuck:

      [<ffffffffa0ad1932>] ptlrpc_set_wait+0x362/0x700 [ptlrpc]
      [<ffffffffa0ad1d57>] ptlrpc_queue_wait+0x87/0x230 [ptlrpc]
      [<ffffffffa0ab7217>] ldlm_cli_enqueue+0x417/0x8f0 [ptlrpc]
      [<ffffffffa0a6105d>] mdc_enqueue_base+0x3ad/0x1990 [mdc]
      [<ffffffffa0a62e38>] mdc_intent_lock+0x288/0x4c0 [mdc]
      [<ffffffffa0bf29ca>] lmv_intent_lock+0x9ca/0x1670 [lmv]
      [<ffffffffa0cfea99>] ll_layout_intent+0x319/0x660 [lustre]
      [<ffffffffa0d09fe2>] ll_layout_refresh+0x282/0x11d0 [lustre]
      [<ffffffffa0d47c73>] vvp_io_init+0x233/0x370 [lustre]
      [<ffffffffa085d4d1>] cl_io_init0.isra.15+0xa1/0x150 [obdclass]
      [<ffffffffa085d641>] cl_io_init+0x41/0x80 [obdclass]
      [<ffffffffa085fb64>] cl_io_rw_init+0x104/0x200 [obdclass]
      [<ffffffffa0d02c5b>] ll_file_io_generic+0x2cb/0xb70 [lustre]
      [<ffffffffa0d03825>] ll_file_write_iter+0x125/0x530 [lustre]
      [<ffffffff81214c9b>] __vfs_write+0xdb/0x130
      [<ffffffff81215581>] vfs_write+0xb1/0x1a0
      [<ffffffff81216ac6>] SyS_write+0x46/0xa0
      [<ffffffff81002af5>] do_syscall_64+0x75/0xf0
      [<ffffffff8160008f>] entry_SYSCALL_64_after_hwframe+0x42/0xb7
      [<ffffffffffffffff>] 0xffffffffffffffff

      I can reproduce and provide any other debug data as necessary.

      Attachments

        Issue Links

          Activity

            [LU-14221] Client hangs when using DoM with a fixed mdc lru_size

            Moved to Lustre 2.15 which has DoM working natively.

            simmonsja James A Simmons added a comment - Moved to Lustre 2.15 which has DoM working natively.

            Patch 41008 is ready to land.

            simmonsja James A Simmons added a comment - Patch 41008 is ready to land.

            Cory previously asked:

            May I ask what harm there is with the large (default) lru_max_age? You say that it is bad that lots of clients may have lots of locks. Is the server not able to handle the lock pressure? Does back pressure not get applied to the clients? Are the servers unable to revoke locks upon client request in a timely manner? I guess I just don't understand why it is inherently bad to use the defaults. Could you explain more? Thanks!

            I think there are two things going on here. Having a large lru_max_age means that unused locks (and potentially data cached under those locks) may linger on the client for a long time. That consumes memory on the MDS and OSS for every lock that every client holds, which could probably be better used somewhere else. Also, there is more work needed at recovery time if the MDS/OSS crashes to recover those locks. Also, having a large number of locks on the client or server adds some overhead to all lock processing due to having more locks to deal with because of longer hash collision chains.

            There is the "dynamic LRU" code that has existed for many years to try an balance MDS lock memory usage vs. client lock requests, but I've never really been convinced that it works properly (see e.g. LU-7266 and related tickets). I also think that when clients have so much RAM these days, it can cause a large number of locks to stay in memory for a long time until there is a sudden shortage of memory on the server, and the server only has limited mechanisms to revoke locks from the clients. It can reduce the "lock volume" (part of the dynamic LRU" functionality) but this is at best a "slow burn" that is intended (if working properly) to keep the steady-state locking traffic in check. More recently, there was work done under LU-6529 "Server side lock limits to avoid unnecessary memory exhaustion" to allow more direct reclaim of DLM memory on the server when it is under pressure. We want to avoid the server cancelling locks that are actively in use by the client, but the server has no real idea about which locks the client is reusing, and which ones were only used once, so it does the best job it can with the information it has, but it is better if the client does a better job of keeping the number of locks under control.

            So there is definitely a balance between being able to cache locks and data on the client vs. sending more RPCs to the server and reducing memory usage on both sides. That is why having a shorter lru_max_age is useful, but longer term LU-11509 "LDLM: replace lock LRU with improved cache algorithm" would improve the selection of which locks to keep cached on the client, and which (possibly newer, but use-once locks) should be dropped. That is as much a research task as a development effort.

            adilger Andreas Dilger added a comment - Cory previously asked: May I ask what harm there is with the large (default) lru_max_age ? You say that it is bad that lots of clients may have lots of locks. Is the server not able to handle the lock pressure? Does back pressure not get applied to the clients? Are the servers unable to revoke locks upon client request in a timely manner? I guess I just don't understand why it is inherently bad to use the defaults. Could you explain more? Thanks! I think there are two things going on here. Having a large lru_max_age means that unused locks (and potentially data cached under those locks) may linger on the client for a long time. That consumes memory on the MDS and OSS for every lock that every client holds, which could probably be better used somewhere else. Also, there is more work needed at recovery time if the MDS/OSS crashes to recover those locks. Also, having a large number of locks on the client or server adds some overhead to all lock processing due to having more locks to deal with because of longer hash collision chains. There is the "dynamic LRU" code that has existed for many years to try an balance MDS lock memory usage vs. client lock requests, but I've never really been convinced that it works properly (see e.g. LU-7266 and related tickets). I also think that when clients have so much RAM these days, it can cause a large number of locks to stay in memory for a long time until there is a sudden shortage of memory on the server, and the server only has limited mechanisms to revoke locks from the clients. It can reduce the "lock volume" (part of the dynamic LRU" functionality) but this is at best a "slow burn" that is intended (if working properly) to keep the steady-state locking traffic in check. More recently, there was work done under LU-6529 " Server side lock limits to avoid unnecessary memory exhaustion " to allow more direct reclaim of DLM memory on the server when it is under pressure. We want to avoid the server cancelling locks that are actively in use by the client, but the server has no real idea about which locks the client is reusing, and which ones were only used once, so it does the best job it can with the information it has, but it is better if the client does a better job of keeping the number of locks under control. So there is definitely a balance between being able to cache locks and data on the client vs. sending more RPCs to the server and reducing memory usage on both sides. That is why having a shorter lru_max_age is useful, but longer term LU-11509 " LDLM: replace lock LRU with improved cache algorithm " would improve the selection of which locks to keep cached on the client, and which (possibly newer, but use-once locks) should be dropped. That is as much a research task as a development effort.
            tappro Mikhail Pershin added a comment - - edited

            For anyone interested, patch from LU-11518 https://review.whamcloud.com/41008 is the one solving problem in 2.12.6 for me. After it untar is not freezing anymore when lru_size has fixed size. 

            tappro Mikhail Pershin added a comment - - edited For anyone interested, patch from LU-11518 https://review.whamcloud.com/41008  is the one solving problem in 2.12.6 for me. After it untar is not freezing anymore when lru_size has fixed size. 

            I think the LU-11518 work should resolve the rest of the problems.

            simmonsja James A Simmons added a comment - I think the LU-11518 work should resolve the rest of the problems.
            nilesj Jeff Niles added a comment -

            Glad you're able to reproduce on 2.12.5. I do find it a bit odd that we experience problems with 2.12.6 while you don't; perhaps it's the larger dataset, like you mention. I think it would be beneficial to figure out what code changed to fix the issue for you in 2.12.6, as it may reveal why we still see issues. Probably not the highest priority work though.

            nilesj Jeff Niles added a comment - Glad you're able to reproduce on 2.12.5. I do find it a bit odd that we experience problems with 2.12.6 while you don't; perhaps it's the larger dataset, like you mention. I think it would be beneficial to figure out what code changed to fix the issue for you in 2.12.6, as it may reveal why we still see issues. Probably not the highest priority work though.
            tappro Mikhail Pershin added a comment - - edited

            I am able to reproduce that issue on initial 2.12.5 release with 3.10 kernel RHEL client and also checked that all works with the latest 2.12.6 version. It seems there is patch in the middle that fixed the issue. I will run git bisect to find it, if that is what we need.

            With latest 2.12.6 I have no problems with fixes lru_size=100 but maybe my testset is not just big

            tappro Mikhail Pershin added a comment - - edited I am able to reproduce that issue on initial 2.12.5 release with 3.10 kernel RHEL client and also checked that all works with the latest 2.12.6 version. It seems there is patch in the middle that fixed the issue. I will run  git bisect  to find it, if that is what we need. With latest 2.12.6 I have no problems with fixes lru_size=100 but maybe my testset is not just big
            nilesj Jeff Niles added a comment -

            Unfortunately that's where my knowledge ends. I do know that a large number of locks puts memory pressure on the MDSs, but from Andreas' comment above, it seems like it should start applying back pressure to the clients at some point?

            Historically, on our large systems we've had to limit the lru_size to prevent overload issues with the MDS. This was the info that we were operating off of, but maybe that's not the case any more.

            nilesj Jeff Niles added a comment - Unfortunately that's where my knowledge ends. I do know that a large number of locks puts memory pressure on the MDSs, but from Andreas' comment above, it seems like it should start applying back pressure to the clients at some point? Historically, on our large systems we've had to limit the lru_size to prevent overload issues with the MDS. This was the info that we were operating off of, but maybe that's not the case any more.
            spitzcor Cory Spitz added a comment -

            Thanks for the clarification. Good sleuthing too!
            May I ask what harm there is with the large (default) lru_max_age? You say that it is bad that lots of clients may have lots of locks. Is the server not able to handle the lock pressure? Does back pressure not get applied to the clients? Are the servers unable to revoke locks upon client request in a timely manner? I guess I just don't understand why it is inherently bad to use the defaults. Could you explain more? Thanks!

            spitzcor Cory Spitz added a comment - Thanks for the clarification. Good sleuthing too! May I ask what harm there is with the large (default) lru_max_age? You say that it is bad that lots of clients may have lots of locks. Is the server not able to handle the lock pressure? Does back pressure not get applied to the clients? Are the servers unable to revoke locks upon client request in a timely manner? I guess I just don't understand why it is inherently bad to use the defaults. Could you explain more? Thanks!
            nilesj Jeff Niles added a comment -

            Hey Cory,

            When set to a fixed LRU size, a 2.12.6 client will complete write actions in a DoM directory in a time about equal to: (number of files to process / lru_size) * lru_max_age). Essentially it completes work as the max age is hit, 200 (or whatever number) tasks at a time.

            When set to a dynamic LRU size (0), a 2.12.6 client will work as expected, except that it will leave every single lock open until they hit the max_age limit (by default 65 minutes). Obviously this is less than idea for a large scale system with a bunch of clients all at 50k locks. This is the basis of our workaround. Set a dynamic LRU size and set a max_age of 30s or so to time them out quickly. Not ideal, but it'll work for now.

             

            The determination on something being fixed between 2.12.6 and 2.14 was based on our reproducer finishing in a normal amount of time with a fixed LRU size (2000) on 2.14, rather than in (number of files to process / lru_size) * lru_max_age), as we were seeing with 2.12.6. Since I don't think I said it above, even with lru_size=2000 on 2.12.6, we were still seeing issues where it would process about 2000 files, hang until those 2000 locks hit the max_age value, and then proceed. The issue isn't just limited to low lru_size settings.

             

            To be clear, have you run the experiment with default lru_size and lru_max_age? Does the LTS client behave poorly? Or, does it match the non-DoM performance?

            Yes. A 2.12.6 LTS client works great with default (0) lru_size, except that it keeps all the locks open until max_age. This LU is specifically about the bug as it relates to fixed mdc lru_size settings.

            nilesj Jeff Niles added a comment - Hey Cory, When set to a  fixed LRU size, a 2.12.6 client will complete write actions in a DoM directory in a time about equal to: (number of files to process / lru_size) * lru_max_age). Essentially it completes work as the max age is hit, 200 (or whatever number) tasks at a time. When set to a dynamic LRU size (0), a 2.12.6 client will work as expected, except that it will leave every single lock open until they hit the max_age limit (by default 65 minutes). Obviously this is less than idea for a large scale system with a bunch of clients all at 50k locks. This is the basis of our workaround. Set a dynamic LRU size and set a max_age of 30s or so to time them out quickly. Not ideal, but it'll work for now.   The determination on something being fixed between 2.12.6 and 2.14 was based on our reproducer finishing in a normal amount of time with a  fixed LRU size (2000) on 2.14, rather than in (number of files to process / lru_size) * lru_max_age), as we were seeing with 2.12.6. Since I don't think I said it above, even with lru_size=2000 on 2.12.6, we were still seeing issues where it would process about 2000 files, hang until those 2000 locks hit the max_age value, and then proceed. The issue isn't just limited to low lru_size settings.   To be clear, have you run the experiment with default lru_size and lru_max_age? Does the LTS client behave poorly? Or, does it match the non-DoM performance? Yes. A 2.12.6 LTS client works great with default (0) lru_size, except that it keeps all the locks open until max_age. This LU is specifically about the bug as it relates to fixed mdc lru_size settings.

            People

              tappro Mikhail Pershin
              nilesj Jeff Niles
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: