Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5585

MDS became unresponsive, clients hanging until MDS fail over

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.5.2
    • None
    • 3
    • 15580

    Description

      This morning some of our clients were hanging (others had not been checked at that time), the active MDS was unresponsive and flooding the console with stack traces. We had to fail over to the second MDS to get the file system back.

      Looking at the system logs, we see a large number of these messages:
      kernel: socknal_sd00_02: page allocation failure. order:2, mode:0x20 all followed by many stack traces, full log attached. Our monitoring is showing that the memory was mainly used by buffers but this had been the case for all of last week already and was stable and only slowly increasing. After the restart the memory used by buffers has quickly increase to about 60% and currently seems to be stable about there...

      Just before these page allocation failure messages we noticed a few client reconnect messages, but have not been able to find any network problems so far. Since the restart of the MDT, no unexpected client reconnects have been seen.

      We are running lustre 2.5.2 + 4 patches as recommended in LU-5529 and LU-5514.

      We've been hammering the MDS a bit since the upgrade, both creating files, stating many files/directories from many clients etc and removing many files, but I would still expect the MDS not to fall over like this.

      Is this a problem/memory leak in Lustre or something else? Could it be related different compile options when compiling Lustre? We did compile the version on the MDS in house with these patches and there is always a chance we didn't quite use the same compile time options that the automatic build process would use...

      What can we do to debug this further and avoid it in the future?

      Attachments

        Issue Links

          Activity

            [LU-5585] MDS became unresponsive, clients hanging until MDS fail over

            Depends on your view, we've got just under 300 clients on this file system.

            We'll try limiting the lru_size and will continue to monitor, looking at LU-5727, I'm not sure how much this will give us.

            Considering that we have been cleaning the file system, it is also entirely possible that we hit something similar to LU-5726, i.e. we almost certainly have run 'rm -rf' or similar in parallel on multiple clients. I will try to reproduce this tomorrow.

            ferner Frederik Ferner (Inactive) added a comment - Depends on your view, we've got just under 300 clients on this file system. We'll try limiting the lru_size and will continue to monitor, looking at LU-5727 , I'm not sure how much this will give us. Considering that we have been cleaning the file system, it is also entirely possible that we hit something similar to LU-5726 , i.e. we almost certainly have run 'rm -rf' or similar in parallel on multiple clients. I will try to reproduce this tomorrow.
            pjones Peter Jones added a comment -

            Bobijam

            Could this be related to the issue reported in LU-5727?

            Peter

            pjones Peter Jones added a comment - Bobijam Could this be related to the issue reported in LU-5727 ? Peter
            green Oleg Drokin added a comment -

            Do you have many clients on this system?

            It's been a known problem in the past that if you let cleint lrus to grow uncontrollably, servers become somewhat memory starved.

            One possible workaround is to set lru_size on the clients to something conservative like 100 or 200. Also if you have mostly non-intersecting jobs on the clients that don't reuse same files between different jobs, some sites are dropping lock lrus (and other caches) forcefully in between job runs.

            green Oleg Drokin added a comment - Do you have many clients on this system? It's been a known problem in the past that if you let cleint lrus to grow uncontrollably, servers become somewhat memory starved. One possible workaround is to set lru_size on the clients to something conservative like 100 or 200. Also if you have mostly non-intersecting jobs on the clients that don't reuse same files between different jobs, some sites are dropping lock lrus (and other caches) forcefully in between job runs.

            Unfortunately this started to severely affect file system performance so we had to fail over. I was nearly in time to do a clean unmount but not quite. By the time I started typing the umount command, the MDS froze completely and I was not able to collect any debug_log.

            Since this is now a repeating feature of this file system any idea how we could prevent this from re-occuring would be much appreciated. If there is anything we can do do help debug this, let us know, we'll do what we can.

            Frederik

            ferner Frederik Ferner (Inactive) added a comment - Unfortunately this started to severely affect file system performance so we had to fail over. I was nearly in time to do a clean unmount but not quite. By the time I started typing the umount command, the MDS froze completely and I was not able to collect any debug_log. Since this is now a repeating feature of this file system any idea how we could prevent this from re-occuring would be much appreciated. If there is anything we can do do help debug this, let us know, we'll do what we can. Frederik

            I've done that, unfortunately it didn't seem to free up much memory.

            During the initial sweep of lctl get_param ldlm.namespaces.MDT.lru_size for this file system, adding up the numbers for all reachable clients (a few are currently unresponsive and are being looked at, we assume unrelated), we seem to have about 5.3M locks on clients (corresponding to most recent snapshot of slabinfo of 5.4M ldlm_locks).

            After the lru_size=clear, both numbers dropped, now, about 20 minutes later they are back at about 1.5M each.

            Fresh meminfo/slabinfo about 20 minutes after clearing the locks are attached.

            ferner Frederik Ferner (Inactive) added a comment - I've done that, unfortunately it didn't seem to free up much memory. During the initial sweep of lctl get_param ldlm.namespaces. MDT .lru_size for this file system, adding up the numbers for all reachable clients (a few are currently unresponsive and are being looked at, we assume unrelated), we seem to have about 5.3M locks on clients (corresponding to most recent snapshot of slabinfo of 5.4M ldlm_locks). After the lru_size=clear, both numbers dropped, now, about 20 minutes later they are back at about 1.5M each. Fresh meminfo/slabinfo about 20 minutes after clearing the locks are attached.

            What is a bit odd here is that there are 5.5M in-use ldlm_locks on 2.9M ldlm_resources, yet there are only 190K inodes in memory (166K objects). This implies there is something kind of strange happening in the DLM, since there should only be a single resource per MDT object. There should be at least one ldlm_resource for each ldlm_lock, though having more locks than resources is OK as multiple clients may lock the same resource, or a single client may lock different parts of the same resource.

            One experiment you might do is to run lctl get_param ldlm.namespaces.MDT.lru_size to get the count of locks held by all the clients, and then lctl set_param ldlm.namespaces.MDT.lru_size=clear on the clients to drop all their DLM locks. The set_param will cancel the corresponding locks on the server and flush the client metadata cache as a result, which may have a short term negative impact on metadata performance, in case that is unacceptable.

            The cancellation of locks on the clients should result in all of the ldlm_locks structures being freed on the MDS (or at least the sum of the locks on the clients should match the number of ACTIVE ldlm_locks allocated on the MDS). If that isn't the case, it seems we have some kind of leak in the DLM.

            adilger Andreas Dilger added a comment - What is a bit odd here is that there are 5.5M in-use ldlm_locks on 2.9M ldlm_resources, yet there are only 190K inodes in memory (166K objects). This implies there is something kind of strange happening in the DLM, since there should only be a single resource per MDT object. There should be at least one ldlm_resource for each ldlm_lock, though having more locks than resources is OK as multiple clients may lock the same resource, or a single client may lock different parts of the same resource. One experiment you might do is to run lctl get_param ldlm.namespaces. MDT .lru_size to get the count of locks held by all the clients, and then lctl set_param ldlm.namespaces. MDT .lru_size=clear on the clients to drop all their DLM locks. The set_param will cancel the corresponding locks on the server and flush the client metadata cache as a result, which may have a short term negative impact on metadata performance, in case that is unacceptable. The cancellation of locks on the clients should result in all of the ldlm_locks structures being freed on the MDS (or at least the sum of the locks on the clients should match the number of ACTIVE ldlm_locks allocated on the MDS). If that isn't the case, it seems we have some kind of leak in the DLM.

            People

              bobijam Zhenyu Xu
              ferner Frederik Ferner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: