Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.1.0
    • Lustre 2.1.0-21chaos (github.com/chaos/lustre)
    • 3
    • 10139

    Description

      Today we had an IB-connected 2.1 server OOM out of the blue. After rebooting the node, the OSS OOMs again after it is in recovery for a little while. This OSS servers 15 OSTs.

      With one of the reboots, we added the "malloc" debug level and used the debug daemon to collect a log. Note that we saw messages about dropped log messages, so be aware that lines are missing in there. I will upload that to the ftp site, as it is too large fir jira. Filename will be sumom31-lustre.log.txt.bz2.

      We also extracted a lustre log at our default logging level from the original crash dump after the first oom. I will attach that here.

      Note also that we have the obdfilter writethrough and read caches disabled at this time.

      Using the crash "kmem" command, it is clear that most of the memory is used in slab, but not attributed to any of the Lustre named slabs. Here is the short kmem -i:

      crash> kmem -i
                    PAGES        TOTAL      PERCENTAGE
       TOTAL MEM  6117058      23.3 GB         ----
            FREE    37513     146.5 MB    0% of TOTAL MEM
            USED  6079545      23.2 GB   99% of TOTAL MEM
          SHARED     3386      13.2 MB    0% of TOTAL MEM
         BUFFERS     3297      12.9 MB    0% of TOTAL MEM
          CACHED    26240     102.5 MB    0% of TOTAL MEM
            SLAB  5908658      22.5 GB   96% of TOTAL MEM
      

      The OSS has 24G of RAM total.

      The biggest consumers by far are size-8192, size-1024, and size-2048. I will attach the full "kmem -s" output as well.

      We attempted to work around the problem by starting one OST at a time, and allowing it to fully recover before starting the next OST. By the end of the third OST's recovery, memory usage was normal. During the fourth OST's recovery the memory usage spiked and the node OOMed.

      We finally gave up and mounted with the abort_recovery option, and things seem to be running fine at the moment.

      Attachments

        Issue Links

          Activity

            [LU-1099] Lustre OSS OOMs repeatedly

            Was the OOM triggered during recovery or after recovery? from the log (sumom31_oss_after_reboot_lustre.log.txt.bz2), I see all the 15 OSTs have finished recovery.

            The huge rqbd buffer (grow only buffer) could be caused by request bursting during recovery, and only lock replay isn't throttled during recovery (resend after recovery isn't throttled as well, but I don't think there are many requests to be resend). Are there huge amount of cached write locks on client before the OSS reboot?

            niu Niu Yawei (Inactive) added a comment - Was the OOM triggered during recovery or after recovery? from the log (sumom31_oss_after_reboot_lustre.log.txt.bz2), I see all the 15 OSTs have finished recovery. The huge rqbd buffer (grow only buffer) could be caused by request bursting during recovery, and only lock replay isn't throttled during recovery (resend after recovery isn't throttled as well, but I don't think there are many requests to be resend). Are there huge amount of cached write locks on client before the OSS reboot?

            Although ptlrpc service sets "Lazy" for request portal, which means request could be blocked inside LNet layer (instead of dropping), but ptlrpc service will always try to allocate enough buffer (rqbd) and ptlrpc_request for incoming requests, these rqbds will not be freed until shutting down service, this is kind of thing we might improve, but my question is, is it expected that we have so many pending requests on this OSS?

            ffff88033fd304c0 size-8192               8192    1413671   1413672 1413672     8k
            ffff88033fcd0340 size-1024               1024    8298628   8298832 2074708     4k
            

            I remember each ptlrpc_request should take about 900 bytes, so I think most of these 1K objects are ptlrpc_requests, and there are over 8 million requests at here...

            liang Liang Zhen (Inactive) added a comment - Although ptlrpc service sets "Lazy" for request portal, which means request could be blocked inside LNet layer (instead of dropping), but ptlrpc service will always try to allocate enough buffer (rqbd) and ptlrpc_request for incoming requests, these rqbds will not be freed until shutting down service, this is kind of thing we might improve, but my question is, is it expected that we have so many pending requests on this OSS? ffff88033fd304c0 size-8192 8192 1413671 1413672 1413672 8k ffff88033fcd0340 size-1024 1024 8298628 8298832 2074708 4k I remember each ptlrpc_request should take about 900 bytes, so I think most of these 1K objects are ptlrpc_requests, and there are over 8 million requests at here...
            green Oleg Drokin added a comment -

            Ok, so the huge number of 8k allocations comes from rqbd allocs and they are never freed (during the log anyway).
            I imagine if you have a big enough cluster with quite a bit of active clients, a lot of them might have uncommitted transactions that we would accumulate in transaction order to replay, esp. if there is a dead (or just very slow to reconnect) client out there that happens to hold some early transaction that is not yet committed.
            Unfortunately without D_HA debug level I cannot know how far from the truth that idea is.

            green Oleg Drokin added a comment - Ok, so the huge number of 8k allocations comes from rqbd allocs and they are never freed (during the log anyway). I imagine if you have a big enough cluster with quite a bit of active clients, a lot of them might have uncommitted transactions that we would accumulate in transaction order to replay, esp. if there is a dead (or just very slow to reconnect) client out there that happens to hold some early transaction that is not yet committed. Unfortunately without D_HA debug level I cannot know how far from the truth that idea is.

            Hmmm, I could have sworn that I uploaded the file. Maybe I got distracted. Sorry about that. I just pushed sumom31-lustre.log.txt.bz2 to the ftp uploads directory.

            morrone Christopher Morrone (Inactive) added a comment - Hmmm, I could have sworn that I uploaded the file. Maybe I got distracted. Sorry about that. I just pushed sumom31-lustre.log.txt.bz2 to the ftp uploads directory.
            green Oleg Drokin added a comment -

            Ah, indeed I missed the caches disabled line somehow.

            Anyway the default log does not have allocation information collected unfortuantely so it's hard to know what was actually allocating.
            I suspected it might be related to ldlm locks, but apparently not based on the data in kmem -s output.

            I wonder what sort of an allocation we might have to the connected clients other than ldlm locks and nothing obvious comes to mind, esp. at these sizes.

            You y you'll upload the log with malloc debug under the name of sumom31-lustre.log.txt.bz2 but it;'s nowhere to be found on our ftp. Any chance I can take a look at it?

            green Oleg Drokin added a comment - Ah, indeed I missed the caches disabled line somehow. Anyway the default log does not have allocation information collected unfortuantely so it's hard to know what was actually allocating. I suspected it might be related to ldlm locks, but apparently not based on the data in kmem -s output. I wonder what sort of an allocation we might have to the connected clients other than ldlm locks and nothing obvious comes to mind, esp. at these sizes. You y you'll upload the log with malloc debug under the name of sumom31-lustre.log.txt.bz2 but it;'s nowhere to be found on our ftp. Any chance I can take a look at it?

            That is what I reported, yes. But now that I think about it, the caches were probably enabled still on this server cluster when we saw the first occurance of an OOM.

            The admins are using lctl set_param to disable the caches from a lustre script after mount occurs. That should be fairly early in the recovery process.

            morrone Christopher Morrone (Inactive) added a comment - That is what I reported, yes. But now that I think about it, the caches were probably enabled still on this server cluster when we saw the first occurance of an OOM. The admins are using lctl set_param to disable the caches from a lustre script after mount occurs. That should be fairly early in the recovery process.
            green Oleg Drokin added a comment -

            Is this one with server side read and write-through caches disabled?

            green Oleg Drokin added a comment - Is this one with server side read and write-through caches disabled?

            People

              green Oleg Drokin
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: