Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9372

OOM happens on OSS during Lustre recovery for more than 5000 clients

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0, Lustre 2.10.6
    • None
    • Server running with b2_7_fe
      Clients are a mix of IEEL3 (RH7/SCS5), 2.5.3.90 (RH6/AE4), 2.7.3 (CentOS7)
    • 3
    • 9223372036854775807

    Description

      I have been on-site to work with Bruno Travouillon (Atos) on one of the crash-dumps they have.

      After joint analysis, it looks like a huge memory part is being consumed by "ptlrpc_request_buffer_desc" (17KB size each due to the embedded req, and that have been allocated in 32KB Slabs to increase/double side effect!).

      Having a look to the concerned source code, it looks like these "ptlrpc_request_buffer_desc" could be additionally allocated upon need by ptlrpc_check_rqbd_pool(), but will never be freed until OST umount/stop by ptlrpc_service_purge_all().

      This problem has caused several OSS failovers to fail due to OOM.

      Attachments

        Issue Links

          Activity

            [LU-9372] OOM happens on OSS during Lustre recovery for more than 5000 clients

            We use pacemaker for HA. When the first OSS crashed, the target resources failover to the partner OSS, which explains the MMP messages.

            These OSS are KVM guests running on top of a DDN SFA14KX-E controller. Indeed, there is only one NUMA node in an OSS.

            Unfortunately, I can't confirm the slabs were size-1024 and size-32768. As far as I remember, they were, but I can't assert it...my bad.

            I will provide an action plan to capture some relevant data during the next occurrence.

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - We use pacemaker for HA. When the first OSS crashed, the target resources failover to the partner OSS, which explains the MMP messages. These OSS are KVM guests running on top of a DDN SFA14KX-E controller. Indeed, there is only one NUMA node in an OSS. Unfortunately, I can't confirm the slabs were size-1024 and size-32768. As far as I remember, they were, but I can't assert it...my bad. I will provide an action plan to capture some relevant data during the next occurrence.

            Ok, so now I have a better look to the OSS log you have provided, I am concerned about the multiple MMP errors that have been encountered for OSTs, they tend to indicate that some of the OSTs ownership has been xfered to an other/HA OSS during the restart+recovery process but it may have induced some "anarchy" in the recovery process...

            Also the OOM stats print only show one Numa node, is it the case? And this node's SLAB content does not look so excessive regarding the memory size ("Node 0 Normal free:64984kB ... present:91205632kB managed:89727696kB ... slab_reclaimable:1356412kB slab_unreclaimable:6092980kB ..."), so do you confirm that you could still see the "size-1024" and "size-32768" Slabs growing to billion of objects ???

            bfaccini Bruno Faccini (Inactive) added a comment - Ok, so now I have a better look to the OSS log you have provided, I am concerned about the multiple MMP errors that have been encountered for OSTs, they tend to indicate that some of the OSTs ownership has been xfered to an other/HA OSS during the restart+recovery process but it may have induced some "anarchy" in the recovery process... Also the OOM stats print only show one Numa node, is it the case? And this node's SLAB content does not look so excessive regarding the memory size ("Node 0 Normal free:64984kB ... present:91205632kB managed:89727696kB ... slab_reclaimable:1356412kB slab_unreclaimable:6092980kB ..."), so do you confirm that you could still see the "size-1024" and "size-32768" Slabs growing to billion of objects ???
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Mais oui!!!! Well, I should have been really tired today since I have not been able to find the forced crash in the log and I was looking for MDS requests sizes !!...
            Thanks Bruno, I will try to push an other patch soon, based on my earlier comment thoughts,.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Mais oui!!!! Well, I should have been really tired today since I have not been able to find the forced crash in the log and I was looking for MDS requests sizes !!... Thanks Bruno, I will try to push an other patch soon, based on my earlier comment thoughts,.

            About the 17K size, look at OST_MAXREQSIZE and OST_BUFSIZE into lustre/include/lustre_net.h.

            #define OST_MAXREQSIZE (16 * 1024)
            /** OST_BUFSIZE = max_reqsize + max sptlrpc payload size */
            #define OST_BUFSIZE max_t(int, OST_MAXREQSIZE + 1024, 16 * 1024)

             

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - About the 17K size, look at OST_MAXREQSIZE and OST_BUFSIZE into lustre/include/lustre_net.h. #define OST_MAXREQSIZE (16 * 1024) /** OST_BUFSIZE = max_reqsize + max sptlrpc payload size */ #define OST_BUFSIZE max_t(int, OST_MAXREQSIZE + 1024, 16 * 1024)  

            Hi Bruno,

            This OSS smurf623 has been upgraded to CentOS 7, Lustre 2.7.3 with patch 26752 on July 5th. Because of LU-8685, we had to upgrade the kernel the next day, on July 6th, while 5201 clients had the filesystem mounted. We then hit the OOM issue while mounting the 20 OSTs on the OSS.(between 14:38:48 and 15:03:13).

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hi Bruno, This OSS smurf623 has been upgraded to CentOS 7, Lustre 2.7.3 with patch 26752 on July 5th. Because of LU-8685 , we had to upgrade the kernel the next day, on July 6th, while 5201 clients had the filesystem mounted. We then hit the OOM issue while mounting the 20 OSTs on the OSS.(between 14:38:48 and 15:03:13).

            Hello Bruno,
            Is "smurf623.log-20170709" the log of one of the crashed OSSs from last occurrence of problem and running with CentOS and my patch?

            By the way, the fact that you Servers are now running CentOS7 may have some implications since in terms of Kernel memory allocation, SLUBs are now used by the Kernel instead of SLABs with Kernels shipped for CentOS6.

            Anyway, I am investigating new possible ways to avoid OOM during Server failover+recovery, like to set a hard limit on the volume/number of ptlrpc_request_buffer_desc+buffer, better understand the reason of the 17K size of the mainly used buffers, and/or may be to create a specific kmem_cache for this purpose and thus limit wasted space...

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Bruno, Is "smurf623.log-20170709" the log of one of the crashed OSSs from last occurrence of problem and running with CentOS and my patch? By the way, the fact that you Servers are now running CentOS7 may have some implications since in terms of Kernel memory allocation, SLUBs are now used by the Kernel instead of SLABs with Kernels shipped for CentOS6. Anyway, I am investigating new possible ways to avoid OOM during Server failover+recovery, like to set a hard limit on the volume/number of ptlrpc_request_buffer_desc+buffer, better understand the reason of the 17K size of the mainly used buffers, and/or may be to create a specific kmem_cache for this purpose and thus limit wasted space...

            Hello Bruno,
            Well, too bad... For both the new occurrence with patch and no crash-dump available!
            But can you at least provide the syslog/dmesg/Console for one of the crashedOSS ?
            And also, do I correctly understand that the last successful failover/recovery with patch had occurred with CentOS 6.x and not CentOS 7 ?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Bruno, Well, too bad... For both the new occurrence with patch and no crash-dump available! But can you at least provide the syslog/dmesg/Console for one of the crashedOSS ? And also, do I correctly understand that the last successful failover/recovery with patch had occurred with CentOS 6.x and not CentOS 7 ?

            Hi Bruno,

            For the record, we hit a similar occurrence after the upgrade of our Lustre servers to CentOS 7. The OSS panic on OOM during the recovery process (20 OSTs per OSS).

            We have been able to start the filesystem by mounting successive subset of 5 OSTs on each OSS. With a few OSTs, the memory consumption increases more slowly, and thanks to your patch https://review.whamcloud.com/#/c/26752/ , the memory was freed after the completion of the recovery.

            Unfortunately, I don't have any vmcore available.

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hi Bruno, For the record, we hit a similar occurrence after the upgrade of our Lustre servers to CentOS 7. The OSS panic on OOM during the recovery process (20 OSTs per OSS). We have been able to start the filesystem by mounting successive subset of 5 OSTs on each OSS. With a few OSTs, the memory consumption increases more slowly, and thanks to your patch https://review.whamcloud.com/#/c/26752/ , the memory was freed after the completion of the recovery. Unfortunately, I don't have any vmcore available.

            Hi,

            The patch has been backported into the CEA 2.7 branch.

            FYI, we have been able to successfully recover 5000+ clients this morning. Thank you!

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hi, The patch has been backported into the CEA 2.7 branch. FYI, we have been able to successfully recover 5000+ clients this morning. Thank you!
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26752/
            Subject: LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 220124bff7b13cd26b1b7b81ecf46e137ac174d3

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26752/ Subject: LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects Project: fs/lustre-release Branch: master Current Patch Set: Commit: 220124bff7b13cd26b1b7b81ecf46e137ac174d3

            People

              bfaccini Bruno Faccini (Inactive)
              bfaccini Bruno Faccini (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: