Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9372

OOM happens on OSS during Lustre recovery for more than 5000 clients

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0, Lustre 2.10.6
    • None
    • Server running with b2_7_fe
      Clients are a mix of IEEL3 (RH7/SCS5), 2.5.3.90 (RH6/AE4), 2.7.3 (CentOS7)
    • 3
    • 9223372036854775807

    Description

      I have been on-site to work with Bruno Travouillon (Atos) on one of the crash-dumps they have.

      After joint analysis, it looks like a huge memory part is being consumed by "ptlrpc_request_buffer_desc" (17KB size each due to the embedded req, and that have been allocated in 32KB Slabs to increase/double side effect!).

      Having a look to the concerned source code, it looks like these "ptlrpc_request_buffer_desc" could be additionally allocated upon need by ptlrpc_check_rqbd_pool(), but will never be freed until OST umount/stop by ptlrpc_service_purge_all().

      This problem has caused several OSS failovers to fail due to OOM.

      Attachments

        Issue Links

          Activity

            [LU-9372] OOM happens on OSS during Lustre recovery for more than 5000 clients
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Andreas, I can't answer about your Lustre RPC and HA config/behavior concerns, but I am sure Bruno will do soon. But you may right that something should be done to avoid a peer to try mounting all targets upon restart/reboot.

            Concerning the memory consumption, I can confirm that the huge number of size-1024 and size-32768 objects was very close to the current number of RQBDs (sum of scp_nrqbds_total).

            About your comment concerning "a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed ..." , I presume you refer to the checks/code in ptlrpc_check_rqbd_pool()/ptlrpc_grow_req_bufs() routines, where the case that a lot of RQBDs+buffers may be currently handled already. And this is where my 2 patchs try to limit the allocations.

            Last, do you mean that I must add in my patch some auto-tune code (vs memory size/load, but also number of targets?, only if failover/recovery running?, ...) to the current/only possibility of manually setting a limit ?

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Andreas, I can't answer about your Lustre RPC and HA config/behavior concerns, but I am sure Bruno will do soon. But you may right that something should be done to avoid a peer to try mounting all targets upon restart/reboot. Concerning the memory consumption, I can confirm that the huge number of size-1024 and size-32768 objects was very close to the current number of RQBDs (sum of scp_nrqbds_total). About your comment concerning "a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed ..." , I presume you refer to the checks/code in ptlrpc_check_rqbd_pool()/ptlrpc_grow_req_bufs() routines, where the case that a lot of RQBDs+buffers may be currently handled already. And this is where my 2 patchs try to limit the allocations. Last, do you mean that I must add in my patch some auto-tune code (vs memory size/load, but also number of targets?, only if failover/recovery running?, ...) to the current/only possibility of manually setting a limit ?

            It isn't clear what the benefit of a tunable to limit the number of RQBDs is, if it is off by default, since most users will not even know it exists. Even if users do know this tunable exists, there probably isn't an easy way to know a good default value for the maximum number of rqbd buffers, since that depends greatly based on RAM size on the server, the number of clients, and the client load.

            Looking at older comments here, there are several things that concern me:

            • the total number of RQBDs allocated seem far more than could possibly ever be used, since clients should typically only have at most 8 RPCs in flight per OST
            • during recovery, clients should normally only have a single RPC in flight per OST

            This means there shouldn't be more than about 5000 clients * 20 OSTs/OSS = 100000 RPCs/OSS outstanding on the OSS (maybe 200000 RPCs if you have 40 OSTs/OSS in failover mode, or is 10 OSTs/OSS the normal config and 20 OSTs/OSS is the failover config?).

            Are we sure that clients are not generating a flood of RPCs per OST during recovery (more than max_rpcs_in_flight)? I also recall there may be a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed if there are many threads trying to send an RPC at the same time and the buffers run out.

            On a related note, having so many OSTs on a single OSS is not really a good configuration, since it doesn't provide very good performance, and as you see it has a very high RAM requirement, and also causes a larger point of failure if the OSS goes down. In addition to the number of outstanding RPCs that the clients may send, there is also a bunch of RAM used by the ldiskfs journals.

            Also, if you are seeing messages like the following in your logs:

            LDISKFS-fs warning (device sfa0049): ldiskfs_multi_mount_protect:331: Device is already active on another node
            

            then there is something significantly wrong with your HA or STONITH configuration. MMP is meant as a backup sanity check to prevent double import to prevent filesystem corruption (which it did successfully for 20 OSTs in this case), but it is not intended to be a primary HA exclusion method for the storage.

            adilger Andreas Dilger added a comment - It isn't clear what the benefit of a tunable to limit the number of RQBDs is, if it is off by default, since most users will not even know it exists. Even if users do know this tunable exists, there probably isn't an easy way to know a good default value for the maximum number of rqbd buffers, since that depends greatly based on RAM size on the server, the number of clients, and the client load. Looking at older comments here, there are several things that concern me: the total number of RQBDs allocated seem far more than could possibly ever be used, since clients should typically only have at most 8 RPCs in flight per OST during recovery, clients should normally only have a single RPC in flight per OST This means there shouldn't be more than about 5000 clients * 20 OSTs/OSS = 100000 RPCs/OSS outstanding on the OSS (maybe 200000 RPCs if you have 40 OSTs/OSS in failover mode, or is 10 OSTs/OSS the normal config and 20 OSTs/OSS is the failover config?). Are we sure that clients are not generating a flood of RPCs per OST during recovery (more than max_rpcs_in_flight )? I also recall there may be a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed if there are many threads trying to send an RPC at the same time and the buffers run out. On a related note, having so many OSTs on a single OSS is not really a good configuration, since it doesn't provide very good performance, and as you see it has a very high RAM requirement, and also causes a larger point of failure if the OSS goes down. In addition to the number of outstanding RPCs that the clients may send, there is also a bunch of RAM used by the ldiskfs journals. Also, if you are seeing messages like the following in your logs: LDISKFS-fs warning (device sfa0049): ldiskfs_multi_mount_protect:331: Device is already active on another node then there is something significantly wrong with your HA or STONITH configuration. MMP is meant as a backup sanity check to prevent double import to prevent filesystem corruption (which it did successfully for 20 OSTs in this case), but it is not intended to be a primary HA exclusion method for the storage.

            J-B, would be cool if we could give a try to this new/2nd patch, with a significant peacock setup (let say with hundreds of clients, thousands ?), by forcing an OSS/OSTs failover+recovery and monitoring kmem memory consumption and if new limitation mechanism is working as expected and has no effect.

            bfaccini Bruno Faccini (Inactive) added a comment - J-B, would be cool if we could give a try to this new/2nd patch, with a significant peacock setup (let say with hundreds of clients, thousands ?), by forcing an OSS/OSTs failover+recovery and monitoring kmem memory consumption and if new limitation mechanism is working as expected and has no effect.

            Looks like my first change #26752 (LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects) could not be efficient to drain rqbds during too long period of heavy load of requests from huge number of Clients and thus still allow for OOM...
            So, I have just pushed #29064 (LU-9372 ptlrpc: allow to limit number of service's rqbds) in order to allow to set a limit on the max number of rqbds per service.

            bfaccini Bruno Faccini (Inactive) added a comment - Looks like my first change #26752 ( LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects) could not be efficient to drain rqbds during too long period of heavy load of requests from huge number of Clients and thus still allow for OOM... So, I have just pushed #29064 ( LU-9372 ptlrpc: allow to limit number of service's rqbds) in order to allow to set a limit on the max number of rqbds per service.

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/29064
            Subject: LU-9372 ptlrpc: allow to limit number of service's rqbds
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 78d7f8f02ba35bbcd84dd44f103ed1326d94c30c

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/29064 Subject: LU-9372 ptlrpc: allow to limit number of service's rqbds Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 78d7f8f02ba35bbcd84dd44f103ed1326d94c30c

            We use pacemaker for HA. When the first OSS crashed, the target resources failover to the partner OSS, which explains the MMP messages.

            These OSS are KVM guests running on top of a DDN SFA14KX-E controller. Indeed, there is only one NUMA node in an OSS.

            Unfortunately, I can't confirm the slabs were size-1024 and size-32768. As far as I remember, they were, but I can't assert it...my bad.

            I will provide an action plan to capture some relevant data during the next occurrence.

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - We use pacemaker for HA. When the first OSS crashed, the target resources failover to the partner OSS, which explains the MMP messages. These OSS are KVM guests running on top of a DDN SFA14KX-E controller. Indeed, there is only one NUMA node in an OSS. Unfortunately, I can't confirm the slabs were size-1024 and size-32768. As far as I remember, they were, but I can't assert it...my bad. I will provide an action plan to capture some relevant data during the next occurrence.

            Ok, so now I have a better look to the OSS log you have provided, I am concerned about the multiple MMP errors that have been encountered for OSTs, they tend to indicate that some of the OSTs ownership has been xfered to an other/HA OSS during the restart+recovery process but it may have induced some "anarchy" in the recovery process...

            Also the OOM stats print only show one Numa node, is it the case? And this node's SLAB content does not look so excessive regarding the memory size ("Node 0 Normal free:64984kB ... present:91205632kB managed:89727696kB ... slab_reclaimable:1356412kB slab_unreclaimable:6092980kB ..."), so do you confirm that you could still see the "size-1024" and "size-32768" Slabs growing to billion of objects ???

            bfaccini Bruno Faccini (Inactive) added a comment - Ok, so now I have a better look to the OSS log you have provided, I am concerned about the multiple MMP errors that have been encountered for OSTs, they tend to indicate that some of the OSTs ownership has been xfered to an other/HA OSS during the restart+recovery process but it may have induced some "anarchy" in the recovery process... Also the OOM stats print only show one Numa node, is it the case? And this node's SLAB content does not look so excessive regarding the memory size ("Node 0 Normal free:64984kB ... present:91205632kB managed:89727696kB ... slab_reclaimable:1356412kB slab_unreclaimable:6092980kB ..."), so do you confirm that you could still see the "size-1024" and "size-32768" Slabs growing to billion of objects ???
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Mais oui!!!! Well, I should have been really tired today since I have not been able to find the forced crash in the log and I was looking for MDS requests sizes !!...
            Thanks Bruno, I will try to push an other patch soon, based on my earlier comment thoughts,.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Mais oui!!!! Well, I should have been really tired today since I have not been able to find the forced crash in the log and I was looking for MDS requests sizes !!... Thanks Bruno, I will try to push an other patch soon, based on my earlier comment thoughts,.

            About the 17K size, look at OST_MAXREQSIZE and OST_BUFSIZE into lustre/include/lustre_net.h.

            #define OST_MAXREQSIZE (16 * 1024)
            /** OST_BUFSIZE = max_reqsize + max sptlrpc payload size */
            #define OST_BUFSIZE max_t(int, OST_MAXREQSIZE + 1024, 16 * 1024)

             

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - About the 17K size, look at OST_MAXREQSIZE and OST_BUFSIZE into lustre/include/lustre_net.h. #define OST_MAXREQSIZE (16 * 1024) /** OST_BUFSIZE = max_reqsize + max sptlrpc payload size */ #define OST_BUFSIZE max_t(int, OST_MAXREQSIZE + 1024, 16 * 1024)  

            Hi Bruno,

            This OSS smurf623 has been upgraded to CentOS 7, Lustre 2.7.3 with patch 26752 on July 5th. Because of LU-8685, we had to upgrade the kernel the next day, on July 6th, while 5201 clients had the filesystem mounted. We then hit the OOM issue while mounting the 20 OSTs on the OSS.(between 14:38:48 and 15:03:13).

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hi Bruno, This OSS smurf623 has been upgraded to CentOS 7, Lustre 2.7.3 with patch 26752 on July 5th. Because of LU-8685 , we had to upgrade the kernel the next day, on July 6th, while 5201 clients had the filesystem mounted. We then hit the OOM issue while mounting the 20 OSTs on the OSS.(between 14:38:48 and 15:03:13).

            Hello Bruno,
            Is "smurf623.log-20170709" the log of one of the crashed OSSs from last occurrence of problem and running with CentOS and my patch?

            By the way, the fact that you Servers are now running CentOS7 may have some implications since in terms of Kernel memory allocation, SLUBs are now used by the Kernel instead of SLABs with Kernels shipped for CentOS6.

            Anyway, I am investigating new possible ways to avoid OOM during Server failover+recovery, like to set a hard limit on the volume/number of ptlrpc_request_buffer_desc+buffer, better understand the reason of the 17K size of the mainly used buffers, and/or may be to create a specific kmem_cache for this purpose and thus limit wasted space...

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Bruno, Is "smurf623.log-20170709" the log of one of the crashed OSSs from last occurrence of problem and running with CentOS and my patch? By the way, the fact that you Servers are now running CentOS7 may have some implications since in terms of Kernel memory allocation, SLUBs are now used by the Kernel instead of SLABs with Kernels shipped for CentOS6. Anyway, I am investigating new possible ways to avoid OOM during Server failover+recovery, like to set a hard limit on the volume/number of ptlrpc_request_buffer_desc+buffer, better understand the reason of the 17K size of the mainly used buffers, and/or may be to create a specific kmem_cache for this purpose and thus limit wasted space...

            People

              bfaccini Bruno Faccini (Inactive)
              bfaccini Bruno Faccini (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: