Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9372

OOM happens on OSS during Lustre recovery for more than 5000 clients

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0, Lustre 2.10.6
    • None
    • Server running with b2_7_fe
      Clients are a mix of IEEL3 (RH7/SCS5), 2.5.3.90 (RH6/AE4), 2.7.3 (CentOS7)
    • 3
    • 9223372036854775807

    Description

      I have been on-site to work with Bruno Travouillon (Atos) on one of the crash-dumps they have.

      After joint analysis, it looks like a huge memory part is being consumed by "ptlrpc_request_buffer_desc" (17KB size each due to the embedded req, and that have been allocated in 32KB Slabs to increase/double side effect!).

      Having a look to the concerned source code, it looks like these "ptlrpc_request_buffer_desc" could be additionally allocated upon need by ptlrpc_check_rqbd_pool(), but will never be freed until OST umount/stop by ptlrpc_service_purge_all().

      This problem has caused several OSS failovers to fail due to OOM.

      Attachments

        Issue Links

          Activity

            [LU-9372] OOM happens on OSS during Lustre recovery for more than 5000 clients

            Both patches from LU-10803 and LU-10826 are also must-have/follow-ons to the LU-9372 serie.

            bfaccini Bruno Faccini (Inactive) added a comment - Both patches from LU-10803 and LU-10826 are also must-have/follow-ons to the LU-9372 serie.

            Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/31622
            Subject: LU-9372 ptlrpc: fix req_buffers_max and req_history_max setting
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: aa9005eb5c9e873e9e83619ff830ba848917f118

            gerrit Gerrit Updater added a comment - Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/31622 Subject: LU-9372 ptlrpc: fix req_buffers_max and req_history_max setting Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: aa9005eb5c9e873e9e83619ff830ba848917f118

            Master patch https://review.whamcloud.com/31162 from LU-10603 is required to make associated tunable visible to the external world and thus to allow this https://review.whamcloud.com/29064/ patch/feature to be usable.

            So just in case, Minh, any back-port of #29064 requires also to back-port #31162.

            bfaccini Bruno Faccini (Inactive) added a comment - Master patch https://review.whamcloud.com/31162 from LU-10603 is required to make associated tunable visible to the external world and thus to allow this https://review.whamcloud.com/29064/ patch/feature to be usable. So just in case, Minh, any back-port of #29064 requires also to back-port #31162.

            Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31108
            Subject: LU-9372 ptlrpc: allow to limit number of service's rqbds
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 69ad99bf62cf461df93419e57adb323a6d537e31

            gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31108 Subject: LU-9372 ptlrpc: allow to limit number of service's rqbds Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 69ad99bf62cf461df93419e57adb323a6d537e31
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29064/
            Subject: LU-9372 ptlrpc: allow to limit number of service's rqbds
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d9e57a765e73e1bc3046124433eb6e2186f7e07c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29064/ Subject: LU-9372 ptlrpc: allow to limit number of service's rqbds Project: fs/lustre-release Branch: master Current Patch Set: Commit: d9e57a765e73e1bc3046124433eb6e2186f7e07c

            We allocate 90 GB of RAM and 8 CPU cores to each OSS. We can't allocate more resources per virtual guest in a SFA14KXE. The cores are HT.
             

             oss# cat /proc/sys/lnet/cpu_partition_table
             0 : 0 1 2 3 
             1 : 4 5 6 7 
             2 : 8 9 10 11 
             3 : 12 13 14 15
            

            Last time we hit OOM, the memory consumption of the OSS was at its maximum (90GB).

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - We allocate 90 GB of RAM and 8 CPU cores to each OSS. We can't allocate more resources per virtual guest in a SFA14KXE. The cores are HT.   oss# cat /proc/sys/lnet/cpu_partition_table 0 : 0 1 2 3 1 : 4 5 6 7 2 : 8 9 10 11 3 : 12 13 14 15 Last time we hit OOM, the memory consumption of the OSS was at its maximum (90GB).

            I couldn’t find in the ticket how much RAM is on this OSS for the 20 OSTs? I’m wondering if we are also having problems here with CPT allocations all happening on one CPT and hitting OOM while there is plenty of RAM available on a second CPT?

            adilger Andreas Dilger added a comment - I couldn’t find in the ticket how much RAM is on this OSS for the 20 OSTs? I’m wondering if we are also having problems here with CPT allocations all happening on one CPT and hitting OOM while there is plenty of RAM available on a second CPT?

            I was just talking about this problem, and I have found that I had never clearly indicated in this ticket that the reason of this 32k alloc for each ptlrpc_rqbd (for a real size of 17k) is due to patch for LU-4755 ("LU-4755 ptlrpc: enlarge OST_MAXREQSIZE for 4MB RPC").

            Since the way this size has been identified looks a bit empiric, we may also want to give try to 15k (+ payload size, thus leading to 16k) in order to divide the real consumed size by 2.
            To allow almost the same size reduction, we may also try to use a specific kmem_cache/Slabs for ptlrpc_rqbd/17k, keeping in mind that it may be made useless due to Kernel merging of Slabs.

            bfaccini Bruno Faccini (Inactive) added a comment - I was just talking about this problem, and I have found that I had never clearly indicated in this ticket that the reason of this 32k alloc for each ptlrpc_rqbd (for a real size of 17k) is due to patch for LU-4755 (" LU-4755 ptlrpc: enlarge OST_MAXREQSIZE for 4MB RPC"). Since the way this size has been identified looks a bit empiric, we may also want to give try to 15k (+ payload size, thus leading to 16k) in order to divide the real consumed size by 2. To allow almost the same size reduction, we may also try to use a specific kmem_cache/Slabs for ptlrpc_rqbd/17k, keeping in mind that it may be made useless due to Kernel merging of Slabs.

            Bruno, my main concern is that a static tunable will not avoid a similar problem for most users.

            adilger Andreas Dilger added a comment - Bruno, my main concern is that a static tunable will not avoid a similar problem for most users.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Andreas, I can't answer about your Lustre RPC and HA config/behavior concerns, but I am sure Bruno will do soon. But you may right that something should be done to avoid a peer to try mounting all targets upon restart/reboot.

            Concerning the memory consumption, I can confirm that the huge number of size-1024 and size-32768 objects was very close to the current number of RQBDs (sum of scp_nrqbds_total).

            About your comment concerning "a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed ..." , I presume you refer to the checks/code in ptlrpc_check_rqbd_pool()/ptlrpc_grow_req_bufs() routines, where the case that a lot of RQBDs+buffers may be currently handled already. And this is where my 2 patchs try to limit the allocations.

            Last, do you mean that I must add in my patch some auto-tune code (vs memory size/load, but also number of targets?, only if failover/recovery running?, ...) to the current/only possibility of manually setting a limit ?

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Andreas, I can't answer about your Lustre RPC and HA config/behavior concerns, but I am sure Bruno will do soon. But you may right that something should be done to avoid a peer to try mounting all targets upon restart/reboot. Concerning the memory consumption, I can confirm that the huge number of size-1024 and size-32768 objects was very close to the current number of RQBDs (sum of scp_nrqbds_total). About your comment concerning "a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed ..." , I presume you refer to the checks/code in ptlrpc_check_rqbd_pool()/ptlrpc_grow_req_bufs() routines, where the case that a lot of RQBDs+buffers may be currently handled already. And this is where my 2 patchs try to limit the allocations. Last, do you mean that I must add in my patch some auto-tune code (vs memory size/load, but also number of targets?, only if failover/recovery running?, ...) to the current/only possibility of manually setting a limit ?

            People

              bfaccini Bruno Faccini (Inactive)
              bfaccini Bruno Faccini (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: