Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9372

OOM happens on OSS during Lustre recovery for more than 5000 clients

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0, Lustre 2.10.6
    • None
    • Server running with b2_7_fe
      Clients are a mix of IEEL3 (RH7/SCS5), 2.5.3.90 (RH6/AE4), 2.7.3 (CentOS7)
    • 3
    • 9223372036854775807

    Description

      I have been on-site to work with Bruno Travouillon (Atos) on one of the crash-dumps they have.

      After joint analysis, it looks like a huge memory part is being consumed by "ptlrpc_request_buffer_desc" (17KB size each due to the embedded req, and that have been allocated in 32KB Slabs to increase/double side effect!).

      Having a look to the concerned source code, it looks like these "ptlrpc_request_buffer_desc" could be additionally allocated upon need by ptlrpc_check_rqbd_pool(), but will never be freed until OST umount/stop by ptlrpc_service_purge_all().

      This problem has caused several OSS failovers to fail due to OOM.

      Attachments

        Issue Links

          Activity

            [LU-9372] OOM happens on OSS during Lustre recovery for more than 5000 clients

            Hi Bruno,

            This OSS smurf623 has been upgraded to CentOS 7, Lustre 2.7.3 with patch 26752 on July 5th. Because of LU-8685, we had to upgrade the kernel the next day, on July 6th, while 5201 clients had the filesystem mounted. We then hit the OOM issue while mounting the 20 OSTs on the OSS.(between 14:38:48 and 15:03:13).

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hi Bruno, This OSS smurf623 has been upgraded to CentOS 7, Lustre 2.7.3 with patch 26752 on July 5th. Because of LU-8685 , we had to upgrade the kernel the next day, on July 6th, while 5201 clients had the filesystem mounted. We then hit the OOM issue while mounting the 20 OSTs on the OSS.(between 14:38:48 and 15:03:13).

            Hello Bruno,
            Is "smurf623.log-20170709" the log of one of the crashed OSSs from last occurrence of problem and running with CentOS and my patch?

            By the way, the fact that you Servers are now running CentOS7 may have some implications since in terms of Kernel memory allocation, SLUBs are now used by the Kernel instead of SLABs with Kernels shipped for CentOS6.

            Anyway, I am investigating new possible ways to avoid OOM during Server failover+recovery, like to set a hard limit on the volume/number of ptlrpc_request_buffer_desc+buffer, better understand the reason of the 17K size of the mainly used buffers, and/or may be to create a specific kmem_cache for this purpose and thus limit wasted space...

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Bruno, Is "smurf623.log-20170709" the log of one of the crashed OSSs from last occurrence of problem and running with CentOS and my patch? By the way, the fact that you Servers are now running CentOS7 may have some implications since in terms of Kernel memory allocation, SLUBs are now used by the Kernel instead of SLABs with Kernels shipped for CentOS6. Anyway, I am investigating new possible ways to avoid OOM during Server failover+recovery, like to set a hard limit on the volume/number of ptlrpc_request_buffer_desc+buffer, better understand the reason of the 17K size of the mainly used buffers, and/or may be to create a specific kmem_cache for this purpose and thus limit wasted space...

            Hello Bruno,
            Well, too bad... For both the new occurrence with patch and no crash-dump available!
            But can you at least provide the syslog/dmesg/Console for one of the crashedOSS ?
            And also, do I correctly understand that the last successful failover/recovery with patch had occurred with CentOS 6.x and not CentOS 7 ?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Bruno, Well, too bad... For both the new occurrence with patch and no crash-dump available! But can you at least provide the syslog/dmesg/Console for one of the crashedOSS ? And also, do I correctly understand that the last successful failover/recovery with patch had occurred with CentOS 6.x and not CentOS 7 ?

            Hi Bruno,

            For the record, we hit a similar occurrence after the upgrade of our Lustre servers to CentOS 7. The OSS panic on OOM during the recovery process (20 OSTs per OSS).

            We have been able to start the filesystem by mounting successive subset of 5 OSTs on each OSS. With a few OSTs, the memory consumption increases more slowly, and thanks to your patch https://review.whamcloud.com/#/c/26752/ , the memory was freed after the completion of the recovery.

            Unfortunately, I don't have any vmcore available.

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hi Bruno, For the record, we hit a similar occurrence after the upgrade of our Lustre servers to CentOS 7. The OSS panic on OOM during the recovery process (20 OSTs per OSS). We have been able to start the filesystem by mounting successive subset of 5 OSTs on each OSS. With a few OSTs, the memory consumption increases more slowly, and thanks to your patch https://review.whamcloud.com/#/c/26752/ , the memory was freed after the completion of the recovery. Unfortunately, I don't have any vmcore available.

            Hi,

            The patch has been backported into the CEA 2.7 branch.

            FYI, we have been able to successfully recover 5000+ clients this morning. Thank you!

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hi, The patch has been backported into the CEA 2.7 branch. FYI, we have been able to successfully recover 5000+ clients this morning. Thank you!
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26752/
            Subject: LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 220124bff7b13cd26b1b7b81ecf46e137ac174d3

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26752/ Subject: LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects Project: fs/lustre-release Branch: master Current Patch Set: Commit: 220124bff7b13cd26b1b7b81ecf46e137ac174d3

            Bruno,
            Since you have told me that you have some OSSs in this case, can you also check for me these same counter/list (in fact all ptlrpc_service_part's counters/lists values/entries-count would be nice to have) on a live system where rqbds may have been massively allocated during a similar recovery storm but have fortunately avoided the OOM ?

            bfaccini Bruno Faccini (Inactive) added a comment - Bruno, Since you have told me that you have some OSSs in this case, can you also check for me these same counter/list (in fact all ptlrpc_service_part's counters/lists values/entries-count would be nice to have) on a live system where rqbds may have been massively allocated during a similar recovery storm but have fortunately avoided the OOM ?
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Bruno,
            Thanks for this clarification which makes me have to re-think about my patch since, before I can have access to these necessary infos, I had anticipated that all this allocated rqbds upon need and now in excess, because never freed, would have been linked on scp_rqbd_idle/scp_hist_rqbds lists.

            Will now try to get a new patch version available asap.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Bruno, Thanks for this clarification which makes me have to re-think about my patch since, before I can have access to these necessary infos, I had anticipated that all this allocated rqbds upon need and now in excess, because never freed, would have been linked on scp_rqbd_idle/scp_hist_rqbds lists. Will now try to get a new patch version available asap.

            To complete the analysis, there are 958249 size-32768 slabs in the vmcore, and
            1068498 size-1024 slabs (ptlrpc_request_buffer_desc are kmalloc of size-1024)

            There are 4 instances of ptlrpc_service_part which have a lot of rqbds:

            crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809cf4dd800
              scp_nrqbds_total = 98342
            crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809dbe9ec00
              scp_nrqbds_total = 302031
            crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809ddc14000
              scp_nrqbds_total = 272040
            crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809ddc1c400
              scp_nrqbds_total = 285039
            
            
            

            Most of the other ptlrpc_service_part instances have scp_nrqbds_total <= 64.

            For these 4 instances, the rqbds are in the scp_rqbd_posted list, while
            scp_nrqbds_posted is quite low:

            crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809cf4dd800
              scp_nrqbds_posted = 12
            crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809cf4dd800
              scp_rqbd_posted = {
                next = 0xffff8809e0758800,
                prev = 0xffff8809db055800
              }
            crash> list 0xffff8809e0758800|wc -l
            98343
            
            crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809dbe9ec00
              scp_nrqbds_posted = 191
            crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809dbe9ec00
              scp_rqbd_posted = {
                next = 0xffff8809ed5b7400,
                prev = 0xffff8809cf4d1000
              }
            crash> list 0xffff8809ed5b7400|wc -l
            302032
            
            crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809ddc14000
              scp_nrqbds_posted = 1
            crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809ddc14000
              scp_rqbd_posted = {
                next = 0xffff8809ec199400,
                prev = 0xffff8809dc6e7800
              }
            crash> list 0xffff8809ec199400|wc -l
            272041
            
            crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809ddc1c400
              scp_nrqbds_posted = 0
            crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809ddc1c400
              scp_rqbd_posted = {
                next = 0xffff8809e4880800,
                prev = 0xffff88097c4ddc00
              }
            crash> list 0xffff8809e4880800|wc -l
            285040
            
            

            In request_in_callback(), the svcpt->scp_nrqbds_posted decrease if ev->unlinked, but it looks like no rqbd is removed from the scp_rqbd_posted list. I have to admit that I don't clearly understand if it's normal or not...

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - - edited To complete the analysis, there are 958249 size-32768 slabs in the vmcore, and 1068498 size-1024 slabs (ptlrpc_request_buffer_desc are kmalloc of size-1024) There are 4 instances of ptlrpc_service_part which have a lot of rqbds: crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809cf4dd800 scp_nrqbds_total = 98342 crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809dbe9ec00 scp_nrqbds_total = 302031 crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809ddc14000 scp_nrqbds_total = 272040 crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809ddc1c400 scp_nrqbds_total = 285039 Most of the other ptlrpc_service_part instances have scp_nrqbds_total <= 64. For these 4 instances, the rqbds are in the scp_rqbd_posted list, while scp_nrqbds_posted is quite low: crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809cf4dd800 scp_nrqbds_posted = 12 crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809cf4dd800 scp_rqbd_posted = { next = 0xffff8809e0758800, prev = 0xffff8809db055800 } crash> list 0xffff8809e0758800|wc -l 98343 crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809dbe9ec00 scp_nrqbds_posted = 191 crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809dbe9ec00 scp_rqbd_posted = { next = 0xffff8809ed5b7400, prev = 0xffff8809cf4d1000 } crash> list 0xffff8809ed5b7400|wc -l 302032 crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809ddc14000 scp_nrqbds_posted = 1 crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809ddc14000 scp_rqbd_posted = { next = 0xffff8809ec199400, prev = 0xffff8809dc6e7800 } crash> list 0xffff8809ec199400|wc -l 272041 crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809ddc1c400 scp_nrqbds_posted = 0 crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809ddc1c400 scp_rqbd_posted = { next = 0xffff8809e4880800, prev = 0xffff88097c4ddc00 } crash> list 0xffff8809e4880800|wc -l 285040 In request_in_callback(), the svcpt->scp_nrqbds_posted decrease if ev->unlinked, but it looks like no rqbd is removed from the scp_rqbd_posted list. I have to admit that I don't clearly understand if it's normal or not...

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/26752
            Subject: LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2d0dff4f5d2dd25ca55f5401c1139519207a0a02

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/26752 Subject: LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2d0dff4f5d2dd25ca55f5401c1139519207a0a02

            People

              bfaccini Bruno Faccini (Inactive)
              bfaccini Bruno Faccini (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: