[LU-9372] OOM happens on OSS during Lustre recovery for more than 5000 clients - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.11.0, Lustre 2.10.6
Affects Version/s: None
Labels:
- cea
Environment:
Server running with b2_7_fe
Clients are a mix of IEEL3 (RH7/SCS5), 2.5.3.90 (RH6/AE4), 2.7.3 (CentOS7)

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I have been on-site to work with Bruno Travouillon (Atos) on one of the crash-dumps they have.

After joint analysis, it looks like a huge memory part is being consumed by "ptlrpc_request_buffer_desc" (17KB size each due to the embedded req, and that have been allocated in 32KB Slabs to increase/double side effect!).

Having a look to the concerned source code, it looks like these "ptlrpc_request_buffer_desc" could be additionally allocated upon need by ptlrpc_check_rqbd_pool(), but will never be freed until OST umount/stop by ptlrpc_service_purge_all().

This problem has caused several OSS failovers to fail due to OOM.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

smurf623.log-20170709
315 kB
07/Sep/17 3:16 PM

Issue Links

is duplicated by

LU-1099 Lustre OSS OOMs repeatedly

Resolved

is related to

LU-10603 ptlrpc_lprocfs_req_buffers_max_fops unused

Resolved

LU-13600 limit number of RPCs in flight during recovery

Resolved

is related to

LU-10803 req_buffers_max and req_history_max setting problems

Resolved

LU-10826 Regression in LU-9372 on OPA enviroment and no recovery triggered

Resolved

Activity

[LU-9372] OOM happens on OSS during Lustre recovery for more than 5000 clients

Bruno Travouillon (Inactive) added a comment - 07/Sep/17 8:41 PM

Hi Bruno,

This OSS smurf623 has been upgraded to CentOS 7, Lustre 2.7.3 with patch 26752 on July 5th. Because of ~~LU-8685~~, we had to upgrade the kernel the next day, on July 6th, while 5201 clients had the filesystem mounted. We then hit the OOM issue while mounting the 20 OSTs on the OSS.(between 14:38:48 and 15:03:13).

Bruno Travouillon (Inactive) added a comment - 07/Sep/17 8:41 PM Hi Bruno, This OSS smurf623 has been upgraded to CentOS 7, Lustre 2.7.3 with patch 26752 on July 5th. Because of LU-8685 , we had to upgrade the kernel the next day, on July 6th, while 5201 clients had the filesystem mounted. We then hit the OOM issue while mounting the 20 OSTs on the OSS.(between 14:38:48 and 15:03:13).

Bruno Faccini (Inactive) added a comment - 07/Sep/17 5:52 PM

Hello Bruno,
Is "smurf623.log-20170709" the log of one of the crashed OSSs from last occurrence of problem and running with CentOS and my patch?

By the way, the fact that you Servers are now running CentOS7 may have some implications since in terms of Kernel memory allocation, SLUBs are now used by the Kernel instead of SLABs with Kernels shipped for CentOS6.

Anyway, I am investigating new possible ways to avoid OOM during Server failover+recovery, like to set a hard limit on the volume/number of ptlrpc_request_buffer_desc+buffer, better understand the reason of the 17K size of the mainly used buffers, and/or may be to create a specific kmem_cache for this purpose and thus limit wasted space...

Bruno Faccini (Inactive) added a comment - 07/Sep/17 5:52 PM Hello Bruno, Is "smurf623.log-20170709" the log of one of the crashed OSSs from last occurrence of problem and running with CentOS and my patch? By the way, the fact that you Servers are now running CentOS7 may have some implications since in terms of Kernel memory allocation, SLUBs are now used by the Kernel instead of SLABs with Kernels shipped for CentOS6. Anyway, I am investigating new possible ways to avoid OOM during Server failover+recovery, like to set a hard limit on the volume/number of ptlrpc_request_buffer_desc+buffer, better understand the reason of the 17K size of the mainly used buffers, and/or may be to create a specific kmem_cache for this purpose and thus limit wasted space...

Bruno Faccini (Inactive) added a comment - 24/Aug/17 8:02 AM

Hello Bruno,
Well, too bad... For both the new occurrence with patch and no crash-dump available!
But can you at least provide the syslog/dmesg/Console for one of the crashedOSS ?
And also, do I correctly understand that the last successful failover/recovery with patch had occurred with CentOS 6.x and not CentOS 7 ?

Bruno Faccini (Inactive) added a comment - 24/Aug/17 8:02 AM Hello Bruno, Well, too bad... For both the new occurrence with patch and no crash-dump available! But can you at least provide the syslog/dmesg/Console for one of the crashedOSS ? And also, do I correctly understand that the last successful failover/recovery with patch had occurred with CentOS 6.x and not CentOS 7 ?

Bruno Travouillon (Inactive) added a comment - 23/Aug/17 2:08 PM

Hi Bruno,

For the record, we hit a similar occurrence after the upgrade of our Lustre servers to CentOS 7. The OSS panic on OOM during the recovery process (20 OSTs per OSS).

We have been able to start the filesystem by mounting successive subset of 5 OSTs on each OSS. With a few OSTs, the memory consumption increases more slowly, and thanks to your patch https://review.whamcloud.com/#/c/26752/ , the memory was freed after the completion of the recovery.

Unfortunately, I don't have any vmcore available.

Bruno Travouillon (Inactive) added a comment - 23/Aug/17 2:08 PM Hi Bruno, For the record, we hit a similar occurrence after the upgrade of our Lustre servers to CentOS 7. The OSS panic on OOM during the recovery process (20 OSTs per OSS). We have been able to start the filesystem by mounting successive subset of 5 OSTs on each OSS. With a few OSTs, the memory consumption increases more slowly, and thanks to your patch https://review.whamcloud.com/#/c/26752/ , the memory was freed after the completion of the recovery. Unfortunately, I don't have any vmcore available.

Bruno Travouillon (Inactive) added a comment - 14/Jun/17 8:50 AM

Hi,

The patch has been backported into the CEA 2.7 branch.

FYI, we have been able to successfully recover 5000+ clients this morning. Thank you!

Bruno Travouillon (Inactive) added a comment - 14/Jun/17 8:50 AM Hi, The patch has been backported into the CEA 2.7 branch. FYI, we have been able to successfully recover 5000+ clients this morning. Thank you!

Peter Jones added a comment - 12/May/17 12:32 PM

Landed for 2.10

Peter Jones added a comment - 12/May/17 12:32 PM Landed for 2.10

Gerrit Updater added a comment - 12/May/17 5:06 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26752/
Subject: ~~LU-9372~~ ptlrpc: drain "ptlrpc_request_buffer_desc" objects
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 220124bff7b13cd26b1b7b81ecf46e137ac174d3

Gerrit Updater added a comment - 12/May/17 5:06 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26752/ Subject: LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects Project: fs/lustre-release Branch: master Current Patch Set: Commit: 220124bff7b13cd26b1b7b81ecf46e137ac174d3

Bruno Faccini (Inactive) added a comment - 02/May/17 9:02 AM

Bruno,
Since you have told me that you have some OSSs in this case, can you also check for me these same counter/list (in fact all ptlrpc_service_part's counters/lists values/entries-count would be nice to have) on a live system where rqbds may have been massively allocated during a similar recovery storm but have fortunately avoided the OOM ?

Bruno Faccini (Inactive) added a comment - 02/May/17 9:02 AM Bruno, Since you have told me that you have some OSSs in this case, can you also check for me these same counter/list (in fact all ptlrpc_service_part's counters/lists values/entries-count would be nice to have) on a live system where rqbds may have been massively allocated during a similar recovery storm but have fortunately avoided the OOM ?

Bruno Faccini (Inactive) added a comment - 28/Apr/17 12:31 AM - edited

Bruno,
Thanks for this clarification which makes me have to re-think about my patch since, before I can have access to these necessary infos, I had anticipated that all this allocated rqbds upon need and now in excess, because never freed, would have been linked on scp_rqbd_idle/scp_hist_rqbds lists.

Will now try to get a new patch version available asap.

Bruno Faccini (Inactive) added a comment - 28/Apr/17 12:31 AM - edited Bruno, Thanks for this clarification which makes me have to re-think about my patch since, before I can have access to these necessary infos, I had anticipated that all this allocated rqbds upon need and now in excess, because never freed, would have been linked on scp_rqbd_idle/scp_hist_rqbds lists. Will now try to get a new patch version available asap.

Bruno Travouillon (Inactive) added a comment - 27/Apr/17 9:48 PM - edited

To complete the analysis, there are 958249 size-32768 slabs in the vmcore, and
1068498 size-1024 slabs (ptlrpc_request_buffer_desc are kmalloc of size-1024)

There are 4 instances of ptlrpc_service_part which have a lot of rqbds:

crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809cf4dd800
  scp_nrqbds_total = 98342
crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809dbe9ec00
  scp_nrqbds_total = 302031
crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809ddc14000
  scp_nrqbds_total = 272040
crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809ddc1c400
  scp_nrqbds_total = 285039

Most of the other ptlrpc_service_part instances have scp_nrqbds_total <= 64.

For these 4 instances, the rqbds are in the scp_rqbd_posted list, while
scp_nrqbds_posted is quite low:

crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809cf4dd800
  scp_nrqbds_posted = 12
crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809cf4dd800
  scp_rqbd_posted = {
    next = 0xffff8809e0758800,
    prev = 0xffff8809db055800
  }
crash> list 0xffff8809e0758800|wc -l
98343

crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809dbe9ec00
  scp_nrqbds_posted = 191
crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809dbe9ec00
  scp_rqbd_posted = {
    next = 0xffff8809ed5b7400,
    prev = 0xffff8809cf4d1000
  }
crash> list 0xffff8809ed5b7400|wc -l
302032

crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809ddc14000
  scp_nrqbds_posted = 1
crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809ddc14000
  scp_rqbd_posted = {
    next = 0xffff8809ec199400,
    prev = 0xffff8809dc6e7800
  }
crash> list 0xffff8809ec199400|wc -l
272041

crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809ddc1c400
  scp_nrqbds_posted = 0
crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809ddc1c400
  scp_rqbd_posted = {
    next = 0xffff8809e4880800,
    prev = 0xffff88097c4ddc00
  }
crash> list 0xffff8809e4880800|wc -l
285040

In request_in_callback(), the svcpt->scp_nrqbds_posted decrease if ev->unlinked, but it looks like no rqbd is removed from the scp_rqbd_posted list. I have to admit that I don't clearly understand if it's normal or not...

Bruno Travouillon (Inactive) added a comment - 27/Apr/17 9:48 PM - edited To complete the analysis, there are 958249 size-32768 slabs in the vmcore, and 1068498 size-1024 slabs (ptlrpc_request_buffer_desc are kmalloc of size-1024) There are 4 instances of ptlrpc_service_part which have a lot of rqbds: crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809cf4dd800 scp_nrqbds_total = 98342 crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809dbe9ec00 scp_nrqbds_total = 302031 crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809ddc14000 scp_nrqbds_total = 272040 crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809ddc1c400 scp_nrqbds_total = 285039 Most of the other ptlrpc_service_part instances have scp_nrqbds_total <= 64. For these 4 instances, the rqbds are in the scp_rqbd_posted list, while scp_nrqbds_posted is quite low: crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809cf4dd800 scp_nrqbds_posted = 12 crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809cf4dd800 scp_rqbd_posted = { next = 0xffff8809e0758800, prev = 0xffff8809db055800 } crash> list 0xffff8809e0758800|wc -l 98343 crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809dbe9ec00 scp_nrqbds_posted = 191 crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809dbe9ec00 scp_rqbd_posted = { next = 0xffff8809ed5b7400, prev = 0xffff8809cf4d1000 } crash> list 0xffff8809ed5b7400|wc -l 302032 crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809ddc14000 scp_nrqbds_posted = 1 crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809ddc14000 scp_rqbd_posted = { next = 0xffff8809ec199400, prev = 0xffff8809dc6e7800 } crash> list 0xffff8809ec199400|wc -l 272041 crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809ddc1c400 scp_nrqbds_posted = 0 crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809ddc1c400 scp_rqbd_posted = { next = 0xffff8809e4880800, prev = 0xffff88097c4ddc00 } crash> list 0xffff8809e4880800|wc -l 285040 In request_in_callback(), the svcpt->scp_nrqbds_posted decrease if ev->unlinked, but it looks like no rqbd is removed from the scp_rqbd_posted list. I have to admit that I don't clearly understand if it's normal or not...

Gerrit Updater added a comment - 20/Apr/17 10:16 AM

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/26752
Subject: ~~LU-9372~~ ptlrpc: drain "ptlrpc_request_buffer_desc" objects
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2d0dff4f5d2dd25ca55f5401c1139519207a0a02

Gerrit Updater added a comment - 20/Apr/17 10:16 AM Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/26752 Subject: LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2d0dff4f5d2dd25ca55f5401c1139519207a0a02

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Bruno Faccini (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 20/Apr/17 9:55 AM

Updated:: 08/Jun/20 5:48 PM

Resolved:: 31/Jan/18 1:49 PM