[LU-9372] OOM happens on OSS during Lustre recovery for more than 5000 clients - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.11.0, Lustre 2.10.6
Affects Version/s: None
Labels:
- cea
Environment:
Server running with b2_7_fe
Clients are a mix of IEEL3 (RH7/SCS5), 2.5.3.90 (RH6/AE4), 2.7.3 (CentOS7)

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I have been on-site to work with Bruno Travouillon (Atos) on one of the crash-dumps they have.

After joint analysis, it looks like a huge memory part is being consumed by "ptlrpc_request_buffer_desc" (17KB size each due to the embedded req, and that have been allocated in 32KB Slabs to increase/double side effect!).

Having a look to the concerned source code, it looks like these "ptlrpc_request_buffer_desc" could be additionally allocated upon need by ptlrpc_check_rqbd_pool(), but will never be freed until OST umount/stop by ptlrpc_service_purge_all().

This problem has caused several OSS failovers to fail due to OOM.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

smurf623.log-20170709
315 kB
07/Sep/17 3:16 PM

Issue Links

is duplicated by

LU-1099 Lustre OSS OOMs repeatedly

Resolved

is related to

LU-10603 ptlrpc_lprocfs_req_buffers_max_fops unused

Resolved

LU-13600 limit number of RPCs in flight during recovery

Resolved

is related to

LU-10803 req_buffers_max and req_history_max setting problems

Resolved

LU-10826 Regression in LU-9372 on OPA enviroment and no recovery triggered

Resolved

Activity

[LU-9372] OOM happens on OSS during Lustre recovery for more than 5000 clients

Bruno Travouillon (Inactive) added a comment - 14/Sep/17 12:37 PM

We use pacemaker for HA. When the first OSS crashed, the target resources failover to the partner OSS, which explains the MMP messages.

These OSS are KVM guests running on top of a DDN SFA14KX-E controller. Indeed, there is only one NUMA node in an OSS.

Unfortunately, I can't confirm the slabs were size-1024 and size-32768. As far as I remember, they were, but I can't assert it...my bad.

I will provide an action plan to capture some relevant data during the next occurrence.

Bruno Travouillon (Inactive) added a comment - 14/Sep/17 12:37 PM We use pacemaker for HA. When the first OSS crashed, the target resources failover to the partner OSS, which explains the MMP messages. These OSS are KVM guests running on top of a DDN SFA14KX-E controller. Indeed, there is only one NUMA node in an OSS. Unfortunately, I can't confirm the slabs were size-1024 and size-32768. As far as I remember, they were, but I can't assert it...my bad. I will provide an action plan to capture some relevant data during the next occurrence.

Bruno Faccini (Inactive) added a comment - 08/Sep/17 9:26 AM

Ok, so now I have a better look to the OSS log you have provided, I am concerned about the multiple MMP errors that have been encountered for OSTs, they tend to indicate that some of the OSTs ownership has been xfered to an other/HA OSS during the restart+recovery process but it may have induced some "anarchy" in the recovery process...

Also the OOM stats print only show one Numa node, is it the case? And this node's SLAB content does not look so excessive regarding the memory size ("Node 0 Normal free:64984kB ... present:91205632kB managed:89727696kB ... slab_reclaimable:1356412kB slab_unreclaimable:6092980kB ..."), so do you confirm that you could still see the "size-1024" and "size-32768" Slabs growing to billion of objects ???

Bruno Faccini (Inactive) added a comment - 08/Sep/17 9:26 AM Ok, so now I have a better look to the OSS log you have provided, I am concerned about the multiple MMP errors that have been encountered for OSTs, they tend to indicate that some of the OSTs ownership has been xfered to an other/HA OSS during the restart+recovery process but it may have induced some "anarchy" in the recovery process... Also the OOM stats print only show one Numa node, is it the case? And this node's SLAB content does not look so excessive regarding the memory size ("Node 0 Normal free:64984kB ... present:91205632kB managed:89727696kB ... slab_reclaimable:1356412kB slab_unreclaimable:6092980kB ..."), so do you confirm that you could still see the "size-1024" and "size-32768" Slabs growing to billion of objects ???

Bruno Faccini (Inactive) added a comment - 07/Sep/17 10:25 PM - edited

Mais oui!!!! Well, I should have been really tired today since I have not been able to find the forced crash in the log and I was looking for MDS requests sizes !!...
Thanks Bruno, I will try to push an other patch soon, based on my earlier comment thoughts,.

Bruno Faccini (Inactive) added a comment - 07/Sep/17 10:25 PM - edited Mais oui!!!! Well, I should have been really tired today since I have not been able to find the forced crash in the log and I was looking for MDS requests sizes !!... Thanks Bruno, I will try to push an other patch soon, based on my earlier comment thoughts,.

Bruno Travouillon (Inactive) added a comment - 07/Sep/17 8:53 PM

About the 17K size, look at OST_MAXREQSIZE and OST_BUFSIZE into lustre/include/lustre_net.h.

#define OST_MAXREQSIZE (16 * 1024)
/** OST_BUFSIZE = max_reqsize + max sptlrpc payload size */
#define OST_BUFSIZE max_t(int, OST_MAXREQSIZE + 1024, 16 * 1024)

Bruno Travouillon (Inactive) added a comment - 07/Sep/17 8:53 PM About the 17K size, look at OST_MAXREQSIZE and OST_BUFSIZE into lustre/include/lustre_net.h. #define OST_MAXREQSIZE (16 * 1024) /** OST_BUFSIZE = max_reqsize + max sptlrpc payload size */ #define OST_BUFSIZE max_t(int, OST_MAXREQSIZE + 1024, 16 * 1024)

Bruno Travouillon (Inactive) added a comment - 07/Sep/17 8:41 PM

Hi Bruno,

This OSS smurf623 has been upgraded to CentOS 7, Lustre 2.7.3 with patch 26752 on July 5th. Because of ~~LU-8685~~, we had to upgrade the kernel the next day, on July 6th, while 5201 clients had the filesystem mounted. We then hit the OOM issue while mounting the 20 OSTs on the OSS.(between 14:38:48 and 15:03:13).

Bruno Travouillon (Inactive) added a comment - 07/Sep/17 8:41 PM Hi Bruno, This OSS smurf623 has been upgraded to CentOS 7, Lustre 2.7.3 with patch 26752 on July 5th. Because of LU-8685 , we had to upgrade the kernel the next day, on July 6th, while 5201 clients had the filesystem mounted. We then hit the OOM issue while mounting the 20 OSTs on the OSS.(between 14:38:48 and 15:03:13).

Bruno Faccini (Inactive) added a comment - 07/Sep/17 5:52 PM

Hello Bruno,
Is "smurf623.log-20170709" the log of one of the crashed OSSs from last occurrence of problem and running with CentOS and my patch?

By the way, the fact that you Servers are now running CentOS7 may have some implications since in terms of Kernel memory allocation, SLUBs are now used by the Kernel instead of SLABs with Kernels shipped for CentOS6.

Anyway, I am investigating new possible ways to avoid OOM during Server failover+recovery, like to set a hard limit on the volume/number of ptlrpc_request_buffer_desc+buffer, better understand the reason of the 17K size of the mainly used buffers, and/or may be to create a specific kmem_cache for this purpose and thus limit wasted space...

Bruno Faccini (Inactive) added a comment - 07/Sep/17 5:52 PM Hello Bruno, Is "smurf623.log-20170709" the log of one of the crashed OSSs from last occurrence of problem and running with CentOS and my patch? By the way, the fact that you Servers are now running CentOS7 may have some implications since in terms of Kernel memory allocation, SLUBs are now used by the Kernel instead of SLABs with Kernels shipped for CentOS6. Anyway, I am investigating new possible ways to avoid OOM during Server failover+recovery, like to set a hard limit on the volume/number of ptlrpc_request_buffer_desc+buffer, better understand the reason of the 17K size of the mainly used buffers, and/or may be to create a specific kmem_cache for this purpose and thus limit wasted space...

Bruno Faccini (Inactive) added a comment - 24/Aug/17 8:02 AM

Hello Bruno,
Well, too bad... For both the new occurrence with patch and no crash-dump available!
But can you at least provide the syslog/dmesg/Console for one of the crashedOSS ?
And also, do I correctly understand that the last successful failover/recovery with patch had occurred with CentOS 6.x and not CentOS 7 ?

Bruno Faccini (Inactive) added a comment - 24/Aug/17 8:02 AM Hello Bruno, Well, too bad... For both the new occurrence with patch and no crash-dump available! But can you at least provide the syslog/dmesg/Console for one of the crashedOSS ? And also, do I correctly understand that the last successful failover/recovery with patch had occurred with CentOS 6.x and not CentOS 7 ?

Bruno Travouillon (Inactive) added a comment - 23/Aug/17 2:08 PM

Hi Bruno,

For the record, we hit a similar occurrence after the upgrade of our Lustre servers to CentOS 7. The OSS panic on OOM during the recovery process (20 OSTs per OSS).

We have been able to start the filesystem by mounting successive subset of 5 OSTs on each OSS. With a few OSTs, the memory consumption increases more slowly, and thanks to your patch https://review.whamcloud.com/#/c/26752/ , the memory was freed after the completion of the recovery.

Unfortunately, I don't have any vmcore available.

Bruno Travouillon (Inactive) added a comment - 23/Aug/17 2:08 PM Hi Bruno, For the record, we hit a similar occurrence after the upgrade of our Lustre servers to CentOS 7. The OSS panic on OOM during the recovery process (20 OSTs per OSS). We have been able to start the filesystem by mounting successive subset of 5 OSTs on each OSS. With a few OSTs, the memory consumption increases more slowly, and thanks to your patch https://review.whamcloud.com/#/c/26752/ , the memory was freed after the completion of the recovery. Unfortunately, I don't have any vmcore available.

Bruno Travouillon (Inactive) added a comment - 14/Jun/17 8:50 AM

Hi,

The patch has been backported into the CEA 2.7 branch.

FYI, we have been able to successfully recover 5000+ clients this morning. Thank you!

Bruno Travouillon (Inactive) added a comment - 14/Jun/17 8:50 AM Hi, The patch has been backported into the CEA 2.7 branch. FYI, we have been able to successfully recover 5000+ clients this morning. Thank you!

Peter Jones added a comment - 12/May/17 12:32 PM

Landed for 2.10

Peter Jones added a comment - 12/May/17 12:32 PM Landed for 2.10

Gerrit Updater added a comment - 12/May/17 5:06 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26752/
Subject: ~~LU-9372~~ ptlrpc: drain "ptlrpc_request_buffer_desc" objects
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 220124bff7b13cd26b1b7b81ecf46e137ac174d3

Gerrit Updater added a comment - 12/May/17 5:06 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26752/ Subject: LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects Project: fs/lustre-release Branch: master Current Patch Set: Commit: 220124bff7b13cd26b1b7b81ecf46e137ac174d3

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Bruno Faccini (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 20/Apr/17 9:55 AM

Updated:: 08/Jun/20 5:48 PM

Resolved:: 31/Jan/18 1:49 PM