[LU-9372] OOM happens on OSS during Lustre recovery for more than 5000 clients - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.11.0, Lustre 2.10.6
Affects Version/s: None
Labels:
- cea
Environment:
Server running with b2_7_fe
Clients are a mix of IEEL3 (RH7/SCS5), 2.5.3.90 (RH6/AE4), 2.7.3 (CentOS7)

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I have been on-site to work with Bruno Travouillon (Atos) on one of the crash-dumps they have.

After joint analysis, it looks like a huge memory part is being consumed by "ptlrpc_request_buffer_desc" (17KB size each due to the embedded req, and that have been allocated in 32KB Slabs to increase/double side effect!).

Having a look to the concerned source code, it looks like these "ptlrpc_request_buffer_desc" could be additionally allocated upon need by ptlrpc_check_rqbd_pool(), but will never be freed until OST umount/stop by ptlrpc_service_purge_all().

This problem has caused several OSS failovers to fail due to OOM.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

smurf623.log-20170709
315 kB
07/Sep/17 3:16 PM

Issue Links

is duplicated by

LU-1099 Lustre OSS OOMs repeatedly

Resolved

is related to

LU-10603 ptlrpc_lprocfs_req_buffers_max_fops unused

Resolved

LU-13600 limit number of RPCs in flight during recovery

Resolved

is related to

LU-10803 req_buffers_max and req_history_max setting problems

Resolved

LU-10826 Regression in LU-9372 on OPA enviroment and no recovery triggered

Resolved

Activity

[LU-9372] OOM happens on OSS during Lustre recovery for more than 5000 clients

Bruno Faccini (Inactive) added a comment - 22/Nov/17 10:11 AM

I was just talking about this problem, and I have found that I had never clearly indicated in this ticket that the reason of this 32k alloc for each ptlrpc_rqbd (for a real size of 17k) is due to patch for ~~LU-4755~~ ("~~LU-4755~~ ptlrpc: enlarge OST_MAXREQSIZE for 4MB RPC").

Since the way this size has been identified looks a bit empiric, we may also want to give try to 15k (+ payload size, thus leading to 16k) in order to divide the real consumed size by 2.
To allow almost the same size reduction, we may also try to use a specific kmem_cache/Slabs for ptlrpc_rqbd/17k, keeping in mind that it may be made useless due to Kernel merging of Slabs.

Bruno Faccini (Inactive) added a comment - 22/Nov/17 10:11 AM I was just talking about this problem, and I have found that I had never clearly indicated in this ticket that the reason of this 32k alloc for each ptlrpc_rqbd (for a real size of 17k) is due to patch for LU-4755 (" LU-4755 ptlrpc: enlarge OST_MAXREQSIZE for 4MB RPC"). Since the way this size has been identified looks a bit empiric, we may also want to give try to 15k (+ payload size, thus leading to 16k) in order to divide the real consumed size by 2. To allow almost the same size reduction, we may also try to use a specific kmem_cache/Slabs for ptlrpc_rqbd/17k, keeping in mind that it may be made useless due to Kernel merging of Slabs.

Andreas Dilger added a comment - 04/Oct/17 8:40 AM

Bruno, my main concern is that a static tunable will not avoid a similar problem for most users.

Andreas Dilger added a comment - 04/Oct/17 8:40 AM Bruno, my main concern is that a static tunable will not avoid a similar problem for most users.

Bruno Faccini (Inactive) added a comment - 03/Oct/17 10:34 AM - edited

Andreas, I can't answer about your Lustre RPC and HA config/behavior concerns, but I am sure Bruno will do soon. But you may right that something should be done to avoid a peer to try mounting all targets upon restart/reboot.

Concerning the memory consumption, I can confirm that the huge number of size-1024 and size-32768 objects was very close to the current number of RQBDs (sum of scp_nrqbds_total).

About your comment concerning "a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed ..." , I presume you refer to the checks/code in ptlrpc_check_rqbd_pool()/ptlrpc_grow_req_bufs() routines, where the case that a lot of RQBDs+buffers may be currently handled already. And this is where my 2 patchs try to limit the allocations.

Last, do you mean that I must add in my patch some auto-tune code (vs memory size/load, but also number of targets?, only if failover/recovery running?, ...) to the current/only possibility of manually setting a limit ?

Bruno Faccini (Inactive) added a comment - 03/Oct/17 10:34 AM - edited Andreas, I can't answer about your Lustre RPC and HA config/behavior concerns, but I am sure Bruno will do soon. But you may right that something should be done to avoid a peer to try mounting all targets upon restart/reboot. Concerning the memory consumption, I can confirm that the huge number of size-1024 and size-32768 objects was very close to the current number of RQBDs (sum of scp_nrqbds_total). About your comment concerning "a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed ..." , I presume you refer to the checks/code in ptlrpc_check_rqbd_pool()/ptlrpc_grow_req_bufs() routines, where the case that a lot of RQBDs+buffers may be currently handled already. And this is where my 2 patchs try to limit the allocations. Last, do you mean that I must add in my patch some auto-tune code (vs memory size/load, but also number of targets?, only if failover/recovery running?, ...) to the current/only possibility of manually setting a limit ?

Andreas Dilger added a comment - 21/Sep/17 11:28 PM

It isn't clear what the benefit of a tunable to limit the number of RQBDs is, if it is off by default, since most users will not even know it exists. Even if users do know this tunable exists, there probably isn't an easy way to know a good default value for the maximum number of rqbd buffers, since that depends greatly based on RAM size on the server, the number of clients, and the client load.

Looking at older comments here, there are several things that concern me:

the total number of RQBDs allocated seem far more than could possibly ever be used, since clients should typically only have at most 8 RPCs in flight per OST
during recovery, clients should normally only have a single RPC in flight per OST

This means there shouldn't be more than about 5000 clients * 20 OSTs/OSS = 100000 RPCs/OSS outstanding on the OSS (maybe 200000 RPCs if you have 40 OSTs/OSS in failover mode, or is 10 OSTs/OSS the normal config and 20 OSTs/OSS is the failover config?).

Are we sure that clients are not generating a flood of RPCs per OST during recovery (more than max_rpcs_in_flight)? I also recall there may be a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed if there are many threads trying to send an RPC at the same time and the buffers run out.

On a related note, having so many OSTs on a single OSS is not really a good configuration, since it doesn't provide very good performance, and as you see it has a very high RAM requirement, and also causes a larger point of failure if the OSS goes down. In addition to the number of outstanding RPCs that the clients may send, there is also a bunch of RAM used by the ldiskfs journals.

Also, if you are seeing messages like the following in your logs:

LDISKFS-fs warning (device sfa0049): ldiskfs_multi_mount_protect:331: Device is already active on another node

then there is something significantly wrong with your HA or STONITH configuration. MMP is meant as a backup sanity check to prevent double import to prevent filesystem corruption (which it did successfully for 20 OSTs in this case), but it is not intended to be a primary HA exclusion method for the storage.

Andreas Dilger added a comment - 21/Sep/17 11:28 PM It isn't clear what the benefit of a tunable to limit the number of RQBDs is, if it is off by default, since most users will not even know it exists. Even if users do know this tunable exists, there probably isn't an easy way to know a good default value for the maximum number of rqbd buffers, since that depends greatly based on RAM size on the server, the number of clients, and the client load. Looking at older comments here, there are several things that concern me: the total number of RQBDs allocated seem far more than could possibly ever be used, since clients should typically only have at most 8 RPCs in flight per OST during recovery, clients should normally only have a single RPC in flight per OST This means there shouldn't be more than about 5000 clients * 20 OSTs/OSS = 100000 RPCs/OSS outstanding on the OSS (maybe 200000 RPCs if you have 40 OSTs/OSS in failover mode, or is 10 OSTs/OSS the normal config and 20 OSTs/OSS is the failover config?). Are we sure that clients are not generating a flood of RPCs per OST during recovery (more than max_rpcs_in_flight )? I also recall there may be a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed if there are many threads trying to send an RPC at the same time and the buffers run out. On a related note, having so many OSTs on a single OSS is not really a good configuration, since it doesn't provide very good performance, and as you see it has a very high RAM requirement, and also causes a larger point of failure if the OSS goes down. In addition to the number of outstanding RPCs that the clients may send, there is also a bunch of RAM used by the ldiskfs journals. Also, if you are seeing messages like the following in your logs: LDISKFS-fs warning (device sfa0049): ldiskfs_multi_mount_protect:331: Device is already active on another node then there is something significantly wrong with your HA or STONITH configuration. MMP is meant as a backup sanity check to prevent double import to prevent filesystem corruption (which it did successfully for 20 OSTs in this case), but it is not intended to be a primary HA exclusion method for the storage.

Bruno Faccini (Inactive) added a comment - 20/Sep/17 6:23 AM

J-B, would be cool if we could give a try to this new/2nd patch, with a significant peacock setup (let say with hundreds of clients, thousands ?), by forcing an OSS/OSTs failover+recovery and monitoring kmem memory consumption and if new limitation mechanism is working as expected and has no effect.

Bruno Faccini (Inactive) added a comment - 20/Sep/17 6:23 AM J-B, would be cool if we could give a try to this new/2nd patch, with a significant peacock setup (let say with hundreds of clients, thousands ?), by forcing an OSS/OSTs failover+recovery and monitoring kmem memory consumption and if new limitation mechanism is working as expected and has no effect.

Bruno Faccini (Inactive) added a comment - 18/Sep/17 11:36 PM

Looks like my first change #26752 (~~LU-9372~~ ptlrpc: drain "ptlrpc_request_buffer_desc" objects) could not be efficient to drain rqbds during too long period of heavy load of requests from huge number of Clients and thus still allow for OOM...
So, I have just pushed #29064 (~~LU-9372~~ ptlrpc: allow to limit number of service's rqbds) in order to allow to set a limit on the max number of rqbds per service.

Bruno Faccini (Inactive) added a comment - 18/Sep/17 11:36 PM Looks like my first change #26752 ( LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects) could not be efficient to drain rqbds during too long period of heavy load of requests from huge number of Clients and thus still allow for OOM... So, I have just pushed #29064 ( LU-9372 ptlrpc: allow to limit number of service's rqbds) in order to allow to set a limit on the max number of rqbds per service.

Gerrit Updater added a comment - 18/Sep/17 11:29 PM

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/29064
Subject: ~~LU-9372~~ ptlrpc: allow to limit number of service's rqbds
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 78d7f8f02ba35bbcd84dd44f103ed1326d94c30c

Gerrit Updater added a comment - 18/Sep/17 11:29 PM Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/29064 Subject: LU-9372 ptlrpc: allow to limit number of service's rqbds Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 78d7f8f02ba35bbcd84dd44f103ed1326d94c30c

Bruno Travouillon (Inactive) added a comment - 14/Sep/17 12:37 PM

We use pacemaker for HA. When the first OSS crashed, the target resources failover to the partner OSS, which explains the MMP messages.

These OSS are KVM guests running on top of a DDN SFA14KX-E controller. Indeed, there is only one NUMA node in an OSS.

Unfortunately, I can't confirm the slabs were size-1024 and size-32768. As far as I remember, they were, but I can't assert it...my bad.

I will provide an action plan to capture some relevant data during the next occurrence.

Bruno Travouillon (Inactive) added a comment - 14/Sep/17 12:37 PM We use pacemaker for HA. When the first OSS crashed, the target resources failover to the partner OSS, which explains the MMP messages. These OSS are KVM guests running on top of a DDN SFA14KX-E controller. Indeed, there is only one NUMA node in an OSS. Unfortunately, I can't confirm the slabs were size-1024 and size-32768. As far as I remember, they were, but I can't assert it...my bad. I will provide an action plan to capture some relevant data during the next occurrence.

Bruno Faccini (Inactive) added a comment - 08/Sep/17 9:26 AM

Ok, so now I have a better look to the OSS log you have provided, I am concerned about the multiple MMP errors that have been encountered for OSTs, they tend to indicate that some of the OSTs ownership has been xfered to an other/HA OSS during the restart+recovery process but it may have induced some "anarchy" in the recovery process...

Also the OOM stats print only show one Numa node, is it the case? And this node's SLAB content does not look so excessive regarding the memory size ("Node 0 Normal free:64984kB ... present:91205632kB managed:89727696kB ... slab_reclaimable:1356412kB slab_unreclaimable:6092980kB ..."), so do you confirm that you could still see the "size-1024" and "size-32768" Slabs growing to billion of objects ???

Bruno Faccini (Inactive) added a comment - 08/Sep/17 9:26 AM Ok, so now I have a better look to the OSS log you have provided, I am concerned about the multiple MMP errors that have been encountered for OSTs, they tend to indicate that some of the OSTs ownership has been xfered to an other/HA OSS during the restart+recovery process but it may have induced some "anarchy" in the recovery process... Also the OOM stats print only show one Numa node, is it the case? And this node's SLAB content does not look so excessive regarding the memory size ("Node 0 Normal free:64984kB ... present:91205632kB managed:89727696kB ... slab_reclaimable:1356412kB slab_unreclaimable:6092980kB ..."), so do you confirm that you could still see the "size-1024" and "size-32768" Slabs growing to billion of objects ???

Bruno Faccini (Inactive) added a comment - 07/Sep/17 10:25 PM - edited

Mais oui!!!! Well, I should have been really tired today since I have not been able to find the forced crash in the log and I was looking for MDS requests sizes !!...
Thanks Bruno, I will try to push an other patch soon, based on my earlier comment thoughts,.

Bruno Faccini (Inactive) added a comment - 07/Sep/17 10:25 PM - edited Mais oui!!!! Well, I should have been really tired today since I have not been able to find the forced crash in the log and I was looking for MDS requests sizes !!... Thanks Bruno, I will try to push an other patch soon, based on my earlier comment thoughts,.

Bruno Travouillon (Inactive) added a comment - 07/Sep/17 8:53 PM

About the 17K size, look at OST_MAXREQSIZE and OST_BUFSIZE into lustre/include/lustre_net.h.

#define OST_MAXREQSIZE (16 * 1024)
/** OST_BUFSIZE = max_reqsize + max sptlrpc payload size */
#define OST_BUFSIZE max_t(int, OST_MAXREQSIZE + 1024, 16 * 1024)

Bruno Travouillon (Inactive) added a comment - 07/Sep/17 8:53 PM About the 17K size, look at OST_MAXREQSIZE and OST_BUFSIZE into lustre/include/lustre_net.h. #define OST_MAXREQSIZE (16 * 1024) /** OST_BUFSIZE = max_reqsize + max sptlrpc payload size */ #define OST_BUFSIZE max_t(int, OST_MAXREQSIZE + 1024, 16 * 1024)

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Bruno Faccini (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 20/Apr/17 9:55 AM

Updated:: 08/Jun/20 5:48 PM

Resolved:: 31/Jan/18 1:49 PM