[LU-9372] OOM happens on OSS during Lustre recovery for more than 5000 clients Created: 20/Apr/17  Updated: 08/Jun/20  Resolved: 31/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0, Lustre 2.10.6

Type: Bug Priority: Major
Reporter: Bruno Faccini (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: cea
Environment:

Server running with b2_7_fe
Clients are a mix of IEEL3 (RH7/SCS5), 2.5.3.90 (RH6/AE4), 2.7.3 (CentOS7)


Attachments: File smurf623.log-20170709    
Issue Links:
Duplicate
is duplicated by LU-1099 Lustre OSS OOMs repeatedly Resolved
Related
is related to LU-10803 req_buffers_max and req_history_max s... Resolved
is related to LU-10826 Regression in LU-9372 on OPA envirome... Resolved
is related to LU-10603 ptlrpc_lprocfs_req_buffers_max_fops u... Resolved
is related to LU-13600 limit number of RPCs in flight during... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I have been on-site to work with Bruno Travouillon (Atos) on one of the crash-dumps they have.

After joint analysis, it looks like a huge memory part is being consumed by "ptlrpc_request_buffer_desc" (17KB size each due to the embedded req, and that have been allocated in 32KB Slabs to increase/double side effect!).

Having a look to the concerned source code, it looks like these "ptlrpc_request_buffer_desc" could be additionally allocated upon need by ptlrpc_check_rqbd_pool(), but will never be freed until OST umount/stop by ptlrpc_service_purge_all().

This problem has caused several OSS failovers to fail due to OOM.



 Comments   
Comment by Bruno Faccini (Inactive) [ 20/Apr/17 ]

According to the concerned source code (and its comments), one possible option could be to run with "test_req_buffer_pressure=1" as a ptlrpc module parameter, to avoid dynamic alloc of new "ptlrpc_request_buffer_desc" objects, but it will need to be carefully tested.

On the other hand, looks like the "history" code in ptlrpc_server_drop_request() could be changed a little bit in order to progressively free "ptlrpc_request_buffer_desc" objects.

Comment by Gerrit Updater [ 20/Apr/17 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/26752
Subject: LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2d0dff4f5d2dd25ca55f5401c1139519207a0a02

Comment by Bruno Travouillon (Inactive) [ 27/Apr/17 ]

To complete the analysis, there are 958249 size-32768 slabs in the vmcore, and
1068498 size-1024 slabs (ptlrpc_request_buffer_desc are kmalloc of size-1024)

There are 4 instances of ptlrpc_service_part which have a lot of rqbds:

crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809cf4dd800
  scp_nrqbds_total = 98342
crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809dbe9ec00
  scp_nrqbds_total = 302031
crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809ddc14000
  scp_nrqbds_total = 272040
crash> struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809ddc1c400
  scp_nrqbds_total = 285039


Most of the other ptlrpc_service_part instances have scp_nrqbds_total <= 64.

For these 4 instances, the rqbds are in the scp_rqbd_posted list, while
scp_nrqbds_posted is quite low:

crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809cf4dd800
  scp_nrqbds_posted = 12
crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809cf4dd800
  scp_rqbd_posted = {
    next = 0xffff8809e0758800,
    prev = 0xffff8809db055800
  }
crash> list 0xffff8809e0758800|wc -l
98343

crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809dbe9ec00
  scp_nrqbds_posted = 191
crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809dbe9ec00
  scp_rqbd_posted = {
    next = 0xffff8809ed5b7400,
    prev = 0xffff8809cf4d1000
  }
crash> list 0xffff8809ed5b7400|wc -l
302032

crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809ddc14000
  scp_nrqbds_posted = 1
crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809ddc14000
  scp_rqbd_posted = {
    next = 0xffff8809ec199400,
    prev = 0xffff8809dc6e7800
  }
crash> list 0xffff8809ec199400|wc -l
272041

crash> struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809ddc1c400
  scp_nrqbds_posted = 0
crash> struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809ddc1c400
  scp_rqbd_posted = {
    next = 0xffff8809e4880800,
    prev = 0xffff88097c4ddc00
  }
crash> list 0xffff8809e4880800|wc -l
285040

In request_in_callback(), the svcpt->scp_nrqbds_posted decrease if ev->unlinked, but it looks like no rqbd is removed from the scp_rqbd_posted list. I have to admit that I don't clearly understand if it's normal or not...

Comment by Bruno Faccini (Inactive) [ 28/Apr/17 ]

Bruno,
Thanks for this clarification which makes me have to re-think about my patch since, before I can have access to these necessary infos, I had anticipated that all this allocated rqbds upon need and now in excess, because never freed, would have been linked on scp_rqbd_idle/scp_hist_rqbds lists.

Will now try to get a new patch version available asap.

Comment by Bruno Faccini (Inactive) [ 02/May/17 ]

Bruno,
Since you have told me that you have some OSSs in this case, can you also check for me these same counter/list (in fact all ptlrpc_service_part's counters/lists values/entries-count would be nice to have) on a live system where rqbds may have been massively allocated during a similar recovery storm but have fortunately avoided the OOM ?

Comment by Gerrit Updater [ 12/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26752/
Subject: LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 220124bff7b13cd26b1b7b81ecf46e137ac174d3

Comment by Peter Jones [ 12/May/17 ]

Landed for 2.10

Comment by Bruno Travouillon (Inactive) [ 14/Jun/17 ]

Hi,

The patch has been backported into the CEA 2.7 branch.

FYI, we have been able to successfully recover 5000+ clients this morning. Thank you!

Comment by Bruno Travouillon (Inactive) [ 23/Aug/17 ]

Hi Bruno,

For the record, we hit a similar occurrence after the upgrade of our Lustre servers to CentOS 7. The OSS panic on OOM during the recovery process (20 OSTs per OSS).

We have been able to start the filesystem by mounting successive subset of 5 OSTs on each OSS. With a few OSTs, the memory consumption increases more slowly, and thanks to your patch https://review.whamcloud.com/#/c/26752/ , the memory was freed after the completion of the recovery.

Unfortunately, I don't have any vmcore available.

Comment by Bruno Faccini (Inactive) [ 24/Aug/17 ]

Hello Bruno,
Well, too bad... For both the new occurrence with patch and no crash-dump available!
But can you at least provide the syslog/dmesg/Console for one of the crashedOSS ?
And also, do I correctly understand that the last successful failover/recovery with patch had occurred with CentOS 6.x and not CentOS 7 ?

Comment by Bruno Faccini (Inactive) [ 07/Sep/17 ]

Hello Bruno,
Is "smurf623.log-20170709" the log of one of the crashed OSSs from last occurrence of problem and running with CentOS and my patch?

By the way, the fact that you Servers are now running CentOS7 may have some implications since in terms of Kernel memory allocation, SLUBs are now used by the Kernel instead of SLABs with Kernels shipped for CentOS6.

Anyway, I am investigating new possible ways to avoid OOM during Server failover+recovery, like to set a hard limit on the volume/number of ptlrpc_request_buffer_desc+buffer, better understand the reason of the 17K size of the mainly used buffers, and/or may be to create a specific kmem_cache for this purpose and thus limit wasted space...

Comment by Bruno Travouillon (Inactive) [ 07/Sep/17 ]

Hi Bruno,

This OSS smurf623 has been upgraded to CentOS 7, Lustre 2.7.3 with patch 26752 on July 5th. Because of LU-8685, we had to upgrade the kernel the next day, on July 6th, while 5201 clients had the filesystem mounted. We then hit the OOM issue while mounting the 20 OSTs on the OSS.(between 14:38:48 and 15:03:13).

Comment by Bruno Travouillon (Inactive) [ 07/Sep/17 ]

About the 17K size, look at OST_MAXREQSIZE and OST_BUFSIZE into lustre/include/lustre_net.h.

#define OST_MAXREQSIZE (16 * 1024)
/** OST_BUFSIZE = max_reqsize + max sptlrpc payload size */
#define OST_BUFSIZE max_t(int, OST_MAXREQSIZE + 1024, 16 * 1024)

 

Comment by Bruno Faccini (Inactive) [ 07/Sep/17 ]

Mais oui!!!! Well, I should have been really tired today since I have not been able to find the forced crash in the log and I was looking for MDS requests sizes !!...
Thanks Bruno, I will try to push an other patch soon, based on my earlier comment thoughts,.

Comment by Bruno Faccini (Inactive) [ 08/Sep/17 ]

Ok, so now I have a better look to the OSS log you have provided, I am concerned about the multiple MMP errors that have been encountered for OSTs, they tend to indicate that some of the OSTs ownership has been xfered to an other/HA OSS during the restart+recovery process but it may have induced some "anarchy" in the recovery process...

Also the OOM stats print only show one Numa node, is it the case? And this node's SLAB content does not look so excessive regarding the memory size ("Node 0 Normal free:64984kB ... present:91205632kB managed:89727696kB ... slab_reclaimable:1356412kB slab_unreclaimable:6092980kB ..."), so do you confirm that you could still see the "size-1024" and "size-32768" Slabs growing to billion of objects ???

Comment by Bruno Travouillon (Inactive) [ 14/Sep/17 ]

We use pacemaker for HA. When the first OSS crashed, the target resources failover to the partner OSS, which explains the MMP messages.

These OSS are KVM guests running on top of a DDN SFA14KX-E controller. Indeed, there is only one NUMA node in an OSS.

Unfortunately, I can't confirm the slabs were size-1024 and size-32768. As far as I remember, they were, but I can't assert it...my bad.

I will provide an action plan to capture some relevant data during the next occurrence.

Comment by Gerrit Updater [ 18/Sep/17 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/29064
Subject: LU-9372 ptlrpc: allow to limit number of service's rqbds
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 78d7f8f02ba35bbcd84dd44f103ed1326d94c30c

Comment by Bruno Faccini (Inactive) [ 18/Sep/17 ]

Looks like my first change #26752 (LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects) could not be efficient to drain rqbds during too long period of heavy load of requests from huge number of Clients and thus still allow for OOM...
So, I have just pushed #29064 (LU-9372 ptlrpc: allow to limit number of service's rqbds) in order to allow to set a limit on the max number of rqbds per service.

Comment by Bruno Faccini (Inactive) [ 20/Sep/17 ]

J-B, would be cool if we could give a try to this new/2nd patch, with a significant peacock setup (let say with hundreds of clients, thousands ?), by forcing an OSS/OSTs failover+recovery and monitoring kmem memory consumption and if new limitation mechanism is working as expected and has no effect.

Comment by Andreas Dilger [ 21/Sep/17 ]

It isn't clear what the benefit of a tunable to limit the number of RQBDs is, if it is off by default, since most users will not even know it exists. Even if users do know this tunable exists, there probably isn't an easy way to know a good default value for the maximum number of rqbd buffers, since that depends greatly based on RAM size on the server, the number of clients, and the client load.

Looking at older comments here, there are several things that concern me:

  • the total number of RQBDs allocated seem far more than could possibly ever be used, since clients should typically only have at most 8 RPCs in flight per OST
  • during recovery, clients should normally only have a single RPC in flight per OST

This means there shouldn't be more than about 5000 clients * 20 OSTs/OSS = 100000 RPCs/OSS outstanding on the OSS (maybe 200000 RPCs if you have 40 OSTs/OSS in failover mode, or is 10 OSTs/OSS the normal config and 20 OSTs/OSS is the failover config?).

Are we sure that clients are not generating a flood of RPCs per OST during recovery (more than max_rpcs_in_flight)? I also recall there may be a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed if there are many threads trying to send an RPC at the same time and the buffers run out.

On a related note, having so many OSTs on a single OSS is not really a good configuration, since it doesn't provide very good performance, and as you see it has a very high RAM requirement, and also causes a larger point of failure if the OSS goes down. In addition to the number of outstanding RPCs that the clients may send, there is also a bunch of RAM used by the ldiskfs journals.

Also, if you are seeing messages like the following in your logs:

LDISKFS-fs warning (device sfa0049): ldiskfs_multi_mount_protect:331: Device is already active on another node

then there is something significantly wrong with your HA or STONITH configuration. MMP is meant as a backup sanity check to prevent double import to prevent filesystem corruption (which it did successfully for 20 OSTs in this case), but it is not intended to be a primary HA exclusion method for the storage.

Comment by Bruno Faccini (Inactive) [ 03/Oct/17 ]

Andreas, I can't answer about your Lustre RPC and HA config/behavior concerns, but I am sure Bruno will do soon. But you may right that something should be done to avoid a peer to try mounting all targets upon restart/reboot.

Concerning the memory consumption, I can confirm that the huge number of size-1024 and size-32768 objects was very close to the current number of RQBDs (sum of scp_nrqbds_total).

About your comment concerning "a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed ..." , I presume you refer to the checks/code in ptlrpc_check_rqbd_pool()/ptlrpc_grow_req_bufs() routines, where the case that a lot of RQBDs+buffers may be currently handled already. And this is where my 2 patchs try to limit the allocations.

Last, do you mean that I must add in my patch some auto-tune code (vs memory size/load, but also number of targets?, only if failover/recovery running?, ...) to the current/only possibility of manually setting a limit ?

Comment by Andreas Dilger [ 04/Oct/17 ]

Bruno, my main concern is that a static tunable will not avoid a similar problem for most users.

Comment by Bruno Faccini (Inactive) [ 22/Nov/17 ]

I was just talking about this problem, and I have found that I had never clearly indicated in this ticket that the reason of this 32k alloc for each ptlrpc_rqbd (for a real size of 17k) is due to patch for LU-4755 ("LU-4755 ptlrpc: enlarge OST_MAXREQSIZE for 4MB RPC").

Since the way this size has been identified looks a bit empiric, we may also want to give try to 15k (+ payload size, thus leading to 16k) in order to divide the real consumed size by 2.
To allow almost the same size reduction, we may also try to use a specific kmem_cache/Slabs for ptlrpc_rqbd/17k, keeping in mind that it may be made useless due to Kernel merging of Slabs.

Comment by Andreas Dilger [ 16/Dec/17 ]

I couldn’t find in the ticket how much RAM is on this OSS for the 20 OSTs? I’m wondering if we are also having problems here with CPT allocations all happening on one CPT and hitting OOM while there is plenty of RAM available on a second CPT?

Comment by Bruno Travouillon (Inactive) [ 16/Dec/17 ]

We allocate 90 GB of RAM and 8 CPU cores to each OSS. We can't allocate more resources per virtual guest in a SFA14KXE. The cores are HT.
 

 oss# cat /proc/sys/lnet/cpu_partition_table
 0 : 0 1 2 3 
 1 : 4 5 6 7 
 2 : 8 9 10 11 
 3 : 12 13 14 15

Last time we hit OOM, the memory consumption of the OSS was at its maximum (90GB).

Comment by Gerrit Updater [ 31/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29064/
Subject: LU-9372 ptlrpc: allow to limit number of service's rqbds
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d9e57a765e73e1bc3046124433eb6e2186f7e07c

Comment by Peter Jones [ 31/Jan/18 ]

Landed for 2.11

Comment by Gerrit Updater [ 31/Jan/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31108
Subject: LU-9372 ptlrpc: allow to limit number of service's rqbds
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 69ad99bf62cf461df93419e57adb323a6d537e31

Comment by Bruno Faccini (Inactive) [ 20/Feb/18 ]

Master patch https://review.whamcloud.com/31162 from LU-10603 is required to make associated tunable visible to the external world and thus to allow this https://review.whamcloud.com/29064/ patch/feature to be usable.

So just in case, Minh, any back-port of #29064 requires also to back-port #31162.

Comment by Gerrit Updater [ 12/Mar/18 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/31622
Subject: LU-9372 ptlrpc: fix req_buffers_max and req_history_max setting
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: aa9005eb5c9e873e9e83619ff830ba848917f118

Comment by Bruno Faccini (Inactive) [ 21/Mar/18 ]

Both patches from LU-10803 and LU-10826 are also must-have/follow-ons to the LU-9372 serie.

Generated at Sat Feb 10 02:25:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.