[LU-9372] OOM happens on OSS during Lustre recovery for more than 5000 clients - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.11.0, Lustre 2.10.6
Affects Version/s: None
Labels:
- cea
Environment:
Server running with b2_7_fe
Clients are a mix of IEEL3 (RH7/SCS5), 2.5.3.90 (RH6/AE4), 2.7.3 (CentOS7)

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I have been on-site to work with Bruno Travouillon (Atos) on one of the crash-dumps they have.

After joint analysis, it looks like a huge memory part is being consumed by "ptlrpc_request_buffer_desc" (17KB size each due to the embedded req, and that have been allocated in 32KB Slabs to increase/double side effect!).

Having a look to the concerned source code, it looks like these "ptlrpc_request_buffer_desc" could be additionally allocated upon need by ptlrpc_check_rqbd_pool(), but will never be freed until OST umount/stop by ptlrpc_service_purge_all().

This problem has caused several OSS failovers to fail due to OOM.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

smurf623.log-20170709
315 kB
07/Sep/17 3:16 PM

Issue Links

is duplicated by

LU-1099 Lustre OSS OOMs repeatedly

Resolved

is related to

LU-10603 ptlrpc_lprocfs_req_buffers_max_fops unused

Resolved

LU-13600 limit number of RPCs in flight during recovery

Resolved

is related to

LU-10803 req_buffers_max and req_history_max setting problems

Resolved

LU-10826 Regression in LU-9372 on OPA enviroment and no recovery triggered

Resolved

Activity

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Bruno Faccini (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 20/Apr/17 9:55 AM

Updated:: 08/Jun/20 5:48 PM

Resolved:: 31/Jan/18 1:49 PM