[LU-9728] out of memory on OSS causing allocation failures or hung threads Created: 30/Jun/17  Updated: 16/Oct/21  Resolved: 29/Jul/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.5.3, Lustre 2.10.0
Fix Version/s: Lustre 2.10.1, Lustre 2.11.0

Type: Bug Priority: Major
Reporter: Andreas Dilger Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-15117 ofd_read_lock vs transaction deadlock... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In several cases recently there have been memory allocation failures on the OSS due to large amounts of RAM usage from the Lustre read cache:

LNet: Service thread pid 4950 was inactive for 200.73s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:

schedule+0x29/0x70
schedule_timeout+0x209/0x2d0
io_schedule_timeout+0xae/0x130
io_schedule+0x18/0x20
sleep_on_page+0xe/0x20
__wait_on_bit_lock+0x5b/0xc0
__lock_page+0x78/0xa0
__find_lock_page+0x54/0x70
find_or_create_page+0x34/0xa0
osd_bufs_get+0x20f/0x410 [osd_ldiskfs]
ofd_preprw+0x647/0x11a0 [ofd]
tgt_brw_read+0x9a1/0x14c0 [ptlrpc]
tgt_request_handle+0x8fb/0x11f0 [ptlrpc]
ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
ptlrpc_main+0xc00/0x1f60 [ptlrpc]

Looking at the page allocation code from osd_bufs_get() to osd_get_page() it appears this is only using GFP_NOFS for allocations, to avoid recursing into the filesystem.

static struct page *osd_get_page(struct dt_object *dt, loff_t offset, int rw)
{
        page = find_or_create_page(inode->i_mapping, offset >> PAGE_SHIFT,
                                   GFP_NOFS | __GFP_HIGHMEM);

However, looking back in the pre-OSD code, the equivalent code was using GFP_HIGHUSER to allow memory pressure and direct memory reclaim from the OSS threads when memory was short:

/*
 * the routine is used to request pages from pagecache
 *
 * use GFP_NOFS for requests from a local client not allowing to enter FS
 * as we might end up waiting on a page he sent in the request we're serving.
 * use __GFP_HIGHMEM so that the pages can use all of the available memory
 * on 32-bit machines
 * use more aggressive GFP_HIGHUSER flags from non-local clients to be able to
 * generate more memory pressure.
 *
 * See Bug 19529 and Bug 19917 for details.
 */
static struct page *filter_get_page(struct obd_device *obd, struct inode *inode,
                                    obd_off offset, int localreq)
{
        page = find_or_create_page(inode->i_mapping, offset >> CFS_PAGE_SHIFT,
                                   (localreq ? (GFP_NOFS | __GFP_HIGHMEM) :
                                             GFP_HIGHUSER));

It looks like something similar can be done with the OSD code for ldiskfs at least, though it isn't as clear what is possible for ZFS since the buffer allocation is handled quite differently.



 Comments   
Comment by Gerrit Updater [ 01/Jul/17 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/27908
Subject: LU-9728 osd: use GFP_HIGHUSER for non-local IO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2c79106a72d42070768c887f8a1b85a508d4f9b3

Comment by Gerrit Updater [ 29/Jul/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27908/
Subject: LU-9728 osd: use GFP_HIGHUSER for non-local IO
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b0ab95d6133e783acacc6329c025d17fb282775e

Comment by Peter Jones [ 29/Jul/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 02/Aug/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28318
Subject: LU-9728 osd: use GFP_HIGHUSER for non-local IO
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 06e40d2220cf9895a7fac74a2f86582d3fc38c1f

Comment by Gerrit Updater [ 10/Aug/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28318/
Subject: LU-9728 osd: use GFP_HIGHUSER for non-local IO
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: a4c7545f6e77229a3eabe537eb9ed161ff3c88ee

Generated at Sat Feb 10 02:28:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.