[LU-11163] CPT-bound allocations can fail if NUMA node is OOM Created: 21/Jul/18  Updated: 02/Nov/18  Resolved: 02/Nov/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7553 Lustre cpu_npartitions default value ... Resolved
is related to LU-5050 cpu partitioning oddities Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With a simple tcp/socklnd configuration there are memory allocation failures even for very small (sub-PAGE_SIZE allocations):

[oss00 ~]# modprobe lustre -v
insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/libcfs.ko
insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/lnet.ko networks=tcp0(ens801)
insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/obdclass.ko
insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/ptlrpc.ko
modprobe: ERROR: could not insert 'lustre': Cannot allocate memory
[oss00 ~]# dmesg
[ 998.653695] libcfs: loading out-of-tree module taints kernel.
[ 998.654085] libcfs: module verification failed: signature and/or required key missing - tainting kernel
[ 998.658330] LNet: HW NUMA nodes: 2, HW CPU cores: 28, npartitions: 2
[ 998.660558] alg: No test for adler32 (adler32-zlib)
[ 998.660606] alg: No test for crc32 (crc32-table)
[ 999.418205] Lustre: Lustre: Build Version: 2.10.1
[ 999.447239] LNet: Added LNI 192.168.99.3@tcp [8/256/0/180]
[ 999.447295] LNet: Accept secure, port 988
[ 999.448457] SLUB: Unable to allocate memory on node 1 (gfp=0x8050)
[ 999.448460] cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
[ 999.448500] node 0: slabs: 597, objs: 25074, free: 8257
[ 1001.448048] Lustre: 150394:0:(ptlrpcd.c:640:ptlrpcd_stop()) Thread for pc ffff880074f00018 was not started
[ 1001.448066] Lustre: 150394:0:(ptlrpcd.c:659:ptlrpcd_free()) Thread for pc ffff880074f00018 was not started
[ 1003.446986] LustreError: 150394:0:(events.c:631:ptlrpc_init_portals()) rpcd initialisation failed
[ 1004.446991] LNet: Removed LNI 192.168.99.3@tcp 

This can happen if there are two NUMA nodes, but only one of them has memory:

[oss01 tmp]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32671 MB
node 0 free: 30820 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 0 MB
node 1 free: 0 MB
node distances:
node 0 1
{{ 0: 10 20}}
{{ 1: 20 10}}

but can also happen during normal operations if there is a significant imbalance in allocations between the nodes (e.g. LU-5050).



 Comments   
Comment by Gerrit Updater [ 21/Jul/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32848
Subject: LU-11163 libcfs: fix CPT NUMA memory failures
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6f974a9772990e2532fa15b5f7cd60e836336550

Comment by Gerrit Updater [ 02/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32848/
Subject: LU-11163 libcfs: fix CPT NUMA memory failures
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1ca7f6329833d551f69fd8aec29b66845bedb0c9

Comment by Peter Jones [ 02/Nov/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:41:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.