Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
With a simple tcp/socklnd configuration there are memory allocation failures even for very small (sub-PAGE_SIZE allocations):
[oss00 ~]# modprobe lustre -v insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/libcfs.ko insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/lnet.ko networks=tcp0(ens801) insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/obdclass.ko insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/ptlrpc.ko modprobe: ERROR: could not insert 'lustre': Cannot allocate memory [oss00 ~]# dmesg [ 998.653695] libcfs: loading out-of-tree module taints kernel. [ 998.654085] libcfs: module verification failed: signature and/or required key missing - tainting kernel [ 998.658330] LNet: HW NUMA nodes: 2, HW CPU cores: 28, npartitions: 2 [ 998.660558] alg: No test for adler32 (adler32-zlib) [ 998.660606] alg: No test for crc32 (crc32-table) [ 999.418205] Lustre: Lustre: Build Version: 2.10.1 [ 999.447239] LNet: Added LNI 192.168.99.3@tcp [8/256/0/180] [ 999.447295] LNet: Accept secure, port 988 [ 999.448457] SLUB: Unable to allocate memory on node 1 (gfp=0x8050) [ 999.448460] cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 [ 999.448500] node 0: slabs: 597, objs: 25074, free: 8257 [ 1001.448048] Lustre: 150394:0:(ptlrpcd.c:640:ptlrpcd_stop()) Thread for pc ffff880074f00018 was not started [ 1001.448066] Lustre: 150394:0:(ptlrpcd.c:659:ptlrpcd_free()) Thread for pc ffff880074f00018 was not started [ 1003.446986] LustreError: 150394:0:(events.c:631:ptlrpc_init_portals()) rpcd initialisation failed [ 1004.446991] LNet: Removed LNI 192.168.99.3@tcp
This can happen if there are two NUMA nodes, but only one of them has memory:
[oss01 tmp]# numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 32671 MB node 0 free: 30820 MB node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 0 MB node 1 free: 0 MB node distances: node 0 1 {{ 0: 10 20}} {{ 1: 20 10}}
but can also happen during normal operations if there is a significant imbalance in allocations between the nodes (e.g. LU-5050).
Landed for 2.12