[LU-11163] CPT-bound allocations can fail if NUMA node is OOM Created: 21/Jul/18 Updated: 02/Nov/18 Resolved: 02/Nov/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
With a simple tcp/socklnd configuration there are memory allocation failures even for very small (sub-PAGE_SIZE allocations): [oss00 ~]# modprobe lustre -v insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/libcfs.ko insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/lnet.ko networks=tcp0(ens801) insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/obdclass.ko insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/ptlrpc.ko modprobe: ERROR: could not insert 'lustre': Cannot allocate memory [oss00 ~]# dmesg [ 998.653695] libcfs: loading out-of-tree module taints kernel. [ 998.654085] libcfs: module verification failed: signature and/or required key missing - tainting kernel [ 998.658330] LNet: HW NUMA nodes: 2, HW CPU cores: 28, npartitions: 2 [ 998.660558] alg: No test for adler32 (adler32-zlib) [ 998.660606] alg: No test for crc32 (crc32-table) [ 999.418205] Lustre: Lustre: Build Version: 2.10.1 [ 999.447239] LNet: Added LNI 192.168.99.3@tcp [8/256/0/180] [ 999.447295] LNet: Accept secure, port 988 [ 999.448457] SLUB: Unable to allocate memory on node 1 (gfp=0x8050) [ 999.448460] cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 [ 999.448500] node 0: slabs: 597, objs: 25074, free: 8257 [ 1001.448048] Lustre: 150394:0:(ptlrpcd.c:640:ptlrpcd_stop()) Thread for pc ffff880074f00018 was not started [ 1001.448066] Lustre: 150394:0:(ptlrpcd.c:659:ptlrpcd_free()) Thread for pc ffff880074f00018 was not started [ 1003.446986] LustreError: 150394:0:(events.c:631:ptlrpc_init_portals()) rpcd initialisation failed [ 1004.446991] LNet: Removed LNI 192.168.99.3@tcp This can happen if there are two NUMA nodes, but only one of them has memory: [oss01 tmp]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32671 MB
node 0 free: 30820 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 0 MB
node 1 free: 0 MB
node distances:
node 0 1
{{ 0: 10 20}}
{{ 1: 20 10}}
but can also happen during normal operations if there is a significant imbalance in allocations between the nodes (e.g. |
| Comments |
| Comment by Gerrit Updater [ 21/Jul/18 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32848 |
| Comment by Gerrit Updater [ 02/Nov/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32848/ |
| Comment by Peter Jones [ 02/Nov/18 ] |
|
Landed for 2.12 |