Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11163

CPT-bound allocations can fail if NUMA node is OOM

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      With a simple tcp/socklnd configuration there are memory allocation failures even for very small (sub-PAGE_SIZE allocations):

      [oss00 ~]# modprobe lustre -v
      insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/libcfs.ko
      insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/lnet.ko networks=tcp0(ens801)
      insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/obdclass.ko
      insmod /lib/modules/3.10.0-693.11.1.el7.x86_64/extra/ptlrpc.ko
      modprobe: ERROR: could not insert 'lustre': Cannot allocate memory
      [oss00 ~]# dmesg
      [ 998.653695] libcfs: loading out-of-tree module taints kernel.
      [ 998.654085] libcfs: module verification failed: signature and/or required key missing - tainting kernel
      [ 998.658330] LNet: HW NUMA nodes: 2, HW CPU cores: 28, npartitions: 2
      [ 998.660558] alg: No test for adler32 (adler32-zlib)
      [ 998.660606] alg: No test for crc32 (crc32-table)
      [ 999.418205] Lustre: Lustre: Build Version: 2.10.1
      [ 999.447239] LNet: Added LNI 192.168.99.3@tcp [8/256/0/180]
      [ 999.447295] LNet: Accept secure, port 988
      [ 999.448457] SLUB: Unable to allocate memory on node 1 (gfp=0x8050)
      [ 999.448460] cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
      [ 999.448500] node 0: slabs: 597, objs: 25074, free: 8257
      [ 1001.448048] Lustre: 150394:0:(ptlrpcd.c:640:ptlrpcd_stop()) Thread for pc ffff880074f00018 was not started
      [ 1001.448066] Lustre: 150394:0:(ptlrpcd.c:659:ptlrpcd_free()) Thread for pc ffff880074f00018 was not started
      [ 1003.446986] LustreError: 150394:0:(events.c:631:ptlrpc_init_portals()) rpcd initialisation failed
      [ 1004.446991] LNet: Removed LNI 192.168.99.3@tcp 
      

      This can happen if there are two NUMA nodes, but only one of them has memory:

      [oss01 tmp]# numactl -H
      available: 2 nodes (0-1)
      node 0 cpus: 0 1 2 3 4 5 6 7
      node 0 size: 32671 MB
      node 0 free: 30820 MB
      node 1 cpus: 8 9 10 11 12 13 14 15
      node 1 size: 0 MB
      node 1 free: 0 MB
      node distances:
      node 0 1
      {{ 0: 10 20}}
      {{ 1: 20 10}}
      

      but can also happen during normal operations if there is a significant imbalance in allocations between the nodes (e.g. LU-5050).

      Attachments

        Issue Links

          Activity

            [LU-11163] CPT-bound allocations can fail if NUMA node is OOM
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32848/
            Subject: LU-11163 libcfs: fix CPT NUMA memory failures
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1ca7f6329833d551f69fd8aec29b66845bedb0c9

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32848/ Subject: LU-11163 libcfs: fix CPT NUMA memory failures Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1ca7f6329833d551f69fd8aec29b66845bedb0c9

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32848
            Subject: LU-11163 libcfs: fix CPT NUMA memory failures
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6f974a9772990e2532fa15b5f7cd60e836336550

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32848 Subject: LU-11163 libcfs: fix CPT NUMA memory failures Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6f974a9772990e2532fa15b5f7cd60e836336550

            People

              adilger Andreas Dilger
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: