Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7553

Lustre cpu_npartitions default value breaks memory allocation on clients

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      I brought this up in LU-5050, but failed to get traction. This is a specific example of how Lustre's default cpu_partition_table and related memory node code is broken by default.

      We have Power7 nodes that appear to have 48 cpus under Linux (12 physical cores, 4 way SMT). There is only a single memory zone on this machine:

      Node 0, zone      DMA   4840   3290   3289   1676    749    325    114    105     69     10      4      1   3664 
      

      For no good reason at all, Lustre decides to lay out the cpu_partition_table like this:

      0       : 0 1 2 3 4 5 
      1       : 6 7 8 9 10 11 
      2       : 12 13 14 15 16 17 
      3       : 18 19 20 21 22 23 
      4       : 24 25 26 27 28 29 
      5       : 30 31 32 33 34 35 
      6       : 36 37 38 39 40 41 
      7       : 42 43 44 45 46 47 
      

      This table has no basis in reality. Not only that, the code seems to assume two memory zones, again for no clear reason that I can see. The memory zone selection doesn't seem to be visible anywhere, so I needed to add debugging code to figure out what was going on. Take a look at this:

      00000100:00100000:24.0:1450144617.022705:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[0] started 0 min 2 max 2
      00000100:00100000:24.0:1450144617.022707:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb00_000'
      00000400:00100000:29.0:1450144617.022761:1296:4718:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=63 nodemask=1
      00000100:00100000:24.0:1450144617.022809:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[0] started 1 min 2 max 2
      00000100:00100000:24.0:1450144617.022811:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb00_001'
      00000400:00100000:33.0F:1450144617.022906:1296:4720:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=63 nodemask=1
      00000100:00100000:24.0:1450144617.022930:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[1] started 0 min 2 max 2
      00000100:00100000:24.0:1450144617.022932:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb01_000'
      00000400:00100000:29.0:1450144617.022973:1296:4721:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=4032 nodemask=1
      00000100:00100000:24.0:1450144617.023029:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[1] started 1 min 2 max 2
      00000100:00100000:24.0:1450144617.023031:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb01_001'
      00000400:00100000:29.0:1450144617.023071:1296:4722:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=4032 nodemask=1
      00000100:00100000:24.0:1450144617.023087:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[2] started 0 min 2 max 2
      00000100:00100000:24.0:1450144617.023089:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb02_000'
      00000400:00100000:29.0:1450144617.023127:1296:4723:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=258048 nodemask=1
      00000100:00100000:24.0:1450144617.023165:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[2] started 1 min 2 max 2
      00000100:00100000:24.0:1450144617.023167:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb02_001'
      00000400:00100000:29.0:1450144617.023203:1296:4724:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=258048 nodemask=1
      00000100:00100000:24.0:1450144617.023218:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[3] started 0 min 2 max 2
      00000100:00100000:24.0:1450144617.023219:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb03_000'
      00000400:00100000:29.0:1450144617.023257:1296:4725:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=16515072 nodemask=1
      00000100:00100000:24.0:1450144617.023296:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[3] started 1 min 2 max 2
      00000100:00100000:24.0:1450144617.023299:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb03_001'
      00000400:00100000:29.0:1450144617.023335:1296:4726:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=16515072 nodemask=1
      00000100:00100000:24.0:1450144617.023351:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[4] started 0 min 2 max 2
      00000100:00100000:24.0:1450144617.023353:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb04_000'
      00000400:00100000:29.0:1450144617.023388:1296:4727:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=1056964608 nodemask=2
      00000100:00100000:24.0:1450144617.023416:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[4] started 1 min 2 max 2
      00000100:00100000:24.0:1450144617.023418:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb04_001'
      00000400:00100000:29.0:1450144617.023453:1296:4728:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=1056964608 nodemask=2
      00000100:00100000:24.0:1450144617.023464:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[5] started 0 min 2 max 2
      00000100:00100000:24.0:1450144617.023466:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb05_000'
      00000400:00100000:29.0:1450144617.023503:1296:4729:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=67645734912 nodemask=2
      00000100:00100000:24.0:1450144617.023537:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[5] started 1 min 2 max 2
      00000100:00100000:24.0:1450144617.023540:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb05_001'
      00000400:00100000:29.0:1450144617.023576:1296:4730:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=67645734912 nodemask=2
      00000100:00100000:24.0:1450144617.023594:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[6] started 0 min 2 max 2
      00000100:00100000:24.0:1450144617.023596:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb06_000'
      00000400:00100000:29.0:1450144617.023635:1296:4731:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=4329327034368 nodemask=2
      00000100:00100000:24.0:1450144617.023670:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[6] started 1 min 2 max 2
      00000100:00100000:24.0:1450144617.023673:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb06_001'
      00000400:00100000:29.0:1450144617.023709:1296:4732:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=4329327034368 nodemask=2
      00000100:00100000:24.0:1450144617.023724:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[7] started 0 min 2 max 2
      00000100:00100000:24.0:1450144617.023726:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb07_000'
      00000400:00100000:29.0:1450144617.023766:1296:4733:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=277076930199552 nodemask=2
      00000100:00100000:24.0:1450144617.023806:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[7] started 1 min 2 max 2
      00000100:00100000:24.0:1450144617.023808:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb07_001'
      00000400:00100000:29.0:1450144617.023843:1296:4734:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=277076930199552 nodemask=2
      

      kmalloc()s are failing on the threads that have nodemask=2. You can't see the failed memory allocation in the above trace only because I commented out the call to set_mems_allowed() in cfs_cpt_bind()).

      So now we know that the default cpu_parition_table layout code is broken in:

      • Robin Humble's example of LU-5050
      • This broken behavior on a Power7 node
      • My information that the default layout algorithm didn't match the actual hardware on any LLNL system at the time in LU-5050

      I think we now have overwhelming evidence that we should set cpu_npartition to 1 by default in Lustre until such time that the cpu_parition_table can actually make sane decisions on its own.

      Lustre must have sane defaults. A default value that makes things fast on only a tiny subset of systems where the table happens to match does not justify turning this on by default. That small, unlikely, benefit does not out the many ways in which the current default out right breaks things.

      cpu_nparitions=1 would totally work for everyone by default.

      Lets please restore a sane default, already!

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: