[LU-7553] Lustre cpu_npartitions default value breaks memory allocation on clients Created: 15/Dec/15 Updated: 21/Jul/18 Resolved: 15/Dec/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
I brought this up in We have Power7 nodes that appear to have 48 cpus under Linux (12 physical cores, 4 way SMT). There is only a single memory zone on this machine: Node 0, zone DMA 4840 3290 3289 1676 749 325 114 105 69 10 4 1 3664 For no good reason at all, Lustre decides to lay out the cpu_partition_table like this: 0 : 0 1 2 3 4 5 1 : 6 7 8 9 10 11 2 : 12 13 14 15 16 17 3 : 18 19 20 21 22 23 4 : 24 25 26 27 28 29 5 : 30 31 32 33 34 35 6 : 36 37 38 39 40 41 7 : 42 43 44 45 46 47 This table has no basis in reality. Not only that, the code seems to assume two memory zones, again for no clear reason that I can see. The memory zone selection doesn't seem to be visible anywhere, so I needed to add debugging code to figure out what was going on. Take a look at this: 00000100:00100000:24.0:1450144617.022705:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[0] started 0 min 2 max 2 00000100:00100000:24.0:1450144617.022707:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb00_000' 00000400:00100000:29.0:1450144617.022761:1296:4718:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=63 nodemask=1 00000100:00100000:24.0:1450144617.022809:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[0] started 1 min 2 max 2 00000100:00100000:24.0:1450144617.022811:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb00_001' 00000400:00100000:33.0F:1450144617.022906:1296:4720:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=63 nodemask=1 00000100:00100000:24.0:1450144617.022930:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[1] started 0 min 2 max 2 00000100:00100000:24.0:1450144617.022932:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb01_000' 00000400:00100000:29.0:1450144617.022973:1296:4721:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=4032 nodemask=1 00000100:00100000:24.0:1450144617.023029:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[1] started 1 min 2 max 2 00000100:00100000:24.0:1450144617.023031:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb01_001' 00000400:00100000:29.0:1450144617.023071:1296:4722:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=4032 nodemask=1 00000100:00100000:24.0:1450144617.023087:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[2] started 0 min 2 max 2 00000100:00100000:24.0:1450144617.023089:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb02_000' 00000400:00100000:29.0:1450144617.023127:1296:4723:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=258048 nodemask=1 00000100:00100000:24.0:1450144617.023165:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[2] started 1 min 2 max 2 00000100:00100000:24.0:1450144617.023167:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb02_001' 00000400:00100000:29.0:1450144617.023203:1296:4724:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=258048 nodemask=1 00000100:00100000:24.0:1450144617.023218:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[3] started 0 min 2 max 2 00000100:00100000:24.0:1450144617.023219:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb03_000' 00000400:00100000:29.0:1450144617.023257:1296:4725:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=16515072 nodemask=1 00000100:00100000:24.0:1450144617.023296:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[3] started 1 min 2 max 2 00000100:00100000:24.0:1450144617.023299:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb03_001' 00000400:00100000:29.0:1450144617.023335:1296:4726:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=16515072 nodemask=1 00000100:00100000:24.0:1450144617.023351:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[4] started 0 min 2 max 2 00000100:00100000:24.0:1450144617.023353:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb04_000' 00000400:00100000:29.0:1450144617.023388:1296:4727:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=1056964608 nodemask=2 00000100:00100000:24.0:1450144617.023416:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[4] started 1 min 2 max 2 00000100:00100000:24.0:1450144617.023418:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb04_001' 00000400:00100000:29.0:1450144617.023453:1296:4728:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=1056964608 nodemask=2 00000100:00100000:24.0:1450144617.023464:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[5] started 0 min 2 max 2 00000100:00100000:24.0:1450144617.023466:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb05_000' 00000400:00100000:29.0:1450144617.023503:1296:4729:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=67645734912 nodemask=2 00000100:00100000:24.0:1450144617.023537:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[5] started 1 min 2 max 2 00000100:00100000:24.0:1450144617.023540:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb05_001' 00000400:00100000:29.0:1450144617.023576:1296:4730:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=67645734912 nodemask=2 00000100:00100000:24.0:1450144617.023594:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[6] started 0 min 2 max 2 00000100:00100000:24.0:1450144617.023596:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb06_000' 00000400:00100000:29.0:1450144617.023635:1296:4731:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=4329327034368 nodemask=2 00000100:00100000:24.0:1450144617.023670:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[6] started 1 min 2 max 2 00000100:00100000:24.0:1450144617.023673:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb06_001' 00000400:00100000:29.0:1450144617.023709:1296:4732:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=4329327034368 nodemask=2 00000100:00100000:24.0:1450144617.023724:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[7] started 0 min 2 max 2 00000100:00100000:24.0:1450144617.023726:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb07_000' 00000400:00100000:29.0:1450144617.023766:1296:4733:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=277076930199552 nodemask=2 00000100:00100000:24.0:1450144617.023806:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[7] started 1 min 2 max 2 00000100:00100000:24.0:1450144617.023808:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread 'ldlm_cb07_001' 00000400:00100000:29.0:1450144617.023843:1296:4734:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=277076930199552 nodemask=2 kmalloc()s are failing on the threads that have nodemask=2. You can't see the failed memory allocation in the above trace only because I commented out the call to set_mems_allowed() in cfs_cpt_bind()). So now we know that the default cpu_parition_table layout code is broken in:
I think we now have overwhelming evidence that we should set cpu_npartition to 1 by default in Lustre until such time that the cpu_parition_table can actually make sane decisions on its own. Lustre must have sane defaults. A default value that makes things fast on only a tiny subset of systems where the table happens to match does not justify turning this on by default. That small, unlikely, benefit does not out the many ways in which the current default out right breaks things. cpu_nparitions=1 would totally work for everyone by default. Lets please restore a sane default, already! |
| Comments |
| Comment by Andreas Dilger [ 15/Dec/15 ] |
|
Closing this as a duplicate of |
| Comment by Andreas Dilger [ 17/Dec/15 ] |
|
Chris, I was looking at this ticket again to see how we can fix the memory allocation binding, but am confused about something. If there is only a single memory zone on this system, the set_mems_allowed() call shouldn't make any difference because all of the allocations would be coming from the same zone no matter which CPU it is allocated on? |
| Comment by Christopher Morrone [ 17/Dec/15 ] |
|
The set_mems_allowed() call is all tied in with the cpu partition table code, so setting cpu_npatitions=1 and disabling all of that has production operations back on line. Yes, cpu binding and memory node binding are not necessarily related, but the code has them fairly tangled together. Since the code can't figure out what sockets, cores, and smt threads are really in existance and map those correctly, then I would not be terribly surprised is Lustre is messing up the binding to memory nodes and assigning half of the processes to a node that doesn't really exist. Granted, there is some speculation there, so take it with a grain of salt. But I do know this much: /proc/buddyinfo shows one memory node. When Lustre binds processes to the second (presumably non-existent) node, that process goes on to fail very simple small kmalloc() calls, despite there being nearly 60GB of free memory, and buddyinfo verifies that there are plenty of order 0 blocks free (and plenty in all of the other order sizes as well). The processes that were bound to the first memory node (presumably the real one that exists), those processes did not exhibit memory allocation problems. Like I said, you can't see the failure in the lustre log snippet that I provided because I had already commented out set_mems_allowed(). But for all of the earlier runs where set_mems_allowed() was active, the first process that used nodemask 2 always hits a kmalloc() failure, and Lustre completely aborts the setup at that point. |