[LU-3992] Fix NUMA emulated mode Created: 23/Sep/13  Updated: 29/May/14  Resolved: 18/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2

Type: Bug Priority: Minor
Reporter: Andriy Skulysh Assignee: Liang Zhen (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 10675

 Description   

Kernel commit c1c3443c9c5e9be92641029ed229a41563e44506
assigns all allowed cpus to emulated node.



 Comments   
Comment by Andriy Skulysh [ 23/Sep/13 ]

patch: http://review.whamcloud.com/7724

Comment by Andriy Skulysh [ 23/Sep/13 ]

without the fix insmod libcfs.ko fails with numa=fake=16 in kernel boot cmdline.

    LNetError: 4055:0:(linux-cpu.c:881:cfs_cpt_table_create()) Failed to setup CPU-partition-table with 4 CPU-partitions, online HW nodes: 16, HW cpus: 32.
    LNetError: 4055:0:(linux-cpu.c:1093:cfs_cpu_init()) Failed to create ptable with npartitions 0
Comment by Liang Zhen (Inactive) [ 23/Sep/13 ]

sorry I failed to understand how this can happen, could you give an example of this?

Comment by Andriy Skulysh [ 23/Sep/13 ]

each emulated node has all cpus in cpumask (cpumask_of_node()).
So each cpu exists in each node's mask
We need to stop loop when all cpus are assigned. It doesn't matter which cpu is chosen for node.

Comment by Liang Zhen (Inactive) [ 23/Sep/13 ]

I think module parameter cpu_pattern can work around this, would it be OK if you just use this parameter instead of adding a patch? I'd like user to see these errors when situation like this happened.

Comment by Andriy Skulysh [ 23/Sep/13 ]

ncpt is always > 0 and we do only cpt++, so "if" can be only "=="
What errors can be here ? Only with strange cfs_node_to_cpumask() like in emulated NUMA case.
Resulting topology can be easily examined via /proc/sys/lnet/cpu_partition_table.
Is new parameter really needed ? We will force user to create dummy cpu_pattern.

Comment by Wally Wang (Inactive) [ 25/Sep/13 ]

We run into this problem and with the patch it works fine for us. I think at least the user/admin prefer a fix instead of using cpu_pattern as a workaround. Is there a probelm with the fix?

Comment by Wally Wang (Inactive) [ 27/Sep/13 ]

Is there any concern or 'side effect' for the fix? We'd like to adopt the fix over the cpu-pattern workaround.

Comment by Liang Zhen (Inactive) [ 29/Sep/13 ]

OK, I think it should be fine

Comment by Peter Jones [ 18/Nov/13 ]

Landed for 2.6

Generated at Sat Feb 10 01:38:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.