[LU-5050] cpu partitioning oddities Created: 12/May/14  Updated: 21/Jul/18  Resolved: 22/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Christopher Morrone Assignee: Dmitry Eremin (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl

Issue Links:
Blocker
Related
is related to LU-8610 Better handle of HW dependency of DLC... Open
is related to LU-7553 Lustre cpu_npartitions default value ... Resolved
is related to LU-8395 Limit lnet-selftest Threads Resolved
is related to LU-11163 CPT-bound allocations can fail if NUM... Resolved
Severity: 3
Rank (Obsolete): 13953

 Description   

In tracking down the root cause of LU-5043, I went through the cpu partitioning code in Lustre and found some of it rather difficult to follow. A number of not-necessarily-safe assumptions seem to made, with no code comments explaining what was assumed and why. I think we have some technical debt to deal with here.



 Comments   
Comment by Christopher Morrone [ 12/May/14 ]

Lets start with the following, which was obtained on a PPC64 BG/Q ION (I/O Node) running roughly Lustre 2.4.0-28chaos (see github.com/chaos/lustre).

$ cat cpu_partition_table 
0       : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
1       : 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 
2       : 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
3       : 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 

Why was this partitioning decided upon?

It looks like we wind up with 16 ptlrpcd threads (of the type named ptlrpcd_<number>). How are those 16 threads, using the default ptlrpcd_bind_policy, assigned to those Lustre cpu partitions?

Comment by Christopher Morrone [ 12/May/14 ]

Next question is about this code ptlrpcd.c:ptlrpcd_select_pc():

        case PDL_POLICY_SAME:
                idx = cfs_smp_processor_id() % ptlrpcds->pd_nthreads;
                break;

The code comment in lustre_net.h for the enum pdl_policy_t says:

typedef enum {
        /* on the same CPU core as the caller */
        PDL_POLICY_SAME         = 1,

This is pretty clearly not the case. The code running on a particular CPU will always map to the same ptlrpcd index, not to the same CPU. That ptlrpcd can, and probably is under many possible settings, be running on a different CPU.

Comment by Liang Zhen (Inactive) [ 13/May/14 ]

Hi Chris, client side ptlrpcd is not using CPU partition at all, that's a thing we need to improve, one purpose of CPU partition is, we can limit Lustre (server or client) on subset of CPUs, but we never got resource for the client side work.

Ideas behind CPU partition are:

  • generate reasonable number of partitions, each partition has its own lustre/lnet thread-pool, each thread-pool has its own lock & data, by this way we can eliminate lock contention and global data contention.
    • partition with too many cores may have locking performance issue
    • partition with too few cores many have bad load balancing.
    • it is difficult to find a common rule to generate proper layout of CPU partition for all machines. e.g. for a chip has 48 cores, we will create a few partitions for this single socket, for a sever with 8 sockets each socket has 4 cores, each CPU partition may cover multiple sockets.
  • current rule is, we choose value "N" as initial number: N is the lowest natural number to match this condition "number_of_cores <= 2 * N^2", then we adjust N and make each partition has same number of cores if possible (when we say "core", it's actual hyperthread if it's enabled), then we assign cores to partitions with these preferences:
    • hyper threads of same core in the same partition
    • cores of same socket in the same partition
    • socket in the NUMA nodes in the same partition (if partition covered multiple sockets)
  • these rules are flawed in some case (user may have to configure partition via "libcfs cpu_pattern=...", lustre manual has some examples)
    • if total number of cores are odd, we only have one partition and everything will fall through and run as old days (a single partition == no affinity + global locks)
    • if total number of CPU sockets are not power of 2, layout could be strange, e.g. some partitions may only cover cores of a single socket, others may cover cores from multiple sockets.
  • for long term, if asymmetric CPU partitions (partitions have different number of cores) becomes a common use-case, we may have to give weight to each partition and use algorithm like "consistent hashing" to balance workload.
  • any suggestion for improving these rules will be great.

For PDL_POLICY_SAME, I agreed the comment is wrong, and I will look into it to see how we can improve ptlrpcd.

Comment by Christopher Morrone [ 13/May/14 ]

Are you telling me that Lustre uses at least two different unrelated cpu binding mechanisms? So the whole LIOD_BIND stuff is not related to the cpu_partition_table stuff?

Comment by Liang Zhen (Inactive) [ 15/May/14 ]

unfortunately yes, only server side is using libcfs cpu partition to set affinity, client side is still using native kernel APIs to set NUMA affinity for ptlrpcd, I need to work with my manager to get a plan for this, because this is not a thing can be done in a couple of days.

Comment by Christopher Morrone [ 22/May/14 ]

AH! The cpu partition stuff is what broke /proc/sys/nis too!!! Geez, I was wondering why I started seeing four copies of everything in there.

Look at this:

# zwicky-lcz-oss2 /root > cat /proc/sys/lnet/nis
nid                      status alive refs peer  rtr   max    tx   min
0@lo                         up     0    2    0    0     0     0     0
0@lo                         up     0    0    0    0     0     0     0
0@lo                         up     0    0    0    0     0     0     0
0@lo                         up     0    0    0    0     0     0     0
10.1.1.2@o2ib9               up    -1    2    8    0   256   256   248
10.1.1.2@o2ib9               up    -1    3    8    0   256   256   247
10.1.1.2@o2ib9               up    -1    5    8    0   256   256   250
10.1.1.2@o2ib9               up    -1    1    8    0   256   256   255

That is not acceptable output. I'm rather appalled that this code passed review like this.

The nis file is supposed to list the network interfaces, but now we've got four copies of each interface. Somehow the concept of a network interface seems to be overloaded with the concept of network queues or something.

And code like this is not written in an understandable and maintainable manner:

			cfs_percpt_for_each(tq, i, ni->ni_tx_queues) {
				for (j = 0; ni->ni_cpts != NULL &&
				     j < ni->ni_ncpts; j++) {
					if (i == ni->ni_cpts[j])
						break;
				}

				if (j == ni->ni_ncpts)
					continue;
Comment by Christopher Morrone [ 22/May/14 ]

More and more and I am beginning to think that cpu partitions should be disabled by default in Lustre. I really don't understand how the default partitions are decided. So far I can't find a system at LLNL where the default partitioning makes sense. Without digging into the code, it looks from all of the proc files that Lustre just assumes 4 partitions. That is not a good plan.

I am very concerned by the decision to split the network interface credits between these new separate queues as well. We choose the tx credits carefully to avoid issues like all of the credits being exhausted when one client cluster goes down. Now that assumption may no longer be correct, and we'll need to multiply our credits by 4.

It seems to me that cpu partitioning only makes sense when you actually make the partitions match your hardware, which would require manual sysadmin attention. If that is the case, then cpu partitioning should not be enabled by default.

Comment by Liang Zhen (Inactive) [ 05/Jun/14 ]

Chris, I actually saw some presentations from users about performance improvements from CPU partitions (both for metadata and IO performance) in the last two LUGs, so I think it can bring us some benefits.
I understand your concerns, here are some explanations:

  • as long as we are putting cores sharing L3 caches or non-uniform memory in the same partition, I tend to think it is reasonable, we don't need/want partition layout to strictly match sockets or cpus assigned to NUMA nodes, because modern cpus can have many cores, hundreds of threads run on all cores of a single socket can make performance bad enough, multiple partitions can avoid many locking issues and reduce cache trashing etc.
  • I agree it's safer to increase global credits like NI credits, router buffer pool credits (peer credits can be same), we actually have put this in manual (section 26.3.4), although this is indeed less convenient than before. I'm also thinking we can just bump default values of those credits because current LNet only consumes small amount of memory. The only concern of larger default credits is about using many networks (hundreds, like Fujitsu)
  • unfortunately yes, proc files of NI or router buffer pool are different now, because different partitions can have different real time status, it is the only way I can find to provide useful information (number of lines should match number of partitions)
Comment by Liang Zhen (Inactive) [ 15/Aug/14 ]

Chris, credit is actually another thing could benefit form CPU partition, because we can bind LNet NI on cpu partition, so each partition has private NI and router-buffer credits, for example:

networks="o2ib0(ib0)[0,1], o2ib1(ib1)[2,3]"
// lustre manual 26.3.2 has a little more details.

so all messages for o2ib0 will only be handled by CPUs/threads in partition[0,1], and all messages for o2ib1 will only be handled by CPUs/threads in partition[2,3], which means if routers or servers can bind different client networks on different partitions, then these different networks will have their own credits, and it could be a better solution then just increasing overall credits.
Anyway, I'm not sure if it can really match your case, just FYI.

Comment by Alexander Zarochentsev [ 15/Jul/15 ]

Also Network Request Scheduler doesn't work across CPU partitions, especially CRRN policy.

Comment by Robin Humble [ 17/Jul/15 ]

I turn off all libcfs affinity on our servers. options libcfs cpu_npartitions=1. this is because it halves (on a 2 numa zone server) the available ram for OSS and MDS inode caching. the OSS case isn't a problem for most people, but the MDS case seems quite serious to me.

this is with 2.5.2 lustre.

the underlying asymmetry comes from the lustre threads on the same socket as the preferred socket of the IB card being used far more frequently than the lustre threads on the other socket. this is presumably "by design". however memory affinity of the lustre threads is set so that memory must be allocated from the local zone, rather than just leaving it up to the kernel to prefer the local zone.
eg.
/proc/<thread>/status -> Mems_allowed_list: 1
so slab cannot be allocated from zone 0 in this case.

when the zone fills up with slab entries (which commonly happens on the MDS with large filesystems), reclaim kicks in, and inodes are kicked out of ram, when in reality the other memory zone is nearly unused. so MDS inode caching is close to halved (or worse if there are >2 numa zones).

this situation is perhaps exacerbated on our servers because we need to set cpu affinity on our IB drivers (qib) because they hang if they are migrated (under load) away from the socket where their irq's physically arrive. however even without cpu affinity set on the IB driver, it strongly prefers its "home" socket, and so the lustre threads on that socket also get far more use and the memory used in each zone is still skewed. I presume other types of IB and ethernet drivers are similar in preferring their "physical" socket over the others, but I don't know.

the default libcfs affinity policy (on OSS/MDS) seems to only make sense to me if there are equal numbers of equally used network cards in each numa zone, and is incorrect otherwise. a better choice is cpu_npartitions=1 as all ram can be used and ram speed is always >> disk speed.

alternatively, whichever lustre threads allocate slab could be allowed to use all memory zones. ie. cpu affinity, but not memory affinity. I find it hard to believe that memory bandwidth or latency on the server will ever be a limiting factor when compared to network speeds and latencies. at the moment it's roughly 10x isn't it?

in the extreme case (which we have hit) the libcfs default affinity caused spurious OOM's in the lustre threads on our OSS's. however I doubt many other people will see this. this is because a) we set vfs_cache_pressure=0 on OSS's to force caching of inodes over data to make the fs more responsive b) there has to be a zone worth of inodes in use. this happened when we were moving all the files from one filesystem to a new one. I had calculated that we had enough OSS ram to do this, but libcfs affinity meant only 1/2 of it was actually usable. ow.

I don't know if affinity also applies to OSS data pages and might reduce available OSS write-through and read caching too. I didn't investigate that.

BTW ldiskfs_inode_cache slab entries are annoyingly 1.02k in size, so only 3 of them fit into a 4k slab page. an amplification in ram usage by 25%. but that is presumably hard to fix, and a separate bug.

Comment by Christopher Morrone [ 15/Dec/15 ]

I don't know if it was true when Liang first said it, but it is certainly not true in 2.5.4 that the client is not impacted by cpu_npartitions. The partitioning is having an effect on binding on clients.

Comment by Christopher Morrone [ 15/Dec/15 ]

I think we now have pretty overwhelming evidence that cpu_npartitions=0 is a terrible default. We need to change it to 1 by default.

Comment by Robin Humble [ 15/Dec/15 ]

I strongly agree. the current default can cause trouble quite easily, and has performance benefits only with unusual hardware configurations.

Comment by Christopher Morrone [ 15/Dec/15 ]

See additional breakage documented in LU-7553.

Comment by Andreas Dilger [ 15/Dec/15 ]

Chris, I agree that the default selection of CPU partitions is not ideal, and I'd welcome your input on how to improve it. The hard limit on memory allocation to be within the CPT partition is something that should be fixed. The normal CPU affinity for allocations should be enough, or the CPT allocator should clear the mask and retry on allocation failures. Also note that the CPT affinity of ptlrpcd threads was fixed in LU-6325 and CPT affinity for socklnd threads in LU-7245.

Setting the default CPT count to one can have a serious negative impact on client IO performance, since this reduces the number of ptlrpcd threads and that will hurt parallel client RPC checksums, which was critical to improving IO performance at your site.

While your assert that the current default is invalid, this is used today on virtually every Lustre system, since it has existed since Lustre 2.2 or 2.3, but as yet there are only two or three issues reported here, so it can't be as bad as you make out and I might equally argue that the default is reasonable and you have a few systems that need manual tuning. Changing cpu_partitions=1 can negatively impact all other users with more normal configurations and force them to manually change their configuration to compensate.

Have you manually set cpu_partitions=1 on your local systems since this bug was opened and can verify there is no performance impact to that change? Have you done any IO performance benchmarks on your new 48-core system to see what the performance difference is?

Turning off the hard memory affinity and improving the CPT calculations seem like reasonable approaches.

Comment by Christopher Morrone [ 15/Dec/15 ]

Did you set cpu_partitions=0 and verify that there were no performance degradations on systems where the alignment was completely wrong? The onus is on you to demonstrate that your defaults are sane. We have already demonstrated that there are very common cases in which those defaults are flat out wrong and break production usage. So don't try to turn around and make it my responsibility to get performance numbers for you. You guys didn't do your homework the first time around.

Instead, you waited 18 months and thrust production outages on your customers all because under some very limited set of hardware configurations it is "faster". That is ridiculous. The default should be cpu_partitions=1, and then where admins want the extra performance they can set up the cpu partition tables to actually align with reality.

Ergo, fix it already.

Comment by Christopher Morrone [ 16/Dec/15 ]

Talking to Peter Jones on the phone, I suspect that Andreas, you might have a misunderstanding about how the cpu partition table stuff works today.

By default the cpt knows nothing about cores, sockets, SMT/hyperthreads, or which cores are mapped to which numa zones. The cpt code in cfs_cpt_num_estimate() knows nothing more than total number of online cpus, and the total number of online nodes. The code goes on to use some random guess about how it should lay out cpu partitions and memory nodes with no knowledge at all about the actual hardware configuration.

The not terribly surprising result of this approach is that the partition table is wrong for every system I have tried it on. That includes perfectly normal x86_64 nodes.

I can't even imagine a magic algorithm that would generate a correct partition table without actual hardware knowledge.

I would argue that the current approach of using a completely wrong table is far worse than the old behavior of not using partition tables.

And since currently only a human can provide Lustre with the information about how the hardware is organize, this cpu partition table code should be disabled by default.

Disabling the cpu affinity/partitioning by default in Lustre will work for everyone. It will even perform reasonably well for everyone. Those seeking the best performance can provide a correct cpu partition layout and enable the partitioning feature.

I can't recommend a better way to automatically create the layout. The kernel may simply not provide sufficient information to make a good selection. But in that case, the obvious design solution would to have cpu partitioning disabled by default.

Comment by Liang Zhen (Inactive) [ 16/Dec/15 ]

CPT does understand sockets, cores and hyperthreads, for example, with socket0[core0, core1], socket1[core2, core3], libcfs will never create partitions like CPT0[core0, core2], CPT1[core1, core3].
But the number of partitions is the hard decision to make, libcfs tends to create more CPTs to allow multiple processing pipelines of the upper layers, at the meanwhile, cores within each partition can't be too small otherwise there will be large performance jitter. So the ideal case is, libcfs creates small number of CPTs for each numa node, each CPT has its own CPU pool (and thread pools), CPTs within the same numa node share local memory.

However, giving the truth that number of cores within each socket can be more various today (which were mostly 2, 4, 6, 8 etc a few years ago), the selection algorithm could be problematic, but I haven't found out a better solution. At the meanwhile, I think there are at least a few things can be changed easily:

  • set cpu_pattern="N" by default, which will create the layout that matches numa topology.
  • As Andreas said, allow user to dynamically change memory affinity (disable/enable numa allocator, or even provide a ratio of using numa allocator?)
  • improve numa allocator of libcfs, use non-affinity allocation if affinity allocation failed.
Comment by Robin Humble [ 16/Dec/15 ]

forced memory affinity is dangerous. it should be disabled by default.
I strongly suggest you drop memory affinity entirely.

threads that are pinned to a core already malloc preferentially from the "right" place, but they can malloc from other numa nodes if required. that is exactly the "fallback" behaviour you and Andrea are proposing, but via a convoluted and complex mechanism. just drop the memory affinity and it will happen automatically and simply.

surely a safe option should be the #1 default. the last drop of highly tuned performance can be left up to those who care to read the documentation carefully and know what they're getting into by tuning lustre explicitly for their hardware.

Comment by Liang Zhen (Inactive) [ 16/Dec/15 ]

Lustre has quite a few frequently accessed data structures that are allocated from one thread, and then be used by different set of threads with affinity, so we need these data structures to be allocated with affinity, otherwise they could be always remotely accessed.
But I agree we should relax the affinity restriction on those dynamically allocated buffers or data structures.

Comment by Christopher Morrone [ 16/Dec/15 ]

However, giving the truth that number of cores within each socket can be more various today (which were mostly 2, 4, 6, 8 etc a few years ago), the selection algorithm could be problematic

Unless Lustre can also deal, by default, with processors coming and going (hyperthread turning on and off, for instance), that is yet another reason why this should not be enabled by default.

Comment by Christopher Morrone [ 17/Dec/15 ]
CPT does understand sockets, cores and hyperthreads, for example, with socket0[core0, core1], socket1[core2, core3], libcfs will never create partitions like CPT0[core0, core2], CPT1[core1, core3].

Can you point me to the code that figures those out?

It certainly has looked to me like Lustre ignores cores and hyperthreads. And any honoring of sockets is unclear, and perhaps just coincidental...perhaps specific to just the hardware where the code happened to be developed.

For instance, I have seen a Lustre cpu partition will contain 6 processes, four bound to SMT "processors" on one core, and the other two bound to a core, overlapping with two processes from a second cpu partition on the same core. It is unclear why Lustre would do that if it actually knew about sockets, cores, and SMT/hyperthreads.

But hey, prove me wrong by pointing to the code.

Comment by Liang Zhen (Inactive) [ 18/Dec/15 ]

In cfs_cpt_choose_ncpus(), it calls cfs_cpu_core_siblings() to select processors from the same socket, and cfs_cpu_ht_siblings() to select processors from the same core. However, I suspect if libcfs made bad decision on the number of partitions, it could separate hyperthreads of the same core to different partitions. I will check this part of code to see if I can make some fixes/improvements.

As in the previous comments, I think one way to avoid the oddities is just set cpu_pattern="N" as default, which will create partitions that match numa topology. But I'm also concerning this approach may create too few partitions and processing pipelines (request queues within the context of Lustre). For example, I have a machine with two sockets and numa nodes, each socket has 18*2 SMT processors. With "N" mode, libcfs creates two partitions, and I would think it's better to have 4 partitions (two partitions for each socket) to reduce service lock contention, L1/L2 cache pollution etc. When this work was done for Opensfs, I did benchmark and saw more partitions will deliver better performance at the begin, but too many partitions will hurt performance as well (partition per core for example) because of noises and performance jitters. Anyway, the programming model of SMP system and NUMA system encourages to have multiple independent processing pipelines and use local memory when possible. However, the trade-off of these methodologies is, we have to separate the global resources (cpu, memory) to different partitions/pools.

Also, in the example above, if libcfs decided to create 8 partitions, which means 4 partitions for each socket, then it will show the similar problem that you described, because libcfs will split 18*2 hyperthreads into 4 partitions, each partition has 9 hyperthreads, and at least one core will spread across two partitions. If this happened, then I agree it doesn't make too much sense and needs to be fixed.

FYI, the work for client stack affinity has been done by SGI, please check LU-6325 for details.

Comment by Christopher Morrone [ 18/Dec/15 ]

You'll note that Olaf Weber said "On a NUMA system, making sure the CPTs match the NUMA topology is left as an exercise for the sysadmin."

"N" mode is certainly closer to what I want by default then the current default, so I would say go go ahead and do that.

Comment by Gerrit Updater [ 05/Jan/16 ]

Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/17824
Subject: LU-5050 libcfs: default CPT matches NUMA topology
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7aaccbf3732c804cf6077bfc9044b38c9456fe9a

Comment by Dmitry Eremin (Inactive) [ 02/Feb/16 ]

I'm concerned about KNL support. I have an issue on it:

# modprobe -v lustre
insmod /lib/modules/3.10.0-327.3.1.el7.x86_64/extra/kernel/net/lustre/libcfs.ko cpu_pattern="N" 
modprobe: ERROR: could not insert 'lustre': Operation not permitted
# dmesg -c
[340771.351434] LNetError: 13576:0:(linux-cpu.c:899:cfs_cpt_table_create_pattern()) Invalid pattern , or too many partitions 0
[340771.351678] LNetError: 13576:0:(linux-cpu.c:1078:cfs_cpu_init()) Failed to create cptab from pattern N
# numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 
cpubind: 0 
nodebind: 0 
membind: 0 1 
Comment by Dmitry Eremin (Inactive) [ 02/Feb/16 ]

The similar issue is on Xeon Phi also.

Comment by James A Simmons [ 02/Feb/16 ]

Can you post your boot dmesg. I like to see what the kernel sets up as the default layout for cores and NUMA nodes.

Comment by Olaf Weber [ 02/Feb/16 ]
    868 static struct cfs_cpt_table *
    869 cfs_cpt_table_create_pattern(char *pattern)
    870 {
    ...
    881         str = cfs_trimwhite(pattern);
    882         if (*str == 'n' || *str == 'N') {
    ...
    887                 } else { /* shortcut to create CPT from NUMA & CPU topology */
    888                         node = -1;
    889                         ncpt = num_online_nodes();
    890                 }
    891         }
    ...
    901         if (ncpt == 0 ||
    902             (node && ncpt > num_online_nodes()) ||
    903             (!node && ncpt > num_online_cpus())) {
    904                 CERROR("Invalid pattern %s, or too many partitions %d\n",
    905                        pattern, ncpt);
    906                 return NULL;
    907         }

Looking at the message, a quick glance at the code suggests that the ncpt == 0 test at line 901 succeeds, which implies that num_online_nodes() == 0.

Comment by Dmitry Eremin (Inactive) [ 02/Feb/16 ]

I'm not sure I can share full log:

[    0.000000] NUMA: Initialized distance table, cnt=2
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x183fffffff]
[    0.000000]   NODE_DATA [mem 0x183ffd9000-0x183fffffff]
[    0.000000] Initmem setup node 1 [mem 0x1840000000-0x1a3fffffff]
[    0.000000]   NODE_DATA [mem 0x1a3ffd1000-0x1a3fff7fff]
[    0.000000] Reserving 167MB of memory at 720MB for crashkernel (System RAM: 106392MB)
[    0.000000]  [ffffea0000000000-ffffea0060ffffff] PMD -> [ffff8817dfe00000-ffff88183fdfffff] on node 0
[    0.000000]  [ffffea0061000000-ffffea0068ffffff] PMD -> [ffff881a37600000-ffff881a3f5fffff] on node 1
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   [mem 0x100000000-0x1a3fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0xb6d61fff]
[    0.000000]   node   0: [mem 0xb6e75000-0xb777efff]
[    0.000000]   node   0: [mem 0xb877f000-0xb89c3fff]
[    0.000000]   node   0: [mem 0xb90c5000-0xb9caefff]
[    0.000000]   node   0: [mem 0xb9d30000-0xbb1cefff]
[    0.000000]   node   0: [mem 0xbbff2000-0xbbffffff]
[    0.000000]   node   0: [mem 0x100000000-0x183fffffff]
[    0.000000]   node   1: [mem 0x1840000000-0x1a3fffffff]
[    0.000000] On node 0 totalpages: 25139431
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 1115 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 11814 pages used for memmap
[    0.000000]   DMA32 zone: 756040 pages, LIFO batch:31
[    0.000000]   Normal zone: 380928 pages used for memmap
[    0.000000]   Normal zone: 24379392 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 2097152
[    0.000000]   Normal zone: 32768 pages used for memmap
[    0.000000]   Normal zone: 2097152 pages, LIFO batch:31
...
[    0.000000] setup_percpu: NR_CPUS:5120 nr_cpumask_bits:304 nr_cpu_ids:304 nr_node_ids:2
[    0.000000] PERCPU: Embedded 31 pages/cpu @ffff8817dd800000 s87168 r8192 d31616 u131072
[    0.000000] pcpu-alloc: s87168 r8192 d31616 u131072 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015
[    0.000000] pcpu-alloc: [0] 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031
[    0.000000] pcpu-alloc: [0] 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047
[    0.000000] pcpu-alloc: [0] 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063
[    0.000000] pcpu-alloc: [0] 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079
[    0.000000] pcpu-alloc: [0] 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095
[    0.000000] pcpu-alloc: [0] 096 097 098 099 100 101 102 103 104 105 106 107 108 109 110 111
[    0.000000] pcpu-alloc: [0] 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
[    0.000000] pcpu-alloc: [0] 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
[    0.000000] pcpu-alloc: [0] 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
[    0.000000] pcpu-alloc: [0] 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
[    0.000000] pcpu-alloc: [0] 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
[    0.000000] pcpu-alloc: [0] 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
[    0.000000] pcpu-alloc: [0] 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
[    0.000000] pcpu-alloc: [0] 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
[    0.000000] pcpu-alloc: [0] 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
[    0.000000] pcpu-alloc: [0] 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
[    0.000000] pcpu-alloc: [0] 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287
[    0.000000] pcpu-alloc: [0] 288 290 292 294 296 298 300 302 --- --- --- --- --- --- --- ---
[    0.000000] pcpu-alloc: [1] 289 291 293 295 297 299 301 303 --- --- --- --- --- --- --- ---
[    0.000000] Built 2 zonelists in Zone order, mobility grouping on.  Total pages: 26809894
[    0.000000] Policy zone: Normal
...
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=304, Nodes=2
[    0.000000] Hierarchical RCU implementation.
[    0.000000]  RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=304.
[    0.000000]  Offload RCU callbacks from all CPUs
[    0.000000]  Offload RCU callbacks from CPUs: 0-303.
[    0.000000] NR_IRQS:327936 nr_irqs:3264 0
Comment by Gerrit Updater [ 21/May/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17824/
Subject: LU-5050 libcfs: default CPT matches NUMA topology
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6b929502063ac0fa8e75a3e09259e9b2b0d0ee6e

Comment by Peter Jones [ 21/May/16 ]

Landed for 2.9

Comment by Gerrit Updater [ 23/May/16 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/20376
Subject: Revert "LU-5050 libcfs: default CPT matches NUMA topology"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7b3e52e9e6c8b47a7d2005860d2893a02b77a05c

Comment by Oleg Drokin [ 23/May/16 ]

I just reverted this change. it causes 100% failure on start in rhel7 (now default):

[  303.759884] libcfs: module verification failed: signature and/or required key missing - tainting kernel
[  303.765035] BUG: unable to handle kernel paging request at ffffffffa01dafcc
[  303.765374] IP: [<ffffffffa01cc21e>] cfs_trimwhite+0x5e/0x70 [libcfs]
[  303.765624] PGD 1c11067 PUD 1c12063 PMD a95c9067 PTE 800000009abee161
[  303.766105] Oops: 0003 [#1] SMP DEBUG_PAGEALLOC
[  303.766445] Modules linked in: libcfs(OE+) rpcsec_gss_krb5 syscopyarea sysfillrect sysimgblt ttm drm_kms_helper drm ata_generic pata_acpi i2c_piix4 ata_piix serio_raw pcspkr virtio_balloon virtio_console libata i2c_core virtio_blk floppy nfsd ip_tables
[  303.768469] CPU: 4 PID: 10077 Comm: insmod Tainted: G           OE  ------------   3.10.0-debug #1
[  303.768685] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  303.768851] task: ffff8800931f0700 ti: ffff8800af7cc000 task.ti: ffff8800af7cc000
[  303.769041] RIP: 0010:[<ffffffffa01cc21e>]  [<ffffffffa01cc21e>] cfs_trimwhite+0x5e/0x70 [libcfs]
[  303.769330] RSP: 0018:ffff8800af7cfcd0  EFLAGS: 00010246
[  303.769484] RAX: ffffffffa01dafcc RBX: ffffffffa01dafcb RCX: 0000000000000001
[  303.769669] RDX: 000000000000004e RSI: 0000000000000000 RDI: ffffffffa01dafcb
[  303.769861] RBP: ffff8800af7cfcd8 R08: 0000000000000001 R09: 0000000000000000
[  303.770489] R10: 0000000000000000 R11: ffff8800931f0fd8 R12: ffff8800ab5ca480
[  303.771130] R13: ffffffffa0222000 R14: ffffffffa01dafcb R15: 0000000000000000
[  303.771753] FS:  00007f9b47a79740(0000) GS:ffff8800bc700000(0000) knlGS:0000000000000000
[  303.772834] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  303.773434] CR2: ffffffffa01dafcc CR3: 00000000af2c6000 CR4: 00000000000006e0
[  303.774045] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  303.774682] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  303.775314] Stack:
[  303.775840]  0000000000000000 ffff8800af7cfd40 ffffffffa01c25d8 ffffffffa01e3170
[  303.777178]  ffff8800af7cfd10 00000000810aa9ee 0000000000000000 ffff8800ab5ca480
[  303.778514]  00000000b6b465f8 0000000000000000 ffff8800ab5ca480 ffffffffa0222000
[  303.779836] Call Trace:
[  303.780395]  [<ffffffffa01c25d8>] cfs_cpu_init+0x1f8/0xcd0 [libcfs]
[  303.780993]  [<ffffffffa0222000>] ? 0xffffffffa0221fff
[  303.781588]  [<ffffffffa022202f>] libcfs_init+0x2f/0x1000 [libcfs]
[  303.782210]  [<ffffffff810020e8>] do_one_initcall+0xb8/0x230
[  303.782810]  [<ffffffff810f3e6e>] load_module+0x138e/0x1bc0
[  303.783414]  [<ffffffff8139de20>] ? ddebug_proc_write+0xf0/0xf0
[  303.784007]  [<ffffffff810eff23>] ? copy_module_from_fd.isra.40+0x53/0x150
[  303.784639]  [<ffffffff810f4876>] SyS_finit_module+0xa6/0xd0
[  303.785239]  [<ffffffff81711809>] system_call_fastpath+0x16/0x1b
[  303.785853] Code: e8 b8 45 1b e1 48 01 d8 48 39 d8 77 11 eb 1c 66 0f 1f 44 00 00 48 83 e8 01 48 39 d8 74 0d 0f b6 50 ff f6 82 60 de 87 81 20 75 ea <c6> 00 00 48 89 d8 5b 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 
[  303.791791] RIP  [<ffffffffa01cc21e>] cfs_trimwhite+0x5e/0x70 [libcfs]

This corresponds to:

0xf21e is in cfs_trimwhite (libcfs/libcfs/libcfs_string.c:185).
180			if (!isspace(end[-1]))
181				break;
182			end--;
183		}
184	
185		*end = 0;
186		return str;
187	}
188	EXPORT_SYMBOL(cfs_trimwhite);
189	
Comment by Gerrit Updater [ 23/May/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20376/
Subject: Revert "LU-5050 libcfs: default CPT matches NUMA topology"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ae6fc0156d11ae730fbb284085a2050006b570c7

Comment by Olaf Weber [ 23/May/16 ]

Quick guess as to the cause is that cfs_trimwhite() modifies the input string, while the string literal "N" was stored in read-only memory. The easiest solution is likely for cfs_cpt_table_create_pattern() to make a copy of the pattern string.

Comment by James A Simmons [ 23/May/16 ]

That is one old bug. I have a question. How did it ever pass testing on RHEL7 in the first place?

Comment by Andreas Dilger [ 15/Jul/16 ]

Dmitry, could you please work on an updated version of the http://review.whamcloud.com/17824 patch. We'd like to get it included into 2.9.0.

Comment by Gerrit Updater [ 08/Sep/16 ]

Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/22377
Subject: LU-5050 libcfs: default CPT matches NUMA topology
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1916e280b1f0ac579d17a6cb9134923f1481176c

Comment by Gerrit Updater [ 14/Sep/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/22507
Subject: LU-5050 libcfs: default CPT matches NUMA topology
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cf67b83eb74bc84e90d0a02ea306405a0fc76b43

Comment by Gerrit Updater [ 21/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22377/
Subject: LU-5050 libcfs: default CPT matches NUMA topology
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4fb513d0a635ce749ddb2173e9841814622ba4a2

Comment by Peter Jones [ 22/Sep/16 ]

Landed for 2.9

Generated at Sat Feb 10 01:48:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.