[LU-5050] cpu partitioning oddities Created: 12/May/14 Updated: 21/Jul/18 Resolved: 22/Sep/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Christopher Morrone | Assignee: | Dmitry Eremin (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 13953 | ||||||||||||||||||||||||
| Description |
|
In tracking down the root cause of |
| Comments |
| Comment by Christopher Morrone [ 12/May/14 ] |
|
Lets start with the following, which was obtained on a PPC64 BG/Q ION (I/O Node) running roughly Lustre 2.4.0-28chaos (see github.com/chaos/lustre). $ cat cpu_partition_table 0 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 : 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 2 : 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 3 : 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 Why was this partitioning decided upon? It looks like we wind up with 16 ptlrpcd threads (of the type named ptlrpcd_<number>). How are those 16 threads, using the default ptlrpcd_bind_policy, assigned to those Lustre cpu partitions? |
| Comment by Christopher Morrone [ 12/May/14 ] |
|
Next question is about this code ptlrpcd.c:ptlrpcd_select_pc(): case PDL_POLICY_SAME:
idx = cfs_smp_processor_id() % ptlrpcds->pd_nthreads;
break;
The code comment in lustre_net.h for the enum pdl_policy_t says: typedef enum {
/* on the same CPU core as the caller */
PDL_POLICY_SAME = 1,
This is pretty clearly not the case. The code running on a particular CPU will always map to the same ptlrpcd index, not to the same CPU. That ptlrpcd can, and probably is under many possible settings, be running on a different CPU. |
| Comment by Liang Zhen (Inactive) [ 13/May/14 ] |
|
Hi Chris, client side ptlrpcd is not using CPU partition at all, that's a thing we need to improve, one purpose of CPU partition is, we can limit Lustre (server or client) on subset of CPUs, but we never got resource for the client side work. Ideas behind CPU partition are:
For PDL_POLICY_SAME, I agreed the comment is wrong, and I will look into it to see how we can improve ptlrpcd. |
| Comment by Christopher Morrone [ 13/May/14 ] |
|
Are you telling me that Lustre uses at least two different unrelated cpu binding mechanisms? So the whole LIOD_BIND stuff is not related to the cpu_partition_table stuff? |
| Comment by Liang Zhen (Inactive) [ 15/May/14 ] |
|
unfortunately yes, only server side is using libcfs cpu partition to set affinity, client side is still using native kernel APIs to set NUMA affinity for ptlrpcd, I need to work with my manager to get a plan for this, because this is not a thing can be done in a couple of days. |
| Comment by Christopher Morrone [ 22/May/14 ] |
|
AH! The cpu partition stuff is what broke /proc/sys/nis too!!! Geez, I was wondering why I started seeing four copies of everything in there. Look at this: # zwicky-lcz-oss2 /root > cat /proc/sys/lnet/nis nid status alive refs peer rtr max tx min 0@lo up 0 2 0 0 0 0 0 0@lo up 0 0 0 0 0 0 0 0@lo up 0 0 0 0 0 0 0 0@lo up 0 0 0 0 0 0 0 10.1.1.2@o2ib9 up -1 2 8 0 256 256 248 10.1.1.2@o2ib9 up -1 3 8 0 256 256 247 10.1.1.2@o2ib9 up -1 5 8 0 256 256 250 10.1.1.2@o2ib9 up -1 1 8 0 256 256 255 That is not acceptable output. I'm rather appalled that this code passed review like this. The nis file is supposed to list the network interfaces, but now we've got four copies of each interface. Somehow the concept of a network interface seems to be overloaded with the concept of network queues or something. And code like this is not written in an understandable and maintainable manner: cfs_percpt_for_each(tq, i, ni->ni_tx_queues) {
for (j = 0; ni->ni_cpts != NULL &&
j < ni->ni_ncpts; j++) {
if (i == ni->ni_cpts[j])
break;
}
if (j == ni->ni_ncpts)
continue;
|
| Comment by Christopher Morrone [ 22/May/14 ] |
|
More and more and I am beginning to think that cpu partitions should be disabled by default in Lustre. I really don't understand how the default partitions are decided. So far I can't find a system at LLNL where the default partitioning makes sense. Without digging into the code, it looks from all of the proc files that Lustre just assumes 4 partitions. That is not a good plan. I am very concerned by the decision to split the network interface credits between these new separate queues as well. We choose the tx credits carefully to avoid issues like all of the credits being exhausted when one client cluster goes down. Now that assumption may no longer be correct, and we'll need to multiply our credits by 4. It seems to me that cpu partitioning only makes sense when you actually make the partitions match your hardware, which would require manual sysadmin attention. If that is the case, then cpu partitioning should not be enabled by default. |
| Comment by Liang Zhen (Inactive) [ 05/Jun/14 ] |
|
Chris, I actually saw some presentations from users about performance improvements from CPU partitions (both for metadata and IO performance) in the last two LUGs, so I think it can bring us some benefits.
|
| Comment by Liang Zhen (Inactive) [ 15/Aug/14 ] |
|
Chris, credit is actually another thing could benefit form CPU partition, because we can bind LNet NI on cpu partition, so each partition has private NI and router-buffer credits, for example: networks="o2ib0(ib0)[0,1], o2ib1(ib1)[2,3]" // lustre manual 26.3.2 has a little more details. so all messages for o2ib0 will only be handled by CPUs/threads in partition[0,1], and all messages for o2ib1 will only be handled by CPUs/threads in partition[2,3], which means if routers or servers can bind different client networks on different partitions, then these different networks will have their own credits, and it could be a better solution then just increasing overall credits. |
| Comment by Alexander Zarochentsev [ 15/Jul/15 ] |
|
Also Network Request Scheduler doesn't work across CPU partitions, especially CRRN policy. |
| Comment by Robin Humble [ 17/Jul/15 ] |
|
I turn off all libcfs affinity on our servers. options libcfs cpu_npartitions=1. this is because it halves (on a 2 numa zone server) the available ram for OSS and MDS inode caching. the OSS case isn't a problem for most people, but the MDS case seems quite serious to me. this is with 2.5.2 lustre. the underlying asymmetry comes from the lustre threads on the same socket as the preferred socket of the IB card being used far more frequently than the lustre threads on the other socket. this is presumably "by design". however memory affinity of the lustre threads is set so that memory must be allocated from the local zone, rather than just leaving it up to the kernel to prefer the local zone. when the zone fills up with slab entries (which commonly happens on the MDS with large filesystems), reclaim kicks in, and inodes are kicked out of ram, when in reality the other memory zone is nearly unused. so MDS inode caching is close to halved (or worse if there are >2 numa zones). this situation is perhaps exacerbated on our servers because we need to set cpu affinity on our IB drivers (qib) because they hang if they are migrated (under load) away from the socket where their irq's physically arrive. however even without cpu affinity set on the IB driver, it strongly prefers its "home" socket, and so the lustre threads on that socket also get far more use and the memory used in each zone is still skewed. I presume other types of IB and ethernet drivers are similar in preferring their "physical" socket over the others, but I don't know. the default libcfs affinity policy (on OSS/MDS) seems to only make sense to me if there are equal numbers of equally used network cards in each numa zone, and is incorrect otherwise. a better choice is cpu_npartitions=1 as all ram can be used and ram speed is always >> disk speed. alternatively, whichever lustre threads allocate slab could be allowed to use all memory zones. ie. cpu affinity, but not memory affinity. I find it hard to believe that memory bandwidth or latency on the server will ever be a limiting factor when compared to network speeds and latencies. at the moment it's roughly 10x isn't it? in the extreme case (which we have hit) the libcfs default affinity caused spurious OOM's in the lustre threads on our OSS's. however I doubt many other people will see this. this is because a) we set vfs_cache_pressure=0 on OSS's to force caching of inodes over data to make the fs more responsive b) there has to be a zone worth of inodes in use. this happened when we were moving all the files from one filesystem to a new one. I had calculated that we had enough OSS ram to do this, but libcfs affinity meant only 1/2 of it was actually usable. ow. I don't know if affinity also applies to OSS data pages and might reduce available OSS write-through and read caching too. I didn't investigate that. BTW ldiskfs_inode_cache slab entries are annoyingly 1.02k in size, so only 3 of them fit into a 4k slab page. an amplification in ram usage by 25%. but that is presumably hard to fix, and a separate bug. |
| Comment by Christopher Morrone [ 15/Dec/15 ] |
|
I don't know if it was true when Liang first said it, but it is certainly not true in 2.5.4 that the client is not impacted by cpu_npartitions. The partitioning is having an effect on binding on clients. |
| Comment by Christopher Morrone [ 15/Dec/15 ] |
|
I think we now have pretty overwhelming evidence that cpu_npartitions=0 is a terrible default. We need to change it to 1 by default. |
| Comment by Robin Humble [ 15/Dec/15 ] |
|
I strongly agree. the current default can cause trouble quite easily, and has performance benefits only with unusual hardware configurations. |
| Comment by Christopher Morrone [ 15/Dec/15 ] |
|
See additional breakage documented in |
| Comment by Andreas Dilger [ 15/Dec/15 ] |
|
Chris, I agree that the default selection of CPU partitions is not ideal, and I'd welcome your input on how to improve it. The hard limit on memory allocation to be within the CPT partition is something that should be fixed. The normal CPU affinity for allocations should be enough, or the CPT allocator should clear the mask and retry on allocation failures. Also note that the CPT affinity of ptlrpcd threads was fixed in Setting the default CPT count to one can have a serious negative impact on client IO performance, since this reduces the number of ptlrpcd threads and that will hurt parallel client RPC checksums, which was critical to improving IO performance at your site. While your assert that the current default is invalid, this is used today on virtually every Lustre system, since it has existed since Lustre 2.2 or 2.3, but as yet there are only two or three issues reported here, so it can't be as bad as you make out and I might equally argue that the default is reasonable and you have a few systems that need manual tuning. Changing cpu_partitions=1 can negatively impact all other users with more normal configurations and force them to manually change their configuration to compensate. Have you manually set cpu_partitions=1 on your local systems since this bug was opened and can verify there is no performance impact to that change? Have you done any IO performance benchmarks on your new 48-core system to see what the performance difference is? Turning off the hard memory affinity and improving the CPT calculations seem like reasonable approaches. |
| Comment by Christopher Morrone [ 15/Dec/15 ] |
|
Did you set cpu_partitions=0 and verify that there were no performance degradations on systems where the alignment was completely wrong? The onus is on you to demonstrate that your defaults are sane. We have already demonstrated that there are very common cases in which those defaults are flat out wrong and break production usage. So don't try to turn around and make it my responsibility to get performance numbers for you. You guys didn't do your homework the first time around. Instead, you waited 18 months and thrust production outages on your customers all because under some very limited set of hardware configurations it is "faster". That is ridiculous. The default should be cpu_partitions=1, and then where admins want the extra performance they can set up the cpu partition tables to actually align with reality. Ergo, fix it already. |
| Comment by Christopher Morrone [ 16/Dec/15 ] |
|
Talking to Peter Jones on the phone, I suspect that Andreas, you might have a misunderstanding about how the cpu partition table stuff works today. By default the cpt knows nothing about cores, sockets, SMT/hyperthreads, or which cores are mapped to which numa zones. The cpt code in cfs_cpt_num_estimate() knows nothing more than total number of online cpus, and the total number of online nodes. The code goes on to use some random guess about how it should lay out cpu partitions and memory nodes with no knowledge at all about the actual hardware configuration. The not terribly surprising result of this approach is that the partition table is wrong for every system I have tried it on. That includes perfectly normal x86_64 nodes. I can't even imagine a magic algorithm that would generate a correct partition table without actual hardware knowledge. I would argue that the current approach of using a completely wrong table is far worse than the old behavior of not using partition tables. And since currently only a human can provide Lustre with the information about how the hardware is organize, this cpu partition table code should be disabled by default. Disabling the cpu affinity/partitioning by default in Lustre will work for everyone. It will even perform reasonably well for everyone. Those seeking the best performance can provide a correct cpu partition layout and enable the partitioning feature. I can't recommend a better way to automatically create the layout. The kernel may simply not provide sufficient information to make a good selection. But in that case, the obvious design solution would to have cpu partitioning disabled by default. |
| Comment by Liang Zhen (Inactive) [ 16/Dec/15 ] |
|
CPT does understand sockets, cores and hyperthreads, for example, with socket0[core0, core1], socket1[core2, core3], libcfs will never create partitions like CPT0[core0, core2], CPT1[core1, core3]. However, giving the truth that number of cores within each socket can be more various today (which were mostly 2, 4, 6, 8 etc a few years ago), the selection algorithm could be problematic, but I haven't found out a better solution. At the meanwhile, I think there are at least a few things can be changed easily:
|
| Comment by Robin Humble [ 16/Dec/15 ] |
|
forced memory affinity is dangerous. it should be disabled by default. threads that are pinned to a core already malloc preferentially from the "right" place, but they can malloc from other numa nodes if required. that is exactly the "fallback" behaviour you and Andrea are proposing, but via a convoluted and complex mechanism. just drop the memory affinity and it will happen automatically and simply. surely a safe option should be the #1 default. the last drop of highly tuned performance can be left up to those who care to read the documentation carefully and know what they're getting into by tuning lustre explicitly for their hardware. |
| Comment by Liang Zhen (Inactive) [ 16/Dec/15 ] |
|
Lustre has quite a few frequently accessed data structures that are allocated from one thread, and then be used by different set of threads with affinity, so we need these data structures to be allocated with affinity, otherwise they could be always remotely accessed. |
| Comment by Christopher Morrone [ 16/Dec/15 ] |
Unless Lustre can also deal, by default, with processors coming and going (hyperthread turning on and off, for instance), that is yet another reason why this should not be enabled by default. |
| Comment by Christopher Morrone [ 17/Dec/15 ] |
CPT does understand sockets, cores and hyperthreads, for example, with socket0[core0, core1], socket1[core2, core3], libcfs will never create partitions like CPT0[core0, core2], CPT1[core1, core3]. Can you point me to the code that figures those out? It certainly has looked to me like Lustre ignores cores and hyperthreads. And any honoring of sockets is unclear, and perhaps just coincidental...perhaps specific to just the hardware where the code happened to be developed. For instance, I have seen a Lustre cpu partition will contain 6 processes, four bound to SMT "processors" on one core, and the other two bound to a core, overlapping with two processes from a second cpu partition on the same core. It is unclear why Lustre would do that if it actually knew about sockets, cores, and SMT/hyperthreads. But hey, prove me wrong by pointing to the code. |
| Comment by Liang Zhen (Inactive) [ 18/Dec/15 ] |
|
In cfs_cpt_choose_ncpus(), it calls cfs_cpu_core_siblings() to select processors from the same socket, and cfs_cpu_ht_siblings() to select processors from the same core. However, I suspect if libcfs made bad decision on the number of partitions, it could separate hyperthreads of the same core to different partitions. I will check this part of code to see if I can make some fixes/improvements. As in the previous comments, I think one way to avoid the oddities is just set cpu_pattern="N" as default, which will create partitions that match numa topology. But I'm also concerning this approach may create too few partitions and processing pipelines (request queues within the context of Lustre). For example, I have a machine with two sockets and numa nodes, each socket has 18*2 SMT processors. With "N" mode, libcfs creates two partitions, and I would think it's better to have 4 partitions (two partitions for each socket) to reduce service lock contention, L1/L2 cache pollution etc. When this work was done for Opensfs, I did benchmark and saw more partitions will deliver better performance at the begin, but too many partitions will hurt performance as well (partition per core for example) because of noises and performance jitters. Anyway, the programming model of SMP system and NUMA system encourages to have multiple independent processing pipelines and use local memory when possible. However, the trade-off of these methodologies is, we have to separate the global resources (cpu, memory) to different partitions/pools. Also, in the example above, if libcfs decided to create 8 partitions, which means 4 partitions for each socket, then it will show the similar problem that you described, because libcfs will split 18*2 hyperthreads into 4 partitions, each partition has 9 hyperthreads, and at least one core will spread across two partitions. If this happened, then I agree it doesn't make too much sense and needs to be fixed. FYI, the work for client stack affinity has been done by SGI, please check |
| Comment by Christopher Morrone [ 18/Dec/15 ] |
|
You'll note that Olaf Weber said "On a NUMA system, making sure the CPTs match the NUMA topology is left as an exercise for the sysadmin." "N" mode is certainly closer to what I want by default then the current default, so I would say go go ahead and do that. |
| Comment by Gerrit Updater [ 05/Jan/16 ] |
|
Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/17824 |
| Comment by Dmitry Eremin (Inactive) [ 02/Feb/16 ] |
|
I'm concerned about KNL support. I have an issue on it: # modprobe -v lustre insmod /lib/modules/3.10.0-327.3.1.el7.x86_64/extra/kernel/net/lustre/libcfs.ko cpu_pattern="N" modprobe: ERROR: could not insert 'lustre': Operation not permitted # dmesg -c [340771.351434] LNetError: 13576:0:(linux-cpu.c:899:cfs_cpt_table_create_pattern()) Invalid pattern , or too many partitions 0 [340771.351678] LNetError: 13576:0:(linux-cpu.c:1078:cfs_cpu_init()) Failed to create cptab from pattern N # numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 cpubind: 0 nodebind: 0 membind: 0 1 |
| Comment by Dmitry Eremin (Inactive) [ 02/Feb/16 ] |
|
The similar issue is on Xeon Phi also. |
| Comment by James A Simmons [ 02/Feb/16 ] |
|
Can you post your boot dmesg. I like to see what the kernel sets up as the default layout for cores and NUMA nodes. |
| Comment by Olaf Weber [ 02/Feb/16 ] |
868 static struct cfs_cpt_table *
869 cfs_cpt_table_create_pattern(char *pattern)
870 {
...
881 str = cfs_trimwhite(pattern);
882 if (*str == 'n' || *str == 'N') {
...
887 } else { /* shortcut to create CPT from NUMA & CPU topology */
888 node = -1;
889 ncpt = num_online_nodes();
890 }
891 }
...
901 if (ncpt == 0 ||
902 (node && ncpt > num_online_nodes()) ||
903 (!node && ncpt > num_online_cpus())) {
904 CERROR("Invalid pattern %s, or too many partitions %d\n",
905 pattern, ncpt);
906 return NULL;
907 }
Looking at the message, a quick glance at the code suggests that the ncpt == 0 test at line 901 succeeds, which implies that num_online_nodes() == 0. |
| Comment by Dmitry Eremin (Inactive) [ 02/Feb/16 ] |
|
I'm not sure I can share full log: [ 0.000000] NUMA: Initialized distance table, cnt=2 [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x183fffffff] [ 0.000000] NODE_DATA [mem 0x183ffd9000-0x183fffffff] [ 0.000000] Initmem setup node 1 [mem 0x1840000000-0x1a3fffffff] [ 0.000000] NODE_DATA [mem 0x1a3ffd1000-0x1a3fff7fff] [ 0.000000] Reserving 167MB of memory at 720MB for crashkernel (System RAM: 106392MB) [ 0.000000] [ffffea0000000000-ffffea0060ffffff] PMD -> [ffff8817dfe00000-ffff88183fdfffff] on node 0 [ 0.000000] [ffffea0061000000-ffffea0068ffffff] PMD -> [ffff881a37600000-ffff881a3f5fffff] on node 1 [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] [ 0.000000] Normal [mem 0x100000000-0x1a3fffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] [ 0.000000] node 0: [mem 0x00100000-0xb6d61fff] [ 0.000000] node 0: [mem 0xb6e75000-0xb777efff] [ 0.000000] node 0: [mem 0xb877f000-0xb89c3fff] [ 0.000000] node 0: [mem 0xb90c5000-0xb9caefff] [ 0.000000] node 0: [mem 0xb9d30000-0xbb1cefff] [ 0.000000] node 0: [mem 0xbbff2000-0xbbffffff] [ 0.000000] node 0: [mem 0x100000000-0x183fffffff] [ 0.000000] node 1: [mem 0x1840000000-0x1a3fffffff] [ 0.000000] On node 0 totalpages: 25139431 [ 0.000000] DMA zone: 64 pages used for memmap [ 0.000000] DMA zone: 1115 pages reserved [ 0.000000] DMA zone: 3999 pages, LIFO batch:0 [ 0.000000] DMA32 zone: 11814 pages used for memmap [ 0.000000] DMA32 zone: 756040 pages, LIFO batch:31 [ 0.000000] Normal zone: 380928 pages used for memmap [ 0.000000] Normal zone: 24379392 pages, LIFO batch:31 [ 0.000000] On node 1 totalpages: 2097152 [ 0.000000] Normal zone: 32768 pages used for memmap [ 0.000000] Normal zone: 2097152 pages, LIFO batch:31 ... [ 0.000000] setup_percpu: NR_CPUS:5120 nr_cpumask_bits:304 nr_cpu_ids:304 nr_node_ids:2 [ 0.000000] PERCPU: Embedded 31 pages/cpu @ffff8817dd800000 s87168 r8192 d31616 u131072 [ 0.000000] pcpu-alloc: s87168 r8192 d31616 u131072 alloc=1*2097152 [ 0.000000] pcpu-alloc: [0] 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 [ 0.000000] pcpu-alloc: [0] 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 [ 0.000000] pcpu-alloc: [0] 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 [ 0.000000] pcpu-alloc: [0] 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 [ 0.000000] pcpu-alloc: [0] 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 [ 0.000000] pcpu-alloc: [0] 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 [ 0.000000] pcpu-alloc: [0] 096 097 098 099 100 101 102 103 104 105 106 107 108 109 110 111 [ 0.000000] pcpu-alloc: [0] 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 [ 0.000000] pcpu-alloc: [0] 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 [ 0.000000] pcpu-alloc: [0] 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 [ 0.000000] pcpu-alloc: [0] 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 [ 0.000000] pcpu-alloc: [0] 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 [ 0.000000] pcpu-alloc: [0] 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 [ 0.000000] pcpu-alloc: [0] 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 [ 0.000000] pcpu-alloc: [0] 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 [ 0.000000] pcpu-alloc: [0] 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 [ 0.000000] pcpu-alloc: [0] 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 [ 0.000000] pcpu-alloc: [0] 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 [ 0.000000] pcpu-alloc: [0] 288 290 292 294 296 298 300 302 --- --- --- --- --- --- --- --- [ 0.000000] pcpu-alloc: [1] 289 291 293 295 297 299 301 303 --- --- --- --- --- --- --- --- [ 0.000000] Built 2 zonelists in Zone order, mobility grouping on. Total pages: 26809894 [ 0.000000] Policy zone: Normal ... [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=304, Nodes=2 [ 0.000000] Hierarchical RCU implementation. [ 0.000000] RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=304. [ 0.000000] Offload RCU callbacks from all CPUs [ 0.000000] Offload RCU callbacks from CPUs: 0-303. [ 0.000000] NR_IRQS:327936 nr_irqs:3264 0 |
| Comment by Gerrit Updater [ 21/May/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17824/ |
| Comment by Peter Jones [ 21/May/16 ] |
|
Landed for 2.9 |
| Comment by Gerrit Updater [ 23/May/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/20376 |
| Comment by Oleg Drokin [ 23/May/16 ] |
|
I just reverted this change. it causes 100% failure on start in rhel7 (now default): [ 303.759884] libcfs: module verification failed: signature and/or required key missing - tainting kernel [ 303.765035] BUG: unable to handle kernel paging request at ffffffffa01dafcc [ 303.765374] IP: [<ffffffffa01cc21e>] cfs_trimwhite+0x5e/0x70 [libcfs] [ 303.765624] PGD 1c11067 PUD 1c12063 PMD a95c9067 PTE 800000009abee161 [ 303.766105] Oops: 0003 [#1] SMP DEBUG_PAGEALLOC [ 303.766445] Modules linked in: libcfs(OE+) rpcsec_gss_krb5 syscopyarea sysfillrect sysimgblt ttm drm_kms_helper drm ata_generic pata_acpi i2c_piix4 ata_piix serio_raw pcspkr virtio_balloon virtio_console libata i2c_core virtio_blk floppy nfsd ip_tables [ 303.768469] CPU: 4 PID: 10077 Comm: insmod Tainted: G OE ------------ 3.10.0-debug #1 [ 303.768685] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 303.768851] task: ffff8800931f0700 ti: ffff8800af7cc000 task.ti: ffff8800af7cc000 [ 303.769041] RIP: 0010:[<ffffffffa01cc21e>] [<ffffffffa01cc21e>] cfs_trimwhite+0x5e/0x70 [libcfs] [ 303.769330] RSP: 0018:ffff8800af7cfcd0 EFLAGS: 00010246 [ 303.769484] RAX: ffffffffa01dafcc RBX: ffffffffa01dafcb RCX: 0000000000000001 [ 303.769669] RDX: 000000000000004e RSI: 0000000000000000 RDI: ffffffffa01dafcb [ 303.769861] RBP: ffff8800af7cfcd8 R08: 0000000000000001 R09: 0000000000000000 [ 303.770489] R10: 0000000000000000 R11: ffff8800931f0fd8 R12: ffff8800ab5ca480 [ 303.771130] R13: ffffffffa0222000 R14: ffffffffa01dafcb R15: 0000000000000000 [ 303.771753] FS: 00007f9b47a79740(0000) GS:ffff8800bc700000(0000) knlGS:0000000000000000 [ 303.772834] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 303.773434] CR2: ffffffffa01dafcc CR3: 00000000af2c6000 CR4: 00000000000006e0 [ 303.774045] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 303.774682] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 303.775314] Stack: [ 303.775840] 0000000000000000 ffff8800af7cfd40 ffffffffa01c25d8 ffffffffa01e3170 [ 303.777178] ffff8800af7cfd10 00000000810aa9ee 0000000000000000 ffff8800ab5ca480 [ 303.778514] 00000000b6b465f8 0000000000000000 ffff8800ab5ca480 ffffffffa0222000 [ 303.779836] Call Trace: [ 303.780395] [<ffffffffa01c25d8>] cfs_cpu_init+0x1f8/0xcd0 [libcfs] [ 303.780993] [<ffffffffa0222000>] ? 0xffffffffa0221fff [ 303.781588] [<ffffffffa022202f>] libcfs_init+0x2f/0x1000 [libcfs] [ 303.782210] [<ffffffff810020e8>] do_one_initcall+0xb8/0x230 [ 303.782810] [<ffffffff810f3e6e>] load_module+0x138e/0x1bc0 [ 303.783414] [<ffffffff8139de20>] ? ddebug_proc_write+0xf0/0xf0 [ 303.784007] [<ffffffff810eff23>] ? copy_module_from_fd.isra.40+0x53/0x150 [ 303.784639] [<ffffffff810f4876>] SyS_finit_module+0xa6/0xd0 [ 303.785239] [<ffffffff81711809>] system_call_fastpath+0x16/0x1b [ 303.785853] Code: e8 b8 45 1b e1 48 01 d8 48 39 d8 77 11 eb 1c 66 0f 1f 44 00 00 48 83 e8 01 48 39 d8 74 0d 0f b6 50 ff f6 82 60 de 87 81 20 75 ea <c6> 00 00 48 89 d8 5b 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 [ 303.791791] RIP [<ffffffffa01cc21e>] cfs_trimwhite+0x5e/0x70 [libcfs] This corresponds to: 0xf21e is in cfs_trimwhite (libcfs/libcfs/libcfs_string.c:185). 180 if (!isspace(end[-1])) 181 break; 182 end--; 183 } 184 185 *end = 0; 186 return str; 187 } 188 EXPORT_SYMBOL(cfs_trimwhite); 189 |
| Comment by Gerrit Updater [ 23/May/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20376/ |
| Comment by Olaf Weber [ 23/May/16 ] |
|
Quick guess as to the cause is that cfs_trimwhite() modifies the input string, while the string literal "N" was stored in read-only memory. The easiest solution is likely for cfs_cpt_table_create_pattern() to make a copy of the pattern string. |
| Comment by James A Simmons [ 23/May/16 ] |
|
That is one old bug. I have a question. How did it ever pass testing on RHEL7 in the first place? |
| Comment by Andreas Dilger [ 15/Jul/16 ] |
|
Dmitry, could you please work on an updated version of the http://review.whamcloud.com/17824 patch. We'd like to get it included into 2.9.0. |
| Comment by Gerrit Updater [ 08/Sep/16 ] |
|
Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/22377 |
| Comment by Gerrit Updater [ 14/Sep/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/22507 |
| Comment by Gerrit Updater [ 21/Sep/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22377/ |
| Comment by Peter Jones [ 22/Sep/16 ] |
|
Landed for 2.9 |