[LU-10133] Multi-page allocation failures in mlx4/mlx5 Created: 17/Oct/17 Updated: 27/Apr/20 Resolved: 23/Nov/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
Soak cluster - lustre-master build 3654 lustre version=2.10.54_13_g84f690e |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
I am seeing multiple page allocation failures from soak-clients. Failures seem to be semi-random. Oct 17 02:20:07 soak-17 kernel: kworker/u480:1: page allocation failure: order:8, mode:0x80d0 Oct 17 02:20:07 soak-17 kernel: CPU: 9 PID: 58714 Comm: kworker/u480:1 Tainted: G OE ------------ 3.10.0-693.2.2.el7.x86_64 #1 Oct 17 02:20:07 soak-17 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 Oct 17 02:20:08 soak-17 kernel: Workqueue: rdma_cm cma_work_handler [rdma_cm] Oct 17 02:20:08 soak-17 kernel: 00000000000080d0 00000000a9e78c95 ffff8803ee9bf848 ffffffff816a3db1 Oct 17 02:20:08 soak-17 kernel: ffff8803ee9bf8d8 ffffffff81188810 0000000000000000 ffff88043ffdb000 Oct 17 02:20:08 soak-17 kernel: 0000000000000008 00000000000080d0 ffff8803ee9bf8d8 00000000a9e78c95 Oct 17 02:20:08 soak-17 kernel: Call Trace: Oct 17 02:20:08 soak-17 kernel: [<ffffffff816a3db1>] dump_stack+0x19/0x1b Oct 17 02:20:08 soak-17 kernel: [<ffffffff81188810>] warn_alloc_failed+0x110/0x180 Oct 17 02:20:08 soak-17 kernel: [<ffffffff8169fd8a>] __alloc_pages_slowpath+0x6b6/0x724 Oct 17 02:20:08 soak-17 kernel: [<ffffffff8118cd85>] __alloc_pages_nodemask+0x405/0x420 Oct 17 02:20:08 soak-17 kernel: [<ffffffff81030f8f>] dma_generic_alloc_coherent+0x8f/0x140 Oct 17 02:20:08 soak-17 kernel: [<ffffffff81064341>] x86_swiotlb_alloc_coherent+0x21/0x50 Oct 17 02:20:08 soak-17 kernel: [<ffffffffc02914d3>] mlx4_buf_direct_alloc.isra.6+0xd3/0x1a0 [mlx4_core] Oct 17 02:20:09 soak-17 kernel: [<ffffffffc029176b>] mlx4_buf_alloc+0x1cb/0x240 [mlx4_core] Oct 17 02:20:09 soak-17 kernel: [<ffffffffc02940d0>] ? __mlx4_cmd+0x560/0x920 [mlx4_core] Oct 17 02:20:09 soak-17 kernel: [<ffffffffc061085e>] create_qp_common.isra.31+0x62e/0x10d0 [mlx4_ib] Oct 17 02:20:09 soak-17 kernel: [<ffffffffc061144e>] mlx4_ib_create_qp+0x14e/0x480 [mlx4_ib] Oct 17 02:20:09 soak-17 kernel: [<ffffffffc03c9c3a>] ib_create_qp+0x7a/0x2f0 [ib_core] Oct 17 02:20:09 soak-17 kernel: [<ffffffffc04f66d4>] rdma_create_qp+0x34/0xb0 [rdma_cm] Oct 17 02:20:09 soak-17 kernel: [<ffffffffc0bd8539>] kiblnd_create_conn+0xbf9/0x1960 [ko2iblnd] Oct 17 02:20:09 soak-17 kernel: [<ffffffffc0be8649>] kiblnd_cm_callback+0x1429/0x2300 [ko2iblnd] Oct 17 02:20:09 soak-17 kernel: [<ffffffffc04fa57c>] cma_work_handler+0x6c/0xa0 [rdma_cm] Oct 17 02:20:09 soak-17 kernel: [<ffffffff810a881a>] process_one_work+0x17a/0x440 Oct 17 02:20:09 soak-17 kernel: [<ffffffff810a94e6>] worker_thread+0x126/0x3c0 Oct 17 02:20:09 soak-17 kernel: [<ffffffff810a93c0>] ? manage_workers.isra.24+0x2a0/0x2a0 Oct 17 02:20:09 soak-17 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Oct 17 02:20:09 soak-17 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 Oct 17 02:20:10 soak-17 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 Oct 17 02:20:10 soak-17 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 Oct 17 02:20:10 soak-17 kernel: Mem-Info: Oct 17 02:20:10 soak-17 kernel: active_anon:36658 inactive_anon:27590 isolated_anon:6#012 active_file:2710466 inactive_file:345768 isolated_file:10#012 unevictable:0 dirty:14 writeback:0 unstable:0#012 slab_reclaimable:30971 slab_unreclaimable:3983583#012 mapped:10108 shmem:6384 pagetables:3086 bounce:0#012 free:776253 free_pcp:359 free_cma:0 Oct 17 02:20:11 soak-17 kernel: Node 0 DMA free:15784kB min:40kB low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15932kB managed:15848kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Oct 17 02:20:11 soak-17 kernel: lowmem_reserve[]: 0 2580 15620 15620 Oct 17 02:20:11 soak-17 kernel: Node 0 DMA32 free:132736kB min:7320kB low:9148kB high:10980kB active_anon:6472kB inactive_anon:8768kB active_file:1063620kB inactive_file:27644kB unevictable:0kB isolated(anon):24kB isolated(file):40kB present:3051628kB managed:2643828kB mlocked:0kB dirty:8kB writeback:0kB mapped:2140kB shmem:116kB slab_reclaimable:9352kB slab_unreclaimable:1306892kB kernel_stack:1152kB pagetables:1196kB unstable:0kB bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Oct 17 02:20:11 soak-17 kernel: lowmem_reserve[]: 0 0 13040 13040 Oct 17 02:20:11 soak-17 kernel: Node 0 Normal free:1149812kB min:37012kB low:46264kB high:55516kB active_anon:69848kB inactive_anon:32420kB active_file:4495364kB inactive_file:737992kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13353036kB mlocked:0kB dirty:24kB writeback:0kB mapped:9156kB shmem:248kB slab_reclaimable:54264kB slab_unreclaimable:6303688kB kernel_stack:7248kB pagetables:5096kB unstable:0kB bounce:0kB free_pcp:860kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Oct 17 02:20:12 soak-17 kernel: lowmem_reserve[]: 0 0 0 0 Oct 17 02:20:12 soak-17 kernel: Node 1 Normal free:1805688kB min:45728kB low:57160kB high:68592kB active_anon:70700kB inactive_anon:69172kB active_file:5282880kB inactive_file:617436kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16498508kB mlocked:0kB dirty:24kB writeback:0kB mapped:29136kB shmem:25172kB slab_reclaimable:60268kB slab_unreclaimable:8323752kB kernel_stack:5568kB pagetables:6052kB unstable:0kB bounce:0kB free_pcp:1468kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Oct 17 02:20:13 soak-17 kernel: lowmem_reserve[]: 0 0 0 0 Oct 17 02:20:13 soak-17 kernel: Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15784kB Oct 17 02:20:13 soak-17 kernel: Node 0 DMA32: 2018*4kB (UEM) 1070*8kB (UEM) 670*16kB (UEM) 685*32kB (UEM) 594*64kB (UEM) 199*128kB (UEM) 80*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 133240kB Oct 17 02:20:13 soak-17 kernel: Node 0 Normal: 8492*4kB (UEM) 5207*8kB (UEM) 3978*16kB (UEM) 8657*32kB (UEM) 8319*64kB (EM) 1594*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1152744kB Oct 17 02:20:13 soak-17 kernel: Node 1 Normal: 14583*4kB (UEM) 8566*8kB (UEM) 5482*16kB (UEM) 13112*32kB (UEM) 11765*64kB (UEM) 2443*128kB (UM) 418*256kB (UM) 5*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 1809388kB Oct 17 02:20:13 soak-17 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Oct 17 02:20:13 soak-17 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Oct 17 02:20:13 soak-17 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Oct 17 02:20:14 soak-17 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Oct 17 02:20:14 soak-17 kernel: 3062619 total pagecache pages Oct 17 02:20:14 soak-17 kernel: 6 pages in swap cache Oct 17 02:20:14 soak-17 kernel: Swap cache stats: add 13, delete 7, find 0/0 Oct 17 02:20:14 soak-17 kernel: Free swap = 16319432kB Oct 17 02:20:14 soak-17 kernel: Total swap = 16319484kB Oct 17 02:20:14 soak-17 kernel: 8369066 pages RAM Oct 17 02:20:14 soak-17 kernel: 0 pages HighMem/MovableOnly Oct 17 02:20:14 soak-17 kernel: 241261 pages reserved Oct 17 02:20:15 soak-17 kernel: kworker/u480:1: page allocation failure: order:8, mode:0x80d0 Oct 17 02:20:15 soak-17 kernel: CPU: 9 PID: 58714 Comm: kworker/u480:1 Tainted: G OE ------------ 3.10.0-693.2.2.el7.x86_64 #1 Oct 17 02:20:15 soak-17 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 Oct 17 02:20:15 soak-17 kernel: Workqueue: rdma_cm cma_work_handler [rdma_cm] The systems appear to recover and continue. Lustre-log dump from soak-17 after the most recent failure attached. |
| Comments |
| Comment by Andreas Dilger [ 18/Oct/17 ] |
|
It looks like there is a fix for this problem in the upstream kernel, to use kvmalloc_array() instead of kmalloc() for the qp->sq.wrid and qp->rq.wrid allocations in create_qp_common(). The main fix is: commit e9105cdefbf64cd7aea300f934c92051e7cb7cff
Author: Li Dongyang <dongyang.li@anu.edu.au>
AuthorDate: Wed Aug 16 23:31:23 2017 +1000
Commit: Doug Ledford <dledford@redhat.com>
CommitDate: Tue Aug 22 16:48:35 2017 -0400
IB/mlx4: use kvmalloc_array to allocate wrid
We could use kvmalloc_array instead of the
kmalloc and __vmalloc combination.
After this we don't need to include linux/vmalloc.h
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
which itself depends on the kvmalloc_array() and kvmalloc() helper functions landed in the following (relatively large) patches. For backporting, it makes sense to just land the subset of those patches that add the kvmalloc_*() functions, rather than changing all of the callsites as well. I don't see any of these helpers in the RHEL 7 kernel I have (linux-3.10.0-514.el6), though kvfree() already exists. commit a7c3e901a46ff54c016d040847eda598a9e3e653
Author: Michal Hocko <mhocko@suse.com>
AuthorDate: Mon May 8 15:57:09 2017 -0700
Commit: Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Mon May 8 17:15:12 2017 -0700
mm: introduce kv[mz]alloc helpers
Patch series "kvmalloc", v5.
There are many open coded kmalloc with vmalloc fallback instances in the
tree. Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.
As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.
Most callers, I could find, have been converted to use the helper
instead. This is patch 6. There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.
[1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com
This patch (of 9):
Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code. Yet we do not have any common helper
for that and so users have invented their own helpers. Some of them are
really creative when doing so. Let's just add kv[mz]alloc and make sure
it is implemented properly. This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures. This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.
This patch also changes some existing users and removes helpers which
are specific for them. In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL). Those need to be
fixed separately.
While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers. Existing ones would have to die
slowly.
[sfr@canb.auug.org.au: f2fs fixup]
Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andreas Dilger <adilger@dilger.ca> [ext4 part]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 752ade68cbd81d0321dfecc188f655a945551b25
Author: Michal Hocko <mhocko@suse.com>
AuthorDate: Mon May 8 15:57:27 2017 -0700
Commit: Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Mon May 8 17:15:13 2017 -0700
treewide: use kv[mz]alloc* rather than opencoded variants
There are many code paths opencoding kvmalloc. Let's use the helper
instead. The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator. E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation. This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously. There is no guarantee something like that happens
though.
This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.
Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| Comment by Cliff White (Inactive) [ 19/Oct/17 ] |
|
Switched to lustre-master-ib build, using MOFED instead of in-kernel drivers. Still having multiple page allocation faults on multiple nodes. |
| Comment by John Hammond [ 19/Oct/17 ] |
|
Hi Cliff, do you have a crash dump from a MOFED run? |
| Comment by Sarah Liu [ 19/Oct/17 ] |
|
Hi John, Oct 19 00:15:39 soak-17 systemd-logind: Removed session 246. Oct 19 00:15:39 soak-17 systemd: Removed slice User Slice of root. Oct 19 00:15:39 soak-17 systemd: Stopping User Slice of root. Oct 19 00:15:54 soak-17 kernel: kworker/u480:3: page allocation failure: order:8, mode:0x80d0 Oct 19 00:15:54 soak-17 kernel: CPU: 5 PID: 19810 Comm: kworker/u480:3 Tainted: G OE ------------ 3.10.0-693.2.2.el7.x86_64 #1 Oct 19 00:15:54 soak-17 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 Oct 19 00:15:54 soak-17 kernel: Workqueue: rdma_cm cma_work_handler [rdma_cm] Oct 19 00:15:54 soak-17 kernel: 00000000000080d0 0000000014a032b0 ffff8806ec793868 ffffffff816a3db1 Oct 19 00:15:54 soak-17 kernel: ffff8806ec7938f8 ffffffff81188810 0000000000000000 ffff88043ffdb000 Oct 19 00:15:54 soak-17 kernel: 0000000000000008 00000000000080d0 ffff8806ec7938f8 0000000014a032b0 Oct 19 00:15:54 soak-17 kernel: Call Trace: Oct 19 00:15:54 soak-17 kernel: [<ffffffff816a3db1>] dump_stack+0x19/0x1b Oct 19 00:15:54 soak-17 kernel: [<ffffffff81188810>] warn_alloc_failed+0x110/0x180 Oct 19 00:15:54 soak-17 kernel: [<ffffffff8169fd8a>] __alloc_pages_slowpath+0x6b6/0x724 Oct 19 00:15:54 soak-17 kernel: [<ffffffff8118cd85>] __alloc_pages_nodemask+0x405/0x420 Oct 19 00:15:54 soak-17 kernel: [<ffffffff81030f8f>] dma_generic_alloc_coherent+0x8f/0x140 Oct 19 00:15:54 soak-17 kernel: [<ffffffff81064341>] x86_swiotlb_alloc_coherent+0x21/0x50 Oct 19 00:15:54 soak-17 kernel: [<ffffffffc071b4c4>] mlx4_buf_direct_alloc.isra.7+0xc4/0x180 [mlx4_core] Oct 19 00:15:54 soak-17 kernel: [<ffffffffc071b73b>] mlx4_buf_alloc+0x1bb/0x250 [mlx4_core] Oct 19 00:15:54 soak-17 kernel: [<ffffffffc0552425>] create_qp_common+0x645/0x10a0 [mlx4_ib] Oct 19 00:15:54 soak-17 kernel: [<ffffffffc0723c7b>] ? mlx4_cq_alloc+0x4ab/0x580 [mlx4_core] Oct 19 00:15:54 soak-17 kernel: [<ffffffffc0553157>] mlx4_ib_create_qp+0x2a7/0x4d0 [mlx4_ib] Oct 19 00:15:54 soak-17 kernel: [<ffffffffc055dc40>] mlx4_ib_create_qp_wrp+0x10/0x20 [mlx4_ib] Oct 19 00:15:54 soak-17 kernel: [<ffffffffc04e42aa>] ib_create_qp+0x7a/0x2f0 [ib_core] Oct 19 00:15:54 soak-17 kernel: [<ffffffffc05eb614>] rdma_create_qp+0x34/0xb0 [rdma_cm] Oct 19 00:15:54 soak-17 kernel: [<ffffffffc0bcb5c9>] kiblnd_create_conn+0xbf9/0x1960 [ko2iblnd] Oct 19 00:15:54 soak-17 kernel: [<ffffffffc0bdb8a9>] kiblnd_cm_callback+0x1429/0x22d0 [ko2iblnd] Oct 19 00:15:54 soak-17 kernel: [<ffffffffc05ef22c>] cma_work_handler+0x6c/0xa0 [rdma_cm] Oct 19 00:15:54 soak-17 kernel: [<ffffffff810a881a>] process_one_work+0x17a/0x440 Oct 19 00:15:54 soak-17 kernel: [<ffffffff810a94e6>] worker_thread+0x126/0x3c0 Oct 19 00:15:54 soak-17 kernel: [<ffffffff810a93c0>] ? manage_workers.isra.24+0x2a0/0x2a0 Oct 19 00:15:54 soak-17 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Oct 19 00:15:54 soak-17 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 Oct 19 00:15:54 soak-17 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 Oct 19 00:15:54 soak-17 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 Oct 19 00:15:54 soak-17 kernel: Mem-Info: Oct 19 00:15:54 soak-17 kernel: active_anon:3001 inactive_anon:26225 isolated_anon:0#012 active_file:3506921 inactive_file:61152 isolated_file:10#012 unevictable:0 dirty:4 writeback:0 unstable:0#012 slab_reclaimable:30003 slab_unreclaimable:3610646#012 mapped:6288 shmem:4251 pagetables:2713 bounce:0#012 free:650891 free_pcp:3700 free_cma:0 Oct 19 00:15:54 soak-17 kernel: Node 0 DMA free:15848kB min:40kB low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15932kB managed:15848kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Oct 19 00:15:54 soak-17 kernel: lowmem_reserve[]: 0 2580 15620 15620 Oct 19 00:15:54 soak-17 kernel: Node 0 DMA32 free:147604kB min:7320kB low:9148kB high:10980kB active_anon:2004kB inactive_anon:5980kB active_file:980924kB inactive_file:28204kB unevictable:0kB isolated(anon):0kB isolated(file):40kB present:3051628kB managed:2643828kB mlocked:0kB dirty:0kB writeback:0kB mapped:364kB shmem:124kB slab_reclaimable:15856kB slab_unreclaimable:1360408kB kernel_stack:1440kB pagetables:1256kB unstable:0kB bounce:0kB free_pcp:4128kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Oct 19 00:15:54 soak-17 kernel: lowmem_reserve[]: 0 0 13040 13040 Oct 19 00:15:54 soak-17 kernel: Node 0 Normal free:699476kB min:37012kB low:46264kB high:55516kB active_anon:2092kB inactive_anon:38928kB active_file:5497500kB inactive_file:91372kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13353036kB mlocked:0kB dirty:4kB writeback:0kB mapped:1296kB shmem:68kB slab_reclaimable:53260kB slab_unreclaimable:6393880kB kernel_stack:6560kB pagetables:5460kB unstable:0kB bounce:0kB free_pcp:4588kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Oct 19 00:15:54 soak-17 kernel: lowmem_reserve[]: 0 0 0 0 Oct 19 00:15:54 soak-17 kernel: Node 1 Normal free:1739896kB min:45728kB low:57160kB high:68592kB active_anon:7908kB inactive_anon:59992kB active_file:7549260kB inactive_file:125032kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16498508kB mlocked:0kB dirty:12kB writeback:0kB mapped:23492kB shmem:16812kB slab_reclaimable:50896kB slab_unreclaimable:6688296kB kernel_stack:5632kB pagetables:4136kB unstable:0kB bounce:0kB free_pcp:6760kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Oct 19 00:15:54 soak-17 kernel: lowmem_reserve[]: 0 0 0 0 Oct 19 00:15:54 soak-17 kernel: Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15848kB Oct 19 00:15:54 soak-17 kernel: Node 0 DMA32: 3105*4kB (UEM) 3479*8kB (UEM) 2316*16kB (UEM) 1767*32kB (UEM) 209*64kB (UM) 9*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 148380kB Oct 19 00:15:54 soak-17 kernel: Node 0 Normal: 31182*4kB (UEM) 20577*8kB (UEM) 10519*16kB (UEM) 6073*32kB (UEM) 710*64kB (UEM) 18*128kB (UM) 0*256kB 0*512kB 1*1024kB (E) 0*2048kB 0*4096kB = 700752kB Oct 19 00:15:54 soak-17 kernel: Node 1 Normal: 19696*4kB (UEM) 23200*8kB (UEM) 22433*16kB (UEM) 20069*32kB (UEM) 7102*64kB (UEM) 162*128kB (UEM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1741040kB Oct 19 00:15:54 soak-17 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Oct 19 00:15:54 soak-17 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Oct 19 00:15:54 soak-17 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Oct 19 00:15:54 soak-17 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Oct 19 00:15:54 soak-17 kernel: 3571770 total pagecache pages Oct 19 00:15:54 soak-17 kernel: 35 pages in swap cache Oct 19 00:15:54 soak-17 kernel: Swap cache stats: add 1185, delete 1150, find 7/13 Oct 19 00:15:54 soak-17 kernel: Free swap = 16314956kB Oct 19 00:15:54 soak-17 kernel: Total swap = 16319484kB Oct 19 00:15:54 soak-17 kernel: 8369066 pages RAM Oct 19 00:15:54 soak-17 kernel: 0 pages HighMem/MovableOnly Oct 19 00:15:54 soak-17 kernel: 241261 pages reserved Oct 19 00:15:54 soak-17 kernel: kworker/u480:3: page allocation failure: order:8, mode:0x80d0 Oct 19 00:15:54 soak-17 kernel: CPU: 21 PID: 19810 Comm: kworker/u480:3 Tainted: G OE ------------ 3.10.0-693.2.2.el7.x86_64 #1 Oct 19 00:15:54 soak-17 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 and the ib module info [root@soak-17 syslog]# modinfo ib_core filename: /lib/modules/3.10.0-693.2.2.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/core/ib_core.ko license: Dual BSD/GPL description: core kernel InfiniBand API author: Roland Dreier rhelversion: 7.4 srcversion: 88498DC1AE00B29161E536C depends: mlx_compat vermagic: 3.10.0-693.2.2.el7.x86_64 SMP mod_unload modversions parm: send_queue_size:Size of send queue in number of work requests (int) parm: recv_queue_size:Size of receive queue in number of work requests (int) parm: roce_v1_noncompat_gid:Default GID auto configuration (Default: yes) (bool) parm: force_mr:Force usage of MRs for RDMA READ/WRITE operations (bool) [root@soak-17 syslog]# |
| Comment by John Hammond [ 20/Oct/17 ] |
|
Can we set panic_on_oom and see if we can get a crash dump? |
| Comment by Cliff White (Inactive) [ 23/Oct/17 ] |
|
I will setup to do this. So far, we have a lot of allocation failures, but very few OOMs, so I may trigger a dump if we don't get one otherwise. |
| Comment by Oleg Drokin [ 08/Nov/17 ] |
|
I filed this in redhat bugzilla to be ported, so I guess all interested parties should ask for it too (I guess the ticket would be private for Intel at least for now): https://bugzilla.redhat.com/show_bug.cgi?id=1511159 |
| Comment by Gerrit Updater [ 18/Nov/17 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/30164 |
| Comment by Andreas Dilger [ 18/Nov/17 ] |
|
Bob, it seems I don't have the latest kernel sources on my dev system. Could you please update the patches as appropriate for the various kernels we are building. |
| Comment by Bob Glossman (Inactive) [ 19/Nov/17 ] |
|
Adding kernel patches as is done in https://review.whamcloud.com/30164 isn't a solution. Client builds use unpatched kernels. We now also have unpatched server builds offered as an option. No fix in those cases from this mod. |
| Comment by Andreas Dilger [ 19/Nov/17 ] |
|
Bob, I understand it isn't a solution for unpatxhed clients and servers, but for Lustre 2.7 and RHEL6 we don't have unpatched servers at all, and it is also possible for users to install the patched kernel and client if they're having this problem. |
| Comment by Bob Glossman (Inactive) [ 27/Nov/17 ] |
|
patches refreshed for current supported versions. |
| Comment by Rick Mohr [ 30/Nov/17 ] |
|
Not sure if this is relevant, but I am seeing something very similar on my Lustre servers (Lustre 2.9, CentOS Linux release 7.3.1611, in-kernel IB support, mlx5 driver). |
| Comment by Rick Mohr [ 30/Nov/17 ] |
Nov 13 04:23:19 haven-oss1 kernel: warn_alloc_failed: 240 callbacks suppressed Nov 13 04:23:19 haven-oss1 kernel: kworker/u32:1: page allocation failure: order:9, mode:0x80d0 Nov 13 04:23:19 haven-oss1 kernel: CPU: 13 PID: 9120 Comm: kworker/u32:1 Tainted: G OE ------------ 3.10.0-514.el7_lustre.x86_64 #1 Nov 13 04:23:19 haven-oss1 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 Nov 13 04:23:19 haven-oss1 kernel: Workqueue: rdma_cm cma_work_handler [rdma_cm] Nov 13 04:23:19 haven-oss1 kernel: 00000000000080d0 0000000074e4e302 ffff881183c6b810 ffffffff816860f8 Nov 13 04:23:19 haven-oss1 kernel: ffff881183c6b8a0 ffffffff811869a0 0000000000000000 ffff8816bebd9000 Nov 13 04:23:19 haven-oss1 kernel: 0000000000000009 00000000000080d0 ffff881183c6b8a0 0000000074e4e302 Nov 13 04:23:19 haven-oss1 kernel: Call Trace: Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff816860f8>] dump_stack+0x19/0x1b Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff811869a0>] warn_alloc_failed+0x110/0x180 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff81681cb0>] __alloc_pages_slowpath+0x6b7/0x725 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff8118af55>] __alloc_pages_nodemask+0x405/0x420 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff81030fcf>] dma_generic_alloc_coherent+0x8f/0x140 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff81061ed1>] x86_swiotlb_alloc_coherent+0x21/0x50 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffffa0214bfd>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core] Nov 13 04:23:19 haven-oss1 kernel: [<ffffffffa0214f7d>] mlx5_buf_alloc_node+0x4d/0xc0 [mlx5_core] Nov 13 04:23:19 haven-oss1 kernel: [<ffffffffa0215004>] mlx5_buf_alloc+0x14/0x20 [mlx5_core] Nov 13 04:23:19 haven-oss1 kernel: [<ffffffffa044d062>] create_kernel_qp.isra.42+0x292/0x7d0 [mlx5_ib] Nov 13 04:23:19 haven-oss1 kernel: [<ffffffffa044e1ee>] create_qp_common+0xc4e/0xe00 [mlx5_ib] Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff8119f25a>] ? kvfree+0x2a/0x40 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff8119f25a>] ? kvfree+0x2a/0x40 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff811de2f6>] ? kmem_cache_alloc_trace+0x1d6/0x200 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffffa044e68b>] mlx5_ib_create_qp+0x10b/0x4c0 [mlx5_ib] Nov 13 04:23:19 haven-oss1 kernel: [<ffffffffa0410a1f>] ib_create_qp+0x3f/0x250 [ib_core] Nov 13 04:23:19 haven-oss1 kernel: [<ffffffffa03aa584>] rdma_create_qp+0x34/0xb0 [rdma_cm] Nov 13 04:23:19 haven-oss1 kernel: [<ffffffffa05a3437>] kiblnd_create_conn+0xad7/0x1870 [ko2iblnd] Nov 13 04:23:19 haven-oss1 kernel: [<ffffffffa05b35f9>] kiblnd_cm_callback+0x1429/0x2290 [ko2iblnd] Nov 13 04:23:19 haven-oss1 kernel: [<ffffffffa03ae3ac>] cma_work_handler+0x6c/0xa0 [rdma_cm] Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff810a7f3b>] process_one_work+0x17b/0x470 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff810a8d76>] worker_thread+0x126/0x410 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff810a8c50>] ? rescuer_thread+0x460/0x460 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff810b052f>] kthread+0xcf/0xe0 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff810bf8d6>] ? finish_task_switch+0x56/0x180 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff81696658>] ret_from_fork+0x58/0x90 Nov 13 04:23:19 haven-oss1 kernel: [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140 |
| Comment by Andreas Dilger [ 01/Dec/17 ] |
|
Rick, the https://review.whamcloud.com/30164 patch is definitely for you then. That fixes mlx5 in the same way as mlx4 was fixed for RHEL 7.0-7.4 and RHEL6.x. My original version of the patch also fixed mlx4 for RHEL6.8- but that is already fixed in RHEL 6.9. |
| Comment by Andreas Dilger [ 06/Dec/17 ] |
|
Unfortunately, my patch https://review.whamcloud.com/30164 doesn't fix all of the allocation problems here. It also seems that fixes that were added to mlx4_buf_alloc() have not all been added to mlx5_buf_alloc(), which means we may need several other commits to reduce allocation size for mlx5, and at least one small improvement for mlx4. mlx4:
mlx5:
As a workaround, Amir suggests that adding options ko2iblnd map_on_demand=32 in /etc/modprobe.d/ko2iblnd.conf will reduce the size of the QP allocations and will reduce the frequency/severity of this problem. |
| Comment by Cliff White (Inactive) [ 06/Dec/17 ] |
|
Unfortunately we have been running map_on_demand=32 for quite a while now. At least a year. options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 |
| Comment by Andreas Dilger [ 06/Dec/17 ] |
|
Cliff, the ko2iblnd-opa options are applied for OPA cards only, but the problem affects mlx4 and mlx5 cards, so a separate options ko2iblnd map_on_demand=32 line needs to be added for Mellanox cards. Otherwise, the default is 256. |
| Comment by Amir Shehata (Inactive) [ 06/Dec/17 ] |
|
One thing we should do before making this the default is some performance testing to see how setting map-on-demand to 32 will impact mlx4 and mlx5. It will reduce memory usage per qp, but we need to double check any performance impact. |
| Comment by Alexey Lyashkov [ 09/Dec/17 ] |
|
My tests with map_on_demand=256 say 1%-2% perf drop for this case. It's not a big changes i think. |
| Comment by Chris Hunter (Inactive) [ 11/Dec/17 ] |
|
We were informed these patches are in mellanox ofed 4.2 GA. There are similar patches applied to kvzalloc for the mlx5 ethernet driver. 1) mm: introduce kv[mz]alloc helpers upstream mlx5 patches were commited May 2017: and Aug 2017: also upstream patch for mlx4:
|
| Comment by Andreas Dilger [ 13/Dec/17 ] |
|
Alexey, did you run any tests with "map_on_demand=32"? I think the default value is 256, but reducing this is important for reducing memory usage. |
| Comment by Andreas Dilger [ 13/Dec/17 ] |
|
Chris, I believe the problem has been fixed in the upstream kernel, the problem is that users are hitting this regularly on RHEL6/RHEL7 kernels (client and server) with the in-kernel OFED, so it would be good to get a fix for those systems as well. |
| Comment by James A Simmons [ 04/Jan/18 ] |
|
Which OFED/MOFED version does this fix appear in. For those who want to avoid patched kernels for server side at all cost |
| Comment by Mahmoud Hanafi [ 09/Jan/18 ] |
|
We are running with MOFED 4.1 and Cent7.4 servers. options ko2iblnd timeout=150 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=63 concurrent_sends=63 Seeing this issue. n 9 08:37:52 nbp1-oss6 kernel: [1189787.313194] kworker/u48:3: page allocation failure: order:5, mode:0x8010 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313196] CPU: 20 PID: 57793 Comm: kworker/u48:3 Tainted: G OE ------------ 3.10.0-693.2.2.el7.20170918.x86_64.lustre2101 #1 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313196] Hardware name: SGI.COM CH-C2112-GP2/X10DRU-i+, BIOS 1.0b 05/08/2015 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313200] Workqueue: ipoib_wq ipoib_cm_tx_start [ib_ipoib] Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313201] 0000000000008010 0000000022ff91e8 ffff8810a141f7e0 ffffffff81684ac1 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313202] ffff8810a141f870 ffffffff811841c0 0000000000000000 ffff88207ffd8000 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313203] 0000000000000005 0000000000008010 ffff8810a141f870 0000000022ff91e8 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313204] Call Trace: Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313205] [<ffffffff81684ac1>] dump_stack+0x19/0x1b Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313207] [<ffffffff811841c0>] warn_alloc_failed+0x110/0x180 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313209] [<ffffffff81188984>] __alloc_pages_nodemask+0x9b4/0xba0 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313211] [<ffffffff811cc688>] alloc_pages_current+0x98/0x110 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313216] [<ffffffff8118300e>] __get_free_pages+0xe/0x50 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313217] [<ffffffff8133d41e>] swiotlb_alloc_coherent+0x5e/0x150 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313221] [<ffffffff810622c1>] x86_swiotlb_alloc_coherent+0x41/0x50 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313224] [<ffffffffa05aa4c4>] mlx4_buf_direct_alloc.isra.7+0xc4/0x180 [mlx4_core] Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313228] [<ffffffffa05aa73b>] mlx4_buf_alloc+0x1bb/0x250 [mlx4_core] Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313233] [<ffffffffa07b8435>] create_qp_common+0x645/0x1090 [mlx4_ib] Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313237] [<ffffffffa07b9104>] ? mlx4_ib_create_qp+0x254/0x4d0 [mlx4_ib] Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313240] [<ffffffffa07b9157>] mlx4_ib_create_qp+0x2a7/0x4d0 [mlx4_ib] Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313244] [<ffffffffa07c3c40>] mlx4_ib_create_qp_wrp+0x10/0x20 [mlx4_ib] Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313248] [<ffffffffa04d02aa>] ib_create_qp+0x7a/0x2f0 [ib_core] Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313253] [<ffffffffa055b2fc>] ipoib_cm_create_tx_qp_rss+0xcc/0x110 [ib_ipoib] Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313257] [<ffffffffa055b9f9>] ipoib_cm_tx_init+0x89/0x2f0 [ib_ipoib] Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313260] [<ffffffffa055d6b8>] ipoib_cm_tx_start+0x248/0x3c0 [ib_ipoib] Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313263] [<ffffffff810a587a>] process_one_work+0x17a/0x440 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313265] [<ffffffff810a6546>] worker_thread+0x126/0x3c0 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313266] [<ffffffff810a6420>] ? manage_workers.isra.24+0x2a0/0x2a0 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313268] [<ffffffff810ad9ef>] kthread+0xcf/0xe0 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313269] [<ffffffff810ad920>] ? insert_kthread_work+0x40/0x40 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313270] [<ffffffff81695ad8>] ret_from_fork+0x58/0x90 Jan 9 08:37:52 nbp1-oss6 kernel: [1189787.313272] [<ffffffff810ad920>] ? insert_kthread_work+0x40/0x40 So is this issue fixed in MOFED4.2?
|
| Comment by Cliff White (Inactive) [ 17/Jan/18 ] |
|
We are seeing what may be this issue on the 2.10.3-RC1 tag |
| Comment by Jay Lan (Inactive) [ 28/Feb/18 ] |
|
The patch https://review.whamcloud.com/30164 would change two kmalloc() calls in create_qp_common() so that a __vmalloc() call would be made in case kmalloc() fails. However, both Mahmoud and Cliff White reported a failure at difference locations: mlx4_buf_alloc() call inside the create_qp_common() routine. The fix from #30164 would have no effect on our problem. [558213.837942] [<ffffffff81686d81>] dump_stack+0x19/0x1b |
| Comment by Andreas Dilger [ 23/Nov/18 ] |
|
This issue is fixed in the MOFED 4.4 release. |