[LU-10775] (sec.c:2363:sptlrpc_svc_unwrap_bulk()) @@@ truncated bulk GET 1048576(2097152) Created: 05/Mar/18 Updated: 19/Dec/18 Resolved: 16/Apr/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ruth Klundt (Inactive) | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 7.4 ARM client vs x86 server |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
running IOR on locally built lustre branch b2_10 at commit 0f6c448, a couple of initial data transfers work but quickly start to fail, with server side messages like: (sec.c:2363:sptlrpc_svc_unwrap_bulk()) @@@ truncated bulk GET 1048576(4194304) req@ffff880f052d8050 x1593867370500512/t0(0) o4->d0c9fb64-cf93-52c4-8daf-a80ac8484f6b@194.1.0.2@o2ib4:76/0 lens 608/448 e 0 to 0 dl 1520037046 ref 1 fl Interpret:H/2/0 rc 0/0
config arg: --disable-gss module opts all defaults on both sides, perhaps something needs changed for ARM client? server has mdt + 3 osts on one node for testing, no lnet routers IB mlx5 connections. |
| Comments |
| Comment by Ruth Klundt (Inactive) [ 06/Mar/18 ] |
|
It appears that the page size is 64k on the ARM client. so the workaround of reducing max_pages_per_rpc to 16 works to get rid of this problem. |
| Comment by James A Simmons [ 06/Mar/18 ] |
|
I see the same exact errors. I thought this issue was something else. I have a early patch to resolve this but its not complete. Let me finish up another thing I'm working on and I will look into it. |
| Comment by James A Simmons [ 06/Mar/18 ] |
|
Give patch https://review.whamcloud.com/#/c/31559 a try |
| Comment by Ruth Klundt (Inactive) [ 07/Mar/18 ] |
|
Thanks James, will do. |
| Comment by Ruth Klundt (Inactive) [ 20/Mar/18 ] |
|
sorry for the delay, the cluster has moved ahead on mofed version (MLNX_OFED_LINUX-4.2-1.4.6.0) and build/insmod of ko2iblnd is problematic now. Not sure if my build is wrong, pointing away from /usr/src/ofa-kernel to the actual kernel rpmbuild directory let's lustre config and build. But lots of these on insmod: [170344.847903] ko2iblnd: disagrees about version of symbol ib_create_cq
restoring config default for o2ib to /usr/src/ofa-kernel, 'make' works. The resulting ko2iblnd.ko loads. lctl ping generates a server side error: LNet: 14703:0:(o2iblnd_cb.c:2355:kiblnd_passive_connect()) Can't accept conn from 194.1.0.2@o2ib4 (version 12): max_frags 16 incompatible without FMR pool (256 wanted) The server is running a 2.10-ish commit 2f379be, without your patch. guess I should have patched both sides..
|
| Comment by James A Simmons [ 21/Mar/18 ] |
|
Sadly Lustre 2.10 is missing patches to make it properly work. A bunch of fixes when into 2.11 to make lustre work with newer OFED stacks or new kernel IB stacks. I have been testing with 2.11 with my one additonal patch. |
| Comment by Ruth Klundt (Inactive) [ 21/Mar/18 ] |
|
I was at 2.11 RC1 with the patch on the client side, 2.10 without the patch on the server. After removing the patch the client connects fine. I'm defaulting any module options. This seems like an interop issue to me.
|
| Comment by James A Simmons [ 21/Mar/18 ] |
|
That is unexpected considering x86 sends 1MB packets with and without the patch. Its ARM/Power8 that is sending 16 MB packets. I can tell you that the patch on x86 platforms will work with x86 systems without the patch. I have run the upstream client which lacks the patch against patched servers. So we have: patched x86 <-> patched x86 works unpatch x86 <-> unpatch x86 works unpatch x86 <-> patched x86 works patched x86 <-> unpatched x86 ??? should work
patched ARM <-> patched x86 works unpatch ARM <-> unpatch x86 fails unpatch ARM <-> pacthed x86 ?? should fail since ARM is not addressed patched ARM <-> unpatch x86 fails Did you trying the server side with the patch to see if it works? |
| Comment by James A Simmons [ 22/Mar/18 ] |
|
Actually I realized I have been testing with an unpatched 2.11 server and it does work. The problem is lustre 2.10 is missing a bunch of fixes to properly support newer MOFED stack. Things like queue pair manage and map_on_demand have changed dramatically. Amir can you put together a list of missing patches for 2.10 to make this work? |
| Comment by Ruth Klundt (Inactive) [ 22/Mar/18 ] |
|
whoa, thanks but no need to patch 2.10 to make this work, I'm fine with moving the server to 2.11, it's a tiny toy fs, no routers. I'm far from grasping the whole map_on_demand ish, but maybe I just needed to set it to 256, don't think I did that. ps the test cluster is under work again...so next try will be in a while.
|
| Comment by Amir Shehata (Inactive) [ 26/Mar/18 ] |
|
Ruth, is the server side running RHEL 7.2 or earlier? Looking through the code the reason you'd get: LNet: 14703:0:(o2iblnd_cb.c:2355:kiblnd_passive_connect()) Can't accept conn from 194.1.0.2@o2ib4 (version 12): max_frags 16 incompatible without FMR pool (256 wanted) is because you're not using FMR. This would occur if HAVE_IB_GET_DMA_MR is defined. I believe this is defined for RHEL 7.2 and earlier. you would be able to avoid this issue by setting map-on-demand to 16 on the server side as well. Can you try that and see if it resolves the issue? James, I consider the map-on-demand changes to be mini-feature. Not sure if it's the best decision to backport that to 2.10. However, we might consider porting the below patch to 2.10, because it fixes a bug LU-10213 lnd: calculate qp max_send_wrs properly |
| Comment by James A Simmons [ 29/Mar/18 ] |
|
setting map-on_demand to 16 is not going to help. I have tried it before. We are going to need the map-on-demand changes for b2_10. As he pointed out using the 64K page patch that is slated for 2.12 will break interop with x86 2.10 server when using ARM clients since it lacks the all the changes to make it possible. So we have a choice here, state that in order to user ARM clients you must use at least a 2.11 server, or back port a bunch of o2iblnd patches to make it possible. Also many of the changes missing from 2.10 make using newer MOFED possible. Do we say you have to stay on a MOFED 3.X version for 2.10? |
| Comment by Ruth Klundt (Inactive) [ 30/Mar/18 ] |
|
Amir, the server side is RHEL 7.4, I built the 2.10 at 0f6c448. The ofed is MLNX_OFED_LINUX-4.2-1.0.0.0. configure reports yes to checking if 'ib_get_dma_mr' exists, but also: WARNING: "ib_get_dma_mr" [/build_area/lustre-release/build/conftest.ko] undefined! > nm /lib/modules/3.10.0-693.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/core/ib_core.ko | grep ib_get_dma 0000000000006e50 T ib_get_dma_mr Setting map_on_demand=16 on the server works, traffic is moving, Thanks. (I guess that would not work if there were other clients mounting, with a different setting though.) The client side is now: 2.11.0_RC2 + ' + rhel7.5, kernel 4.14.0-49.el7a.aarch64 and MLNX_OFED_LINUX-4.3-1.0.1.0
|
| Comment by James A Simmons [ 05/Apr/18 ] |
|
Ruth can you join the LWG call today? |
| Comment by Ruth Klundt (Inactive) [ 05/Apr/18 ] |
|
yes I'll be on
|
| Comment by James A Simmons [ 16/Apr/18 ] |
|
Now that
|