[LU-10020] mlx5_warn:mlx5_1:dump_cqe:257:(pid 4031): dump error cqe Created: 22/Sep/17 Updated: 01/Sep/20 Resolved: 18/Dec/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 2 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
We have the patch from We are testing lustre2.10.1 pre-release on a mlx5 hca host. lnet_selftest fails and mounting filesystem produced this error. [ 435.503071] mlx5_warn:mlx5_1:dump_cqe:257:(pid 4031): dump error cqe
[ 435.503072] 00000000 00000000 00000000 00000000
[ 435.503072] 00000000 00000000 00000000 00000000
[ 435.503073] 00000000 00000000 00000000 00000000
[ 435.503075] 00000000 9d005304 08000069 005878d2
[ 435.503078] LNet: 4031:0:(o2iblnd_cb.c:3475:kiblnd_complete()) RDMA (tx: ffffc90063356f28) failed: 4
[ 435.503292] LNet: 4029:0:(o2iblnd_cb.c:967:kiblnd_tx_complete()) Tx -> 10.151.20.103@o2ib cookie 0x67 sending 1 waiting 0: failed 5
[ 435.503295] LNet: 4029:0:(o2iblnd_cb.c:1919:kiblnd_close_conn_locked()) Closing conn to 10.151.20.103@o2ib: error -5(waiting)
[ 435.503304] LNet: 4029:0:(rpc.c:1413:srpc_lnet_ev_handler()) LNet event status -5 type 1, RPC errors 11
[ 435.503306] LNet: 4029:0:(rpc.c:1413:srpc_lnet_ev_handler()) Skipped 1 previous similar message
[ 435.503396] LNet: 4151:0:(rpc.c:1143:srpc_client_rpc_done()) Client RPC done: service 5, peer 12345-10.151.20.103@o2ib, status SWI_STATE_REQUEST_SUBMITTED:1:-4
[ 440.503751] LNet: 4152:0:(lib-move.c:830:lnet_post_send_locked()) Dropping message for 12345-10.151.20.103@o2ib: peer not alive
[ 440.503754] LNet: 4152:0:(lib-move.c:2827:LNetPut()) Error sending PUT to 12345-10.151.20.103@o2ib: -113
[ 440.503757] LNet: 4152:0:(rpc.c:1413:srpc_lnet_ev_handler()) LNet event status -113 type 5, RPC errors 16
[ 440.503758] LNet: 4152:0:(rpc.c:1413:srpc_lnet_ev_handler()) Skipped 4 previous similar messages
[ 440.503765] LNet: 4152:0:(rpc.c:1143:srpc_client_rpc_done()) Client RPC done: service 5, peer 12345-10.151.20.103@o2ib, status SWI_STATE_REQUEST_SUBMITTED:1:-4
[ 506.581347] LNet: 4173:0:(rpc.c:1069:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.151.20.103@o2ib, timeout 64.
[ 506.581363] LNet: 4147:0:(rpc.c:1143:srpc_client_rpc_done()) Client RPC done: service 11, peer 12345-10.151.20.103@o2ib, status SWI_STATE_REQUEST_SENT:1:-4
[ 506.581367] LustreError: 4147:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.151.20.103@o2ib failed with -110
|
| Comments |
| Comment by Brad Hoagland (Inactive) [ 22/Sep/17 ] |
|
Hello, Can you please confirm you are using 2.10.1 pre-release on a production system? |
| Comment by Mahmoud Hanafi [ 22/Sep/17 ] |
|
This particular system is a router and not in production yet. We are testing multirail for production use.
|
| Comment by Peter Jones [ 22/Sep/17 ] |
|
Then let's move to sev 2 - sev 1 is just for production outages |
| Comment by Peter Jones [ 22/Sep/17 ] |
|
Amir Can you please advise on this one? Thanks Peter |
| Comment by Mahmoud Hanafi [ 22/Sep/17 ] |
|
Additional Info: HCA: Mellanox Technologies MT27700 Family [ConnectX-4] OS: Sles12 SP2 4.4.74-92.32.1.20170808-nasa OFED: mlnx ofed 3.4.2
Lustre 2.9 Works Lustre 2.10.0 Works. Lustre 2.10.1 Does not work
|
| Comment by Amir Shehata (Inactive) [ 22/Sep/17 ] |
|
I wonder if commit: commit f87c7c2cee6fc5a0864a757917a414dc605554b3
Author: Doug Oucharek <doug.s.oucharek@intel.com>
Date: Tue May 16 16:00:53 2017 -0700
LU-9500 lnd: Don't Page Align remote_addr with FastReg
Is the problem. Can you take out this commit from your 2.10.1 tree and try it out? I'll try and reproduce locally as well |
| Comment by Mahmoud Hanafi [ 22/Sep/17 ] |
|
removing commit f87c7c2cee6fc5a0864a757917a414dc605554b3 fixed the problem in mofed3.4.2. We are building mofed4.1 and will test soon. |
| Comment by Amir Shehata (Inactive) [ 22/Sep/17 ] |
|
Ok. |
| Comment by Mahmoud Hanafi [ 24/Sep/17 ] |
|
I tested mofed4.1 and lustre2.10.1. Didn't get the error. |
| Comment by Amir Shehata (Inactive) [ 25/Sep/17 ] |
|
Just to clarify, is the passing test with mofed 4.1, Lustre 2.10.1 minus |
| Comment by Amir Shehata (Inactive) [ 25/Sep/17 ] |
|
I couldn't reproduce the failure with lnet-selftest on RHEL 7.3/MOFED 3.4.2/Lustre 2.10.1-RC1. would you be able to provide us with net/neterror logs for this problem? |
| Comment by Mahmoud Hanafi [ 25/Sep/17 ] |
|
With Rhel7.3/Mofed3.4.2/lustre2.10.1-Rc1 I didn't always get the "dump error cqe." But lnet-selftest wasn't working with lots lnet errors. With Rhel7.3/Mofed3.4.2/lustre 2.10 and Rhel7.3/Mofed4.1/lustre 2.10.1-RC1 it always works And Removing Currently I am running with Rhel7.3/Mofed4.1/lustre 2.10.1-RC1. I'll will revert and gather some logs. I think to reproduce it map_on_demand needs to be configured options ko2iblnd timeout=150 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=63 concurrent_sends=63 |
| Comment by Amir Shehata (Inactive) [ 26/Sep/17 ] |
|
ok, I'm able to reproduce with map_on_demand=32. I'll investigate further. |
| Comment by Amir Shehata (Inactive) [ 26/Sep/17 ] |
|
It looks like the issue is: kiblnd_fmr_map_tx()
...
if (!is_fastreg)
rd->rd_frags[0].rf_addr &= ~hdev->ibh_page_mask;
for MOFED 3.4.2 it appears that we need to page align the remote_addr even when fast_reg is enabled. Next step is to install MOFED 4.1 and see if page aligning the remote_address will trigger failure. If it does then we might need to check for MOFED version to see if we need to page align. |
| Comment by Amir Shehata (Inactive) [ 04/Oct/17 ] |
|
Please look at Let me know if the patch helps in your case. |
| Comment by Mahmoud Hanafi [ 10/Oct/17 ] |
|
Will
|
| Comment by Amir Shehata (Inactive) [ 12/Oct/17 ] |
|
Please take a look at my comment here: |
| Comment by Mahmoud Hanafi [ 18/Dec/18 ] |
|
think this can be close. All fixes has been pushed to mofed 4.4.2 |
| Comment by Peter Jones [ 18/Dec/18 ] |
|
ok - thanks |