[LU-10020] mlx5_warn:mlx5_1:dump_cqe:257:(pid 4031): dump error cqe Created: 22/Sep/17  Updated: 01/Sep/20  Resolved: 18/Dec/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Amir Shehata (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None

Issue Links:
Related
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

We have the patch from LU-8752 applied.

We are testing lustre2.10.1 pre-release on a mlx5 hca host. lnet_selftest fails and mounting filesystem produced this error.

[ 435.503071] mlx5_warn:mlx5_1:dump_cqe:257:(pid 4031): dump error cqe
[ 435.503072] 00000000 00000000 00000000 00000000
[ 435.503072] 00000000 00000000 00000000 00000000
[ 435.503073] 00000000 00000000 00000000 00000000
[ 435.503075] 00000000 9d005304 08000069 005878d2
[ 435.503078] LNet: 4031:0:(o2iblnd_cb.c:3475:kiblnd_complete()) RDMA (tx: ffffc90063356f28) failed: 4
[ 435.503292] LNet: 4029:0:(o2iblnd_cb.c:967:kiblnd_tx_complete()) Tx -> 10.151.20.103@o2ib cookie 0x67 sending 1 waiting 0: failed 5
[ 435.503295] LNet: 4029:0:(o2iblnd_cb.c:1919:kiblnd_close_conn_locked()) Closing conn to 10.151.20.103@o2ib: error -5(waiting)
[ 435.503304] LNet: 4029:0:(rpc.c:1413:srpc_lnet_ev_handler()) LNet event status -5 type 1, RPC errors 11
[ 435.503306] LNet: 4029:0:(rpc.c:1413:srpc_lnet_ev_handler()) Skipped 1 previous similar message
[ 435.503396] LNet: 4151:0:(rpc.c:1143:srpc_client_rpc_done()) Client RPC done: service 5, peer 12345-10.151.20.103@o2ib, status SWI_STATE_REQUEST_SUBMITTED:1:-4
[ 440.503751] LNet: 4152:0:(lib-move.c:830:lnet_post_send_locked()) Dropping message for 12345-10.151.20.103@o2ib: peer not alive
[ 440.503754] LNet: 4152:0:(lib-move.c:2827:LNetPut()) Error sending PUT to 12345-10.151.20.103@o2ib: -113
[ 440.503757] LNet: 4152:0:(rpc.c:1413:srpc_lnet_ev_handler()) LNet event status -113 type 5, RPC errors 16
[ 440.503758] LNet: 4152:0:(rpc.c:1413:srpc_lnet_ev_handler()) Skipped 4 previous similar messages
[ 440.503765] LNet: 4152:0:(rpc.c:1143:srpc_client_rpc_done()) Client RPC done: service 5, peer 12345-10.151.20.103@o2ib, status SWI_STATE_REQUEST_SUBMITTED:1:-4
[ 506.581347] LNet: 4173:0:(rpc.c:1069:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.151.20.103@o2ib, timeout 64.
[ 506.581363] LNet: 4147:0:(rpc.c:1143:srpc_client_rpc_done()) Client RPC done: service 11, peer 12345-10.151.20.103@o2ib, status SWI_STATE_REQUEST_SENT:1:-4
[ 506.581367] LustreError: 4147:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.151.20.103@o2ib failed with -110


 

 

 



 Comments   
Comment by Brad Hoagland (Inactive) [ 22/Sep/17 ]

Hello,

Can you please confirm you are using 2.10.1 pre-release on a production system?

Comment by Mahmoud Hanafi [ 22/Sep/17 ]

This particular system is a router and not in production yet. We are testing multirail for production use.

 

 

Comment by Peter Jones [ 22/Sep/17 ]

Then let's move to sev 2 - sev 1 is just for production outages

Comment by Peter Jones [ 22/Sep/17 ]

Amir

Can you please advise on this one?

Thanks

Peter

Comment by Mahmoud Hanafi [ 22/Sep/17 ]

Additional Info:

HCA:  Mellanox Technologies MT27700 Family [ConnectX-4]

OS: Sles12 SP2 4.4.74-92.32.1.20170808-nasa

OFED: mlnx ofed 3.4.2

 

Lustre 2.9 Works

Lustre 2.10.0 Works.

Lustre 2.10.1 Does not work

 

Comment by Amir Shehata (Inactive) [ 22/Sep/17 ]

I wonder if commit:

commit f87c7c2cee6fc5a0864a757917a414dc605554b3
Author: Doug Oucharek <doug.s.oucharek@intel.com>
Date:   Tue May 16 16:00:53 2017 -0700

    LU-9500 lnd: Don't Page Align remote_addr with FastReg

Is the problem.

Can you take out this commit from your 2.10.1 tree and try it out?

I'll try and reproduce locally as well

Comment by Mahmoud Hanafi [ 22/Sep/17 ]

removing commit f87c7c2cee6fc5a0864a757917a414dc605554b3 fixed the problem in mofed3.4.2. We are building mofed4.1 and will test soon.
It may be required with mofed4.x.

Comment by Amir Shehata (Inactive) [ 22/Sep/17 ]

Ok. LU-9500 was intended to get mofed 4.1 working, but it shouldn't have broken mofed3.4.2. We'll need to resolve that.

Comment by Mahmoud Hanafi [ 24/Sep/17 ]

I tested mofed4.1 and lustre2.10.1. Didn't get the error.

Comment by Amir Shehata (Inactive) [ 25/Sep/17 ]

Just to clarify, is the passing test with mofed 4.1, Lustre 2.10.1 minus LU-9500? Or with LU-9500?

Comment by Amir Shehata (Inactive) [ 25/Sep/17 ]

I couldn't reproduce the failure with lnet-selftest on RHEL 7.3/MOFED 3.4.2/Lustre 2.10.1-RC1.
are you able to consistently reproduce this with mofed3.4.2 + 2.10.1? This patch specifically addresses fastreg.

would you be able to provide us with net/neterror logs for this problem?

Comment by Mahmoud Hanafi [ 25/Sep/17 ]

With Rhel7.3/Mofed3.4.2/lustre2.10.1-Rc1 I didn't always get the "dump error cqe." But lnet-selftest wasn't working with lots lnet errors.

With Rhel7.3/Mofed3.4.2/lustre 2.10 and Rhel7.3/Mofed4.1/lustre 2.10.1-RC1 it always works

And Removing LU-9500 with Rhel7.3/Mofed3.4.2/lustre2.10.1-Rc1 it always worked.

Currently I am running with Rhel7.3/Mofed4.1/lustre 2.10.1-RC1. I'll will revert and gather some logs.

I think to reproduce it map_on_demand needs to be configured
Here is our config.

options ko2iblnd timeout=150 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=63 concurrent_sends=63

Comment by Amir Shehata (Inactive) [ 26/Sep/17 ]

ok, I'm able to reproduce with map_on_demand=32.

I'll investigate further.

Comment by Amir Shehata (Inactive) [ 26/Sep/17 ]

It looks like the issue is:

kiblnd_fmr_map_tx()
...
    if (!is_fastreg) 
         rd->rd_frags[0].rf_addr &= ~hdev->ibh_page_mask;

for MOFED 3.4.2 it appears that we need to page align the remote_addr even when fast_reg is enabled.

Next step is to install MOFED 4.1 and see if page aligning the remote_address will trigger failure. If it does then we might need to check for MOFED version to see if we need to page align.

Comment by Amir Shehata (Inactive) [ 04/Oct/17 ]

Please look at LU-9983. I believe this would be the same problem. https://review.whamcloud.com/29290 should resolve that issue.

Let me know if the patch helps in your case.

Comment by Mahmoud Hanafi [ 10/Oct/17 ]

Will LU-9983 land as the solution for this issue? We have moved up to mofed4 so we are no longer seeing the issue. 

 

Comment by Amir Shehata (Inactive) [ 12/Oct/17 ]

Please take a look at my comment here:
https://jira.hpdd.intel.com/browse/LU-10089?focusedCommentId=210745&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-210745

Comment by Mahmoud Hanafi [ 18/Dec/18 ]

think this can be close. All fixes has been pushed to mofed 4.4.2

Comment by Peter Jones [ 18/Dec/18 ]

ok - thanks

Generated at Sat Feb 10 02:31:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.