[LU-8752] mlx5_warn:mlx5_0:dump_cqe:257: Created: 24/Oct/16  Updated: 18/Jan/18  Resolved: 18/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

mlnx ofed3.2
lustre-2.7.2-2nas-fe
Linux elrtr1 3.0.101-77.1.20160630-nasa #1 SMP Thu Jun 30 00:56:32 UTC 2016 (a082ea6) x86_64 x86_64 x86_64 GNU/Linux


Attachments: File netdebug.mlx4host.gz     File netdebug.mlx5host.gz    
Issue Links:
Blocker
Duplicate
is duplicated by LU-8693 ko2iblnd recieving IB_WC_MW_BIND_ERR ... Resolved
Related
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Running lnet selftest on a mlx5 card we get these errors.

[1477328975.069684] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10912): dump error cqe
[1477328975.085684] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10906): dump error cqe
[1477328975.085684] 00000000 00000000 00000000 00000000
[1477328975.085684] 00000000 00000000 00000000 00000000
[1477328975.085684] 00000000 00000000 00000000 00000000
[1477328975.085684] 00000000 08007806 2500002f 00085dd0
[1477328975.085684] LustreError: 11028:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.25@o2ib: -5
[1477328975.085684] LustreError: 11028:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer to 12345-10.151.27.25@o2ib has failed: -5
[1477328975.093683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10922): dump error cqe
[1477328975.093683] 00000000 00000000 00000000 00000000
[1477328975.093683] 00000000 00000000 00000000 00000000
[1477328975.093683] 00000000 00000000 00000000 00000000
[1477328975.093683] 00000000 08007806 25000030 000842d0
[1477328975.105683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10915): dump error cqe
[1477328975.105683] 00000000 00000000 00000000 00000000
[1477328975.105683] 00000000 00000000 00000000 00000000
[1477328975.105683] 00000000 00000000 00000000 00000000
[1477328975.105683] 00000000 08007806 25000031 000843d0
[1477328975.113683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10900): dump error cqe
[1477328975.113683] 00000000 00000000 00000000 00000000
[1477328975.113683] 00000000 00000000 00000000 00000000
[1477328975.113683] 00000000 00000000 00000000 00000000
[1477328975.113683] 00000000 08007806 25000032 000840d0
[1477328975.121683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10900): dump error cqe
[1477328975.121683] 00000000 00000000 00000000 00000000
[1477328975.121683] 00000000 00000000 00000000 00000000
[1477328975.121683] 00000000 00000000 00000000 00000000
[1477328975.121683] 00000000 08007806 25000033 000841d0
[1477328975.129683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10915): dump error cqe
[1477328975.129683] 00000000 00000000 00000000 00000000
[1477328975.129683] 00000000 00000000 00000000 00000000
[1477328975.129683] 00000000 00000000 00000000 00000000
[1477328975.129683] 00000000 08007806 2500002e 00085cd0
[1477328975.133683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10907): dump error cqe
[1477328975.133683] 00000000 00000000 00000000 00000000
[1477328975.133683] 00000000 00000000 00000000 00000000
[1477328975.133683] 00000000 00000000 00000000 00000000
[1477328975.133683] 00000000 08007806 25000034 000846d0
[1477328975.205682] 00000000 00000000 00000000 00000000
[1477328975.281682] 00000000 00000000 00000000 00000000
[1477328975.305681] 00000000 00000000 00000000 00000000
[1477328975.305681] 00000000 08007806 2500002d 000b57d0


 Comments   
Comment by James A Simmons [ 24/Oct/16 ]

Do you have map_on_demand enabled? We saw this with one of our systems as well and when we disabled map_on_demand it went away. This issue only shows up on specific systems. If you look at your logs you will see IB_BIND_ERRORS (something along that line). See ticket LU-8693

Comment by Peter Jones [ 24/Oct/16 ]

Doug

Anything to add here?

Peter

Comment by Doug Oucharek (Inactive) [ 24/Oct/16 ]

Mahmoud: can you confirm if IB_BIND_ERRORS are present in your logs?

Comment by Mahmoud Hanafi [ 24/Oct/16 ]

I don't see IB_BIND_ERRORS. But we are using map_on_demand. Our cluster wide setting is

options ko2iblnd timeout=150 retry_count=7 map_on_demand=32 peer_credits=63 concurrent_sends=63

Can we keep the peer_credits and concurrent_sends at 63 with map_on_demand disabled?

Comment by James A Simmons [ 24/Oct/16 ]

Do you see the following in your logs:

[ 170.597651] LNet: 8714:0:(o2iblnd_cb.c:3433:kiblnd_complete()) FastReg failed: 6
[ 170.597728] LNet: 8713:0:(o2iblnd_cb.c:3444:kiblnd_complete()) RDMA (tx: c000003c6a78c5a8) failed: 5

That error code 6 you see is the ib_wc_status which is IB_WC_MW_BIND_ERR.

I found in my testing you only need map_on_demand if you have hardware using the mlx4 driver talking to
hardware using the mlx5 driver. If you mlx5 to mlx5 communications then map_on_demand is not needed.
As for keeping peer_credits and concurrent_sends settings that depends on the setup.I found in my testing
with one Power8 client cluster I had to reduce the concurrent_sends to 31 to reduce the queue pair depth
usage. At concurrent_send set to 64 the pair depth was 64K in size with that max in hardware 32K. You have
to see for your hardware if this is the case. Its all tinkering.

Comment by Mahmoud Hanafi [ 24/Oct/16 ]

If I set map_on_demand=0 then I get this error

elrtr1 login: [1477346024.576963] LNet: 81:0:(o2iblnd_cb.c:2367:kiblnd_passive_connect()) Can't accept conn from 10.151.27.22@o2ib (version 12): max_frags 32 incompatible without FMR pool (256 wanted)

if I set map_on_demand=256 on the mlnx5 host I get

#mlx4 config
options ko2iblnd timeout=150 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=63 concurrent_sends=63
#mlx5 config
options ko2iblnd timeout=150 retry_count=7 map_on_demand=256  peer_credits=63 concurrent_sends=63
[1477346171.287627] mlx5_warn:mlx5_0:calc_sq_size:516:(pid 81): wqe count(65536) exceeds limits(32768)
[1477346171.315626] mlx5_warn:mlx5_0:create_kernel_qp:1078:(pid 81): err -12
[1477346171.335626] mlx5_warn:mlx5_0:create_qp_common:1823:(pid 81): err -12
[1477346171.359626] mlx5_warn:mlx5_0:__create_qp:2267:(pid 81): create_qp_common failed
[1477346171.379626] mlx5_warn:mlx5_0:calc_sq_size:516:(pid 81): wqe count(65536) exceeds limits(32768)
[1477346171.407626] mlx5_warn:mlx5_0:create_kernel_qp:1078:(pid 81): err -12
[1477346171.423625] mlx5_warn:mlx5_0:create_qp_common:1823:(pid 81): err -12
[1477346171.447625] mlx5_warn:mlx5_0:__create_qp:2267:(pid 81): create_qp_common failed
[1477346171.479625] mlx5_warn:mlx5_0:dump_cqe:257:(pid 13624): dump error cqe
[1477346171.495625] LustreError: 13730:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer from 12345-10.151.27.56@o2ib has failed: -5
[1477346171.495625] LustreError: 13735:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.56@o2ib: -5
[1477346171.567624] LustreError: 13730:0:(brw_test.c:362:brw_server_rpc_done()) Skipped 1 previous similar message
Comment by Mahmoud Hanafi [ 24/Oct/16 ]

So this configuration eliminated the LNET errors but still getting 'dump error cqe'

#mlx4 config
options ko2iblnd timeout=150 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=16 concurrent_sends=16

#mlx5 config
options ko2iblnd timeout=150 retry_count=7 map_on_demand=256  peer_credits=16 concurrent_sends=16

ERRORS

[1477347536.651112] mlx5_warn:mlx5_0:dump_cqe:257:(pid 14273): dump error cqe
[1477347536.671112] 00000000 00000000 00000000 00000000
[1477347536.671112] 00000000 00000000 00000000 00000000
[1477347536.671112] 00000000 00000000 00000000 00000000
[1477347536.671112] 00000000 08007806 2500005c 000e3cd0
[1477347536.671112] LustreError: 14383:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer from 12345-10.151.27.56@o2ib has failed: -5
[1477347536.671112] LustreError: 14381:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.56@o2ib: -5
[1477347536.743111] LustreError: 14383:0:(brw_test.c:362:brw_server_rpc_done()) Skipped 1 previous similar message

Comment by James A Simmons [ 24/Oct/16 ]

Why map_on_demand=32 ? You are sending 32 * 4K in size on x86 platforms. We use map_on_demand=16 on Power8 since the page is 64K in size. So 1MB divided by 64K pages equals 16 fragments for map_on_demand. For x86 platforms the optimal is map_on_demand=256 since 1 MB divided by 4K pages equals 256.

Comment by Mahmoud Hanafi [ 25/Oct/16 ]

The 32 was a recommendation from a previous LU.

Comment by Mahmoud Hanafi [ 25/Oct/16 ]

I tried turning off map_on_demand on both mlx4 and mlx5 and that worked. So we need to understand why map_on_demand causes the errors.

Comment by Jay Lan (Inactive) [ 25/Oct/16 ]

James A. Simmons wrote:
"I found in my testing you only need map_on_demand if you have hardware using the mlx4 driver talking to hardware using the mlx5 driver."

We do have a mix of mlx4 and mlx5 hardwares in our fabrics, don't we, Mahmoud?

Comment by James A Simmons [ 25/Oct/16 ]

Also if you disable map_on_demand you lose a good chunk of performance

Comment by Doug Oucharek (Inactive) [ 25/Oct/16 ]

I thought that map_on_demand with Mellanox hardware only really helps in cases where you have a high latency network. Under normal conditions, the difference should be minimal. Are you having a difference experience, James?

Comment by Mahmoud Hanafi [ 28/Oct/16 ]

So when map_on_demand is set to any thing, even 256, causes this error this must point to creating pools with fmr/fastReq.

[1477674572.063221] Lustre: Lustre: Build Version: 2.7.2-3.2nasC_mofed32v3f
[1477674572.123220] LNet: Using FastReg for registration
[1477674572.659216] LNet: Using FastReg for registration
[1477674572.659216] LNet: Skipped 5 previous similar messages
[1477674572.915214] LNet: Added LNI 10.151.25.168@o2ib [16/1024/0/180]
[1477674573.807206] LNet: Added LNI 10.150.25.168@o2ib233 [16/1024/0/180]
[1477674632.270692] mlx5_warn:mlx5_0:dump_cqe:257:(pid 9951): dump error cqe
[1477674632.286692] 00000000 00000000 00000000 00000000
[1477674632.286692] 00000000 00000000 00000000 00000000
[1477674632.286692] 00000000 00000000 00000000 00000000
[1477674632.286692] 00000000 08007806 2500002d 000d4fd2
[1477674632.290692] LustreError: 10088:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer from 12345-10.151.27.56@o2ib has failed: -5
[1477674632.290692] LustreError: 10083:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.56@o2ib: -5
[1477674632.362691] LustreError: 10088:0:(brw_test.c:362:brw_server_rpc_done()) Skipped 1 previous similar message
Comment by Doug Oucharek (Inactive) [ 28/Oct/16 ]

Mahmoud: do you have neterrors turn on (going to /var/log/messages)? I'm wondering if there is some other feedback we are getting from the lower drivers.

Comment by Mahmoud Hanafi [ 28/Oct/16 ]

Attaching +net debug output for mlx5 and mlx4 hosts.

There are no other errors in /var/log/messages or console.

Comment by Doug Oucharek (Inactive) [ 28/Oct/16 ]

Did this also include +neterror (or do you have those on by default)? Neterror is not included by default yet. I have a patch to change this but it is not getting much traction.

Comment by Mahmoud Hanafi [ 31/Oct/16 ]

yes the debug output included +neterror

Comment by Mahmoud Hanafi [ 31/Oct/16 ]

The above tests where all done with connectIB card. We tried a EDR card and got once piece of additional info. Just before the cqe error we get this

[1477940064.283784] LNetError: 11647:0:(o2iblnd_cb.c:1082:kiblnd_init_rdma()) RDMA is too large for peer 10.151.27.56@o2ib (131072), src size: 1048576 dst size: 1048576
[1477940064.327784] LNetError: 11647:0:(o2iblnd_cb.c:434:kiblnd_handle_rx()) Can't setup rdma for PUT to 10.151.27.56@o2ib: -90
[1477940064.363784] mlx5_warn:mlx5_1:dump_cqe:257:(pid 11647): dump error cqe
[1477940064.383784] 00000000 00000000 00000000 00000000
[1477940064.383784] 00000000 00000000 00000000 00000000
[1477940064.383784] 00000000 00000000 00000000 00000000
[1477940064.383784] 00000000 08007806 2500002e 026fc4d2
[1477940064.383784] LustreError: 11597:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer from 12345-10.151.27.56@o2ib has failed: -5
[1477940064.383784] LustreError: 11598:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.56@o2ib: -5
[1477940064.455783] LustreError: 11597:0:(brw_test.c:362:brw_server_rpc_done()) Skipped 1 previous similar message
Comment by Doug Oucharek (Inactive) [ 31/Oct/16 ]

I can't comment on dump_cqe message and what it means, but I do understand what is causing the "RDMA is too large for peer" message and its breakage.

This was caused by the patch associated with LU-7650. The patch was trying to make o2iblnd fragments work with PPC's different page sizes. Even after inspections (I was one of those), we did not see the breakage. To be honest, I still don't know why that patch is causing this issue. As a result, a revert patch was created for master and has landed: http://review.whamcloud.com/#/c/23439/.

I suspect that NASA has picked up the patch from LU-7650, but not the revert. Either remove the patch from LU-7650 or apply the revert patch.

Hopefully, the dump_cqe message is associated with this and will go away one this is fixed.

Comment by James A Simmons [ 31/Oct/16 ]

Only if they applied the patch. The patch for LU-7650 never landed to lustre 2.7 or lustre 2.8.0. In fact it doesn't exist anywhere now but the upstream client. This problem they are seeing was first seen by Jeremy in one of the LU-3322 patches. I have to find it in the comments. The patch was still landed due to no else being able to reproduce this problem. I think he couldn't reproduce it after awhile. It is hit or miss with this.

Comment by Jay Lan (Inactive) [ 31/Oct/16 ]

I did picked up LU-7650. Dont remember why I picked that one up.

However, I picked up LU-7650 in 2.7.2-3.1nas build, but Mahmoud reported this problem against 2.7.2-2nas. The LU-7650 patch is not in 2.7.2-2nas.

Comment by Doug Oucharek (Inactive) [ 31/Oct/16 ]

So, I assume that means the dump_cqe message happens without the LU-7650 patch but the "RDMA is too large for peer" message does not. They must then be unrelated.

I'm going to have to go on the Mellanox community board to ask about this mlx5-specific failure. We do not have a support contract with Mellanox and I doubt they will help fix this failure without it. To my knowledge, there is nothing we are doing in ko2iblnd which would break the MLX code "assuming" they did proper backwards compatibility between mlx4 and mlx5. We have never seen such an error or problem with mlx4 so I have to assume they did something to mlx5 to change things.

If NASA has a support contract with Mellanox, perhaps you can raise a ticket with them to get their input on the potential causes of these error messages. That would help us to determine if there is something we need to change in ko2iblnd.

Comment by Mahmoud Hanafi [ 01/Nov/16 ]

We have opened a case with Mellanox and sent them some debugging. Waiting to hear back, but they did say the message is related to Fast Memory Registration Mode.

Comment by Doug Oucharek (Inactive) [ 01/Nov/16 ]

Thank you Mahmoud. I have suspected the Fast Memory Registration code could be the culprit here. We had to add support for that over the older FMR code when using newer MLX cards which only support Fast Memory. As such, that code is fairly new and can have an issue. It will be great if Mellanox can point us specifically at where we are making a mistake so we can address it.

Comment by Mahmoud Hanafi [ 19/Nov/16 ]

After the host is reboot, the error can be reproduced when read from mlx4 to mlx5.
This will produce the dump_cqe error.

#test1
lst add_test --batch bulk_rw --from mlx4_host --to mlx5_host brw read size=1M check=full

But this will work!

#test2
lst add_test --batch bulk_rw --from mlx4_host --to mlx5_host brw write size=1M check=full 

What's interesting once the write test is ran the read will start to work.

Comment by Doug Oucharek (Inactive) [ 23/Nov/16 ]

I've been able to get mixed mlx4 and mlx5 cards in the same cluster. Using your test examples, I have been able to reproduce this issue getting the same results as you.

I verified that the issue only occurs when map_on_demand is non zero. I have reproduced the issue with both upstream OFED and MOFED (latest version for both). I have also reproduced it with RHEL 6.8 and 7.3.

No solution yet, but at least I have a way to investigate this now.

Comment by Mahmoud Hanafi [ 23/Nov/16 ]

Mellanox has also been able to reproduce this issue in their lab and are looking at it.

Comment by Doug Oucharek (Inactive) [ 23/Nov/16 ]

Another interesting data point:

I flipped the from and to parameters in your test making it "--from mlx5_host --to mlx4_host". In theory, the issue should happen when I then do a write as this is almost the same thing as what you have. However, in this case, I am not able to reproduce the problem with either read or write.

I believe the difference when you flip the from and to parameters is who initiates the test.

Comment by Mahmoud Hanafi [ 01/Dec/16 ]

Here is update from mellanox engineering

"The failure is happening because of fast reg mr called twice, second time the mkey is not free - but it set for a check if free and its meaning the operation will succeed only if mkey is free.

In order to call for second fast reg mr customer needed to do local invalidate before that call.

(These operation are explained in IB spec)."

Comment by Doug Oucharek (Inactive) [ 01/Dec/16 ]

Thank you for that feedback.  I will dig into the code for the path where we are making this mistake.

Comment by Gerrit Updater [ 06/Dec/16 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/24162
Subject: LU-8752 lnet: Debugging Patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 235b2ce7d07ae80e31a170cb7e424f859dce97b1

Comment by Gerrit Updater [ 12/Dec/16 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/24306
Subject: LU-8752 lnet: Stop MLX5 triggering a dump_cqe
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3dcf679c7f1e2aa99d99a10deb65ac2faa30a629

Comment by Doug Oucharek (Inactive) [ 13/Dec/16 ]

I have submitted a fix to this issue above (ignore the earlier Gerrit patch as it was just for my debugging).  I have validated it on our mlx4 / mlx5 mixed cluster.  I still need to validate it on OmniPath to ensure it does not cause a problem there.

Comment by Mahmoud Hanafi [ 13/Dec/16 ]

Thanks we will build and test it.

 

Comment by Doug Oucharek (Inactive) [ 13/Dec/16 ]

OmniPath is not affected by this patch as it uses FMR and not FastReg.  So this change should only affect MLX5 based cards.

Did you need me to push a 2.7FE patch?

Comment by Nathan Dauchy (Inactive) [ 13/Dec/16 ]

Yes, hopefully it is trivial, but we will want to use the 2.7fe patch. Thanks!

Comment by Doug Oucharek (Inactive) [ 14/Dec/16 ]

The 2.7FE patch is: https://review.whamcloud.com/24336

Comment by Mahmoud Hanafi [ 19/Dec/16 ]

Here is some feed back from MLX.

"

According to IB spec state should be set to Free.

Under "MEMORY REGION TYPES" section there is an explanation of states, specifically:
Table: "Memory Region States Summary"
"The following table summarizes the states of Memory Regions L_Keys and R_Keys and the operations allowed on each state:"
Looking at Property / Operation Allowed: for "Fast Register", there are 3 possible states: Invalid, Free, Valid.
The only allowed state is - Free.

"

Comment by Jay Lan (Inactive) [ 27/Dec/16 ]

Do we need this patch for Lustre 2.9 release?

Comment by Peter Jones [ 28/Dec/16 ]

Jay

Are you running with 2.9.x? I had thought that you were using 2.7.x...

Peter

Comment by James A Simmons [ 28/Dec/16 ]

ORNL needs it for the 2.8 release.

Comment by Jian Yu [ 28/Dec/16 ]

Hi James,
Here is the patch for FE 2.8.x release: https://review.whamcloud.com/24365

Comment by Jay Lan (Inactive) [ 28/Dec/16 ]

Hi Peter,
NASA Ames is running 2.7.2+ in production, but we started testing 2.9.0 client on SLES12 SP2.

  • jay
Comment by Doug Oucharek (Inactive) [ 03/Jan/17 ]

Mahmoud:

I suspect there is a bug in the Mellanox MLX5 driver somewhere.  My debugging showed that the first time we use the L-key/R-key for an RDMA operation (reading from MLX5), it is being rejected as invalid.  I spent a lot of time tracing the code to see if we have invoked anything  which could have changed the state of the R-key/L-key and found nothing.

That is why I came up with the solution of just invalidating the first key we set and advancing to the next one.  I have not found an issue with invalidating a free key so am hoping this fix will not cause any problems down the road (Dmitry and Amir's concerns in the code review).

Doug

Comment by Gerrit Updater [ 18/Jan/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24306/
Subject: LU-8752 lnet: Stop MLX5 triggering a dump_cqe
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 783428b60a98874b4783f8da48c66019d68d84d6

Comment by Peter Jones [ 18/Jan/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:20:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.