[LU-3322] ko2iblnd support for different map_on_demand and peer_credits between systems Created: 13/May/13  Updated: 14/Jun/19  Resolved: 24/Nov/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Jeremy Filizetti Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Attachments: Microsoft Word compatibility matrix.xlsx    
Issue Links:
Duplicate
is duplicated by LU-5718 RDMA too fragmented with router Resolved
is duplicated by LU-7401 OOM after LNet initialization with no... Resolved
Related
is related to LU-7101 Lnet: Support per NI map-on-demand Resolved
is related to LU-7569 IB leaf switch caused LNet routers to... Resolved
is related to LU-5783 o2iblnd: investigate new memory regis... Resolved
is related to LU-7351 LNet router crash during bring up of ... Resolved
is related to LU-7650 ko2iblnd map_on_demand can't negotita... Resolved
is related to LU-7314 In kiblnd_rejected(), NULL pointer 'c... Resolved
is related to LU-12419 ppc64le: "LNetError: RDMA has too man... Closed
is related to LUDOC-286 Document the effects of LU-3322 Open
Severity: 3
Bugzilla ID: 20,543
Rank (Obsolete): 8528

 Description   

ko2iblnd currently doesn't support different values of peer_credits or map_on_demand between systems.

After I finish some testing I will upload a patch to gerrit in the next couple of days.



 Comments   
Comment by Liang Zhen (Inactive) [ 13/May/13 ]

Add Isaac and me to watching list.

Comment by Peter Jones [ 13/May/13 ]

ok Jeremy, I will assign this ticket to myself for now and reassign it to an engineer when you upload the patch

Comment by Jeremy Filizetti [ 20/Nov/13 ]

Sorry, I've been sitting on this forever. Finally uploaded at http://review.whamcloud.com/#/c/8342/

Comment by Peter Jones [ 20/Nov/13 ]

Thanks Jeremy!

Amir,

could you please review Jeremy's patch?

Thanks

Peter

Comment by Amir Shehata (Inactive) [ 21/Nov/13 ]

I looked at it and as far as I could tell the patch is ok. I would suggest we add Isaac and/or Liang to take a look at it since they know the o2iblnd driver more thoroughly.

Comment by John Fuchs-Chesney (Inactive) [ 08/Mar/14 ]

Amir or Jeremy,
Did we get this finished and if so can I mark the ticket as resolved?
Thanks,
~ jfc.

Comment by John Fuchs-Chesney (Inactive) [ 29/Jul/14 ]

Isaac or Liang,
Do you have comments to add here: http://review.whamcloud.com/#/c/8342/ as requested by Andreas?

If this ticket, or the patch itself, is no longer being worked on, can we mark it as resolved please?

Thanks,
~ jfc.

Comment by Liang Zhen (Inactive) [ 30/Jul/14 ]

sorry for late reply, I actually think we should have this feature, but I need to review it first....

Comment by Liang Zhen (Inactive) [ 02/Sep/14 ]

I think we probably should prioritise this patch and have this feature in 2.7, at least for the map_on_demand part, because this is something asked by different people for years.

Comment by Jeremy Filizetti [ 07/Sep/14 ]

New patch for master uploaded at: http://review.whamcloud.com/#/c/11794/

Comment by James A Simmons [ 13/May/15 ]

This is most excellent. I have a system that will not accepted the normal 63 peer_credits we use. I'm going to test this out right now and let you know the results. I refreshed the patch.

Comment by James A Simmons [ 26/May/15 ]

I did some testing and found some issues with the following setup. On the client side I'm using the mlx5 driver from the Mellanox 2.4 stack. This driver does not support FMR but it does support PMR. The clients module parameters are as follows:

options ko2iblnd timeout=100 credits=2560 ntx=5120 peer_credits=63 concurrent_sends=63 pmr_pool_size=1280 fmr_pool_size=1280 fmr_flush_trigger=1024 map_on_demand=64

Here are the logs for the client:

931569.400915] LNet: HW CPU cores: 160, npartitions: 16
[931611.029095] fmr_pool: Device mlx5_0 does not support FMRs
[931611.029212] LNetError: 26732:0:(o2iblnd.c:1509:kiblnd_create_fmr_pool()) Failed to create FMR pool: -38
[931611.029340] LNet: 26732:0:(o2iblnd.c:2301:kiblnd_net_init_pools()) Device does not support FMR, failing back to PMR
[931611.088611] LNet: Added LNI 10.37.202.11@o2ib1 [63/8064/0/180]

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
On the server side I'm running the default OFED stack that comes with RHEL6.5 which uses the mlx4 driver. This driver supports FMR. The server module parameters are as follows:

options ko2iblnd timeout=100 credits=2560 ntx=5120 peer_credits=63 concurrent_sends=63 fmr_pool_size=1280 pmr_pool_size=1280 fmr_flush_trigger=1024

So as you can see the only difference is map_on_demand on the client. Now when I attempt to ping the server node from the client I get
the following error on the SERVER:

[499128.871773] LNetError: 1982:0:(o2iblnd_cb.c:2359:kiblnd_passive_connect()) Can't accept conn from 10.37.202.11@o2ib1 (version 12): max_frags 64 incompatible without FMR/PMR pool (256 wanted)

In the reverse direction, server node pinging the client node I get :

932734.599925] LNetError: 20856:0:(o2iblnd_cb.c:2345:kiblnd_passive_connect()) Can't accept conn from 10.37.248.67@o2ib1 (version 12): max_frags 256 too large (64 wanted)

Is there any way to avoid having to reconfigure my entire LNet fabric to make this work?

Comment by Jeremy Filizetti [ 26/May/15 ]

You will need to set map_on_demand=256 on the servers as well. It's a little odd because even though the default frags is 256 without setting map_on_demand explicitly the kiblnd_fmr_map_tx/kiblnd_pmr_map_tx is never called from kiblnd_map_tx.

Since I've never added any documentation here is some more details from a writeup on it I did quite a while back. I didn't verify that things haven't changed with the module parameters so hopefully it's still accurate.

Summary:
Normally the Lustre ko2iblnd can only operate with identical peer_credits and map_on_demand between systems. The patch affects the active (initiator) connection in IB for clients/routers and the passive (responder) connections for IB with servers/routers. Passive connections will automatically negotiate down if the parameters permit and reject if they are to high for the remote request. Active connections will send their defaults initially and if rejected attempt to use the lower values from the reject message. If the values are higher the active connection won't retry because its not supported.

There are 3 parameters in the ko2iblnd of interest here: peer_credits, map_on_demand, and concurrent_sends. The default settings for ko2iblnd are peer_credits=8, concurrent_sends=8, and map_on_demand=0 (disabled)

peer_credits determines how many messages you can receive from a single connection (queue pair).
map_on_demand determines the number of DMA segments per credit that are sent. Each segment is usually (always?) a page and is sent as a separate work request so these are typically (256 * 4k pages) going across the wire.
concurrent_sends determines how many send messages you can queue at a time to a single connection, concurrent_sends can't be less than half of peer_credits but concurrent_sends needs to be <= 62 to not exceed the maximum number of work requests per queue pair in the standard Mellanox ConnectX[123] HCAs.

The relation between those values is: work_requests_allocated = (map_on_demand + 1) * concurrent_sends

You can see your max work requests per queue pair with:

  1. ibv_devinfo -v | grep max_qp_wr
    max_qp_wr: 16384

My recommendation is the following for maximum compatibility.
Lustre MDS/OSS Severs and Routers:
Use the patch and run with peer_credits=124 concurrent_sends=62 map_on_demand=256

Existing Lustre clients:
Need to apply the patch or lower the current values for peer_credits and concurrent_sends to match the OSS/MDS setting.

Comment by James A Simmons [ 27/May/15 ]

I did some testing and your advice on setting map_on_demand=256 worked!! I'm now running nodes with very different map_on_demand settings with no problems.

Comment by Doug Oucharek (Inactive) [ 27/May/15 ]

From the code, I've always considered map_on_demand = 0 an "off" switch for FMR. So, if mlx5 does not support FMR anymore, should map_on_demand be set to zero there? Will that cause connection issues with older mlx versions with FMR turned on?

I'm happy with the patch and will give it a +1 as well. I'm also going to open an LUDOC ticket for documenting how all of this works. Our current documentation is a bit too sparse on all these tunables and how they relate.

Is there a recommendation for changing the default parameters to something more reasonable? I know that you have to turn on FMR (map_on_demand > 0) for Truescale to work at a reasonable performance rate.

Comment by Amir Shehata (Inactive) [ 11/Sep/15 ]

I have pushed a patch which addresses the race conditions that Isaac pointed.

However, I have some questions, which IMHO seem fundamental to me.

1. Do we care whether map-on-demand is set on both peers? or do we care if the number of fragments communicated between peers are compatible? IE: the active side has equal or less than number of fragments of the passive peer?

In the original code, in kiblnd_passive_connect() it checked whether the number of fragments were not equal. On a peer with map-on-demand off IBLND_RDMA_FRAGS returns 256, so a peer with map-on-demand on and == 256 can still connect with the one with map-on-demand off.

In the LU-3322 patch that code was changed to add the restriction of checking if FMR is enabled locally even if the number of frags is less than what it is locally. What was the reason for that?

2. Why is it that the number of fragments on both sides need to be compatible? <=? Why do we simply not remove that restriction completely and let each network admin determine the best values - while we provide documentation on ideal values?

I have attached a compatibility matrix excel sheet, which I hope to have outlined all the different cases, and whether a connection is accepted or rejected. It'll be great if some one vets it. Ideally, we'd agree on the desired behavior, then move on with implementing that behavior. Currently it seems to me that there is a disagreement on how old/new and new/new versions of the software should behave.

Comment by Jeremy Filizetti [ 18/Sep/15 ]

1. I don't think MOD needs to be a requirement for both. For the FMR explanation see 2.

2. For a client with a lower number of fragments I don't see a way to guarantee that the number of pages/frags passed into kiblnd_setup_rd_{iov,kiov} is less than 256. o2iblnd receives messages from LNet which has LNET_MAX_IOV=256. If you didn't have FMR to create a single fragment from multiple pages/segments you would end up sending a tx with too many fragments for the peer.

Comment by Gerrit Updater [ 07/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11794/
Subject: LU-3322 ko2iblnd: Support different configs between systems
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7f5c9753872cfa8ad47821be3fa924c74c4c8b0d

Comment by Joseph Gmitter (Inactive) [ 07/Oct/15 ]

Landed for 2.8

Comment by Mahmoud Hanafi [ 07/Oct/15 ]

Can we get this back ported to lustre 2.5.3?

Comment by Dmitry Eremin (Inactive) [ 06/Nov/15 ]

Unfortunately the setting peer_creadits cannot be different.
I have on one machine:

options ko2iblnd credits=2560 ntx=5120 concurrent_sends=63 peer_credits=16

on other

options ko2iblnd credits=2560 ntx=5120 concurrent_sends=63

and then got the following error:

LNetError: 2895:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Can't accept conn from 192.168.3.102@o2ib, queue depth too large:  16 (<=8 wanted)
LNetError: 2895:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Can't accept conn from 192.168.3.102@o2ib, queue depth too large:  16 (<=8 wanted)
LNetError: 2895:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Skipped 134 previous similar messages
LNetError: 2895:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Can't accept conn from 192.168.3.102@o2ib, queue depth too large:  16 (<=8 wanted)
LNetError: 2895:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Skipped 265 previous similar messages
LNetError: 2895:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Can't accept conn from 192.168.3.102@o2ib, queue depth too large:  16 (<=8 wanted)
LNetError: 2895:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Skipped 520 previous similar messages
LNetError: 2895:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Can't accept conn from 192.168.3.102@o2ib, queue depth too large:  16 (<=8 wanted)
LNetError: 2895:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Skipped 1020 previous similar messages
LNetError: 2895:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Can't accept conn from 192.168.3.102@o2ib, queue depth too large:  16 (<=8 wanted)
LNetError: 2895:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Skipped 1971 previous similar messages
Comment by Dmitry Eremin (Inactive) [ 06/Nov/15 ]

This was tested on master v2_7_62_0-25-g8248c89
one machine:

Lustre: Lustre: Build Version: 2.7.62-g8248c89-CHANGED-2.6.32-573.7.1.el6_lustre.g95557d5.x86_64
LNet: Added LNI 192.168.3.102@o2ib [16/2560/0/180]

second machine:

Lustre: Lustre: Build Version: 2.7.62-g8248c89-CHANGED-2.6.32-573.7.1.el6_lustre.g95557d5.x86_64
LNet: Added LNI 192.168.3.104@o2ib [8/2560/0/180]
Comment by Jeremy Filizetti [ 06/Nov/15 ]

In patch set 4 and earlier the call from kiblnd_reconnect() was storing the peer's maximums in the kib_peer_t which was not recreated and would be reused when reconnecting through kiblnd_active_connect. With patch set 5 and later this was removed so its likely the rejected connections wont work for map_on_demand or peer credits including the patch cherry-picked for master. However, it should accept connections for remote hosts that don't exceed the host's maximums. So the landed patch is missing half of the functionality.

Comment by Gerrit Updater [ 06/Nov/15 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: http://review.whamcloud.com/17074
Subject: LU-3322 lnet: make connect parameters persistent
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1d262cebcdccd8ba6a2f678f131d99df55f1b692

Comment by James A Simmons [ 06/Nov/15 ]

I have to say we haven't seen any of these problems and I have been running a mlx4 <-> mlx5 setup with different peer_credits on each end. Will try the patch.

Comment by Dmitry Eremin (Inactive) [ 09/Nov/15 ]

With patch http://review.whamcloud.com/#/c/17074/2 it works but anyway I see one error message:

in one machine:
LNet: Added LNI 192.168.3.102@o2ib [16/512/0/180]

in other machine:
LNet: Added LNI 192.168.3.104@o2ib [8/256/0/180]
LNetError: 123936:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Can't accept conn from 192.168.3.102@o2ib, queue depth too large:  16 (<=8 wanted)

So, this message a bit confused. Can you omit it?

Comment by Jeremy Filizetti [ 09/Nov/15 ]

We could alter the error message to say something like "Can't accept conn from 192.168.3.102@o2ib, queue depth too large: 16 (<=8 wanted), returning maximum supported values to peer for reconnect" to be more specific. The problem with leaving this out is if you had an a client without the patch connecting you would see no error messages for why the client can't connect and it wouldn't negotiate to the lower supported value it would just repeatedly fail with no error messages. Perhaps the better thing to do here is to bump the IBLND_MSG_VERSION and report errors for only clients with IBLND_MSG_VERSION_2. I'll have to take a look to see if that's a possibility.

Comment by Dmitry Eremin (Inactive) [ 09/Nov/15 ]

Is it expected that old clients cannot work if they have different settings?
So, this patch should be everywhere (servers, routers and clients). Is it possible to accommodate to old client and make it workable?

Comment by Jeremy Filizetti [ 09/Nov/15 ]

See comment:

https://jira.hpdd.intel.com/browse/LU-3322?focusedCommentId=116413&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-116413

The unpatched clients can connect as long as it's values are less then or equal but if they are higher the patched system is unable to allow the connection which is why it sends it's supported values back in the rejection message. Unpatched clients ignore these parameters sent back and only only support connections if their values match exactly.

Comment by Dmitry Eremin (Inactive) [ 09/Nov/15 ]

Thanks Jeremy,

From my point of view it would be better to say explicitly that we try to make other handshake with different settings. In this case the error message will make sense. Or check the version and report it only for old clients.

P.S. Also I understand if the server can not operate with high value from old client it's not possible to support it, We need a clear message about this.

Comment by Dmitry Eremin (Inactive) [ 10/Nov/15 ]

Another issue I have:

Server with mlx5 have:
LNet: Added LNI 192.168.3.100@o2ib [16/2560/0/180]
LNetError: 107216:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Can't accept conn from 192.168.3.8@o2ib, queue depth too large:  128 (<=16 wanted)
LNet: 107216:0:(o2iblnd_cb.c:2291:kiblnd_passive_connect()) Can't accept conn from 192.168.3.8@o2ib (version 12): max_frags 32 incompatible without FMR pool (256 wanted)

Router with mlx4 and hfi have:
LNet: Added LNI 192.168.3.8@o2ib [128/4096/0/180]
LNet: Added LNI 192.168.5.8@o2ib1 [128/4096/0/180]
# lctl ping 192.168.3.100@o2ib0
failed to ping 192.168.3.100@o2ib: Input/output error
# lctl ping 192.168.5.200@o2ib1
12345-0@lo
12345-192.168.5.200@o2ib1

old client with hfi have:
LNet: Added LNI 192.168.5.200@o2ib1 [128/4096/0/180]
# lctl ping 192.168.3.100@o2ib0
failed to ping 192.168.3.100@o2ib: Input/output error

Client and Router has the same tunables. Only Server have different tunables but both server and router have new version of Lustre with LU-3322 patch. So, again different tunables don't work.

Comment by Jeremy Filizetti [ 10/Nov/15 ]

This is also explained in the earlier comment. Without FMR (mlx5 doesn't support it) or some other way to coalesce the fragments you can't connect to a client with a lower number of frags because you will end up with RDMA's too fragmented. I'm assuming your server is running with map_on_demand=32 and your router is running without it set?

See https://jira.hpdd.intel.com/browse/LU-3322?focusedCommentId=127857&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-127857

Comment by Dmitry Eremin (Inactive) [ 10/Nov/15 ]

yes, router have map_on_demand=32 but server no. Why they cannot handshake the workable parameters? Both have latest version.

Comment by Jeremy Filizetti [ 10/Nov/15 ]

The mlx5 doesn't support FMR pools which is Lustre's only method of mapping a group of fragments into a single fragment. o2iblnd is written so that 1 LNet message = 1 o2iblnd message (2 technically: 1 IB RDMA paired with an 1 IB Send) so there is no way to split the LNet message into multiple o2iblnd messages. This could be added but IIRC it's not a trivial change due to credit accounting and failure handling.

Comment by Dmitry Eremin (Inactive) [ 10/Nov/15 ]

Hmm. There are no ways to connect mlx5 <=> mlx4. With map_on_demand=256 on mlx4 I got:

LNetError: 4927:0:(o2iblnd.c:866:kiblnd_create_conn()) Can't create QP: -22, send_wr: 65792, recv_wr: 512
LNetError: 4927:0:(o2iblnd.c:866:kiblnd_create_conn()) Skipped 3 previous similar messages

with map_on_demand=0 on mlx4 I got on mlx5 node:

LNetError: 107218:0:(o2iblnd_cb.c:2655:kiblnd_rejected()) 192.168.3.8@o2ib rejected: o2iblnd no resources
LNetError: 107218:0:(o2iblnd_cb.c:2655:kiblnd_rejected()) 192.168.3.8@o2ib rejected: o2iblnd no resources
LNetError: 107218:0:(o2iblnd_cb.c:2655:kiblnd_rejected()) 192.168.3.8@o2ib rejected: o2iblnd no resources
LNetError: 107218:0:(o2iblnd_cb.c:2655:kiblnd_rejected()) 192.168.3.8@o2ib rejected: o2iblnd no resources
Comment by Jeremy Filizetti [ 10/Nov/15 ]

Lower your peer_credits and concurrent_sends on the router and you can connect with map_on_demand=0. If you lower your concurrent_sends (and possibly your peer/router credts) on the server you won't get the QP: -22 error.

Comment by Dmitry Eremin (Inactive) [ 10/Nov/15 ]

Hmm. It looks we cannot use in optimal way mlx5 and OPA/TS cards anyway.
if I set optimal OPA/TS settings on mlx4:

options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1

I cannot connect because of from mlx5 side with default settings I got:

LNetError: 107215:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Can't accept conn from 192.168.3.8@o2ib, queue depth too large:  128 (<=8 wanted)
LNet: 107215:0:(o2iblnd_cb.c:2291:kiblnd_passive_connect()) Can't accept conn from 192.168.3.8@o2ib (version 12): max_frags 32 incompatible without FMR pool (256 wanted)

If I set map_on_demand=0 I got QP: -22 error from mlx4. And only when I decrease peer_credits=16 on mlx4 the communication between mlx4 and mlx5 become work.

So, the current solution have a very limited usage because of I cannot use an optimal tunables anyway. We cannot recommend it for hybrid fabrics with mlx5.

Comment by Amir Shehata (Inactive) [ 10/Nov/15 ]

If I understand the description properly, the problem on mlx5 is due to:

LNetError: 107215:0:(o2iblnd_cb.c:2264:kiblnd_passive_connect()) Can't accept conn from 192.168.3.8@o2ib, queue depth too large:  128 (<=8 wanted)

That happens because the mlx4 has peer_credits set to 128 and mlx5 has that set to the default. Try setting the mlx5 peer_credits to 128 as well.

Have you tried that?

Comment by Dmitry Eremin (Inactive) [ 10/Nov/15 ]

I cannot set peer_credits more than 16 on mlx5 because of LU-7124.

Comment by Amir Shehata (Inactive) [ 11/Nov/15 ]

I updated the http://review.whamcloud.com/#/c/17074/.

Comment by Andreas Dilger [ 11/Nov/15 ]

What about something like patch http://review.whamcloud.com/16141 ? Having larger-order allocations would reduce the number of fragments that need to be sent.

Comment by Dmitry Eremin (Inactive) [ 16/Nov/15 ]

Amir, unfortunately, last patch set #4 don't work with different settings as well. From mlx4 site it reports:

[643723.574292] LNet: 45941:0:(o2iblnd_cb.c:2278:kiblnd_passive_connect()) Can't accept conn from 192.168.3.102@o2ib (version 12): max_frags 256 too large (32 wanted)

Just recall from mlx5 site peer_credits=16. From mlx4 side peer_credits=128. Probably Andreas is right and we should reduce the number of allocations to make it workable.

Comment by Amir Shehata (Inactive) [ 17/Nov/15 ]

Dmitry,

The message you posted only indicates that the active has sent a connection request to the passive, but the passive can't handle 256. The reply to the connection attempt should go back to the active side indicating 32, and the active side should be attempting a reconnect. Is that not happening? If the active side doesn't have map-on-demand the negotiation will fail. What's the log on the active side say?

Comment by Dmitry Eremin (Inactive) [ 17/Nov/15 ]

Amir,

From active side nothing in kernel messages, but lst show the error:

create session RPC failed on 12345-192.168.3.8@o2ib: Unknown error 18446744073709551503

P.S. the active side is mlx5. So, it don't have map-on-demand.

Comment by Dmitry Eremin (Inactive) [ 17/Nov/15 ]

with map_on_demand=32 on mlx5 I got the following:

LNetError: 208538:0:(o2iblnd.c:2085:kiblnd_net_init_pools()) Can't set fmr pool size (512) < ntx / 4(1280)
LNetError: 208538:0:(o2iblnd.c:2906:kiblnd_startup()) Failed to initialize NI pools: -22
LNetError: 105-4: Error -100 starting up LNI o2ib
LNetError: 208538:0:(rpc.c:1605:srpc_startup()) LNetNIInit() has failed: -100
Comment by Jeremy Filizetti [ 17/Nov/15 ]

If you are still using your previous configuration you could modify things to make everything work.

1. Server with mlx5 doesn't support FMR so map_on_demand should remain 0.
2. Router can support FMR with the mlx4 (not sure about hfi) so you could use map_on_demand=256. You will need to drop your concurrent_sends to <=62 I believe for the QP to not fail creation for the mlx4.
3. Client with hfi I don't know if it supports FMR. But you could use defaults (map_on_demand=0) if not.

With all that I think your configuration will be able to connect to everything. At least from what I quickly glanced over with respect to the map_on_demand settings.

Comment by Olaf Weber [ 17/Nov/15 ]

FWIW 18446744073709551503 = -113 assuming 2's complement. On x86 Linux, think that's
also -EHOSTUNREACH.

Comment by Dmitry Eremin (Inactive) [ 17/Nov/15 ]

The following parameters are compatible with mlx5 and works fine for OPA:

options ko2iblnd-opa peer_credits=62 peer_credits_hiw=64 credits=1024 concurrent_sends=62 ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1
Comment by Chris Hunter (Inactive) [ 19/Nov/15 ]

There are different bundles of the OFED/infiniband kernel drivers available from different sources eg) openfabrics.org OFED, mellanox OFED (MOFED), Truescale OFED+ and RHEL rdma. All these packages seem to tweak the kernel drivers (eg. different versions, additional code, etc.) To further confuse the list, there a multiple versions of the packages (eg. MOFED 2.x, 3.x).

Which IB driver packaging are you using for testing the ko2iblnd patches ? stock OFED ? MOFED ?

thanks,
chris hunter

Comment by Jeremy Filizetti [ 19/Nov/15 ]

I only used stock centos 6 kernels for the initial work. Only after noting that MLX5 memory registration does not support FMR have I even started looking at and testing Mellanox ofed. There are not any issues with this patch and map_on_demand settings that I'm aware of even though they seem to be getting reported here as such. The problem is that it requires too much low level understanding of the driver to configure and ko2iblnd does not abstract the differing hardware well enough at this point. My goal with this patch was to allow interop with systems configured for IB WAN performance and those that may come from a vendor solution with different parameters. Given the current memory registration upstream changes and lack of flexibility, ko2iblnd really needs some additional work to make things more robust and support multiple configurations. This patch only serves as a stop-gap for that larger necessary work. The best that really can be done is to make recommendations to people based on their needs here.

Comment by Chris Hunter (Inactive) [ 19/Nov/15 ]

Hi Jeremy,
Thanks for the explanation at least it shows the challenges involved. From your comments the default ko2iblnd-opa parameters (ie. LU-6735) should work for ConnectX[123] and Truescale adapters.
Our hardware, we have max work requests per QP (max_qp_wr) vaules of 16351, 16383 or 16384.

Comment by Jeremy Filizetti [ 20/Nov/15 ]

I don't know anything about the ko2iblnd-opa, I've always used a custom module parameter file for Lustre. Now that you've included the link I see this is now being included with Lustre which I wasn't aware of before. From what I can see you should be ok to use those parameters with those adapters but I'm not sure they are the "ideal" settings.

Comment by Doug Oucharek (Inactive) [ 20/Nov/15 ]

This ticket seems to have expanded to be a catch-all for anything related to map_on_demand/peer_credit settings. I'd rather see this ticket be used for its original purpose, what Jeremy describes above. Anything new should become a new ticket (or set of tickets) so we don't get confused and link a bunch of new tickets to this one believing these patches will solve all issues in this area.

Once patch http://review.whamcloud.com/#/c/17074/ has landed, I'd like this ticket to be closed. If there are any more problems with optimized settings for specific hardware setups, please open separate tickets so they can be prioritized and addressed accordingly.

Comment by Gerrit Updater [ 24/Nov/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17074/
Subject: LU-3322 lnet: make connect parameters persistent
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4c689a573fafcfa1ca7474a275f958e00b1deddc

Comment by Joseph Gmitter (Inactive) [ 24/Nov/15 ]

http://review.whamcloud.com/17074/ has landed for 2.8.
Resolving the ticket as noted in the commentary above.

Generated at Sat Feb 10 01:32:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.