[LU-8225] router node: Failed to create FMR pool: -38 Created: 01/Jun/16  Updated: 01/Jun/16  Resolved: 01/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Dmitry Eremin (Inactive)
Resolution: Duplicate Votes: 0
Labels: llnl
Environment:

RHEL7.2 derivative: 3.10.0-327.13.1.3chaos.ch6.x86_64 #1 SMP Wed May 11 18:38:20 PDT 2016 x86_64 x86_64 x86_64 GNU/Linux
lustre-2.8.0_0.0.llnlpreview.13-1.ch6.x86_64
router has two interfaces, omnipath on compute side:
05:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
81:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 10)


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On a router node with both omnipath and mellanox interfaces, I see the following in the output of journalctl -xe:

-- Unit lnet.service has begun starting up.
kernel: LNet: Added LNI 192.168.128.187@o2ib18 [128/8192/0/180]
kernel: fmr_pool: Device mlx5_0 does not support FMRs
kernel: LNetError: 7963:0:(o2iblnd.c:1459:kiblnd_create_fmr_pool()) Failed to create FMR pool: -38
kernel: LNetError: 7963:0:(o2iblnd.c:2096:kiblnd_net_init_pools()) Can't initialize FMR pool for CPT 0: -38
kernel: LNetError: 7963:0:(o2iblnd.c:2895:kiblnd_startup()) Failed to initialize NI pools: -38
kernel: LNetError: 105-4: Error -100 starting up LNI o2ib
kernel: LNetError: 801:0:(o2iblnd_cb.c:2297:kiblnd_passive_connect()) Can't accept conn from 192.168.128.37@o2ibkernel: LNetError: 801:0:(o2iblnd_cb.c:2297:kiblnd_passive_connect()) Skipped 20 previous similar messages
kernel: LNet: Removed LNI 192.168.128.187@o2ib18
lnet[7960]: LNET configure error 100: Network is down
systemd[1]: lnet.service: control process exited, code=exited status=1
systemd[1]: Failed to start SYSV: Part of the lustre file system..

I do not encounter this on the compute nodes, which have only omnipath, nor on the lustre servers, which have only mellanox.

Lustre 2.8 ships with /etc/modprobe.d/ko2iblnd.conf, which contains:

alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1


 Comments   
Comment by Olaf Faaland [ 01/Jun/16 ]

This did not come up with lustre 2.5; it's new with lustre 2.8.

Comment by Olaf Faaland [ 01/Jun/16 ]

Note this occurs when attempt to start lnet. Lnet fails to start as a result.

Comment by Olaf Faaland [ 01/Jun/16 ]

Removing /etc/modprobe.d/ko2iblnd.conf allows lnet to start successfully. lctl pings from client->server and server->client (through the router) then work as expected.

Comment by Peter Jones [ 01/Jun/16 ]

Olaf

It seems that this is a duplicate of LU-5783 which is queued up for inclusion in the 2.8.1 FE release - http://review.whamcloud.com/#/c/19024/.

Dmitry

Please can you provide any further advise LLNL need on this topic

Thanks

Peter

Comment by Olaf Faaland [ 01/Jun/16 ]

Peter,
I agree this is a duplicate. Thanks for finding the original. If Dmitry has no other comments feel free to close this ticket.
thanks,
Olaf

Generated at Sat Feb 10 02:15:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.