[LU-9472] FastReg (MLX5) support breaks when map_on_demand > 0 Created: 09/May/17  Updated: 08/Sep/17  Resolved: 20/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Upstream
Fix Version/s: Lustre 2.10.0, Upstream

Type: Bug Priority: Critical
Reporter: Doug Oucharek (Inactive) Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: lnet

Issue Links:
Duplicate
is duplicated by LU-9932 LU-9026, LU-9500, and LU-9472 backports Resolved
Related
is related to LU-9500 MOFED 4/mlx5: Aligning non-aligned pa... Resolved
is related to LU-6215 Sync Lustre external tree with lustre... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When building against MODED 4, the default for map_on_demand switches from 0 to 256.  This is breaking MLX5-based cards which make use of the FastReg support in ko2iblnd.  There are three problems with FastReg which need to be fixed:

  1. In kiblnd_fmr_pool_map() when using elements from the fpo_pool_list, if the list runs out, the current code is setting rc to -EBUSY when it should be -EAGAIN.  EAGAIN triggers the pool to be made bigger.  EBUSY just fails the transfer and connection (not what we want).
  2. Even after I fix the setting of rc in number 1, bringing down the network via "lctl network down" trips this assert: 
    [ 1172.255552] LNetError: 10176:0:(o2iblnd.c:1421:kiblnd_destroy_fmr_pool()) ASSERTION( fpo->fpo_map_count == 0 ) failed: 
  1. Every time the pool size is increased, I keep seeing this annoying log (with neterror on): 
    May  9 00:22:26 trevis-407 kernel: LNet: Using FastReg for registration

The first 2 items are blockers and must be fixed ASAP.  The 3rd might as well be addressed at the same time.

 



 Comments   
Comment by Doug Oucharek (Inactive) [ 09/May/17 ]

The main problem, it turns out, is that the unmap routine is never being called for FastReg.  As such, we have to keep growing the pool and assert when trying to shut down networking (because pool items are leaking).

Comment by James A Simmons [ 09/May/17 ]

Thanks for finding this.

Comment by Gerrit Updater [ 09/May/17 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/27015
Subject: LU-9472 lnd: Fix FastReg map/unmap for MLX5
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a6a1d45a72360b5cc7e9e3a65c7456fa62c19192

Comment by Alexey Lyashkov [ 15/May/17 ]

Tested fix and it work for me.

Comment by Gerrit Updater [ 20/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27015/
Subject: LU-9472 lnd: Fix FastReg map/unmap for MLX5
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b436c75d9488222190de8b30f56d720f8ec63d6f

Comment by Peter Jones [ 20/May/17 ]

Landed for 2.10

Comment by Doug Oucharek (Inactive) [ 17/Aug/17 ]

Has this been pushed upstream yet?

Comment by James A Simmons [ 17/Aug/17 ]

Not yet.

Generated at Sat Feb 10 02:26:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.