[LU-10129] map-on-demand set to 32 doesn't work on OPA Created: 17/Oct/17  Updated: 02/Jan/19  Resolved: 22/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Critical
Reporter: Amir Shehata (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl

Issue Links:
Related
is related to LU-7650 ko2iblnd map_on_demand can't negotita... Resolved
is related to LU-10157 LNET_MAX_IOV hard coded to 256 Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With patch https://review.whamcloud.com/29290/, map-on-demand set to 32 doesn't work. map-on-demand needs to be set to 256 in order for large transfers to complete.



 Comments   
Comment by Olaf Faaland [ 20/Oct/17 ]

We do have OPA fabrics, so I've added the llnl and topllnl labels to this ticket based on the following comment in LU-10089:

One thing to note, if you're using OPA you should use map-on-demand set to 256. I'm still analyzing this issue and hopefully will have a patch soon. This issue is tracked under LU-10129

I haven't seen unexplained symptoms on our OPA-connected nodes.

What would I likely see?  disconnects/reconnects?  Failed BRW operations?

Comment by Amir Shehata (Inactive) [ 23/Oct/17 ]

At the LND layer RDMA writes fail. So that could translate to RPC failures/bulk write failures, and other FS issues.

Comment by Amir Shehata (Inactive) [ 24/Oct/17 ]

Here is a summary of my investigation on map-on-demand, and a proposal for a change to resolve the issues around this. I'd like feedback to see if I missed something:
https://wiki.hpdd.intel.com/display/LNet/o2iblnd+map_on_demand

Comment by James A Simmons [ 25/Oct/17 ]

While are looking to fix this maybe its time to revist LU-7650.

Comment by Gerrit Updater [ 08/Nov/17 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29995
Subject: LU-10129 lnd: rework map_on_demand behavior
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d6fb47b0e3cc1521f8f33dcc7dd78a9abcb27154

Comment by Gerrit Updater [ 29/Nov/17 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/30309
Subject: LU-10129 lnd: set device capabilities
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 200fb9081989add33457554a221ecfe087da1176

Comment by Chris Hunter (Inactive) [ 12/Dec/17 ]

Does hfi1 module parameter "num_user_contexts" have to match the LND map_on_demand setting ?

Comment by Gerrit Updater [ 17/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30309/
Subject: LU-10129 lnd: set device capabilities
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 70a73c52d5a780e580e49ccfa778c4beab340c9c

Comment by Olaf Faaland [ 20/Dec/17 ]

Do this patch and 29290 need to be backported to 2.10?  I'm not requesting it, just asking the question.

Comment by Gerrit Updater [ 22/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29995/
Subject: LU-10129 lnd: rework map_on_demand behavior
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2c38e8f5b08e7475ffc37915fe0300505a5759db

Comment by Peter Jones [ 22/Dec/17 ]

Landed for 2.11. Flagged for consideration for 2.10.x until Amir is back in the office and able to comment

Comment by Amir Shehata (Inactive) [ 04/Jan/18 ]

These list of patches don't need to be ported to 2.10.x. They were instigated by:

LU-9983 ko2iblnd: allow for discontiguous fragments

which is not in 2.10

Di committed his changed

LU-9983 osp: align the OSP request size by 4k

which should hide the initial problem of discontigous fragments.

Comment by Chris Hunter (Inactive) [ 17/Jan/18 ]

LU-9500 proposes similar fix for fragment alignment.

Generated at Sat Feb 10 02:32:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.