[LU-10129] map-on-demand set to 32 doesn't work on OPA Created: 17/Oct/17 Updated: 02/Jan/19 Resolved: 22/Dec/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Amir Shehata (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
With patch https://review.whamcloud.com/29290/, map-on-demand set to 32 doesn't work. map-on-demand needs to be set to 256 in order for large transfers to complete. |
| Comments |
| Comment by Olaf Faaland [ 20/Oct/17 ] |
|
We do have OPA fabrics, so I've added the llnl and topllnl labels to this ticket based on the following comment in
I haven't seen unexplained symptoms on our OPA-connected nodes. What would I likely see? disconnects/reconnects? Failed BRW operations? |
| Comment by Amir Shehata (Inactive) [ 23/Oct/17 ] |
|
At the LND layer RDMA writes fail. So that could translate to RPC failures/bulk write failures, and other FS issues. |
| Comment by Amir Shehata (Inactive) [ 24/Oct/17 ] |
|
Here is a summary of my investigation on map-on-demand, and a proposal for a change to resolve the issues around this. I'd like feedback to see if I missed something: |
| Comment by James A Simmons [ 25/Oct/17 ] |
|
While are looking to fix this maybe its time to revist |
| Comment by Gerrit Updater [ 08/Nov/17 ] |
|
Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29995 |
| Comment by Gerrit Updater [ 29/Nov/17 ] |
|
Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/30309 |
| Comment by Chris Hunter (Inactive) [ 12/Dec/17 ] |
|
Does hfi1 module parameter "num_user_contexts" have to match the LND map_on_demand setting ? |
| Comment by Gerrit Updater [ 17/Dec/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30309/ |
| Comment by Olaf Faaland [ 20/Dec/17 ] |
|
Do this patch and 29290 need to be backported to 2.10? I'm not requesting it, just asking the question. |
| Comment by Gerrit Updater [ 22/Dec/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29995/ |
| Comment by Peter Jones [ 22/Dec/17 ] |
|
Landed for 2.11. Flagged for consideration for 2.10.x until Amir is back in the office and able to comment |
| Comment by Amir Shehata (Inactive) [ 04/Jan/18 ] |
|
These list of patches don't need to be ported to 2.10.x. They were instigated by: LU-9983 ko2iblnd: allow for discontiguous fragments
which is not in 2.10 Di committed his changed LU-9983 osp: align the OSP request size by 4k which should hide the initial problem of discontigous fragments. |
| Comment by Chris Hunter (Inactive) [ 17/Jan/18 ] |
|
|