[LU-10157] LNET_MAX_IOV hard coded to 256 Created: 24/Oct/17 Updated: 24/Jan/24 Resolved: 16/Jun/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.12.0, Lustre 2.10.6, Lustre 2.14.0, Lustre 2.12.7 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Amir Shehata (Inactive) | Assignee: | Alexey Lyashkov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||
| Description |
|
LNET_MAX_IOV is hard coded to 256 which works well for architectures with 4096 PAGE_SIZE. However on systems with page sizes bigger or smaller than 4K it will result in long MD count calculations, causing write errors. |
| Comments |
| Comment by Amir Shehata (Inactive) [ 15/Dec/17 ] |
|
This issue can be worked around by setting the max_pages_per_rpc to 16. The power8 page size is 65536, 64 value would yield 4MB RPC write. When we reduced that to 1MB RPC write (max_pages_per_rpc == 16) then the RDMA write will work properly. Looking at the code the problem seems to be the way we calculate the number of MDs to describe the data to RDMA. in ptlrpc_register_bulk() total_md = (desc->bd_iov_count + LNET_MAX_IOV - 1) / LNET_MAX_IOV; total_md is the number of MDs to use to transfer all the data that needs to be transferred. Unfortunately, LNET_MAX_IOV is hard-coded to 256. 256 is used because there is an underlying assumption that the page size is 4K, thus 256 * 4K = 1M. On the power8 machine the page size is 64K. And they set the max_pages_per_rpc to 64, intending for a 4MB RPC sizes. As an example let's take the case where we max out the number of pages to 64. If you plugin the numbers in the above equation, you get: 64 + 256 - 1 / 256 = 1 So you essentially try to shove 4MB worth of data into 1 MD. LNet expects that each MD will only describe 1MB of buffers. If we define LNET_MAX_IOV to #define LNET_MAX_IOV (1024*1024) / PAGE_SIZE then on a power8 machine LNET_MAX_IOV will be 16, resulting in: 64 + 16 - 1 / 16 = 4 4 MDs each describing 1 MB of data, which should work. This analysis also holds considering the results of our testing, when we reduced max_pages_per_rpc to 16, since no matter what we're always going to have a maximum of 1MB in 1 MD. |
| Comment by Gerrit Updater [ 06/Mar/18 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31559 |
| Comment by Gerrit Updater [ 14/Apr/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31559/ |
| Comment by Peter Jones [ 15/Apr/18 ] |
|
Landed for 2.12 |
| Comment by Gerrit Updater [ 12/Sep/18 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33148 |
| Comment by Alexey Lyashkov [ 12/Dec/19 ] |
|
Landed patch don't help with ARM64 client with 64k page size to have work with x86 servers in case 4mb bulk is allowed. |
| Comment by Alexey Lyashkov [ 19/Dec/19 ] |
|
This is ptlrpc bug, not a lnet. And should be fixed with bulk preparation time. |
| Comment by Gerrit Updater [ 31/Jan/20 ] |
|
Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/37385 |
| Comment by Gerrit Updater [ 31/Jan/20 ] |
|
Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/37386 |
| Comment by Gerrit Updater [ 31/Jan/20 ] |
|
Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/37387 |
| Comment by James A Simmons [ 31/Jan/20 ] |
|
Resetting LNET_MAX_IOV to 256 will make the Power8 nodes send 16MB instead of 1 MB. This will put memory pressure on the compute nodes that will impact user applications. What I have been thinking about is breaking the relationship of LNet fragments and the page size like what is done with the page cache. This is done by using xarrays. The Xarray API has the ability to map a range of indexes to the same object. This was done for huge table support. So in the case of LNet we could allocation an Xarray 256 in size and then on 64K page systems access the 7 object will give you the first page at the proper offset. On an x86 system this would give you the 7th page. I will work on adding the xarray infrastructure to lustre today. |
| Comment by Alexey Lyashkov [ 01/Feb/20 ] |
|
James - it looks you read a patch incorrectly. You say about it in slack, and i reply to it already.Lets remember a discussion in slack. xarray don't needs for it, completely don't needs. I adds a code to limit a MD with transfer size in additional to the fragments count, as it before.
+ if (((desc->bd_iov_count % LNET_MAX_IOV) == 0) ||
+ (desc->bd_nob_last == LNET_MTU)) {
+ desc->bd_mds_off[desc->bd_md_count] = desc->bd_iov_count;
+ desc->bd_md_count ++;
+ desc->bd_nob_last = 0;
+ LASSERT(desc->bd_md_count <= PTLRPC_BULK_OPS_COUNT);
+ }
So client don't able to send 16mb transfer as you point. Whole patch series tested on soft AARH64 with RHEL8 with 64k page size (ConnectIB VF), and on real HW AARCH64 with RHEL8 (Mellanox CX-4 card) with same page size, server side is Sonexion L300 (2.12 based + Mellanox CX-4) and Lustre 2.5 + Connect CX3 card. |
| Comment by James A Simmons [ 01/Feb/20 ] |
|
I'm looking at struct lnet_libmd which has lnet_kiov_t kiov[LNET_MAX_IOV] which ends up being 256 pages no? |
| Comment by Alexey Lyashkov [ 01/Feb/20 ] |
|
No. up 256 fragments, depend of overall transfer size. |
| Comment by Cory Spitz [ 07/Feb/20 ] |
|
Hmm, somehow there are two comments here from Gerrit Updater about https://review.whamcloud.com/#/c/37386, but none about https://review.whamcloud.com/#/c/37387. Please note that three pending patches (37385, 37386, and 37387) are tied to this ticket. |
| Comment by James A Simmons [ 07/Feb/20 ] |
|
I need to really think about this. I just see potential issues with confusion with fragment count and number of pages. I see it in the patch for lnet selftest user land tool for example. I think a better way can be done. Especially since this change touches all of the LNet layer yet its a IB specific issue. |
| Comment by Alexey Lyashkov [ 08/Feb/20 ] |
|
I think LNet should check an MD region size before send or attach. Other possibility is it export an fragments / MTU size tunable into ptlrpc instead of defines. |
| Comment by Gerrit Updater [ 06/Jun/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37386/ |
| Comment by Gerrit Updater [ 06/Jun/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37387/ |
| Comment by Gerrit Updater [ 16/Jun/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37385/ |
| Comment by Peter Jones [ 16/Jun/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 20/Jan/21 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41275 |
| Comment by Gerrit Updater [ 20/Jan/21 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41276 |
| Comment by Gerrit Updater [ 20/Jan/21 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41277 |
| Comment by Gerrit Updater [ 04/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41275/ |
| Comment by Gerrit Updater [ 04/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41276/ |
| Comment by Gerrit Updater [ 04/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41277/ |