[LU-10157] LNET_MAX_IOV hard coded to 256 Created: 24/Oct/17  Updated: 24/Jan/24  Resolved: 16/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.12.0, Lustre 2.10.6, Lustre 2.14.0, Lustre 2.12.7

Type: Bug Priority: Blocker
Reporter: Amir Shehata (Inactive) Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
duplicates LU-10775 (sec.c:2363:sptlrpc_svc_unwrap_bulk()... Resolved
Related
is related to LU-10129 map-on-demand set to 32 doesn't work ... Resolved
is related to LU-7650 ko2iblnd map_on_demand can't negotita... Resolved
is related to LU-13181 kiblnd_fmr_pool_map error on the AARC... Resolved
is related to LU-10073 lnet-selftest test_smoke: lst Error f... Resolved
is related to LU-6387 Add Power8 support to Lustre Resolved
is related to LU-10300 Can the Lustre 2.10.x clients support... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

LNET_MAX_IOV is hard coded to 256 which works well for architectures with 4096 PAGE_SIZE. However on systems with page sizes bigger or smaller than 4K it will result in long MD count calculations, causing write errors.



 Comments   
Comment by Amir Shehata (Inactive) [ 15/Dec/17 ]

This issue can be worked around by setting the max_pages_per_rpc to 16. The power8 page size is 65536, 64 value would yield 4MB RPC write. When we reduced that to 1MB RPC write (max_pages_per_rpc == 16) then the RDMA write will work properly.

Looking at the code the problem seems to be the way we calculate the number of MDs to describe the data to RDMA.

in ptlrpc_register_bulk()

total_md = (desc->bd_iov_count + LNET_MAX_IOV - 1) / LNET_MAX_IOV;

total_md is the number of MDs to use to transfer all the data that needs to be transferred.

Unfortunately, LNET_MAX_IOV is hard-coded to 256. 256 is used because there is an underlying assumption that the page size is 4K, thus 256 * 4K = 1M.

On the power8 machine the page size is 64K. And they set the max_pages_per_rpc to 64, intending for a 4MB RPC sizes.

As an example let's take the case where we max out the number of pages to 64. If you plugin the numbers in the above equation, you get:

64 + 256 - 1 / 256 = 1

So you essentially try to shove 4MB worth of data into 1 MD. LNet expects that each MD will only describe 1MB of buffers.

If we define LNET_MAX_IOV to

#define LNET_MAX_IOV (1024*1024) / PAGE_SIZE 

then on a power8 machine LNET_MAX_IOV will be 16, resulting in:

64 + 16 - 1 / 16 = 4

4 MDs each describing 1 MB of data, which should work.

This analysis also holds considering the results of our testing, when we reduced max_pages_per_rpc to 16, since no matter what we're always going to have a maximum of 1MB in 1 MD.

Comment by Gerrit Updater [ 06/Mar/18 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31559
Subject: LU-10157 lnet: make LNET_MAX_IOV dependent on page size
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1237508cdb9dc8e748c375dafaaea3a286ae936a

Comment by Gerrit Updater [ 14/Apr/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31559/
Subject: LU-10157 lnet: make LNET_MAX_IOV dependent on page size
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 272e49ce2d5d6883e6ca1b00a9322b3a23b2e55a

Comment by Peter Jones [ 15/Apr/18 ]

Landed for 2.12

Comment by Gerrit Updater [ 12/Sep/18 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33148
Subject: LU-10157 lnet: make LNET_MAX_IOV dependent on page size
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: ef932917ad3a77d1e9fdeaeb39774108bf25fc19

Comment by Alexey Lyashkov [ 12/Dec/19 ]

Landed patch don't help with ARM64 client with 64k page size to have work with x86 servers in case 4mb bulk is allowed.
for other cases it's limits a number pages for random IO which can attached to the bulk.
It looks we need much complex solution with both LNET_MTU and number fragments limits and don't rely to the MAX_IOV to calculate an number MD's for bulk.

Comment by Alexey Lyashkov [ 19/Dec/19 ]

This is ptlrpc bug, not a lnet. And should be fixed with bulk preparation time.
Patch will be send in short time, until it - 64k pages is not usable at all.

Comment by Gerrit Updater [ 31/Jan/20 ]

Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/37385
Subject: LU-10157 lnet: restore an maximal fragments count
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a4ad80c4ecb584e951426cd90d6f084886d50903

Comment by Gerrit Updater [ 31/Jan/20 ]

Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/37386
Subject: LU-10157 ptlrpc: separate number MD and refrences for bulk
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bf78709e14c6b774cd62a9ed9a93cf1b5138d3bb

Comment by Gerrit Updater [ 31/Jan/20 ]

Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/37387
Subject: LU-10157 ptlrpc: fill md correctly.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3d34163fa8aa9834c13f03b4f471b0ccd40784d6

Comment by James A Simmons [ 31/Jan/20 ]

Resetting LNET_MAX_IOV to 256 will make the Power8 nodes send 16MB instead of 1 MB. This will put memory pressure on the compute nodes that will impact user applications. What I have been thinking about is breaking the relationship of LNet fragments and the page size like what is done with the page cache. This is done by using xarrays. The Xarray API has the ability to map a range of indexes to the same object. This was done for huge table support. So in the case of LNet we could allocation an Xarray 256 in size and then on 64K page systems access the 7 object will give you the first page at the proper offset. On an x86 system this would give you the 7th page. I will work on adding the xarray infrastructure to lustre today.

Comment by Alexey Lyashkov [ 01/Feb/20 ]

James - it looks you read a patch incorrectly. You say about it in slack, and i reply to it already.Lets remember a discussion in slack. xarray don't needs for it, completely don't needs.

I adds a code to limit a MD with transfer size in additional to the fragments count, as it before.
So currently MD can be 1mb with 16 fragments or 256 fragments each 4k and less, which limit is hit early.

+       if (((desc->bd_iov_count % LNET_MAX_IOV) == 0) ||
+            (desc->bd_nob_last == LNET_MTU)) {
+               desc->bd_mds_off[desc->bd_md_count] = desc->bd_iov_count;
+               desc->bd_md_count ++;
+               desc->bd_nob_last = 0;
+               LASSERT(desc->bd_md_count <= PTLRPC_BULK_OPS_COUNT);
+       }

So client don't able to send 16mb transfer as you point.

Whole patch series tested on soft AARH64 with RHEL8 with 64k page size (ConnectIB VF), and on real HW AARCH64 with RHEL8 (Mellanox CX-4 card) with same page size, server side is Sonexion L300 (2.12 based + Mellanox CX-4) and Lustre 2.5 + Connect CX3 card.

Comment by James A Simmons [ 01/Feb/20 ]

I'm looking at struct lnet_libmd which has 

lnet_kiov_t      kiov[LNET_MAX_IOV] which ends up being 256 pages no?

Comment by Alexey Lyashkov [ 01/Feb/20 ]

No. up 256 fragments, depend of overall transfer size.
random io workload with <= 4k size on 64k pages - will fill a 256 entries in single transfer - but overall size is 1Mb or less.

Comment by Cory Spitz [ 07/Feb/20 ]

Hmm, somehow there are two comments here from Gerrit Updater about https://review.whamcloud.com/#/c/37386, but none about https://review.whamcloud.com/#/c/37387. Please note that three pending patches (37385, 37386, and 37387) are tied to this ticket.

Comment by James A Simmons [ 07/Feb/20 ]

I need to really think about this. I just see potential issues with confusion with fragment count and number of pages. I see it in the patch for lnet selftest user land tool for example. I think a better way can be done. Especially since this change touches all of the LNet layer yet its a IB specific issue.

Comment by Alexey Lyashkov [ 08/Feb/20 ]

I think LNet should check an MD region size before send or attach. Other possibility is it export an fragments / MTU size tunable into ptlrpc instead of defines.

Comment by Gerrit Updater [ 06/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37386/
Subject: LU-10157 ptlrpc: separate number MD and refrences for bulk
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8a7f2d4b11801eae4c91904da9f9750a012a6b11

Comment by Gerrit Updater [ 06/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37387/
Subject: LU-10157 ptlrpc: fill md correctly.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e1ac9e74844dc75d77ef740b3a44fad2efde30c5

Comment by Gerrit Updater [ 16/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37385/
Subject: LU-10157 lnet: restore an maximal fragments count
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4072d863c240fa5466f0f616f7e9b1cfcdf0aa0e

Comment by Peter Jones [ 16/Jun/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 20/Jan/21 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41275
Subject: LU-10157 ptlrpc: separate number MD and refrences for bulk
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 0d8b5acaf0cf3e741652e93e9894c43a5bf21f2e

Comment by Gerrit Updater [ 20/Jan/21 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41276
Subject: LU-10157 ptlrpc: fill md correctly.
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 1b3f8e4ad1adeb52505aaa34be2a9246997658bf

Comment by Gerrit Updater [ 20/Jan/21 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41277
Subject: LU-10157 lnet: restore an maximal fragments count
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 385819a69da21e8caf1b4b5ef80beb1b87f2c560

Comment by Gerrit Updater [ 04/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41275/
Subject: LU-10157 ptlrpc: separate number MD and refrences for bulk
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 39cc8bd8d3747583e5029e4cd9520115ab0c6ff1

Comment by Gerrit Updater [ 04/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41276/
Subject: LU-10157 ptlrpc: fill md correctly.
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 8ef7c5c45567fc9cdedfe4242a6c5b73193ab9fe

Comment by Gerrit Updater [ 04/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41277/
Subject: LU-10157 lnet: restore an maximal fragments count
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: e39ba78737b63128cfc7df9d49c8dd49a30ce590

Generated at Sat Feb 10 02:32:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.