[LU-7990] Large bulk IO support Created: 06/Apr/16 Updated: 08/Dec/20 Resolved: 05/May/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0, Lustre 2.11.0, Lustre 2.10.2 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Gu Zheng (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
Add large bulk IO support, e.g. make the ptlrpc be able to size 16MB IO, to improve the performance. |
| Comments |
| Comment by Gerrit Updater [ 07/Apr/16 ] |
|
Gu Zheng (gzheng@ddn.com) uploaded a new patch: http://review.whamcloud.com/19366 |
| Comment by Gerrit Updater [ 07/Apr/16 ] |
|
Gu Zheng (gzheng@ddn.com) uploaded a new patch: http://review.whamcloud.com/19367 |
| Comment by Gerrit Updater [ 07/Apr/16 ] |
|
Gu Zheng (gzheng@ddn.com) uploaded a new patch: http://review.whamcloud.com/19368 |
| Comment by Gerrit Updater [ 17/Apr/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19366/ |
| Comment by Alexey Lyashkov [ 18/Apr/16 ] |
|
I like to see DDN work to decrease a overall Lustre performance. Last merged patch replace a kmalloc allocations with vmalloc so dramatically increase vmalloc lock contention for kernels without per CPU vmalloc code. |
| Comment by James A Simmons [ 18/Apr/16 ] |
|
Ouch that needs to be fixed. |
| Comment by Jinshan Xiong (Inactive) [ 19/Apr/16 ] |
|
I guess you are referring to the memory allocation in ptlrpc_new_bulk(). With 16M RPC support, the maximum value of nfrags is 4096, and this will lead to a memory allocation of 64KB in maximum. This is less than the maximum memory size kmalloc() can allocate in modern systems. If the memory allocation can;t be fulfilled by kmalloc(), it will then fall down to vmalloc(). I didn't see any problem with it. |
| Comment by Patrick Farrell (Inactive) [ 21/Apr/16 ] |
|
Is it? What is that maximum size? I thought it was page size times two? |
| Comment by Li Xi (Inactive) [ 22/Apr/16 ] |
|
Hi Alexey, did you actually see a performance degression? Performance is really important to us, so please open a ticket if so. |
| Comment by Gerrit Updater [ 25/Apr/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19367/ |
| Comment by Alexey Lyashkov [ 25/Apr/16 ] |
|
Li Xi, we have lots problems with vmalloc on RHEL6 kernels. That kernel have a single spinlock to protect a vmalloc allocations. |
| Comment by Li Xi (Inactive) [ 25/Apr/16 ] |
|
Hi Alexey, thanks for sharing this information. We will run benchmarks to check this too. |
| Comment by Alexey Lyashkov [ 25/Apr/16 ] |
|
From my point view, you should look to very different side if you want to increase a size of transfer. So from my point of view - bulk code should be changed to access an OSC view as input and have a region to SGE page conversion inside of LNet as SGE list don't have a limitation to number of segments and may better controlled by low level drivers. |
| Comment by James A Simmons [ 25/Apr/16 ] |
|
You make valid points Alexey. I think I like to bring up is that it has been recommend that we move from kernel_sock** to using the netlink api. I currently don't have the cycles for that but its something we should look into. |
| Comment by Andreas Dilger [ 26/Apr/16 ] |
|
In a related question - does the large BRW RPC size also increase the initial ocd_grant request, so that the client can at least form one full-size RPC to the OST after a connect? The default is currently 2MB (i.e. 2x 1MB RPCs) but this initial grant should probably be increased to 8MB or more when the RPC size is increased. |
| Comment by Shuichi Ihara (Inactive) [ 26/Apr/16 ] |
|
Alexey, would you please give us speicfic IO pattern you think performance drops? we had a lot of benchmark and didn't see any perforamnce regresions so far and singinicant improved performance, instead. |
| Comment by Andreas Dilger [ 26/Apr/16 ] |
|
The ptlrpc_new_bulk() code is changed from calling OBD_ALLOC() to use OBD_ALLOC_LARGE(), which in Lustre 2.7.55 and later is: #define OBD_ALLOC_LARGE(ptr, size) \ do { \ OBD_ALLOC_GFP(ptr, size, GFP_NOFS | __GFP_NOWARN); \ if (ptr == NULL) \ OBD_VMALLOC(ptr, size); \ } while (0) so it will try kmalloc() first at whatever size is requested, and only if this fails will it fall back to vmalloc() if there isn't a large-order allocation available to fulfill this request. For a fragmented 16MB RPC this will be 64KB at most, so it is likely that kmalloc() will succeed to get an order-4 allocation in most cases. Also, the size of this allocation is driven by the actual RPC size, so it won't do a larger allocation if large RPCs are not enabled. I don't see any significant problem in the current implementation. Alexey, there is work being done on the upstream kernel to allow higher-order page allocations for IO, which would reduce the number of fragments in the IB SG list and could optimize both Lustre RPCs and IB-connected storage like SRP. It would be welcome if you would investigate your proposal to use higher-order allocations or virtual mappings to reduce the SG list and to update the socklnd code to be more efficient. Patches should probably be submitted under a different LU ticket. |
| Comment by Jinshan Xiong (Inactive) [ 26/Apr/16 ] |
|
It sounds like there will be significant changes from VFS level for the support of large page order allocation. Right now kernel dirties pages one by one therefore the order of page allocation is always 1. Alexey, It's good to see this problem from a higher level. Please document your proposal in detail so that we can revisit this later after we have sufficient support from kernel. |
| Comment by Jinshan Xiong (Inactive) [ 26/Apr/16 ] |
|
Andreas - I'm reluctant to increase the initial grant because I would like to be conservative before 16M RPC is widely used. Without this change, the first RPC would be small in size, but after that the RPC size should be 16MB, not a big deal. |
| Comment by Alexey Lyashkov [ 26/Apr/16 ] |
|
Andreas, 64k allocations failed easly after some time. If you look into several LNet OOM tickets you may see it. So any aged system including servers will switch to use vmalloc for this code. As about SG lists - server side R/O cache kills an IB OFED cache as pool code can't find older map. I worked on other possibilities time to time. when looked to rework an o2ib LND. |
| Comment by James A Simmons [ 26/Apr/16 ] |
|
Over time these larger allocations are going to fail due to the increasing memory fragmentation so the vmalloc penalty will show up. |
| Comment by Gerrit Updater [ 04/May/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19368/ |
| Comment by Joseph Gmitter (Inactive) [ 05/May/16 ] |
|
Patches have landed to master for 2.9.0 |
| Comment by Evan D. Chen (Inactive) [ 29/Sep/16 ] |
|
Add |
| Comment by Gerrit Updater [ 04/May/17 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/26955 |
| Comment by Gerrit Updater [ 24/Oct/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26955/ |
| Comment by Gerrit Updater [ 24/Oct/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29738 |
| Comment by Gerrit Updater [ 26/Oct/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29738/ |