Adapt ko2iblnd to latest RDMA changes (LU-8874)

[LU-9026] Adapt to the removal of ib_get_dma_mr() Created: 17/Jan/17  Updated: 29/Sep/17  Resolved: 23/Mar/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Upstream
Fix Version/s: Lustre 2.10.0, Upstream

Type: Technical task Priority: Blocker
Reporter: Doug Oucharek (Inactive) Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: lnet

Issue Links:
Blocker
Duplicate
is duplicated by LU-9932 LU-9026, LU-9500, and LU-9472 backports Resolved
Related
is related to LU-6215 Sync Lustre external tree with lustre... Resolved
is related to LU-9983 LBUG llog_osd.c:327:llog_osd_declare_... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

   As part of recent (last September) RDMA changes, the routine ib_get_dma_mr() was deleted and an extra "flags" parameter was added to ib_alloc_pd().  This was all done to stop upper layers from accessing the R-key for unregistered DMA memory regions.  It seems that this opens the door to accessing any physical memory you want on a remote system thereby being a security issue.

For the applicable changes, see these two kernel commits:

 
from 4.9-rc1: change of arguments number for ib_alloc_pd
commit ed082d36a7b2c27d1cda55fdfb28af18040c4a89
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Sep 5 12:56:17 2016 +0200
from 4.9-rc1: removal of ib_get_dma_mr
commit 5ef990f06bd7e3cf521b5705d898d8e43d04ea90
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Sep 5 12:56:21 2016 +0200
 

LNet uses unregistered memory regions for short messages (lower latency).  A "method" of still allowing us to work as we did before was added in the form of a PD variable called "unsafe_global_rkey" which is populated if you pass the flag "IB_PD_UNSAFE_GLOBAL_RKEY" as the new flags parameter to ib_alloc_pd().  As the name implies, this is considered unsafe and should not be done unless you are very confident in your security.  Using it will trigger a warning to the kernel console.

This change has caused a kernel maintainer to mark ko2iblnd as "BROKEN" in the 4.10 kernel build.  To get rid of this broken tag, we need to fix this quickly so I am going to make use of the unsafe flag so I do not have to re-architect ko2iblnd for short messages.  We need to test how much of a performance hit we get on short messages if we have to register all memory regions for better security.  There is not enough time to do that now.

So, this ticket was created to track the following changes to the staging version of LNet:

  1. Remove the BROKEN tag from lnet/Kconfig.
  2. Pass the IB_PD_UNSAFE_GLOBAL_RKEY flag when calling ib_alloc_pd().
  3. Everywhere we are using the device DMA mr to derive the R-key for non-registered memory regions, use the pd->unsafe_global_rkey value instead (this will get rid of our call to ib_get_dma_mr()).

Once this is landed to staging, use this ticket to also make the change to community master with the necessary compile hooks to ensure the change is only done when it makes sense.



 Comments   
Comment by Gerrit Updater [ 18/Jan/17 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/24931
Subject: LU-9026 lnet: Adapt to the removal of ib_get_dma_mr()
Project: fs/linux-staging
Branch: staging-testing
Current Patch Set: 1
Commit: 9a0076b26cd74a3bc2708163b93a04ddf3b54bdc

Comment by James A Simmons [ 19/Jan/17 ]

I submitted a patch like this when it first broke and the response from Chris Hellwig was hell no. Mind you your patch is better than mine but he didn't want to reintroduce the security issues. Still this patch can be used as a base to get ko2iblnd working against upstream so we can work on the other important bits that need to be done 

Comment by Doug Oucharek (Inactive) [ 19/Jan/17 ]

ko2iblnd was not secure before this patch, and the patch does not change that at all.  If by changing the API to "force" us to be more secure, then Hellwig should not have put in the UNSAFE flag which I can see even he is using in the NFS code as well as the nvme guys.  The issue I have with his solution is it "can" increase the latency of our small messages (even the nvme guys mention this in their code comments which is why they allow the use of the UNSAFE flag).

I agree we need to eventually review this aspect to ko2iblnd security and improve it, but as a feature not as a quick bug fix.  Given we are no worse off than we were before, I see no good argument against this fix.

Comment by James A Simmons [ 19/Jan/17 ]

Just preparing you for Hellwig  

Comment by Doug Oucharek (Inactive) [ 24/Jan/17 ]

Naive question: if we switch to not using the global R-key on the upstream client, but the Lustre servers stay the way they are today (using the global R-key), will the upstream client be able to communicate with the server?

If they can communicate, I'm ok with changing this patch to not be using the global R-key and we will see what performance change we get.

Comment by Doug Oucharek (Inactive) [ 24/Jan/17 ]

It looks like we can have a new updated upstream client communicate with a older server.

Based on my understanding of how ko2iblnd is using rkey's, to do this fix so we avoid the global rkey, we "must" use FMR/FastReg.  That implies that map_on_demand cannot be zero (and the default has to be changed from the default).

James: is this your understanding too?  Are we ok with this change?

Comment by Doug Oucharek (Inactive) [ 26/Jan/17 ]

James: With the latest patch version, we have the upstream clients use a default map_on_demand of 32 and not allow them to be zero anymore.  Autotest is trying to test them against servers based on master which have a default map_on_demand of zero.  Those two settings are not compatible so all tests immediately fail.

I cannot just change master to have a new default map_on_demand of 32 as that will break backwards compatibility between 2.10 and previous releases.  Sigh.  This change is going to be a political nightmare!

How about this: I change master so that a node which has a map_on_demand of zero encounters another node with a non-zero map_on_demand, the current node changes to match the non-zero value?  That should keep backward compatibility and prepare servers for working with the new upstream clients.  We do have to be very clear that the upstream clients CANNOT work with servers older than 2.10 unless those servers have their map_on_demand values made non-zero.  That sort of breaks our rule of a seamless backwards compatibility upgrade.

Would this be an acceptable solution?  Suggestions for a better way?

Comment by Andreas Dilger [ 26/Jan/17 ]

Doug, I'm not sure about the LNet/IB issues, but in terms of compatibility your proposal seems reasonable. It would also be possible to backport the dynamic map_on_demand handling to our maintenance branches so that the older clients can mount newer servers in the future.

Comment by Gerrit Updater [ 06/Feb/17 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/25277
Subject: LU-9026 lnd: Adapt to the removal of ib_get_dma_mr()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 54cc5d3d5e9a34c5d59b847bf0bcacccb7050755

Comment by James A Simmons [ 14/Feb/17 ]

ORNL just hit this when someone attempted to build lustre with Mellanox 4.0 stack.

Comment by Doug Oucharek (Inactive) [ 14/Feb/17 ]

We need to get these two patches (one upstream, the other on master) landed soon then.  Reviews?

Comment by James A Simmons [ 21/Feb/17 ]

Upstream patch has been submitted. Waiting for it to land since no complaints. I tested the patch for master without the MOFED 4 stack. We now have MOFED 4 in our test system so I'm going to give that a try next.

Comment by James A Simmons [ 24/Feb/17 ]

Patch has landed upstream.

Comment by Andreas Dilger [ 25/Feb/17 ]

Yay! Thanks Doug and James. Doug is also working on a couple more o2iblnd patches to avoid future similar breakage. Hopefully that will keep Christoph at bay for a while, and maybe even convince him that we want to get Lustre out of staging.

Is there some linux-ib list that should be followed to get advance warning of such API changes in the future? It may be listed in the commit message for the original patches, or the get-maintainers script (or whatever it is called, or just the MAINTAINERS file) can be used to find the list.

Comment by Alexey Lyashkov [ 02/Mar/17 ]

did any perf test run with it patch ?

Comment by Doug Oucharek (Inactive) [ 03/Mar/17 ]

I have been focused on doing some performance testing and am having some strange problems.  This patch makes the default map_on_demand 256.  With Lustre master, I am seeing a 5% reduction in performance with MLX QDR.  Overall, not too bad given we are addressing a security hold with this change (i.e. we are being forced to do this).

I downloaded the build from upstream with this change.  Things are not good there.  Ran into two bad problems:

  1. Performance is less than 1/2 from master.
  2. Cannot use map_on_demand values less than 256 without getting this error message:

[ 1125.547100] LNetError: 22552:0:(o2iblnd_cb.c:1068:kiblnd_init_rdma()) RDMA is too large for peer 192.168.1.24@o2ib (131072), src size: 1048576 dst size: 1048576

Some map_on_demand fixes/changes must be missing upstream.  Whether that missing code is also causing the huge drop in performance, I don't yet know.  I need to find an upstream build without this patch to see if performance has always been bad upstream and no one noticed.

In any case, I will be opening a ticket to address number 2 as we need to find the missing code and get it upstream.  Depending on how my additional testing for number 1 goes we will either be ok (and a 5% performance hit will accompany this change) or this patch has introduced a nasty problem which needs to be debugged before we can land it to master.

Comment by Gerrit Updater [ 03/Mar/17 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/25802
Subject: LU-9026 ko2iblnd: Test Patch to debug performance issue
Project: fs/linux-staging
Branch: staging-testing
Current Patch Set: 1
Commit: 171cd4eeda569fb5a19b6fde02240352adf7db24

Comment by Doug Oucharek (Inactive) [ 04/Mar/17 ]

Did some more testing and am starting to believe we have a general performance problem in the upstream LNet and is not caused by the patch in this ticket.  I have opened two tickets corresponding to the two items above:

  1. Performance with IB/OPA: LU-9179
  2. map_on_demand <256 a problem: LU-9180

I have not been able to validate the master version of this fix for performance issues yet because, well, I don't yet know how to build the master branch against Linux 4.9.  Help appreciated.

Comment by Dmitry Eremin (Inactive) [ 06/Mar/17 ]

Doug,
I'm porting master to Linux 4.9 above your patch. I have one more patch to complete. You can look at series in top for now https://review.whamcloud.com/25840.

Comment by Doug Oucharek (Inactive) [ 07/Mar/17 ]

I just tried to build master against 4.9 (client only).  It fails in the libcfs debug code.  Will you be addressing that?  I tried turning it off via a configure option, and that triggers a warning that you can't turn off debugging.  Sigh.

Comment by Alexey Lyashkov [ 07/Mar/17 ]

Dmitriy,

your's patch isn't complete. Based on upstream commit, you need define an
struct xattr_handlers to do same work. But you's patch just disable an xattr for lustre client. I don't think it good one.

Comment by James A Simmons [ 08/Mar/17 ]

Dough you can test out the patch with MOFED 4. Its requires this patch to work. I have been running master +M MOFED 4 to test this patch. So far I haven't see the performance degradation you are seeing. Give MOFED 4 a try.

Comment by Alexey Lyashkov [ 10/Mar/17 ]

James,

did you test a mlx5 or mlx4 cards a specially QDR ?

Comment by James A Simmons [ 13/Mar/17 ]

I tested mlx4 card (Connect-3) which is FDR. I need to try with the mlx5 driver once my Power8 node image is updated.

Comment by Alexey Lyashkov [ 13/Mar/17 ]

mlx5 cards supports a fast memory registration mode, so should lack a performance problems,
but mlx4 + QDR cards (Connect-X2 possible) should have an issues with MD rpc traffic which need constantly send a reply with different kernel address, so memory mapping cache isn't work in this case.

Comment by Gerrit Updater [ 23/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25277/
Subject: LU-9026 o2iblnd: Adapt to the removal of ib_get_dma_mr()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e4297ef38561f1e788ba73ca0c8078a09dc8c303

Comment by Peter Jones [ 23/Mar/17 ]

Landed to master which will now make testing easier.

Generated at Sat Feb 10 02:22:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.