[LU-5718] RDMA too fragmented with router Created: 08/Oct/14  Updated: 14/Jun/19  Resolved: 03/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Critical
Reporter: Johann Lombardi (Inactive) Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl

Attachments: File 4james.tgz    
Issue Links:
Duplicate
duplicates LU-7385 Bulk IO write error Resolved
duplicates LU-3322 ko2iblnd support for different map_on... Resolved
Related
is related to LU-7210 ASSERTION( peer->ibp_connecting == 0 ) Resolved
is related to LU-7569 IB leaf switch caused LNet routers to... Resolved
is related to LU-7650 ko2iblnd map_on_demand can't negotita... Resolved
is related to LU-9420 Bad Check slipped into repo Resolved
is related to LU-10252 backport change LU-5718 change 12451/... Resolved
is related to LU-7401 OOM after LNet initialization with no... Resolved
is related to LUDOC-378 Document wrq_sge as an o2iblnd parameter Resolved
is related to LU-12419 ppc64le: "LNetError: RDMA has too man... Closed
Severity: 3
Rank (Obsolete): 16043

 Description   

Got an IOR failure on the soak cluster with the following errors:

Oct  7 21:54:01 lola-23 kernel: LNetError: 3613:0:(o2iblnd_cb.c:1134:kiblnd_init_rdma()) RDMA too fragmented for 192.168.1.115@o2ib100 (256): 128/256 src 128/256 dst frags
Oct  7 21:54:01 lola-23 kernel: LNetError: 3618:0:(o2iblnd_cb.c:428:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.1.114@o2ib100: -90
Oct  7 21:54:01 lola-23 kernel: LNetError: 3618:0:(o2iblnd_cb.c:428:kiblnd_handle_rx()) Skipped 7 previous similar messages

Liang told me that this is a known issue with routing. That said, the IOR process is not killable and the only option is to reboot the client node. We should at least fail "gracefully" by returning the error to the application.



 Comments   
Comment by Liang Zhen (Inactive) [ 28/Oct/14 ]

patch is here: http://review.whamcloud.com/12451
it's not tested yet, I need to test it.

Comment by Chris Horn [ 28/Oct/14 ]

Johann/Liang, any tips for reproducing this issue?

Comment by Liang Zhen (Inactive) [ 29/Oct/14 ]

I think Johann hit this while running some mixed workloads with routers. I will patch lnet_selftest and make it support brw with offset ,which should be able to reproduce this issue.

Comment by Alexey Lyashkov [ 30/Oct/14 ]

I don't sure patch is correct.

Oct  7 21:54:01 lola-23 kernel: LNetError: 3613:0:(o2iblnd_cb.c:1134:kiblnd_init_rdma()) RDMA too fragmented for 192.168.1.115@o2ib100 (256): 128/256 src 128/256 dst frags

I think main reason for it - incorrect calculation on osc/ptlrpc layer. It's already responsible to the check number fragments for bulk transfer.

Comment by Chris Horn [ 05/Nov/14 ]

Liang, do you have an LU ticket for the lnet_selftest enhancement you mentioned?

Comment by Liang Zhen (Inactive) [ 06/Nov/14 ]

Hi Chris, I didn't create another ticket for selftest, but I have posted patch for it: http://review.whamcloud.com/#/c/12496/

Comment by Chris Horn [ 22/Apr/15 ]

We had a site report seeing an error with this patch when they set peer_credits > 16:

LNetError: 2641:0:(o2iblnd.c:872:kiblnd_create_conn()) Can't create QP: -12, send_wr: 16191, recv_wr: 254, send_sge: 2, recv_sge: 1
Comment by Liang Zhen (Inactive) [ 23/Apr/15 ]

Chris, I don't think this is an issue from this patch because it does not consume extra memory, I suspect it is about connd may aggressively reconnect when there is connection race, I will post a patch for this.

Comment by Chris Horn [ 23/Apr/15 ]

Thanks, Liang. FWIW, they only see that error with this patch applied, and when they set "options ko2iblnd wrq_sge=1" the error goes away...

Comment by Isaac Huang (Inactive) [ 23/Apr/15 ]

liang I think the patch could cause increased memory overhead at the OFED and layers beneath it, since init_qp_attr->cap.max_send_sge is doubled.

Comment by Alexey Lyashkov [ 23/Apr/15 ]

Issac,

did you remember my comments about additional memory issues with that patch?...

Comment by Isaac Huang (Inactive) [ 23/Apr/15 ]

Alexey, that's the price to pay - there's no free lunch.

Comment by Gerrit Updater [ 27/Apr/15 ]

Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/14600
Subject: LU-5718 o2iblnd: avoid intensive reconnecting
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5ec48e1f63befe9c361ddac6d8baa38aa83edd34

Comment by Liang Zhen (Inactive) [ 27/Apr/15 ]

Isaac, indeed, thanks for pointing out.
Chris, could you try this patch and see if it can help.

Comment by Alexey Lyashkov [ 27/Apr/15 ]

Per discussion with Melanox people, they don't happy with increasing number fragments for own IB cards.
because it need large array allocated via kmalloc. With Cray default settings - it's 128k allocation per connection, so easy to hit problem with any new connection. With own patch it allocation will double so need 256k allocation via kmalloc.

                qp->sq.wrid  = kmalloc(qp->sq.wqe_cnt * sizeof (u64), GFP_KERNEL);
                qp->rq.wrid  = kmalloc(qp->rq.wqe_cnt * sizeof (u64), GFP_KERNEL);

I agree with Issac, none free lunch - but with that patch you may stop working with large number connections like router <> clients links.

#define IBLND_SEND_WRS(v) ((IBLND_RDMA_FRAGS(v) + 1) * IBLND_CONCURRENT_SENDS(v))

Comment by Chris Horn [ 14/May/15 ]

Liang, I haven't had a chance to reproduce the QP allocation failure internally, so I haven't tested your patch. I agree with Alexey that I think a big part of our problem is the large kmallocs we're doing. The site that hit this issue is using ConnectIB cards with the mlx5 drivers (I only have access to ConnectX/mlx4 cards internally). I haven’t looked at the driver code before, but it looks to me like we're not just doing the one 256k allocation noted by Alexey (I'm pretty sure the qp->rq.wrid kmalloc is for just 2048 bytes), but it looks like we're doing four of them:

qp->rq.wqe_cnt = 256
qp->sq.wqe_cnt = 32768

         qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(*qp->sq.wrid), GFP_KERNEL); // 262144 bytes
         qp->sq.wr_data = kmalloc(qp->sq.wqe_cnt * sizeof(*qp->sq.wr_data), GFP_KERNEL); // 262144 bytes
         qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof(*qp->rq.wrid), GFP_KERNEL); // 2048 bytes
         qp->sq.w_list = kmalloc(qp->sq.wqe_cnt * sizeof(*qp->sq.w_list), GFP_KERNEL); // 262144 bytes
         qp->sq.wqe_head = kmalloc(qp->sq.wqe_cnt * sizeof(*qp->sq.wqe_head), GFP_KERNEL); // 262144 bytes

The reason we have such large allocations is that we set peer_credits=126 and concurrent_sends=63 in order to deal with the huge amount of small messages generated by Lustre client pings at large scale (see https://cug.org/proceedings/attendee_program_cug2012/includes/files/pap166.pdf for details). The site that reported the QP allocation failure did try different values of peer_credits, and they found that the only values that worked were peer_credits=8 and peer_credits=16. This was on a small TDS system with just two LNet routers (I’m still waiting to find out the total number of IB peers).

Interestingly, we've deployed the multiple SGEs patch at another (very) large site that uses ConnectX/mlx4 drivers, and they have not seen this issue. So I'm wondering if there's a difference in the driver code that is making this more likely.

Comment by Gerrit Updater [ 04/Sep/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14600/
Subject: LU-5718 o2iblnd: avoid intensive reconnecting
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5dcc6f68d6ebba0be4e2a7d132d4e28da7a8361e

Comment by Alexey Lyashkov [ 04/Sep/15 ]

May you explain why you close a ticket with unrelated patch ?

Comment by Alexey Lyashkov [ 04/Sep/15 ]

reconnect problem is completely different problem and need own ticket, but it's never addressed to wrong alignment for router buffer.
Please reopen ticket.

Comment by Joseph Gmitter (Inactive) [ 04/Sep/15 ]

Hi Alexey,
this was in error - my apologies.

Comment by Frank Heckes (Inactive) [ 28/Sep/15 ]

Error happens during still soak testing 2_7_59 + debug patch
(see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20150914)
running IOR (single-shared-file mode) for a single client node.
Job hangs and IOR process can't be killed.

There're 173 message of the form:

Sep 27 09:58:24 lola-27 kernel: Lustre: 3698:0:(client.c:2040:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443372961/real 1443372961]  req@ffff880588fbd0c0 x1513236214216272/t0(0) o4->soaked-OST0008-osc-ffff880818748800@192.168.1.102@o2ib10:6/4 lens 608/448 e 2 to 1 dl 1443373079 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Sep 27 09:58:24 lola-27 kernel: Lustre: 3698:0:(client.c:2040:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 27 09:58:24 lola-27 kernel: Lustre: soaked-OST0008-osc-ffff880818748800: Connection to soaked-OST0008 (at 192.168.1.102@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
Sep 27 09:58:24 lola-27 kernel: Lustre: soaked-OST0008-osc-ffff880818748800: Connection restored to soaked-OST0008 (at 192.168.1.102@o2ib10)
Sep 27 09:58:24 lola-27 kernel: LNetError: 3675:0:(o2iblnd_cb.c:1139:kiblnd_init_rdma()) RDMA too fragmented for 192.168.1.114@o2ib100 (256): 128/233 src 128/233 dst frags
Sep 27 09:58:24 lola-27 kernel: LNetError: 3675:0:(o2iblnd_cb.c:435:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.1.114@o2ib100: -90

which seems to correlate to the same amount of errors on OSS node (lola-2) :

Sep 27 09:57:41 lola-2 kernel: LustreError: 8847:0:(ldlm_lib.c:3017:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff8801c9e476c0 x1513236214216272/t0(0) o4->076bba0c-23e4-e9cc-96e8-bd39615184cd@192.168.1.127@o2ib100:318/0 lens 608/448 e 2 to 0 dl 1443373078 ref 1 fl Interpret:H/0/0 rc 0/0
Sep 27 09:57:41 lola-2 kernel: Lustre: soaked-OST0008: Bulk IO write error with 076bba0c-23e4-e9cc-96e8-bd39615184cd (at 192.168.1.127@o2ib100), client will retry: rc -110
Sep 27 09:58:24 lola-2 kernel: Lustre: soaked-OST0008: Client 076bba0c-23e4-e9cc-96e8-bd39615184cd (at 192.168.1.127@o2ib100) reconnecting
Comment by Gerrit Updater [ 21/Dec/15 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/17699
Subject: LU-5718 o2iblnd: Revert original fix
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 682a15bf7319907cbd281021ea9af85d160cdf94

Comment by James A Simmons [ 21/Dec/15 ]

Please don't revert it did really help relieve our router memory pressures. I really think the LU-7210 and LU-7569 will relieve these problems.

Comment by Andreas Dilger [ 21/Dec/15 ]

Please don't revert it did really help relieve our router memory pressures. I really think the LU-7210 and LU-7569 will relieve these problems.

James, as yet this patch is not landed, even the reversion needs to go through build and test since it is so old. Are the fixes on top of LU-5718 well enough understood and tested that they present a better path forward than reverting to a state that was working for many years before it landed? I don't have much information on it, as I'm not LNet-savvy enough to make a final decision myself, but my understanding is that the current situation is worse than before the http://review.whamcloud.com/14600 patch.

Comment by Liang Zhen (Inactive) [ 23/Dec/15 ]

James, I think it is better to revert it for the time being, this patch is on the right direction but it is faulty. It opened a few race windows. Instead of adding fixes for it, I think it's better to just revert it and have a better implementation. I will work out another patch for the memory issue based on this patch and http://review.whamcloud.com/17527 .

Andreas, I agree the situation is worse than w/o 14600 because it is faulty, sorry for that. But it is very helpful for the memory issue that people met for years, so I will rework the patch.

Comment by Gerrit Updater [ 08/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17699/
Subject: LU-5718 o2iblnd: Revert original fix
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3efb7683679ab2d18b4d2b256acd462596324d9c

Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ]

Given the true fix will be done in LU-7569, I'm closing this ticket as a duplicate of that ticket.

Comment by Chris Horn [ 08/Jan/16 ]

Is LU-7569 really a duplicate? Can you briefly explain how the patch there resolves the "RDMA too fragmented" issue?

Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ]

Hmm...I was assuming that http://review.whamcloud.com/#/c/14600/ was the fix for this issue and we had to revert it as it caused other problems. That patch is being redone under LU-7569 which is why I wanted to close this ticket.

As I look at the history, I'm not convinced that http://review.whamcloud.com/#/c/14600/ was addressing the original problem. Does anyone know what the state of the original issue is? I fear we have been trying to tackle too many items here.

Comment by Chris Horn [ 08/Jan/16 ]

AFAIK, the original fragmentation issue still exists. The 14600 patch was, IMO, inappropriately linked to this ticket, and never addressed the fragmentation error. Hence this ticket remained open even though the 14600 patch had landed.

Comment by James A Simmons [ 08/Jan/16 ]

The reason for 14600 creation was to fix the huge memory pressure that happened from the other patch, 12451, for this ticket. Patch 12451 was never merged but 14600 was. Also their has been debate that 12451 was fixing the issue in the right way which is why it never was merged. See the comment history here. This still needs to be investigated.

Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ]

Ok. That being the case, I am re-opening this ticket to address the fragmentation of memory issue. Let's not do any more reconnection fixes here :^).

Comment by Doug Oucharek (Inactive) [ 29/Aug/16 ]

I'm starting to believe that the fix for this issue is the same as for LU-7385. That is assuming the fragmentation error occurs due to an offset in the IOVs.

Comment by James A Simmons [ 29/Aug/16 ]

So we have two not so hot solutions

Comment by Doug Oucharek (Inactive) [ 29/Aug/16 ]

My big question is: why do we have an offset? Is this caused by partial reads/writes in the file system?

James: Is this going to result not in an error, but a crash after your change under LU-7650? You removed this fragment check in place of an overall sizing check before the loop. Not sure if that will catch a problem when an offset is applied.

Comment by James A Simmons [ 29/Aug/16 ]

It shouldn't crash since all allocating are 1 + IBLND_MAX_RDMA_FRAG to work around this issue. I know its a ugly solution but it will hold over until we move to the netlink api.

Comment by Doug Oucharek (Inactive) [ 29/Aug/16 ]

I don't think just adding 1 to the MAX_RDMA_FRAGS is enough. Here is what I think is happening and I really need others to tell me my understanding is wrong or agree so we can quickly move to fix this. We have customers adding LNet routers and running into bulk RDMA failures due to this issue so fixing this has just become a very high priority.

1. A bulk operation is sent to the LNet router where the first fragment has an offset so the full 4k (assuming 4k pages) is not used in the first RDMA buffer.
2. In the code ko2iblnd_init_rdma(), when it is configuring the work queue for going from the source to the destination, it will have the source with a first fragment less than 4k and a destination with the first fragment ready for 4k.
3. 1st iteration of the loop setting things up, it will set the 1st work queue item to transfer <4k (size of source 1st fragment).
4. When the consume routines are called, the source will advance to fragment index 2 (transferred all bytes it has) but the destination will not advance as it has space in its 1st fragment.
5. 2nd iteration of the loop will set up work element 2 to transfer just the number of bytes the destination has left in its 1st fragment.
6. When the consume routines are called, the source will not advance because it has not transferred all bytes of its 2nd fragment, but the destination will advance to index 2.
7. This will continue until both source and destination indexes are 128. At this point we will have used 256 work queue items which is the max. It will be detected that we have used up all work queue items but are not done. That causes the "RDMA too fragmented" error message.

So, this issue seems to be caused by the fact that the 1st fragment of the source is < 4k while the destination is 4k. It causes us to use twice as many work queue items as fragments. It would seem that a proper solution would be to shift the offset of the destination forward to match the source so both source and destination have the same sized 1st fragments. I have no clue how to do that and am open to suggestions.

Another solution is to have 512 work queue items on LNet routers to accommodate this particular situation. Not sure we can do that given all the funky FMR/fast reg code out there.

Yet another solution is to do what was done in LU-7385 and use one very big fragment buffer on the routers when RDMA is in play. All the source fragments will nicely fit into the big destination fragment so we don't end up needing twice the number of work queue items as source fragments.

Thoughts from anyone?

Comment by Doug Oucharek (Inactive) [ 29/Aug/16 ]

Another possible solution is to break the assumption that we need to fill up each destination fragment before advancing to the next destination index. If we always advance the destination index when we advance the source index, this problem would go away. However, it would mean that the destination fragments have to be the same size as the source. But I believe Jame's has found that this must be true anyway for multiple reasons. The code is just not in shape to have different fragment sizes.

Comment by James A Simmons [ 29/Aug/16 ]

Correct. The fragments sizes must match on both sides. So important is the max fragment count that its transmitted over the wire. The thing is that we allocate all our buffers for the worst case scenario at ko2iblnd initialization. We really should be allocating it dynamically based on what the remote connection can support. Anyways its acceptable that we handle the problem as you described. Currently I can't duplicate this problem. Are their known configurations/setups that expose this. Do you need a specific work load for this to show up.

Comment by Doug Oucharek (Inactive) [ 29/Aug/16 ]

Don't have a profile yet. Working on it. Trying to get file system guys to describe the need for offset.

Seagate added an "offset" parameter to the lnet-selftest command set so you can reproduce this issue. See LU-7385.

Comment by Doug Oucharek (Inactive) [ 29/Aug/16 ]

Also, I don't understand why this is only happening with LNet routers and not direct RDMA operations. In theory, it should happen everywhere. I'm really missing something and the code is not making it obvious.

Comment by Doug Oucharek (Inactive) [ 30/Aug/16 ]

The one scenario given to me which could cause an offset is using O_DIRECT read or write on a 512-byte sector boundary. Possibly you also have mix this with non-O_DIRECT operations (not sure).

Comment by Doug Oucharek (Inactive) [ 31/Aug/16 ]

Another, preferable, option is to fix the original patch by Liang, http://review.whamcloud.com/12451. In that patch, was having peer_credits > 16 triggering too many send_wr's?

Comment by Olaf Weber [ 31/Aug/16 ]

Also, I don't understand why this is only happening with LNet routers and not direct RDMA operations. In theory, it should happen everywhere. I'm really missing something and the code is not making it obvious.

The explanation for this difference in behaviour is likely that the source and target both use the offset because it corresponds to (say) an offset in a file. A router on the other hand, only needs to buffer the message for forwarding, and doesn't need to replicate the offset in its buffer. Note the 0 for the offset parameter in lnet_ni_recv() below.

lnet_parse()
        if (!for_me) {
                rc = lnet_parse_forward_locked(ni, msg);
                lnet_net_unlock(cpt);

                if (rc < 0)
                        goto free_drop;

                if (rc == LNET_CREDIT_OK) {
                        lnet_ni_recv(ni, msg->msg_private, msg, 0,
                                     0, payload_length, payload_length);
                }
                return 0;
        }

The easiest approach to making RDMA work better here might be to use the offset when buffering routed messages. This should cost at most one extra page per message buffer and result in one extra fragment. If we really cannot afford to spend the extra page, we could try to use the fact that the partial page at the start + partial page at end <= one page, so in principle we can store both fragments in a single page. (There is still the extra fragment to deal with, and we may end up having to debug RDMA engines if it turns out they don't like putting two non-overlapping fragments into the same page.)

To find where this may be coming from, if you have some kind of reproducer, consider putting a WARN_ON() in lnet_md_build() that triggers when umd->start isn't a multiple of the page size. You'll probably want to limit that to the LNET_MD_IOVEC and LNET_MD_KIOV cases, because you'd likely get a warning for each LNet ping otherwise. If you can get to the point where the warnings are only triggered by the cases of interest, but the traces don't provide enough information by themselves, you can change them to a BUG_ON and dig through the core to get the actual function parameters.

Comment by Doug Oucharek (Inactive) [ 01/Sep/16 ]

I have given up trying to reproduce this issue from the file system level. Having no luck. Instead, I have updated the lnet-selftest patch which adds an offset parameter. Using that, I have been able to reproduce the issue.

Note: before the patch for LU-7650 I get an error message and failed bulk operation. After LU-7650 I get a crash (see my comment above about this). So fixing this issue has become much more important now if we don't want to revert LU-7650.

What Olaf has suggested sounds good, but I need to provide a production system a patch ASAP and don't really have the time to investigate that approach. Instead, I'm going to take Liang's original fix, http://review.whamcloud.com/12451, and see if I can resolve the problem found with peer_credits.

Comment by Alexey Lyashkov [ 01/Sep/16 ]

Olaf,

Router don't have such info about offset, as sender don't fill it on message. That information exist on osc protocol as first KIOV isn't page aligned. Short solution - just allocate a whole large buffer as single alloc_pages() on router. But it produce a problems with TCP <> IB routing as lnet need adjusted to ability to send large buffer to socklnd.

Comment by James A Simmons [ 01/Sep/16 ]

Doug the original patch for this ticket was integrated into our default cray 2.5 clients. On our systems it broke are routers unless sge_wqe=1 was set. Since you are under pressure it seems logical to use it as a band aid for the site currently suffering from this issue.

Comment by Doug Oucharek (Inactive) [ 01/Sep/16 ]

I'm wondering if a recent change which keeps retrying QP creation lowering the number of send_wr's each iteration will fix/mask the problem Cray found?

Comment by Doug Oucharek (Inactive) [ 01/Sep/16 ]

James: I'm finding the original patch is only needed on the clients as we have not seen this problem with servers rdma'ing to the routers or routers rdma'ing to the clients. So, you really only need wrq_sge=2 (default) on the clients and set it to 1 on the routers and servers.

Comment by Alexey Lyashkov [ 02/Sep/16 ]

Doug,

looks you have incorrect investigation.
problem is related with loop on router side where first unaligned chunk was mapped to aligned, but it caused twice more segments per WC to send one page.
simple workaround is https://github.com/Xyratex/lustre-stable/commit/f9895e2423ad76147bfbb6c4974c58439782180f

but it will cause a problem with TCP <> routing.

Comment by Olaf Weber [ 02/Sep/16 ]

Looking a bit closer, the offset into the first page is present at the LND level (as opposed to LNet level) for the o2ib and (I think) gni LNDs. The sock LND does not have it. So there is a problem when data is routed from a TCP network to IB or GNI. It would be possible to extend the sock LND to carry this data (easier than an LNet protocol change) but some less invasive option might be preferable.

Comment by Doug Oucharek (Inactive) [ 02/Sep/16 ]

The Xyratex fix was put into Gerrit under: http://review.whamcloud.com/16141/. I made some significant changes to the patch to make it work with ksocklnd and to make the change more adaptable (controlling the larger fragments by making them a new pool). It has not been reviewed/landed yet.

What I have found when this problem occurs in production is that an offset is only applied when a client is doing an rdma write through a router. When the client is setting up the work queue elements to rdma write to the router, it runs out of elements because it is using two elements for each fragment because the fragments are out of sync (see my description of the problem above). So the client is reporting the error "Too fragmented" and is aborting the rdma operation. I have never seen the "Too fragmented" error reported on servers or routers.

This can be fixed in two ways:

1- as Liang's original fix does, just have more work queue elements via doubling the sge so the client can rdma an offset buffer into a non-offset buffer in the router.
2- as Xyratex has done and have the router's buffer be one single big fragment so each fragment is appended into the router's buffer. This prevents the number of work queue elements needed from doubling.

Fix 1 needs to by applied to the clients (and servers if we believe an offset can ever happen from here...no evidence of that yet). Fix 2 needs to be applied to the routers only.

Which one is best? That seems to be an ongoing discussion here.

Comment by James A Simmons [ 02/Sep/16 ]

Ugh. Neither is great since they involve increasing the memory foot print. In our experience Cray routers tend to be very memory constrained so I would go for the client fix option. I have looked at the Xyratex solution and never understood why a new buffer. Couldn't we just expand on the large buffers that already exist?

On the other hand down the road when we move to netlink then the upper layer problems go away. These work arounds in the LND driver would have to be cleaned up in the future.

Comment by Doug Oucharek (Inactive) [ 02/Sep/16 ]

I wanted to let users control the use of large RDMA buffers. Allocating a large number of 1M buffers made up of contiguous pages can be challenging if the system's memory has become fragmented. Depending on how much memory the router has, a customer may want to control the allocution of these buffers falling back on to fragmented large buffers when they run out of the contiguous ones. Having a separate pool make this easier to configure and adapt to each unique situation.

Comment by James A Simmons [ 02/Sep/16 ]

Oh I see what Alyona is doing. Its just I'm used to seeing contiguous pages allocated using alloc_contig_range() or CMA. Since this is the case I would recommend you rename some of the RDMA labels to CMA so it makes sense to external reviewers. I need to look at these older kernels everyone uses to see what apis are available.

Comment by Doug Oucharek (Inactive) [ 02/Sep/16 ]

Any kernel wisdom you can bring to the solution will be appreciated. I have no idea what CMA is. Out of the loop.

Comment by Alexey Lyashkov [ 03/Sep/16 ]

James,

problem is simple. it's bug or feature in o2ibld which copied to GNI.
kiblnd_init_rdma() function.

        while (resid > 0) {
                if (srcidx >= srcrd->rd_nfrags) {
                        CERROR("Src buffer exhausted: %d frags\n", srcidx);
                        rc = -EPROTO;
                        break;
                }

                if (dstidx == dstrd->rd_nfrags) {
                        CERROR("Dst buffer exhausted: %d frags\n", dstidx);
                        rc = -EPROTO;
                        break;
                }

                if (tx->tx_nwrq >= conn->ibc_max_frags) {
                        CERROR("RDMA has too many fragments for peer %s (%d), "
                               "src idx/frags: %d/%d dst idx/frags: %d/%d\n",
                               libcfs_nid2str(conn->ibc_peer->ibp_nid),
                               conn->ibc_max_frags,
                               srcidx, srcrd->rd_nfrags,
                               dstidx, dstrd->rd_nfrags);
                        rc = -EMSGSIZE;
                        break;
                }

                wrknob = MIN(MIN(kiblnd_rd_frag_size(srcrd, srcidx),
                                 kiblnd_rd_frag_size(dstrd, dstidx)), resid); <<< it line is problem.

                sge = &tx->tx_sge[tx->tx_nwrq];
                sge->addr   = kiblnd_rd_frag_addr(srcrd, srcidx);
                sge->lkey   = kiblnd_rd_frag_key(srcrd, srcidx);
                sge->length = wrknob;

                wrq = &tx->tx_wrq[tx->tx_nwrq];

                wrq->next       = wrq + 1;
                wrq->wr_id      = kiblnd_ptr2wreqid(tx, IBLND_WID_RDMA);
                wrq->sg_list    = sge;
                wrq->num_sge    = 1;
                wrq->opcode     = IB_WR_RDMA_WRITE;
                wrq->send_flags = 0;

                wrq->wr.rdma.remote_addr = kiblnd_rd_frag_addr(dstrd, dstidx);
                wrq->wr.rdma.rkey        = kiblnd_rd_frag_key(dstrd, dstidx);

                srcidx = kiblnd_rd_consume_frag(srcrd, srcidx, wrknob);
                dstidx = kiblnd_rd_consume_frag(dstrd, dstidx, wrknob);

                resid -= wrknob;

                tx->tx_nwrq++;
                wrq++;
                sge++;
        }

so source transfer with sizes

{128,4096}

and destination segments

{4096}

,

{4096}

will mapped into

{128},{4096-128}, {128}

segments. so in general it's needs a twice more SGE/WR to send same amount of data. as now one source segment needs a two destination segments on router.
but single buffer will include all buffers without know about sizing.

probably someone may find better solution with just fix it loop and it avoid problem at all without any protocol, memory pool changes.

Comment by Thomas Stibor [ 15/Sep/16 ]

Hi there,

we encountered the RDMA too fragmented problem without LNET routers on nearly all clients where a kind of strange function call pattern was executed:
Example of one of the client.

...
Sep  2 06:05:28 lxbk0101 kernel: [2569481.266678] LNetError: 1407:0:(o2iblnd_cb.c:1140:kiblnd_init_rdma()) RDMA too fragmented for 10.20.0.250@o2ib1 (256): 241/256 src 241/256 dst frags
Sep  2 06:05:28 lxbk0101 kernel: [2569481.269159] LNetError: 1407:0:(o2iblnd_cb.c:1140:kiblnd_init_rdma()) Skipped 1 previous similar message
Sep  2 06:05:28 lxbk0101 kernel: [2569481.270325] LNetError: 1407:0:(o2iblnd_cb.c:1690:kiblnd_reply()) Can't setup rdma for GET from 10.20.0.250@o2ib1: -90
Sep  2 06:05:28 lxbk0101 kernel: [2569481.271498] LNetError: 1407:0:(o2iblnd_cb.c:1690:kiblnd_reply()) Skipped 1 previous similar message
....
....

As a consequence of the error the Lustre client lost the corresponding OST with message (lfs check osts): Resource temporarily unavailable (11)
However, this behavior occurred only when the process was close reaching the soft/hard quota. Switching quota completely off or being at least 50% away
from the soft/hard quota, did not triggered the problem.

MDS/OSS are with Lustre 2.5.3, Clients are with Lustre 2.6

The (strage) function call pattern is:

.....
write(2, "Info in <CbmPlutoGenerator::Read"..., 96) = 96
write(2, "Info in <CbmPlutoGenerator::Read"..., 75) = 75
write(2, "Info in <CbmPlutoGenerator::Read"..., 96) = 96
write(1, "BoxGen: kf=1000010020, p=(0.20, "..., 285) = 285
write(1, " GTREVE_ROOT : Transporting prim"..., 8151) = 8151
lseek(18, 580977933, SEEK_SET)          = 580977933
rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0bfe5448d0}, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, 8) = 0
write(18, "\0\0005]\3\354\0.\353\24VO*\353\0V\0<\0\0\0\0\"\241\5\r\0\0\0\0\0\0"..., 13661) = 13661
rt_sigaction(SIGINT, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, NULL, 8) = 0
lseek(18, 580991594, SEEK_SET)          = 580991594
rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0bfe5448d0}, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, 8) = 0
write(18, "\0\0;1\3\354\0.\353\24VO*\353\0R\0<\0\0\0\0\"\241:j\0\0\0\0\0\0"..., 15153) = 15153
rt_sigaction(SIGINT, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, NULL, 8) = 0
lseek(18, 581006747, SEEK_SET)          = 581006747
rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0bfe5448d0}, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, 8) = 0
write(18, "\0\2\3Y\3\354\0.\353\24VO*\353\0W\0<\0\0\0\0\"\241u\233\0\0\0\0\0\0"..., 131929) = 131929
rt_sigaction(SIGINT, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, NULL, 8) = 0
lseek(18, 581138676, SEEK_SET)          = 581138676
rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0bfe5448d0}, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, 8) = 0
write(18, "\0\3\4\16\3\354\0.\353\24VO*\353\0U\0<\0\0\0\0\"\243x\364\0\0\0\0\0\0"..., 197646) = 197646
rt_sigaction(SIGINT, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, NULL, 8) = 0
lseek(18, 581336322, SEEK_SET)          = 581336322
...
...
...

We were able to simulate the pattern with a simple Ruby script and were able to produce the RDMA too fragmented error

#!/usr/bin/env ruby

# Generate a 1MB buffer
data = String.new
100000.times do |i|
  data << '0123456789'
end

File.open(ARGV.first, 'w+') do |f|
  512.times do |i|
    offset = i * 1000000
    puts offset
    f.seek(offset, IO::SEEK_SET) # SEEK_SET seeks from the beginning of the file
    f.write(data) # the write call already sets the file cursor where we will seek to in the next cycle
  end
end

The RDMA too fragment error was triggered after the 5th, 6th or sometimes the 7th loop.

We are currently running more investigations on dedicated Lustre client machines, however, it looks like that by setting max_pages_per_rpc=64 => 4 * 64 = 256 thus
matching it with the

./lnet/include/lnet/types.h:#define LNET_MAX_IOV    256
klnds/o2iblnd/o2iblnd.h:#define IBLND_MAX_RDMA_FRAGS         LNET_MAX_IOV           /* max # of fragments supported */

the problem is not occurring anymore within routine

if (tx->tx_nwrq == IBLND_RDMA_FRAGS(conn->ibc_version)) {
                        CERROR("RDMA too fragmented for %s (%d): "
                               "%d/%d src %d/%d dst frags\n",
                               libcfs_nid2str(conn->ibc_peer->ibp_nid),
                               IBLND_RDMA_FRAGS(conn->ibc_version),
                               srcidx, srcrd->rd_nfrags,
                               dstidx, dstrd->rd_nfrags);
                        rc = -EMSGSIZE;
                        break;
                }

So far the problems seems to be gone by setting max_pages_per_rpc=64.

Comment by James A Simmons [ 20/Sep/16 ]

By setting max_pages_per_rpc=64 this means you are below the 256 page limit so you will never hit that issue.

Doug I tried the new lnet selftest patch with offsets of 64 and 256 and I didn't see any fragmentation issues. Have you been able to reproduce the problem?

Comment by Alexey Lyashkov [ 21/Sep/16 ]

James,

did you use an 1mb transfer size for lnet selftest with offset?

ps. max_pages_per_rpc=128 should be enough to avoid problem as it twice more fragments on router than pages.

Comment by Doug Oucharek (Inactive) [ 22/Sep/16 ]

Thomas: I have checked and so far, everyone who has seen this issue in production were using the Quotas feature. You may be on to something. One theory is that quotas can trigger a syncio which may cause a page misalignment. I'm going to look more into this.

I have also run into two examples where the "RDMA too fragmented" error was encounter where a router was not used.

Olaf: are you certain that the destination will match the same starting offset when a router is not present? If so, we should never see this error without a router.

If this issue can happen without a router, then the fix which uses large, continuous, buffers on routers won't cover all cases. The original fix by Liang, however, will.

I'm proposing that we land that fix but make the default of sge_wqe=1 so it is off by default. Then, anyone running into this error can turn on the fix on systems which exhibit it.

In the meantime, I will see if I can better understand how/why quotas is trigger this and see if that can be resolved.

Comment by Doug Oucharek (Inactive) [ 22/Sep/16 ]

Note: one example of this issue when routers are not present was on the discussion board: https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg12963.html

Comment by Andreas Dilger [ 22/Sep/16 ]

When quotas are enabled, it is possible that if a user hits the quota limit this will force sync writes from the client. That's similar if the client is doing O_DIRECT writes.

It is a bit confusing, however, since I can't imaging why this would cause unaligned writes, since the pages should have previously been fully fetched to the client, so the whole page ghouls be written in this case and the write should be page aligned. AFAIK, only O_DIRECT should be able to generate partial page writes, anything else is a bug IMHO.

Rather than transferring all of the pages misaligned, my strong preference would be to doc the handling of the first page, and then send the rest of the pages properly. Is the lack of ability to send the first partial page w problem at the Lustre level or LNet? If it is a bug in the way Lustrr generares the bulk requests then this could (also) be fixed, even if there also needs to be a temporary fix in LNet as well.

Comment by Gerrit Updater [ 05/Oct/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12496/
Subject: LU-5718 lnet: add offset for selftest brw
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: efcef00cb3043b5f8661174fd80626b3dc0edc50

Comment by Christopher Morrone [ 23/Feb/17 ]

Just a "me too": We are hitting this with 2.8.0 on an OmniPath network.  Client gets stuck with "RDMA has too many fragments for peer" sending to a router node.  If it matters, we have one client isolated right now in this state.  Hopefully the problem is understood at this point, but if you need some information gathering let us know what you want us to look for.

 

Comment by James A Simmons [ 24/Feb/17 ]

The problem is the purposed patch breaks our systems. It causes our clients to go into a reconnect storm.

Comment by Doug Oucharek (Inactive) [ 24/Feb/17 ]

The summary of this issue is thus:

  • The patch associated with this ticket solves the problem on all node types, but as James is reporting above, is causing problems with his clients (PPC-based).
  • The patch associated with LU-7385 has been tested and works, but only for LNet routers.  If you see this issue on other node types, that patch will not help you.
  • If neither of the above two patches work, then we are going to need to devise a third option which does not yet exist.
  • Because there is no universal agreement on how to fix this problem, nothing has landed yet so customers are running with patches.  As such, we now have a bit of a "grab-bag" with regards to this issue.

Ultimately, it would be good to get some agreement on how to fix this and land something.

Comment by Christopher Morrone [ 24/Feb/17 ]

This issue is explicitly about a client problem.  So LU-7385 doesn't apply.  Got it.

"The fix", I assume, refers to change 12451??

We too have PPC clients. Not sure if any of them will get Lustre 2.8+. But I would not particularly like to gamble on a patch that Intel hasn't committed to yet.

Comment by Doug Oucharek (Inactive) [ 24/Feb/17 ]

Yes, the patch for this ticket is 12451.  It is "off" by default and needs a module parameter to turn it on.  As such, it should be safe.

James: if this patch is off, do you have an issue with your clients?  If so, then we have to debug that.  If not, then technically, we can land the patch and those can use it just have to turn it on.

Comment by Christopher Morrone [ 24/Feb/17 ]

Speaking as a customer, a "fix" that requires me to manually go in and change a configuration to tell Lustre "no, really, please don't be broken" is not a terribly satisfactory solution. Please work on a solution that will make Lustre work out of the box.

Comment by James A Simmons [ 07/Mar/17 ]

I attached my router logs that show the problem with this patch.

Comment by Doug Oucharek (Inactive) [ 09/Mar/17 ]

James: I looked at your router logs.  All but rtr5 have no o2iblnd logs.  Only gnilnd.  Rtr5 has some new logs you must have added to debug this issue.  Stuff like "(130)++" seem to be a counter you are keeping track of.  What is it?  Is it counting queue depth?  Is that the problem you are running into?

Comment by James A Simmons [ 09/Mar/17 ]

That is from kiblnd_conn_addref() which is a inline function defined in o2iblnd.h.

Comment by Doug Oucharek (Inactive) [ 10/Mar/17 ]

I'm not seeing the client reconnect storm in those logs.  Is neterr logs turned off?

Comment by Doug Oucharek (Inactive) [ 11/Apr/17 ]

James: give the latest patch version, 10, a try on PPC.  I believe I fixed the PPC issue with the patch.

Comment by Stephane Thiell [ 12/Apr/17 ]

Hi,

We just hit this problem on brand new 2.9 clients, only on a bigmem node, leading to deadlock'ed writes on our /scratch. We are using EE3 servers with lnet routers (they are all already patched for this, see DELL-221).

As we think to have here a basic use case, because only a few processes were reading from a single file and writing to multiple files, apparently doing (nice) 4M I/Os before the deadlock occurred, we took a crash dump which is available for download at the following link:

https://stanford.box.com/s/d37761k3ywukxh7im9mq8mgp9m2gkpga

It shows deadlocked writes after the RDMA too fragmented errors.

Kernel version is 3.10.0-514.10.2.el7.x86_64 on el7

Hope this helps...

Stephane

 

 

Comment by James A Simmons [ 13/Apr/17 ]

Currently my test system that has this problem is down until middle of next week. As soon as it is back I will test it.

Comment by Gerrit Updater [ 26/Apr/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/12451/
Subject: LU-5718 o2iblnd: multiple sges for work request
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fda19c748016c9f57f71278b597fd8a651268f66

Comment by Peter Jones [ 26/Apr/17 ]

Landed for 2.10

Comment by James A Simmons [ 26/Apr/17 ]

Just to let you know I'm in the process of testing this patch and the latest patch seems to be holding up. Good work Doug.

Comment by Dmitry Eremin (Inactive) [ 03/May/17 ]

After last patch landed I got the following error:

[4020251.265904] LNetError: 95052:0:(o2iblnd_cb.c:1086:kiblnd_init_rdma()) RDMA is too large for peer 192.168.213.235@o2ib (131072), src size: 1048576 dst size: 1048576
[4020251.265941] LNetError: 95050:0:(o2iblnd_cb.c:1720:kiblnd_reply()) Can't setup rdma for GET from 192.168.213.235@o2ib: -90
[4020251.265948] LustreError: 95050:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff8816e0754c00
...
[4020251.267492] Lustre: 95098:0:(client.c:2115:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1493833318/real 1493833318]  req@ffff8817e9e48000 x1566397691863184/t0(0) o4->
[4020251.267503] Lustre: nvmelfs-OST000b-osc-ffff881c8e362000: Connection to nvmelfs-OST000b (at 192.168.213.236@o2ib) was lost; in progress operations using this service will wait for recovery to complete
...
[4020251.267965] LustreError: 95050:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff880223361400
[4020251.268058] Lustre: nvmelfs-OST000b-osc-ffff881c8e362000: Connection restored to 192.168.213.236@o2ib (at 192.168.213.236@o2ib)
...
[4020256.133400] LNetError: 95052:0:(o2iblnd_cb.c:1086:kiblnd_init_rdma()) RDMA is too large for peer 192.168.213.235@o2ib (131072), src size: 1048576 dst size: 1048576
[4020256.133561] LNetError: 95049:0:(o2iblnd_cb.c:1720:kiblnd_reply()) Can't setup rdma for GET from 192.168.213.235@o2ib: -90
[4020256.133564] LNetError: 95049:0:(o2iblnd_cb.c:1720:kiblnd_reply()) Skipped 159 previous similar messages
[4020256.133569] LustreError: 95049:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88192932fe00
[4020256.133630] Lustre: 95125:0:(client.c:2115:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1493833323/real 1493833323]  req@ffff882031360300 x1566397691866144/t0(0) o4->
[4020256.133634] Lustre: 95125:0:(client.c:2115:ptlrpc_expire_one_request()) Skipped 39 previous similar messages
[4020256.133654] Lustre: nvmelfs-OST000e-osc-ffff881c8e362000: Connection to nvmelfs-OST000e (at 192.168.213.235@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[4020256.133656] Lustre: Skipped 39 previous similar messages
[4020256.134200] Lustre: nvmelfs-OST000e-osc-ffff881c8e362000: Connection restored to 192.168.213.235@o2ib (at 192.168.213.235@o2ib)
[4020256.134202] Lustre: Skipped 39 previous similar messages

The system is partially working. I'm able to see the list of files and open small files. But large bulk transfer don't work.

Comment by Doug Oucharek (Inactive) [ 03/May/17 ]

This was addressed by a patch to LU-9420.  I would have pulled this patch to fix it under this ticket, but the patch took 2 years to land and I was not about to pull it for fear it would take another 2 years to re-land :^(.

Comment by Dmitry Eremin (Inactive) [ 03/May/17 ]

Uff. It looks I was luck to use the build without this fix.

Comment by Stephane Thiell [ 25/May/17 ]

Hi,
Could you please explain what is required to make the patches that landed work? We have tried 2.9 FE + patches from both LU-5718 and LU-9420 but are still seeing the problem on the routers. We have set wrq_sge=2 on the clients, and let the default wrq_sge=1 on the routers. We were not able to patch the servers at the moment (running IEEL3), see DELL-221.

on the router with wrq_sge=1 (10.210.34.213@o2ib1 is an OSS not patched):

[ 1111.504575] LNetError: 8688:0:(o2iblnd_cb.c:1093:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.210.34.213@o2ib1 (256), src idx/frags: 128/147 dst idx/frags: 128/147
[ 1111.522352] LNetError: 8688:0:(o2iblnd_cb.c:430:kiblnd_handle_rx()) Can't setup rdma for PUT to 10.210.34.213@o2ib1: -90

Clients and routers are using mlx5, servers are using mlx4.

Thanks,
Stephane

Comment by Chris Horn [ 25/May/17 ]

You need to set wrq_sge=2 on the routers, too.

Comment by Doug Oucharek (Inactive) [ 25/May/17 ]

Your router needs wrq_sge=2.

Comment by Stanford Research Computing Center [ 25/May/17 ]

Ah! Thanks for the clarification, Chris and Doug! I was a bit lost as the parameters changed along the work done in this ticket. We'll test this right away.
All the best,
Stephane

Comment by Cory Spitz [ 06/Jun/17 ]

Looks like we should have opened a LUDOC ticket to document wrq_sge.

Comment by Cory Spitz [ 09/Jun/17 ]

LUDOC-378 is linked to this issue.

Generated at Sat Feb 10 01:53:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.