[LU-5718] RDMA too fragmented with router Created: 08/Oct/14 Updated: 14/Jun/19 Resolved: 03/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Johann Lombardi (Inactive) | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 16043 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Got an IOR failure on the soak cluster with the following errors: Oct 7 21:54:01 lola-23 kernel: LNetError: 3613:0:(o2iblnd_cb.c:1134:kiblnd_init_rdma()) RDMA too fragmented for 192.168.1.115@o2ib100 (256): 128/256 src 128/256 dst frags Oct 7 21:54:01 lola-23 kernel: LNetError: 3618:0:(o2iblnd_cb.c:428:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.1.114@o2ib100: -90 Oct 7 21:54:01 lola-23 kernel: LNetError: 3618:0:(o2iblnd_cb.c:428:kiblnd_handle_rx()) Skipped 7 previous similar messages Liang told me that this is a known issue with routing. That said, the IOR process is not killable and the only option is to reboot the client node. We should at least fail "gracefully" by returning the error to the application. |
| Comments |
| Comment by Liang Zhen (Inactive) [ 28/Oct/14 ] |
|
patch is here: http://review.whamcloud.com/12451 |
| Comment by Chris Horn [ 28/Oct/14 ] |
|
Johann/Liang, any tips for reproducing this issue? |
| Comment by Liang Zhen (Inactive) [ 29/Oct/14 ] |
|
I think Johann hit this while running some mixed workloads with routers. I will patch lnet_selftest and make it support brw with offset ,which should be able to reproduce this issue. |
| Comment by Alexey Lyashkov [ 30/Oct/14 ] |
|
I don't sure patch is correct. Oct 7 21:54:01 lola-23 kernel: LNetError: 3613:0:(o2iblnd_cb.c:1134:kiblnd_init_rdma()) RDMA too fragmented for 192.168.1.115@o2ib100 (256): 128/256 src 128/256 dst frags
I think main reason for it - incorrect calculation on osc/ptlrpc layer. It's already responsible to the check number fragments for bulk transfer. |
| Comment by Chris Horn [ 05/Nov/14 ] |
|
Liang, do you have an LU ticket for the lnet_selftest enhancement you mentioned? |
| Comment by Liang Zhen (Inactive) [ 06/Nov/14 ] |
|
Hi Chris, I didn't create another ticket for selftest, but I have posted patch for it: http://review.whamcloud.com/#/c/12496/ |
| Comment by Chris Horn [ 22/Apr/15 ] |
|
We had a site report seeing an error with this patch when they set peer_credits > 16: LNetError: 2641:0:(o2iblnd.c:872:kiblnd_create_conn()) Can't create QP: -12, send_wr: 16191, recv_wr: 254, send_sge: 2, recv_sge: 1 |
| Comment by Liang Zhen (Inactive) [ 23/Apr/15 ] |
|
Chris, I don't think this is an issue from this patch because it does not consume extra memory, I suspect it is about connd may aggressively reconnect when there is connection race, I will post a patch for this. |
| Comment by Chris Horn [ 23/Apr/15 ] |
|
Thanks, Liang. FWIW, they only see that error with this patch applied, and when they set "options ko2iblnd wrq_sge=1" the error goes away... |
| Comment by Isaac Huang (Inactive) [ 23/Apr/15 ] |
|
liang I think the patch could cause increased memory overhead at the OFED and layers beneath it, since init_qp_attr->cap.max_send_sge is doubled. |
| Comment by Alexey Lyashkov [ 23/Apr/15 ] |
|
Issac, did you remember my comments about additional memory issues with that patch?... |
| Comment by Isaac Huang (Inactive) [ 23/Apr/15 ] |
|
Alexey, that's the price to pay - there's no free lunch. |
| Comment by Gerrit Updater [ 27/Apr/15 ] |
|
Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/14600 |
| Comment by Liang Zhen (Inactive) [ 27/Apr/15 ] |
|
Isaac, indeed, thanks for pointing out. |
| Comment by Alexey Lyashkov [ 27/Apr/15 ] |
|
Per discussion with Melanox people, they don't happy with increasing number fragments for own IB cards. qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof (u64), GFP_KERNEL);
qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof (u64), GFP_KERNEL);
I agree with Issac, none free lunch - but with that patch you may stop working with large number connections like router <> clients links. #define IBLND_SEND_WRS(v) ((IBLND_RDMA_FRAGS(v) + 1) * IBLND_CONCURRENT_SENDS(v)) |
| Comment by Chris Horn [ 14/May/15 ] |
|
Liang, I haven't had a chance to reproduce the QP allocation failure internally, so I haven't tested your patch. I agree with Alexey that I think a big part of our problem is the large kmallocs we're doing. The site that hit this issue is using ConnectIB cards with the mlx5 drivers (I only have access to ConnectX/mlx4 cards internally). I haven’t looked at the driver code before, but it looks to me like we're not just doing the one 256k allocation noted by Alexey (I'm pretty sure the qp->rq.wrid kmalloc is for just 2048 bytes), but it looks like we're doing four of them: qp->rq.wqe_cnt = 256
qp->sq.wqe_cnt = 32768
qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(*qp->sq.wrid), GFP_KERNEL); // 262144 bytes
qp->sq.wr_data = kmalloc(qp->sq.wqe_cnt * sizeof(*qp->sq.wr_data), GFP_KERNEL); // 262144 bytes
qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof(*qp->rq.wrid), GFP_KERNEL); // 2048 bytes
qp->sq.w_list = kmalloc(qp->sq.wqe_cnt * sizeof(*qp->sq.w_list), GFP_KERNEL); // 262144 bytes
qp->sq.wqe_head = kmalloc(qp->sq.wqe_cnt * sizeof(*qp->sq.wqe_head), GFP_KERNEL); // 262144 bytes
The reason we have such large allocations is that we set peer_credits=126 and concurrent_sends=63 in order to deal with the huge amount of small messages generated by Lustre client pings at large scale (see https://cug.org/proceedings/attendee_program_cug2012/includes/files/pap166.pdf for details). The site that reported the QP allocation failure did try different values of peer_credits, and they found that the only values that worked were peer_credits=8 and peer_credits=16. This was on a small TDS system with just two LNet routers (I’m still waiting to find out the total number of IB peers). Interestingly, we've deployed the multiple SGEs patch at another (very) large site that uses ConnectX/mlx4 drivers, and they have not seen this issue. So I'm wondering if there's a difference in the driver code that is making this more likely. |
| Comment by Gerrit Updater [ 04/Sep/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14600/ |
| Comment by Alexey Lyashkov [ 04/Sep/15 ] |
|
May you explain why you close a ticket with unrelated patch ? |
| Comment by Alexey Lyashkov [ 04/Sep/15 ] |
|
reconnect problem is completely different problem and need own ticket, but it's never addressed to wrong alignment for router buffer. |
| Comment by Joseph Gmitter (Inactive) [ 04/Sep/15 ] |
|
Hi Alexey, |
| Comment by Frank Heckes (Inactive) [ 28/Sep/15 ] |
|
Error happens during still soak testing 2_7_59 + debug patch There're 173 message of the form: Sep 27 09:58:24 lola-27 kernel: Lustre: 3698:0:(client.c:2040:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443372961/real 1443372961] req@ffff880588fbd0c0 x1513236214216272/t0(0) o4->soaked-OST0008-osc-ffff880818748800@192.168.1.102@o2ib10:6/4 lens 608/448 e 2 to 1 dl 1443373079 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Sep 27 09:58:24 lola-27 kernel: Lustre: 3698:0:(client.c:2040:ptlrpc_expire_one_request()) Skipped 1 previous similar message Sep 27 09:58:24 lola-27 kernel: Lustre: soaked-OST0008-osc-ffff880818748800: Connection to soaked-OST0008 (at 192.168.1.102@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Sep 27 09:58:24 lola-27 kernel: Lustre: soaked-OST0008-osc-ffff880818748800: Connection restored to soaked-OST0008 (at 192.168.1.102@o2ib10) Sep 27 09:58:24 lola-27 kernel: LNetError: 3675:0:(o2iblnd_cb.c:1139:kiblnd_init_rdma()) RDMA too fragmented for 192.168.1.114@o2ib100 (256): 128/233 src 128/233 dst frags Sep 27 09:58:24 lola-27 kernel: LNetError: 3675:0:(o2iblnd_cb.c:435:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.1.114@o2ib100: -90 which seems to correlate to the same amount of errors on OSS node (lola-2) : Sep 27 09:57:41 lola-2 kernel: LustreError: 8847:0:(ldlm_lib.c:3017:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff8801c9e476c0 x1513236214216272/t0(0) o4->076bba0c-23e4-e9cc-96e8-bd39615184cd@192.168.1.127@o2ib100:318/0 lens 608/448 e 2 to 0 dl 1443373078 ref 1 fl Interpret:H/0/0 rc 0/0 Sep 27 09:57:41 lola-2 kernel: Lustre: soaked-OST0008: Bulk IO write error with 076bba0c-23e4-e9cc-96e8-bd39615184cd (at 192.168.1.127@o2ib100), client will retry: rc -110 Sep 27 09:58:24 lola-2 kernel: Lustre: soaked-OST0008: Client 076bba0c-23e4-e9cc-96e8-bd39615184cd (at 192.168.1.127@o2ib100) reconnecting |
| Comment by Gerrit Updater [ 21/Dec/15 ] |
|
Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/17699 |
| Comment by James A Simmons [ 21/Dec/15 ] |
|
Please don't revert it did really help relieve our router memory pressures. I really think the |
| Comment by Andreas Dilger [ 21/Dec/15 ] |
James, as yet this patch is not landed, even the reversion needs to go through build and test since it is so old. Are the fixes on top of |
| Comment by Liang Zhen (Inactive) [ 23/Dec/15 ] |
|
James, I think it is better to revert it for the time being, this patch is on the right direction but it is faulty. It opened a few race windows. Instead of adding fixes for it, I think it's better to just revert it and have a better implementation. I will work out another patch for the memory issue based on this patch and http://review.whamcloud.com/17527 . Andreas, I agree the situation is worse than w/o 14600 because it is faulty, sorry for that. But it is very helpful for the memory issue that people met for years, so I will rework the patch. |
| Comment by Gerrit Updater [ 08/Jan/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17699/ |
| Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ] |
|
Given the true fix will be done in |
| Comment by Chris Horn [ 08/Jan/16 ] |
|
Is |
| Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ] |
|
Hmm...I was assuming that http://review.whamcloud.com/#/c/14600/ was the fix for this issue and we had to revert it as it caused other problems. That patch is being redone under As I look at the history, I'm not convinced that http://review.whamcloud.com/#/c/14600/ was addressing the original problem. Does anyone know what the state of the original issue is? I fear we have been trying to tackle too many items here. |
| Comment by Chris Horn [ 08/Jan/16 ] |
|
AFAIK, the original fragmentation issue still exists. The 14600 patch was, IMO, inappropriately linked to this ticket, and never addressed the fragmentation error. Hence this ticket remained open even though the 14600 patch had landed. |
| Comment by James A Simmons [ 08/Jan/16 ] |
|
The reason for 14600 creation was to fix the huge memory pressure that happened from the other patch, 12451, for this ticket. Patch 12451 was never merged but 14600 was. Also their has been debate that 12451 was fixing the issue in the right way which is why it never was merged. See the comment history here. This still needs to be investigated. |
| Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ] |
|
Ok. That being the case, I am re-opening this ticket to address the fragmentation of memory issue. Let's not do any more reconnection fixes here :^). |
| Comment by Doug Oucharek (Inactive) [ 29/Aug/16 ] |
|
I'm starting to believe that the fix for this issue is the same as for |
| Comment by James A Simmons [ 29/Aug/16 ] |
|
So we have two not so hot solutions |
| Comment by Doug Oucharek (Inactive) [ 29/Aug/16 ] |
|
My big question is: why do we have an offset? Is this caused by partial reads/writes in the file system? James: Is this going to result not in an error, but a crash after your change under |
| Comment by James A Simmons [ 29/Aug/16 ] |
|
It shouldn't crash since all allocating are 1 + IBLND_MAX_RDMA_FRAG to work around this issue. I know its a ugly solution but it will hold over until we move to the netlink api. |
| Comment by Doug Oucharek (Inactive) [ 29/Aug/16 ] |
|
I don't think just adding 1 to the MAX_RDMA_FRAGS is enough. Here is what I think is happening and I really need others to tell me my understanding is wrong or agree so we can quickly move to fix this. We have customers adding LNet routers and running into bulk RDMA failures due to this issue so fixing this has just become a very high priority. 1. A bulk operation is sent to the LNet router where the first fragment has an offset so the full 4k (assuming 4k pages) is not used in the first RDMA buffer. So, this issue seems to be caused by the fact that the 1st fragment of the source is < 4k while the destination is 4k. It causes us to use twice as many work queue items as fragments. It would seem that a proper solution would be to shift the offset of the destination forward to match the source so both source and destination have the same sized 1st fragments. I have no clue how to do that and am open to suggestions. Another solution is to have 512 work queue items on LNet routers to accommodate this particular situation. Not sure we can do that given all the funky FMR/fast reg code out there. Yet another solution is to do what was done in Thoughts from anyone? |
| Comment by Doug Oucharek (Inactive) [ 29/Aug/16 ] |
|
Another possible solution is to break the assumption that we need to fill up each destination fragment before advancing to the next destination index. If we always advance the destination index when we advance the source index, this problem would go away. However, it would mean that the destination fragments have to be the same size as the source. But I believe Jame's has found that this must be true anyway for multiple reasons. The code is just not in shape to have different fragment sizes. |
| Comment by James A Simmons [ 29/Aug/16 ] |
|
Correct. The fragments sizes must match on both sides. So important is the max fragment count that its transmitted over the wire. The thing is that we allocate all our buffers for the worst case scenario at ko2iblnd initialization. We really should be allocating it dynamically based on what the remote connection can support. Anyways its acceptable that we handle the problem as you described. Currently I can't duplicate this problem. Are their known configurations/setups that expose this. Do you need a specific work load for this to show up. |
| Comment by Doug Oucharek (Inactive) [ 29/Aug/16 ] |
|
Don't have a profile yet. Working on it. Trying to get file system guys to describe the need for offset. Seagate added an "offset" parameter to the lnet-selftest command set so you can reproduce this issue. See |
| Comment by Doug Oucharek (Inactive) [ 29/Aug/16 ] |
|
Also, I don't understand why this is only happening with LNet routers and not direct RDMA operations. In theory, it should happen everywhere. I'm really missing something and the code is not making it obvious. |
| Comment by Doug Oucharek (Inactive) [ 30/Aug/16 ] |
|
The one scenario given to me which could cause an offset is using O_DIRECT read or write on a 512-byte sector boundary. Possibly you also have mix this with non-O_DIRECT operations (not sure). |
| Comment by Doug Oucharek (Inactive) [ 31/Aug/16 ] |
|
Another, preferable, option is to fix the original patch by Liang, http://review.whamcloud.com/12451. In that patch, was having peer_credits > 16 triggering too many send_wr's? |
| Comment by Olaf Weber [ 31/Aug/16 ] |
The explanation for this difference in behaviour is likely that the source and target both use the offset because it corresponds to (say) an offset in a file. A router on the other hand, only needs to buffer the message for forwarding, and doesn't need to replicate the offset in its buffer. Note the 0 for the offset parameter in lnet_ni_recv() below. lnet_parse() if (!for_me) { rc = lnet_parse_forward_locked(ni, msg); lnet_net_unlock(cpt); if (rc < 0) goto free_drop; if (rc == LNET_CREDIT_OK) { lnet_ni_recv(ni, msg->msg_private, msg, 0, 0, payload_length, payload_length); } return 0; } The easiest approach to making RDMA work better here might be to use the offset when buffering routed messages. This should cost at most one extra page per message buffer and result in one extra fragment. If we really cannot afford to spend the extra page, we could try to use the fact that the partial page at the start + partial page at end <= one page, so in principle we can store both fragments in a single page. (There is still the extra fragment to deal with, and we may end up having to debug RDMA engines if it turns out they don't like putting two non-overlapping fragments into the same page.) To find where this may be coming from, if you have some kind of reproducer, consider putting a WARN_ON() in lnet_md_build() that triggers when umd->start isn't a multiple of the page size. You'll probably want to limit that to the LNET_MD_IOVEC and LNET_MD_KIOV cases, because you'd likely get a warning for each LNet ping otherwise. If you can get to the point where the warnings are only triggered by the cases of interest, but the traces don't provide enough information by themselves, you can change them to a BUG_ON and dig through the core to get the actual function parameters. |
| Comment by Doug Oucharek (Inactive) [ 01/Sep/16 ] |
|
I have given up trying to reproduce this issue from the file system level. Having no luck. Instead, I have updated the lnet-selftest patch which adds an offset parameter. Using that, I have been able to reproduce the issue. Note: before the patch for What Olaf has suggested sounds good, but I need to provide a production system a patch ASAP and don't really have the time to investigate that approach. Instead, I'm going to take Liang's original fix, http://review.whamcloud.com/12451, and see if I can resolve the problem found with peer_credits. |
| Comment by Alexey Lyashkov [ 01/Sep/16 ] |
|
Olaf, Router don't have such info about offset, as sender don't fill it on message. That information exist on osc protocol as first KIOV isn't page aligned. Short solution - just allocate a whole large buffer as single alloc_pages() on router. But it produce a problems with TCP <> IB routing as lnet need adjusted to ability to send large buffer to socklnd. |
| Comment by James A Simmons [ 01/Sep/16 ] |
|
Doug the original patch for this ticket was integrated into our default cray 2.5 clients. On our systems it broke are routers unless sge_wqe=1 was set. Since you are under pressure it seems logical to use it as a band aid for the site currently suffering from this issue. |
| Comment by Doug Oucharek (Inactive) [ 01/Sep/16 ] |
|
I'm wondering if a recent change which keeps retrying QP creation lowering the number of send_wr's each iteration will fix/mask the problem Cray found? |
| Comment by Doug Oucharek (Inactive) [ 01/Sep/16 ] |
|
James: I'm finding the original patch is only needed on the clients as we have not seen this problem with servers rdma'ing to the routers or routers rdma'ing to the clients. So, you really only need wrq_sge=2 (default) on the clients and set it to 1 on the routers and servers. |
| Comment by Alexey Lyashkov [ 02/Sep/16 ] |
|
Doug, looks you have incorrect investigation. but it will cause a problem with TCP <> routing. |
| Comment by Olaf Weber [ 02/Sep/16 ] |
|
Looking a bit closer, the offset into the first page is present at the LND level (as opposed to LNet level) for the o2ib and (I think) gni LNDs. The sock LND does not have it. So there is a problem when data is routed from a TCP network to IB or GNI. It would be possible to extend the sock LND to carry this data (easier than an LNet protocol change) but some less invasive option might be preferable. |
| Comment by Doug Oucharek (Inactive) [ 02/Sep/16 ] |
|
The Xyratex fix was put into Gerrit under: http://review.whamcloud.com/16141/. I made some significant changes to the patch to make it work with ksocklnd and to make the change more adaptable (controlling the larger fragments by making them a new pool). It has not been reviewed/landed yet. What I have found when this problem occurs in production is that an offset is only applied when a client is doing an rdma write through a router. When the client is setting up the work queue elements to rdma write to the router, it runs out of elements because it is using two elements for each fragment because the fragments are out of sync (see my description of the problem above). So the client is reporting the error "Too fragmented" and is aborting the rdma operation. I have never seen the "Too fragmented" error reported on servers or routers. This can be fixed in two ways: 1- as Liang's original fix does, just have more work queue elements via doubling the sge so the client can rdma an offset buffer into a non-offset buffer in the router. Fix 1 needs to by applied to the clients (and servers if we believe an offset can ever happen from here...no evidence of that yet). Fix 2 needs to be applied to the routers only. Which one is best? That seems to be an ongoing discussion here. |
| Comment by James A Simmons [ 02/Sep/16 ] |
|
Ugh. Neither is great since they involve increasing the memory foot print. In our experience Cray routers tend to be very memory constrained so I would go for the client fix option. I have looked at the Xyratex solution and never understood why a new buffer. Couldn't we just expand on the large buffers that already exist? On the other hand down the road when we move to netlink then the upper layer problems go away. These work arounds in the LND driver would have to be cleaned up in the future. |
| Comment by Doug Oucharek (Inactive) [ 02/Sep/16 ] |
|
I wanted to let users control the use of large RDMA buffers. Allocating a large number of 1M buffers made up of contiguous pages can be challenging if the system's memory has become fragmented. Depending on how much memory the router has, a customer may want to control the allocution of these buffers falling back on to fragmented large buffers when they run out of the contiguous ones. Having a separate pool make this easier to configure and adapt to each unique situation. |
| Comment by James A Simmons [ 02/Sep/16 ] |
|
Oh I see what Alyona is doing. Its just I'm used to seeing contiguous pages allocated using alloc_contig_range() or CMA. Since this is the case I would recommend you rename some of the RDMA labels to CMA so it makes sense to external reviewers. I need to look at these older kernels everyone uses to see what apis are available. |
| Comment by Doug Oucharek (Inactive) [ 02/Sep/16 ] |
|
Any kernel wisdom you can bring to the solution will be appreciated. I have no idea what CMA is. Out of the loop. |
| Comment by Alexey Lyashkov [ 03/Sep/16 ] |
|
James, problem is simple. it's bug or feature in o2ibld which copied to GNI. while (resid > 0) { if (srcidx >= srcrd->rd_nfrags) { CERROR("Src buffer exhausted: %d frags\n", srcidx); rc = -EPROTO; break; } if (dstidx == dstrd->rd_nfrags) { CERROR("Dst buffer exhausted: %d frags\n", dstidx); rc = -EPROTO; break; } if (tx->tx_nwrq >= conn->ibc_max_frags) { CERROR("RDMA has too many fragments for peer %s (%d), " "src idx/frags: %d/%d dst idx/frags: %d/%d\n", libcfs_nid2str(conn->ibc_peer->ibp_nid), conn->ibc_max_frags, srcidx, srcrd->rd_nfrags, dstidx, dstrd->rd_nfrags); rc = -EMSGSIZE; break; } wrknob = MIN(MIN(kiblnd_rd_frag_size(srcrd, srcidx), kiblnd_rd_frag_size(dstrd, dstidx)), resid); <<< it line is problem. sge = &tx->tx_sge[tx->tx_nwrq]; sge->addr = kiblnd_rd_frag_addr(srcrd, srcidx); sge->lkey = kiblnd_rd_frag_key(srcrd, srcidx); sge->length = wrknob; wrq = &tx->tx_wrq[tx->tx_nwrq]; wrq->next = wrq + 1; wrq->wr_id = kiblnd_ptr2wreqid(tx, IBLND_WID_RDMA); wrq->sg_list = sge; wrq->num_sge = 1; wrq->opcode = IB_WR_RDMA_WRITE; wrq->send_flags = 0; wrq->wr.rdma.remote_addr = kiblnd_rd_frag_addr(dstrd, dstidx); wrq->wr.rdma.rkey = kiblnd_rd_frag_key(dstrd, dstidx); srcidx = kiblnd_rd_consume_frag(srcrd, srcidx, wrknob); dstidx = kiblnd_rd_consume_frag(dstrd, dstidx, wrknob); resid -= wrknob; tx->tx_nwrq++; wrq++; sge++; } so source transfer with sizes {128,4096}and destination segments {4096}, {4096}will mapped into {128},{4096-128}, {128} segments. so in general it's needs a twice more SGE/WR to send same amount of data. as now one source segment needs a two destination segments on router. probably someone may find better solution with just fix it loop and it avoid problem at all without any protocol, memory pool changes. |
| Comment by Thomas Stibor [ 15/Sep/16 ] |
|
Hi there, we encountered the RDMA too fragmented problem without LNET routers on nearly all clients where a kind of strange function call pattern was executed: ... Sep 2 06:05:28 lxbk0101 kernel: [2569481.266678] LNetError: 1407:0:(o2iblnd_cb.c:1140:kiblnd_init_rdma()) RDMA too fragmented for 10.20.0.250@o2ib1 (256): 241/256 src 241/256 dst frags Sep 2 06:05:28 lxbk0101 kernel: [2569481.269159] LNetError: 1407:0:(o2iblnd_cb.c:1140:kiblnd_init_rdma()) Skipped 1 previous similar message Sep 2 06:05:28 lxbk0101 kernel: [2569481.270325] LNetError: 1407:0:(o2iblnd_cb.c:1690:kiblnd_reply()) Can't setup rdma for GET from 10.20.0.250@o2ib1: -90 Sep 2 06:05:28 lxbk0101 kernel: [2569481.271498] LNetError: 1407:0:(o2iblnd_cb.c:1690:kiblnd_reply()) Skipped 1 previous similar message .... .... As a consequence of the error the Lustre client lost the corresponding OST with message (lfs check osts): Resource temporarily unavailable (11) MDS/OSS are with Lustre 2.5.3, Clients are with Lustre 2.6 The (strage) function call pattern is: ..... write(2, "Info in <CbmPlutoGenerator::Read"..., 96) = 96 write(2, "Info in <CbmPlutoGenerator::Read"..., 75) = 75 write(2, "Info in <CbmPlutoGenerator::Read"..., 96) = 96 write(1, "BoxGen: kf=1000010020, p=(0.20, "..., 285) = 285 write(1, " GTREVE_ROOT : Transporting prim"..., 8151) = 8151 lseek(18, 580977933, SEEK_SET) = 580977933 rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0bfe5448d0}, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, 8) = 0 write(18, "\0\0005]\3\354\0.\353\24VO*\353\0V\0<\0\0\0\0\"\241\5\r\0\0\0\0\0\0"..., 13661) = 13661 rt_sigaction(SIGINT, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, NULL, 8) = 0 lseek(18, 580991594, SEEK_SET) = 580991594 rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0bfe5448d0}, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, 8) = 0 write(18, "\0\0;1\3\354\0.\353\24VO*\353\0R\0<\0\0\0\0\"\241:j\0\0\0\0\0\0"..., 15153) = 15153 rt_sigaction(SIGINT, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, NULL, 8) = 0 lseek(18, 581006747, SEEK_SET) = 581006747 rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0bfe5448d0}, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, 8) = 0 write(18, "\0\2\3Y\3\354\0.\353\24VO*\353\0W\0<\0\0\0\0\"\241u\233\0\0\0\0\0\0"..., 131929) = 131929 rt_sigaction(SIGINT, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, NULL, 8) = 0 lseek(18, 581138676, SEEK_SET) = 581138676 rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0bfe5448d0}, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, 8) = 0 write(18, "\0\3\4\16\3\354\0.\353\24VO*\353\0U\0<\0\0\0\0\"\243x\364\0\0\0\0\0\0"..., 197646) = 197646 rt_sigaction(SIGINT, {0x7f0bffd39980, [], SA_RESTORER|SA_RESTART, 0x7f0bfe5448d0}, NULL, 8) = 0 lseek(18, 581336322, SEEK_SET) = 581336322 ... ... ... We were able to simulate the pattern with a simple Ruby script and were able to produce the RDMA too fragmented error #!/usr/bin/env ruby # Generate a 1MB buffer data = String.new 100000.times do |i| data << '0123456789' end File.open(ARGV.first, 'w+') do |f| 512.times do |i| offset = i * 1000000 puts offset f.seek(offset, IO::SEEK_SET) # SEEK_SET seeks from the beginning of the file f.write(data) # the write call already sets the file cursor where we will seek to in the next cycle end end The RDMA too fragment error was triggered after the 5th, 6th or sometimes the 7th loop. We are currently running more investigations on dedicated Lustre client machines, however, it looks like that by setting max_pages_per_rpc=64 => 4 * 64 = 256 thus ./lnet/include/lnet/types.h:#define LNET_MAX_IOV 256
klnds/o2iblnd/o2iblnd.h:#define IBLND_MAX_RDMA_FRAGS LNET_MAX_IOV /* max # of fragments supported */
the problem is not occurring anymore within routine if (tx->tx_nwrq == IBLND_RDMA_FRAGS(conn->ibc_version)) { CERROR("RDMA too fragmented for %s (%d): " "%d/%d src %d/%d dst frags\n", libcfs_nid2str(conn->ibc_peer->ibp_nid), IBLND_RDMA_FRAGS(conn->ibc_version), srcidx, srcrd->rd_nfrags, dstidx, dstrd->rd_nfrags); rc = -EMSGSIZE; break; } So far the problems seems to be gone by setting max_pages_per_rpc=64. |
| Comment by James A Simmons [ 20/Sep/16 ] |
|
By setting max_pages_per_rpc=64 this means you are below the 256 page limit so you will never hit that issue. Doug I tried the new lnet selftest patch with offsets of 64 and 256 and I didn't see any fragmentation issues. Have you been able to reproduce the problem? |
| Comment by Alexey Lyashkov [ 21/Sep/16 ] |
|
James, did you use an 1mb transfer size for lnet selftest with offset? ps. max_pages_per_rpc=128 should be enough to avoid problem as it twice more fragments on router than pages. |
| Comment by Doug Oucharek (Inactive) [ 22/Sep/16 ] |
|
Thomas: I have checked and so far, everyone who has seen this issue in production were using the Quotas feature. You may be on to something. One theory is that quotas can trigger a syncio which may cause a page misalignment. I'm going to look more into this. I have also run into two examples where the "RDMA too fragmented" error was encounter where a router was not used. Olaf: are you certain that the destination will match the same starting offset when a router is not present? If so, we should never see this error without a router. If this issue can happen without a router, then the fix which uses large, continuous, buffers on routers won't cover all cases. The original fix by Liang, however, will. I'm proposing that we land that fix but make the default of sge_wqe=1 so it is off by default. Then, anyone running into this error can turn on the fix on systems which exhibit it. In the meantime, I will see if I can better understand how/why quotas is trigger this and see if that can be resolved. |
| Comment by Doug Oucharek (Inactive) [ 22/Sep/16 ] |
|
Note: one example of this issue when routers are not present was on the discussion board: https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg12963.html |
| Comment by Andreas Dilger [ 22/Sep/16 ] |
|
When quotas are enabled, it is possible that if a user hits the quota limit this will force sync writes from the client. That's similar if the client is doing O_DIRECT writes. It is a bit confusing, however, since I can't imaging why this would cause unaligned writes, since the pages should have previously been fully fetched to the client, so the whole page ghouls be written in this case and the write should be page aligned. AFAIK, only O_DIRECT should be able to generate partial page writes, anything else is a bug IMHO. Rather than transferring all of the pages misaligned, my strong preference would be to doc the handling of the first page, and then send the rest of the pages properly. Is the lack of ability to send the first partial page w problem at the Lustre level or LNet? If it is a bug in the way Lustrr generares the bulk requests then this could (also) be fixed, even if there also needs to be a temporary fix in LNet as well. |
| Comment by Gerrit Updater [ 05/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12496/ |
| Comment by Christopher Morrone [ 23/Feb/17 ] |
|
Just a "me too": We are hitting this with 2.8.0 on an OmniPath network. Client gets stuck with "RDMA has too many fragments for peer" sending to a router node. If it matters, we have one client isolated right now in this state. Hopefully the problem is understood at this point, but if you need some information gathering let us know what you want us to look for.
|
| Comment by James A Simmons [ 24/Feb/17 ] |
|
The problem is the purposed patch breaks our systems. It causes our clients to go into a reconnect storm. |
| Comment by Doug Oucharek (Inactive) [ 24/Feb/17 ] |
|
The summary of this issue is thus:
Ultimately, it would be good to get some agreement on how to fix this and land something. |
| Comment by Christopher Morrone [ 24/Feb/17 ] |
|
This issue is explicitly about a client problem. So "The fix", I assume, refers to change 12451?? We too have PPC clients. Not sure if any of them will get Lustre 2.8+. But I would not particularly like to gamble on a patch that Intel hasn't committed to yet. |
| Comment by Doug Oucharek (Inactive) [ 24/Feb/17 ] |
|
Yes, the patch for this ticket is 12451. It is "off" by default and needs a module parameter to turn it on. As such, it should be safe. James: if this patch is off, do you have an issue with your clients? If so, then we have to debug that. If not, then technically, we can land the patch and those can use it just have to turn it on. |
| Comment by Christopher Morrone [ 24/Feb/17 ] |
|
Speaking as a customer, a "fix" that requires me to manually go in and change a configuration to tell Lustre "no, really, please don't be broken" is not a terribly satisfactory solution. Please work on a solution that will make Lustre work out of the box. |
| Comment by James A Simmons [ 07/Mar/17 ] |
|
I attached my router logs that show the problem with this patch. |
| Comment by Doug Oucharek (Inactive) [ 09/Mar/17 ] |
|
James: I looked at your router logs. All but rtr5 have no o2iblnd logs. Only gnilnd. Rtr5 has some new logs you must have added to debug this issue. Stuff like "(130)++" seem to be a counter you are keeping track of. What is it? Is it counting queue depth? Is that the problem you are running into? |
| Comment by James A Simmons [ 09/Mar/17 ] |
|
That is from kiblnd_conn_addref() which is a inline function defined in o2iblnd.h. |
| Comment by Doug Oucharek (Inactive) [ 10/Mar/17 ] |
|
I'm not seeing the client reconnect storm in those logs. Is neterr logs turned off? |
| Comment by Doug Oucharek (Inactive) [ 11/Apr/17 ] |
|
James: give the latest patch version, 10, a try on PPC. I believe I fixed the PPC issue with the patch. |
| Comment by Stephane Thiell [ 12/Apr/17 ] |
|
Hi, We just hit this problem on brand new 2.9 clients, only on a bigmem node, leading to deadlock'ed writes on our /scratch. We are using EE3 servers with lnet routers (they are all already patched for this, see DELL-221). As we think to have here a basic use case, because only a few processes were reading from a single file and writing to multiple files, apparently doing (nice) 4M I/Os before the deadlock occurred, we took a crash dump which is available for download at the following link: https://stanford.box.com/s/d37761k3ywukxh7im9mq8mgp9m2gkpga It shows deadlocked writes after the RDMA too fragmented errors. Kernel version is 3.10.0-514.10.2.el7.x86_64 on el7 Hope this helps... Stephane
|
| Comment by James A Simmons [ 13/Apr/17 ] |
|
Currently my test system that has this problem is down until middle of next week. As soon as it is back I will test it. |
| Comment by Gerrit Updater [ 26/Apr/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/12451/ |
| Comment by Peter Jones [ 26/Apr/17 ] |
|
Landed for 2.10 |
| Comment by James A Simmons [ 26/Apr/17 ] |
|
Just to let you know I'm in the process of testing this patch and the latest patch seems to be holding up. Good work Doug. |
| Comment by Dmitry Eremin (Inactive) [ 03/May/17 ] |
|
After last patch landed I got the following error: [4020251.265904] LNetError: 95052:0:(o2iblnd_cb.c:1086:kiblnd_init_rdma()) RDMA is too large for peer 192.168.213.235@o2ib (131072), src size: 1048576 dst size: 1048576 [4020251.265941] LNetError: 95050:0:(o2iblnd_cb.c:1720:kiblnd_reply()) Can't setup rdma for GET from 192.168.213.235@o2ib: -90 [4020251.265948] LustreError: 95050:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff8816e0754c00 ... [4020251.267492] Lustre: 95098:0:(client.c:2115:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1493833318/real 1493833318] req@ffff8817e9e48000 x1566397691863184/t0(0) o4-> [4020251.267503] Lustre: nvmelfs-OST000b-osc-ffff881c8e362000: Connection to nvmelfs-OST000b (at 192.168.213.236@o2ib) was lost; in progress operations using this service will wait for recovery to complete ... [4020251.267965] LustreError: 95050:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff880223361400 [4020251.268058] Lustre: nvmelfs-OST000b-osc-ffff881c8e362000: Connection restored to 192.168.213.236@o2ib (at 192.168.213.236@o2ib) ... [4020256.133400] LNetError: 95052:0:(o2iblnd_cb.c:1086:kiblnd_init_rdma()) RDMA is too large for peer 192.168.213.235@o2ib (131072), src size: 1048576 dst size: 1048576 [4020256.133561] LNetError: 95049:0:(o2iblnd_cb.c:1720:kiblnd_reply()) Can't setup rdma for GET from 192.168.213.235@o2ib: -90 [4020256.133564] LNetError: 95049:0:(o2iblnd_cb.c:1720:kiblnd_reply()) Skipped 159 previous similar messages [4020256.133569] LustreError: 95049:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff88192932fe00 [4020256.133630] Lustre: 95125:0:(client.c:2115:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1493833323/real 1493833323] req@ffff882031360300 x1566397691866144/t0(0) o4-> [4020256.133634] Lustre: 95125:0:(client.c:2115:ptlrpc_expire_one_request()) Skipped 39 previous similar messages [4020256.133654] Lustre: nvmelfs-OST000e-osc-ffff881c8e362000: Connection to nvmelfs-OST000e (at 192.168.213.235@o2ib) was lost; in progress operations using this service will wait for recovery to complete [4020256.133656] Lustre: Skipped 39 previous similar messages [4020256.134200] Lustre: nvmelfs-OST000e-osc-ffff881c8e362000: Connection restored to 192.168.213.235@o2ib (at 192.168.213.235@o2ib) [4020256.134202] Lustre: Skipped 39 previous similar messages The system is partially working. I'm able to see the list of files and open small files. But large bulk transfer don't work. |
| Comment by Doug Oucharek (Inactive) [ 03/May/17 ] |
|
This was addressed by a patch to |
| Comment by Dmitry Eremin (Inactive) [ 03/May/17 ] |
|
Uff. It looks I was luck to use the build without this fix. |
| Comment by Stephane Thiell [ 25/May/17 ] |
|
Hi, on the router with wrq_sge=1 (10.210.34.213@o2ib1 is an OSS not patched): [ 1111.504575] LNetError: 8688:0:(o2iblnd_cb.c:1093:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.210.34.213@o2ib1 (256), src idx/frags: 128/147 dst idx/frags: 128/147 [ 1111.522352] LNetError: 8688:0:(o2iblnd_cb.c:430:kiblnd_handle_rx()) Can't setup rdma for PUT to 10.210.34.213@o2ib1: -90 Clients and routers are using mlx5, servers are using mlx4. Thanks, |
| Comment by Chris Horn [ 25/May/17 ] |
|
You need to set wrq_sge=2 on the routers, too. |
| Comment by Doug Oucharek (Inactive) [ 25/May/17 ] |
|
Your router needs wrq_sge=2. |
| Comment by Stanford Research Computing Center [ 25/May/17 ] |
|
Ah! Thanks for the clarification, Chris and Doug! I was a bit lost as the parameters changed along the work done in this ticket. We'll test this right away. |
| Comment by Cory Spitz [ 06/Jun/17 ] |
|
Looks like we should have opened a LUDOC ticket to document wrq_sge. |
| Comment by Cory Spitz [ 09/Jun/17 ] |
|
|