[LU-14733] brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103 Created: 03/Jun/21  Updated: 18/Mar/22  Resolved: 24/Jul/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.12.8, Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Olaf Faaland Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: LTS12, llnl
Environment:

lustre-2.12.6_9.llnl client
kernel-4.18.0-305.0.0.1toss.t4.x86_64
RHEL84


Attachments: Text File 01-move_null.patch     Text File 02-post_state.patch     Text File build.txt     Text File diff.txt     Text File dk.opal188.llnl.gov.7.txt     Text File dk.opal63.llnl.gov.7.txt     Text File dmesg.opal188.txt     Text File dmesg.opal63.txt     File kprobes-off.sh     File kprobes.sh     Text File linux-kernel-test.patch     Text File move_null.patch     Text File post_state.patch     Text File trace1.txt     Text File trace2.txt    
Issue Links:
Duplicate
is duplicated by LU-13976 duplicate IB_WR_LOCAL_INV causing ice... Resolved
Related
is related to LU-15116 crash when writing files in parallel ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

lnet_selftest fails between two nodes over Omnipath

dk.opal63.llnl.gov.7:00000001:00020000:43.0:1622598261.714620:0:129525:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103 

Bulk transfers work over Infiniband (although in that test 1 of the nodes was RHEL 7.9 and an earlier Lustre patch stack).  Bulk transfers also work over tcp using ksocklnd.

lctl pings work fine between the same two nodes.

mpibench and other MPI applications also work fine over Omnipath between two nodes.

See https://github.com/LLNL/lustre/releases/tag/2.12.6_9.llnl for the patch stack



 Comments   
Comment by Olaf Faaland [ 03/Jun/21 ]

These nodes were RHEL 8.3 based and have been upgraded to RHEL 8.4.  At the same time, they got an updated Lustre client with the following OS-compat patches:

  • 12d13f33c0 (tag: 2.12.6_9.llnl) LU-13783 osc: handle removal of NR_UNSTABLE_NFS
  • 3cdb219927 LU-12355 llite: MS_* flags and SB_* flags split
  • 03e48854cb LU-12355 llite: totalram_pages changed to atomic_long_t
  • b055fe9f7a (tag: 2.12.6_8.llnl) LU-14690 kernel: new kernel [RHEL 8.4 4.18.0-305.el8]
  • 335e03049d (tag: 2.12.6_7.llnl) LU-14673 sec: annotate algorithms taking optional key
Comment by Olaf Faaland [ 03/Jun/21 ]

See also LU-14690

Comment by Peter Jones [ 03/Jun/21 ]

Serguei

Can you please advise?

Thanks

Peter

Comment by Olaf Faaland [ 04/Jun/21 ]

For my reference, my local ticket is TOSS-5228

Comment by Olaf Faaland [ 14/Jun/21 ]

Hi Serguei, do you have any update or questions on this? Thanks

Comment by Serguei Smirnov [ 14/Jun/21 ]

Olaf,

So far it looks like it is possible there's some sort of incompatibility in how ib_post_send is called, but I don't have anything concrete in this direction yet. Do you have any config logs (from building lustre)? I'm not that familiar with Omnipath either. Which version of Omnipath are you using? Basically I want to make sure we end up calling ib_post_send correctly.

Thanks,

Serguei.

Comment by Olaf Faaland [ 14/Jun/21 ]

Hi Serguei,
I've attached build.txt - not the config.log, but at least the stdout from ./configure. I'll look into Omnipath info.
thanks,
Olaf

Comment by Serguei Smirnov [ 15/Jun/21 ]

Hi Olaf,

Here's some more detail regarding what I'd like to try with the OPA build:

lnet/autoconf/lustre-lnet.m4 has a check for ib_post_send() and ib_post_recv() to see if they require const ptr parameters. Could you please try removing this check and build without it? (See attached diff file). I suspect there may be an issue with using the wrong header file when building. The kernel code for 4.18.0 appears to define these functions without the "const" and I think that's what we should be using for OPA, but the stdout you provided indicates that as a result of the configure the "const" version is used.diff.txt

Thanks,

Serguei.

Comment by Olaf Faaland [ 17/Jun/21 ]

Hi Serguei,

Sorry, I didn't see your message for some reason. My test was slightly different, but I think still produced the result you need. With the config check sabotaged so HAVE_IB_POST_SEND_RECV_CONST is not defined, the build fails with

/g/g0/faaland1/rpmbuild/BUILD/lustre-2.12.6_9.llnl_2_gd3006c6/lnet/klnds/o2iblnd/o2iblnd_cb.c:1002:46: error: passing argument 3 of 'ib_post_send' from incompatible pointer type [-Werror=incompatible-pointer-types]
    rc = ib_post_send(conn->ibc_cmid->qp, wr, &bad);
                                              ^~~~
In file included from /usr/src/kernels/4.18.0-305.0.0.1toss.t4.x86_64/include/rdma/ib_addr.h:20,
                 from /usr/src/kernels/4.18.0-305.0.0.1toss.t4.x86_64/include/rdma/rdma_cm.h:12,
                 from /g/g0/faaland1/rpmbuild/BUILD/lustre-2.12.6_9.llnl_2_gd3006c6/lnet/klnds/o2iblnd/o2iblnd.h:71,
                 from /g/g0/faaland1/rpmbuild/BUILD/lustre-2.12.6_9.llnl_2_gd3006c6/lnet/klnds/o2iblnd/o2iblnd_cb.c:37:
/usr/src/kernels/4.18.0-305.0.0.1toss.t4.x86_64/include/rdma/ib_verbs.h:3799:37: note: expected 'const struct ib_send_wr **' but argument is of type 'struct ib_send_wr **'
           const struct ib_send_wr **bad_send_wr)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~

My change was

diff --git a/lnet/autoconf/lustre-lnet.m4 b/lnet/autoconf/lustre-lnet.m4
index 6d4a05d4ef..98490ea9ae 100644
--- a/lnet/autoconf/lustre-lnet.m4
+++ b/lnet/autoconf/lustre-lnet.m4
@@ -539,6 +539,8 @@ AS_IF([test $ENABLEO2IB != "no"], [
        # In MOFED 4.6, the second and third parameters for
        # ib_post_send() and ib_post_recv() are declared with
        # 'const'.
+       #
+       # SABOTAGE: force this to fail with extra argument to ib_post_send
        tmp_flags="$EXTRA_KCFLAGS"
        EXTRA_KCFLAGS="-Werror"
        LB_CHECK_COMPILE([if 'ib_post_send() and ib_post_recv()' have const parameters],
@@ -555,7 +557,7 @@ AS_IF([test $ENABLEO2IB != "no"], [
                #include <rdma/ib_verbs.h>
        ],[
                ib_post_send(NULL, (const struct ib_send_wr *)NULL,
-                            (const struct ib_send_wr **)NULL);
+                            (const struct ib_send_wr **)NULL, NULL);
        ],[
                AC_DEFINE(HAVE_IB_POST_SEND_RECV_CONST, 1,
                        [ib_post_send and ib_post_recv have const parameters])


thanks,
Olaf

Comment by Serguei Smirnov [ 17/Jun/21 ]

Olaf,

I suppose this means we need to look for a different reason why ib_post_send would fail with EINVAL (22) error code.

So far I wasn't able to find the source code for this function provided by Omnipath for RH8.4. Oddly, I also failed to locate Omnipath release notes for RH8.4. Perhaps someone from CN can comment.

Btw, although this is unlikely to be related to this issue, perhaps you want to port LU-13182 as well? (see last couple of comments in  LU-14690)

Thanks,

Serguei.

Comment by Olaf Faaland [ 17/Jun/21 ]

Hi Serguei, I'm told:

"OPA hardware does not have hardware verbs support, so it uses the rdmavt layer to provide software translation. You want to look at rvt_post_send() in drivers/infiniband/sw/rdmavt/qp.c."

We're using the in-kernel OPA driver, so in this case the one in linux-4.18.0-305.el8

thanks

Comment by Serguei Smirnov [ 17/Jun/21 ]

Olaf,

Have you tried building lustre master and testing if it has the same problem on your RH8.4 OPA machine? There's a chance that one of the o2iblnd patches there may help. Otherwise I'm thinking I'm going to have to add some debug messages to LLNL patch stack so we can use that to get more data.

Thanks,

Serguei.

Comment by Olaf Faaland [ 17/Jun/21 ]

Just to document this where it's visible to everyone:

Ran perftest (perftest-4.4-37.0.t4.x86_64) from opal63 (compute), used both opal64 (compute) and opal187 (router) as servers. These are the same nodes, running the same OS, where lnet_selftest fails.

All these tests succeeded:

ib_write_bw opal64-hsi0
ib_read_bw opal64-hsi0
ib_send_bw opal64-hsi0
ib_atomic_bw opal64-hsi0
ib_write_lat opal187-hsi0
ib_send_bw -a -b -R -F opal187-hsi0
ib_send_bw -a -b -R -F opal64-hsi0
ib_send_bw -a -b -R -F -q 3 opal64-hsi0
ib_send_bw -a -b -R -F -q 3 opal187-hsi0
ib_read_bw -a -b -R -F -q 3 opal187-hsi0
ib_atomic_bw -b -R -F -q 3 opal187-hsi0
ib_send_bw -a -b -R -F -q 3 opal187-hsi0
ib_write_bw -a -b -R -F -q 3 opal187-hsi0
ib_read_bw -a -b -R -F -q 3 opal187-hsi0

Those arguments mean:

 -R, --rdma_cm Connect QPs with rdma_cm and run test on those QPs
 -z, --com_rdma_cm Communicate with rdma_cm module to exchange data - use regular QPs
 -m, --mtu=<mtu> QP Mtu size (default: active_mtu from ibv_devinfo)
 -c, --connection=<type> Connection type RC/UC/UD/XRC/DC/SRD (default RC).
 -d, --ib-dev=<dev> Use IB device <dev> (default: first device found)
 -i, --ib-port=<port> Use network port <port> of IB device (default: 1)
 -s, --size=<size> Size of message to exchange (default: 1)
 -a, --all Run sizes from 2 till 2^23
 -n, --iters=<iters> Number of exchanges (at least 100, default: 1000)
 -x, --gid-index=<index> Test uses GID with GID index taken from command
 -V, --version Display version number
 -e, --events Sleep on CQ events (default poll)
 -F, --CPU-freq Do not fail even if cpufreq_ondemand module
 -I, --inline_size=<size> Max size of message to be sent in inline mode
 -u, --qp-timeout=<timeout> QP timeout = (4 uSec)*(2^timeout) (default: 14)
 -S, --sl=<sl> Service Level (default 0) 
Comment by Olaf Faaland [ 17/Jun/21 ]

Hi Serguei,

Have you tried building lustre master and testing if it has the same problem on your RH8.4 OPA machine?

We were in the process of setting that up when we lost power to the machine.  As soon as it's back up, we will test that.

Comment by Gian-Carlo Defazio [ 18/Jun/21 ]

We've now tried lnet_selftest on the OPA compute nodes opal63 and opal64 using lustre-2.14.0_2.llnl https://github.com/LLNL/lustre/tree/2.14.0_2.llnl. The results were the same as with lustre-2.12.6_9.llnl, that is, the test failed and we saw bad numbers for the results (lots of 0s) and the same error in the kernel dumps as before:

(o2iblnd_cb.c:957:kiblnd_post_tx_locked()) Error -22 posting transmit to 192.168.128.5@o2ib18

We're in the process of repeating the same test using lustre master.

Comment by Olaf Faaland [ 19/Jun/21 ]

The test with master hit a list corruption BUG, so it was inconclusive.  I won't be able to track it down now, I'm leaving for vacation soon and need to prepare.

Comment by Dennis Dalessandro [ 21/Jun/21 ]

Can someone post instructions or a test script so we can take a look? I downloaded the LLNL lustre tarball and built it on a RHEL 8.4 box (stock kernel) and would like to give it a shot, but not very familiar with Lnet self test.

Comment by Serguei Smirnov [ 21/Jun/21 ]

Dennis,

There's instructions here: https://wiki.lustre.org/LNET_Selftest

Basically, two nodes with LNet running are needed

1) Make sure selftest is loaded on both nodes: modprobe lnet_selftest

2) Make sure LNet is configured on both nodes. Run "lnetctl net show" to list local nids on each.

3) Make the nodes "discover" each other: "lnetctl discover <peer nid>"

4) Copy the wrapper script on one of the nodes. Use primary nids to fill in "TO" and "FROM"

5) Run the script

 

Comment by Mike Marciniszyn [ 21/Jun/21 ]

The 305 kernel may be the first RHEL kernel with fmr removed...

Comment by Dennis Dalessandro [ 22/Jun/21 ]

Am able to reproduce an EINVAL error internally. Looks like some of the post_send calls work but the ones from kiblnd_sd always fail. Here is a kprobe that I used to dump the return value of rvt_post_send():

r:testprobe rdmavt:rvt_post_send $retval

These threads always succeed:

lst_t_00_24-7492 [039] d... 88520.827471: testprobe: (kiblnd_post_tx_locked+0x857/0xa50 [ko2iblnd] <- rvt_post_send) arg1=0x0

kiblnd_connd-7547 [000] d... 88520.827888: testprobe: (ib_send_mad+0x235/0x420 [ib_core] <- rvt_post_send) arg1=0x0

monitor_thread-7556 [032] d... 88521.503931: testprobe: (kiblnd_post_tx_locked+0x857/0xa50 [ko2iblnd] <- rvt_post_send) arg1=0x0

These threads always fail:

kiblnd_sd_00_01-7549 [014] d... 88520.827678: testprobe: (kiblnd_post_tx_locked+0x857/0xa50 [ko2iblnd] <- rvt_post_send) arg1=0xffffffea

Note 0xffffffea decodes to -22 in decimal.

 

Comment by Mike Marciniszyn [ 24/Jun/21 ]

The -EINVAL is happening because a post_send of a IB_WR_LOCAL_INV operation is failing.   So this is an issue with the fastreg stuff as I expected.

[^kprobes-off.sh,] [^kprobes.sh,] and trace1.txt contain the kprobes used and the tracing.

I'm trying to get more details now.

 

Comment by Mike Marciniszyn [ 24/Jun/21 ]

I should point our our upstream CI testing validates NFS/RDMA, iSer, SRP use of the fast reg feature and has been passing without issue.

Comment by Mike Marciniszyn [ 26/Jun/21 ]

I have attached refined kprobes and trace2.txt.

Comment by Mike Marciniszyn [ 26/Jun/21 ]

Here is the invalidate code:

int rvt_invalidate_rkey(struct rvt_qp *qp, u32 rkey)
{
<snip>
        mr = rcu_dereference(
                rkt->table[(rkey >> (32 - dev->dparms.lkey_table_size))]);
        if (unlikely(!mr || mr->lkey != rkey || qp->ibqp.pd != mr->pd)) *** rkey is != mr->lkey
                goto bail;
<snip>
bail:
        rcu_read_unlock();
        return -EINVAL;
}

Here is an excerpt from the trace:

*** focusing on keys that being with 0x16f700    
     lst_t_00_21-7489  [011] d... 426264.174874: rvt_invalidate_rkey_p: (rvt_invalidate_rkey+0x0/0x60 [rdmavt]) qpn=0xe0 pd=0xffffa0ab4831b780 rkey=0x16f7000
     lst_t_00_21-7489  [011] d.Z. 426264.174879: rvt_invalidate_rkey_1: (rvt_invalidate_rkey+0x26/0x60 [rdmavt]) mr=0xffffc0dfca0d3b78 table_ptr=0xffffc0dfca0d3b78
     lst_t_00_21-7489  [011] d... 426264.174881: rvt_invalidate_rkey_2: (rvt_invalidate_rkey+0x29/0x60 [rdmavt]) rkey=0x16f7000 mr_lkey=0x16f7000
     lst_t_00_21-7489  [011] d... 426264.174882: rvt_invalidate_rkey: (rvt_post_send+0x525/0x800 [rdmavt] <- rvt_invalidate_rkey) ret=0x0
*** 0x16f7000 has been invalidated ***
     lst_t_00_21-7489  [011] d... 426264.174884: rvt_fast_reg_mr_p: (rvt_fast_reg_mr+0x0/0x70 [rdmavt]) qpn=0xe0 ibmr=0xffffa0a316a16a00 pd=0xffffa0ab4831b780 key=0x16f7001
     lst_t_00_21-7489  [011] d... 426264.174886: rvt_fast_reg_mr: (rvt_post_send+0x1a3/0x800 [rdmavt] <- rvt_fast_reg_mr) ret=0x0
*** 0x16f7001 has been written into ibmr.mr keys because of the above fast reg ***
*** then an invalidate is posted for 0x16f7000
 kiblnd_sd_00_03-7551  [015] d... 426264.175096: rvt_invalidate_rkey_p: (rvt_invalidate_rkey+0x0/0x60 [rdmavt]) qpn=0xe0 pd=0xffffa0ab4831b780 rkey=0x16f7000
 kiblnd_sd_00_03-7551  [015] d.Z. 426264.175100: rvt_invalidate_rkey_1: (rvt_invalidate_rkey+0x26/0x60 [rdmavt]) mr=0xffffc0dfca0d3b78 table_ptr=0xffffc0dfca0d3b78
*** the key in the mr is the one fastreg'ed from 426264.174884
 kiblnd_sd_00_03-7551  [015] d... 426264.175101: rvt_invalidate_rkey_2: (rvt_invalidate_rkey+0x29/0x60 [rdmavt]) rkey=0x16f7000 mr_lkey=0x16f7001
 kiblnd_sd_00_03-7551  [015] d... 426264.175103: rvt_invalidate_rkey: (rvt_post_send+0x525/0x800 [rdmavt] <- rvt_invalidate_rkey) ret=0xffffffea
 kiblnd_sd_00_03-7551  [015] d.Z. 426264.175104: rvt_post_send_err1: (rvt_post_send+0x475/0x800 [rdmavt]) wr=0xffffa0a2e7da7ed0 wr_opcode=7 err=-22
*** and the post send fails ***
 kiblnd_sd_00_03-7551  [015] d... 426264.175105: rvt_post_send: (kiblnd_post_tx_locked+0x857/0xa50 [ko2iblnd] <- rvt_post_send) ret=0xffffffea

It looks to me like Lustre is losing track of the keys for a particular MR?

Comment by Olaf Faaland [ 28/Jun/21 ]

Thanks, Mike and Dennis.

Serguei, please let us know when you (or someone) are working on this new information Cornelis came up with.  Thank you.

Comment by Mike Marciniszyn [ 28/Jun/21 ]

It looks like Lustre is sending a gratuitous invalidate because of this code fragment:

                /* There appears to be a bug in MLX5 code where you must
                 * invalidate the rkey of a new FastReg pool before first
                 * using it. Thus, I am marking the FRD invalid here. */
                frd->frd_valid = false;

This is not wrong, but difference than other ULPs.

The following code is then executed before any fast reg has happened:

                                if (!frd->frd_valid) {
                                        struct ib_rdma_wr *inv_wr;
                                        __u32 key = is_rx ? mr->rkey : mr->lkey;

                                        inv_wr = &frd->frd_inv_wr;
                                        memset(inv_wr, 0, sizeof(*inv_wr));

                                        inv_wr->wr.opcode = IB_WR_LOCAL_INV;
                                        inv_wr->wr.wr_id  = IBLND_WID_MR;
                                        inv_wr->wr.ex.invalidate_rkey = key;

                                        /* Bump the key */
                                        key = ib_inc_rkey(key); 
                                        *** updates keys in ib_mr, but not the rvt_mregion keys ***
                                        ib_update_fast_reg_key(mr, key);
                                }

The following code uses struct rvt_mregion keys to validate and doesn't see the above key change in the ibmr and fails the invalidate.   The rkey is correct, but the mr->lkey hasn't changed to match until the next

int rvt_invalidate_rkey(struct rvt_qp *qp, u32 rkey)
{
        struct rvt_dev_info *dev = ib_to_rvt(qp->ibqp.device);
        struct rvt_lkey_table *rkt = &dev->lkey_table;
        struct rvt_mregion *mr;

        if (rkey == 0)
                return -EINVAL;

        rcu_read_lock();
        mr = rcu_dereference(
                rkt->table[(rkey >> (32 - dev->dparms.lkey_table_size))]);
        if (unlikely(!mr || mr->lkey != rkey || qp->ibqp.pd != mr->pd))
                goto bail;

        atomic_set(&mr->lkey_invalid, 1);
        rcu_read_unlock();
        return 0;

bail:
        rcu_read_unlock();
        return -EINVAL;
}

I'm working on the following fix:

diff --git a/drivers/infiniband/sw/rdmavt/mr.c b/drivers/infiniband/sw/rdmavt/mr.c
index 601d18dd..528727f 100644
--- a/drivers/infiniband/sw/rdmavt/mr.c
+++ b/drivers/infiniband/sw/rdmavt/mr.c
@@ -691,6 +691,7 @@ int rvt_invalidate_rkey(struct rvt_qp *qp, u32 rkey)
        struct rvt_dev_info *dev = ib_to_rvt(qp->ibqp.device);
        struct rvt_lkey_table *rkt = &dev->lkey_table;
        struct rvt_mregion *mr;
+       struct rvt_mr *rmr;

        if (rkey == 0)
                return -EINVAL;
@@ -698,7 +699,11 @@ int rvt_invalidate_rkey(struct rvt_qp *qp, u32 rkey)
        rcu_read_lock();
        mr = rcu_dereference(
                rkt->table[(rkey >> (32 - dev->dparms.lkey_table_size))]);
-       if (unlikely(!mr || mr->lkey != rkey || qp->ibqp.pd != mr->pd))
+       if (unlikely(!mr || qp->ibqp.pd != mr->pd))
+               goto bail;
+       /* isolate parent */
+       rmr = container_of(mr, struct rvt_mr, mr);
+       if (rmr->ibmr.type != IB_MR_TYPE_MEM_REG || rmr->ibmr.rkey != rkey)
                goto bail;

        atomic_set(&mr->lkey_invalid, 1);
Comment by Serguei Smirnov [ 28/Jun/21 ]

Hi,

To me this looks like a very good candidate to fix the issue. Thanks for taking the time to look into this!

Thanks,

Serguei.

Comment by Mike Marciniszyn [ 29/Jun/21 ]

> To me this looks like a very good candidate to fix the issue.

Another potential fix is to delete the MLX5 hack.   That should work as well.

Comment by Mike Marciniszyn [ 29/Jun/21 ]

The patch didn't work.

I need to do more analysis.

Comment by Olaf Faaland [ 29/Jun/21 ]

Hi Serguei,

Do you (or Amir, or anyone else you have easy access to) know if the MLX issue that prompted that hack has been fixed? The JIRA issue was LU-8752 and the commit was:

commit 783428b60a98874b4783f8da48c66019d68d84d6
Author: Doug Oucharek <doug.s.oucharek@intel.com>
Date:   Mon Dec 12 09:31:37 2016 -0800


    LU-8752 lnet: Stop MLX5 triggering a dump_cqe
    
    We have found that MLX5 will trigger a dump_cqe if we don't
    invalidate the rkey on a newly alloated MR for FastReg usage.
    
    This fix just tags the MR as invalid on its creation if we are
    using FastReg and that will force it to do an invalidate of the
    rkey on first usage.
    
...

diff --git a/lnet/klnds/o2iblnd/o2iblnd.c b/lnet/klnds/o2iblnd/o2iblnd.c
index e919008d44..ee5a01f9fa 100644
--- a/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1536,7 +1536,10 @@ static int kiblnd_alloc_freg_pool(kib_fmr_poolset_t *fps, kib_fmr_pool_t *fpo)
                        goto out_middle;
                }
 
-               frd->frd_valid = true;
+               /* There appears to be a bug in MLX5 code where you must
+                * invalidate the rkey of a new FastReg pool before first
+                * using it. Thus, I am marking the FRD invalid here. */
+               frd->frd_valid = false;
 
                list_add_tail(&frd->frd_list, &fpo->fast_reg.fpo_pool_list);
                fpo->fast_reg.fpo_pool_size++;

 
Comment by Olaf Faaland [ 30/Jun/21 ]

I reverted

LU-8752 lnet: Stop MLX5 triggering a dump_cqe

and tested. Initially I had nonzero bandwidth, which is different than I recall seeing before. After a few seconds the bandwidth recorded went to 0. lctl dk shows:

00000800:00020000:16.0F:1625017325.639837:0:514295:0:(o2iblnd_cb.c:1031:kiblnd_post_tx_locked()) Error -22 posting transmit to 192.168.128.3@o2ib18
00000800:00020000:6.0F:1625017325.639911:0:514296:0:(o2iblnd_cb.c:1031:kiblnd_post_tx_locked()) Error -22 posting transmit to 192.168.128.3@o2ib18
00000800:00000100:6.0:1625017325.639916:0:514296:0:(o2iblnd_cb.c:2101:kiblnd_close_conn_locked()) Closing conn to 192.168.128.3@o2ib18: error -22(waiting)
00000400:00000100:8.0F:1625017325.639943:0:514293:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -22 type 5, RPC errors 1
00000400:00000100:6.0:1625017325.639947:0:514296:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -22 type 5, RPC errors 2
00000001:00020000:7.0F:1625017325.639951:0:514856:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-192.168.128.3@o2ib18: -22
00000001:00020000:53.0:1625017325.639966:0:514858:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-192.168.128.3@o2ib18: -22
00000400:00000100:53.0:1625017325.639968:0:514858:0:(rpc.c:905:srpc_server_rpc_done()) Server RPC 000000007f1b43fb done: service brw_test, peer 12345-192.168.128.3@o2ib18, status SWI_STATE_BULK_STARTED:-5
00000001:00020000:53.0:1625017325.639971:0:514858:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer to 12345-192.168.128.3@o2ib18 has failed: -5
00000800:00000100:17.0:1625017325.640281:0:514294:0:(o2iblnd_cb.c:3734:kiblnd_complete()) RDMA (tx: 0000000005d7d115) failed: 5
00000800:00020000:6.0:1625017325.640283:0:514296:0:(o2iblnd_cb.c:1031:kiblnd_post_tx_locked()) Error -22 posting transmit to 192.168.128.3@o2ib18
00000800:00000100:6.0:1625017325.640284:0:514296:0:(o2iblnd_cb.c:2101:kiblnd_close_conn_locked()) Closing conn to 192.168.128.3@o2ib18: error -22(waiting)
00000400:00000100:8.0:1625017325.640285:0:514293:0:(lib-msg.c:698:lnet_attempt_msg_resend()) msg 0@<0:0>->192.168.128.3@o2ib18 exceeded retry count 0
00000800:00020000:17.0:1625017325.640286:0:514294:0:(o2iblnd_cb.c:1031:kiblnd_post_tx_locked()) Error -22 posting transmit to 192.168.128.3@o2ib18
00000400:00000100:8.0:1625017325.640287:0:514293:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -5 type 5, RPC errors 3
00000800:00000100:6.0:1625017325.640288:0:514296:0:(o2iblnd_cb.c:3734:kiblnd_complete()) RDMA (tx: 0000000029f85d1a) failed: 5
00000400:00000100:17.0:1625017325.640290:0:514294:0:(lib-msg.c:698:lnet_attempt_msg_resend()) msg 0@<0:0>->192.168.128.3@o2ib18 exceeded retry count 0
Comment by Mike Marciniszyn [ 30/Jun/21 ]

I do know that patch proposed is just wrong.    Testing against the struct rvt_mregion lkey should be correct.

I need to add another kprobe with the original kernel to look at the keys for all ib_alloc_mr() allocations from birth and track from that point to failures.

A portion of the comment says:

This fix just tags the MR as invalid on its creation if we are
using FastReg and that will force it to do an invalidate of the
rkey on first usage.

The inference is that the invalidate only happens on first MR use, but I don't see anywhere that sets frd_invalid to true?   It looks like it will happen all the time for all MRs.

Comment by Mike Marciniszyn [ 30/Jun/21 ]

It is starting to look to me like there is a concurency issue where somehow the old key is subsequently passed to an invalidate.

Here is an invalidate for rkey 0x100:

     lst_t_00_09-3288  [046] d...  6592.029697: rvt_invalidate_rkey_p: (rvt_invalidate_rkey+0x0/0x60 [rdmavt]) qpn=0x8 pd=0xffff9ebe1a48a200 rkey=0x100
     lst_t_00_09-3288  [046] d...  6592.029700: rvt_invalidate_rkey_1: (rvt_invalidate_rkey+0x29/0x60 [rdmavt]) table_ptr=0xffffada64a4d1000
     lst_t_00_09-3288  [046] d.Z.  6592.029702: rvt_invalidate_rkey_2: (rvt_invalidate_rkey+0x36/0x60 [rdmavt]) mr_lkey=0x100 ib_mr_type=0 ib_mr_rkey=0x101 ib_mr_lkey=0x101
     lst_t_00_09-3288  [046] d...  6592.029705: rvt_invalidate_rkey: (rvt_post_send+0x525/0x800 [rdmavt] <- rvt_invalidate_rkey) ret=0x0

Here is the one that fails:

kiblnd_sd_00_00-3359  [036] d...  6592.029947: rvt_invalidate_rkey_p: (rvt_invalidate_rkey+0x0/0x60 [rdmavt]) qpn=0x8 pd=0xffff9ebe1a48a200 rkey=0x100
     lst_t_00_13-3292  [050] d...  6592.029947: rvt_lkey_ok_p: (rvt_lkey_ok+0x0/0x380 [rdmavt]) pd=0xffff9ebe1a48a200 sge_lkey=0x0
     lst_t_00_18-3297  [001] d.Z.  6592.029948: rvt_invalidate_rkey_2: (rvt_invalidate_rkey+0x36/0x60 [rdmavt]) mr_lkey=0x60700 ib_mr_type=0 ib_mr_rkey=0x60701 ib_mr_lkey=0x60701
     lst_t_00_13-3292  [050] d...  6592.029948: rvt_lkey_ok: (rvt_post_send+0x2dc/0x800 [rdmavt] <- rvt_lkey_ok) ret=0x1
 kiblnd_sd_00_00-3359  [036] d...  6592.029950: rvt_invalidate_rkey_1: (rvt_invalidate_rkey+0x29/0x60 [rdmavt]) table_ptr=0xffffada64a4d1000
     lst_t_00_18-3297  [001] d...  6592.029951: rvt_invalidate_rkey: (rvt_post_send+0x525/0x800 [rdmavt] <- rvt_invalidate_rkey) ret=0x0
 kiblnd_sd_00_00-3359  [036] d.Z.  6592.029952: rvt_invalidate_rkey_2: (rvt_invalidate_rkey+0x36/0x60 [rdmavt]) mr_lkey=0x101 ib_mr_type=0 ib_mr_rkey=0x101 ib_mr_lkey=0x101
     lst_t_00_13-3292  [050] dN..  6592.029952: rvt_post_send: (kiblnd_post_tx_locked+0x857/0xa50 [ko2iblnd] <- rvt_post_send) ret=0x0
     lst_t_00_18-3297  [001] d...  6592.029952: rvt_fast_reg_mr_p: (rvt_fast_reg_mr+0x0/0x70 [rdmavt]) qpn=0x4 ibmr=0xffff9ebe3fd0e600 pd=0xffff9ebe1a48a200 key=0x60701
          <idle>-0     [010] d.h.  6592.029952: rvt_lkey_ok_p: (rvt_lkey_ok+0x0/0x380 [rdmavt]) pd=0xffff9ebe1a48a200 sge_lkey=0x0
          <idle>-0     [010] d.h.  6592.029953: rvt_lkey_ok: (rvt_get_rwqe+0x2c8/0x450 [rdmavt] <- rvt_lkey_ok) ret=0x1
     lst_t_00_12-3291  [049] dN..  6592.029953: rvt_invalidate_rkey_p: (rvt_invalidate_rkey+0x0/0x60 [rdmavt]) qpn=0xa pd=0xffff9ebe1a48a200 rkey=0x70800
     lst_t_00_18-3297  [001] d...  6592.029954: rvt_fast_reg_mr: (rvt_post_send+0x1a3/0x800 [rdmavt] <- rvt_fast_reg_mr) ret=0x0
 kiblnd_sd_00_00-3359  [036] d...  6592.029955: rvt_invalidate_rkey: (rvt_post_send+0x525/0x800 [rdmavt] <- rvt_invalidate_rkey) ret=0xffffffea

Note the different CPU and task name.

Comment by Olaf Faaland [ 06/Jul/21 ]

Hi Mike and Serguei,

Do you have any update on this issue?

I'm expecting to get my OPA test cluster back again today, at which point I plan to more thoroughly compare the behavior of lnet_selftest both with and without the patch to invalidate the rkey before the mr is used.

thanks,

Olaf

Comment by Mike Marciniszyn [ 06/Jul/21 ]

There are two issues with the existing code and hfi1.

First there is what I suspect is a misplaced racy assignment:

void
kiblnd_fmr_pool_unmap(struct kib_fmr *fmr, int status)
{
<snip> This code returns the frd to the list
                if (frd) {
                        frd->frd_valid = false;
                        spin_lock(&fps->fps_lock);
                        list_add_tail(&frd->frd_list, &fpo->fast_reg.fpo_pool_list);
                        spin_unlock(&fps->fps_lock);
                        fmr->fmr_frd = NULL; <- I think this should be before add
                }
<snip> I suspect the NULL should be before adding to the list to avoid a race with the map
<snip> pulling the descriptor from the list
}

The mapping looks like:

int kiblnd_fmr_pool_map(struct kib_fmr_poolset *fps, struct kib_tx *tx,
                        struct kib_rdma_desc *rd, u32 nob, u64 iov,
                        struct kib_fmr *fmr)
{
<snip>This code dequeues the kib_fast_reg_descriptor from a list
                                struct kib_fast_reg_descriptor *frd;
#ifdef HAVE_IB_MAP_MR_SG
                                struct ib_reg_wr *wr;
                                int n;
#else
                                struct ib_rdma_wr *wr;
                                struct ib_fast_reg_page_list *frpl;
#endif
                                struct ib_mr *mr;

                                frd = list_first_entry(&fpo->fast_reg.fpo_pool_list,
                                                        struct kib_fast_reg_descriptor,
                                                        frd_list);
                                list_del(&frd->frd_list);
                                spin_unlock(&fps->fps_lock);
<snip> This code sets up the invalidate operation embbeded in the kib_fast_reg_descriptor
                                if (!frd->frd_valid) {
                                        struct ib_rdma_wr *inv_wr;
                                        __u32 key = is_rx ? mr->rkey : mr->lkey;

                                        inv_wr = &frd->frd_inv_wr;
                                        memset(inv_wr, 0, sizeof(*inv_wr));

                                        inv_wr->wr.opcode = IB_WR_LOCAL_INV;
                                        inv_wr->wr.wr_id  = IBLND_WID_MR;
                                        inv_wr->wr.ex.invalidate_rkey = key;

                                        /* Bump the key */
                                        key = ib_inc_rkey(key);
                                        ib_update_fast_reg_key(mr, key);
                                }
<snip> The code goes on to register the pages in ib_map_mr_sg and sets up the
<snip> frd_fastreg_wr embedded in the kib_fast_reg_descriptor
<snip>
<snip> The code then fuses the kib_fast_reg_descriptor kib_fmr in the kib_tx
                                fmr->fmr_key  = is_rx ? mr->rkey : mr->lkey;
                                fmr->fmr_frd  = frd; <--- here
                                fmr->fmr_pool = fpo;
                                return 0;
<snip> At this point no posts have been done and they are defered to
<snip> kiblnd_post_tx_locked().
}

The post send looks like:

static int
kiblnd_post_tx_locked(struct kib_conn *conn, struct kib_tx *tx, int credit)
__must_hold(&conn->ibc_lock)
{
<snip> This code sees frd from above and prepends WRs from the kib_fast_reg_descriptor
                if (frd != NULL) {
                        if (!frd->frd_valid) {
                                wr = &frd->frd_inv_wr.wr;
                                wr->next = &frd->frd_fastreg_wr.wr;
                        } else {
                                wr = &frd->frd_fastreg_wr.wr;
                        }
                        frd->frd_fastreg_wr.wr.next = &tx->tx_wrq[0].wr;
                        will_post = true;
                }
<snip> The post is here
               if (lnet_send_error_simulation(tx->tx_lntmsg[0], &tx->tx_hstatus))
                        rc = -EINVAL;
                else
#ifdef HAVE_IB_POST_SEND_RECV_CONST
                        rc = ib_post_send(conn->ibc_cmid->qp, wr,
                                          (const struct ib_send_wr **)&bad);
#else
                        rc = ib_post_send(conn->ibc_cmid->qp, wr, &bad);
#endif

        conn->ibc_last_send = ktime_get();

        if (rc == 0)
                return 0; <-- return here with the WRs from the mapping complete
<snip> At this point there is a tx with everthing ready to go, BUT
<snip> all posts using the tx until is unmapped will send the invalidate and fast reg
<snip> and the invalidate has the OLD key forever since nothing has been done to
<snip> remember and disable the invalid WRs for the next post using the tx.
}

Everytime the kib_tx is used after the first there will be a superflous OLD invalidate followed by an OLD fast req that will change the key to what it currently is. 

For hfi1 the second invalidate will get a -EINVAL return code because the keys don't match.

There are two possible fixes:

  1. Add state to the kib_fast_reg_descriptor that keeps track of if the fast reg WRs have been posted and patch the post logic to only post the WRs if they had not already been posted.
  2. Make the rdmavt invalidate code allow the OLD invalidate and fast reg by only comparing the key bits above 7.

I'm about to attach the two patches for 1.   The patch seems to fix the issue and the lnet_selftest works fine.

I'm getting ready to test the invalidate patch, but I should point out our current code works with SRP, iSer, and NFS RDMA as is.

Comment by Mike Marciniszyn [ 07/Jul/21 ]

linux-kernel-test.patch contains a potential upstream patch.

Older versions attached were incorrect.

Comment by Mike Marciniszyn [ 07/Jul/21 ]

linux-kernel-test.patch solves the issue, at the expense of extra processing whenever the kib_tx is reused.

Comment by Serguei Smirnov [ 07/Jul/21 ]

Mike,

I tried the LND change you suggested on my setup using MOFED, it appears to be fine. 

Please submit the LND patch for review, it will be easier to track with your ownership.

Thanks,

Serguei.

Comment by Mike Marciniszyn [ 07/Jul/21 ]

Please submit the LND patch for review, it will be easier to track with your ownership.

Which one? There are two patches. One is a bug fix in unmap. The other is the fix to insure duplicate WRs are not sent after the initial posts.

Comment by Peter Jones [ 07/Jul/21 ]

Mike

If you need help getting your gerrit account sorted out then it's best to email me directly rather than using JIRA

Peter

Comment by Serguei Smirnov [ 07/Jul/21 ]

Mike, 

I was referring to the bug fix in the kiblnd_fmr_pool_unmap. 

I have no way of testing the upstream patch with OPA. Is it possible that the kiblnd_fmr_pool_unmap fix is sufficient by itself?

Thanks,

Serguei.

Comment by Mike Marciniszyn [ 07/Jul/21 ]

I have no way of testing the upstream patch with OPA. Is it possible that the kiblnd_fmr_pool_unmap fix is sufficient by itself?

I tested the rdmavt upstream fix by rebuilding the 8.4 GA kernel from srpm adding the upstream patch.   The unmap fix is not sufficient by itself.  I determined that early on. 

We either need an upstream patch to get pulled by Jason and point Red Hat to it, or use the second Lustre patch that avoids the superflous WRs that trigger the EINVAL.

It seems to me that a Lustre fix might be quicker?

Do we need a separate Jira for the map/unmap race fix?

Comment by Serguei Smirnov [ 07/Jul/21 ]

I don't think we need a separate ticket for the map/unmap race fix. Peter can correct me if he disagrees.

Comment by Olaf Faaland [ 07/Jul/21 ]

Mike,

Thanks for tracking this down.  Do you know why we saw this first with RHEL 8.4, but not RHEL 8.3?

Comment by Mike Marciniszyn [ 08/Jul/21 ]

Do you know why we saw this first with RHEL 8.4, but not RHEL 8.3?

Upstream removed FMR as of 5.8 and it looks like the 8.4 RDMA code took that in.

8.3 still has the FMR and it appears that this code will prefer it:

#ifdef HAVE_FMR_POOL_API
        if (dev->ibd_dev_caps & IBLND_DEV_CAPS_FMR_ENABLED)
                rc = kiblnd_alloc_fmr_pool(fps, fpo);
        else
#endif /* HAVE_FMR_POOL_API */
                rc = kiblnd_alloc_freg_pool(fps, fpo, dev->ibd_dev_caps);
        if (rc)
                goto out_fpo;

        fpo->fpo_deadline = ktime_get_seconds() + IBLND_POOL_DEADLINE;
        fpo->fpo_owner = fps;
        *pp_fpo = fpo;

        return 0;

I don't see any override when both are available?

Comment by Mike Marciniszyn [ 08/Jul/21 ]

Here are the patches for Lustre:

  1. 01-move_null.patch
  2. 02-post_state.patch
Comment by Gerrit Updater [ 08/Jul/21 ]

Mike Marciniszyn (mike.marciniszyn@cornelisnetworks.com) uploaded a new patch: https://review.whamcloud.com/44189
Subject: LU-14733 o2iblnd: Move racy NULL assignment
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 071c8c7faa322be1a42d424ff0d83c3113c80140

Comment by Gerrit Updater [ 08/Jul/21 ]

Mike Marciniszyn (mike.marciniszyn@cornelisnetworks.com) uploaded a new patch: https://review.whamcloud.com/44190
Subject: LU-14733 o2iblnd: Avoid double posting invalidate
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 778962a825eeee5b4754664a6baabf61e982aa38

Comment by Mike Marciniszyn [ 08/Jul/21 ]

Currently I have unit tested both bulk read and write with opa cards and RHEL8.4.

I'm trying to find an MLX card to test with that as well.

Comment by Gerrit Updater [ 12/Jul/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/44189/
Subject: LU-14733 o2iblnd: Move racy NULL assignment
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 023113fb8946f3565529e7327fdcd90ab9db3ba3

Comment by Gerrit Updater [ 12/Jul/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/44190/
Subject: LU-14733 o2iblnd: Avoid double posting invalidate
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5930576791e864529e6ef9b46f3e09cc4b635fc2

Comment by Olaf Faaland [ 12/Jul/21 ]

I saw those two Lustre patches were merged to master.  Has someone been able to test them on MLX to confirm they don't cause new issues there?  Thanks

Comment by Serguei Smirnov [ 12/Jul/21 ]

Before Mike pushed them, I tried these patches on my local setup that uses LTS MOFED 4.9 and cx-2 cards. Ran lnet_selftest, didn't see any issues.

Comment by Mike Marciniszyn [ 12/Jul/21 ]

Before Mike pushed them, I tried these patches on my local setup that uses LTS MOFED 4.9 and cx-2 cards. Ran lnet_selftest, didn't see any issues.

I will also test on some MLX cards. I will try the stock 8.4 kernel vs. MOFED though.

Comment by Olaf Faaland [ 12/Jul/21 ]

Great, thank you both.

Comment by Gerrit Updater [ 12/Jul/21 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44216
Subject: LU-14733 o2iblnd: Move racy NULL assignment
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: d33a2dcf34fcea3adaf46d1a806405a9c334adc0

Comment by Gerrit Updater [ 12/Jul/21 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44217
Subject: LU-14733 o2iblnd: Avoid double posting invalidate
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 8c70059776f2bdc7a68a27e35bb5dc36763d3dd6

Comment by Mike Marciniszyn [ 12/Jul/21 ]

05:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

I was able to run lnet_selftest on a vanilla RH 8.4 install:

[root@ioperf-05 ~]# ./lnet_wrapper_read
LST_SESSION = 6780
SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No
192.168.25.9@o2ib1 are added to session
192.168.25.10@o2ib1 are added to session
Test was added successfully
bulk_read is running now
Capturing statistics for 30 secs [LNet Rates of lfrom]
[R] Avg: 21275    RPC/s Min: 21275    RPC/s Max: 21275    RPC/s
[W] Avg: 10638    RPC/s Min: 10638    RPC/s Max: 10638    RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 10639.30 MiB/s Min: 10639.30 MiB/s Max: 10639.30 MiB/s
[W] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[LNet Rates of lto]
[R] Avg: 10637    RPC/s Min: 10637    RPC/s Max: 10637    RPC/s
[W] Avg: 21275    RPC/s Min: 21275    RPC/s Max: 21275    RPC/s
[LNet Bandwidth of lto]
[R] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[W] Avg: 10639.10 MiB/s Min: 10639.10 MiB/s Max: 10639.10 MiB/s
[LNet Rates of lfrom]
[R] Avg: 21294    RPC/s Min: 21294    RPC/s Max: 21294    RPC/s
[W] Avg: 10647    RPC/s Min: 10647    RPC/s Max: 10647    RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 10647.90 MiB/s Min: 10647.90 MiB/s Max: 10647.90 MiB/s
[W] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[LNet Rates of lto]
[R] Avg: 10647    RPC/s Min: 10647    RPC/s Max: 10647    RPC/s
[W] Avg: 21294    RPC/s Min: 21294    RPC/s Max: 21294    RPC/s
[LNet Bandwidth of lto]
[R] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[W] Avg: 10647.90 MiB/s Min: 10647.90 MiB/s Max: 10647.90 MiB/s
[LNet Rates of lfrom]
[R] Avg: 21275    RPC/s Min: 21275    RPC/s Max: 21275    RPC/s
[W] Avg: 10637    RPC/s Min: 10637    RPC/s Max: 10637    RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 10639.50 MiB/s Min: 10639.50 MiB/s Max: 10639.50 MiB/s
[W] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[LNet Rates of lto]
[R] Avg: 10637    RPC/s Min: 10637    RPC/s Max: 10637    RPC/s
[W] Avg: 21276    RPC/s Min: 21276    RPC/s Max: 21276    RPC/s
[LNet Bandwidth of lto]
[R] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[W] Avg: 10639.50 MiB/s Min: 10639.50 MiB/s Max: 10639.50 MiB/s
[LNet Rates of lfrom]
[R] Avg: 21297    RPC/s Min: 21297    RPC/s Max: 21297    RPC/s
[W] Avg: 10649    RPC/s Min: 10649    RPC/s Max: 10649    RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 10649.22 MiB/s Min: 10649.22 MiB/s Max: 10649.22 MiB/s
[W] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[LNet Rates of lto]
[R] Avg: 10648    RPC/s Min: 10648    RPC/s Max: 10648    RPC/s
[W] Avg: 21294    RPC/s Min: 21294    RPC/s Max: 21294    RPC/s
[LNet Bandwidth of lto]
[R] Avg: 1.62     MiB/s Min: 1.62     MiB/s Max: 1.62     MiB/s
[W] Avg: 10649.22 MiB/s Min: 10649.22 MiB/s Max: 10649.22 MiB/s
[LNet Rates of lfrom]
[R] Avg: 21304    RPC/s Min: 21304    RPC/s Max: 21304    RPC/s
[W] Avg: 10652    RPC/s Min: 10652    RPC/s Max: 10652    RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 10653.70 MiB/s Min: 10653.70 MiB/s Max: 10653.70 MiB/s
[W] Avg: 1.63     MiB/s Min: 1.63     MiB/s Max: 1.63     MiB/s
[LNet Rates of lto]
[R] Avg: 10652    RPC/s Min: 10652    RPC/s Max: 10652    RPC/s
[W] Avg: 21305    RPC/s Min: 21305    RPC/s Max: 21305    RPC/s
[LNet Bandwidth of lto]
[R] Avg: 1.63     MiB/s Min: 1.63     MiB/s Max: 1.63     MiB/s
[W] Avg: 10653.70 MiB/s Min: 10653.70 MiB/s Max: 10653.70 MiB/s

lfrom:
Total 0 error nodes in lfrom
lto:
Total 0 error nodes in lto
1 batch in stopping
Batch is stopped
session is ended

The only issue I ran into was a panic trying to reboot. The servers required a powercyle.

Jul 12 19:09:14 ioperf-06 kernel: reboot          D    0 10092   9865 0x00004080
Jul 12 19:09:14 ioperf-06 kernel: Call Trace:
Jul 12 19:09:14 ioperf-06 kernel: __schedule+0x2c4/0x700
Jul 12 19:09:14 ioperf-06 kernel: ? __switch_to_asm+0x35/0x70
Jul 12 19:09:14 ioperf-06 kernel: ? __switch_to_asm+0x35/0x70
Jul 12 19:09:14 ioperf-06 kernel: schedule+0x38/0xa0
Jul 12 19:09:14 ioperf-06 kernel: schedule_timeout+0x246/0x2f0
Jul 12 19:09:14 ioperf-06 kernel: ? __switch_to_asm+0x41/0x70
Jul 12 19:09:14 ioperf-06 kernel: ? __switch_to+0x10c/0x480
Jul 12 19:09:14 ioperf-06 kernel: ? __schedule+0x2cc/0x700
Jul 12 19:09:14 ioperf-06 kernel: wait_for_completion+0x97/0x100
Jul 12 19:09:14 ioperf-06 kernel: cma_remove_one+0x23f/0x310 [rdma_cm]
Jul 12 19:09:14 ioperf-06 kernel: remove_client_context+0x8b/0xd0 [ib_core]
Jul 12 19:09:14 ioperf-06 kernel: disable_device+0x8c/0x130 [ib_core]
Jul 12 19:09:14 ioperf-06 kernel: __ib_unregister_device+0x35/0xa0 [ib_core]
Jul 12 19:09:14 ioperf-06 kernel: ib_unregister_device+0x21/0x30 [ib_core]
Jul 12 19:09:14 ioperf-06 kernel: __mlx5_ib_remove+0x38/0x60 [mlx5_ib]
Jul 12 19:09:14 ioperf-06 kernel: mlx5_detach_device+0xb2/0xc0 [mlx5_core]
Jul 12 19:09:14 ioperf-06 kernel: mlx5_unload_one+0x80/0x120 [mlx5_core]
Jul 12 19:09:14 ioperf-06 kernel: shutdown+0x144/0x1d0 [mlx5_core]
Jul 12 19:09:14 ioperf-06 kernel: pci_device_shutdown+0x34/0x60
Jul 12 19:09:14 ioperf-06 kernel: device_shutdown+0x161/0x212
Jul 12 19:09:14 ioperf-06 kernel: kernel_restart+0xe/0x30
Jul 12 19:09:14 ioperf-06 kernel: __do_sys_reboot+0x1d2/0x210
Jul 12 19:09:14 ioperf-06 kernel: ? syscall_trace_enter+0x1d3/0x2c0
Jul 12 19:09:14 ioperf-06 kernel: ? __audit_syscall_exit+0x249/0x2a0
Jul 12 19:09:14 ioperf-06 kernel: do_syscall_64+0x5b/0x1a0
Jul 12 19:09:14 ioperf-06 kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
Jul 12 19:09:14 ioperf-06 kernel: RIP: 0033:0x7f73aa2825f7
Jul 12 19:09:14 ioperf-06 kernel: Code: Unable to access opcode bytes at RIP 0x7f73aa2825cd.
Jul 12 19:09:14 ioperf-06 kernel: RSP: 002b:00007ffed2874d28 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
Jul 12 19:09:14 ioperf-06 kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f73aa2825f7
Jul 12 19:09:14 ioperf-06 kernel: RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead
Jul 12 19:09:14 ioperf-06 kernel: RBP: 00007ffed2874d70 R08: 0000000000000002 R09: 0000000000000000
Jul 12 19:09:14 ioperf-06 kernel: R10: 000000000000004b R11: 0000000000000246 R12: 0000000000000001
Jul 12 19:09:14 ioperf-06 kernel: R13: 00000000fffffffe R14: 0000000000000006 R15: 0000000000000000
Comment by Gerrit Updater [ 13/Jul/21 ]

Gian-Carlo DeFazio (defazio1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/44295
Subject: LU-14733 o2iblnd: Move racy NULL assignment
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: 8db93a26f7e99f140ccd4a0fd3e35f4e9f71b8ec

Comment by Gerrit Updater [ 13/Jul/21 ]

Gian-Carlo DeFazio (defazio1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/44296
Subject: LU-14733 o2iblnd: Avoid double posting invalidate
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: 1105bfbfad5002bc673ff568c0720bcebc5d095a

Comment by Olaf Faaland [ 19/Jul/21 ]

Mike, Serguei, this works on our test system

Comment by Peter Jones [ 24/Jul/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 10/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44216/
Subject: LU-14733 o2iblnd: Move racy NULL assignment
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 173d60a3274d19bc1d9811b6e1b09aac2b25f221

Comment by Gerrit Updater [ 28/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44217/
Subject: LU-14733 o2iblnd: Avoid double posting invalidate
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 96d7dcf4e773e6026a590e4596ef30ac8a4a5061

Comment by Gerrit Updater [ 14/Nov/21 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/44295/
Subject: LU-14733 o2iblnd: Move racy NULL assignment
Project: fs/lustre-release
Branch: b2_14
Current Patch Set:
Commit: 380be07fcca1f76564d1f29e58f2d8d5f8f530c8

Comment by Gerrit Updater [ 14/Nov/21 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/44296/
Subject: LU-14733 o2iblnd: Avoid double posting invalidate
Project: fs/lustre-release
Branch: b2_14
Current Patch Set:
Commit: 29da7cba3e7b3461d895010c7f7284b9649aba52

Generated at Sat Feb 10 03:12:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.