[LU-12385] 2.12.2: IOR read testing reliably inducing LNetError: Timed out RDMA with ...@o2ib1 (6): c: 0, oc: 0, rc: 63 Created: 04/Jun/19  Updated: 28/Sep/19  Resolved: 20/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Major
Reporter: Matt Rásó-Barnett (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Attachments: PNG File Screenshot from 2019-06-05 14-20-52.png     File clients-peer-credits-128.log     File clients-peer-credits-16.run1.log     File clients-peer-credits-16.run2.log     File clients-peer-credits-16.run3.log     File clients.20190604.log     Text File qp_stats_client_good.txt     Text File qp_stats_server_good.txt     File servers-peer-credits-128.log     File servers-peer-credits-16.run1.log     File servers-peer-credits-16.run2.log     File servers-peer-credits-16.run3.log     File servers.20190604.log    
Issue Links:
Related
is related to LU-10213 o2iblnd: Potential discrepancy when a... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hello,
I've been doing some more benchmark testing for IO500 today and have been reliably hitting what appears to be an LNET/RDMA/Network problem since upgrading our servers and clients to Lustre 2.12.2.

The test environment is:

Server Environment
# 24 Identical servers, each server contains 12x NVMe SSDs.
# Each server is both MDS and OSS
lustre-2.12.2-1.el7.x86_64
kernel-3.10.0-957.10.1.el7_lustre.x86_64

# Fabric is Intel OPA, using distro's in-kernel OFED
# Each server has 2x OPA cards, configured with LNET multi-rail
[root@dac-e-1 ~]# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib1
      local NI(s):
        - nid: 10.47.18.1@o2ib1
          status: up
          interfaces:
              0: ib0
        - nid: 10.47.18.25@o2ib1
          status: up
          interfaces:
              0: ib1

# LNET tunables
        tunables:
              peer_timeout: 180
              peer_credits: 128
              peer_buffer_credits: 0
              credits: 1024
              peercredits_hiw: 64
              map_on_demand: 256
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 4
Client Environment
# Test involves 10 clients
lustre-client-2.12.2-1.el7.x86_64
kernel-3.10.0-957.10.1.el7.x86_64

# Client has 1x Intel OPA card
    - net type: o2ib1
      local NI(s):
        - nid: 10.47.20.61@o2ib1
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 21957610
              recv_count: 29253267
              drop_count: 2
          tunables:
              peer_timeout: 180
              peer_credits: 128
              peer_buffer_credits: 0
              credits: 1024
              peercredits_hiw: 64
              map_on_demand: 256
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 4
          lnd tunables:
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1]"

I'm testing with the 'ior_easy' phase of IO500, which just writes a single file per-process to achieve a maximum bandwidth number.

The write phase of the test completes fine, however after a few minutes of the read phase, the clients start to show LNetError messages such as:

Client Errors
Jun 04 19:47:08 cpu-e-1149 kernel: Lustre: Mounted fs1-client
Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.47.18.14@o2ib1 (5): c: 0, oc: 0, rc: 63
Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Skipped 1 previous similar message
Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: 19805:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1559674918/real 1559674924]  req@ffff8ddc96787500 x1635429905047040/t0(0) o3->fs1-OST0089-osc-ffff8df3bf0c8000@10.47.18.12@o2ib1:6/4 lens 488/440 e 0 to 1 dl 1559674962 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: 19805:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 36 previous similar messages
Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: fs1-OST0089-osc-ffff8df3bf0c8000: Connection to fs1-OST0089 (at 10.47.18.12@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: Skipped 4 previous similar messages
Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: fs1-OST0089-osc-ffff8df3bf0c8000: Connection restored to 10.47.18.12@o2ib1 (at 10.47.18.12@o2ib1)
Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: Skipped 4 previous similar messages
Jun 04 20:02:10 cpu-e-1149 kernel: LNet: 34294:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.47.18.10@o2ib1 - queue depth reduced from 128 to 63  to allow for qp creation
Jun 04 20:02:10 cpu-e-1149 kernel: LNet: 34294:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 83 previous similar messagess

And the servers:

Server Errors
Jun 04 20:02:08 dac-e-10 kernel: Lustre: fs1-OST006f: Client 899ea795-c4b3-f3f8-43dd-db593fc06301 (at 10.47.21.126@o2ib1) reconnecting
Jun 04 20:02:08 dac-e-10 kernel: Lustre: Skipped 1 previous similar message
Jun 04 20:02:59 dac-e-10 kernel: LustreError: 303530:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(1048576)  req@ffff8b787babd050 x1635429904859568/t0(0) o3->899ea795-c4b3-f3f8-43dd-db593fc06301@10.47.21.126@o2ib1:527/0 lens 488/440 e 3 to 0 dl 1559674997 ref 1 fl Interpret:/0/0 rc 0/0
Jun 04 20:02:59 dac-e-10 kernel: LustreError: 303530:0:(ldlm_lib.c:3268:target_bulk_io()) Skipped 1 previous similar message
Jun 04 20:02:59 dac-e-10 kernel: Lustre: fs1-OST0074: Bulk IO read error with 899ea795-c4b3-f3f8-43dd-db593fc06301 (at 10.47.21.126@o2ib1), client will retry: rc -110
Jun 04 20:02:59 dac-e-10 kernel: Lustre: Skipped 6 previous similar messages

I've attached log files to the ticket with full logs for all the servers and the clients.

The IOR output with command used is:

IOR-3.3alpha1: MPI Coordinated Test of Parallel I/O
Began               : Tue Jun  4 19:56:12 2019
Command line        : /home/mjr208/projects/benchmarking/io-500-src-stonewall-fix/bin/ior -r -R -a POSIX --posix.odirect -C -t 1m -b 140g -F -i 1 -C -Q 1 -g -G 27 -k -e -o /dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy/ior_file_easy -O stoneWallingStatusFile=/dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy/stonewall
Machine             : Linux cpu-e-829
TestID              : 0
StartTime           : Tue Jun  4 19:56:12 2019
Path                : /dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy
FS                  : 412.6 TiB   Used FS: 8.5%   Inodes: 960.0 Mi   Used Inodes: 0.0%

Options: 
api                 : POSIX
apiVersion          : 
test filename       : /dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy/ior_file_easy
access              : file-per-process
type                : independent
segments            : 1
ordering in a file  : sequential
tasks               : 320
clients per node    : 32
repetitions         : 1
xfersize            : 1 MiB
blocksize           : 140 GiB
aggregate filesize  : 43.75 TiB

Results: 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
WARNING: Expected aggregate file size       = 48103633715200.
WARNING: Stat() of aggregate file size      = 38543976038400.
WARNING: Using actual aggregate bytes moved = 38543976038400.
read      16882      146800640  1024.00    0.023208   2177.30    0.023764   2177.33    0   
Max Read:  16882.35 MiB/sec (17702.42 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
read        16882.35   16882.35   16882.35       0.00   16882.35   16882.35   16882.35       0.00 2177.32777     0    320  32    1   1     2        1         0    0      1 150323855360  1048576 36758400.0 POSIX      0
Finished            : Tue Jun  4 20:32:29 2019

My main cause for suspicion here is that this same hardware and client combination was running the same benchmark this morning however at that time the servers were running Lustre 2.12.0 and the clients were on 2.10.7.

Is there anything notable in the logs provided as to what might be going on?

Thanks,
Matt



 Comments   
Comment by Peter Jones [ 04/Jun/19 ]

Sonia

Can you please advise here?

Thanks

Peter

Comment by Sonia Sharma (Inactive) [ 05/Jun/19 ]

Hi Matt,

Is this OPA or MLX? And are you seeing any performance drop as well along with this error messages?

These messages in the logs, 

Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.47.18.14@o2ib1 (5): c: 0, oc: 0, rc: 63

They indicate that the message is queued on the o2iblnd internal queue and hasn't been posted for the default timeout of 50 seconds.

Later in the logs, I see this message

Jun 04 20:02:10 cpu-e-1149 kernel: LNet: 34294:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.47.18.10@o2ib1 - queue depth reduced from 128 to 63  to allow for qp creation

What happens is when the initial request is received to create the connection, it tries to create the queue-pair(initially set to the number of peer-credits) but if that fails then it reduces the queue-depth by 1 and retry to create the qp. And this continues until it successfully create the qp or reduce the queue depth to < 2.
So, if there are no more error messages once the queue-depth is negotiated and a connection is established then this message is just an information.

Comment by Matt Rásó-Barnett (Inactive) [ 05/Jun/19 ]

Hi Sonia, this is OPA.
I am seeing a severe performance drop once these messages appear. The test is mean't to run for roughly 300-400 seconds, and I can see expected bandwidth for the first 4-5minutes, but then these errors appear and bandwidth completely drops to a few MiB/s until the test eventually completes.

I am only seeing this when I upgrade the clients to 2.12.2. Our production clients usually run 2.10.7 and I've re-run this test with the 2.10.7 clients talking to the same 2.12.2 servers with no problem. The kernel version of the two sets of clients is the same.

If the first set of messages are due to messages queued on the o2iblnd internal queue, this is because of congestion on the link? What else could be the cause?

Comment by Matt Rásó-Barnett (Inactive) [ 05/Jun/19 ]

This graph shows the total read bandwidth as reported by the servers obdfilter.$OST.stats read_bytes during the log period I sent above. So essentially around the time those messages start appearing (~20:02), you see the total read bandwidth completely drop away, until each OST is just doing a few MiB/s.

Comment by Peter Jones [ 05/Jun/19 ]

mrb

Does the same behaviour manifest with 2.12.1 clients?

Peter

Comment by Matt Rásó-Barnett (Inactive) [ 05/Jun/19 ]

Hi Peter, I just tried it and unfortunately yes I did get the same behaviour.

I'm going to try with 2.12.0 on the same clients now and see if any change.

Matt

Comment by Sonia Sharma (Inactive) [ 05/Jun/19 ]

Hi Matt,

Some other site was facing similar issues with OPA and removing this patch - https://review.whamcloud.com/#/c/30310/ helped them. So may be we can give it a try.
I would also check if TID-RDMA is disabled, this value if enabled might affect performance.

Comment by Matt Rásó-Barnett (Inactive) [ 05/Jun/19 ]

Hi Sonia, I'm pretty sure TID-RDMA is disabled, here is our hfi1.conf:

options hfi1 krcvqs=8 pcie_caps=0x51 rcvhdrcnt=8192

I haven't been brave enough to ever try TID-RDMA again, as we turned it on when we first got our OPA cluster and there was a horrendous memory leak in the HFI1 driver...

Regarding the patch, I will see if I can give that a go tomorrow.

I tried 2.12.0 earlier today and it also had the same problem, so I'm hitting this issue with all 2.12.X versions unfortunately.

Comment by Peter Jones [ 05/Jun/19 ]

Sonia

While I agree that we have had some reports from sites using OPA at scale using that fix, isn't it also in 2.10.7 (which seems to be ok)?

Peter

Comment by Matt Rásó-Barnett (Inactive) [ 06/Jun/19 ]

Hi Sonia, Peter,
This is unfortunately blocking me from using 2.12 on our IO500 submission which is unfortunate, as we have a set of clients with two-OPA cards that we were going to use for a 10-node challenge run (so Multi-rail servers and multi-rail clients). I also saw a very good improvement in the 'pfind' scores with 2.12 which I would like to take advantage of.

What I wanted to ask was if you had any guidance on further debugging I can do that might help track down this issue, since I can reliably reproduce it in our environment?

  • Would lctl debug traces from clients/servers around the time of the errors be of use? Any particular debug filter I should use other than 'all' to make it less noisy?
  • Do you have any guidance on what LNET or hfi1 tunables other large-scale OPA sites have had success with at 2.12? We are using the default LNET tunables for 2.12.

Thanks,
Matt

Comment by Sonia Sharma (Inactive) [ 06/Jun/19 ]

Hi Matt,

Closely looking at the graph, I see that the sharp performance drop happens at time 20:02 and if we look at the logs , it is the same time that the queue depth is reduced. The tx timeout and RDMA timeout messages has a time stamp earlier than 20:02.

Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.47.18.14@o2ib1 (5): c: 0, oc: 0, rc: 63
Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Skipped 1 previous similar message
Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: 19805:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1559674918/real 1559674924]  req@ffff8ddc96787500 x1635429905047040/t0(0) o3->fs1-OST0089-osc-ffff8df3bf0c8000@10.47.18.12@o2ib1:6/4 lens 488/440 e 0 to 1 dl 1559674962 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: 19805:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 36 previous similar messages
Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: fs1-OST0089-osc-ffff8df3bf0c8000: Connection to fs1-OST0089 (at 10.47.18.12@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: Skipped 4 previous similar messages
Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: fs1-OST0089-osc-ffff8df3bf0c8000: Connection restored to 10.47.18.12@o2ib1 (at 10.47.18.12@o2ib1)
Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: Skipped 4 previous similar messages
Jun 04 20:02:10 cpu-e-1149 kernel: LNet: 34294:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.47.18.10@o2ib1 - queue depth reduced from 128 to 63  to allow for qp creation

So I would suggest trying once removing the above patch I mentioned and running the performance tests again. I am suggesting this because that patch does the reduction of connection queue-depth and it is possible that with heavy traffic, the reduced queue-depth leads to more tx being queued up resulting in the performance drop.
We might get a different error by removing this patch but to cover base, i would recommend trying it once.

Comment by Amir Shehata (Inactive) [ 12/Jul/19 ]

Hi Matt,

Any updates on this issue?

Looks like last comments there was a suggestion to revert: https://review.whamcloud.com/#/c/30310/. This has proven problematic on OPA clusters.

Comment by James A Simmons [ 12/Jul/19 ]

Didn't a fix land to 2.13 for this?

Comment by James A Simmons [ 12/Jul/19 ]

I think the fix needed is 

https://review.whamcloud.com/34945

from LU-12279. Amir is that correct?

The patch landed for 2.12.2, what version are you running?

Comment by Peter Jones [ 12/Jul/19 ]

It says 2.12.2 in the initial report

Comment by James A Simmons [ 12/Jul/19 ]

So master is mostly likely broken as well.

Comment by Peter Jones [ 12/Jul/19 ]

Yes

Comment by Amir Shehata (Inactive) [ 12/Jul/19 ]

The patch I pointed to is problematic for OPA and I'm not sure why exactly. But on larger clusters they see timeouts on the transmit queue. If Matt is able to revert the patch and try his test it would verify if this is the issue on his system. Sonia had suggested that earlier on, but from the responses which came after, it's not clear if it has been tried.

One potential work around is to check the type of HW and apply the logic selectively. But I would feel better if we root cause why OPA has different behavior. There has been some chatter regarding changing the hfi configuration parameters. I need to investigate the impact of these hfi module parameters and if they could be related to this particular issue.

Comment by Peter Jones [ 12/Jul/19 ]

I know from discussions with Matt that reverting LU-10213 did clear up the issue

Comment by Matt Rásó-Barnett (Inactive) [ 12/Jul/19 ]

Hi Amir, yes, as Peter says I did end up reverting that patch against master and it did appear to resolve the issue. Since the IO500 I've reverted everything back to 2.12.2 and I've been hitting the problem again, so I will test just reverting this patch against 2.12.2 and see if I can confirm that this resolves the problem again.
Cheers,
Matt

Comment by Amir Shehata (Inactive) [ 13/Jul/19 ]

Hi Matt,

Do you have cycles to help me root cause the issue? I don't have a setup to reproduce. Having access to a setup to reproduce the problem will help us arrive to a solution quicker.

Comment by Matt Rásó-Barnett (Inactive) [ 13/Jul/19 ]

Hi Amir, yes that would be great - I think I have a reliable reproducer. How do you want to proceed on this - is there something I could gather or do you want to observe?

Cheers,

Matt

Comment by Amir Shehata (Inactive) [ 16/Jul/19 ]

Hi Matt

Talking with mmarcini2 from Intel. We'd like some stats from the HFI driver.

Using the LU-10213 patch which causes the problem can you collect:

 hfi1stats -c 5

This will report hfi statistics at 5 second interval.

If you can grab the statistics before the test starts to get a baseline, then run the reproducer of the problem, while recording the statistics on the clients and the servers, that'll be great. Ideally we'll have the stats on the nodes which encounter the problem.

Comment by Mike Marciniszyn [ 16/Jul/19 ]

Matt,

 

What is the version of the hfi1 driver?

Comment by Matt Rásó-Barnett (Inactive) [ 17/Jul/19 ]

Hi Mike,
The clients are using an IFS release and are using the following version:

$ opaconfig -V
10.8.0.0.204

The servers are using the inbox driver, and are on RHEL7.6, with: kernel: 3.10.0-957.10.1.el7_lustre.x86_64

I should confirm again that reverting the https://jira.whamcloud.com/browse/LU-10213 patch does appear to fix/workaround the problem, so I've moved the servers and clients to 2.12.2 with this patch reverted and I don't see the same errors any more.

@Amir - I will redeploy back onto 2.12.2 again at the end of the week and collect these stats.

I've also been speaking with James Erwin from Intel who reached out to me too - he's identified that our servers did not have the recommended irqbalance configuration that they should have had, so this is being corrected too.

Thanks,
Matt

Comment by Mike Marciniszyn [ 17/Jul/19 ]

@Matt Rásó-Barnett

What is the scale involved here?

What is the minimum configuration?

Comment by Matt Rásó-Barnett (Inactive) [ 17/Jul/19 ]

I've recently been doing tests with 32 clients and 24 Lustre servers when hitting the problem. However I was hitting the problem a few weeks ago when I opened the ticket with 10 client nodes.

I also always appear to trigger the problem during the IOR read-phase in the benchmark that I'm running - I'm not sure what the significance of that is, if any, but this will be the clients reading data from the servers primarily.

Comment by Mike Marciniszyn [ 17/Jul/19 ]

A read means that Lustre will send a Read RPC to the server(s) and the server will post RDMA write(s) and a send.  Each RDMA write/send pair consumes a Lustre credit.

If you can get a copy of:

/sys/kernel/debug/hfi1/hfi1_0/qp_stats

when Lustre has live QPs?

I want to see what the sizes for the QPs on both clients and servers.

I will need the same when you go back to 2.12.2.

Comment by Matt Rásó-Barnett (Inactive) [ 19/Jul/19 ]

Hi Mike, attached are qp_stats from a server and a client during a IOR read test, when using the 'fixed' version of 2.12.2 with the patch reverted.

I will collect the same once I've redeployed back to 2.12.2 when the problem occurs hopefully later today.

Cheers,
Matt

qp_stats_client_good.txt qp_stats_server_good.txt

Comment by Mike Marciniszyn [ 31/Jul/19 ]

Amir,

If there a status on the issue here?

My understanding there are some Lustre settings that can allow Lustre 2.12.2 avoid the credit starvation issue that may be happening.

Is there a formal patch?

My understanding is the detour is to change peer_credits_hiw to roughly half of the peer_credits or 32?

 

Comment by Mike Marciniszyn [ 31/Jul/19 ]

There is also another potential detour if running 2.12.2.

Add the following hfi1 module parameter options line:

max_qp_wrs=32896

 

Comment by Amir Shehata (Inactive) [ 12/Aug/19 ]

Hi Matt,

Would you be able to try your reproducer using peer_credits=16 and peer_credits_hiw=8. You don't have to explicitly set peer_credits_hiw. It'll get set to half of the peer_credits if it's not set in the module parameters.

There are two intents here.

  1. I want to see if we need a high value for peer_credits
  2. I want to eliminate the scenario where we ever hit the code where it reduces the queue depth

would be interesting to see the result of your testing with these two parameters set.

Please note that this test should be done with LU-10213 applied

Comment by Peter Jones [ 20/Aug/19 ]

mrb when will you be able to run this test? We'd like to have a robust fix in place ahead of 2.12.3...

Comment by Matt Rásó-Barnett (Inactive) [ 20/Aug/19 ]

Hi Peter, Amir

Sorry for the delay on my end - I should have sent an update sooner but I've been away on leave for a few weeks. However I'm back at work now and will get this picked up.

I'll try and get a test running on this today and see if I can get a reply to you all in a couple of days - I'm a bit out of sync with the state of things since I was away, so I'll need to get set back up, but it shouldn't take too long to get the test done.

Matt

 

Comment by Peter Jones [ 20/Aug/19 ]

Thanks mrb - I suspected as much

Comment by Matt Rásó-Barnett (Inactive) [ 21/Aug/19 ]

Hi Amir,

I've run the reproducer test last night and there did seem to be an improvement.

I redeployed all servers and clients to vanilla 2.12.2, and ran the reproducer test with the default OPA ko2iblnd settings of: peer_credits=128 peer_credits_hiw=64 to be sure that we still see the issue. As expected this hit the problem again on the IOR read-phase.

Attached are logs from this run:
servers-peer-credits-128.log clients-peer-credits-128.log

Then I changed peer_credits=16 and peer_credits_hiw=8 on both clients and servers and re-ran the test.

This time the test completed successfully, although there were still slowdowns and the read performance is still lower than I would expect by a factor of 2. However I didn't see the 'Timed out RDM' errors from before, so maybe there is a different reason for this.

Attached are logs from these runs:

clients-peer-credits-16.run1.log servers-peer-credits-16.run1.log

clients-peer-credits-16.run2.log servers-peer-credits-16.run2.log

clients-peer-credits-16.run3.log servers-peer-credits-16.run3.log

I haven't tried anything else yet - I was thinking to try Mike's suggestion of the HFI module parameter: max_qp_wrs=32896 as well still. I also am still using krcvqs=8 in hfi1.conf, which James Erwin mentioned is very high, so I was going to reduce this back to the default and retest as well.

But the peer_credits change does seem to be a big improvement.

Is there anything else I should collect at the moment?

Many thanks for the help on this,

Matt

Comment by Mike Marciniszyn [ 21/Aug/19 ]

I haven't tried anything else yet - I was thinking to try Mike's suggestion of the HFI module parameter: max_qp_wrs=32896 as well still. I also am still using krcvqs=8 in hfi1.conf, which James Erwin mentioned is very high, so I was going to reduce this back to the default and retest as well.

The above module parameter will allow for the creation of a Send Q that will support 128 credits.   The down side is a large per QP memory footprint that might cause issues at scale.

The krcvqs is large, and probably 4 is enough, but that is having no impact on the "credit" issue.

Comment by Amir Shehata (Inactive) [ 21/Aug/19 ]

I would suggest the following three tests to see the impact of the peer_credits on the performance.

If we can run the reproducer with

  1. peer_credits=32 and peer_credits_hiw=16
  2. peer_credits=63 and peer_credits_hiw = 32

and see if this is going to improve performance and by how much.

Then run again with Mike's suggested HFI modifications and increase the peer_credits=128 and peer_credits_hiw=64

This will help us determine the best default values to use for these credits.

Comment by Matt Rásó-Barnett (Inactive) [ 22/Aug/19 ]

Just a short update to say that a test last night with, peer_credits=32 and peer_credits_hiw=16, already seemed to recover a lot of the performance for me. I didn't get a chance to do much more, so I will update tomorrow most likely after I've done more testing on this and the other configurations.

Cheers,

Matt

Comment by Matt Rásó-Barnett (Inactive) [ 29/Aug/19 ]

Sorry for being so slow to update on this.

I've done a few further tests now, with a few iterations of the benchmark for each of these configurations:

  1. peer_credits=32 and peer_credits_hiw=16
  2. peer_credits=63 and peer_credits_hiw = 32
  3. peer_credits=32 and peer_credits_hiw=16 and max_qp_wrs=32896
  4. peer_credits=63 and peer_credits_hiw = 32 and max_qp_wrs=32896

I haven't yet tried with: peer_credits=128 and peer_credits_hiw=64 and max_qp_wrs=32896 - I will try that tonight.

However the performance of the other 4 configurations was largely the same and none showed the 'Timed out RDMA' log error messages.

I will try with the peer_credits=128 again tonight and update again if those shows any improvement.

Cheers,

Matt

Comment by Amir Shehata (Inactive) [ 29/Aug/19 ]

these results seem to argue that we should set the defaults to: peer_credits=32 and peer_credits_hiw=16

As well as implement a fix to ensure that if peer_credits is reduced dynamically that peer_credits_hiw is also adjusted.

Comment by Matt Rásó-Barnett (Inactive) [ 02/Sep/19 ]

I tried with peer_credits=128, peer_credits_hiw=64 and max_qp_wrs=32896 over the weekend and this also appeared to fix the problem. Read performance was again similar to the other configurations tried with lower peer_credits.

So it seems that this hfi module param increase is also a fix to this issue.

Comment by Gerrit Updater [ 06/Sep/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36072
Subject: LU-12385 lnet: update opa defaults
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0b74daac778694097a445b92812ef8c99ffb1881

Comment by Amir Shehata (Inactive) [ 06/Sep/19 ]

Hi Matt,

There are 3 patches which should resolve this issue:

https://review.whamcloud.com/36072 LU-12385 lnet: update opa defaults
https://review.whamcloud.com/36073 LU-12621 o2iblnd: cache max_qp_wr
https://review.whamcloud.com/35578 LU-12569 o2iblnd: Make credits hiw connection aware

LU-12385 based on your testing I changed the defaults to use peer_credits 32 and peer_credits_hiw 16

LU-12621 Change the algorithm to cache the maximum number of work requests per qp which a device can handle and make sure we do not exceed it when creating the qp.

LU-12569 If we do endup reducing the queue depth makes sure to use the negotiated queue depth when calculating the hiw

Comment by Peter Jones [ 06/Sep/19 ]

mrb are you able to verify whether these three patches do as hoped address this issue?

Comment by Matt Rásó-Barnett (Inactive) [ 09/Sep/19 ]

Hi Peter, yes I'll give this a test - shall I apply these on top of branch b2_12 for testing?
Cheers,
Matt

Comment by Peter Jones [ 09/Sep/19 ]

If you can that would be perfect but testing on master would work too. Basically, as long as you can run without the patches and reproduce this behaviour then run with them and avoid it.

Comment by Peter Jones [ 12/Sep/19 ]

Matt

Do you expect to get to this shortly? I understand that you are probably pressed with other priorities (not least prepping for your LAD talk) but I'd like to move to plan B for how we handle this issue for 2.12.3 if we're not going to be able to validate whether this proposed fix works soon.

Thanks

Peter

Comment by Matt Rásó-Barnett (Inactive) [ 12/Sep/19 ]

Hi Peter, yes sorry - I can get this done before the weekend for sure, I'll send an update tomorrow once I've done the test.
Cheers,
Matt

Comment by Peter Jones [ 12/Sep/19 ]

Great - thanks Matt!

Comment by Matt Rásó-Barnett (Inactive) [ 12/Sep/19 ]

Hi Peter,
I've just run the test just now and can confirm that these three patches resolve the issue for me. I re-ran the test case against clients and servers running upstream 2.12.2 which showed the problem as expected. Then I took the current tip of branch b2_12, applied those three patches from gerrit on top, and re-ran the test which didn't show the issue.
Cheers,
Matt

Comment by Peter Jones [ 12/Sep/19 ]

Excellent - thanks!

Comment by Gerrit Updater [ 20/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36072/
Subject: LU-12385 lnet: update opa defaults
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7f199dbf0261b89afe0dc8185db4403ae0efdefa

Comment by Peter Jones [ 20/Sep/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 21/Sep/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36252
Subject: LU-12385 lnet: update opa defaults
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 371ba5e6c74cfa3a709254a97905f3506092f446

Comment by Gerrit Updater [ 28/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36252/
Subject: LU-12385 lnet: update opa defaults
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: eaed79bd261177b27b9993aa2ccb167d381c1e36

Generated at Sat Feb 10 02:52:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.