[LU-12385] 2.12.2: IOR read testing reliably inducing LNetError: Timed out RDMA with ...@o2ib1 (6): c: 0, oc: 0, rc: 63 Created: 04/Jun/19 Updated: 28/Sep/19 Resolved: 20/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Matt Rásó-Barnett (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: | |||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Hello, The test environment is: Server Environment # 24 Identical servers, each server contains 12x NVMe SSDs. # Each server is both MDS and OSS lustre-2.12.2-1.el7.x86_64 kernel-3.10.0-957.10.1.el7_lustre.x86_64 # Fabric is Intel OPA, using distro's in-kernel OFED # Each server has 2x OPA cards, configured with LNET multi-rail [root@dac-e-1 ~]# lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: o2ib1 local NI(s): - nid: 10.47.18.1@o2ib1 status: up interfaces: 0: ib0 - nid: 10.47.18.25@o2ib1 status: up interfaces: 0: ib1 # LNET tunables tunables: peer_timeout: 180 peer_credits: 128 peer_buffer_credits: 0 credits: 1024 peercredits_hiw: 64 map_on_demand: 256 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 ntx: 2048 conns_per_peer: 4 Client Environment # Test involves 10 clients lustre-client-2.12.2-1.el7.x86_64 kernel-3.10.0-957.10.1.el7.x86_64 # Client has 1x Intel OPA card - net type: o2ib1 local NI(s): - nid: 10.47.20.61@o2ib1 status: up interfaces: 0: ib0 statistics: send_count: 21957610 recv_count: 29253267 drop_count: 2 tunables: peer_timeout: 180 peer_credits: 128 peer_buffer_credits: 0 credits: 1024 peercredits_hiw: 64 map_on_demand: 256 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 ntx: 2048 conns_per_peer: 4 lnd tunables: dev cpt: 0 tcp bonding: 0 CPT: "[0,1]" I'm testing with the 'ior_easy' phase of IO500, which just writes a single file per-process to achieve a maximum bandwidth number. The write phase of the test completes fine, however after a few minutes of the read phase, the clients start to show LNetError messages such as: Client Errors Jun 04 19:47:08 cpu-e-1149 kernel: Lustre: Mounted fs1-client Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Skipped 1 previous similar message Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.47.18.14@o2ib1 (5): c: 0, oc: 0, rc: 63 Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Skipped 1 previous similar message Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: 19805:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1559674918/real 1559674924] req@ffff8ddc96787500 x1635429905047040/t0(0) o3->fs1-OST0089-osc-ffff8df3bf0c8000@10.47.18.12@o2ib1:6/4 lens 488/440 e 0 to 1 dl 1559674962 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: 19805:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 36 previous similar messages Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: fs1-OST0089-osc-ffff8df3bf0c8000: Connection to fs1-OST0089 (at 10.47.18.12@o2ib1) was lost; in progress operations using this service will wait for recovery to complete Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: Skipped 4 previous similar messages Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: fs1-OST0089-osc-ffff8df3bf0c8000: Connection restored to 10.47.18.12@o2ib1 (at 10.47.18.12@o2ib1) Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: Skipped 4 previous similar messages Jun 04 20:02:10 cpu-e-1149 kernel: LNet: 34294:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.47.18.10@o2ib1 - queue depth reduced from 128 to 63 to allow for qp creation Jun 04 20:02:10 cpu-e-1149 kernel: LNet: 34294:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 83 previous similar messagess And the servers: Server Errors Jun 04 20:02:08 dac-e-10 kernel: Lustre: fs1-OST006f: Client 899ea795-c4b3-f3f8-43dd-db593fc06301 (at 10.47.21.126@o2ib1) reconnecting Jun 04 20:02:08 dac-e-10 kernel: Lustre: Skipped 1 previous similar message Jun 04 20:02:59 dac-e-10 kernel: LustreError: 303530:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(1048576) req@ffff8b787babd050 x1635429904859568/t0(0) o3->899ea795-c4b3-f3f8-43dd-db593fc06301@10.47.21.126@o2ib1:527/0 lens 488/440 e 3 to 0 dl 1559674997 ref 1 fl Interpret:/0/0 rc 0/0 Jun 04 20:02:59 dac-e-10 kernel: LustreError: 303530:0:(ldlm_lib.c:3268:target_bulk_io()) Skipped 1 previous similar message Jun 04 20:02:59 dac-e-10 kernel: Lustre: fs1-OST0074: Bulk IO read error with 899ea795-c4b3-f3f8-43dd-db593fc06301 (at 10.47.21.126@o2ib1), client will retry: rc -110 Jun 04 20:02:59 dac-e-10 kernel: Lustre: Skipped 6 previous similar messages I've attached log files to the ticket with full logs for all the servers and the clients. The IOR output with command used is: IOR-3.3alpha1: MPI Coordinated Test of Parallel I/O Began : Tue Jun 4 19:56:12 2019 Command line : /home/mjr208/projects/benchmarking/io-500-src-stonewall-fix/bin/ior -r -R -a POSIX --posix.odirect -C -t 1m -b 140g -F -i 1 -C -Q 1 -g -G 27 -k -e -o /dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy/ior_file_easy -O stoneWallingStatusFile=/dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy/stonewall Machine : Linux cpu-e-829 TestID : 0 StartTime : Tue Jun 4 19:56:12 2019 Path : /dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy FS : 412.6 TiB Used FS: 8.5% Inodes: 960.0 Mi Used Inodes: 0.0% Options: api : POSIX apiVersion : test filename : /dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy/ior_file_easy access : file-per-process type : independent segments : 1 ordering in a file : sequential tasks : 320 clients per node : 32 repetitions : 1 xfersize : 1 MiB blocksize : 140 GiB aggregate filesize : 43.75 TiB Results: access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- WARNING: Expected aggregate file size = 48103633715200. WARNING: Stat() of aggregate file size = 38543976038400. WARNING: Using actual aggregate bytes moved = 38543976038400. read 16882 146800640 1024.00 0.023208 2177.30 0.023764 2177.33 0 Max Read: 16882.35 MiB/sec (17702.42 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum read 16882.35 16882.35 16882.35 0.00 16882.35 16882.35 16882.35 0.00 2177.32777 0 320 32 1 1 2 1 0 0 1 150323855360 1048576 36758400.0 POSIX 0 Finished : Tue Jun 4 20:32:29 2019 My main cause for suspicion here is that this same hardware and client combination was running the same benchmark this morning however at that time the servers were running Lustre 2.12.0 and the clients were on 2.10.7. Is there anything notable in the logs provided as to what might be going on? Thanks, |
| Comments |
| Comment by Peter Jones [ 04/Jun/19 ] |
|
Sonia Can you please advise here? Thanks Peter |
| Comment by Sonia Sharma (Inactive) [ 05/Jun/19 ] |
|
Hi Matt, Is this OPA or MLX? And are you seeing any performance drop as well along with this error messages? These messages in the logs, Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Skipped 1 previous similar message Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.47.18.14@o2ib1 (5): c: 0, oc: 0, rc: 63 They indicate that the message is queued on the o2iblnd internal queue and hasn't been posted for the default timeout of 50 seconds. Later in the logs, I see this message
Jun 04 20:02:10 cpu-e-1149 kernel: LNet: 34294:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.47.18.10@o2ib1 - queue depth reduced from 128 to 63 to allow for qp creation
What happens is when the initial request is received to create the connection, it tries to create the queue-pair(initially set to the number of peer-credits) but if that fails then it reduces the queue-depth by 1 and retry to create the qp. And this continues until it successfully create the qp or reduce the queue depth to < 2. |
| Comment by Matt Rásó-Barnett (Inactive) [ 05/Jun/19 ] |
|
Hi Sonia, this is OPA. I am only seeing this when I upgrade the clients to 2.12.2. Our production clients usually run 2.10.7 and I've re-run this test with the 2.10.7 clients talking to the same 2.12.2 servers with no problem. The kernel version of the two sets of clients is the same. If the first set of messages are due to messages queued on the o2iblnd internal queue, this is because of congestion on the link? What else could be the cause? |
| Comment by Matt Rásó-Barnett (Inactive) [ 05/Jun/19 ] |
|
This graph shows the total read bandwidth as reported by the servers obdfilter.$OST.stats read_bytes during the log period I sent above. So essentially around the time those messages start appearing (~20:02), you see the total read bandwidth completely drop away, until each OST is just doing a few MiB/s. |
| Comment by Peter Jones [ 05/Jun/19 ] |
|
Does the same behaviour manifest with 2.12.1 clients? Peter |
| Comment by Matt Rásó-Barnett (Inactive) [ 05/Jun/19 ] |
|
Hi Peter, I just tried it and unfortunately yes I did get the same behaviour. I'm going to try with 2.12.0 on the same clients now and see if any change. Matt |
| Comment by Sonia Sharma (Inactive) [ 05/Jun/19 ] |
|
Hi Matt, Some other site was facing similar issues with OPA and removing this patch - https://review.whamcloud.com/#/c/30310/ helped them. So may be we can give it a try. |
| Comment by Matt Rásó-Barnett (Inactive) [ 05/Jun/19 ] |
|
Hi Sonia, I'm pretty sure TID-RDMA is disabled, here is our hfi1.conf: options hfi1 krcvqs=8 pcie_caps=0x51 rcvhdrcnt=8192 I haven't been brave enough to ever try TID-RDMA again, as we turned it on when we first got our OPA cluster and there was a horrendous memory leak in the HFI1 driver... Regarding the patch, I will see if I can give that a go tomorrow. I tried 2.12.0 earlier today and it also had the same problem, so I'm hitting this issue with all 2.12.X versions unfortunately. |
| Comment by Peter Jones [ 05/Jun/19 ] |
|
Sonia While I agree that we have had some reports from sites using OPA at scale using that fix, isn't it also in 2.10.7 (which seems to be ok)? Peter |
| Comment by Matt Rásó-Barnett (Inactive) [ 06/Jun/19 ] |
|
Hi Sonia, Peter, What I wanted to ask was if you had any guidance on further debugging I can do that might help track down this issue, since I can reliably reproduce it in our environment?
Thanks, |
| Comment by Sonia Sharma (Inactive) [ 06/Jun/19 ] |
|
Hi Matt, Closely looking at the graph, I see that the sharp performance drop happens at time 20:02 and if we look at the logs , it is the same time that the queue depth is reduced. The tx timeout and RDMA timeout messages has a time stamp earlier than 20:02. Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Skipped 1 previous similar message Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.47.18.14@o2ib1 (5): c: 0, oc: 0, rc: 63 Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Skipped 1 previous similar message Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: 19805:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1559674918/real 1559674924] req@ffff8ddc96787500 x1635429905047040/t0(0) o3->fs1-OST0089-osc-ffff8df3bf0c8000@10.47.18.12@o2ib1:6/4 lens 488/440 e 0 to 1 dl 1559674962 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: 19805:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 36 previous similar messages Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: fs1-OST0089-osc-ffff8df3bf0c8000: Connection to fs1-OST0089 (at 10.47.18.12@o2ib1) was lost; in progress operations using this service will wait for recovery to complete Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: Skipped 4 previous similar messages Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: fs1-OST0089-osc-ffff8df3bf0c8000: Connection restored to 10.47.18.12@o2ib1 (at 10.47.18.12@o2ib1) Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: Skipped 4 previous similar messages Jun 04 20:02:10 cpu-e-1149 kernel: LNet: 34294:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.47.18.10@o2ib1 - queue depth reduced from 128 to 63 to allow for qp creation So I would suggest trying once removing the above patch I mentioned and running the performance tests again. I am suggesting this because that patch does the reduction of connection queue-depth and it is possible that with heavy traffic, the reduced queue-depth leads to more tx being queued up resulting in the performance drop. |
| Comment by Amir Shehata (Inactive) [ 12/Jul/19 ] |
|
Hi Matt, Any updates on this issue? Looks like last comments there was a suggestion to revert: https://review.whamcloud.com/#/c/30310/. This has proven problematic on OPA clusters. |
| Comment by James A Simmons [ 12/Jul/19 ] |
|
Didn't a fix land to 2.13 for this? |
| Comment by James A Simmons [ 12/Jul/19 ] |
|
I think the fix needed is https://review.whamcloud.com/34945 from The patch landed for 2.12.2, what version are you running? |
| Comment by Peter Jones [ 12/Jul/19 ] |
|
It says 2.12.2 in the initial report |
| Comment by James A Simmons [ 12/Jul/19 ] |
|
So master is mostly likely broken as well. |
| Comment by Peter Jones [ 12/Jul/19 ] |
|
Yes |
| Comment by Amir Shehata (Inactive) [ 12/Jul/19 ] |
|
The patch I pointed to is problematic for OPA and I'm not sure why exactly. But on larger clusters they see timeouts on the transmit queue. If Matt is able to revert the patch and try his test it would verify if this is the issue on his system. Sonia had suggested that earlier on, but from the responses which came after, it's not clear if it has been tried. One potential work around is to check the type of HW and apply the logic selectively. But I would feel better if we root cause why OPA has different behavior. There has been some chatter regarding changing the hfi configuration parameters. I need to investigate the impact of these hfi module parameters and if they could be related to this particular issue. |
| Comment by Peter Jones [ 12/Jul/19 ] |
|
I know from discussions with Matt that reverting |
| Comment by Matt Rásó-Barnett (Inactive) [ 12/Jul/19 ] |
|
Hi Amir, yes, as Peter says I did end up reverting that patch against master and it did appear to resolve the issue. Since the IO500 I've reverted everything back to 2.12.2 and I've been hitting the problem again, so I will test just reverting this patch against 2.12.2 and see if I can confirm that this resolves the problem again. |
| Comment by Amir Shehata (Inactive) [ 13/Jul/19 ] |
|
Hi Matt, Do you have cycles to help me root cause the issue? I don't have a setup to reproduce. Having access to a setup to reproduce the problem will help us arrive to a solution quicker. |
| Comment by Matt Rásó-Barnett (Inactive) [ 13/Jul/19 ] |
|
Hi Amir, yes that would be great - I think I have a reliable reproducer. How do you want to proceed on this - is there something I could gather or do you want to observe? Cheers, Matt |
| Comment by Amir Shehata (Inactive) [ 16/Jul/19 ] |
|
Hi Matt Talking with mmarcini2 from Intel. We'd like some stats from the HFI driver. Using the hfi1stats -c 5 This will report hfi statistics at 5 second interval. If you can grab the statistics before the test starts to get a baseline, then run the reproducer of the problem, while recording the statistics on the clients and the servers, that'll be great. Ideally we'll have the stats on the nodes which encounter the problem. |
| Comment by Mike Marciniszyn [ 16/Jul/19 ] |
|
Matt,
What is the version of the hfi1 driver? |
| Comment by Matt Rásó-Barnett (Inactive) [ 17/Jul/19 ] |
|
Hi Mike, $ opaconfig -V 10.8.0.0.204 The servers are using the inbox driver, and are on RHEL7.6, with: kernel: 3.10.0-957.10.1.el7_lustre.x86_64 I should confirm again that reverting the https://jira.whamcloud.com/browse/LU-10213 patch does appear to fix/workaround the problem, so I've moved the servers and clients to 2.12.2 with this patch reverted and I don't see the same errors any more. @Amir - I will redeploy back onto 2.12.2 again at the end of the week and collect these stats. I've also been speaking with James Erwin from Intel who reached out to me too - he's identified that our servers did not have the recommended irqbalance configuration that they should have had, so this is being corrected too. Thanks, |
| Comment by Mike Marciniszyn [ 17/Jul/19 ] |
|
What is the scale involved here? What is the minimum configuration? |
| Comment by Matt Rásó-Barnett (Inactive) [ 17/Jul/19 ] |
|
I've recently been doing tests with 32 clients and 24 Lustre servers when hitting the problem. However I was hitting the problem a few weeks ago when I opened the ticket with 10 client nodes. I also always appear to trigger the problem during the IOR read-phase in the benchmark that I'm running - I'm not sure what the significance of that is, if any, but this will be the clients reading data from the servers primarily. |
| Comment by Mike Marciniszyn [ 17/Jul/19 ] |
|
A read means that Lustre will send a Read RPC to the server(s) and the server will post RDMA write(s) and a send. Each RDMA write/send pair consumes a Lustre credit. If you can get a copy of: /sys/kernel/debug/hfi1/hfi1_0/qp_stats when Lustre has live QPs? I want to see what the sizes for the QPs on both clients and servers. I will need the same when you go back to 2.12.2. |
| Comment by Matt Rásó-Barnett (Inactive) [ 19/Jul/19 ] |
|
Hi Mike, attached are qp_stats from a server and a client during a IOR read test, when using the 'fixed' version of 2.12.2 with the patch reverted. I will collect the same once I've redeployed back to 2.12.2 when the problem occurs hopefully later today. Cheers, |
| Comment by Mike Marciniszyn [ 31/Jul/19 ] |
|
Amir, If there a status on the issue here? My understanding there are some Lustre settings that can allow Lustre 2.12.2 avoid the credit starvation issue that may be happening. Is there a formal patch? My understanding is the detour is to change peer_credits_hiw to roughly half of the peer_credits or 32?
|
| Comment by Mike Marciniszyn [ 31/Jul/19 ] |
|
There is also another potential detour if running 2.12.2. Add the following hfi1 module parameter options line: max_qp_wrs=32896
|
| Comment by Amir Shehata (Inactive) [ 12/Aug/19 ] |
|
Hi Matt, Would you be able to try your reproducer using peer_credits=16 and peer_credits_hiw=8. You don't have to explicitly set peer_credits_hiw. It'll get set to half of the peer_credits if it's not set in the module parameters. There are two intents here.
would be interesting to see the result of your testing with these two parameters set. Please note that this test should be done with |
| Comment by Peter Jones [ 20/Aug/19 ] |
|
mrb when will you be able to run this test? We'd like to have a robust fix in place ahead of 2.12.3... |
| Comment by Matt Rásó-Barnett (Inactive) [ 20/Aug/19 ] |
|
Hi Peter, Amir Sorry for the delay on my end - I should have sent an update sooner but I've been away on leave for a few weeks. However I'm back at work now and will get this picked up. I'll try and get a test running on this today and see if I can get a reply to you all in a couple of days - I'm a bit out of sync with the state of things since I was away, so I'll need to get set back up, but it shouldn't take too long to get the test done. Matt
|
| Comment by Peter Jones [ 20/Aug/19 ] |
|
Thanks mrb - I suspected as much |
| Comment by Matt Rásó-Barnett (Inactive) [ 21/Aug/19 ] |
|
Hi Amir, I've run the reproducer test last night and there did seem to be an improvement. I redeployed all servers and clients to vanilla 2.12.2, and ran the reproducer test with the default OPA ko2iblnd settings of: peer_credits=128 peer_credits_hiw=64 to be sure that we still see the issue. As expected this hit the problem again on the IOR read-phase. Attached are logs from this run: Then I changed peer_credits=16 and peer_credits_hiw=8 on both clients and servers and re-ran the test. This time the test completed successfully, although there were still slowdowns and the read performance is still lower than I would expect by a factor of 2. However I didn't see the 'Timed out RDM' errors from before, so maybe there is a different reason for this. Attached are logs from these runs: clients-peer-credits-16.run1.log clients-peer-credits-16.run2.log clients-peer-credits-16.run3.log I haven't tried anything else yet - I was thinking to try Mike's suggestion of the HFI module parameter: max_qp_wrs=32896 as well still. I also am still using krcvqs=8 in hfi1.conf, which James Erwin mentioned is very high, so I was going to reduce this back to the default and retest as well. But the peer_credits change does seem to be a big improvement. Is there anything else I should collect at the moment? Many thanks for the help on this, Matt |
| Comment by Mike Marciniszyn [ 21/Aug/19 ] |
The above module parameter will allow for the creation of a Send Q that will support 128 credits. The down side is a large per QP memory footprint that might cause issues at scale. The krcvqs is large, and probably 4 is enough, but that is having no impact on the "credit" issue. |
| Comment by Amir Shehata (Inactive) [ 21/Aug/19 ] |
|
I would suggest the following three tests to see the impact of the peer_credits on the performance. If we can run the reproducer with
and see if this is going to improve performance and by how much. Then run again with Mike's suggested HFI modifications and increase the peer_credits=128 and peer_credits_hiw=64 This will help us determine the best default values to use for these credits. |
| Comment by Matt Rásó-Barnett (Inactive) [ 22/Aug/19 ] |
|
Just a short update to say that a test last night with, peer_credits=32 and peer_credits_hiw=16, already seemed to recover a lot of the performance for me. I didn't get a chance to do much more, so I will update tomorrow most likely after I've done more testing on this and the other configurations. Cheers, Matt |
| Comment by Matt Rásó-Barnett (Inactive) [ 29/Aug/19 ] |
|
Sorry for being so slow to update on this. I've done a few further tests now, with a few iterations of the benchmark for each of these configurations:
I haven't yet tried with: peer_credits=128 and peer_credits_hiw=64 and max_qp_wrs=32896 - I will try that tonight. However the performance of the other 4 configurations was largely the same and none showed the 'Timed out RDMA' log error messages. I will try with the peer_credits=128 again tonight and update again if those shows any improvement. Cheers, Matt |
| Comment by Amir Shehata (Inactive) [ 29/Aug/19 ] |
|
these results seem to argue that we should set the defaults to: peer_credits=32 and peer_credits_hiw=16 As well as implement a fix to ensure that if peer_credits is reduced dynamically that peer_credits_hiw is also adjusted. |
| Comment by Matt Rásó-Barnett (Inactive) [ 02/Sep/19 ] |
|
I tried with peer_credits=128, peer_credits_hiw=64 and max_qp_wrs=32896 over the weekend and this also appeared to fix the problem. Read performance was again similar to the other configurations tried with lower peer_credits. So it seems that this hfi module param increase is also a fix to this issue. |
| Comment by Gerrit Updater [ 06/Sep/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36072 |
| Comment by Amir Shehata (Inactive) [ 06/Sep/19 ] |
|
Hi Matt, There are 3 patches which should resolve this issue: https://review.whamcloud.com/36072 LU-12385 lnet: update opa defaults https://review.whamcloud.com/36073 LU-12621 o2iblnd: cache max_qp_wr https://review.whamcloud.com/35578 LU-12569 o2iblnd: Make credits hiw connection aware
|
| Comment by Peter Jones [ 06/Sep/19 ] |
|
mrb are you able to verify whether these three patches do as hoped address this issue? |
| Comment by Matt Rásó-Barnett (Inactive) [ 09/Sep/19 ] |
|
Hi Peter, yes I'll give this a test - shall I apply these on top of branch b2_12 for testing? |
| Comment by Peter Jones [ 09/Sep/19 ] |
|
If you can that would be perfect but testing on master would work too. Basically, as long as you can run without the patches and reproduce this behaviour then run with them and avoid it. |
| Comment by Peter Jones [ 12/Sep/19 ] |
|
Matt Do you expect to get to this shortly? I understand that you are probably pressed with other priorities (not least prepping for your LAD talk) but I'd like to move to plan B for how we handle this issue for 2.12.3 if we're not going to be able to validate whether this proposed fix works soon. Thanks Peter |
| Comment by Matt Rásó-Barnett (Inactive) [ 12/Sep/19 ] |
|
Hi Peter, yes sorry - I can get this done before the weekend for sure, I'll send an update tomorrow once I've done the test. |
| Comment by Peter Jones [ 12/Sep/19 ] |
|
Great - thanks Matt! |
| Comment by Matt Rásó-Barnett (Inactive) [ 12/Sep/19 ] |
|
Hi Peter, |
| Comment by Peter Jones [ 12/Sep/19 ] |
|
Excellent - thanks! |
| Comment by Gerrit Updater [ 20/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36072/ |
| Comment by Peter Jones [ 20/Sep/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 21/Sep/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36252 |
| Comment by Gerrit Updater [ 28/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36252/ |