Thanks for the tips Serguei. Based on what you wrote, I did a little math and came up with some suggestions.
On each orelic router, I have around 2200 peers (based on `lnetctl peer show`). Of those, only 34 peers are on tcp while the remaining are on various o2ib networks. Following my understanding of your recommendations, I would make the following changes on orelic:
- o2iblnd
- peer_credits: 32 (up from 8)
- credits: 65536 (up from 1024)
- conns_per_peer: leave at 1 (MLNX std)
- concurrent_sends: 64 (up from 8)
- peercredits_hiw: 16 (up from 4)
- tcp (200 Gb link)
- peer_credits: 16 (up from 8, increase due to high b/w of network)
- credits: 512 (16 * 34=544; up from 8)
- Router buffers
- (# o2ib peers * o2ib peer_credits) + (# tcp peers * tcp peer credits)
- (2166 * 32) + (34 * 16) = 69856, so round down to 65536 (up from 2048/16384/2048)
Do those bold numbers look sane? Should I set all 3 buffers to 65536? Note these orelic routers have 256 GB of RAM.
Thanks!
Cameron
Cameron, are you planning to run any performance tests to compare "before" and "after"?