Thanks for the tips Serguei. Based on what you wrote, I did a little math and came up with some suggestions.
On each orelic router, I have around 2200 peers (based on `lnetctl peer show`). Of those, only 34 peers are on tcp while the remaining are on various o2ib networks. Following my understanding of your recommendations, I would make the following changes on orelic:
- o2iblnd
- peer_credits: 32 (up from 8)
- credits: 65536 (up from 1024)
- conns_per_peer: leave at 1 (MLNX std)
- concurrent_sends: 64 (up from 8)
- peercredits_hiw: 16 (up from 4)
- tcp (200 Gb link)
- peer_credits: 16 (up from 8, increase due to high b/w of network)
- credits: 512 (16 * 34=544; up from 8)
- Router buffers
- (# o2ib peers * o2ib peer_credits) + (# tcp peers * tcp peer credits)
- (2166 * 32) + (34 * 16) = 69856, so round down to 65536 (up from 2048/16384/2048)
Do those bold numbers look sane? Should I set all 3 buffers to 65536? Note these orelic routers have 256 GB of RAM.
Thanks!
Cameron
Since setting those values yesterday, at least one of the nodes (asp4) that was logging lots of errors, cleaned up immediately and has remained clean since.