socklnd needs improved interface selection and configuration
(LU-14064)
|
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Technical task | Priority: | Minor |
| Reporter: | Shuichi Ihara | Assignee: | Serguei Smirnov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
When server receives messages from the clients, those messages are going into each CPT(CPU partition), then pass them to upper layer. However, if there is lnet routers between clients and servers, hashing is based on router's NID, not client's NIDs. Without LNET router nid refs state last max rtr min tx min queue 0@lo 1 NA -1 0 0 0 0 0 0 10.0.0.34@o2ib12 7 NA -1 8 8 8 2 -20 3616 10.0.11.226@o2ib12 1 NA -1 8 8 8 8 -8 0 10.0.0.39@o2ib12 5 NA -1 8 8 8 4 -18 2560 10.0.0.31@o2ib12 5 NA -1 8 8 8 4 -20 1984 10.0.0.35@o2ib12 5 NA -1 8 8 8 4 -18 1752 10.0.0.36@o2ib12 6 NA -1 8 8 8 3 -19 2544 10.0.0.32@o2ib12 1 NA -1 8 8 8 8 -18 0 10.0.0.33@o2ib12 6 NA -1 8 8 8 3 -17 2312 10.0.11.225@o2ib12 1 NA -1 8 8 8 8 -8 0 10.0.0.40@o2ib12 6 NA -1 8 8 8 3 -19 3056 10.0.0.38@o2ib12 1 NA -1 8 8 8 8 -21 0 10.0.11.227@o2ib12 1 NA -1 8 8 8 8 -8 0 10.0.0.37@o2ib12 6 NA -1 8 8 8 3 -18 3248 And, those messages are handled by lnet threads in different CPTs because of hash(client's NID). top - 01:05:44 up 1 day, 16:09, 2 users, load average: 39.70, 18.43, 7.46 Tasks: 1442 total, 75 running, 1367 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 50.6 sy, 0.0 ni, 48.3 id, 1.0 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 15369398+total, 13057227+free, 18096748 used, 5024956 buff/cache KiB Swap: 11075580 total, 11075580 free, 0 used. 13475987+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17303 root 20 0 0 0 0 S 17.8 0.0 0:03.16 kworker/u40:1 17601 root 20 0 0 0 0 S 9.2 0.0 0:00.47 ll_ost19_004 17642 root 20 0 0 0 0 S 7.9 0.0 0:00.33 ll_ost19_007 16187 root 20 0 0 0 0 R 7.3 0.0 0:07.56 kiblnd_sd_03_00 16192 root 20 0 0 0 0 R 7.3 0.0 0:07.57 kiblnd_sd_08_00 16198 root 20 0 0 0 0 R 7.3 0.0 0:11.14 kiblnd_sd_14_00 16201 root 20 0 0 0 0 R 7.3 0.0 0:07.70 kiblnd_sd_17_00 16632 root 20 0 0 0 0 R 7.3 0.0 0:07.10 mdt03_000 16634 root 20 0 0 0 0 R 7.3 0.0 0:06.95 mdt03_002 16647 root 20 0 0 0 0 R 7.3 0.0 0:07.24 mdt08_000 16649 root 20 0 0 0 0 R 7.3 0.0 0:07.06 mdt08_002 With LNET router It's same test from 10 clients, but messages goes through the lnet router. nid refs state last max rtr min tx min queue 0@lo 1 NA -1 0 0 0 0 0 0 192.168.11.35@o2ib10 2 NA -1 0 0 0 0 0 0 10.0.11.226@o2ib12 1 NA -1 8 8 8 8 -8 0 192.168.11.36@o2ib10 2 NA -1 0 0 0 0 0 0 10.12.11.135@o2ib12 13 up -1 8 8 8 3 -94 3248 192.168.11.40@o2ib10 2 NA -1 0 0 0 0 0 0 192.168.11.32@o2ib10 2 NA -1 0 0 0 0 0 0 192.168.11.33@o2ib10 2 NA -1 0 0 0 0 0 0 192.168.11.37@o2ib10 2 NA -1 0 0 0 0 0 0 192.168.11.34@o2ib10 2 NA -1 0 0 0 0 0 0 10.0.11.225@o2ib12 1 NA -1 8 8 8 8 -7 0 192.168.11.39@o2ib10 2 NA -1 0 0 0 0 0 0 10.0.11.227@o2ib12 1 NA -1 8 8 8 8 -8 0 192.168.11.38@o2ib10 2 NA -1 0 0 0 0 0 0 192.168.11.31@o2ib10 2 NA -1 0 0 0 0 0 0 Then, that goes into cpt=2 and other 19 CPTs (19 CPU cores) are idle. Tasks: 1067 total, 3 running, 1064 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 5.0 sy, 0.0 ni, 94.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 15369398+total, 14010728+free, 12679048 used, 907648 buff/cache KiB Swap: 11075580 total, 11075580 free, 0 used. 14017849+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13044 root 20 0 0 0 0 R 7.0 0.0 0:01.81 kiblnd_sd_02_00 14146 root 20 0 0 0 0 S 6.6 0.0 0:00.93 mdt02_005 13489 root 20 0 0 0 0 R 6.3 0.0 0:01.37 mdt02_002 It would be nice to have better hashing to distribute messages to different CPTs on server to improve metadata performance and IOPS when LNET router is exist. |
| Comments |
| Comment by Andreas Dilger [ 07/May/21 ] |
|
It would be good to understand what the best way to distribute RPC handling across CPTs is. Should these be evenly distributed to balance load, or is there a benefit to make some affinity based on the OST/MDT target and/or object FID? Does the initial CPT selection also affect later RPC processing? |
| Comment by Shuichi Ihara [ 07/May/21 ] |
|
As far as I observed behaviors, if threads are aware of CPT (e.g. mdtXX_YYY, mdt_ioXX_YYY, ll_ostXX_YYY, ptlrpcd_XX_YY, etc :XX is CPT ID), those upper layer threads are relied on CPT which is firstly selected by LNET. # cat /sys/kernel/debug/lnet/cpu_partition_table 0 : 0 1 : 1 2 : 2 3 : 3 4 : 4 5 : 5 6 : 6 7 : 7 8 : 8 9 : 9 10 : 10 11 : 11 12 : 12 13 : 13 14 : 14 15 : 15 16 : 16 17 : 17 18 : 18 19 : 19 Tasks: 1245 total, 3 running, 1242 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 0.0 us, 99.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu3 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 0.0 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu7 : 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu14 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu15 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu16 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu17 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu18 : 0.0 us, 1.3 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu19 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 15369398+total, 13857009+free, 12370992 used, 2752888 buff/cache KiB Swap: 11075580 total, 11075580 free, 0 used. 14048668+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19443 root 20 0 0 0 0 S 7.6 0.0 0:10.23 ptlrpc_hr02_000 19476 root 20 0 0 0 0 R 5.6 0.0 0:15.99 kiblnd_sd_02_00 20577 root 20 0 0 0 0 R 5.6 0.0 0:07.73 mdt02_003 20586 root 20 0 0 0 0 S 5.0 0.0 0:03.68 mdt02_009 20826 root 20 0 0 0 0 S 5.0 0.0 0:02.87 mdt02_058 20813 root 20 0 0 0 0 S 4.0 0.0 0:00.88 mdt02_045 20823 root 20 0 0 0 0 S 4.0 0.0 0:02.90 mdt02_055 20663 root 20 0 0 0 0 S 3.6 0.0 0:06.75 mdt02_017 20667 root 20 0 0 0 0 S 3.6 0.0 0:04.53 mdt02_021 20816 root 20 0 0 0 0 S 3.6 0.0 0:02.77 mdt02_048 20591 root 20 0 0 0 0 S 3.3 0.0 0:00.66 mdt_rdpg02_002 20670 root 20 0 0 0 0 S 3.3 0.0 0:05.15 mdt02_024 20786 root 20 0 0 0 0 S 3.3 0.0 0:00.31 mdt_rdpg02_004 20795 root 20 0 0 0 0 S 3.3 0.0 0:00.29 mdt_rdpg02_013 20814 root 20 0 0 0 0 S 3.3 0.0 0:03.30 mdt02_046 20792 root 20 0 0 0 0 S 3.0 0.0 0:00.28 mdt_rdpg02_010 20250 root 20 0 0 0 0 S 2.6 0.0 0:00.59 ll_ost02_002 20660 root 20 0 0 0 0 S 2.6 0.0 0:02.86 mdt02_014 20822 root 20 0 0 0 0 S 2.6 0.0 0:02.46 mdt02_054 19906 root 20 0 0 0 0 S 2.0 0.0 0:07.22 jbd2/dm-4-8 20585 root 20 0 0 0 0 S 2.0 0.0 0:06.23 mdt02_008 20611 root 20 0 0 0 0 S 2.0 0.0 0:00.06 ll_ost_io02_025 20589 root 20 0 0 0 0 S 1.7 0.0 0:00.10 ll_ost_io02_005 20599 root 20 0 0 0 0 S 1.7 0.0 0:00.06 ll_ost_io02_013 20604 root 20 0 0 0 0 S 1.7 0.0 0:00.05 ll_ost_io02_018 20797 root 20 0 0 0 0 S 1.7 0.0 0:00.07 mdt_rdpg02_015 20587 root 20 0 0 0 0 S 1.3 0.0 0:00.07 ll_ost_io02_003 20593 root 20 0 0 0 0 S 1.3 0.0 0:00.09 ll_ost_io02_007 20600 root 20 0 0 0 0 S 1.3 0.0 0:00.06 ll_ost_io02_014 20602 root 20 0 0 0 0 S 1.3 0.0 0:00.05 ll_ost_io02_016 20606 root 20 0 0 0 0 S 1.3 0.0 0:00.04 ll_ost_io02_020 20610 root 20 0 0 0 0 S 1.3 0.0 0:00.04 ll_ost_io02_024 |
| Comment by Andreas Dilger [ 08/May/21 ] |
|
If the same CPT is used for the entire lifetime of the RPC, this may be a factor why we've had performance limits for single-client IO in the past? It probably makes sense to have CPT affinity based not (only?) on the source NID, but rather the target MDT/OST and FID (+object offset/hash?) so that load is distributed across CPTs even if only a single client is doing IO. One option would be for Lustre to register a callback with LNet to map the RPC to decide CPT affinity, so that it could be mapped to a specific core for a particular OST and object and offset so that there is page/cache locality for each object. We already have request_in_callback() that is decoding the request, but I think that is only called after the CPT is selected. |
| Comment by Shuichi Ihara [ 10/May/21 ] |
|
If the client sends rpcs to multiple servers, I don't think the client performance performance limit come from here.
Let's assume 20 CPTs and 40 clients and those IP address are 192.[1-40].1.1. # for i in `seq 1 40`; do lnetctl cpt-of-nid --nid 192.$i.1.1@tcp9 --ncpt 20; done | grep value | sort | uniq -c
8 value: 0
8 value: 12
8 value: 16
8 value: 4
8 value: 8
Server distributes NIDs only to 5 CPTs means it only uses #CPU x 5/#CPT in this case. here is difernet case. It's still 20 CPTs and 40 clients but IP address's range are 192.1.1.[1-40]. [root@sky06 ~]# for i in `seq 1 40`; do lnetctl cpt-of-nid --nid 192.1.1.$i@tcp9 --ncpt 20; done | grep value | sort | uniq -c
2 value: 0
2 value: 1
3 value: 10
2 value: 12
3 value: 13
2 value: 14
3 value: 15
1 value: 16
1 value: 17
2 value: 18
2 value: 19
3 value: 2
1 value: 3
2 value: 4
2 value: 5
2 value: 6
1 value: 7
2 value: 8
4 value: 9
The best case, it would expect 2 clients per CPT, but it shows many unbalanced CPT distribution. (even CPT=11 is not used) Since there are no chances of CPT re-distribution later LNET, there is an limitation here and that needs to consider what NID goes into which CPT and how they are distributing well to all CPTs. |
| Comment by Andreas Dilger [ 11/May/21 ] |
|
Definitely it seems like the existing hash function used does not result in a good distribution of NIDs across CPTs. Since most clusters will have (fairly) linear NID assignments to clients, I'm wondering if it makes sense to have something simple like (sum of NID bytes % ncpts) so that we get a very uniform distribution across CPTs? That would get good results up to 256 CPTs on the server, so a little while yet (I'm not sure if there is some other CPT limit). Separately, there is a question whether a 1:1 mapping of NID->CPT makes sense? Limiting a single-rail client to a single CPT on the server seems like it can cause bottlenecks for clients (we've previously seen reports with e.g. 3GB/s limit for a single client, no matter what we do to optimize it). The client is working hard to submit IO in parallel from multiple threads/cores (e.g. ptlrpcd and osc.*.max_rpcs_in_flight) so it doesn't make sense to funnel this back into a single CPT on the server again. Amir, Shuichi, what is the typical number of CPTs and cores per CPT on AI400 and ES18XKE systems for OSTs and MDTs? Maybe it is not such a big deal if there are only a typically small number of CPTs configured, or is the "one CPT per core" described here the normal server configuration? Also, it seems uncommon that there would be a single LNet router for a system? The clusters I've seen typically have tens of routers to get good bandwidth. One option would be to stick with the current initial 1:1 NID->CPT mapping (possibly with a better mapping function), but then implement a "work stealing" mechanism at the ptlrpc level, so that in a heavily loaded case there is good CPU affinity for processing incoming requests, but if clients < CPTs (also router case) then the work is still distributed across cores. |
| Comment by Shuichi Ihara [ 12/May/21 ] |
|
This is one of reason (not all cases) sometimes we saw long tail of IOR or mdtest with a specific thread's of clients. Today's CPT setting is a bit annoying. In order to reduce contentions or context switching between cpu cores, 1 core per CPT was best configuration as far as I have tested IOPS and metadata performance in 8 CPU core's server configuration. (If all client's NIDs hashes to all CPTs well) So, one CPU core per CPT is good to maximize utilization of CPU with less cpu contentions today, but we also need to figure out how many CPU cores (CPTs) can be participated for the single client if one CPU per CPT is configured on server. Ideally, more dynamic load-balancing (per rpc or lnet messages or object basics?) would be preferred than 1:1 mapping of NID<->CPT... |
| Comment by Amir Shehata (Inactive) [ 13/May/21 ] |
|
just to chime in with some of the details. We have to look at the send and receive paths regarding CPT distribution. Send Path **On a call to LNetGet()/LNetPut() we just lock the current cpt, so that's whatever CPT the current core we're executing on belongs to. I believe this is determined by the CPT the ptlrpc thread making the LNetGet()/LNetPut() calls is bound to. While under that lock we perform local NI and remote NID selection for multi-rail. For local interface selection that's subject to multiple criteria: NUMA, Credit availability, round robin, etc. For the remote NID selection that criteria is based on credit availability and round robin. If the final destination is remote then we select a gateway NID using the same remote selection criteria. Once the selection is completed, we calculate a new CPT based on the NID of the next hop destination selected. That could be the final destination NID or a gateway NID. This could result in us switching the CPT lock. The new CPT is then used by the LND to queue the work to the correct LND scheduler thread. LNDs spawns multiple thread pools per CPT partition. If we're always sending to the same final destination NID or we have a single interface router as the next-hop, then all work to that will be funnelled to the same CPT. Resources at the LND level are partitioned by CPT, which means we could run into some resource restrictions in that case. Receive Path When a connection is established we determine the CPT of the connection based on the NID of the immediate next-hop we're receiving the message from, either the original source or that of a router. All subsequent messages received on that connection is processed by the a scheduler bound to that CPT; as there could be a pool of them. In the receive path we're subject to the same issue described in this ticket. Consideration As a short term fix, we can always hash the NID of the final destination or the original source of the message. I don't believe there should be any draw backs to doing this. And it's a relatively easy change to make. It'll avoid the router in the middle problem we're seeing. But it doesn't completely resolve the problem. As far as I understand, the reason for hashing the NID is to distribute the workload across different partitions in order to avoid using one lock. The hashing function is a strictly an LNet thing. It's implemented in lnet_cpt_of_nid(). I have a fundamental question regarding the decision to go with NID hashing for work distribution. Why was it implemented that way? Is there an advantage to always having messages from/to some peer be processed by the same set of threads locally? Why not change the implementation to a simple round robin selection, such that work is evenly distributed across all CPTs without regard to the NID? Is there an advantage to keeping all messages of a specific RPC locked to the same CPT partition? I know for NUMA we want to lock the local interface used for the entire RPC duration to make sure we use the correct interface for the bulk RDMA transfer. But as far as I understand we don't have the same restriction on per LND message.
|
| Comment by Andreas Dilger [ 13/May/21 ] |
|
On the one hand, keeping RPCs from a single client on one CPT improves the CPU cache/object/locking locality if different clients are each working on different objects (typical file-per-process workload). However, as Shuichi pointed out, the hash function is poor for typical use, and using a simple round-robin by NID (sum NID bytes % ncpt) would likely be more uniform for common configurations. If the CPT was based on the object FID then the CPU cache locality would likely be the same or better for the many client/many object case, but would be significantly improved for the few client/many object case. To avoid congestion for the many client/single object case, the FID-based CPT mapping should also take file offset into account (eg. ((sum FID bytes+(offset/1GB % 256)) % ncpt), so that RPCs (and NUMA page cache allocations) are distributed across cores. However, there are potentially considerations at the LNet level for keeping the peer state on a single CPT (I see lnet_cpt_of_nid() is used when locking) so it may not be possible/practical to change the CPT of a NID arbitrarily. As for the exact details behind this, spelunking in Git history ("git log -S lnet_cpt_of_nid") shows |
| Comment by Andreas Dilger [ 13/May/21 ] |
|
I think the first thing to test here is just a simple change to lnet_cpt_of_nid() to use the "(sum of nid bytes % ncpt)" method instead of the hash function. This will make CPT distribution flat for consecutive NIDs, and is low risk. It will not help the router/single client case. Anything beyond that needs an understanding of LNet locking to determine if eg. "source NID" could be used instead of the router NID, or there are specifics of peer state that depend on the router NID. |
| Comment by Amir Shehata (Inactive) [ 13/May/21 ] |
|
We can test changing lnet_cpt_of_nid() However, I think it's worth it to try and use the source/final destination NID as well. Implementation wise I think as long as we use a consistent NID for hashing it will work. The locking code/strategy in LNet will not change. We can compare the two approaches and assess which one will give the most benefit. |
| Comment by Shuichi Ihara [ 13/Jun/21 ] |
|
I see. we had an option to disable strict cpt binding with <servivce>_cpu_bind=0 which introduced by |
| Comment by Serguei Smirnov [ 16/Jun/21 ] |
|
This has exposed issues in the LND design. Currently in the LND the CPT is allocated for the connection to the router and doesn't change while the connection is up, even though the connection is handling txs to multiple remote peers. It is proposed to change the mechanism of scheduler thread selection such that the final destination nid can be used to select a CPT and a scheduler from the pool for the CPT on per-tx basis. In other words, cpt_of_nid determines the cpt based on the NID. The scheduler is assigned to the connection on creation and doesn't change for the duration of the connection's life time. in case of a router the same connection to the router can be used to send to multiple final destinations. This causes the same scheduler to be used irregardless of the final destination. We need to change the scheduler selection for each final destination, and the best way to do that is to assign the scheduler at the time of sending the message. Both socklnd and o2iblnd will be redesigned as a part of this effort to achieve a more even distribution of load across CPTs. The level of design and development effort is going to be at the level of a feature request. |
| Comment by Andreas Dilger [ 09/Dec/21 ] |
|
Assuming that the last NID in the above list should be 172.16.171.67 (or 172.16.172.67) instead of 172.16.170.67 twice, the simple "sum of NID bytes mod 20" gives a uniform distribution across the 20 CPTs: for N in 172.16.167.67 172.16.173.67 172.16.174.67 172.16.175.67 172.16.168.67
172.16.176.67 172.16.177.67 172.16.169.67 172.16.170.67 172.16.171.67; do
echo $(( (${N//./+}) % 20 ));
done | sort -n
2
3
4
5
6
8
9
10
11
12
It wouldn't quite be perfect for 10 CPTs, with one duplicate CPT and one unused CPT, but still better than the current algorithm. |
| Comment by Patrick Farrell [ 09/Dec/21 ] |
|
For what it's worth, I'm sure we can do better, but for example, a perfectly random "choose 10 from 20 without removal" generates 2.1 collisions. The result of our hash algorithm in the example above - which I agree is badly written and should probably be replaced - is 3. That's hardly that far out - if 2 is most common, 3 would happen quite often. The problem is that what we want is a uniform distribution where the client NIDs are spread evenly across the CPTs. We use pseudo-random as a proxy for uniform because it's quick to calculate and for large node counts it works fine in practice. The problem becomes acute when node counts and CPT counts are of similar magnitudes. Andreas, Yours generates a uniform distribution because of the modulus and the relatively small range of values, I think. Given random inputs, the modulus would generate a random distribution just mapped in to the CPT space. Given ordered inputs, it would generate a uniform output. This input is basically ordered - they all differ only in the third octet, the other three octets are the same. (FWIW, you accidentally wrote mod 256 instead of mod 20, but you used 20 in your script.) I mean perhaps (NID_value) % (num_CPTs) is a reasonable thing. The issue is if someone had NIDs that differed by a value of num_cpts, they'd all pile up on the same CPT. (Like, literally every single one.) That's presumably unacceptable. I have a thought on this I'll put in a separate comment. |
| Comment by Patrick Farrell [ 09/Dec/21 ] |
|
Here's a simple way to guarantee a uniform distribution that should also be perfectly scalable (ie, no lookups or anything), but it requires a protocol change, I think. If each client connection (at the LNet level) is assigned a unique number at the time of connection*, we can just do % that number to determine CPT. It would be simple, fast, reliable, and would generate a uniform distribution regardless of what the client NIDs looked like. Note the key here - uniform. If we have 10 clients and 20 CPTs, we are guaranteed 1 client per CPT on 10 distinct CPTs. Same if we have 10,000 clients and 20 CPTs - the distribution is perfectly uniform, and mapping client to CPT requires a single modulus operation on the number presented by the client. *(since it's part of the original connection, it's OK that this would require the overhead of managing a global (or per-NIC or whatever) counter - it would just be at connection time) |
| Comment by Andreas Dilger [ 10/Dec/21 ] |
|
I definitely understand that my "simple hash" generates a uniform distribution because of the way the input NIDs change by uniform increments, but my expectation is that this is actually the most common case in an HPC cluster - clients either having sequential NIDs on the same subnet (single interface case), or in this case where multiple NIDs on the same client have the same low octet ("node number") but the second octet is on different sequential subnets. Playing the devil's advocate for a minute - the assignment of a sequential integer to the connection is perfectly uniform as long as no clients disconnect. After that, depending on the client connection pattern the NIDs may become non-uniform, but it should only be an issue in pathological cases. Even with disconnect/reconnect, each group of new connections should be uniformly distributed across the CPTs, so the only way this could become problematic is if there is some systematic reason clients are disconnecting from one CPT and connecting to a new one. That would typically be attributed to a problem with the CPT (i.e. it is overloaded or something), so that isn't a bad thing if it happens. The other drawback of the counter that the CPT selection is non-deterministic (depends on order of peer connections, and previous connection history), so in some cases e.g. 8 NIDs would use CPT 0-7 and then if the NIDs reboot and reconnect they will use CPT 8-15, which may have performance pathologies (NUMA socket affinity vs. NVMe PCI bus, etc.) that could be complex to debug, while the simple hash is at least deterministic for each client. I'm not against your approach, just bringing up some things to ponder. I suspect that the difference between the two methods is only of interest when client connections <= CPTs, and will be largely the same in the majority of cases. |
| Comment by Gerrit Updater [ 20/Jan/22 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46233 |
| Comment by Gerrit Updater [ 31/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46233/ |
| Comment by Peter Jones [ 31/Jan/22 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 22/Mar/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50382 |