James, since you asked, a few notes about NUMA.
Reasoning about NUMA is a lot like thinking about a cluster, except you spend a lot of time worrying about cache lines instead of files. (Much of the terminology is also similar, which becomes confusing when discussing NUMA issues for systems that are part of a cluster.) It is worth noting that NUMA considerations already apply once a system has more than one socket. Ideally, the process driving I/O, the memory involved, and the interface involved all live on the same socket.
In practice, the memory placement may have been done by some user space process outside our control, and the same goes for the process that initiates the I/O. Selection of an interface that is close in the topology of the system is useful, but that already assumes a multi-rail type configuration. Much of the time there will be no choice because there is only one interface. (LNet routers are an exception: there we do have full control over the location of all buffers and threads relative to the interfaces used.)
So the main concern becomes doing the best we can in areas we do control, in particular avoiding cache line bouncing. Placing a datastructure like ksock_peer in the same CPT as the interface helps a little bit here, but only a little. The layout of the ksock_peer structure is actually a good example of what not to do. Take a look at the first few members, which will likely all be in the same cache line:
typedef struct ksock_peer
{
struct list_head ksnp_list;
cfs_time_t ksnp_last_alive;
lnet_process_id_t ksnp_id;
atomic_t ksnp_refcount;
int ksnp_sharecount;
ksnp_list and ksnp_id are semi-constant, and read by any thread that does a lookup of some peer in the hash table (shared/read access). In contrast ksnp_refcount and ksnp_last_alive are updated by threads doing work for this particular peer (exclusive/write access). So a lookup of some unrelated peer causes a cache line bounce between the CPU doing the lookup and the CPU managing the I/O. This particular case can be mitigated by being very careful with the layout of a datastructure, and by making sure that threads that do modify the structure run on the same socket, even if that socket is not where the datastructure lives.
Patches for this work already landed and the multi-rail work filled in the rest of the gaps.