Liang, your proposal is reasonable. But since we're considering how this part of the code ought to work, I'd like to try to think this through once more.
The LOCAL/ROUND policy mechanism queues work for processing by different threads. The partner mechanism then enables idle threads to steal work that was queued for another thread. Limiting the number of partners reduces the number of queues to scan in ptlrpcd_check() and the number of threads to wake in ptlrpc_set_add_new_req().
It is in this context that pinning a ptlrpcd thread to a specific CPU makes sense, when combined with the code in ptlrpcd_select_pc() that avoids queuing work on the thread for the current CPU. But that other CPU is unlikely to be idle either. Ultimately I'm not convinced that Lustre should attempt this kind of fine-grained load-balancing across CPUs.
The alternative approach is to limit the binding of ptlrpcd threads to NUMA nodes and/or CPTs. It seems to me that on the systems where we care about these things, we want the CPT boundaries to match the NUMA boundaries anyway. Therefore I'd say that only binding to CPT boundaries should be good enough.
So I end up with binding ptlrpcd threads to CPTs, but not to individual CPUs. The code in ptlrpcd_select_pc() that avoids "the ptlrpcd for the current CPU" goes away, because there no longer is such a thing. For partner policy, only implement what you called the CPT_PARTNER policy. By default create one ptlrpcd thread per CPU in a CPT, and all threads in the same CPT are its partners.
If a ptlrpcd_max_partners is implemented then one thing to consider is how the partner sets are computed. Currently the "partner of" relation is commutative: if A is a partner of B, then B is a partner of A. Setting things up like that is tricky, it means partitioning the ptlrpcd threads for a CPT into a number of cliques with each clique containing the right number of threads. This feels like more work than we should bother with.
Since the only point of the partner relation is to define which additional queues a ptlrpcd thread may take work from, I'd suggest threads always looking at the the queues for the next threads (with wrap). For example, if max_ptlrpcds = 5, and ptlrpcd_max_partners = 2, then the partner relation is like this:
On a system with multiple CPTs this would be the relation between the 7 ptlrpcd threads within each CPT.
For implementation this would mean code much like the current patchset, but in addition:
- LIOD_BIND and related code is removed (this includes removing the new ptlrpcd_ctl.pc_cpu field).
- pdb_policy_t and related code is removed (this includes removing the ptlrpcd_bind_policy tunable).
- pdl_policy_t and related code is removed.
Landed for 2.8