warning "fix this code to use new CPU partition APIs"
endif
In our own experimentation on large NUMA systems, we found substantial benefits to confining the existing ptlrpcd's to the NUMA node originating IO using the taskset command to set affinity. Unfortunately, this only works for one node at a time.
To obtain the best case for all nodes, we need to create ptlrpcd's confined to each node and to select ptlrpcd's on the same node.
We plan to submit a patch against master to complete this.
Attachments
Issue Links
is related to
LU-6580Poor read performance with many ptlrpcd threads on the client
Liang, your proposal is reasonable. But since we're considering how this part of the code ought to work, I'd like to try to think this through once more.
The LOCAL/ROUND policy mechanism queues work for processing by different threads. The partner mechanism then enables idle threads to steal work that was queued for another thread. Limiting the number of partners reduces the number of queues to scan in ptlrpcd_check() and the number of threads to wake in ptlrpc_set_add_new_req().
It is in this context that pinning a ptlrpcd thread to a specific CPU makes sense, when combined with the code in ptlrpcd_select_pc() that avoids queuing work on the thread for the current CPU. But that other CPU is unlikely to be idle either. Ultimately I'm not convinced that Lustre should attempt this kind of fine-grained load-balancing across CPUs.
The alternative approach is to limit the binding of ptlrpcd threads to NUMA nodes and/or CPTs. It seems to me that on the systems where we care about these things, we want the CPT boundaries to match the NUMA boundaries anyway. Therefore I'd say that only binding to CPT boundaries should be good enough.
So I end up with binding ptlrpcd threads to CPTs, but not to individual CPUs. The code in ptlrpcd_select_pc() that avoids "the ptlrpcd for the current CPU" goes away, because there no longer is such a thing. For partner policy, only implement what you called the CPT_PARTNER policy. By default create one ptlrpcd thread per CPU in a CPT, and all threads in the same CPT are its partners.
If a ptlrpcd_max_partners is implemented then one thing to consider is how the partner sets are computed. Currently the "partner of" relation is commutative: if A is a partner of B, then B is a partner of A. Setting things up like that is tricky, it means partitioning the ptlrpcd threads for a CPT into a number of cliques with each clique containing the right number of threads. This feels like more work than we should bother with.
Since the only point of the partner relation is to define which additional queues a ptlrpcd thread may take work from, I'd suggest threads always looking at the the queues for the next threads (with wrap). For example, if max_ptlrpcds = 5, and ptlrpcd_max_partners = 2, then the partner relation is like this:
0 -> 1, 2
1 -> 2, 3
2 -> 3, 4
3 -> 4, 0
4 -> 0, 1
On a system with multiple CPTs this would be the relation between the 7 ptlrpcd threads within each CPT.
For implementation this would mean code much like the current patchset, but in addition:
LIOD_BIND and related code is removed (this includes removing the new ptlrpcd_ctl.pc_cpu field).
pdb_policy_t and related code is removed (this includes removing the ptlrpcd_bind_policy tunable).
pdl_policy_t and related code is removed.
Olaf Weber (Inactive)
added a comment - Liang, your proposal is reasonable. But since we're considering how this part of the code ought to work, I'd like to try to think this through once more.
The LOCAL/ROUND policy mechanism queues work for processing by different threads. The partner mechanism then enables idle threads to steal work that was queued for another thread. Limiting the number of partners reduces the number of queues to scan in ptlrpcd_check() and the number of threads to wake in ptlrpc_set_add_new_req().
It is in this context that pinning a ptlrpcd thread to a specific CPU makes sense, when combined with the code in ptlrpcd_select_pc() that avoids queuing work on the thread for the current CPU. But that other CPU is unlikely to be idle either. Ultimately I'm not convinced that Lustre should attempt this kind of fine-grained load-balancing across CPUs.
The alternative approach is to limit the binding of ptlrpcd threads to NUMA nodes and/or CPTs. It seems to me that on the systems where we care about these things, we want the CPT boundaries to match the NUMA boundaries anyway. Therefore I'd say that only binding to CPT boundaries should be good enough.
So I end up with binding ptlrpcd threads to CPTs, but not to individual CPUs. The code in ptlrpcd_select_pc() that avoids "the ptlrpcd for the current CPU" goes away, because there no longer is such a thing. For partner policy, only implement what you called the CPT_PARTNER policy. By default create one ptlrpcd thread per CPU in a CPT, and all threads in the same CPT are its partners.
If a ptlrpcd_max_partners is implemented then one thing to consider is how the partner sets are computed. Currently the "partner of" relation is commutative: if A is a partner of B, then B is a partner of A. Setting things up like that is tricky, it means partitioning the ptlrpcd threads for a CPT into a number of cliques with each clique containing the right number of threads. This feels like more work than we should bother with.
Since the only point of the partner relation is to define which additional queues a ptlrpcd thread may take work from, I'd suggest threads always looking at the the queues for the next threads (with wrap). For example, if max_ptlrpcds = 5, and ptlrpcd_max_partners = 2, then the partner relation is like this:
0 -> 1, 2
1 -> 2, 3
2 -> 3, 4
3 -> 4, 0
4 -> 0, 1
On a system with multiple CPTs this would be the relation between the 7 ptlrpcd threads within each CPT.
For implementation this would mean code much like the current patchset, but in addition:
LIOD_BIND and related code is removed (this includes removing the new ptlrpcd_ctl.pc_cpu field).
pdb_policy_t and related code is removed (this includes removing the ptlrpcd_bind_policy tunable).
pdl_policy_t and related code is removed.
Olaf, thanks for explaining, and your last comment about removing policies is very reasonable to me.
Also, because it's difficult to ask everybody to test performance before landing it, I think the safe way to implement this feature is, users can configure system and make it work in the same way as today, so they can switch back to old fashion if it can't work well as now.
If we think over current ptlrpcd:
map thread index to cpu, bind thread on the numa node that this cpu belongs to
PDB_POLICY_PAIR: half of threads (thread 1,3,5,7...) have numa affinity, another half of threads are free to run on any CPUs. This is the default policy.
PDB_POLICY_NEIGHBOR & PDB_POLICY_FULL: all threads have numa affinity on numa node (probably nobody has system configured without NUMA), the difference is PDB_POLICY_FULL has no partner but PDB_POLICY_NEIGHBOR has partner threads.
PDB_POLICY_NONE: no affinity, no partner.
So your second comment should be able to cover most of these cases except PDB_POLICY_PAIR. However, when policy is PDB_POLICY_PAIR, even some threads has no cpu affinity, but we should expect these threads to run on the cpus of the same NUMA node that their partners are running on (although it may not always true in current implementation), which means it essentially has no difference with PDB_POLICY_NEIGHBOR except number of partners, so I think your proposal should be able to satisfy all requirements.
Now let's get back to your patch, bind thread on particular cpu is a new behave, and this is another reason that I'm asking about it. If you think this is a valuable thing to have, let's have an example like this:
and user created two CPU partitions which matches cpu topology of socket for some reasons (just as example, better way is matching numa topology in this case)
cpu_pattern="0[0-7,16-23] 1[8-15, 24-31]"
then we probably can:
create struct ptlrpcd for each CPT (already in your patch), each ptlrpcd has 16 threads
as you described, number of partners are configurable, default is 1.
replace all current policies with these partner policies:
CPT_PARTENER, number of partners can be 0-15
threads in ptlrpcd[0] run on cpu[0-7,16-23]
threads in ptlrpc[1] run on cpu[8-15,24-31]
NUMA_PARTENER (default), number of partners can be 0-7
thread[0-7] in ptlrpcd[0] run on cpu[0-3,16-19]
threads[8-15] in ptlrpcd[0] run on cpu[4-7,20-23]
thread[0-7] in ptlrpcd[1] run on cpu[8-11,24-27]
threads[8-15] in ptlrpcd[1] run on cpu[12-15,28-31]
CORE_PARTENER, number of partners can be 0 or 1
thread[0,1] in ptlrpcd[0] run on cpu[0,16]
thread[2,3] in ptlrpcd[0] run on cpu[1,17]
thread[4,5] in ptlrpcd[0] run on cpu[2,18]
thread[6,7] in ptlrpcd[0] run on cpu[3,19]
thread[8,9] in ptlrpcd[0] run on cpu[4,20]
thread[2,3] in ptlrpcd[0] run on cpu[5,21]
thread[4,5] in ptlrpcd[0] run on cpu[6,22]
thread[6,7] in ptlrpcd[0] run on cpu[7,23]
thread[0,1] in ptlrpcd[1] run on cpu[8,24]
thread[2,3] in ptlrpcd[1] run on cpu[9,25]
thread[4,5] in ptlrpcd[1] run on cpu[10,26]
thread[6,7] in ptlrpcd[1] run on cpu[11,27]
thread[8,9] in ptlrpcd[1] run on cpu[12,28]
thread[2,3] in ptlrpcd[1] run on cpu[13,29]
thread[4,5] in ptlrpcd[1] run on cpu[14,30]
thread[6,7] in ptlrpcd[1] run on cpu[15,31]
Is this way reasonable to you?
Liang Zhen (Inactive)
added a comment - Olaf, thanks for explaining, and your last comment about removing policies is very reasonable to me.
Also, because it's difficult to ask everybody to test performance before landing it, I think the safe way to implement this feature is, users can configure system and make it work in the same way as today, so they can switch back to old fashion if it can't work well as now.
If we think over current ptlrpcd:
map thread index to cpu, bind thread on the numa node that this cpu belongs to
PDB_POLICY_PAIR: half of threads (thread 1,3,5,7...) have numa affinity, another half of threads are free to run on any CPUs. This is the default policy.
PDB_POLICY_NEIGHBOR & PDB_POLICY_FULL: all threads have numa affinity on numa node (probably nobody has system configured without NUMA), the difference is PDB_POLICY_FULL has no partner but PDB_POLICY_NEIGHBOR has partner threads.
PDB_POLICY_NONE: no affinity, no partner.
So your second comment should be able to cover most of these cases except PDB_POLICY_PAIR. However, when policy is PDB_POLICY_PAIR, even some threads has no cpu affinity, but we should expect these threads to run on the cpus of the same NUMA node that their partners are running on (although it may not always true in current implementation), which means it essentially has no difference with PDB_POLICY_NEIGHBOR except number of partners, so I think your proposal should be able to satisfy all requirements.
Now let's get back to your patch, bind thread on particular cpu is a new behave, and this is another reason that I'm asking about it. If you think this is a valuable thing to have, let's have an example like this:
socket0: numa[0,1]
socket1: numa[2,3]
numa0: cpu[0-3], cpu[16-19]
numa1: cpu[4-7], cpu[20-23]
numa2: cpu[8-11], cpu[24-27]
numa3: cpu[12-15], cpu[28-31]
core0: cpu[0,16]
core1: cpu[1,17]
core2: cpu[2,18]
....
core15: cpu[15,31]
and user created two CPU partitions which matches cpu topology of socket for some reasons (just as example, better way is matching numa topology in this case)
cpu_pattern="0[0-7,16-23] 1[8-15, 24-31]"
then we probably can:
create struct ptlrpcd for each CPT (already in your patch), each ptlrpcd has 16 threads
as you described, number of partners are configurable, default is 1.
replace all current policies with these partner policies:
CPT_PARTENER, number of partners can be 0-15
threads in ptlrpcd[0] run on cpu[0-7,16-23]
threads in ptlrpc[1] run on cpu[8-15,24-31]
NUMA_PARTENER (default), number of partners can be 0-7
thread[0-7] in ptlrpcd[0] run on cpu[0-3,16-19]
threads[8-15] in ptlrpcd[0] run on cpu[4-7,20-23]
thread[0-7] in ptlrpcd[1] run on cpu[8-11,24-27]
threads[8-15] in ptlrpcd[1] run on cpu[12-15,28-31]
CORE_PARTENER, number of partners can be 0 or 1
thread[0,1] in ptlrpcd[0] run on cpu[0,16]
thread[2,3] in ptlrpcd[0] run on cpu[1,17]
thread[4,5] in ptlrpcd[0] run on cpu[2,18]
thread[6,7] in ptlrpcd[0] run on cpu[3,19]
thread[8,9] in ptlrpcd[0] run on cpu[4,20]
thread[2,3] in ptlrpcd[0] run on cpu[5,21]
thread[4,5] in ptlrpcd[0] run on cpu[6,22]
thread[6,7] in ptlrpcd[0] run on cpu[7,23]
thread[0,1] in ptlrpcd[1] run on cpu[8,24]
thread[2,3] in ptlrpcd[1] run on cpu[9,25]
thread[4,5] in ptlrpcd[1] run on cpu[10,26]
thread[6,7] in ptlrpcd[1] run on cpu[11,27]
thread[8,9] in ptlrpcd[1] run on cpu[12,28]
thread[2,3] in ptlrpcd[1] run on cpu[13,29]
thread[4,5] in ptlrpcd[1] run on cpu[14,30]
thread[6,7] in ptlrpcd[1] run on cpu[15,31]
Is this way reasonable to you?
To expand on my comment a bit: I think a plausible approach to this would be to remove the various "policy" variables and replace them with a tunable scheme like the following:
CPT-aware ptlrpcd thread placement, ptlrpcd_max is the maximum number of ptlrpcd threads per CPT (maybe excluding hyperthreads), and ptlrpcd_max_partners is the maximum number of partner threads of each ptlrpcd thread. (So 0 <= ptlrpcd_max_partners < ptlrpcd_max, and partners are always selected from the same CPT, just as ptlrpcd_max is applied on a per-CPT basis.) At this point I think that change still implements the intent of the original changes, but I also think it goes a bit beyond what can be done without more extensive testing and feedback from the various stakeholders.
Olaf Weber (Inactive)
added a comment - To expand on my comment a bit: I think a plausible approach to this would be to remove the various "policy" variables and replace them with a tunable scheme like the following:
CPT-aware ptlrpcd thread placement, ptlrpcd_max is the maximum number of ptlrpcd threads per CPT (maybe excluding hyperthreads), and ptlrpcd_max_partners is the maximum number of partner threads of each ptlrpcd thread. (So 0 <= ptlrpcd_max_partners < ptlrpcd_max , and partners are always selected from the same CPT, just as ptlrpcd_max is applied on a per-CPT basis.) At this point I think that change still implements the intent of the original changes, but I also think it goes a bit beyond what can be done without more extensive testing and feedback from the various stakeholders.
I can do a cleanup patch to remove one of PDL_POLICY_LOCAL/PDL_POLICY_ROUND.
I can also do a cleanup patch to remove PDL_POLICY_PREFERRED and its related infrastructure, in this case the 'index' argument to ptlrpcd_select_pc().
There is a followup patch in the series that removes the ptlrpcd_use_cpts tunable.
With respect to binding ptlrpcd threads to a single core: I worked on a "conservative" assumption that the load-balancing done through PDL_POLICY_LOCAL/ROUND/SAME is supposed to work. Once threads are allowed to wander (whether across the system or just within a single CPT) we might ask whether it makes sense to remove pdl_policy_t and related code completely. In addition, we might want to reconsider what, if anything, pdb_policy_t (PDB_POLICY_NONE/FULL/PAIR/NEIGHBOR) should be doing.
Olaf Weber (Inactive)
added a comment - Liang, thanks for the comments.
To summarize:
I can do a cleanup patch to remove one of PDL_POLICY_LOCAL/PDL_POLICY_ROUND.
I can also do a cleanup patch to remove PDL_POLICY_PREFERRED and its related infrastructure, in this case the 'index' argument to ptlrpcd_select_pc().
There is a followup patch in the series that removes the ptlrpcd_use_cpts tunable.
With respect to binding ptlrpcd threads to a single core: I worked on a "conservative" assumption that the load-balancing done through PDL_POLICY_LOCAL/ROUND/SAME is supposed to work. Once threads are allowed to wander (whether across the system or just within a single CPT) we might ask whether it makes sense to remove pdl_policy_t and related code completely. In addition, we might want to reconsider what, if anything, pdb_policy_t (PDB_POLICY_NONE/FULL/PAIR/NEIGHBOR) should be doing.
Olaf, thanks for working out this.
I have posted some comments on gerrit, in summary:
it's unnecessary to have both PDL_POLICY_LOCAL and PDL_POLICY_ROUND if PDL_POLICY_ROUND can also limit threads in the same CPT.
I'd suggest to remove PDL_POLICY_PREFERRED to simplify code (or only keep definition as memo, but remove implementation), even in current Lustre versions, it's not correctly implemented.
it's probably not a good idea to localise thread on a single core because OS scheduler may have no chance to migrate this thread to other cores even current core is busy all the time (occupied by softirq, or application on client)
would it be possible to only use PDB_POLICY_NONE to instead of ptlrpcd_use_cpts?
Liang Zhen (Inactive)
added a comment - Olaf, thanks for working out this.
I have posted some comments on gerrit, in summary:
it's unnecessary to have both PDL_POLICY_LOCAL and PDL_POLICY_ROUND if PDL_POLICY_ROUND can also limit threads in the same CPT.
I'd suggest to remove PDL_POLICY_PREFERRED to simplify code (or only keep definition as memo, but remove implementation), even in current Lustre versions, it's not correctly implemented.
it's probably not a good idea to localise thread on a single core because OS scheduler may have no chance to migrate this thread to other cores even current core is busy all the time (occupied by softirq, or application on client)
would it be possible to only use PDB_POLICY_NONE to instead of ptlrpcd_use_cpts?
Olaf Weber (olaf@sgi.com) uploaded a new patch: http://review.whamcloud.com/13973
Subject: LU-6325 ptlrpc: CPT-aware placement of ptlrpc_request_set
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 906a8603a911d670164fd296abff1bea80e7f668
Gerrit Updater
added a comment - Olaf Weber (olaf@sgi.com) uploaded a new patch: http://review.whamcloud.com/13973
Subject: LU-6325 ptlrpc: CPT-aware placement of ptlrpc_request_set
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 906a8603a911d670164fd296abff1bea80e7f668
Olaf Weber (olaf@sgi.com) uploaded a new patch: http://review.whamcloud.com/13972
Subject: LU-6325 ptlrpc: make ptlrpcd threads cpt-aware
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e30c8a2890af4b252de42655c19286f4c47ddbe4
Gerrit Updater
added a comment - Olaf Weber (olaf@sgi.com) uploaded a new patch: http://review.whamcloud.com/13972
Subject: LU-6325 ptlrpc: make ptlrpcd threads cpt-aware
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e30c8a2890af4b252de42655c19286f4c47ddbe4
The code we developed at SGI has the following highlights:
When CPTs are defined, all ptlrpcd threads are bound to (the CPUs in) a CPT.
On a NUMA system, making sure the CPTs match the NUMA topology is left as an exercise for the sysadmin.
You can specify which CPTs are populated with ptlrpcd threads (if, for whatever reason, you think parts of the system should not be running ptlrpcd threads at all). This is a kernel module parameter.
The max_ptlrpcds tunable doesn't set a global maximum number of ptlrpcd threads, instead it sets the maximum number of ptlrpcd threads per CPT. The use-case we have in mind is wanting to run one ptlrpcd thread per core on a system with hyperthreading enabled.
Bound threads (per the ptlrpcd_bind_policy) are bound to a single CPU.
Partner threads are always in the same CPT.
PDL_POLICY_LOCAL/PDL_POLICY_ROUND select another thread bound to the same CPT, unless there are no ptlrpcd threads in the current CPT at all.
Olaf Weber (Inactive)
added a comment - The code we developed at SGI has the following highlights:
When CPTs are defined, all ptlrpcd threads are bound to (the CPUs in) a CPT.
On a NUMA system, making sure the CPTs match the NUMA topology is left as an exercise for the sysadmin.
You can specify which CPTs are populated with ptlrpcd threads (if, for whatever reason, you think parts of the system should not be running ptlrpcd threads at all). This is a kernel module parameter.
The max_ptlrpcds tunable doesn't set a global maximum number of ptlrpcd threads, instead it sets the maximum number of ptlrpcd threads per CPT. The use-case we have in mind is wanting to run one ptlrpcd thread per core on a system with hyperthreading enabled.
Bound threads (per the ptlrpcd_bind_policy ) are bound to a single CPU.
Partner threads are always in the same CPT.
PDL_POLICY_LOCAL/PDL_POLICY_ROUND select another thread bound to the same CPT, unless there are no ptlrpcd threads in the current CPT at all.
Liang, your proposal is reasonable. But since we're considering how this part of the code ought to work, I'd like to try to think this through once more.
The LOCAL/ROUND policy mechanism queues work for processing by different threads. The partner mechanism then enables idle threads to steal work that was queued for another thread. Limiting the number of partners reduces the number of queues to scan in ptlrpcd_check() and the number of threads to wake in ptlrpc_set_add_new_req().
It is in this context that pinning a ptlrpcd thread to a specific CPU makes sense, when combined with the code in ptlrpcd_select_pc() that avoids queuing work on the thread for the current CPU. But that other CPU is unlikely to be idle either. Ultimately I'm not convinced that Lustre should attempt this kind of fine-grained load-balancing across CPUs.
The alternative approach is to limit the binding of ptlrpcd threads to NUMA nodes and/or CPTs. It seems to me that on the systems where we care about these things, we want the CPT boundaries to match the NUMA boundaries anyway. Therefore I'd say that only binding to CPT boundaries should be good enough.
So I end up with binding ptlrpcd threads to CPTs, but not to individual CPUs. The code in ptlrpcd_select_pc() that avoids "the ptlrpcd for the current CPU" goes away, because there no longer is such a thing. For partner policy, only implement what you called the CPT_PARTNER policy. By default create one ptlrpcd thread per CPU in a CPT, and all threads in the same CPT are its partners.
If a ptlrpcd_max_partners is implemented then one thing to consider is how the partner sets are computed. Currently the "partner of" relation is commutative: if A is a partner of B, then B is a partner of A. Setting things up like that is tricky, it means partitioning the ptlrpcd threads for a CPT into a number of cliques with each clique containing the right number of threads. This feels like more work than we should bother with.
Since the only point of the partner relation is to define which additional queues a ptlrpcd thread may take work from, I'd suggest threads always looking at the the queues for the next threads (with wrap). For example, if max_ptlrpcds = 5, and ptlrpcd_max_partners = 2, then the partner relation is like this:
On a system with multiple CPTs this would be the relation between the 7 ptlrpcd threads within each CPT.
For implementation this would mean code much like the current patchset, but in addition: