[LU-6325] CPT bound ptlrpcd's are unimplemented - Whamcloud Community JIRA

Liang Zhen (Inactive) added a comment - 18/Mar/15 11:39 PM

thanks Olaf, I will look into it soon.

Liang Zhen (Inactive) added a comment - 18/Mar/15 11:39 PM thanks Olaf, I will look into it soon.

Olaf Weber (Inactive) added a comment - 18/Mar/15 11:15 AM

I've updated http://review.whamcloud.com/13972 with an implementation of the proposed changes. The new tunable is num_ptlrpcd_partners. The ptlrpcd threads on a CPT are divided into groups of partner threads of size num_ptlrpcd_partners+1. So an alternative name for the tunable would be ptlrpcd_partner_group_size (with settings at +1 compared to the current implementation).

the default for num_ptlrpcd_partners is 1, which matches the pair policy that is the current default.
if num_ptlrpcd_partners == -1, then all ptlrpcd threads in a CPT will be partners of each other (e.g, a single partner group per CPT).
if num_ptlrpdcd_partners + 1 is larger than the number of ptlrpcd threads we want to create on a CPT, then no additional ptlrpcd threads are created, and they all go into a single partner group.
if the number of ptlrpcd threads for a CPT is smaller than and not a multiple of num_ptlrpcd_partners + 1 then the number of threads will be increased to make it a multiple.

Olaf Weber (Inactive) added a comment - 18/Mar/15 11:15 AM I've updated http://review.whamcloud.com/13972 with an implementation of the proposed changes. The new tunable is num_ptlrpcd_partners . The ptlrpcd threads on a CPT are divided into groups of partner threads of size num_ptlrpcd_partners+1. So an alternative name for the tunable would be ptlrpcd_partner_group_size (with settings at +1 compared to the current implementation). the default for num_ptlrpcd_partners is 1, which matches the pair policy that is the current default. if num_ptlrpcd_partners == -1 , then all ptlrpcd threads in a CPT will be partners of each other (e.g, a single partner group per CPT). if num_ptlrpdcd_partners + 1 is larger than the number of ptlrpcd threads we want to create on a CPT, then no additional ptlrpcd threads are created, and they all go into a single partner group. if the number of ptlrpcd threads for a CPT is smaller than and not a multiple of num_ptlrpcd_partners + 1 then the number of threads will be increased to make it a multiple.

Gerrit Updater added a comment - 12/Mar/15 5:58 AM

Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/14049
Subject: ~~LU-6325~~ libcfs: easy config for NUMA-match CPT table
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: badb5d5181bd7b4b5628add5117e70536240d5bf

Gerrit Updater added a comment - 12/Mar/15 5:58 AM Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/14049 Subject: LU-6325 libcfs: easy config for NUMA-match CPT table Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: badb5d5181bd7b4b5628add5117e70536240d5bf

Olaf Weber (Inactive) added a comment - 11/Mar/15 2:47 PM

I don't think it is going to be a problem if threads check from different subsets of the queues. But if that is suspected to be an issue, then we should probably make the partner relation commutative (as it is today) and enforce that if ptlrpcd_max_partners > 0 then max_ptlrpcds == N * ptlrpcd_max_partners for some N, and increase a too-low setting of max_ptlrpcds if necessary.

The limit on the number of partners is to avoid a thundering herd pounding on the queue, which is a struct ptlrpc_request_set protected by spinlocks. Another way to address this is to have one request queue per partner set (this assumes commutative partnership). Or even have one request queue per CPT and drop partnership as a separate concept. In that case only a single wake_up() call needs to be made when a request is queued. But this will not scale well on a large system when CPTs are not used.

Easier NUMA configuration would certainly be welcome.

Olaf Weber (Inactive) added a comment - 11/Mar/15 2:47 PM I don't think it is going to be a problem if threads check from different subsets of the queues. But if that is suspected to be an issue, then we should probably make the partner relation commutative (as it is today) and enforce that if ptlrpcd_max_partners > 0 then max_ptlrpcds == N * ptlrpcd_max_partners for some N, and increase a too-low setting of max_ptlrpcds if necessary. The limit on the number of partners is to avoid a thundering herd pounding on the queue, which is a struct ptlrpc_request_set protected by spinlocks. Another way to address this is to have one request queue per partner set (this assumes commutative partnership). Or even have one request queue per CPT and drop partnership as a separate concept. In that case only a single wake_up() call needs to be made when a request is queued. But this will not scale well on a large system when CPTs are not used. Easier NUMA configuration would certainly be welcome.

Liang Zhen (Inactive) added a comment - 11/Mar/15 1:38 PM

I like this way better, it will greatly simplify code. One thing we may also notice is how to wake partners in ptlrpcd_add_rqset(), assuming we choose the next 2 threads as partners, ptlrpcd_add_rqset wakes up thread(N+1) and thread(N+2), if both thread(N+1) and thread(N+2) will check previous two threads, then thread(N+1) will take requests from thread(N-1) and thread(N), this is probably not a problem, or we can enforce number of partners to be power2, so thread knows which subset it should check from, which way sounds better to you?
Anyway, I think we should go ahead based on our current agreement.

Another thing is, I think we should provide a easy way to require libcfs to create CPT table that can match numa topology. In current libcfs, user have to check numa topology, or set cpu_npartitions to number of numa nodes, they are inconvenient, I can work out a patch to improve this.

Liang Zhen (Inactive) added a comment - 11/Mar/15 1:38 PM I like this way better, it will greatly simplify code. One thing we may also notice is how to wake partners in ptlrpcd_add_rqset(), assuming we choose the next 2 threads as partners, ptlrpcd_add_rqset wakes up thread(N+1) and thread(N+2), if both thread(N+1) and thread(N+2) will check previous two threads, then thread(N+1) will take requests from thread(N-1) and thread(N), this is probably not a problem, or we can enforce number of partners to be power2, so thread knows which subset it should check from, which way sounds better to you? Anyway, I think we should go ahead based on our current agreement. Another thing is, I think we should provide a easy way to require libcfs to create CPT table that can match numa topology. In current libcfs, user have to check numa topology, or set cpu_npartitions to number of numa nodes, they are inconvenient, I can work out a patch to improve this.

Olaf Weber (Inactive) added a comment - 11/Mar/15 11:34 AM

Liang, your proposal is reasonable. But since we're considering how this part of the code ought to work, I'd like to try to think this through once more.

The LOCAL/ROUND policy mechanism queues work for processing by different threads. The partner mechanism then enables idle threads to steal work that was queued for another thread. Limiting the number of partners reduces the number of queues to scan in ptlrpcd_check() and the number of threads to wake in ptlrpc_set_add_new_req().

It is in this context that pinning a ptlrpcd thread to a specific CPU makes sense, when combined with the code in ptlrpcd_select_pc() that avoids queuing work on the thread for the current CPU. But that other CPU is unlikely to be idle either. Ultimately I'm not convinced that Lustre should attempt this kind of fine-grained load-balancing across CPUs.

The alternative approach is to limit the binding of ptlrpcd threads to NUMA nodes and/or CPTs. It seems to me that on the systems where we care about these things, we want the CPT boundaries to match the NUMA boundaries anyway. Therefore I'd say that only binding to CPT boundaries should be good enough.

So I end up with binding ptlrpcd threads to CPTs, but not to individual CPUs. The code in ptlrpcd_select_pc() that avoids "the ptlrpcd for the current CPU" goes away, because there no longer is such a thing. For partner policy, only implement what you called the CPT_PARTNER policy. By default create one ptlrpcd thread per CPU in a CPT, and all threads in the same CPT are its partners.

If a ptlrpcd_max_partners is implemented then one thing to consider is how the partner sets are computed. Currently the "partner of" relation is commutative: if A is a partner of B, then B is a partner of A. Setting things up like that is tricky, it means partitioning the ptlrpcd threads for a CPT into a number of cliques with each clique containing the right number of threads. This feels like more work than we should bother with.

Since the only point of the partner relation is to define which additional queues a ptlrpcd thread may take work from, I'd suggest threads always looking at the the queues for the next threads (with wrap). For example, if max_ptlrpcds = 5, and ptlrpcd_max_partners = 2, then the partner relation is like this:

   0 -> 1, 2
   1 -> 2, 3
   2 -> 3, 4
   3 -> 4, 0
   4 -> 0, 1

On a system with multiple CPTs this would be the relation between the 7 ptlrpcd threads within each CPT.

For implementation this would mean code much like the current patchset, but in addition:

LIOD_BIND and related code is removed (this includes removing the new ptlrpcd_ctl.pc_cpu field).
pdb_policy_t and related code is removed (this includes removing the ptlrpcd_bind_policy tunable).
pdl_policy_t and related code is removed.

Olaf Weber (Inactive) added a comment - 11/Mar/15 11:34 AM Liang, your proposal is reasonable. But since we're considering how this part of the code ought to work, I'd like to try to think this through once more. The LOCAL/ROUND policy mechanism queues work for processing by different threads. The partner mechanism then enables idle threads to steal work that was queued for another thread. Limiting the number of partners reduces the number of queues to scan in ptlrpcd_check() and the number of threads to wake in ptlrpc_set_add_new_req(). It is in this context that pinning a ptlrpcd thread to a specific CPU makes sense, when combined with the code in ptlrpcd_select_pc() that avoids queuing work on the thread for the current CPU. But that other CPU is unlikely to be idle either. Ultimately I'm not convinced that Lustre should attempt this kind of fine-grained load-balancing across CPUs. The alternative approach is to limit the binding of ptlrpcd threads to NUMA nodes and/or CPTs. It seems to me that on the systems where we care about these things, we want the CPT boundaries to match the NUMA boundaries anyway. Therefore I'd say that only binding to CPT boundaries should be good enough. So I end up with binding ptlrpcd threads to CPTs, but not to individual CPUs. The code in ptlrpcd_select_pc() that avoids "the ptlrpcd for the current CPU" goes away, because there no longer is such a thing. For partner policy, only implement what you called the CPT_PARTNER policy. By default create one ptlrpcd thread per CPU in a CPT, and all threads in the same CPT are its partners. If a ptlrpcd_max_partners is implemented then one thing to consider is how the partner sets are computed. Currently the "partner of" relation is commutative: if A is a partner of B, then B is a partner of A. Setting things up like that is tricky, it means partitioning the ptlrpcd threads for a CPT into a number of cliques with each clique containing the right number of threads. This feels like more work than we should bother with. Since the only point of the partner relation is to define which additional queues a ptlrpcd thread may take work from, I'd suggest threads always looking at the the queues for the next threads (with wrap). For example, if max_ptlrpcds = 5, and ptlrpcd_max_partners = 2, then the partner relation is like this: 0 -> 1, 2 1 -> 2, 3 2 -> 3, 4 3 -> 4, 0 4 -> 0, 1 On a system with multiple CPTs this would be the relation between the 7 ptlrpcd threads within each CPT. For implementation this would mean code much like the current patchset, but in addition: LIOD_BIND and related code is removed (this includes removing the new ptlrpcd_ctl.pc_cpu field). pdb_policy_t and related code is removed (this includes removing the ptlrpcd_bind_policy tunable). pdl_policy_t and related code is removed.

Liang Zhen (Inactive) added a comment - 11/Mar/15 9:01 AM

Olaf, thanks for explaining, and your last comment about removing policies is very reasonable to me.
Also, because it's difficult to ask everybody to test performance before landing it, I think the safe way to implement this feature is, users can configure system and make it work in the same way as today, so they can switch back to old fashion if it can't work well as now.

If we think over current ptlrpcd:

map thread index to cpu, bind thread on the numa node that this cpu belongs to
PDB_POLICY_PAIR: half of threads (thread 1,3,5,7...) have numa affinity, another half of threads are free to run on any CPUs. This is the default policy.
PDB_POLICY_NEIGHBOR & PDB_POLICY_FULL: all threads have numa affinity on numa node (probably nobody has system configured without NUMA), the difference is PDB_POLICY_FULL has no partner but PDB_POLICY_NEIGHBOR has partner threads.
PDB_POLICY_NONE: no affinity, no partner.

So your second comment should be able to cover most of these cases except PDB_POLICY_PAIR. However, when policy is PDB_POLICY_PAIR, even some threads has no cpu affinity, but we should expect these threads to run on the cpus of the same NUMA node that their partners are running on (although it may not always true in current implementation), which means it essentially has no difference with PDB_POLICY_NEIGHBOR except number of partners, so I think your proposal should be able to satisfy all requirements.

Now let's get back to your patch, bind thread on particular cpu is a new behave, and this is another reason that I'm asking about it. If you think this is a valuable thing to have, let's have an example like this:

socket0: numa[0,1]
socket1: numa[2,3]
numa0: cpu[0-3], cpu[16-19]
numa1: cpu[4-7], cpu[20-23]
numa2: cpu[8-11], cpu[24-27]
numa3: cpu[12-15], cpu[28-31]
core0: cpu[0,16]
core1: cpu[1,17]
core2: cpu[2,18]
....
core15: cpu[15,31]

and user created two CPU partitions which matches cpu topology of socket for some reasons (just as example, better way is matching numa topology in this case)

cpu_pattern="0[0-7,16-23] 1[8-15, 24-31]"

then we probably can:

create struct ptlrpcd for each CPT (already in your patch), each ptlrpcd has 16 threads
as you described, number of partners are configurable, default is 1.
replace all current policies with these partner policies:
- CPT_PARTENER, number of partners can be 0-15
  - threads in ptlrpcd[0] run on cpu[0-7,16-23]
  - threads in ptlrpc[1] run on cpu[8-15,24-31]
- NUMA_PARTENER (default), number of partners can be 0-7
  - thread[0-7] in ptlrpcd[0] run on cpu[0-3,16-19]
  - threads[8-15] in ptlrpcd[0] run on cpu[4-7,20-23]
  - thread[0-7] in ptlrpcd[1] run on cpu[8-11,24-27]
  - threads[8-15] in ptlrpcd[1] run on cpu[12-15,28-31]
- CORE_PARTENER, number of partners can be 0 or 1
  - thread[0,1] in ptlrpcd[0] run on cpu[0,16]
  - thread[2,3] in ptlrpcd[0] run on cpu[1,17]
  - thread[4,5] in ptlrpcd[0] run on cpu[2,18]
  - thread[6,7] in ptlrpcd[0] run on cpu[3,19]
  - thread[8,9] in ptlrpcd[0] run on cpu[4,20]
  - thread[2,3] in ptlrpcd[0] run on cpu[5,21]
  - thread[4,5] in ptlrpcd[0] run on cpu[6,22]
  - thread[6,7] in ptlrpcd[0] run on cpu[7,23]
  - thread[0,1] in ptlrpcd[1] run on cpu[8,24]
  - thread[2,3] in ptlrpcd[1] run on cpu[9,25]
  - thread[4,5] in ptlrpcd[1] run on cpu[10,26]
  - thread[6,7] in ptlrpcd[1] run on cpu[11,27]
  - thread[8,9] in ptlrpcd[1] run on cpu[12,28]
  - thread[2,3] in ptlrpcd[1] run on cpu[13,29]
  - thread[4,5] in ptlrpcd[1] run on cpu[14,30]
  - thread[6,7] in ptlrpcd[1] run on cpu[15,31]

Is this way reasonable to you?

Liang Zhen (Inactive) added a comment - 11/Mar/15 9:01 AM Olaf, thanks for explaining, and your last comment about removing policies is very reasonable to me. Also, because it's difficult to ask everybody to test performance before landing it, I think the safe way to implement this feature is, users can configure system and make it work in the same way as today, so they can switch back to old fashion if it can't work well as now. If we think over current ptlrpcd: map thread index to cpu, bind thread on the numa node that this cpu belongs to PDB_POLICY_PAIR: half of threads (thread 1,3,5,7...) have numa affinity, another half of threads are free to run on any CPUs. This is the default policy. PDB_POLICY_NEIGHBOR & PDB_POLICY_FULL: all threads have numa affinity on numa node (probably nobody has system configured without NUMA), the difference is PDB_POLICY_FULL has no partner but PDB_POLICY_NEIGHBOR has partner threads. PDB_POLICY_NONE: no affinity, no partner. So your second comment should be able to cover most of these cases except PDB_POLICY_PAIR. However, when policy is PDB_POLICY_PAIR, even some threads has no cpu affinity, but we should expect these threads to run on the cpus of the same NUMA node that their partners are running on (although it may not always true in current implementation), which means it essentially has no difference with PDB_POLICY_NEIGHBOR except number of partners, so I think your proposal should be able to satisfy all requirements. Now let's get back to your patch, bind thread on particular cpu is a new behave, and this is another reason that I'm asking about it. If you think this is a valuable thing to have, let's have an example like this: socket0: numa[0,1] socket1: numa[2,3] numa0: cpu[0-3], cpu[16-19] numa1: cpu[4-7], cpu[20-23] numa2: cpu[8-11], cpu[24-27] numa3: cpu[12-15], cpu[28-31] core0: cpu[0,16] core1: cpu[1,17] core2: cpu[2,18] .... core15: cpu[15,31] and user created two CPU partitions which matches cpu topology of socket for some reasons (just as example, better way is matching numa topology in this case) cpu_pattern="0[0-7,16-23] 1[8-15, 24-31]" then we probably can: create struct ptlrpcd for each CPT (already in your patch), each ptlrpcd has 16 threads as you described, number of partners are configurable, default is 1. replace all current policies with these partner policies: CPT_PARTENER, number of partners can be 0-15 threads in ptlrpcd[0] run on cpu[0-7,16-23] threads in ptlrpc[1] run on cpu[8-15,24-31] NUMA_PARTENER (default), number of partners can be 0-7 thread[0-7] in ptlrpcd[0] run on cpu[0-3,16-19] threads[8-15] in ptlrpcd[0] run on cpu[4-7,20-23] thread[0-7] in ptlrpcd[1] run on cpu[8-11,24-27] threads[8-15] in ptlrpcd[1] run on cpu[12-15,28-31] CORE_PARTENER, number of partners can be 0 or 1 thread[0,1] in ptlrpcd[0] run on cpu[0,16] thread[2,3] in ptlrpcd[0] run on cpu[1,17] thread[4,5] in ptlrpcd[0] run on cpu[2,18] thread[6,7] in ptlrpcd[0] run on cpu[3,19] thread[8,9] in ptlrpcd[0] run on cpu[4,20] thread[2,3] in ptlrpcd[0] run on cpu[5,21] thread[4,5] in ptlrpcd[0] run on cpu[6,22] thread[6,7] in ptlrpcd[0] run on cpu[7,23] thread[0,1] in ptlrpcd[1] run on cpu[8,24] thread[2,3] in ptlrpcd[1] run on cpu[9,25] thread[4,5] in ptlrpcd[1] run on cpu[10,26] thread[6,7] in ptlrpcd[1] run on cpu[11,27] thread[8,9] in ptlrpcd[1] run on cpu[12,28] thread[2,3] in ptlrpcd[1] run on cpu[13,29] thread[4,5] in ptlrpcd[1] run on cpu[14,30] thread[6,7] in ptlrpcd[1] run on cpu[15,31] Is this way reasonable to you?

Olaf Weber (Inactive) added a comment - 10/Mar/15 8:03 PM

To expand on my comment a bit: I think a plausible approach to this would be to remove the various "policy" variables and replace them with a tunable scheme like the following:
CPT-aware ptlrpcd thread placement, ptlrpcd_max is the maximum number of ptlrpcd threads per CPT (maybe excluding hyperthreads), and ptlrpcd_max_partners is the maximum number of partner threads of each ptlrpcd thread. (So 0 <= ptlrpcd_max_partners < ptlrpcd_max, and partners are always selected from the same CPT, just as ptlrpcd_max is applied on a per-CPT basis.) At this point I think that change still implements the intent of the original changes, but I also think it goes a bit beyond what can be done without more extensive testing and feedback from the various stakeholders.

Olaf Weber (Inactive) added a comment - 10/Mar/15 8:03 PM To expand on my comment a bit: I think a plausible approach to this would be to remove the various "policy" variables and replace them with a tunable scheme like the following: CPT-aware ptlrpcd thread placement, ptlrpcd_max is the maximum number of ptlrpcd threads per CPT (maybe excluding hyperthreads), and ptlrpcd_max_partners is the maximum number of partner threads of each ptlrpcd thread. (So 0 <= ptlrpcd_max_partners < ptlrpcd_max , and partners are always selected from the same CPT, just as ptlrpcd_max is applied on a per-CPT basis.) At this point I think that change still implements the intent of the original changes, but I also think it goes a bit beyond what can be done without more extensive testing and feedback from the various stakeholders.

Olaf Weber (Inactive) added a comment - 10/Mar/15 10:17 AM

Liang, thanks for the comments.
To summarize:

I can do a cleanup patch to remove one of PDL_POLICY_LOCAL/PDL_POLICY_ROUND.
I can also do a cleanup patch to remove PDL_POLICY_PREFERRED and its related infrastructure, in this case the 'index' argument to ptlrpcd_select_pc().
There is a followup patch in the series that removes the ptlrpcd_use_cpts tunable.

With respect to binding ptlrpcd threads to a single core: I worked on a "conservative" assumption that the load-balancing done through PDL_POLICY_LOCAL/ROUND/SAME is supposed to work. Once threads are allowed to wander (whether across the system or just within a single CPT) we might ask whether it makes sense to remove pdl_policy_t and related code completely. In addition, we might want to reconsider what, if anything, pdb_policy_t (PDB_POLICY_NONE/FULL/PAIR/NEIGHBOR) should be doing.

Olaf Weber (Inactive) added a comment - 10/Mar/15 10:17 AM Liang, thanks for the comments. To summarize: I can do a cleanup patch to remove one of PDL_POLICY_LOCAL/PDL_POLICY_ROUND. I can also do a cleanup patch to remove PDL_POLICY_PREFERRED and its related infrastructure, in this case the 'index' argument to ptlrpcd_select_pc(). There is a followup patch in the series that removes the ptlrpcd_use_cpts tunable. With respect to binding ptlrpcd threads to a single core: I worked on a "conservative" assumption that the load-balancing done through PDL_POLICY_LOCAL/ROUND/SAME is supposed to work. Once threads are allowed to wander (whether across the system or just within a single CPT) we might ask whether it makes sense to remove pdl_policy_t and related code completely. In addition, we might want to reconsider what, if anything, pdb_policy_t (PDB_POLICY_NONE/FULL/PAIR/NEIGHBOR) should be doing.

Liang Zhen (Inactive) added a comment - 10/Mar/15 9:17 AM

Olaf, thanks for working out this.
I have posted some comments on gerrit, in summary:

it's unnecessary to have both PDL_POLICY_LOCAL and PDL_POLICY_ROUND if PDL_POLICY_ROUND can also limit threads in the same CPT.
I'd suggest to remove PDL_POLICY_PREFERRED to simplify code (or only keep definition as memo, but remove implementation), even in current Lustre versions, it's not correctly implemented.
it's probably not a good idea to localise thread on a single core because OS scheduler may have no chance to migrate this thread to other cores even current core is busy all the time (occupied by softirq, or application on client)
would it be possible to only use PDB_POLICY_NONE to instead of ptlrpcd_use_cpts?

Liang Zhen (Inactive) added a comment - 10/Mar/15 9:17 AM Olaf, thanks for working out this. I have posted some comments on gerrit, in summary: it's unnecessary to have both PDL_POLICY_LOCAL and PDL_POLICY_ROUND if PDL_POLICY_ROUND can also limit threads in the same CPT. I'd suggest to remove PDL_POLICY_PREFERRED to simplify code (or only keep definition as memo, but remove implementation), even in current Lustre versions, it's not correctly implemented. it's probably not a good idea to localise thread on a single core because OS scheduler may have no chance to migrate this thread to other cores even current core is busy all the time (occupied by softirq, or application on client) would it be possible to only use PDB_POLICY_NONE to instead of ptlrpcd_use_cpts?

Liang Zhen (Inactive) added a comment - 06/Mar/15 12:32 AM

sure, I will review them

Liang Zhen (Inactive) added a comment - 06/Mar/15 12:32 AM sure, I will review them

CPT bound ptlrpcd's are unimplemented

Details

Description

Attachments

Issue Links

Activity

People

Dates