Please post the SMP patches to bugzilla and then push the changes to a branch on review.whamcloud.com/lustre. Keep the branch current as changes are made to the patches during the inspection/landing process, because we'll also be testing this branch at LLNL and possibly other places.
SMP Node Affinity Scope StatementIntroduction The following scope statement applies to the SMP Node Affinity Scope Statement project within the SFSDEV001 contract/SOW dates 08/01/2011. Problem Statement Followers of CPU design are observing steady increases in core count....
Lustre Community Development in ProgressFeatures are being developed for future Lustre releases both at Whamcloud and by other organizations in the Lustre community. These will be eligible for inclusion in future Lustre releases as per our processes
already landed tens of standalone patches when I was in Oracle, all following descriptions are about patches not landed yet
A fat server is divided into several processing partitions (or cpu-partitions, CP as abbreviation), each partition contains: some cpu cores(or NUMA nodes) + memory pool + message queue + threads pool), it's a little like concept of virtual machine, although it's much simpler. It's not new thing, I just bring in this as a concept to replace what I used to call cpu_node + cpu-affinity, which is confusing because "node" is already used to present NUMA in linux kernel.
Although we still have one single namespace on MDS, but we can have several virtual processing partitions, lifetime of request should be localized on the processing partition as possible as we can, to reduce data/thread migration between CPUs, also reduce lock contentions and cacheline conflicts.
LND has message queue and threads-pool for each partition
LNet has EQ callback for each partition
Ptlrpc service has message queue and threads pool for each partition
number of CPU partitions is configurable (by new parameter "cpu_npartitions" of libcfs), libcfs will also automatically estimate "cpu_npartitions" based on number of CPUs. If we have cpu_node_num=N (N > 1), Lustre stack can have N standalone processing flows (unless they are contenting on the same resource), if we configured the system with cpu_npartitions=1, we have only one partition and lustre should act as current "master".
user can provide string pattern for cpu-partitions as well, i.e: "0[0, 2, 4, 6] 1[1, 3, 5, 7]", by this way, lustre will have two cpu-partitions, the first one contains core[0, 2, 4, 6], the second one presents core[1, 3, 5, 7], NB: those numbers inside bracket can be NUMA id as well.
number of cpu partitions < number of cores (or hyper-threadings), because modern computer can have hundreds or more cores, there are some major downsides if we have per-core schedulers(threads) pool: a) bad load balance especially on small/middle size cluster, b) too many threads overall;
on client/OSS/router, cpu_npartitions == NUMA node, dataflow of request/reply is localized on NUMA node
on MDS, a cpu partition will present a few cores(or hyper-threading), i.e: on system with 4 NUMA nodes and 64-cores, we have 8 cpu-partitions
we might hash different objects(by fid) to different cpu-partition in the future, it's an advanced feature which we don't have now
we can bind LNet NI on specified cpu partition for NUMIOA performance
these things helped a lot on performance of many target directories tests, but they can help nothing on shared directory performance at all, I never saw 20K+ file/sec in shared directory creatinon/removal tests (non-zero stripe), with 1.8, the number is < 7K files/sec
new pdirops patch works fine(LU-50), Andreas has already reviewed the prototype, now I have posted an the second version for review. The pdirops patch is different with the old one (we had 5 or 6 years ago?)
the old version is dynlock based, it has too many inline changes for ext4, and we probably need to change more to support N-level htree (large directory), we probably have to change ext4/namei.c and make it quite like our IAM.
the new version is htree_lock based, although implementation of the htree_lock is big & complex, but it just requires a few inline changes for ext4. htree_lock is more like a embedded component, it can be enabled/disabled easily, and need very few logic change to current ext4. The patch is big (about 2K lines), but only has about 200 LOC inline changes (including changes for N-level htree), and half of those inline changes are adding a new parameter to some functions.
htree_lock based pdirops patch have same performance as IAM dir, but without any interop issue
with pdirops patch on my branch, we can get 65K+ files/sec opencreate+close (narrow stripping file, aggregation performance on server) in my latest test on toro
MDD transaction scheduler will be totally dropped, If you still remember it, I mentioned previously that I wanted to add a relatively small threads pool in MDD to take over backend filesystem transactions, this thread pool size should be moderate and just enough for driving full throughput of metadata operations, this idea is majorly for share directory performance. However, pdirops patch works very good, so I will totally drop it (there is more detail in my mail to lustre-devel a few weeks ago)
key component for landing
Tens of standalone patches
no dependency on each other, size of them are from a few LOC to a few hundreds LOC
some of them will be landed on 2.1, all of them will be landed on 2.2
pdirop patch + BH LRU size kernel patch
pdirop patch is big, but easy to maintain (at least it's easy to port it to rhel6)
I've sent another mail to Andreas and bzzz to explain why we need to increase size of BH LRU size, it will be a small patch
we would like to land them on 2.2 if we got enough resource or got funding on this
cpu-partition patches
it's the biggest chunk, includes several large patches spread over stack layers (libcfs, LNet, LND, ptlrpc server side, and some small changes to other modules like mdt, ost )
patches have dependencies.
The biggest challenge is inspection LNet + LNDs, isaac might give us some help to review, Lai Siyao will be the other inspector
if we got funding on this, we tend to land them on 2.2, otherwise it's more realistic for us to land them on 2.3
Liang Zhen (Inactive)
added a comment -
Background of the project
already landed tens of standalone patches when I was in Oracle, all following descriptions are about patches not landed yet
A fat server is divided into several processing partitions (or cpu-partitions, CP as abbreviation), each partition contains: some cpu cores(or NUMA nodes) + memory pool + message queue + threads pool), it's a little like concept of virtual machine, although it's much simpler. It's not new thing, I just bring in this as a concept to replace what I used to call cpu_node + cpu-affinity, which is confusing because "node" is already used to present NUMA in linux kernel.
Although we still have one single namespace on MDS, but we can have several virtual processing partitions, lifetime of request should be localized on the processing partition as possible as we can, to reduce data/thread migration between CPUs, also reduce lock contentions and cacheline conflicts.
LND has message queue and threads-pool for each partition
LNet has EQ callback for each partition
Ptlrpc service has message queue and threads pool for each partition
number of CPU partitions is configurable (by new parameter "cpu_npartitions" of libcfs), libcfs will also automatically estimate "cpu_npartitions" based on number of CPUs. If we have cpu_node_num=N (N > 1), Lustre stack can have N standalone processing flows (unless they are contenting on the same resource), if we configured the system with cpu_npartitions=1, we have only one partition and lustre should act as current "master".
user can provide string pattern for cpu-partitions as well, i.e: "0[0, 2, 4, 6] 1[1, 3, 5, 7]", by this way, lustre will have two cpu-partitions, the first one contains core[0, 2, 4, 6], the second one presents core[1, 3, 5, 7], NB: those numbers inside bracket can be NUMA id as well.
number of cpu partitions < number of cores (or hyper-threadings), because modern computer can have hundreds or more cores, there are some major downsides if we have per-core schedulers(threads) pool: a) bad load balance especially on small/middle size cluster, b) too many threads overall;
on client/OSS/router, cpu_npartitions == NUMA node, dataflow of request/reply is localized on NUMA node
on MDS, a cpu partition will present a few cores(or hyper-threading), i.e: on system with 4 NUMA nodes and 64-cores, we have 8 cpu-partitions
we might hash different objects(by fid) to different cpu-partition in the future, it's an advanced feature which we don't have now
we can bind LNet NI on specified cpu partition for NUMIOA performance
these things helped a lot on performance of many target directories tests, but they can help nothing on shared directory performance at all, I never saw 20K+ file/sec in shared directory creatinon/removal tests (non-zero stripe), with 1.8, the number is < 7K files/sec
new pdirops patch works fine( LU-50 ), Andreas has already reviewed the prototype, now I have posted an the second version for review. The pdirops patch is different with the old one (we had 5 or 6 years ago?)
the old version is dynlock based, it has too many inline changes for ext4, and we probably need to change more to support N-level htree (large directory), we probably have to change ext4/namei.c and make it quite like our IAM.
the new version is htree_lock based, although implementation of the htree_lock is big & complex, but it just requires a few inline changes for ext4. htree_lock is more like a embedded component, it can be enabled/disabled easily, and need very few logic change to current ext4. The patch is big (about 2K lines), but only has about 200 LOC inline changes (including changes for N-level htree), and half of those inline changes are adding a new parameter to some functions.
htree_lock based pdirops patch have same performance as IAM dir, but without any interop issue
with pdirops patch on my branch, we can get 65K+ files/sec opencreate+close (narrow stripping file, aggregation performance on server) in my latest test on toro
MDD transaction scheduler will be totally dropped, If you still remember it, I mentioned previously that I wanted to add a relatively small threads pool in MDD to take over backend filesystem transactions, this thread pool size should be moderate and just enough for driving full throughput of metadata operations, this idea is majorly for share directory performance. However, pdirops patch works very good, so I will totally drop it (there is more detail in my mail to lustre-devel a few weeks ago)
key component for landing
Tens of standalone patches
no dependency on each other, size of them are from a few LOC to a few hundreds LOC
some of them will be landed on 2.1, all of them will be landed on 2.2
pdirop patch + BH LRU size kernel patch
pdirop patch is big, but easy to maintain (at least it's easy to port it to rhel6)
I've sent another mail to Andreas and bzzz to explain why we need to increase size of BH LRU size, it will be a small patch
we would like to land them on 2.2 if we got enough resource or got funding on this
cpu-partition patches
it's the biggest chunk, includes several large patches spread over stack layers (libcfs, LNet, LND, ptlrpc server side, and some small changes to other modules like mdt, ost )
patches have dependencies.
The biggest challenge is inspection LNet + LNDs, isaac might give us some help to review, Lai Siyao will be the other inspector
if we got funding on this, we tend to land them on 2.2, otherwise it's more realistic for us to land them on 2.3
Liang Zhen (Inactive)
added a comment - Hi, I think you need this patch: http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=6bdb62e3d7ed206bcaef0cd3499f9cb56dc1fb92
I've synced my branch with master, so you just need to pull it from my branch.
patched mdtest which parses expression for multi-mount