Please post the SMP patches to bugzilla and then push the changes to a branch on review.whamcloud.com/lustre. Keep the branch current as changes are made to the patches during the inspection/landing process, because we'll also be testing this branch at LLNL and possibly other places.
SMP Node Affinity Scope StatementIntroduction The following scope statement applies to the SMP Node Affinity Scope Statement project within the SFSDEV001 contract/SOW dates 08/01/2011. Problem Statement Followers of CPU design are observing steady increases in core count....
Lustre Community Development in ProgressFeatures are being developed for future Lustre releases both at Whamcloud and by other organizations in the Lustre community. These will be eligible for inclusion in future Lustre releases as per our processes
already landed tens of standalone patches when I was in Oracle, all following descriptions are about patches not landed yet
A fat server is divided into several processing partitions (or cpu-partitions, CP as abbreviation), each partition contains: some cpu cores(or NUMA nodes) + memory pool + message queue + threads pool), it's a little like concept of virtual machine, although it's much simpler. It's not new thing, I just bring in this as a concept to replace what I used to call cpu_node + cpu-affinity, which is confusing because "node" is already used to present NUMA in linux kernel.
Although we still have one single namespace on MDS, but we can have several virtual processing partitions, lifetime of request should be localized on the processing partition as possible as we can, to reduce data/thread migration between CPUs, also reduce lock contentions and cacheline conflicts.
LND has message queue and threads-pool for each partition
LNet has EQ callback for each partition
Ptlrpc service has message queue and threads pool for each partition
number of CPU partitions is configurable (by new parameter "cpu_npartitions" of libcfs), libcfs will also automatically estimate "cpu_npartitions" based on number of CPUs. If we have cpu_node_num=N (N > 1), Lustre stack can have N standalone processing flows (unless they are contenting on the same resource), if we configured the system with cpu_npartitions=1, we have only one partition and lustre should act as current "master".
user can provide string pattern for cpu-partitions as well, i.e: "0[0, 2, 4, 6] 1[1, 3, 5, 7]", by this way, lustre will have two cpu-partitions, the first one contains core[0, 2, 4, 6], the second one presents core[1, 3, 5, 7], NB: those numbers inside bracket can be NUMA id as well.
number of cpu partitions < number of cores (or hyper-threadings), because modern computer can have hundreds or more cores, there are some major downsides if we have per-core schedulers(threads) pool: a) bad load balance especially on small/middle size cluster, b) too many threads overall;
on client/OSS/router, cpu_npartitions == NUMA node, dataflow of request/reply is localized on NUMA node
on MDS, a cpu partition will present a few cores(or hyper-threading), i.e: on system with 4 NUMA nodes and 64-cores, we have 8 cpu-partitions
we might hash different objects(by fid) to different cpu-partition in the future, it's an advanced feature which we don't have now
we can bind LNet NI on specified cpu partition for NUMIOA performance
these things helped a lot on performance of many target directories tests, but they can help nothing on shared directory performance at all, I never saw 20K+ file/sec in shared directory creatinon/removal tests (non-zero stripe), with 1.8, the number is < 7K files/sec
new pdirops patch works fine(LU-50), Andreas has already reviewed the prototype, now I have posted an the second version for review. The pdirops patch is different with the old one (we had 5 or 6 years ago?)
the old version is dynlock based, it has too many inline changes for ext4, and we probably need to change more to support N-level htree (large directory), we probably have to change ext4/namei.c and make it quite like our IAM.
the new version is htree_lock based, although implementation of the htree_lock is big & complex, but it just requires a few inline changes for ext4. htree_lock is more like a embedded component, it can be enabled/disabled easily, and need very few logic change to current ext4. The patch is big (about 2K lines), but only has about 200 LOC inline changes (including changes for N-level htree), and half of those inline changes are adding a new parameter to some functions.
htree_lock based pdirops patch have same performance as IAM dir, but without any interop issue
with pdirops patch on my branch, we can get 65K+ files/sec opencreate+close (narrow stripping file, aggregation performance on server) in my latest test on toro
MDD transaction scheduler will be totally dropped, If you still remember it, I mentioned previously that I wanted to add a relatively small threads pool in MDD to take over backend filesystem transactions, this thread pool size should be moderate and just enough for driving full throughput of metadata operations, this idea is majorly for share directory performance. However, pdirops patch works very good, so I will totally drop it (there is more detail in my mail to lustre-devel a few weeks ago)
key component for landing
Tens of standalone patches
no dependency on each other, size of them are from a few LOC to a few hundreds LOC
some of them will be landed on 2.1, all of them will be landed on 2.2
pdirop patch + BH LRU size kernel patch
pdirop patch is big, but easy to maintain (at least it's easy to port it to rhel6)
I've sent another mail to Andreas and bzzz to explain why we need to increase size of BH LRU size, it will be a small patch
we would like to land them on 2.2 if we got enough resource or got funding on this
cpu-partition patches
it's the biggest chunk, includes several large patches spread over stack layers (libcfs, LNet, LND, ptlrpc server side, and some small changes to other modules like mdt, ost )
patches have dependencies.
The biggest challenge is inspection LNet + LNDs, isaac might give us some help to review, Lai Siyao will be the other inspector
if we got funding on this, we tend to land them on 2.2, otherwise it's more realistic for us to land them on 2.3
Liang Zhen (Inactive)
added a comment -
Background of the project
already landed tens of standalone patches when I was in Oracle, all following descriptions are about patches not landed yet
A fat server is divided into several processing partitions (or cpu-partitions, CP as abbreviation), each partition contains: some cpu cores(or NUMA nodes) + memory pool + message queue + threads pool), it's a little like concept of virtual machine, although it's much simpler. It's not new thing, I just bring in this as a concept to replace what I used to call cpu_node + cpu-affinity, which is confusing because "node" is already used to present NUMA in linux kernel.
Although we still have one single namespace on MDS, but we can have several virtual processing partitions, lifetime of request should be localized on the processing partition as possible as we can, to reduce data/thread migration between CPUs, also reduce lock contentions and cacheline conflicts.
LND has message queue and threads-pool for each partition
LNet has EQ callback for each partition
Ptlrpc service has message queue and threads pool for each partition
number of CPU partitions is configurable (by new parameter "cpu_npartitions" of libcfs), libcfs will also automatically estimate "cpu_npartitions" based on number of CPUs. If we have cpu_node_num=N (N > 1), Lustre stack can have N standalone processing flows (unless they are contenting on the same resource), if we configured the system with cpu_npartitions=1, we have only one partition and lustre should act as current "master".
user can provide string pattern for cpu-partitions as well, i.e: "0[0, 2, 4, 6] 1[1, 3, 5, 7]", by this way, lustre will have two cpu-partitions, the first one contains core[0, 2, 4, 6], the second one presents core[1, 3, 5, 7], NB: those numbers inside bracket can be NUMA id as well.
number of cpu partitions < number of cores (or hyper-threadings), because modern computer can have hundreds or more cores, there are some major downsides if we have per-core schedulers(threads) pool: a) bad load balance especially on small/middle size cluster, b) too many threads overall;
on client/OSS/router, cpu_npartitions == NUMA node, dataflow of request/reply is localized on NUMA node
on MDS, a cpu partition will present a few cores(or hyper-threading), i.e: on system with 4 NUMA nodes and 64-cores, we have 8 cpu-partitions
we might hash different objects(by fid) to different cpu-partition in the future, it's an advanced feature which we don't have now
we can bind LNet NI on specified cpu partition for NUMIOA performance
these things helped a lot on performance of many target directories tests, but they can help nothing on shared directory performance at all, I never saw 20K+ file/sec in shared directory creatinon/removal tests (non-zero stripe), with 1.8, the number is < 7K files/sec
new pdirops patch works fine( LU-50 ), Andreas has already reviewed the prototype, now I have posted an the second version for review. The pdirops patch is different with the old one (we had 5 or 6 years ago?)
the old version is dynlock based, it has too many inline changes for ext4, and we probably need to change more to support N-level htree (large directory), we probably have to change ext4/namei.c and make it quite like our IAM.
the new version is htree_lock based, although implementation of the htree_lock is big & complex, but it just requires a few inline changes for ext4. htree_lock is more like a embedded component, it can be enabled/disabled easily, and need very few logic change to current ext4. The patch is big (about 2K lines), but only has about 200 LOC inline changes (including changes for N-level htree), and half of those inline changes are adding a new parameter to some functions.
htree_lock based pdirops patch have same performance as IAM dir, but without any interop issue
with pdirops patch on my branch, we can get 65K+ files/sec opencreate+close (narrow stripping file, aggregation performance on server) in my latest test on toro
MDD transaction scheduler will be totally dropped, If you still remember it, I mentioned previously that I wanted to add a relatively small threads pool in MDD to take over backend filesystem transactions, this thread pool size should be moderate and just enough for driving full throughput of metadata operations, this idea is majorly for share directory performance. However, pdirops patch works very good, so I will totally drop it (there is more detail in my mail to lustre-devel a few weeks ago)
key component for landing
Tens of standalone patches
no dependency on each other, size of them are from a few LOC to a few hundreds LOC
some of them will be landed on 2.1, all of them will be landed on 2.2
pdirop patch + BH LRU size kernel patch
pdirop patch is big, but easy to maintain (at least it's easy to port it to rhel6)
I've sent another mail to Andreas and bzzz to explain why we need to increase size of BH LRU size, it will be a small patch
we would like to land them on 2.2 if we got enough resource or got funding on this
cpu-partition patches
it's the biggest chunk, includes several large patches spread over stack layers (libcfs, LNet, LND, ptlrpc server side, and some small changes to other modules like mdt, ost )
patches have dependencies.
The biggest challenge is inspection LNet + LNDs, isaac might give us some help to review, Lai Siyao will be the other inspector
if we got funding on this, we tend to land them on 2.2, otherwise it's more realistic for us to land them on 2.3
Liang Zhen (Inactive)
added a comment - Hi, I think you need this patch: http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=6bdb62e3d7ed206bcaef0cd3499f9cb56dc1fb92
I've synced my branch with master, so you just need to pull it from my branch.
Thanks for the quick fix, now I can 'modprobe ptlrpc' without any issue.
But unfortunately I am hitting another problem: when I try to mount the MDT, I get the following error:
[root@berlin27 lustre]# mount -t lustre -o acl,user_xattr /dev/sda9 /mnt/t100full/mdt/0
mount.lustre: mount /dev/sda9 at /mnt/t100full/mdt/0 failed: Invalid argument
This may have multiple causes.
Are the mount options correct?
Check the syslog for more info.
[root@berlin27 lustre]# dmesg -c
LDISKFS-fs warning (device sda9): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
LDISKFS-fs (sda9): mounted filesystem with ordered data mode
LDISKFS-fs warning (device sda9): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
LDISKFS-fs (sda9): Unrecognized mount option "64bithash" or missing value
LustreError: 16785:0:(obd_mount.c:1399:server_kernel_mount()) ll_kern_mount failed: rc = -22
LustreError: 16785:0:(obd_mount.c:1680:server_fill_super()) Unable to mount device /dev/sda9: -22
LustreError: 16785:0:(obd_mount.c:2154:lustre_fill_super()) Unable to mount (-22)
Did I miss something? Like a specific e2fsprogs version (we are currently using 1.41.10.sun2)?
I must say that we do not have this error if we start the same MDT with our Lustre 2.0.0.1 rpms installed.
TIA,
Sebastien.
Sebastien Buisson (Inactive)
added a comment - Hi Liang,
Thanks for the quick fix, now I can 'modprobe ptlrpc' without any issue.
But unfortunately I am hitting another problem: when I try to mount the MDT, I get the following error:
[root@berlin27 lustre] # mount -t lustre -o acl,user_xattr /dev/sda9 /mnt/t100full/mdt/0
mount.lustre: mount /dev/sda9 at /mnt/t100full/mdt/0 failed: Invalid argument
This may have multiple causes.
Are the mount options correct?
Check the syslog for more info.
[root@berlin27 lustre] # dmesg -c
LDISKFS-fs warning (device sda9): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
LDISKFS-fs (sda9): mounted filesystem with ordered data mode
LDISKFS-fs warning (device sda9): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
LDISKFS-fs (sda9): Unrecognized mount option "64bithash" or missing value
LustreError: 16785:0:(obd_mount.c:1399:server_kernel_mount()) ll_kern_mount failed: rc = -22
LustreError: 16785:0:(obd_mount.c:1680:server_fill_super()) Unable to mount device /dev/sda9: -22
LustreError: 16785:0:(obd_mount.c:2154:lustre_fill_super()) Unable to mount (-22)
Did I miss something? Like a specific e2fsprogs version (we are currently using 1.41.10.sun2)?
I must say that we do not have this error if we start the same MDT with our Lustre 2.0.0.1 rpms installed.
TIA,
Sebastien.
Hi, it should be fixed, it's a typo which will make my code access cpu_to_node[-1]...it's been there for months but I never hit it on many different hardwares... please pull from git for the fix.
btw: I've changed default value of cpu_mode to 2 for my testing, so please use cpu_mode=0 to modprobe.conf on OSS or clients.
Liang Zhen (Inactive)
added a comment - Hi, it should be fixed, it's a typo which will make my code access cpu_to_node [-1] ...it's been there for months but I never hit it on many different hardwares... please pull from git for the fix.
btw: I've changed default value of cpu_mode to 2 for my testing, so please use cpu_mode=0 to modprobe.conf on OSS or clients.
For information, the problem occurs on the first instruction of cfs_cpumap_alloc() function, i.e.
LIBCFS_ALLOC_ALIGNED(cpumap, sizeof(cfs_cpumap_t));
Sebastien Buisson (Inactive)
added a comment - For information, the problem occurs on the first instruction of cfs_cpumap_alloc() function, i.e.
LIBCFS_ALLOC_ALIGNED(cpumap, sizeof(cfs_cpumap_t));
Yes, after small adjustments in patch, I see that ext4_pdirop.patch is applied at compile time in order to build ldiskfs.
I hoped we could run some obdfilter-survey tests at least, in order to have nice figures to present at the LUG
Sebastien Buisson (Inactive)
added a comment - Yes, after small adjustments in patch, I see that ext4_pdirop.patch is applied at compile time in order to build ldiskfs.
I hoped we could run some obdfilter-survey tests at least, in order to have nice figures to present at the LUG
Sebastien, sorry I've never seen this and actually I never tried my branch on 2.6.32, but I will look into this issue very soon, btw, are you able to apply the new kernel patch to 2.6.32? (ext4_pdirop.patch)
Liang Zhen (Inactive)
added a comment - - edited Sebastien, sorry I've never seen this and actually I never tried my branch on 2.6.32, but I will look into this issue very soon, btw, are you able to apply the new kernel patch to 2.6.32? (ext4_pdirop.patch)
Re: the last two comments. Patches have landed. I'll copy an update of the patch list from Richard Henwood here:
c3e985e7e98f41ebf5ecb78887dcd2554601f7ef
LU-56ptlrpc: post rqbd with flag LNET_INS_LOCAL8bbd62a7c0d2fc48d8f11e78d92bb42809968bba
LU-56ptlrpc: CPT affinity ptlrpc RS handlersd800fc41a1abdaf7aaf6c0e3e7ddcdec489985a8
LU-56ptlrpc: partitioned ptlrpc serviceb43a6b1800265608cfa18159d4d0d006a1c23015
LU-56o2iblnd: CPT affinity o2iblnd82e02a17c0c645a8d156e51b8d8da5eaa68b8f5b
LU-56lnet: re-finalize failed ACK or routed message1a73553d15b459208cbf7279ea6e5e5a110c632b
LU-56ksocklnd: CPT affinity socklnd6b4b780895dfdeaca316862fbf1696983608f96d
LU-56lnet: wrong assertion for optimized GET913c8e22cfc4fc5f52c4f0d6d3f0b4b86a7ac58c
LU-56lnet: tuning wildcard portals rotorc03783fce46ae0b40db0680388df6e2d6fca5008
LU-56lnet: SMP improvements for LNet selftest7b2ab9beae02080797ff2da5105eaddadd67c151
LU-56ldlm: SMP improvement for ldlm_lock_cancel07b8db220e48782369f48d86213c5d404a628ded
LU-56ptlrpc: Reduce at_lock dancec48a869557fe7663f4f3370b130d4c248958180e
LU-56libcfs: CPT affinity workitem scheduler8a5b8dbda960b155f669c13602504f1233a84c7e
LU-56obdclass: SMP improvement for lu_keye531dc437c56a08a65de9074a511faa55184712b
LU-56lnet: multiple cleanups for inspectione069296630240947f1815e505067fd48033909f7
LU-56lnet: allow user to bind NI on CPTsa07e9d350b3e500c7be877f6dcf54380b86a9cbe
LU-56lnet: Partitioned LNet networks5e1957841df3e771f3d72d8ea59180213430bbb9
LU-56lnet: cleanup for rtrpool and LNet counter279bbc81e03dc74d273ec12b4d9e703ca94404c4
LU-56lnet: Partitioned LNet resources (ME/MD/EQ)582c110231cf06bcd7e5e0b3bdf4f2058e18ebe4
LU-56ptlrpc: cleanup of ptlrpc_unregister_serviceff0c89a73e141ce019ee2a94e5d01a8a37dd830a
LU-56ptlrpc: svc thread starting/stopping cleanup25766da50b627648b04549ff3fb55af12acbcb4b
LU-56lnet: reduce stack usage of "match" functionsc7bff5640caff778d4cfca229672a2cc67b350d6
LU-56lnet: Granulate LNet lock24564b398f53009521aeda5d653e57fe8b525775
LU-56ptlrpc: partition data for ptlrpc service698d3088622b4610a84bd508f2b707a7a2dd1e3e
LU-56lnet: code cleanup for lib-move.c38fcdd3966da09517ca176b962230b7dae43514c
LU-56lnet: match-table for Portalsf0aa1eef72e7438c2bd4b3eee821fefbc50d1f8e
LU-56lnet: code cleanup for lib-md.c75a8f4b4aa9ad6bf697aedece539e62111e9029a
LU-56lnet: split lnet_commit_md and cleanup06093c1f24da938418a0243259b5307c9fc338d5
LU-56lnet: LNet message event cleanup2118a8b92cec2df85d1bdbe2e58b389d83fe06b2
LU-56lnet: eliminate a few locking dance in LNet51a5b4df5bbbf5fd12c73d2722b230e93fe93327
LU-56lnet: parse RC ping in event callbackb9bad9bd7d1c3271df916ee62091106e3f3c98b7
LU-56lnet: router-checker (RC) cleanup4fcc56be68c8c1667fbd91721d084874a2f05c3e
LU-56ptlrpc: common code to validate nthreadsed22093b2d569fd0e93f35504580171114bf212d
LU-56lnet: move "match" functions to lib-ptl.ca096d858b671f28fd4c5e6197b51643cd0780a50
LU-56lnet: allow to create EQ with zero eq_sizec1366da8f43ecfb98ef3bdcf629eec8a2fc9cd4c
LU-56lnet: cleanup for LNet Event Queue3211f6862cbbe96642db540e6593f3c614f9528c
LU-56lnet: new internal object lnet_peer_table7a51ad347960ef2b9d1dfad14644c0bca35b80b6
LU-56ptlrpc: clean up ptlrpc svc initializing APIsfacf5086667874c405c9ef6ce7f8f737868ffefd
LU-56lnet: container for LNet messagec3a57ec36441c75df03cfbec8f718e053aaad12a
LU-56lnet: abstract container for EQ/ME/MD4bd9bf53728260d38efc74cac981318fe31280cd
LU-56lnet: add lnet_*_free_locked for LNetc8da7bfbe0505175869973b25281b152940774b0
LU-56libcfs: more common APIs in libcfsb76f327de7836a854f204d28e61de52bc03011b1
LU-56libcfs: export a few symbols from libcfs3a92c850b094019e556577ec6cab5907538dcbf5
LU-56libcfs: NUMA allocator and code cleanup617e8e1229637908d4cce6725878dd5668960420
LU-56libcfs: implementation of cpu partition19ec037c0a9427250b87a69c53beb153d533ab1c
LU-56libcfs: move range expression parser to libcfs