Details
-
Task
-
Resolution: Fixed
-
Major
-
Lustre 2.3.0
-
None
-
4564
Description
Please post the SMP patches to bugzilla and then push the changes to a branch on review.whamcloud.com/lustre. Keep the branch current as changes are made to the patches during the inspection/landing process, because we'll also be testing this branch at LLNL and possibly other places.
Attachments
Issue Links
- is related to
-
LU-14676 Better hash distribution to different CPTs when LNET router is exist
-
- Resolved
-
-
LUDOC-97 SMP Documentation
-
- Closed
-
- Trackbacks
-
SMP Node Affinity Scope Statement Introduction The following scope statement applies to the SMP Node Affinity Scope Statement project within the SFSDEV001 contract/SOW dates 08/01/2011. Problem Statement Followers of CPU design are observing steady increases in core count....
-
Lustre Community Development in Progress Features are being developed for future Lustre releases both at Whamcloud and by other organizations in the Lustre community. These will be eligible for inclusion in future Lustre releases as per our processes
Activity
Cray has been testing (a little) for functionality only. No significant issues to report so far.
Re: the last two comments. Patches have landed. I'll copy an update of the patch list from Richard Henwood here:
c3e985e7e98f41ebf5ecb78887dcd2554601f7ef LU-56 ptlrpc: post rqbd with flag LNET_INS_LOCAL
8bbd62a7c0d2fc48d8f11e78d92bb42809968bba LU-56 ptlrpc: CPT affinity ptlrpc RS handlers
d800fc41a1abdaf7aaf6c0e3e7ddcdec489985a8 LU-56 ptlrpc: partitioned ptlrpc service
b43a6b1800265608cfa18159d4d0d006a1c23015 LU-56 o2iblnd: CPT affinity o2iblnd
82e02a17c0c645a8d156e51b8d8da5eaa68b8f5b LU-56 lnet: re-finalize failed ACK or routed message
1a73553d15b459208cbf7279ea6e5e5a110c632b LU-56 ksocklnd: CPT affinity socklnd
6b4b780895dfdeaca316862fbf1696983608f96d LU-56 lnet: wrong assertion for optimized GET
913c8e22cfc4fc5f52c4f0d6d3f0b4b86a7ac58c LU-56 lnet: tuning wildcard portals rotor
c03783fce46ae0b40db0680388df6e2d6fca5008 LU-56 lnet: SMP improvements for LNet selftest
7b2ab9beae02080797ff2da5105eaddadd67c151 LU-56 ldlm: SMP improvement for ldlm_lock_cancel
07b8db220e48782369f48d86213c5d404a628ded LU-56 ptlrpc: Reduce at_lock dance
c48a869557fe7663f4f3370b130d4c248958180e LU-56 libcfs: CPT affinity workitem scheduler
8a5b8dbda960b155f669c13602504f1233a84c7e LU-56 obdclass: SMP improvement for lu_key
e531dc437c56a08a65de9074a511faa55184712b LU-56 lnet: multiple cleanups for inspection
e069296630240947f1815e505067fd48033909f7 LU-56 lnet: allow user to bind NI on CPTs
a07e9d350b3e500c7be877f6dcf54380b86a9cbe LU-56 lnet: Partitioned LNet networks
5e1957841df3e771f3d72d8ea59180213430bbb9 LU-56 lnet: cleanup for rtrpool and LNet counter
279bbc81e03dc74d273ec12b4d9e703ca94404c4 LU-56 lnet: Partitioned LNet resources (ME/MD/EQ)
582c110231cf06bcd7e5e0b3bdf4f2058e18ebe4 LU-56 ptlrpc: cleanup of ptlrpc_unregister_service
ff0c89a73e141ce019ee2a94e5d01a8a37dd830a LU-56 ptlrpc: svc thread starting/stopping cleanup
25766da50b627648b04549ff3fb55af12acbcb4b LU-56 lnet: reduce stack usage of "match" functions
c7bff5640caff778d4cfca229672a2cc67b350d6 LU-56 lnet: Granulate LNet lock
24564b398f53009521aeda5d653e57fe8b525775 LU-56 ptlrpc: partition data for ptlrpc service
698d3088622b4610a84bd508f2b707a7a2dd1e3e LU-56 lnet: code cleanup for lib-move.c
38fcdd3966da09517ca176b962230b7dae43514c LU-56 lnet: match-table for Portals
f0aa1eef72e7438c2bd4b3eee821fefbc50d1f8e LU-56 lnet: code cleanup for lib-md.c
75a8f4b4aa9ad6bf697aedece539e62111e9029a LU-56 lnet: split lnet_commit_md and cleanup
06093c1f24da938418a0243259b5307c9fc338d5 LU-56 lnet: LNet message event cleanup
2118a8b92cec2df85d1bdbe2e58b389d83fe06b2 LU-56 lnet: eliminate a few locking dance in LNet
51a5b4df5bbbf5fd12c73d2722b230e93fe93327 LU-56 lnet: parse RC ping in event callback
b9bad9bd7d1c3271df916ee62091106e3f3c98b7 LU-56 lnet: router-checker (RC) cleanup
4fcc56be68c8c1667fbd91721d084874a2f05c3e LU-56 ptlrpc: common code to validate nthreads
ed22093b2d569fd0e93f35504580171114bf212d LU-56 lnet: move "match" functions to lib-ptl.c
a096d858b671f28fd4c5e6197b51643cd0780a50 LU-56 lnet: allow to create EQ with zero eq_size
c1366da8f43ecfb98ef3bdcf629eec8a2fc9cd4c LU-56 lnet: cleanup for LNet Event Queue
3211f6862cbbe96642db540e6593f3c614f9528c LU-56 lnet: new internal object lnet_peer_table
7a51ad347960ef2b9d1dfad14644c0bca35b80b6 LU-56 ptlrpc: clean up ptlrpc svc initializing APIs
facf5086667874c405c9ef6ce7f8f737868ffefd LU-56 lnet: container for LNet message
c3a57ec36441c75df03cfbec8f718e053aaad12a LU-56 lnet: abstract container for EQ/ME/MD
4bd9bf53728260d38efc74cac981318fe31280cd LU-56 lnet: add lnet_*_free_locked for LNet
c8da7bfbe0505175869973b25281b152940774b0 LU-56 libcfs: more common APIs in libcfs
b76f327de7836a854f204d28e61de52bc03011b1 LU-56 libcfs: export a few symbols from libcfs
3a92c850b094019e556577ec6cab5907538dcbf5 LU-56 libcfs: NUMA allocator and code cleanup
617e8e1229637908d4cce6725878dd5668960420 LU-56 libcfs: implementation of cpu partition
19ec037c0a9427250b87a69c53beb153d533ab1c LU-56 libcfs: move range expression parser to libcfs
http://review.whamcloud.com/#change,2346
http://review.whamcloud.com/#change,2461
http://review.whamcloud.com/#change,2523
http://review.whamcloud.com/#change,2558
http://review.whamcloud.com/#change,2638
http://review.whamcloud.com/#change,2718
http://review.whamcloud.com/#change,2725
http://review.whamcloud.com/#change,2729
FYI - They're still in review.
- Background of the project
- already landed tens of standalone patches when I was in Oracle, all following descriptions are about patches not landed yet
- A fat server is divided into several processing partitions (or cpu-partitions, CP as abbreviation), each partition contains: some cpu cores(or NUMA nodes) + memory pool + message queue + threads pool), it's a little like concept of virtual machine, although it's much simpler. It's not new thing, I just bring in this as a concept to replace what I used to call cpu_node + cpu-affinity, which is confusing because "node" is already used to present NUMA in linux kernel.
- Although we still have one single namespace on MDS, but we can have several virtual processing partitions, lifetime of request should be localized on the processing partition as possible as we can, to reduce data/thread migration between CPUs, also reduce lock contentions and cacheline conflicts.
- LND has message queue and threads-pool for each partition
- LNet has EQ callback for each partition
- Ptlrpc service has message queue and threads pool for each partition
- number of CPU partitions is configurable (by new parameter "cpu_npartitions" of libcfs), libcfs will also automatically estimate "cpu_npartitions" based on number of CPUs. If we have cpu_node_num=N (N > 1), Lustre stack can have N standalone processing flows (unless they are contenting on the same resource), if we configured the system with cpu_npartitions=1, we have only one partition and lustre should act as current "master".
- user can provide string pattern for cpu-partitions as well, i.e: "0[0, 2, 4, 6] 1[1, 3, 5, 7]", by this way, lustre will have two cpu-partitions, the first one contains core[0, 2, 4, 6], the second one presents core[1, 3, 5, 7], NB: those numbers inside bracket can be NUMA id as well.
- number of cpu partitions < number of cores (or hyper-threadings), because modern computer can have hundreds or more cores, there are some major downsides if we have per-core schedulers(threads) pool: a) bad load balance especially on small/middle size cluster, b) too many threads overall;
- on client/OSS/router, cpu_npartitions == NUMA node, dataflow of request/reply is localized on NUMA node
- on MDS, a cpu partition will present a few cores(or hyper-threading), i.e: on system with 4 NUMA nodes and 64-cores, we have 8 cpu-partitions
- we might hash different objects(by fid) to different cpu-partition in the future, it's an advanced feature which we don't have now
- we can bind LNet NI on specified cpu partition for NUMIOA performance
- these things helped a lot on performance of many target directories tests, but they can help nothing on shared directory performance at all, I never saw 20K+ file/sec in shared directory creatinon/removal tests (non-zero stripe), with 1.8, the number is < 7K files/sec
- new pdirops patch works fine(
LU-50), Andreas has already reviewed the prototype, now I have posted an the second version for review. The pdirops patch is different with the old one (we had 5 or 6 years ago?)- the old version is dynlock based, it has too many inline changes for ext4, and we probably need to change more to support N-level htree (large directory), we probably have to change ext4/namei.c and make it quite like our IAM.
- the new version is htree_lock based, although implementation of the htree_lock is big & complex, but it just requires a few inline changes for ext4. htree_lock is more like a embedded component, it can be enabled/disabled easily, and need very few logic change to current ext4. The patch is big (about 2K lines), but only has about 200 LOC inline changes (including changes for N-level htree), and half of those inline changes are adding a new parameter to some functions.
- htree_lock based pdirops patch have same performance as IAM dir, but without any interop issue
- with pdirops patch on my branch, we can get 65K+ files/sec opencreate+close (narrow stripping file, aggregation performance on server) in my latest test on toro
- MDD transaction scheduler will be totally dropped, If you still remember it, I mentioned previously that I wanted to add a relatively small threads pool in MDD to take over backend filesystem transactions, this thread pool size should be moderate and just enough for driving full throughput of metadata operations, this idea is majorly for share directory performance. However, pdirops patch works very good, so I will totally drop it (there is more detail in my mail to lustre-devel a few weeks ago)
- key component for landing
- Tens of standalone patches
- no dependency on each other, size of them are from a few LOC to a few hundreds LOC
- some of them will be landed on 2.1, all of them will be landed on 2.2
- pdirop patch + BH LRU size kernel patch
- pdirop patch is big, but easy to maintain (at least it's easy to port it to rhel6)
- I've sent another mail to Andreas and bzzz to explain why we need to increase size of BH LRU size, it will be a small patch
- we would like to land them on 2.2 if we got enough resource or got funding on this
- cpu-partition patches
- it's the biggest chunk, includes several large patches spread over stack layers (libcfs, LNet, LND, ptlrpc server side, and some small changes to other modules like mdt, ost )
- patches have dependencies.
- The biggest challenge is inspection LNet + LNDs, isaac might give us some help to review, Lai Siyao will be the other inspector
- if we got funding on this, we tend to land them on 2.2, otherwise it's more realistic for us to land them on 2.3
- Tens of standalone patches
Hi, I think you need this patch: http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=6bdb62e3d7ed206bcaef0cd3499f9cb56dc1fb92
I've synced my branch with master, so you just need to pull it from my branch.
Hi Liang,
Thanks for the quick fix, now I can 'modprobe ptlrpc' without any issue.
But unfortunately I am hitting another problem: when I try to mount the MDT, I get the following error:
[root@berlin27 lustre]# mount -t lustre -o acl,user_xattr /dev/sda9 /mnt/t100full/mdt/0
mount.lustre: mount /dev/sda9 at /mnt/t100full/mdt/0 failed: Invalid argument
This may have multiple causes.
Are the mount options correct?
Check the syslog for more info.
[root@berlin27 lustre]# dmesg -c
LDISKFS-fs warning (device sda9): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
LDISKFS-fs (sda9): mounted filesystem with ordered data mode
LDISKFS-fs warning (device sda9): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
LDISKFS-fs (sda9): Unrecognized mount option "64bithash" or missing value
LustreError: 16785:0:(obd_mount.c:1399:server_kernel_mount()) ll_kern_mount failed: rc = -22
LustreError: 16785:0:(obd_mount.c:1680:server_fill_super()) Unable to mount device /dev/sda9: -22
LustreError: 16785:0:(obd_mount.c:2154:lustre_fill_super()) Unable to mount (-22)
Did I miss something? Like a specific e2fsprogs version (we are currently using 1.41.10.sun2)?
I must say that we do not have this error if we start the same MDT with our Lustre 2.0.0.1 rpms installed.
TIA,
Sebastien.
Also, binding LNet on CPU are changed as well... now it should be like "o2ib0:0(ib0), o2ib1:1(ib1)"
Hi, it should be fixed, it's a typo which will make my code access cpu_to_node[-1]...it's been there for months but I never hit it on many different hardwares... please pull from git for the fix.
btw: I've changed default value of cpu_mode to 2 for my testing, so please use cpu_mode=0 to modprobe.conf on OSS or clients.
For information, the problem occurs on the first instruction of cfs_cpumap_alloc() function, i.e.
LIBCFS_ALLOC_ALIGNED(cpumap, sizeof(cfs_cpumap_t));
SMP node affinity presentation for Lustre team Tech-call