Details

    • Task
    • Resolution: Fixed
    • Major
    • Lustre 2.3.0
    • Lustre 2.3.0
    • None
    • 4564

    Description

      Please post the SMP patches to bugzilla and then push the changes to a branch on review.whamcloud.com/lustre. Keep the branch current as changes are made to the patches during the inspection/landing process, because we'll also be testing this branch at LLNL and possibly other places.

      Attachments

        Issue Links

          Activity

            [LU-56] Finish SMP scalability work

            patched mdtest which parses expression for multi-mount

            liang Liang Zhen (Inactive) added a comment - patched mdtest which parses expression for multi-mount

            Please let me know if more work is needed for this and I will reopen ticket.

            jlevi Jodi Levi (Inactive) added a comment - Please let me know if more work is needed for this and I will reopen ticket.

            SMP node affinity OpenSFS demonstration

            liang Liang Zhen (Inactive) added a comment - SMP node affinity OpenSFS demonstration

            SMP node affinity presentation for Lustre team Tech-call

            liang Liang Zhen (Inactive) added a comment - SMP node affinity presentation for Lustre team Tech-call
            spitzcor Cory Spitz added a comment -

            Cray has been testing (a little) for functionality only. No significant issues to report so far.

            spitzcor Cory Spitz added a comment - Cray has been testing (a little) for functionality only. No significant issues to report so far.
            spitzcor Cory Spitz added a comment -

            Re: the last two comments. Patches have landed. I'll copy an update of the patch list from Richard Henwood here:

            c3e985e7e98f41ebf5ecb78887dcd2554601f7ef LU-56 ptlrpc: post rqbd with flag LNET_INS_LOCAL
            8bbd62a7c0d2fc48d8f11e78d92bb42809968bba LU-56 ptlrpc: CPT affinity ptlrpc RS handlers
            d800fc41a1abdaf7aaf6c0e3e7ddcdec489985a8 LU-56 ptlrpc: partitioned ptlrpc service
            b43a6b1800265608cfa18159d4d0d006a1c23015 LU-56 o2iblnd: CPT affinity o2iblnd
            82e02a17c0c645a8d156e51b8d8da5eaa68b8f5b LU-56 lnet: re-finalize failed ACK or routed message
            1a73553d15b459208cbf7279ea6e5e5a110c632b LU-56 ksocklnd: CPT affinity socklnd
            6b4b780895dfdeaca316862fbf1696983608f96d LU-56 lnet: wrong assertion for optimized GET
            913c8e22cfc4fc5f52c4f0d6d3f0b4b86a7ac58c LU-56 lnet: tuning wildcard portals rotor
            c03783fce46ae0b40db0680388df6e2d6fca5008 LU-56 lnet: SMP improvements for LNet selftest
            7b2ab9beae02080797ff2da5105eaddadd67c151 LU-56 ldlm: SMP improvement for ldlm_lock_cancel
            07b8db220e48782369f48d86213c5d404a628ded LU-56 ptlrpc: Reduce at_lock dance
            c48a869557fe7663f4f3370b130d4c248958180e LU-56 libcfs: CPT affinity workitem scheduler
            8a5b8dbda960b155f669c13602504f1233a84c7e LU-56 obdclass: SMP improvement for lu_key
            e531dc437c56a08a65de9074a511faa55184712b LU-56 lnet: multiple cleanups for inspection
            e069296630240947f1815e505067fd48033909f7 LU-56 lnet: allow user to bind NI on CPTs
            a07e9d350b3e500c7be877f6dcf54380b86a9cbe LU-56 lnet: Partitioned LNet networks
            5e1957841df3e771f3d72d8ea59180213430bbb9 LU-56 lnet: cleanup for rtrpool and LNet counter
            279bbc81e03dc74d273ec12b4d9e703ca94404c4 LU-56 lnet: Partitioned LNet resources (ME/MD/EQ)
            582c110231cf06bcd7e5e0b3bdf4f2058e18ebe4 LU-56 ptlrpc: cleanup of ptlrpc_unregister_service
            ff0c89a73e141ce019ee2a94e5d01a8a37dd830a LU-56 ptlrpc: svc thread starting/stopping cleanup
            25766da50b627648b04549ff3fb55af12acbcb4b LU-56 lnet: reduce stack usage of "match" functions
            c7bff5640caff778d4cfca229672a2cc67b350d6 LU-56 lnet: Granulate LNet lock
            24564b398f53009521aeda5d653e57fe8b525775 LU-56 ptlrpc: partition data for ptlrpc service
            698d3088622b4610a84bd508f2b707a7a2dd1e3e LU-56 lnet: code cleanup for lib-move.c
            38fcdd3966da09517ca176b962230b7dae43514c LU-56 lnet: match-table for Portals
            f0aa1eef72e7438c2bd4b3eee821fefbc50d1f8e LU-56 lnet: code cleanup for lib-md.c
            75a8f4b4aa9ad6bf697aedece539e62111e9029a LU-56 lnet: split lnet_commit_md and cleanup
            06093c1f24da938418a0243259b5307c9fc338d5 LU-56 lnet: LNet message event cleanup
            2118a8b92cec2df85d1bdbe2e58b389d83fe06b2 LU-56 lnet: eliminate a few locking dance in LNet
            51a5b4df5bbbf5fd12c73d2722b230e93fe93327 LU-56 lnet: parse RC ping in event callback
            b9bad9bd7d1c3271df916ee62091106e3f3c98b7 LU-56 lnet: router-checker (RC) cleanup
            4fcc56be68c8c1667fbd91721d084874a2f05c3e LU-56 ptlrpc: common code to validate nthreads
            ed22093b2d569fd0e93f35504580171114bf212d LU-56 lnet: move "match" functions to lib-ptl.c
            a096d858b671f28fd4c5e6197b51643cd0780a50 LU-56 lnet: allow to create EQ with zero eq_size
            c1366da8f43ecfb98ef3bdcf629eec8a2fc9cd4c LU-56 lnet: cleanup for LNet Event Queue
            3211f6862cbbe96642db540e6593f3c614f9528c LU-56 lnet: new internal object lnet_peer_table
            7a51ad347960ef2b9d1dfad14644c0bca35b80b6 LU-56 ptlrpc: clean up ptlrpc svc initializing APIs
            facf5086667874c405c9ef6ce7f8f737868ffefd LU-56 lnet: container for LNet message
            c3a57ec36441c75df03cfbec8f718e053aaad12a LU-56 lnet: abstract container for EQ/ME/MD
            4bd9bf53728260d38efc74cac981318fe31280cd LU-56 lnet: add lnet_*_free_locked for LNet
            c8da7bfbe0505175869973b25281b152940774b0 LU-56 libcfs: more common APIs in libcfs
            b76f327de7836a854f204d28e61de52bc03011b1 LU-56 libcfs: export a few symbols from libcfs
            3a92c850b094019e556577ec6cab5907538dcbf5 LU-56 libcfs: NUMA allocator and code cleanup
            617e8e1229637908d4cce6725878dd5668960420 LU-56 libcfs: implementation of cpu partition
            19ec037c0a9427250b87a69c53beb153d533ab1c LU-56 libcfs: move range expression parser to libcfs

            spitzcor Cory Spitz added a comment - Re: the last two comments. Patches have landed. I'll copy an update of the patch list from Richard Henwood here: c3e985e7e98f41ebf5ecb78887dcd2554601f7ef LU-56 ptlrpc: post rqbd with flag LNET_INS_LOCAL 8bbd62a7c0d2fc48d8f11e78d92bb42809968bba LU-56 ptlrpc: CPT affinity ptlrpc RS handlers d800fc41a1abdaf7aaf6c0e3e7ddcdec489985a8 LU-56 ptlrpc: partitioned ptlrpc service b43a6b1800265608cfa18159d4d0d006a1c23015 LU-56 o2iblnd: CPT affinity o2iblnd 82e02a17c0c645a8d156e51b8d8da5eaa68b8f5b LU-56 lnet: re-finalize failed ACK or routed message 1a73553d15b459208cbf7279ea6e5e5a110c632b LU-56 ksocklnd: CPT affinity socklnd 6b4b780895dfdeaca316862fbf1696983608f96d LU-56 lnet: wrong assertion for optimized GET 913c8e22cfc4fc5f52c4f0d6d3f0b4b86a7ac58c LU-56 lnet: tuning wildcard portals rotor c03783fce46ae0b40db0680388df6e2d6fca5008 LU-56 lnet: SMP improvements for LNet selftest 7b2ab9beae02080797ff2da5105eaddadd67c151 LU-56 ldlm: SMP improvement for ldlm_lock_cancel 07b8db220e48782369f48d86213c5d404a628ded LU-56 ptlrpc: Reduce at_lock dance c48a869557fe7663f4f3370b130d4c248958180e LU-56 libcfs: CPT affinity workitem scheduler 8a5b8dbda960b155f669c13602504f1233a84c7e LU-56 obdclass: SMP improvement for lu_key e531dc437c56a08a65de9074a511faa55184712b LU-56 lnet: multiple cleanups for inspection e069296630240947f1815e505067fd48033909f7 LU-56 lnet: allow user to bind NI on CPTs a07e9d350b3e500c7be877f6dcf54380b86a9cbe LU-56 lnet: Partitioned LNet networks 5e1957841df3e771f3d72d8ea59180213430bbb9 LU-56 lnet: cleanup for rtrpool and LNet counter 279bbc81e03dc74d273ec12b4d9e703ca94404c4 LU-56 lnet: Partitioned LNet resources (ME/MD/EQ) 582c110231cf06bcd7e5e0b3bdf4f2058e18ebe4 LU-56 ptlrpc: cleanup of ptlrpc_unregister_service ff0c89a73e141ce019ee2a94e5d01a8a37dd830a LU-56 ptlrpc: svc thread starting/stopping cleanup 25766da50b627648b04549ff3fb55af12acbcb4b LU-56 lnet: reduce stack usage of "match" functions c7bff5640caff778d4cfca229672a2cc67b350d6 LU-56 lnet: Granulate LNet lock 24564b398f53009521aeda5d653e57fe8b525775 LU-56 ptlrpc: partition data for ptlrpc service 698d3088622b4610a84bd508f2b707a7a2dd1e3e LU-56 lnet: code cleanup for lib-move.c 38fcdd3966da09517ca176b962230b7dae43514c LU-56 lnet: match-table for Portals f0aa1eef72e7438c2bd4b3eee821fefbc50d1f8e LU-56 lnet: code cleanup for lib-md.c 75a8f4b4aa9ad6bf697aedece539e62111e9029a LU-56 lnet: split lnet_commit_md and cleanup 06093c1f24da938418a0243259b5307c9fc338d5 LU-56 lnet: LNet message event cleanup 2118a8b92cec2df85d1bdbe2e58b389d83fe06b2 LU-56 lnet: eliminate a few locking dance in LNet 51a5b4df5bbbf5fd12c73d2722b230e93fe93327 LU-56 lnet: parse RC ping in event callback b9bad9bd7d1c3271df916ee62091106e3f3c98b7 LU-56 lnet: router-checker (RC) cleanup 4fcc56be68c8c1667fbd91721d084874a2f05c3e LU-56 ptlrpc: common code to validate nthreads ed22093b2d569fd0e93f35504580171114bf212d LU-56 lnet: move "match" functions to lib-ptl.c a096d858b671f28fd4c5e6197b51643cd0780a50 LU-56 lnet: allow to create EQ with zero eq_size c1366da8f43ecfb98ef3bdcf629eec8a2fc9cd4c LU-56 lnet: cleanup for LNet Event Queue 3211f6862cbbe96642db540e6593f3c614f9528c LU-56 lnet: new internal object lnet_peer_table 7a51ad347960ef2b9d1dfad14644c0bca35b80b6 LU-56 ptlrpc: clean up ptlrpc svc initializing APIs facf5086667874c405c9ef6ce7f8f737868ffefd LU-56 lnet: container for LNet message c3a57ec36441c75df03cfbec8f718e053aaad12a LU-56 lnet: abstract container for EQ/ME/MD 4bd9bf53728260d38efc74cac981318fe31280cd LU-56 lnet: add lnet_*_free_locked for LNet c8da7bfbe0505175869973b25281b152940774b0 LU-56 libcfs: more common APIs in libcfs b76f327de7836a854f204d28e61de52bc03011b1 LU-56 libcfs: export a few symbols from libcfs 3a92c850b094019e556577ec6cab5907538dcbf5 LU-56 libcfs: NUMA allocator and code cleanup 617e8e1229637908d4cce6725878dd5668960420 LU-56 libcfs: implementation of cpu partition 19ec037c0a9427250b87a69c53beb153d533ab1c LU-56 libcfs: move range expression parser to libcfs
            ian Ian Colle (Inactive) added a comment - - edited http://review.whamcloud.com/#change,2346 http://review.whamcloud.com/#change,2461 http://review.whamcloud.com/#change,2523 http://review.whamcloud.com/#change,2558 http://review.whamcloud.com/#change,2638 http://review.whamcloud.com/#change,2718 http://review.whamcloud.com/#change,2725 http://review.whamcloud.com/#change,2729 FYI - They're still in review.

            Can you post links to your patches here. Thank you.

            simmonsja James A Simmons added a comment - Can you post links to your patches here. Thank you.
            • Background of the project
              • already landed tens of standalone patches when I was in Oracle, all following descriptions are about patches not landed yet
              • A fat server is divided into several processing partitions (or cpu-partitions, CP as abbreviation), each partition contains: some cpu cores(or NUMA nodes) + memory pool + message queue + threads pool), it's a little like concept of virtual machine, although it's much simpler. It's not new thing, I just bring in this as a concept to replace what I used to call cpu_node + cpu-affinity, which is confusing because "node" is already used to present NUMA in linux kernel.
              • Although we still have one single namespace on MDS, but we can have several virtual processing partitions, lifetime of request should be localized on the processing partition as possible as we can, to reduce data/thread migration between CPUs, also reduce lock contentions and cacheline conflicts.
                • LND has message queue and threads-pool for each partition
                • LNet has EQ callback for each partition
                • Ptlrpc service has message queue and threads pool for each partition
              • number of CPU partitions is configurable (by new parameter "cpu_npartitions" of libcfs), libcfs will also automatically estimate "cpu_npartitions" based on number of CPUs. If we have cpu_node_num=N (N > 1), Lustre stack can have N standalone processing flows (unless they are contenting on the same resource), if we configured the system with cpu_npartitions=1, we have only one partition and lustre should act as current "master".
              • user can provide string pattern for cpu-partitions as well, i.e: "0[0, 2, 4, 6] 1[1, 3, 5, 7]", by this way, lustre will have two cpu-partitions, the first one contains core[0, 2, 4, 6], the second one presents core[1, 3, 5, 7], NB: those numbers inside bracket can be NUMA id as well.
              • number of cpu partitions < number of cores (or hyper-threadings), because modern computer can have hundreds or more cores, there are some major downsides if we have per-core schedulers(threads) pool: a) bad load balance especially on small/middle size cluster, b) too many threads overall;
                • on client/OSS/router, cpu_npartitions == NUMA node, dataflow of request/reply is localized on NUMA node
                • on MDS, a cpu partition will present a few cores(or hyper-threading), i.e: on system with 4 NUMA nodes and 64-cores, we have 8 cpu-partitions
                • we might hash different objects(by fid) to different cpu-partition in the future, it's an advanced feature which we don't have now
                • we can bind LNet NI on specified cpu partition for NUMIOA performance
              • these things helped a lot on performance of many target directories tests, but they can help nothing on shared directory performance at all, I never saw 20K+ file/sec in shared directory creatinon/removal tests (non-zero stripe), with 1.8, the number is < 7K files/sec
              • new pdirops patch works fine(LU-50), Andreas has already reviewed the prototype, now I have posted an the second version for review. The pdirops patch is different with the old one (we had 5 or 6 years ago?)
                • the old version is dynlock based, it has too many inline changes for ext4, and we probably need to change more to support N-level htree (large directory), we probably have to change ext4/namei.c and make it quite like our IAM.
                • the new version is htree_lock based, although implementation of the htree_lock is big & complex, but it just requires a few inline changes for ext4. htree_lock is more like a embedded component, it can be enabled/disabled easily, and need very few logic change to current ext4. The patch is big (about 2K lines), but only has about 200 LOC inline changes (including changes for N-level htree), and half of those inline changes are adding a new parameter to some functions.
                • htree_lock based pdirops patch have same performance as IAM dir, but without any interop issue
                • with pdirops patch on my branch, we can get 65K+ files/sec opencreate+close (narrow stripping file, aggregation performance on server) in my latest test on toro
              • MDD transaction scheduler will be totally dropped, If you still remember it, I mentioned previously that I wanted to add a relatively small threads pool in MDD to take over backend filesystem transactions, this thread pool size should be moderate and just enough for driving full throughput of metadata operations, this idea is majorly for share directory performance. However, pdirops patch works very good, so I will totally drop it (there is more detail in my mail to lustre-devel a few weeks ago)
            • key component for landing
              • Tens of standalone patches
                • no dependency on each other, size of them are from a few LOC to a few hundreds LOC
                • some of them will be landed on 2.1, all of them will be landed on 2.2
              • pdirop patch + BH LRU size kernel patch
                • pdirop patch is big, but easy to maintain (at least it's easy to port it to rhel6)
                • I've sent another mail to Andreas and bzzz to explain why we need to increase size of BH LRU size, it will be a small patch
                • we would like to land them on 2.2 if we got enough resource or got funding on this
              • cpu-partition patches
                • it's the biggest chunk, includes several large patches spread over stack layers (libcfs, LNet, LND, ptlrpc server side, and some small changes to other modules like mdt, ost )
                • patches have dependencies.
                • The biggest challenge is inspection LNet + LNDs, isaac might give us some help to review, Lai Siyao will be the other inspector
                • if we got funding on this, we tend to land them on 2.2, otherwise it's more realistic for us to land them on 2.3
            liang Liang Zhen (Inactive) added a comment - Background of the project already landed tens of standalone patches when I was in Oracle, all following descriptions are about patches not landed yet A fat server is divided into several processing partitions (or cpu-partitions, CP as abbreviation), each partition contains: some cpu cores(or NUMA nodes) + memory pool + message queue + threads pool), it's a little like concept of virtual machine, although it's much simpler. It's not new thing, I just bring in this as a concept to replace what I used to call cpu_node + cpu-affinity, which is confusing because "node" is already used to present NUMA in linux kernel. Although we still have one single namespace on MDS, but we can have several virtual processing partitions, lifetime of request should be localized on the processing partition as possible as we can, to reduce data/thread migration between CPUs, also reduce lock contentions and cacheline conflicts. LND has message queue and threads-pool for each partition LNet has EQ callback for each partition Ptlrpc service has message queue and threads pool for each partition number of CPU partitions is configurable (by new parameter "cpu_npartitions" of libcfs), libcfs will also automatically estimate "cpu_npartitions" based on number of CPUs. If we have cpu_node_num=N (N > 1), Lustre stack can have N standalone processing flows (unless they are contenting on the same resource), if we configured the system with cpu_npartitions=1, we have only one partition and lustre should act as current "master". user can provide string pattern for cpu-partitions as well, i.e: "0[0, 2, 4, 6] 1[1, 3, 5, 7]", by this way, lustre will have two cpu-partitions, the first one contains core[0, 2, 4, 6], the second one presents core[1, 3, 5, 7], NB: those numbers inside bracket can be NUMA id as well. number of cpu partitions < number of cores (or hyper-threadings), because modern computer can have hundreds or more cores, there are some major downsides if we have per-core schedulers(threads) pool: a) bad load balance especially on small/middle size cluster, b) too many threads overall; on client/OSS/router, cpu_npartitions == NUMA node, dataflow of request/reply is localized on NUMA node on MDS, a cpu partition will present a few cores(or hyper-threading), i.e: on system with 4 NUMA nodes and 64-cores, we have 8 cpu-partitions we might hash different objects(by fid) to different cpu-partition in the future, it's an advanced feature which we don't have now we can bind LNet NI on specified cpu partition for NUMIOA performance these things helped a lot on performance of many target directories tests, but they can help nothing on shared directory performance at all, I never saw 20K+ file/sec in shared directory creatinon/removal tests (non-zero stripe), with 1.8, the number is < 7K files/sec new pdirops patch works fine( LU-50 ), Andreas has already reviewed the prototype, now I have posted an the second version for review. The pdirops patch is different with the old one (we had 5 or 6 years ago?) the old version is dynlock based, it has too many inline changes for ext4, and we probably need to change more to support N-level htree (large directory), we probably have to change ext4/namei.c and make it quite like our IAM. the new version is htree_lock based, although implementation of the htree_lock is big & complex, but it just requires a few inline changes for ext4. htree_lock is more like a embedded component, it can be enabled/disabled easily, and need very few logic change to current ext4. The patch is big (about 2K lines), but only has about 200 LOC inline changes (including changes for N-level htree), and half of those inline changes are adding a new parameter to some functions. htree_lock based pdirops patch have same performance as IAM dir, but without any interop issue with pdirops patch on my branch, we can get 65K+ files/sec opencreate+close (narrow stripping file, aggregation performance on server) in my latest test on toro MDD transaction scheduler will be totally dropped, If you still remember it, I mentioned previously that I wanted to add a relatively small threads pool in MDD to take over backend filesystem transactions, this thread pool size should be moderate and just enough for driving full throughput of metadata operations, this idea is majorly for share directory performance. However, pdirops patch works very good, so I will totally drop it (there is more detail in my mail to lustre-devel a few weeks ago) key component for landing Tens of standalone patches no dependency on each other, size of them are from a few LOC to a few hundreds LOC some of them will be landed on 2.1, all of them will be landed on 2.2 pdirop patch + BH LRU size kernel patch pdirop patch is big, but easy to maintain (at least it's easy to port it to rhel6) I've sent another mail to Andreas and bzzz to explain why we need to increase size of BH LRU size, it will be a small patch we would like to land them on 2.2 if we got enough resource or got funding on this cpu-partition patches it's the biggest chunk, includes several large patches spread over stack layers (libcfs, LNet, LND, ptlrpc server side, and some small changes to other modules like mdt, ost ) patches have dependencies. The biggest challenge is inspection LNet + LNDs, isaac might give us some help to review, Lai Siyao will be the other inspector if we got funding on this, we tend to land them on 2.2, otherwise it's more realistic for us to land them on 2.3

            Hi, I think you need this patch: http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=6bdb62e3d7ed206bcaef0cd3499f9cb56dc1fb92
            I've synced my branch with master, so you just need to pull it from my branch.

            liang Liang Zhen (Inactive) added a comment - Hi, I think you need this patch: http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=6bdb62e3d7ed206bcaef0cd3499f9cb56dc1fb92 I've synced my branch with master, so you just need to pull it from my branch.

            People

              liang Liang Zhen (Inactive)
              rread Robert Read
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: