Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1413

difference of single client's performance between b2_1 and 2.1.2RC0

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.1.2
    • None
    • CentOS6.2, the latest b2_1 branch and 2.1.2RC0
    • 3
    • 6394

    Description

      Performance differences between b2_1 and 2.1.2RC0

      During single client performance testing (LU-1408), I saw another single client performance regression on b2_1 (checked out the latest on 05/15/12 http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=e9daba96e0a6bbe898b3d6207b2fe4bdd3293181) from 2.1.2RC0 tag.

      Any changes after 2.1.2RC0 to potentially cause this regression?

      Both version of lustre on the client are tested on the same hardware, same OS (CentOS, include kernel), same network and all.
      The servers are running lustre-2.2.

      I also tested on the client with FDR Infiniband as well as QDR Infiniband without lustre checksum.

      Attachments

        Issue Links

          Activity

            [LU-1413] difference of single client's performance between b2_1 and 2.1.2RC0
            pjones Peter Jones added a comment -

            So this seems to be a duplicate of LU-1408 as it has the same root cause - a performance regression introduced by LU-969

            pjones Peter Jones added a comment - So this seems to be a duplicate of LU-1408 as it has the same root cause - a performance regression introduced by LU-969

            Good job, Ihara.

            jay Jinshan Xiong (Inactive) added a comment - Good job, Ihara.

            I ran few "git bisect bad/good" and benchmarked on each commit, and finally figured out which commit caused this regression.
            The regression started from commit b9cbe3616b6e0b44c7835b1aec65befb85f848f9 (LU-969 debug: reduce stack usage).

            Here is log bisect log files and performance numbers (IOR with 4 process on an client) on each commit.

            # git rev-list 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425..HEAD
            ab83c57df451f4907752be6dad0ce8d87b98d60b bad (write 1.4GB/s, read 1.4GB/s)
            8676a50913f0572d47e987483a45167d9e9faacd
            36309a984850cd89f2b62938db5d56431834fd36
            12a618ebb00b940678785cfef8050b3ea9f0ad04
            849dc60cf1f8d46d9ccfa60bcf6e118d7aeafed3
            ad8dee856e37ef8c5ac4ba8466ce14e941ccd268
            e9daba96e0a6bbe898b3d6207b2fe4bdd3293181
            57c7b312b242a24768baf42cc88dec450f068948
            adec9ed03b9374088bec8c6e9e2dcc9b5c24902f
            1e56c6d0bcd0a0b75f0f16060da015612f948134
            0fe07cd4252d8c478e8d05d80b877b81f8ad2ed9
            d4f36787d67c02e2ce7b21a891ff71bc709b3cb5 bad (write 1.4GB/s, read 1.4GB/s)
            85aeb732e0dc300247c1f941ac22d88a62957cb3
            4c2b197ec3054acce53f4b149a7fb1921caaf4e5
            abff03a2c27e7abe3c56c856a367118f0038aad4
            2e6aeae984ea03b5e756b4fb6816ac5eafcd5660
            e4694e0153e8323f401ee7f95c3fe391aee38792 bad (write 1.4GB/s, read 1.4GB/s)
            9f70f65536e31e297c5cf495247978ddb187fbe8 bad (write 1.4GB/s, read 1.4GB/s)
            b9cbe3616b6e0b44c7835b1aec65befb85f848f9 bad (write 1.4GB/s, read 1.4GB/s)
            1702f156bf61210a937f40ae5b8e9d8832a3e59a ok  (write 2.6GB/s, read 2.8GB/s)
            6a9def181397a20c48b8f55ef4f6c29f2fc18ed6
            8abf07f93019d7f3cddd9a796f9486dd56be2f13 ok  (write 2.8GB/s, read 3.1GB/s)
            9aba7a41b0ee2f598cf1a9a634c374cc294fdfe6
            2a65b7b49a1cb0aa3ea46e9a0612708f1a8474b8
            9812260ca4296c198ad7bc63f1a64718f159597b
            042980026c596ff08c97764bbcf7a1e710fd4f5a
            2d2f1cf4f63f02cdb2ef03c53b971877d565749a
            afd6ac5f37f72bdbc921b598aecdfc2e15de0875
            9cfb8177b2ba56fb04c09d38939ca300c05a4ce6
            1fa441093995a076e73d7e9e1062012fb98f6ee4
            f1ceb64aa3d2cfefb32d1e73957c574acf843be6
            08baca07bb19fb34f1a6a1bbe943d9ef318463fd
            822cb2cdd59328944ea401eeede664abd00a64ca
            ac58a77eed62e0a8fb07f150fe0af7fdde9c556e
            7d4a0e9564c759b558806d6be5394fa72ee85d31
            64599b0816c012480485722e876253871c267511
            a57c2dd5cd878c522d8d67e383c9981ef2cce823
            c6d5b6ff72ed40ecce1a7c47542371f2dba5ad5f
            ee3a35d1f1bc9923466a5b025ca30b855d478ebb
            63ec7c9fac90d8020bad5eeb900df8c539662d56
            4e2e7f232ad001cfb91c46c18d529603d4995941
            d7c60130c4bd32c8edd48452a3538953c425a63a ok  (write 2.6GB/s, read 3.0GB/s)
            
            # git bisect start
            # git bisect bad  
            # git bisect good d7c60130c4bd32c8edd48452a3538953c425a63a
            Bisecting: 20 revisions left to test after this (roughly 4 steps)
            [8abf07f93019d7f3cddd9a796f9486dd56be2f13] LU-577 tests: FAIL replay-single test_70b rundbench load
            
            # git bisect good 
            Bisecting: 10 revisions left to test after this (roughly 3 steps)
            [d4f36787d67c02e2ce7b21a891ff71bc709b3cb5] LU-630 lnet: only router checks peer health
            
            # git bisect bad 
            Bisecting: 4 revisions left to test after this (roughly 2 steps)
            [e4694e0153e8323f401ee7f95c3fe391aee38792] LU-1095 debug: Standardize, suppress mount/umount messages
            
            # git bisect bad
            Bisecting: 2 revisions left to test after this (roughly 1 step)
            [1702f156bf61210a937f40ae5b8e9d8832a3e59a] LU-1361 build: enable kabi on rhel6
            
            # git bisect good
            Bisecting: 0 revisions left to test after this (roughly 1 step)
            [9f70f65536e31e297c5cf495247978ddb187fbe8] LU-1095 mgs: remove message from console
            
            # git bisect bad
            git bisect bad
            Bisecting: 0 revisions left to test after this (roughly 0 steps)
            [b9cbe3616b6e0b44c7835b1aec65befb85f848f9] LU-969 debug: reduce stack usage
            
            # git bisect bad
            b9cbe3616b6e0b44c7835b1aec65befb85f848f9 is the first bad commit
            commit b9cbe3616b6e0b44c7835b1aec65befb85f848f9
            Author: Hongchao Zhang <hongchao.zhang@whamcloud.com>
            Date:   Mon Mar 12 16:11:47 2012 +0800
            
                LU-969 debug: reduce stack usage
                
                1, libcfs_debug_vmsg2 to accept libcfs_debug_msg_data struture
                   to replace SUBSYSTEM, __FILE__, __FUNCTION__, __LINE__ and
                   cdls on the stack
                
                2, CDEBUG, DEBUG_CAPA use static libcfs_debug_msg_data
                
                3, remove the local variable in RETURN/GOTO/__CHECK_STACK
                
                4, reduce stack in recovery thread by moving lu_env,
                   ptlrpc_thread to heap.
                
                Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
                Signed-off-by: Hongchao Zhang <hongchao.zhang@whamcloud.com>
                Signed-off-by: Bob Glossman <bogl@whamcloud.com>
                Change-Id: I75fe53027f56e27255b5f558e8fd57c7db833648
                Reviewed-on: http://review.whamcloud.com/2668
                Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
                Tested-by: Hudson
                Reviewed-by: Oleg Drokin <green@whamcloud.com>
            
            :040000 040000 5e599b4fa74f5757eec4df24b7b76daf624f7b42 1055e80b29131ae03cc30a6252809fb249a843d0 M	libcfs
            :040000 040000 2e31d6ad09f31455f1e07b6a726d4a651291798e a23402122c63a1858a3f246d094873f4cf40991c M	lustre
            

            For double check, I did rebase b2_1 and removed LU-969 patches from b2_1 and benchmark again.

            # git checkout -b b2_1-lu1413 b2_1
            # git rebase -i 1702f156bf61210a937f40ae5b8e9d8832a3e59a
            
            remove 
            pick b9cbe36 LU-969 debug: reduce stack usage
            .
            .
            ".git/rebase-merge/git-rebase-todo" 31L, 1602C written
            Successfully rebased and updated refs/heads/b2_1-lu1413.
            
            IOR result
            
            Max Write: 2592.10 MiB/sec (2718.01 MB/sec)
            Max Read:  3058.55 MiB/sec (3207.13 MB/sec)
            
            

            we can get same numbers that we saw numbers on 2.1.2RC0 tag.

            ihara Shuichi Ihara (Inactive) added a comment - I ran few "git bisect bad/good" and benchmarked on each commit, and finally figured out which commit caused this regression. The regression started from commit b9cbe3616b6e0b44c7835b1aec65befb85f848f9 ( LU-969 debug: reduce stack usage). Here is log bisect log files and performance numbers (IOR with 4 process on an client) on each commit. # git rev-list 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425..HEAD ab83c57df451f4907752be6dad0ce8d87b98d60b bad (write 1.4GB/s, read 1.4GB/s) 8676a50913f0572d47e987483a45167d9e9faacd 36309a984850cd89f2b62938db5d56431834fd36 12a618ebb00b940678785cfef8050b3ea9f0ad04 849dc60cf1f8d46d9ccfa60bcf6e118d7aeafed3 ad8dee856e37ef8c5ac4ba8466ce14e941ccd268 e9daba96e0a6bbe898b3d6207b2fe4bdd3293181 57c7b312b242a24768baf42cc88dec450f068948 adec9ed03b9374088bec8c6e9e2dcc9b5c24902f 1e56c6d0bcd0a0b75f0f16060da015612f948134 0fe07cd4252d8c478e8d05d80b877b81f8ad2ed9 d4f36787d67c02e2ce7b21a891ff71bc709b3cb5 bad (write 1.4GB/s, read 1.4GB/s) 85aeb732e0dc300247c1f941ac22d88a62957cb3 4c2b197ec3054acce53f4b149a7fb1921caaf4e5 abff03a2c27e7abe3c56c856a367118f0038aad4 2e6aeae984ea03b5e756b4fb6816ac5eafcd5660 e4694e0153e8323f401ee7f95c3fe391aee38792 bad (write 1.4GB/s, read 1.4GB/s) 9f70f65536e31e297c5cf495247978ddb187fbe8 bad (write 1.4GB/s, read 1.4GB/s) b9cbe3616b6e0b44c7835b1aec65befb85f848f9 bad (write 1.4GB/s, read 1.4GB/s) 1702f156bf61210a937f40ae5b8e9d8832a3e59a ok (write 2.6GB/s, read 2.8GB/s) 6a9def181397a20c48b8f55ef4f6c29f2fc18ed6 8abf07f93019d7f3cddd9a796f9486dd56be2f13 ok (write 2.8GB/s, read 3.1GB/s) 9aba7a41b0ee2f598cf1a9a634c374cc294fdfe6 2a65b7b49a1cb0aa3ea46e9a0612708f1a8474b8 9812260ca4296c198ad7bc63f1a64718f159597b 042980026c596ff08c97764bbcf7a1e710fd4f5a 2d2f1cf4f63f02cdb2ef03c53b971877d565749a afd6ac5f37f72bdbc921b598aecdfc2e15de0875 9cfb8177b2ba56fb04c09d38939ca300c05a4ce6 1fa441093995a076e73d7e9e1062012fb98f6ee4 f1ceb64aa3d2cfefb32d1e73957c574acf843be6 08baca07bb19fb34f1a6a1bbe943d9ef318463fd 822cb2cdd59328944ea401eeede664abd00a64ca ac58a77eed62e0a8fb07f150fe0af7fdde9c556e 7d4a0e9564c759b558806d6be5394fa72ee85d31 64599b0816c012480485722e876253871c267511 a57c2dd5cd878c522d8d67e383c9981ef2cce823 c6d5b6ff72ed40ecce1a7c47542371f2dba5ad5f ee3a35d1f1bc9923466a5b025ca30b855d478ebb 63ec7c9fac90d8020bad5eeb900df8c539662d56 4e2e7f232ad001cfb91c46c18d529603d4995941 d7c60130c4bd32c8edd48452a3538953c425a63a ok (write 2.6GB/s, read 3.0GB/s) # git bisect start # git bisect bad # git bisect good d7c60130c4bd32c8edd48452a3538953c425a63a Bisecting: 20 revisions left to test after this (roughly 4 steps) [8abf07f93019d7f3cddd9a796f9486dd56be2f13] LU-577 tests: FAIL replay-single test_70b rundbench load # git bisect good Bisecting: 10 revisions left to test after this (roughly 3 steps) [d4f36787d67c02e2ce7b21a891ff71bc709b3cb5] LU-630 lnet: only router checks peer health # git bisect bad Bisecting: 4 revisions left to test after this (roughly 2 steps) [e4694e0153e8323f401ee7f95c3fe391aee38792] LU-1095 debug: Standardize, suppress mount/umount messages # git bisect bad Bisecting: 2 revisions left to test after this (roughly 1 step) [1702f156bf61210a937f40ae5b8e9d8832a3e59a] LU-1361 build: enable kabi on rhel6 # git bisect good Bisecting: 0 revisions left to test after this (roughly 1 step) [9f70f65536e31e297c5cf495247978ddb187fbe8] LU-1095 mgs: remove message from console # git bisect bad git bisect bad Bisecting: 0 revisions left to test after this (roughly 0 steps) [b9cbe3616b6e0b44c7835b1aec65befb85f848f9] LU-969 debug: reduce stack usage # git bisect bad b9cbe3616b6e0b44c7835b1aec65befb85f848f9 is the first bad commit commit b9cbe3616b6e0b44c7835b1aec65befb85f848f9 Author: Hongchao Zhang <hongchao.zhang@whamcloud.com> Date: Mon Mar 12 16:11:47 2012 +0800 LU-969 debug: reduce stack usage 1, libcfs_debug_vmsg2 to accept libcfs_debug_msg_data struture to replace SUBSYSTEM, __FILE__, __FUNCTION__, __LINE__ and cdls on the stack 2, CDEBUG, DEBUG_CAPA use static libcfs_debug_msg_data 3, remove the local variable in RETURN/GOTO/__CHECK_STACK 4, reduce stack in recovery thread by moving lu_env, ptlrpc_thread to heap. Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com> Signed-off-by: Hongchao Zhang <hongchao.zhang@whamcloud.com> Signed-off-by: Bob Glossman <bogl@whamcloud.com> Change-Id: I75fe53027f56e27255b5f558e8fd57c7db833648 Reviewed-on: http://review.whamcloud.com/2668 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Tested-by: Hudson Reviewed-by: Oleg Drokin <green@whamcloud.com> :040000 040000 5e599b4fa74f5757eec4df24b7b76daf624f7b42 1055e80b29131ae03cc30a6252809fb249a843d0 M libcfs :040000 040000 2e31d6ad09f31455f1e07b6a726d4a651291798e a23402122c63a1858a3f246d094873f4cf40991c M lustre For double check, I did rebase b2_1 and removed LU-969 patches from b2_1 and benchmark again. # git checkout -b b2_1-lu1413 b2_1 # git rebase -i 1702f156bf61210a937f40ae5b8e9d8832a3e59a remove pick b9cbe36 LU-969 debug: reduce stack usage . . ".git/rebase-merge/git-rebase-todo" 31L, 1602C written Successfully rebased and updated refs/heads/b2_1-lu1413. IOR result Max Write: 2592.10 MiB/sec (2718.01 MB/sec) Max Read: 3058.55 MiB/sec (3207.13 MB/sec) we can get same numbers that we saw numbers on 2.1.2RC0 tag.

            ok, that's helpful to easy find when the performance regression starts. I will try it.

            ihara Shuichi Ihara (Inactive) added a comment - ok, that's helpful to easy find when the performance regression starts. I will try it.

            I didn't find the culprit. A common way to find it out is via bisect of git.

            jxiong@mac: b21$ git rev-list  v2_1_2_0_RC0..HEAD
            165bc30fb0a403289be6e0831c7913951a63d259
            e9daba96e0a6bbe898b3d6207b2fe4bdd3293181
            57c7b312b242a24768baf42cc88dec450f068948
            adec9ed03b9374088bec8c6e9e2dcc9b5c24902f
            1e56c6d0bcd0a0b75f0f16060da015612f948134
            0fe07cd4252d8c478e8d05d80b877b81f8ad2ed9
            d4f36787d67c02e2ce7b21a891ff71bc709b3cb5
            85aeb732e0dc300247c1f941ac22d88a62957cb3
            4c2b197ec3054acce53f4b149a7fb1921caaf4e5
            abff03a2c27e7abe3c56c856a367118f0038aad4
            2e6aeae984ea03b5e756b4fb6816ac5eafcd5660
            e4694e0153e8323f401ee7f95c3fe391aee38792
            9f70f65536e31e297c5cf495247978ddb187fbe8
            b9cbe3616b6e0b44c7835b1aec65befb85f848f9
            1702f156bf61210a937f40ae5b8e9d8832a3e59a
            6a9def181397a20c48b8f55ef4f6c29f2fc18ed6
            8abf07f93019d7f3cddd9a796f9486dd56be2f13
            9aba7a41b0ee2f598cf1a9a634c374cc294fdfe6
            2a65b7b49a1cb0aa3ea46e9a0612708f1a8474b8
            9812260ca4296c198ad7bc63f1a64718f159597b
            042980026c596ff08c97764bbcf7a1e710fd4f5a
            2d2f1cf4f63f02cdb2ef03c53b971877d565749a
            afd6ac5f37f72bdbc921b598aecdfc2e15de0875
            9cfb8177b2ba56fb04c09d38939ca300c05a4ce6
            1fa441093995a076e73d7e9e1062012fb98f6ee4
            f1ceb64aa3d2cfefb32d1e73957c574acf843be6
            08baca07bb19fb34f1a6a1bbe943d9ef318463fd
            822cb2cdd59328944ea401eeede664abd00a64ca
            ac58a77eed62e0a8fb07f150fe0af7fdde9c556e
            7d4a0e9564c759b558806d6be5394fa72ee85d31
            64599b0816c012480485722e876253871c267511
            a57c2dd5cd878c522d8d67e383c9981ef2cce823
            c6d5b6ff72ed40ecce1a7c47542371f2dba5ad5f
            ee3a35d1f1bc9923466a5b025ca30b855d478ebb
            63ec7c9fac90d8020bad5eeb900df8c539662d56
            4e2e7f232ad001cfb91c46c18d529603d4995941
            

            and:

            jxiong@mac: b21$ git rev-list --bisect v2_1_2_0_RC0..HEAD
            2a65b7b49a1cb0aa3ea46e9a0612708f1a8474b8

            This way we will find which commit degrades performance.

            jay Jinshan Xiong (Inactive) added a comment - I didn't find the culprit. A common way to find it out is via bisect of git. jxiong@mac: b21$ git rev-list v2_1_2_0_RC0..HEAD 165bc30fb0a403289be6e0831c7913951a63d259 e9daba96e0a6bbe898b3d6207b2fe4bdd3293181 57c7b312b242a24768baf42cc88dec450f068948 adec9ed03b9374088bec8c6e9e2dcc9b5c24902f 1e56c6d0bcd0a0b75f0f16060da015612f948134 0fe07cd4252d8c478e8d05d80b877b81f8ad2ed9 d4f36787d67c02e2ce7b21a891ff71bc709b3cb5 85aeb732e0dc300247c1f941ac22d88a62957cb3 4c2b197ec3054acce53f4b149a7fb1921caaf4e5 abff03a2c27e7abe3c56c856a367118f0038aad4 2e6aeae984ea03b5e756b4fb6816ac5eafcd5660 e4694e0153e8323f401ee7f95c3fe391aee38792 9f70f65536e31e297c5cf495247978ddb187fbe8 b9cbe3616b6e0b44c7835b1aec65befb85f848f9 1702f156bf61210a937f40ae5b8e9d8832a3e59a 6a9def181397a20c48b8f55ef4f6c29f2fc18ed6 8abf07f93019d7f3cddd9a796f9486dd56be2f13 9aba7a41b0ee2f598cf1a9a634c374cc294fdfe6 2a65b7b49a1cb0aa3ea46e9a0612708f1a8474b8 9812260ca4296c198ad7bc63f1a64718f159597b 042980026c596ff08c97764bbcf7a1e710fd4f5a 2d2f1cf4f63f02cdb2ef03c53b971877d565749a afd6ac5f37f72bdbc921b598aecdfc2e15de0875 9cfb8177b2ba56fb04c09d38939ca300c05a4ce6 1fa441093995a076e73d7e9e1062012fb98f6ee4 f1ceb64aa3d2cfefb32d1e73957c574acf843be6 08baca07bb19fb34f1a6a1bbe943d9ef318463fd 822cb2cdd59328944ea401eeede664abd00a64ca ac58a77eed62e0a8fb07f150fe0af7fdde9c556e 7d4a0e9564c759b558806d6be5394fa72ee85d31 64599b0816c012480485722e876253871c267511 a57c2dd5cd878c522d8d67e383c9981ef2cce823 c6d5b6ff72ed40ecce1a7c47542371f2dba5ad5f ee3a35d1f1bc9923466a5b025ca30b855d478ebb 63ec7c9fac90d8020bad5eeb900df8c539662d56 4e2e7f232ad001cfb91c46c18d529603d4995941 and: jxiong@mac: b21$ git rev-list --bisect v2_1_2_0_RC0..HEAD 2a65b7b49a1cb0aa3ea46e9a0612708f1a8474b8 This way we will find which commit degrades performance.

            2.1.2RC0 was tagged on Apr 23.

            jxiong@mac: b21$ git show v2_1_2_0_RC0
            tag v2_1_2_0_RC0
            Tagger: Oleg Drokin <green@whamcloud.com>
            Date: Mon Apr 23 14:46:57 2012 -0400

            There shouldn't be any significant changes to client code. I'll check it.

            jay Jinshan Xiong (Inactive) added a comment - 2.1.2RC0 was tagged on Apr 23. jxiong@mac: b21$ git show v2_1_2_0_RC0 tag v2_1_2_0_RC0 Tagger: Oleg Drokin <green@whamcloud.com> Date: Mon Apr 23 14:46:57 2012 -0400 There shouldn't be any significant changes to client code. I'll check it.
            pjones Peter Jones added a comment -

            Oleg

            Could any of the recent b2_1 landings be responsible for this seeming performance regression?

            Peter

            pjones Peter Jones added a comment - Oleg Could any of the recent b2_1 landings be responsible for this seeming performance regression? Peter

            test resutls.

            ihara Shuichi Ihara (Inactive) added a comment - test resutls.

            People

              green Oleg Drokin
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: