[LU-1413] difference of single client's performance between b2_1 and 2.1.2RC0 Created: 16/May/12  Updated: 20/May/12  Resolved: 20/May/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.2
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Shuichi Ihara (Inactive) Assignee: Oleg Drokin
Resolution: Duplicate Votes: 0
Labels: None
Environment:

CentOS6.2, the latest b2_1 branch and 2.1.2RC0


Attachments: Microsoft Word 2.1.2RC0-b2_1.xlsx    
Issue Links:
Related
is related to LU-744 Single client's performance degradati... Resolved
is related to LU-969 2.1 client stack overruns Resolved
is related to LU-1408 single client's performance regressio... Resolved
Severity: 3
Rank (Obsolete): 6394

 Description   

Performance differences between b2_1 and 2.1.2RC0

During single client performance testing (LU-1408), I saw another single client performance regression on b2_1 (checked out the latest on 05/15/12 http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=e9daba96e0a6bbe898b3d6207b2fe4bdd3293181) from 2.1.2RC0 tag.

Any changes after 2.1.2RC0 to potentially cause this regression?

Both version of lustre on the client are tested on the same hardware, same OS (CentOS, include kernel), same network and all.
The servers are running lustre-2.2.

I also tested on the client with FDR Infiniband as well as QDR Infiniband without lustre checksum.



 Comments   
Comment by Shuichi Ihara (Inactive) [ 16/May/12 ]

test resutls.

Comment by Peter Jones [ 16/May/12 ]

Oleg

Could any of the recent b2_1 landings be responsible for this seeming performance regression?

Peter

Comment by Jinshan Xiong (Inactive) [ 16/May/12 ]

2.1.2RC0 was tagged on Apr 23.

jxiong@mac: b21$ git show v2_1_2_0_RC0
tag v2_1_2_0_RC0
Tagger: Oleg Drokin <green@whamcloud.com>
Date: Mon Apr 23 14:46:57 2012 -0400

There shouldn't be any significant changes to client code. I'll check it.

Comment by Jinshan Xiong (Inactive) [ 16/May/12 ]

I didn't find the culprit. A common way to find it out is via bisect of git.

jxiong@mac: b21$ git rev-list  v2_1_2_0_RC0..HEAD
165bc30fb0a403289be6e0831c7913951a63d259
e9daba96e0a6bbe898b3d6207b2fe4bdd3293181
57c7b312b242a24768baf42cc88dec450f068948
adec9ed03b9374088bec8c6e9e2dcc9b5c24902f
1e56c6d0bcd0a0b75f0f16060da015612f948134
0fe07cd4252d8c478e8d05d80b877b81f8ad2ed9
d4f36787d67c02e2ce7b21a891ff71bc709b3cb5
85aeb732e0dc300247c1f941ac22d88a62957cb3
4c2b197ec3054acce53f4b149a7fb1921caaf4e5
abff03a2c27e7abe3c56c856a367118f0038aad4
2e6aeae984ea03b5e756b4fb6816ac5eafcd5660
e4694e0153e8323f401ee7f95c3fe391aee38792
9f70f65536e31e297c5cf495247978ddb187fbe8
b9cbe3616b6e0b44c7835b1aec65befb85f848f9
1702f156bf61210a937f40ae5b8e9d8832a3e59a
6a9def181397a20c48b8f55ef4f6c29f2fc18ed6
8abf07f93019d7f3cddd9a796f9486dd56be2f13
9aba7a41b0ee2f598cf1a9a634c374cc294fdfe6
2a65b7b49a1cb0aa3ea46e9a0612708f1a8474b8
9812260ca4296c198ad7bc63f1a64718f159597b
042980026c596ff08c97764bbcf7a1e710fd4f5a
2d2f1cf4f63f02cdb2ef03c53b971877d565749a
afd6ac5f37f72bdbc921b598aecdfc2e15de0875
9cfb8177b2ba56fb04c09d38939ca300c05a4ce6
1fa441093995a076e73d7e9e1062012fb98f6ee4
f1ceb64aa3d2cfefb32d1e73957c574acf843be6
08baca07bb19fb34f1a6a1bbe943d9ef318463fd
822cb2cdd59328944ea401eeede664abd00a64ca
ac58a77eed62e0a8fb07f150fe0af7fdde9c556e
7d4a0e9564c759b558806d6be5394fa72ee85d31
64599b0816c012480485722e876253871c267511
a57c2dd5cd878c522d8d67e383c9981ef2cce823
c6d5b6ff72ed40ecce1a7c47542371f2dba5ad5f
ee3a35d1f1bc9923466a5b025ca30b855d478ebb
63ec7c9fac90d8020bad5eeb900df8c539662d56
4e2e7f232ad001cfb91c46c18d529603d4995941

and:

jxiong@mac: b21$ git rev-list --bisect v2_1_2_0_RC0..HEAD
2a65b7b49a1cb0aa3ea46e9a0612708f1a8474b8

This way we will find which commit degrades performance.

Comment by Shuichi Ihara (Inactive) [ 16/May/12 ]

ok, that's helpful to easy find when the performance regression starts. I will try it.

Comment by Shuichi Ihara (Inactive) [ 20/May/12 ]

I ran few "git bisect bad/good" and benchmarked on each commit, and finally figured out which commit caused this regression.
The regression started from commit b9cbe3616b6e0b44c7835b1aec65befb85f848f9 (LU-969 debug: reduce stack usage).

Here is log bisect log files and performance numbers (IOR with 4 process on an client) on each commit.

# git rev-list 68ed546e43dbc4ba31b409e9dbf8a65ef9a7f425..HEAD
ab83c57df451f4907752be6dad0ce8d87b98d60b bad (write 1.4GB/s, read 1.4GB/s)
8676a50913f0572d47e987483a45167d9e9faacd
36309a984850cd89f2b62938db5d56431834fd36
12a618ebb00b940678785cfef8050b3ea9f0ad04
849dc60cf1f8d46d9ccfa60bcf6e118d7aeafed3
ad8dee856e37ef8c5ac4ba8466ce14e941ccd268
e9daba96e0a6bbe898b3d6207b2fe4bdd3293181
57c7b312b242a24768baf42cc88dec450f068948
adec9ed03b9374088bec8c6e9e2dcc9b5c24902f
1e56c6d0bcd0a0b75f0f16060da015612f948134
0fe07cd4252d8c478e8d05d80b877b81f8ad2ed9
d4f36787d67c02e2ce7b21a891ff71bc709b3cb5 bad (write 1.4GB/s, read 1.4GB/s)
85aeb732e0dc300247c1f941ac22d88a62957cb3
4c2b197ec3054acce53f4b149a7fb1921caaf4e5
abff03a2c27e7abe3c56c856a367118f0038aad4
2e6aeae984ea03b5e756b4fb6816ac5eafcd5660
e4694e0153e8323f401ee7f95c3fe391aee38792 bad (write 1.4GB/s, read 1.4GB/s)
9f70f65536e31e297c5cf495247978ddb187fbe8 bad (write 1.4GB/s, read 1.4GB/s)
b9cbe3616b6e0b44c7835b1aec65befb85f848f9 bad (write 1.4GB/s, read 1.4GB/s)
1702f156bf61210a937f40ae5b8e9d8832a3e59a ok  (write 2.6GB/s, read 2.8GB/s)
6a9def181397a20c48b8f55ef4f6c29f2fc18ed6
8abf07f93019d7f3cddd9a796f9486dd56be2f13 ok  (write 2.8GB/s, read 3.1GB/s)
9aba7a41b0ee2f598cf1a9a634c374cc294fdfe6
2a65b7b49a1cb0aa3ea46e9a0612708f1a8474b8
9812260ca4296c198ad7bc63f1a64718f159597b
042980026c596ff08c97764bbcf7a1e710fd4f5a
2d2f1cf4f63f02cdb2ef03c53b971877d565749a
afd6ac5f37f72bdbc921b598aecdfc2e15de0875
9cfb8177b2ba56fb04c09d38939ca300c05a4ce6
1fa441093995a076e73d7e9e1062012fb98f6ee4
f1ceb64aa3d2cfefb32d1e73957c574acf843be6
08baca07bb19fb34f1a6a1bbe943d9ef318463fd
822cb2cdd59328944ea401eeede664abd00a64ca
ac58a77eed62e0a8fb07f150fe0af7fdde9c556e
7d4a0e9564c759b558806d6be5394fa72ee85d31
64599b0816c012480485722e876253871c267511
a57c2dd5cd878c522d8d67e383c9981ef2cce823
c6d5b6ff72ed40ecce1a7c47542371f2dba5ad5f
ee3a35d1f1bc9923466a5b025ca30b855d478ebb
63ec7c9fac90d8020bad5eeb900df8c539662d56
4e2e7f232ad001cfb91c46c18d529603d4995941
d7c60130c4bd32c8edd48452a3538953c425a63a ok  (write 2.6GB/s, read 3.0GB/s)
# git bisect start
# git bisect bad  
# git bisect good d7c60130c4bd32c8edd48452a3538953c425a63a
Bisecting: 20 revisions left to test after this (roughly 4 steps)
[8abf07f93019d7f3cddd9a796f9486dd56be2f13] LU-577 tests: FAIL replay-single test_70b rundbench load

# git bisect good 
Bisecting: 10 revisions left to test after this (roughly 3 steps)
[d4f36787d67c02e2ce7b21a891ff71bc709b3cb5] LU-630 lnet: only router checks peer health

# git bisect bad 
Bisecting: 4 revisions left to test after this (roughly 2 steps)
[e4694e0153e8323f401ee7f95c3fe391aee38792] LU-1095 debug: Standardize, suppress mount/umount messages

# git bisect bad
Bisecting: 2 revisions left to test after this (roughly 1 step)
[1702f156bf61210a937f40ae5b8e9d8832a3e59a] LU-1361 build: enable kabi on rhel6

# git bisect good
Bisecting: 0 revisions left to test after this (roughly 1 step)
[9f70f65536e31e297c5cf495247978ddb187fbe8] LU-1095 mgs: remove message from console

# git bisect bad
git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[b9cbe3616b6e0b44c7835b1aec65befb85f848f9] LU-969 debug: reduce stack usage

# git bisect bad
b9cbe3616b6e0b44c7835b1aec65befb85f848f9 is the first bad commit
commit b9cbe3616b6e0b44c7835b1aec65befb85f848f9
Author: Hongchao Zhang <hongchao.zhang@whamcloud.com>
Date:   Mon Mar 12 16:11:47 2012 +0800

    LU-969 debug: reduce stack usage
    
    1, libcfs_debug_vmsg2 to accept libcfs_debug_msg_data struture
       to replace SUBSYSTEM, __FILE__, __FUNCTION__, __LINE__ and
       cdls on the stack
    
    2, CDEBUG, DEBUG_CAPA use static libcfs_debug_msg_data
    
    3, remove the local variable in RETURN/GOTO/__CHECK_STACK
    
    4, reduce stack in recovery thread by moving lu_env,
       ptlrpc_thread to heap.
    
    Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
    Signed-off-by: Hongchao Zhang <hongchao.zhang@whamcloud.com>
    Signed-off-by: Bob Glossman <bogl@whamcloud.com>
    Change-Id: I75fe53027f56e27255b5f558e8fd57c7db833648
    Reviewed-on: http://review.whamcloud.com/2668
    Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
    Tested-by: Hudson
    Reviewed-by: Oleg Drokin <green@whamcloud.com>

:040000 040000 5e599b4fa74f5757eec4df24b7b76daf624f7b42 1055e80b29131ae03cc30a6252809fb249a843d0 M	libcfs
:040000 040000 2e31d6ad09f31455f1e07b6a726d4a651291798e a23402122c63a1858a3f246d094873f4cf40991c M	lustre

For double check, I did rebase b2_1 and removed LU-969 patches from b2_1 and benchmark again.

# git checkout -b b2_1-lu1413 b2_1
# git rebase -i 1702f156bf61210a937f40ae5b8e9d8832a3e59a

remove 
pick b9cbe36 LU-969 debug: reduce stack usage
.
.
".git/rebase-merge/git-rebase-todo" 31L, 1602C written
Successfully rebased and updated refs/heads/b2_1-lu1413.

IOR result

Max Write: 2592.10 MiB/sec (2718.01 MB/sec)
Max Read:  3058.55 MiB/sec (3207.13 MB/sec)

we can get same numbers that we saw numbers on 2.1.2RC0 tag.

Comment by Jinshan Xiong (Inactive) [ 20/May/12 ]

Good job, Ihara.

Comment by Peter Jones [ 20/May/12 ]

So this seems to be a duplicate of LU-1408 as it has the same root cause - a performance regression introduced by LU-969

Generated at Sat Feb 10 01:16:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.