[LU-1666] hdr->coh_page_guard contention (single shared file performance) Created: 24/Jul/12 Updated: 13/Apr/14 Resolved: 13/Apr/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.1.5 |
| Fix Version/s: | Lustre 2.4.0, Lustre 2.1.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Brian Behlendorf | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | client, performance | ||
| Environment: |
Chaos 5, single client |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Epic: | client | ||||||||
| Rank (Obsolete): | 4054 | ||||||||
| Description |
|
Shared file performance with the CLIO code has been observed to be considerably worse than file per process. In particular, during reads it's easy for our clients to become cpu bound. This workload is of particular interest because we expect numerous codes to operate in this mode on Sequoia. Running oprofile clearly shows that the vast majority of cpu is going to contention on the hdr->coh_page_guard spin lock. Below are is a report of all functions consuming >0.1% of the cpu during the read phase. samples cum. samples % cum. % app name symbol name 3815407 3815407 26.3390 26.3390 obdclass.ko cl_vmpage_page 3671445 7486852 25.3452 51.6842 vmlinux _atomic_dec_and_lock 1215715 8702567 8.3925 60.0767 obdclass.ko cl_page_find0 876063 9578630 6.0478 66.1244 obdclass.ko cl_object_top 760239 10338869 5.2482 71.3726 obdclass.ko cl_page_delete0 262614 10601483 1.8129 73.1855 vmlinux copy_user_generic_string 141309 10742792 0.9755 74.1610 obdclass.ko cl_page_put 111365 10854157 0.7688 74.9298 osc.ko osc_checksum_bulk 105015 10959172 0.7250 75.6548 lvfs.ko lprocfs_counter_add 99702 11058874 0.6883 76.3430 vmlinux unlock_page 96783 11155657 0.6681 77.0112 lustre.ko vvp_page_export 87458 11243115 0.6038 77.6149 vmlinux put_page 82223 11325338 0.5676 78.1825 lvfs.ko lprocfs_counter_sub 77318 11402656 0.5338 78.7163 obdclass.ko cl_object_attr_lock 77143 11479799 0.5325 79.2488 obdclass.ko cl_page_list_init 75322 11555121 0.5200 79.7688 vmlinux kmem_cache_free 74627 11629748 0.5152 80.2840 osc.ko osc_ap_completion A closer look at the top five consumers shows them all contented on this spin lock. vma samples cum. samples % cum. % linenr info app name symbol name 0000000000050f30 3815407 3815407 26.3390 26.3390 cl_page.c:723 obdclass.ko cl_vmpage_page 0000000000050f30 212 212 0.0056 0.0056 cl_page.c:723 0000000000050f31 197 409 0.0052 0.0107 cl_page.c:723 0000000000050f34 12 421 3.1e-04 0.0110 cl_page.c:723 0000000000050f36 184 605 0.0048 0.0159 cl_page.c:723 0000000000050f38 395 1000 0.0104 0.0262 cl_page.c:723 0000000000050f39 118 1118 0.0031 0.0293 cl_page.c:723 0000000000050f3d 164 1282 0.0043 0.0336 cl_page.c:723 0000000000050f42 27 1309 7.1e-04 0.0343 libcfs_debug.h:198 0000000000050f49 16 1325 4.2e-04 0.0347 cl_page.c:727 0000000000050f53 191 1516 0.0050 0.0397 cl_page.c:723 0000000000050f56 7 1523 1.8e-04 0.0399 cl_page.c:723 0000000000050f59 6 1529 1.6e-04 0.0401 libcfs_debug.h:198 0000000000050f68 170 1699 0.0045 0.0445 cl_page.c:728 0000000000050f6b 128 1827 0.0034 0.0479 cl_page.c:728 0000000000050f71 13 1840 3.4e-04 0.0482 cl_page.c:740 0000000000050f74 8 1848 2.1e-04 0.0484 cl_page.c:740 0000000000050f79 182 2030 0.0048 0.0532 cl_object.h:2626 0000000000050f7c 343 2373 0.0090 0.0622 cl_page.c:741 0000000000050f80 48 2421 0.0013 0.0635 cl_page.c:741 0000000000050f88 3714181 3716602 97.3469 97.4104 cl_page.c:742 0000000000050f8c 1089 3717691 0.0285 97.4389 cl_page.c:742 0000000000050f91 233 3717924 0.0061 97.4450 cl_object.h:2637 0000000000050f95 15918 3733842 0.4172 97.8622 cl_object.h:2626 0000000000050f99 3835 3737677 0.1005 97.9627 cl_page.c:744 0000000000050f9c 420 3738097 0.0110 97.9737 cl_page.c:744 0000000000050f9e 1016 3739113 0.0266 98.0004 cl_page.c:744 0000000000050fb1 3584 3742697 0.0939 98.0943 cl_page.c:743 0000000000050fb5 161 3742858 0.0042 98.0985 cl_page.c:742 0000000000050fba 480 3743338 0.0126 98.1111 spinlock.h:174 0000000000050fbd 269 3743607 0.0071 98.1182 spinlock.h:174 0000000000050fc0 48983 3792590 1.2838 99.4020 spinlock.h:174 0000000000050fc3 2 3792592 5.2e-05 99.4020 cl_page.c:750 0000000000050fcc 103 3792695 0.0027 99.4047 libcfs_debug.h:198 0000000000050fd3 84 3792779 0.0022 99.4069 libcfs_debug.h:198 0000000000050ff7 125 3792904 0.0033 99.4102 cl_page.c:752 0000000000050ffb 29 3792933 7.6e-04 99.4110 cl_page.c:752 0000000000050ffc 229 3793162 0.0060 99.4170 cl_page.c:752 0000000000050ffe 117 3793279 0.0031 99.4200 cl_page.c:752 0000000000051000 98 3793377 0.0026 99.4226 cl_page.c:752 0000000000051001 204 3793581 0.0053 99.4280 cl_page.c:752 0000000000051028 55 3793636 0.0014 99.4294 atomic_64.h:176 000000000005102d 30 3793666 7.9e-04 99.4302 atomic_64.h:176 0000000000051031 7819 3801485 0.2049 99.6351 cl_page.c:123 0000000000051035 6 3801491 1.6e-04 99.6353 cl_page.c:124 0000000000051039 69 3801560 0.0018 99.6371 cl_page.c:124 000000000005103d 583 3802143 0.0153 99.6524 cl_page.c:124 0000000000051041 365 3802508 0.0096 99.6619 atomic_64.h:93 0000000000051045 11064 3813572 0.2900 99.9519 atomic_64.h:93 0000000000051086 30 3813602 7.9e-04 99.9527 cl_page.c:750 000000000005108d 1805 3815407 0.0473 100.000 cl_page.c:750 0000000000052420 1215715 8702567 8.3925 60.0767 cl_page.c:402 obdclass.ko cl_page_find0 0000000000052420 95 95 0.0078 0.0078 cl_page.c:402 0000000000052421 55 150 0.0045 0.0123 cl_page.c:402 0000000000052424 40 190 0.0033 0.0156 cl_page.c:402 000000000005242b 85 275 0.0070 0.0226 cl_page.c:402 000000000005242f 86 361 0.0071 0.0297 cl_page.c:402 0000000000052433 55 416 0.0045 0.0342 cl_page.c:402 0000000000052437 4 420 3.3e-04 0.0345 cl_page.c:402 000000000005243b 73 493 0.0060 0.0406 cl_page.c:402 000000000005243f 58 551 0.0048 0.0453 cl_page.c:402 0000000000052444 42 593 0.0035 0.0488 cl_page.c:402 0000000000052448 5 598 4.1e-04 0.0492 cl_page.c:402 000000000005244c 124 722 0.0102 0.0594 cl_page.c:402 000000000005244f 49 771 0.0040 0.0634 cl_page.c:402 0000000000052453 75 846 0.0062 0.0696 cl_object.h:2631 0000000000052457 8013 8859 0.6591 0.7287 cl_page.c:402 000000000005245a 60 8919 0.0049 0.7336 cl_page.c:402 000000000005245d 32 8951 0.0026 0.7363 cl_object.h:2631 0000000000052461 5510 14461 0.4532 1.1895 cl_page.c:409 0000000000052465 3 14464 2.5e-04 1.1898 cl_page.c:409 000000000005246e 35 14499 0.0029 1.1926 cl_page.c:410 0000000000052473 119 14618 0.0098 1.2024 libcfs_debug.h:198 000000000005247a 41 14659 0.0034 1.2058 cl_page.c:412 0000000000052484 155 14814 0.0127 1.2185 libcfs_debug.h:198 00000000000524a8 648 15462 0.0533 1.2718 cl_object.h:2626 00000000000524ac 82 15544 0.0067 1.2786 atomic_64.h:93 00000000000524b0 83372 98916 6.8579 8.1364 libcfs_debug.h:198 00000000000524b8 3 98919 2.5e-04 8.1367 cl_page.c:417 00000000000524c2 191 99110 0.0157 8.1524 libcfs_debug.h:198 0000000000052512 86 99196 0.0071 8.1595 cl_object.h:2631 0000000000052517 2425 101621 0.1995 8.3589 libcfs_debug.h:198 000000000005251e 42 101663 0.0035 8.3624 cl_object.h:2631 0000000000052522 767 102430 0.0631 8.4255 cl_page.c:339 0000000000052530 154 102584 0.0127 8.4382 libcfs_debug.h:198 0000000000052536 13 102597 0.0011 8.4392 thread_info.h:216 000000000005253f 700 103297 0.0576 8.4968 cl_page.c:341 0000000000052546 363 103660 0.0299 8.5267 cl_page.c:341 0000000000052551 106 103766 0.0087 8.5354 cl_page.c:341 0000000000052558 32 103798 0.0026 8.5380 cl_page.c:341 0000000000052562 85 103883 0.0070 8.5450 cl_page.c:341 0000000000052565 5 103888 4.1e-04 8.5454 cl_page.c:341 0000000000052568 51 103939 0.0042 8.5496 cl_page.c:341 000000000005256e 1 103940 8.2e-05 8.5497 libcfs_fail.h:81 0000000000052575 52 103992 0.0043 8.5540 libcfs_fail.h:81 0000000000052580 56 104048 0.0046 8.5586 cl_page.c:341 0000000000052585 1 104049 8.2e-05 8.5587 cl_page.c:341 0000000000052588 55600 159649 4.5734 13.1321 cl_page.c:341 000000000005258a 197 159846 0.0162 13.1483 cl_page.c:341 0000000000052591 6 159852 4.9e-04 13.1488 cl_page.c:341 0000000000052598 30 159882 0.0025 13.1513 cl_page.c:341 000000000005259d 96 159978 0.0079 13.1592 libcfs_debug.h:198 00000000000525a4 2 159980 1.6e-04 13.1593 cl_page.c:341 00000000000525ae 103 160083 0.0085 13.1678 libcfs_debug.h:198 00000000000525b7 1 160084 8.2e-05 13.1679 cl_page.c:344 00000000000525bb 21 160105 0.0017 13.1696 atomic_64.h:35 00000000000525c2 869 160974 0.0715 13.2411 cl_page.c:345 00000000000525c7 79 161053 0.0065 13.2476 cl_page.c:346 00000000000525d2 56 161109 0.0046 13.2522 cl_page.c:348 00000000000525d6 46 161155 0.0038 13.2560 cl_page.c:354 00000000000525da 19 161174 0.0016 13.2575 cl_page.c:354 00000000000525e1 86 161260 0.0071 13.2646 cl_page.c:326 00000000000525e9 22 161282 0.0018 13.2664 cl_page.c:350 00000000000525f0 19 161301 0.0016 13.2680 cl_page.c:354 00000000000525f7 23 161324 0.0019 13.2699 cl_page.c:348 00000000000525fb 82 161406 0.0067 13.2766 cl_page.c:351 00000000000525ff 1 161407 8.2e-05 13.2767 list.h:30 0000000000052603 18 161425 0.0015 13.2782 list.h:31 0000000000052607 26 161451 0.0021 13.2803 cl_page.c:352 000000000005260b 70 161521 0.0058 13.2861 list.h:30 000000000005260f 16 161537 0.0013 13.2874 list.h:31 0000000000052613 20 161557 0.0016 13.2891 cl_page.c:353 0000000000052617 12 161569 9.9e-04 13.2900 list.h:30 000000000005261b 89 161658 0.0073 13.2974 list.h:31 000000000005261f 26 161684 0.0021 13.2995 cl_page.c:354 0000000000052624 21 161705 0.0017 13.3012 cl_page.c:356 0000000000052628 1605 163310 0.1320 13.4332 cl_page.c:357 000000000005262c 30095 193405 2.4755 15.9087 cl_page.c:357 0000000000052630 48 193453 0.0039 15.9127 cl_page.c:357 0000000000052633 60 193513 0.0049 15.9176 cl_page.c:357 0000000000052637 4 193517 3.3e-04 15.9180 cl_page.c:357 0000000000052639 43 193560 0.0035 15.9215 cl_page.c:357 000000000005263d 51 193611 0.0042 15.9257 cl_page.c:357 0000000000052641 9 193620 7.4e-04 15.9264 cl_page.c:357 0000000000052645 4 193624 3.3e-04 15.9268 cl_page.c:357 0000000000052648 44 193668 0.0036 15.9304 cl_page.c:357 0000000000052650 142 193810 0.0117 15.9421 cl_page.c:359 0000000000052655 6356 200166 0.5228 16.4649 cl_page.c:359 0000000000052658 568 200734 0.0467 16.5116 cl_page.c:359 000000000005265d 124 200858 0.0102 16.5218 cl_page.c:360 0000000000052660 63 200921 0.0052 16.5270 cl_page.c:360 0000000000052663 3 200924 2.5e-04 16.5272 cl_page.c:360 0000000000052666 24 200948 0.0020 16.5292 cl_page.c:360 0000000000052669 88 201036 0.0072 16.5364 cl_page.c:360 000000000005266b 47 201083 0.0039 16.5403 cl_page.c:362 0000000000052674 103 201186 0.0085 16.5488 cl_page.c:357 0000000000052679 4520 205706 0.3718 16.9206 cl_page.c:357 000000000005267c 103 205809 0.0085 16.9290 cl_page.c:357 0000000000052680 3 205812 2.5e-04 16.9293 cl_page.c:357 0000000000052682 81 205893 0.0067 16.9360 cl_page.c:357 0000000000052686 80 205973 0.0066 16.9425 cl_page.c:357 000000000005268a 19 205992 0.0016 16.9441 atomic_64.h:93 000000000005268e 11 206003 9.0e-04 16.9450 atomic_64.h:93 0000000000052692 25935 231938 2.1333 19.0783 atomic_64.h:93 0000000000052696 5512 237450 0.4534 19.5317 atomic_64.h:93 000000000005269a 5961 243411 0.4903 20.0220 atomic_64.h:93 00000000000526a6 87 243498 0.0072 20.0292 libcfs_debug.h:198 0000000000052718 191 243689 0.0157 20.0449 cl_page.c:430 000000000005271b 9 243698 7.4e-04 20.0457 cl_page.c:430 0000000000052723 2761 246459 0.2271 20.2728 cl_page.c:438 0000000000052726 1 246460 8.2e-05 20.2728 cl_page.c:430 0000000000052729 17 246477 0.0014 20.2742 cl_page.c:438 000000000005272f 59 246536 0.0049 20.2791 atomic_64.h:93 0000000000052733 44355 290891 3.6485 23.9276 libcfs_debug.h:198 000000000005273a 2 290893 1.6e-04 23.9277 libcfs_debug.h:198 0000000000052761 50 290943 0.0041 23.9318 cl_page.c:497 0000000000052764 53 290996 0.0044 23.9362 cl_page.c:497 0000000000052768 33 291029 0.0027 23.9389 cl_page.c:497 000000000005276c 80 291109 0.0066 23.9455 cl_page.c:497 0000000000052770 58 291167 0.0048 23.9503 cl_page.c:497 0000000000052774 77 291244 0.0063 23.9566 cl_page.c:497 0000000000052778 45 291289 0.0037 23.9603 cl_page.c:497 0000000000052779 155 291444 0.0127 23.9731 cl_page.c:497 00000000000527b8 128 291572 0.0105 23.9836 cl_page.c:448 0000000000052850 105 291677 0.0086 23.9922 cl_page.c:461 0000000000052854 2 291679 1.6e-04 23.9924 cl_page.c:461 0000000000052858 1 291680 8.2e-05 23.9925 cl_page.c:461 000000000005285f 96 291776 0.0079 24.0004 cl_page.c:461 0000000000052864 901444 1193220 74.1493 98.1496 cl_page.c:462 0000000000052868 106 1193326 0.0087 98.1584 cl_page.c:462 000000000005286c 4 1193330 3.3e-04 98.1587 cl_page.c:462 0000000000052870 31 1193361 0.0025 98.1612 cl_page.c:462 0000000000052874 60 1193421 0.0049 98.1662 cl_page.c:462 000000000005287c 81 1193502 0.0067 98.1728 cl_page.c:463 000000000005287e 2 1193504 1.6e-04 98.1730 cl_page.c:463 0000000000052882 42 1193546 0.0035 98.1765 cl_page.c:463 0000000000052888 58 1193604 0.0048 98.1812 cl_page.c:482 000000000005288d 111 1193715 0.0091 98.1904 cl_page.c:482 000000000005288f 16 1193731 0.0013 98.1917 cl_page.c:483 0000000000052894 302 1194033 0.0248 98.2165 cl_page.c:483 000000000005289a 32 1194065 0.0026 98.2192 cl_page.c:484 000000000005289e 136 1194201 0.0112 98.2303 cl_page.c:484 00000000000528a2 105 1194306 0.0086 98.2390 cl_page.c:485 00000000000528a6 270 1194576 0.0222 98.2612 cl_page.c:487 00000000000528aa 83 1194659 0.0068 98.2680 cl_page.c:487 00000000000528ad 25 1194684 0.0021 98.2701 cl_page.c:487 00000000000528b0 29 1194713 0.0024 98.2725 cl_page.c:487 00000000000528b5 13876 1208589 1.1414 99.4138 spinlock.h:174 00000000000528b8 14 1208603 0.0012 99.4150 spinlock.h:174 00000000000528bb 6973 1215576 0.5736 99.9886 spinlock.h:174 00000000000528be 18 1215594 0.0015 99.9900 cl_page.c:491 00000000000528c7 18 1215612 0.0015 99.9915 libcfs_debug.h:198 00000000000528ce 103 1215715 0.0085 100.000 libcfs_debug.h:198 000000000004d840 876063 9578630 6.0478 66.1244 cl_object.c:167 obdclass.ko cl_object_top 000000000004d840 6463 6463 0.7377 0.7377 cl_object.c:167 000000000004d841 9476 15939 1.0817 1.8194 cl_object.c:167 000000000004d844 438 16377 0.0500 1.8694 cl_object.c:167 000000000004d845 569 16946 0.0649 1.9343 cl_object.c:167 000000000004d849 185 17131 0.0211 1.9555 cl_object.c:167 000000000004d84e 276 17407 0.0315 1.9870 cl_object.h:2626 000000000004d851 17130 34537 1.9553 3.9423 cl_object.h:2626 000000000004d858 97 34634 0.0111 3.9534 cl_object.c:171 000000000004d85b 254 34888 0.0290 3.9824 cl_object.c:171 000000000004d862 424776 459664 48.4869 52.4693 cl_object.c:171 000000000004d867 693 460357 0.0791 52.5484 list.h:186 000000000004d86b 368051 828408 42.0119 94.5603 lu_object.h:748 000000000004d86f 10 828418 0.0011 94.5615 lu_object.h:748 000000000004d878 662 829080 0.0756 94.6370 libcfs.h:320 000000000004d880 616 829696 0.0703 94.7073 libcfs.h:320 000000000004d889 539 830235 0.0615 94.7689 libcfs.h:323 000000000004d88d 3 830238 3.4e-04 94.7692 cl_object.h:2603 000000000004d890 29 830267 0.0033 94.7725 lu_object.h:749 000000000004d893 391 830658 0.0446 94.8172 cl_object.h:2603 000000000004d895 630 831288 0.0719 94.8891 libcfs_debug.h:198 000000000004d89c 159 831447 0.0181 94.9072 cl_object.c:175 000000000004d8a6 614 832061 0.0701 94.9773 libcfs_debug.h:198 000000000004d8cc 11 832072 0.0013 94.9786 cl_object.c:177 000000000004d8cf 156 832228 0.0178 94.9964 cl_object.c:177 000000000004d8d3 8 832236 9.1e-04 94.9973 cl_object.c:177 000000000004d8d4 662 832898 0.0756 95.0728 cl_object.c:177 000000000004d8d5 221 833119 0.0252 95.0981 cl_object.c:177 000000000004d8d6 584 833703 0.0667 95.1647 cl_object.h:2603 000000000004d8de 517 834220 0.0590 95.2237 cl_object.h:2587 000000000004d8e2 29587 863807 3.3773 98.6010 cl_object.h:2587 000000000004d8e6 9113 872920 1.0402 99.6412 cl_object.h:2603 000000000004d8e9 3143 876063 0.3588 100.000 cl_object.h:2603 00000000000510d0 760239 10338869 5.2482 71.3726 cl_page.c:1126 obdclass.ko cl_page_delete0 00000000000510d0 12 12 0.0016 0.0016 cl_page.c:1126 00000000000510d1 38 50 0.0050 0.0066 cl_page.c:1126 00000000000510d4 9 59 0.0012 0.0078 cl_page.c:1126 00000000000510d6 14 73 0.0018 0.0096 cl_page.c:1126 00000000000510d8 42 115 0.0055 0.0151 cl_page.c:1126 00000000000510da 6 121 7.9e-04 0.0159 cl_page.c:1126 00000000000510dc 6 127 7.9e-04 0.0167 cl_page.c:1126 00000000000510dd 4 131 5.3e-04 0.0172 cl_page.c:1126 00000000000510e1 26 157 0.0034 0.0207 cl_page.c:1126 00000000000510e6 4 161 5.3e-04 0.0212 libcfs_debug.h:198 00000000000510ed 32 193 0.0042 0.0254 cl_page.c:1128 00000000000510f7 4 197 5.3e-04 0.0259 cl_page.c:1126 00000000000510fa 8 205 0.0011 0.0270 cl_page.c:1126 00000000000510fd 28 233 0.0037 0.0306 cl_page.c:1126 0000000000051100 6 239 7.9e-04 0.0314 libcfs_debug.h:198 000000000005110f 5 244 6.6e-04 0.0321 libcfs_debug.h:198 0000000000051112 3 247 3.9e-04 0.0325 libcfs_debug.h:198 000000000005111b 25 272 0.0033 0.0358 cl_page.c:102 0000000000051120 2 274 2.6e-04 0.0360 cl_page.c:102 0000000000051125 2 276 2.6e-04 0.0363 cl_page.c:1130 000000000005112e 2 278 2.6e-04 0.0366 cl_page.c:1131 0000000000051133 20 298 0.0026 0.0392 cl_page.c:1131 0000000000051139 4 302 5.3e-04 0.0397 cl_page.c:1136 000000000005113c 2 304 2.6e-04 0.0400 cl_page.c:1136 0000000000051141 22 326 0.0029 0.0429 cl_page.c:1146 0000000000051143 14 340 0.0018 0.0447 cl_page.c:1146 0000000000051146 1 341 1.3e-04 0.0449 cl_page.c:1146 0000000000051149 7 348 9.2e-04 0.0458 cl_page.c:1146 000000000005114e 1 349 1.3e-04 0.0459 cl_page.c:1147 0000000000051159 28 377 0.0037 0.0496 cl_page.c:1147 0000000000051167 98 475 0.0129 0.0625 cl_page.c:1149 0000000000051169 3 478 3.9e-04 0.0629 cl_page.c:1149 0000000000051173 32 510 0.0042 0.0671 cl_page.c:102 0000000000051178 167 677 0.0220 0.0891 cl_page.c:102 000000000005117d 21 698 0.0028 0.0918 cl_page.c:102 0000000000051180 5 703 6.6e-04 0.0925 cl_page.c:1170 0000000000051185 117 820 0.0154 0.1079 cl_page.c:1170 0000000000051189 24 844 0.0032 0.1110 cl_page.c:1170 000000000005118e 62 906 0.0082 0.1192 cl_page.c:1170 0000000000051190 38 944 0.0050 0.1242 cl_page.c:1170 0000000000051194 855 1799 0.1125 0.2366 cl_page.c:1170 0000000000051198 192 1991 0.0253 0.2619 cl_page.c:1170 000000000005119d 31 2022 0.0041 0.2660 cl_page.c:1170 00000000000511a0 41 2063 0.0054 0.2714 cl_page.c:1170 00000000000511a3 4 2067 5.3e-04 0.2719 cl_page.c:1170 00000000000511a5 159 2226 0.0209 0.2928 cl_page.c:1170 00000000000511a9 143 2369 0.0188 0.3116 cl_page.c:1170 00000000000511ad 44 2413 0.0058 0.3174 cl_page.c:1170 00000000000511b2 6 2419 7.9e-04 0.3182 cl_page.c:1170 00000000000511b7 152 2571 0.0200 0.3382 cl_page.c:1170 00000000000511bc 24 2595 0.0032 0.3413 libcfs_debug.h:198 00000000000511c3 3 2598 3.9e-04 0.3417 cl_page.c:1172 00000000000511cd 28 2626 0.0037 0.3454 libcfs_debug.h:198 00000000000511f1 7 2633 9.2e-04 0.3463 cl_page.c:1173 00000000000511f2 16 2649 0.0021 0.3484 cl_page.c:1173 00000000000511f4 4 2653 5.3e-04 0.3490 cl_page.c:1173 00000000000511f6 7 2660 9.2e-04 0.3499 cl_page.c:1173 00000000000511f8 11 2671 0.0014 0.3513 cl_page.c:1173 00000000000511fa 21 2692 0.0028 0.3541 cl_page.c:1173 00000000000511fb 4 2696 5.3e-04 0.3546 cl_page.c:1173 0000000000051216 2 2698 2.6e-04 0.3549 cl_page.c:1150 0000000000051219 4 2702 5.3e-04 0.3554 cl_page.c:1150 000000000005121c 39 2741 0.0051 0.3605 cl_page.c:1150 0000000000051228 82 2823 0.0108 0.3713 cl_page.c:1164 000000000005122c 1095 3918 0.1440 0.5154 cl_page.c:1164 0000000000051235 23 3941 0.0030 0.5184 cl_page.c:1165 000000000005123d 120 4061 0.0158 0.5342 spinlock.h:174 0000000000051243 2020 6081 0.2657 0.7999 spinlock.h:174 000000000005124a 717 6798 0.0943 0.8942 cl_page.c:1155 0000000000051253 19 6817 0.0025 0.8967 cl_object.h:2626 0000000000051257 194 7011 0.0255 0.9222 cl_object.h:2626 000000000005125a 2233 9244 0.2937 1.2159 cl_page.c:1160 000000000005125e 44 9288 0.0058 1.2217 cl_page.c:1160 0000000000051261 36 9324 0.0047 1.2265 cl_page.c:1160 0000000000051266 750688 760012 98.7437 99.9701 cl_page.c:1161 000000000005126a 120 760132 0.0158 99.9859 cl_page.c:1161 0000000000051273 107 760239 0.0141 100.000 cl_page.c:1163 We can start by trying the easy stuff to improve performance here. For example, as suggested in the cl_object_header structure definition aligning the spin locks on cache lines should help. This may be particularly true on Sequoias PowerPC arch where we have 68 slow cores and relatively long latencies to memory. Although, I don't see much contention in the oprofile results for the other locks in the cache line. But if that fails we're going to need to look at optimizing what's done under the spin locks. On first glance I see several while loops and for loops which is always concerning. |
| Comments |
| Comment by Brian Behlendorf [ 24/Jul/12 ] |
|
This problem can be easily reproduced with a single client and the following fio script. fio ssf.fio ; ssf.fio, single shared file [global] directory=/p/lstest/behlendo/fio/ filename=ssf size=256g blocksize=1m direct=0 ioengine=sync numjobs=128 group_reporting=1 fallocate=none runtime=60 stonewall [write] rw=randwrite:256 rw_sequencer=sequential fsync_on_close=1 [read] rw=randread:256 rw_sequencer=sequential |
| Comment by Brian Behlendorf [ 24/Jul/12 ] |
|
For the record these results include the the following patch. It looks like you guys are already well aware that this lock is heavily contended. If there is currently other work underway to address this could you please point me too it. http://review.whamcloud.com/#change,911 (Patch set #16) Finally, I've also verified that I'm able to create significant contention on this lock (100% on 16 cores) even in the file per process case when there are enough threads performing I/O. So resolving the lock contention here should improve a variety of client workloads. |
| Comment by Peter Jones [ 24/Jul/12 ] |
|
Jinshan Could you please comment on this one? Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 26/Jul/12 ] |
|
Right now the spinlock in object header is protecting both individual page and radix tree, we can split it by functionality to reduce contention. However, the lock protecting radix tree will still be under heavy contention. Distributing this lock into different cache line won't help because you're writing the same file so the same lock will be under contention. |
| Comment by Brian Behlendorf [ 26/Jul/12 ] |
|
One way or another we're going to have to reduce the contention on this lock or we'll never see good client performance numbers. To quantify this we ran some simple single client fio benchmarks for a single shared file (SSF) and a file per process (FPP) workload. We did this for both 16-core x86_64 clients (Grove) and 68-core PowerPC clients (Sequoia). I've attached both fio scripts which clearly describe the exact test cases. Sequoia | FPP | SSF |
---------+--------+--------+
Write | ^ 1329 | # 169 |
---------+--------+--------+
Read | ^ 778 | ^ 1474 |
---------+--------+--------+
Grove | FPP | SSF |
---------+--------+--------+
Write | ^ 1697 | * 549 |
---------+--------+--------+
Read | ^ 1643 | ^ 1039 |
---------+--------+--------+
These results are very interesting. They suggest that this spin lock is currently the limiting factor for both FPP I/O and SSF reads. Also notice that the effect is more severe on the PowerPC clients which have ~4x cores (68 cores), running at a lower clock rate, with significantly slower access to memory. Splitting the lock sounds like a good place to start. That's straight forward enough and we can test what sort of dividends that pays. But even a quick glance at the code and profiling results show other areas which can be improved. For example, check out the cl_vmpage_page() function in the profiling results above. A total of 26.3% of the cpu cycles on the node went to this function. Of that time 97% of it was spent waiting to acquire the spin lock, but more interestingly 3% of that time was spent in this for loop covered by the lock. This is absolutely something which can be optimized. hdr = cl_object_header(cl_object_top(obj));
cfs_spin_lock(&hdr->coh_page_guard);
for (page = (void *)vmpage->private;
page != NULL; page = page->cp_child) {
if (cl_object_same(page->cp_obj, obj)) {
cl_page_get_trust(page);
break;
}
}
cfs_spin_unlock(&hdr->coh_page_guard);
|
| Comment by Brian Behlendorf [ 26/Jul/12 ] |
|
Fio workload scripts |
| Comment by Brian Behlendorf [ 30/Jul/12 ] |
|
In conjunction with this issue we'll be pursuing with IBM getting ticket spinlocks implemented in the kernel for PowerPC. This will improve the fairness between the lock waiters ensuring none of them get starved for excessively long which is entirely possible right now. Particularly with so many cores it seems entirely plausible to me one core might get starved for several minutes. |
| Comment by Prakash Surya (Inactive) [ 01/Aug/12 ] |
|
We've since ruled out the need for ticketed spinlocks. A simple test was constructed which did not show any one CPU as being unfairly starved. Each CPU was able to acquire a global shared lock a roughly equal amount of times, and with roughly equal wait times. That result, in conjunction with the fact that the node only has a single NUMA zone, hints at the fact that ticketed spin locks would not prove as great of a benefit as we had hoped. |
| Comment by Ned Bass [ 08/Aug/12 ] |
|
As a reference for those of us trying to understand the implicated code, here is a call trace for cl_vmpage_page(). 0xffffffffa04d4190 : cl_vmpage_page+0x0/0x1a0 [obdclass] 0xffffffffa04d5983 : cl_page_find0+0x303/0x730 [obdclass] 0xffffffffa04d5de1 : cl_page_find+0x11/0x20 [obdclass] 0xffffffffa0b543dd : ll_readahead+0xa2d/0xec0 [lustre] 0xffffffffa0b78895 : vvp_io_read_page+0x295/0x2e0 [lustre] 0xffffffffa04de145 : cl_io_read_page+0x95/0x130 [obdclass] 0xffffffffa0b533f0 : ll_readpage+0xa0/0x1d0 [lustre] 0xffffffff8111262c : generic_file_aio_read+0x1fc/0x700 [kernel] 0xffffffffa0b78ec6 : vvp_io_read_start+0x146/0x350 [lustre] 0xffffffffa04dc512 : cl_io_start+0x72/0xf0 [obdclass] 0xffffffffa04dfe24 : cl_io_loop+0xb4/0x160 [obdclass] 0xffffffffa0b32a77 : ll_file_io_generic+0x337/0x4f0 [lustre] 0xffffffffa0b33919 : ll_file_aio_read+0x149/0x290 [lustre] 0xffffffffa0b33bd3 : ll_file_read+0x173/0x270 [lustre] 0xffffffff811782d5 : vfs_read+0xb5/0x1a0 [kernel] 0xffffffff81178411 : sys_read+0x51/0x90 [kernel] 0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel] |
| Comment by Ned Bass [ 10/Aug/12 ] |
|
Here's some data from /proc/lock_stats for the fio single-shared-file read workload. I believe the _atomic_dec_and_lock is from cl_page_put(). cl_page_find0+0x464 is where the lock is taken for a radix_tree_insert(). lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
&cl_page_guard_class: 40974077 41330086 1.72 5319.93 622070050.62 45807971 48525879 1.16 583.13 14269160.92
--------------------
&cl_page_guard_class 15772469 [<ffffffff8128fb35>] _atomic_dec_and_lock+0x55/0x80
&cl_page_guard_class 20602549 [<ffffffffa04ae508>] cl_vmpage_page+0x58/0x1a0 [obdclass]
&cl_page_guard_class 300 [<ffffffffa04ae1fb>] cl_page_delete0+0x19b/0x3b0 [obdclass]
&cl_page_guard_class 4954768 [<ffffffffa04af9c4>] cl_page_find0+0x464/0x750 [obdclass]
--------------------
&cl_page_guard_class 15656563 [<ffffffff8128fb35>] _atomic_dec_and_lock+0x55/0x80
&cl_page_guard_class 20518390 [<ffffffffa04ae508>] cl_vmpage_page+0x58/0x1a0 [obdclass]
&cl_page_guard_class 277 [<ffffffffa04ae1fb>] cl_page_delete0+0x19b/0x3b0 [obdclass]
&cl_page_guard_class 5154856 [<ffffffffa04af9c4>] cl_page_find0+0x464/0x750 [obdclass]
|
| Comment by Ned Bass [ 10/Aug/12 ] |
|
Hi Jinshan, I tried this experimental patch to avoid taking coh_page_guard in cl_vmpage_page() diff --git a/lustre/obdclass/cl_page.c b/lustre/obdclass/cl_page.c
index 0639499..027075e 100644
--- a/lustre/obdclass/cl_page.c
+++ b/lustre/obdclass/cl_page.c
@@ -736,6 +736,15 @@ struct cl_page *cl_vmpage_page(cfs_page_t *vmpage, struct cl_object *obj)
* This loop assumes that ->private points to the top-most page. This
* can be rectified easily.
*/
+
+ page = (void *)vmpage->private;
+ if (page == NULL)
+ RETURN(page);
+ if (cl_object_same(page->cp_obj, obj)) {
+ cl_page_get_trust(page);
+ RETURN(page);
+ }
+
hdr = cl_object_header(cl_object_top(obj));
cfs_spin_lock(&hdr->coh_page_guard);
for (page = (void *)vmpage->private;
I found that it reduced the contentions/acquisitions ratio for coh_page_guard by about 10%, but increased read throughput on the ssf.fio test on grove by about 25%. However, the node is still CPU-bound during the test due to contention on this lock. My assumption here is that we only need the lock to ensure consistency of the page->cp_child linked list, and that the cl_page we're looking for is very often associated with the top-most object, in which case we don't need to traverse the list. Anyways, these results are a good indication that we can reap some big performance gains if we can find ways to take this lock less often. |
| Comment by Jinshan Xiong (Inactive) [ 13/Aug/12 ] |
|
Hi Ned, your patch seems work for me. BTW, can you please try to not grab coh_page_guard and see how's going? It may be safe to do it this way because we're holding vmpage lock to call cl_vmpage_page() so that cl_page list can't go away; also cl_object of the page can't be destroyed either since there already exists a page. |
| Comment by Ned Bass [ 13/Aug/12 ] |
|
Never taking coh_page_guard in cl_vmpage_page() seems to work well. However, when I instrumented the code to count iterations in the for loop, it always performed only one iteration. I think for this workload it never follows the cl_page pointer, so we'll need more testing or careful reasoning to say for sure that this is safe. Here are some results for the ssf.fio test above on grove with and without taking the lock. Reads get a nice bump from this optimization. | SSF | SSF |
Grove | locked | unlocked |
---------+--------+----------+
Write | 836 | 719 |
---------+--------+----------+
Read | 1232 | 1698 |
---------+--------+----------+
The point of contention now shifts to the _atomic_dec_and_lock() call in cl_page_put(). I don't understand why the lock would be contended here since it only takes the lock if the refcount decrements to zero. In that case, who else would be holding the lock? Here is oprofile output and lock_stat data for the read phase of ssf.fio when we don't take coh_page_guard: CPU: Intel Sandy Bridge microarchitecture, speed 2601 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 vma samples cum. samples % cum. % linenr info app name symbol name ffffffff8126c5e0 327867 327867 28.1153 28.1153 dec_and_lock.c:21 vmlinux _atomic_dec_and_lock 00000000000524b0 92029 419896 7.8917 36.0070 cl_page.c:402 obdclass.ko cl_page_find0 000000000004d900 82820 502716 7.1020 43.1090 cl_object.c:167 obdclass.ko cl_object_top 0000000000051160 80924 583640 6.9394 50.0484 cl_page.c:1123 obdclass.ko cl_page_delete0 0000000000000000 59278 642918 5.0832 55.1316 (no location information) lustre /lustre ffffffff812778c0 23310 666228 1.9989 57.1305 copy_user_64.S:240 vmlinux copy_user_generic_string 00000000000520b0 20577 686805 1.7645 58.8950 cl_page.c:652 obdclass.ko cl_page_put -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
&cl_page_guard_class: 31974716 31992762 1.42 3949.95 161714748.58 55017484 62473133 1.39 1542.40 20544116.54
--------------------
&cl_page_guard_class 779139 [<ffffffffa04a536b>] cl_page_delete0+0x19b/0x3b0 [obdclass]
&cl_page_guard_class 22460626 [<ffffffff8128fb35>] _atomic_dec_and_lock+0x55/0x80
&cl_page_guard_class 8752997 [<ffffffffa04a6994>] cl_page_find0+0x464/0x750 [obdclass]
--------------------
&cl_page_guard_class 872283 [<ffffffffa04a536b>] cl_page_delete0+0x19b/0x3b0 [obdclass]
&cl_page_guard_class 21478799 [<ffffffff8128fb35>] _atomic_dec_and_lock+0x55/0x80
&cl_page_guard_class 9641680 [<ffffffffa04a6994>] cl_page_find0+0x464/0x750 [obdclass]
|
| Comment by Jinshan Xiong (Inactive) [ 13/Aug/12 ] |
|
Hi Ned, After taking a closer look, I realize this change will revive bug 19537 so I think the correct way to reduce the contention of coh_page_guard is to define a per page spinlock for this purpose. The reason that it has high contention at coh_page_guard is that every page is used only once so everyone had to grab coh_page_guard in cl_page_put(). I will work out a patch soon. |
| Comment by Jinshan Xiong (Inactive) [ 13/Aug/12 ] |
|
Hi Ned, can you please try http://review.whamcloud.com/3627 and see if it can help? |
| Comment by Ned Bass [ 14/Aug/12 ] |
|
Initial results are very good! First try with ssf.fio got 2274MB/s, the best read number I've seen yet. All CPUs are still busy during the read phase, but in the 91-94% range, no longer 100%. Oprofile stats from read phase: vma samples cum. samples % cum. % linenr info app name symbol name 0000000000052500 1533121 1533121 8.2425 8.2425 cl_page.c:401 obdclass.ko cl_page_find0 0000000000000000 1353993 2887114 7.2794 15.5219 (no location information) lustre /lustre ffffffff8126c5e0 1045457 3932571 5.6207 21.1425 dec_and_lock.c:21 vmlinux _atomic_dec_and_lock ffffffff812778c0 888524 4821095 4.7769 25.9195 copy_user_64.S:240 vmlinux copy_user_generic_string 0000000000050ff0 603333 5424428 3.2437 29.1632 cl_page.c:719 obdclass.ko cl_vmpage_page |
| Comment by James A Simmons [ 16/Aug/12 ] |
|
Also have been doing some benchmarks with this patch. For single client performance using all the memory using the following command line. IOR -a POSIX -i 5 -C -v -g -w -r -e -b 32G -t 4m -o /lustre/barry/scratch/jsimmons/wc_ir_ior_test_2629.arthur-sys0.ccs.ornl.gov/testfile.out Without patch - Single Shared file: With patch - Single Shared file: With multiple clients (18 nodes) with a file stripe count of 28 I get with the command IOR -a POSIX -i 5 -C -v -g -w -r -e -b 64m -t 4m -o /lustre/barry/scratc without patch Max Write: 2163.72 MiB/sec (2268.83 MB/sec) with patch Max Write: 2293.95 MiB/sec (2405.38 MB/sec) I'm also seeing improvement for the file per process case as well. This patch is a step in the right direction. |
| Comment by Ned Bass [ 16/Aug/12 ] |
|
Hit this assertion with patch http://review.whamcloud.com/3627 running ssf.fio test on a BGQ IO node. 2012-08-16 15:46:27.296017 {DefaultControlEventListener} [mmcs]{16}.0.2: kernel:LustreError: 5769:0:(osc_page.c:431:osc_page_delete()) ASSERTION( 0 ) failed:
2012-08-16 15:46:27.296967 {DefaultControlEventListener} [mmcs]{16}.0.2: ^GMessage from syslogd@(none) at Aug 16 15:46:27 ...
2012-08-16 15:46:27.297419 {DefaultControlEventListener} [mmcs]{16}.0.2: kernel:LustreError: 5769:0:(osc_page.c:431:osc_page_delete()) LBUG
2012-08-16 15:46:27.297863 {DefaultControlEventListener} [mmcs]{16}.10.3: Kernel panic - not syncing: LBUG
2012-08-16 15:46:27.298321 {DefaultControlEventListener} [mmcs]{16}.10.3: Call Trace:
2012-08-16 15:46:27.299124 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa550] [c000000000008190] .show_stack+0x7c/0x184 (unreliable)
2012-08-16 15:46:27.299627 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa600] [c000000000428ebc] .panic+0x80/0x1a8
2012-08-16 15:46:27.300102 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa690] [8000000002fd12c0] .lbug_with_loc+0xb0/0xc0 [libcfs]
2012-08-16 15:46:27.300671 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa720] [8000000007e7f3cc] .osc_page_delete+0x2cc/0x390 [osc]
2012-08-16 15:46:27.301057 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa7f0] [80000000053fb8c0] .cl_page_delete0+0x150/0x5c0 [obdclass]
2012-08-16 15:46:27.301479 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa8d0] [80000000053fbdc8] .cl_page_delete+0x98/0x1a0 [obdclass]
2012-08-16 15:46:27.301970 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa990] [80000000097a5f74] .ll_invalidatepage+0xc4/0x1b0 [lustre]
2012-08-16 15:46:27.302370 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaaa30] [80000000097bc568] .vvp_page_discard+0xf8/0x1a0 [lustre]
2012-08-16 15:46:27.302768 {DefaultControlEventListener} [mmcs]{16}.0.2: [c00000034aaaaac0] [80000000053f7e84] .cl_page_invoid+0x104/0x220 [obdclass]
2012-08-16 15:46:27.303311 {DefaultControlEventListener} [mmcs]{16}.0.2: [c00000034aaaab90] [8000000007e7c89c] .discard_pagevec+0xdc/0x150 [osc]
2012-08-16 15:46:27.303778 {DefaultControlEventListener} [mmcs]{16}.0.2: [c00000034aaaac50] [8000000007e7cf5c] .osc_lru_shrink+0x64c/0xf90 [osc]
2012-08-16 15:46:27.304146 {DefaultControlEventListener} [mmcs]{16}.0.2: [c00000034aaaadb0] [8000000007e7f7d8] .osc_page_init+0x348/0xcf0 [osc]
2012-08-16 15:46:27.304672 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaaf20] [80000000053fda78] .cl_page_find0+0x2f8/0x990 [obdclass]
2012-08-16 15:46:27.305536 {DefaultControlEventListener} [mmcs]{16}.0.2: ^GMessage from syslogd@(none) at Aug 16 15:46:27 ...
2012-08-16 15:46:27.306031 {DefaultControlEventListener} [mmcs]{16}.0.2: kernel:Kernel panic - not syncing: LBUG
2012-08-16 15:46:27.306421 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab040] [8000000008897730] .lov_page_init_raid0+0x240/0x820 [lov]
2012-08-16 15:46:27.306875 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab150] [8000000008893bd4] .lov_page_init+0x84/0x120 [lov]
2012-08-16 15:46:27.307234 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab1f0] [80000000053fda78] .cl_page_find0+0x2f8/0x990 [obdclass]
2012-08-16 15:46:27.313843 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab310] [8000000009787a4c] .ll_cl_init+0x11c/0x540 [lustre]
2012-08-16 15:46:27.314287 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab400] [800000000978a194] .ll_prepare_write+0xa4/0x240 [lustre]
2012-08-16 15:46:27.314838 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab4d0] [80000000097a635c] .ll_write_begin+0xcc/0x220 [lustre]
2012-08-16 15:46:27.315272 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab5a0] [c00000000009589c] .generic_file_buffered_write+0x140/0x354
2012-08-16 15:46:27.315789 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab6e0] [c00000000009600c] .__generic_file_aio_write+0x374/0x3d8
2012-08-16 15:46:27.316270 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab7e0] [80000000097bf864] .vvp_io_write_start+0x114/0x300 [lustre]
2012-08-16 15:46:27.316783 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab8c0] [8000000005408f3c] .cl_io_start+0xdc/0x1d0 [obdclass]
2012-08-16 15:46:27.317267 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab970] [800000000540f3a4] .cl_io_loop+0x144/0x250 [obdclass]
2012-08-16 15:46:27.317758 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaba30] [800000000975509c] .ll_file_io_generic+0x33c/0x510 [lustre]
2012-08-16 15:46:27.318270 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaabb30] [80000000097556b4] .ll_file_aio_write+0x1e4/0x340 [lustre]
2012-08-16 15:46:27.318726 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaabc00] [8000000009755970] .ll_file_write+0x160/0x2d0 [lustre]
2012-08-16 15:46:27.319202 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaabce0] [c0000000000c8234] .vfs_write+0xd0/0x1c4
2012-08-16 15:46:27.319739 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaabd80] [c0000000000c8424] .SyS_write+0x54/0x98
2012-08-16 15:46:27.320127 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaabe30] [c000000000000580] syscall_exit+0x0/0x2c
|
| Comment by Prakash Surya (Inactive) [ 16/Aug/12 ] |
|
Ned, I saw that in the logs as well. Have you looked at |
| Comment by Ned Bass [ 16/Aug/12 ] |
|
No I hadn't noticed that bug, thanks for the pointer. |
| Comment by Jinshan Xiong (Inactive) [ 16/Aug/12 ] |
|
Hi Ned, do you have any luck to grab the page debug info such as |
| Comment by Ned Bass [ 17/Aug/12 ] |
|
Yes, sorry I neglected to include it above. 2012-08-16 15:46:27.275649 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_cache.c:2338:osc_teardown_async_page()) extent c0000003642caf38@{[361184 -> 361199/361199], [3|1|-|active|wi|c00000029bfc0240], [1048576|16|+|-|c000000295f440b0|16|(null)]} trunc at 361195.
2012-08-16 15:46:27.276084 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) page@c000000343b9ed40[2 c0000003c0a2ee98:2889515 ^(null)_c000000343b9ead0 4 0 1 (null) (null) 0x0]
2012-08-16 15:46:27.276522 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) page@c000000343b9ead0[1 c00000036d6e4e68:361195 ^c000000343b9ed40_(null) 4 0 1 (null) (null) 0x0]
2012-08-16 15:46:27.276915 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) vvp-page@c0000003633dce90(0:0:0) vm@c0000000015ddce8 71 2:0 0 2889515 lru
2012-08-16 15:46:27.277362 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) lov-page@c000000349140930
2012-08-16 15:46:27.277779 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) osc-page@c00000038c90d328: 1< 0x845fed 2 0 + - > 2< 23671275520 0 65536 0x0 0x420 | (null) c00000029159a7b0 c00000029bfc0240 > 3< + c0000003e9990740 0 0 0 > 4< 0 0 8 18677760 - | + - + - > 5< + - + - | 0 - | 64 - ->
2012-08-16 15:46:27.278174 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) end page@c000000343b9ed40
2012-08-16 15:46:27.278663 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) Trying to teardown failed: -16
|
| Comment by Jinshan Xiong (Inactive) [ 17/Aug/12 ] |
|
This problem is imported in |
| Comment by Jinshan Xiong (Inactive) [ 17/Aug/12 ] |
|
Please use patch set 20 of |
| Comment by Gregoire Pichon [ 05/Dec/12 ] |
|
One of our customer (TGCC - lustre 2.1.3) is facing a performance issue while reading in the same file from many tasks on one client at the same time. Performance drops while increasing the number of tasks. I have reproduced the issue and obtained these results: tasks Mean Total (MB/s) dd 1 339 339 dd 2 107 214 dd 4 78 312 dd 8 64 512 dd 16 26 428 Applying the patch http://review.whamcloud.com/#change,3627 on lustre 2.1.3 gives no real improvement. tasks Mean Total (MB/s) dd 1 542 542 dd 2 118 236 dd 4 74 298 dd 8 73 591 dd 16 52 833 The profiling shows contention in CLIO page routines. samples % linenr info app name symbol name 1149893 34.8489 (no location information) libpython2.6.so.1.0 /usr/lib64/libpython2.6.so.1.0 338676 10.2640 dec_and_lock.c:21 vmlinux _atomic_dec_and_lock 335241 10.1599 cl_page.c:404 obdclass.ko cl_page_find0 293448 8.8933 cl_page.c:729 obdclass.ko cl_vmpage_page 235856 7.1479 filemap.c:963 vmlinux grab_cache_page_nowait 142686 4.3243 cl_page.c:661 obdclass.ko cl_page_put 80065 2.4265 filemap.c:667 vmlinux find_get_page 46623 1.4130 copy_user_64.S:240 vmlinux copy_user_generic_string 34127 1.0343 (no location information) libpthread-2.12.so pthread_rwlock_rdlock 33685 1.0209 (no location information) libc-2.12.so getenv 31725 0.9615 (no location information) libpthread-2.12.so pthread_rwlock_unlock 30492 0.9241 swap.c:288 vmlinux activate_page 28528 0.8646 (no location information) libc-2.12.so __dcigettext 23703 0.7183 (no location information) libc-2.12.so __strlen_sse42 22345 0.6772 vvp_page.c:120 lustre.ko vvp_page_unassume 20646 0.6257 swap.c:183 vmlinux put_page 18525 0.5614 intel_idle.c:215 vmlinux intel_idle 12774 0.3871 open.c:961 vmlinux sys_close 12702 0.3849 filemap.c:579 vmlinux unlock_page 12685 0.3844 (no location information) libc-2.12.so __memcpy_ssse3 12134 0.3677 cl_page.c:874 obdclass.ko cl_page_invoid 11734 0.3556 (no location information) libpthread-2.12.so __close_nocancel 10573 0.3204 cl_page.c:884 obdclass.ko cl_page_owner_clear 10510 0.3185 (no location information) libc-2.12.so strerror_r 9783 0.2965 cl_page.c:1068 obdclass.ko cl_page_unassume 9472 0.2871 (no location information) oprofiled /usr/bin/oprofiled 9438 0.2860 (no location information) libc-2.12.so __strncmp_sse2 8826 0.2675 rw.c:724 lustre.ko ll_readahead 8769 0.2658 filemap.c:527 vmlinux page_waitqueue 8095 0.2453 entry_64.S:470 vmlinux system_call_after_swapgs 7756 0.2351 cl_page.c:561 obdclass.ko cl_page_state_set0 7615 0.2308 (no location information) libc-2.12.so __stpcpy_sse2 7366 0.2232 entry_64.S:462 vmlinux system_call 7097 0.2151 sched.c:4430 vmlinux find_busiest_group 6898 0.2091 (no location information) libperl.so /usr/lib64/perl5/CORE/libperl.so 6887 0.2087 (no location information) libc-2.12.so __ctype_b_loc 6780 0.2055 wait.c:251 vmlinux __wake_up_bit 6631 0.2010 cl_page.c:139 obdclass.ko cl_page_at_trusted 6588 0.1997 (no location information) libc-2.12.so __strlen_sse2 6336 0.1920 lov_io.c:239 lov.ko lov_page_stripe 5289 0.1603 lov_io.c:208 lov.ko lov_sub_get 5263 0.1595 ring_buffer.c:2834 vmlinux rb_get_reader_page 4595 0.1393 lvfs_lib.c:76 lvfs.ko lprocfs_counter_add 4373 0.1325 (no location information) libc-2.12.so strerror 4347 0.1317 lov_io.c:252 lov.ko lov_page_subio 4337 0.1314 cl_page.c:1036 obdclass.ko cl_page_assume 4310 0.1306 (no location information) libc-2.12.so __mempcpy_sse2 3903 0.1183 lov_page.c:92 lov.ko lov_page_own 3866 0.1172 mutex.c:409 vmlinux __mutex_lock_slowpath 3841 0.1164 lcommon_cl.c:1097 lustre.ko cl2vm_page 3698 0.1121 radix-tree.c:414 vmlinux radix_tree_lookup_slot 3506 0.1063 vvp_page.c:109 lustre.ko vvp_page_assume 3491 0.1058 lu_object.c:1115 obdclass.ko lu_object_locate 3490 0.1058 (no location information) ipmi_si.ko port_inb 3457 0.1048 ring_buffer.c:3221 vmlinux ring_buffer_consume 3431 0.1040 cl_page.c:898 obdclass.ko cl_page_owner_set Does the patch I can open another JIRA ticket if needed and provide additional information (log, dump, profiling) as you which. |
| Comment by Ned Bass [ 05/Dec/12 ] |
|
Hi Gregoire, You got a factor of two performance increase at 16 tasks, so why do you say there was no real improvement? I think the same optimization should apply with minimal adaptation to 2.1, since the same contention point exists there. |
| Comment by Gregoire Pichon [ 06/Dec/12 ] |
|
Ned, |
| Comment by Gregoire Pichon [ 12/Dec/12 ] |
|
As proposed by Jinshan, I have open a separate ticket for the parallel read performance issue: |
| Comment by Andreas Dilger [ 18/Feb/13 ] |
|
Gregoire submitted a b2_1 version of this patch at http://review.whamcloud.com/5428. |