[LU-1666] hdr->coh_page_guard contention (single shared file performance) Created: 24/Jul/12  Updated: 13/Apr/14  Resolved: 13/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.1.5
Fix Version/s: Lustre 2.4.0, Lustre 2.1.5

Type: Bug Priority: Major
Reporter: Brian Behlendorf Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 1
Labels: client, performance
Environment:

Chaos 5, single client


Attachments: File fpp.fio     File ssf.fio    
Issue Links:
Related
is related to LU-2481 performance issue of parallel read Resolved
Severity: 3
Epic: client
Rank (Obsolete): 4054

 Description   

Shared file performance with the CLIO code has been observed to be considerably worse than file per process. In particular, during reads it's easy for our clients to become cpu bound. This workload is of particular interest because we expect numerous codes to operate in this mode on Sequoia.

Running oprofile clearly shows that the vast majority of cpu is going to contention on the hdr->coh_page_guard spin lock. Below are is a report of all functions consuming >0.1% of the cpu during the read phase.

samples  cum. samples  %        cum. %     app name                 symbol name
3815407  3815407       26.3390  26.3390    obdclass.ko              cl_vmpage_page
3671445  7486852       25.3452  51.6842    vmlinux                  _atomic_dec_and_lock
1215715  8702567        8.3925  60.0767    obdclass.ko              cl_page_find0
876063   9578630        6.0478  66.1244    obdclass.ko              cl_object_top
760239   10338869       5.2482  71.3726    obdclass.ko              cl_page_delete0
262614   10601483       1.8129  73.1855    vmlinux                  copy_user_generic_string
141309   10742792       0.9755  74.1610    obdclass.ko              cl_page_put
111365   10854157       0.7688  74.9298    osc.ko                   osc_checksum_bulk
105015   10959172       0.7250  75.6548    lvfs.ko                  lprocfs_counter_add
99702    11058874       0.6883  76.3430    vmlinux                  unlock_page
96783    11155657       0.6681  77.0112    lustre.ko                vvp_page_export
87458    11243115       0.6038  77.6149    vmlinux                  put_page
82223    11325338       0.5676  78.1825    lvfs.ko                  lprocfs_counter_sub
77318    11402656       0.5338  78.7163    obdclass.ko              cl_object_attr_lock
77143    11479799       0.5325  79.2488    obdclass.ko              cl_page_list_init
75322    11555121       0.5200  79.7688    vmlinux                  kmem_cache_free
74627    11629748       0.5152  80.2840    osc.ko                   osc_ap_completion

A closer look at the top five consumers shows them all contented on this spin lock.

vma      samples  cum. samples  %        cum. %     linenr info                 app name                 symbol name
0000000000050f30 3815407  3815407       26.3390  26.3390    cl_page.c:723               obdclass.ko              cl_vmpage_page
  0000000000050f30 212      212            0.0056   0.0056    cl_page.c:723
  0000000000050f31 197      409            0.0052   0.0107    cl_page.c:723
  0000000000050f34 12       421           3.1e-04   0.0110    cl_page.c:723
  0000000000050f36 184      605            0.0048   0.0159    cl_page.c:723
  0000000000050f38 395      1000           0.0104   0.0262    cl_page.c:723
  0000000000050f39 118      1118           0.0031   0.0293    cl_page.c:723
  0000000000050f3d 164      1282           0.0043   0.0336    cl_page.c:723
  0000000000050f42 27       1309          7.1e-04   0.0343    libcfs_debug.h:198
  0000000000050f49 16       1325          4.2e-04   0.0347    cl_page.c:727
  0000000000050f53 191      1516           0.0050   0.0397    cl_page.c:723
  0000000000050f56 7        1523          1.8e-04   0.0399    cl_page.c:723
  0000000000050f59 6        1529          1.6e-04   0.0401    libcfs_debug.h:198
  0000000000050f68 170      1699           0.0045   0.0445    cl_page.c:728
  0000000000050f6b 128      1827           0.0034   0.0479    cl_page.c:728
  0000000000050f71 13       1840          3.4e-04   0.0482    cl_page.c:740
  0000000000050f74 8        1848          2.1e-04   0.0484    cl_page.c:740
  0000000000050f79 182      2030           0.0048   0.0532    cl_object.h:2626
  0000000000050f7c 343      2373           0.0090   0.0622    cl_page.c:741
  0000000000050f80 48       2421           0.0013   0.0635    cl_page.c:741
  0000000000050f88 3714181  3716602       97.3469  97.4104    cl_page.c:742
  0000000000050f8c 1089     3717691        0.0285  97.4389    cl_page.c:742
  0000000000050f91 233      3717924        0.0061  97.4450    cl_object.h:2637
  0000000000050f95 15918    3733842        0.4172  97.8622    cl_object.h:2626
  0000000000050f99 3835     3737677        0.1005  97.9627    cl_page.c:744
  0000000000050f9c 420      3738097        0.0110  97.9737    cl_page.c:744
  0000000000050f9e 1016     3739113        0.0266  98.0004    cl_page.c:744
  0000000000050fb1 3584     3742697        0.0939  98.0943    cl_page.c:743
  0000000000050fb5 161      3742858        0.0042  98.0985    cl_page.c:742
  0000000000050fba 480      3743338        0.0126  98.1111    spinlock.h:174
  0000000000050fbd 269      3743607        0.0071  98.1182    spinlock.h:174
  0000000000050fc0 48983    3792590        1.2838  99.4020    spinlock.h:174
  0000000000050fc3 2        3792592       5.2e-05  99.4020    cl_page.c:750
  0000000000050fcc 103      3792695        0.0027  99.4047    libcfs_debug.h:198
  0000000000050fd3 84       3792779        0.0022  99.4069    libcfs_debug.h:198
  0000000000050ff7 125      3792904        0.0033  99.4102    cl_page.c:752
  0000000000050ffb 29       3792933       7.6e-04  99.4110    cl_page.c:752
  0000000000050ffc 229      3793162        0.0060  99.4170    cl_page.c:752
  0000000000050ffe 117      3793279        0.0031  99.4200    cl_page.c:752
  0000000000051000 98       3793377        0.0026  99.4226    cl_page.c:752
  0000000000051001 204      3793581        0.0053  99.4280    cl_page.c:752
  0000000000051028 55       3793636        0.0014  99.4294    atomic_64.h:176
  000000000005102d 30       3793666       7.9e-04  99.4302    atomic_64.h:176
  0000000000051031 7819     3801485        0.2049  99.6351    cl_page.c:123
  0000000000051035 6        3801491       1.6e-04  99.6353    cl_page.c:124
  0000000000051039 69       3801560        0.0018  99.6371    cl_page.c:124
  000000000005103d 583      3802143        0.0153  99.6524    cl_page.c:124
  0000000000051041 365      3802508        0.0096  99.6619    atomic_64.h:93
  0000000000051045 11064    3813572        0.2900  99.9519    atomic_64.h:93
  0000000000051086 30       3813602       7.9e-04  99.9527    cl_page.c:750
  000000000005108d 1805     3815407        0.0473  100.000    cl_page.c:750

0000000000052420 1215715  8702567        8.3925  60.0767    cl_page.c:402               obdclass.ko              cl_page_find0
  0000000000052420 95       95             0.0078   0.0078    cl_page.c:402
  0000000000052421 55       150            0.0045   0.0123    cl_page.c:402
  0000000000052424 40       190            0.0033   0.0156    cl_page.c:402
  000000000005242b 85       275            0.0070   0.0226    cl_page.c:402
  000000000005242f 86       361            0.0071   0.0297    cl_page.c:402
  0000000000052433 55       416            0.0045   0.0342    cl_page.c:402
  0000000000052437 4        420           3.3e-04   0.0345    cl_page.c:402
  000000000005243b 73       493            0.0060   0.0406    cl_page.c:402
  000000000005243f 58       551            0.0048   0.0453    cl_page.c:402
  0000000000052444 42       593            0.0035   0.0488    cl_page.c:402
  0000000000052448 5        598           4.1e-04   0.0492    cl_page.c:402
  000000000005244c 124      722            0.0102   0.0594    cl_page.c:402
  000000000005244f 49       771            0.0040   0.0634    cl_page.c:402
  0000000000052453 75       846            0.0062   0.0696    cl_object.h:2631
  0000000000052457 8013     8859           0.6591   0.7287    cl_page.c:402
  000000000005245a 60       8919           0.0049   0.7336    cl_page.c:402
  000000000005245d 32       8951           0.0026   0.7363    cl_object.h:2631
  0000000000052461 5510     14461          0.4532   1.1895    cl_page.c:409
  0000000000052465 3        14464         2.5e-04   1.1898    cl_page.c:409
  000000000005246e 35       14499          0.0029   1.1926    cl_page.c:410
  0000000000052473 119      14618          0.0098   1.2024    libcfs_debug.h:198
  000000000005247a 41       14659          0.0034   1.2058    cl_page.c:412
  0000000000052484 155      14814          0.0127   1.2185    libcfs_debug.h:198
  00000000000524a8 648      15462          0.0533   1.2718    cl_object.h:2626
  00000000000524ac 82       15544          0.0067   1.2786    atomic_64.h:93
  00000000000524b0 83372    98916          6.8579   8.1364    libcfs_debug.h:198
  00000000000524b8 3        98919         2.5e-04   8.1367    cl_page.c:417
  00000000000524c2 191      99110          0.0157   8.1524    libcfs_debug.h:198
  0000000000052512 86       99196          0.0071   8.1595    cl_object.h:2631
  0000000000052517 2425     101621         0.1995   8.3589    libcfs_debug.h:198
  000000000005251e 42       101663         0.0035   8.3624    cl_object.h:2631
  0000000000052522 767      102430         0.0631   8.4255    cl_page.c:339
  0000000000052530 154      102584         0.0127   8.4382    libcfs_debug.h:198
  0000000000052536 13       102597         0.0011   8.4392    thread_info.h:216
  000000000005253f 700      103297         0.0576   8.4968    cl_page.c:341
  0000000000052546 363      103660         0.0299   8.5267    cl_page.c:341
  0000000000052551 106      103766         0.0087   8.5354    cl_page.c:341
  0000000000052558 32       103798         0.0026   8.5380    cl_page.c:341
  0000000000052562 85       103883         0.0070   8.5450    cl_page.c:341
  0000000000052565 5        103888        4.1e-04   8.5454    cl_page.c:341
  0000000000052568 51       103939         0.0042   8.5496    cl_page.c:341
  000000000005256e 1        103940        8.2e-05   8.5497    libcfs_fail.h:81
  0000000000052575 52       103992         0.0043   8.5540    libcfs_fail.h:81
  0000000000052580 56       104048         0.0046   8.5586    cl_page.c:341
  0000000000052585 1        104049        8.2e-05   8.5587    cl_page.c:341
  0000000000052588 55600    159649         4.5734  13.1321    cl_page.c:341
  000000000005258a 197      159846         0.0162  13.1483    cl_page.c:341
  0000000000052591 6        159852        4.9e-04  13.1488    cl_page.c:341
  0000000000052598 30       159882         0.0025  13.1513    cl_page.c:341
  000000000005259d 96       159978         0.0079  13.1592    libcfs_debug.h:198
  00000000000525a4 2        159980        1.6e-04  13.1593    cl_page.c:341
  00000000000525ae 103      160083         0.0085  13.1678    libcfs_debug.h:198
  00000000000525b7 1        160084        8.2e-05  13.1679    cl_page.c:344
  00000000000525bb 21       160105         0.0017  13.1696    atomic_64.h:35
  00000000000525c2 869      160974         0.0715  13.2411    cl_page.c:345
  00000000000525c7 79       161053         0.0065  13.2476    cl_page.c:346
  00000000000525d2 56       161109         0.0046  13.2522    cl_page.c:348
  00000000000525d6 46       161155         0.0038  13.2560    cl_page.c:354
  00000000000525da 19       161174         0.0016  13.2575    cl_page.c:354
  00000000000525e1 86       161260         0.0071  13.2646    cl_page.c:326
  00000000000525e9 22       161282         0.0018  13.2664    cl_page.c:350
  00000000000525f0 19       161301         0.0016  13.2680    cl_page.c:354
  00000000000525f7 23       161324         0.0019  13.2699    cl_page.c:348
  00000000000525fb 82       161406         0.0067  13.2766    cl_page.c:351
  00000000000525ff 1        161407        8.2e-05  13.2767    list.h:30
  0000000000052603 18       161425         0.0015  13.2782    list.h:31
  0000000000052607 26       161451         0.0021  13.2803    cl_page.c:352
  000000000005260b 70       161521         0.0058  13.2861    list.h:30
  000000000005260f 16       161537         0.0013  13.2874    list.h:31
  0000000000052613 20       161557         0.0016  13.2891    cl_page.c:353
  0000000000052617 12       161569        9.9e-04  13.2900    list.h:30
  000000000005261b 89       161658         0.0073  13.2974    list.h:31
  000000000005261f 26       161684         0.0021  13.2995    cl_page.c:354
  0000000000052624 21       161705         0.0017  13.3012    cl_page.c:356
  0000000000052628 1605     163310         0.1320  13.4332    cl_page.c:357
  000000000005262c 30095    193405         2.4755  15.9087    cl_page.c:357
  0000000000052630 48       193453         0.0039  15.9127    cl_page.c:357
  0000000000052633 60       193513         0.0049  15.9176    cl_page.c:357
  0000000000052637 4        193517        3.3e-04  15.9180    cl_page.c:357
  0000000000052639 43       193560         0.0035  15.9215    cl_page.c:357
  000000000005263d 51       193611         0.0042  15.9257    cl_page.c:357
  0000000000052641 9        193620        7.4e-04  15.9264    cl_page.c:357
  0000000000052645 4        193624        3.3e-04  15.9268    cl_page.c:357
  0000000000052648 44       193668         0.0036  15.9304    cl_page.c:357
  0000000000052650 142      193810         0.0117  15.9421    cl_page.c:359
  0000000000052655 6356     200166         0.5228  16.4649    cl_page.c:359
  0000000000052658 568      200734         0.0467  16.5116    cl_page.c:359
  000000000005265d 124      200858         0.0102  16.5218    cl_page.c:360
  0000000000052660 63       200921         0.0052  16.5270    cl_page.c:360
  0000000000052663 3        200924        2.5e-04  16.5272    cl_page.c:360
  0000000000052666 24       200948         0.0020  16.5292    cl_page.c:360
  0000000000052669 88       201036         0.0072  16.5364    cl_page.c:360
  000000000005266b 47       201083         0.0039  16.5403    cl_page.c:362
  0000000000052674 103      201186         0.0085  16.5488    cl_page.c:357
  0000000000052679 4520     205706         0.3718  16.9206    cl_page.c:357
  000000000005267c 103      205809         0.0085  16.9290    cl_page.c:357
  0000000000052680 3        205812        2.5e-04  16.9293    cl_page.c:357
  0000000000052682 81       205893         0.0067  16.9360    cl_page.c:357
  0000000000052686 80       205973         0.0066  16.9425    cl_page.c:357
  000000000005268a 19       205992         0.0016  16.9441    atomic_64.h:93
  000000000005268e 11       206003        9.0e-04  16.9450    atomic_64.h:93
  0000000000052692 25935    231938         2.1333  19.0783    atomic_64.h:93
  0000000000052696 5512     237450         0.4534  19.5317    atomic_64.h:93
  000000000005269a 5961     243411         0.4903  20.0220    atomic_64.h:93
  00000000000526a6 87       243498         0.0072  20.0292    libcfs_debug.h:198
  0000000000052718 191      243689         0.0157  20.0449    cl_page.c:430
  000000000005271b 9        243698        7.4e-04  20.0457    cl_page.c:430
  0000000000052723 2761     246459         0.2271  20.2728    cl_page.c:438
  0000000000052726 1        246460        8.2e-05  20.2728    cl_page.c:430
  0000000000052729 17       246477         0.0014  20.2742    cl_page.c:438
  000000000005272f 59       246536         0.0049  20.2791    atomic_64.h:93
  0000000000052733 44355    290891         3.6485  23.9276    libcfs_debug.h:198
  000000000005273a 2        290893        1.6e-04  23.9277    libcfs_debug.h:198
  0000000000052761 50       290943         0.0041  23.9318    cl_page.c:497
  0000000000052764 53       290996         0.0044  23.9362    cl_page.c:497
  0000000000052768 33       291029         0.0027  23.9389    cl_page.c:497
  000000000005276c 80       291109         0.0066  23.9455    cl_page.c:497
  0000000000052770 58       291167         0.0048  23.9503    cl_page.c:497
  0000000000052774 77       291244         0.0063  23.9566    cl_page.c:497
  0000000000052778 45       291289         0.0037  23.9603    cl_page.c:497
  0000000000052779 155      291444         0.0127  23.9731    cl_page.c:497
  00000000000527b8 128      291572         0.0105  23.9836    cl_page.c:448
  0000000000052850 105      291677         0.0086  23.9922    cl_page.c:461
  0000000000052854 2        291679        1.6e-04  23.9924    cl_page.c:461
  0000000000052858 1        291680        8.2e-05  23.9925    cl_page.c:461
  000000000005285f 96       291776         0.0079  24.0004    cl_page.c:461
  0000000000052864 901444   1193220       74.1493  98.1496    cl_page.c:462
  0000000000052868 106      1193326        0.0087  98.1584    cl_page.c:462
  000000000005286c 4        1193330       3.3e-04  98.1587    cl_page.c:462
  0000000000052870 31       1193361        0.0025  98.1612    cl_page.c:462
  0000000000052874 60       1193421        0.0049  98.1662    cl_page.c:462
  000000000005287c 81       1193502        0.0067  98.1728    cl_page.c:463
  000000000005287e 2        1193504       1.6e-04  98.1730    cl_page.c:463
  0000000000052882 42       1193546        0.0035  98.1765    cl_page.c:463
  0000000000052888 58       1193604        0.0048  98.1812    cl_page.c:482
  000000000005288d 111      1193715        0.0091  98.1904    cl_page.c:482
  000000000005288f 16       1193731        0.0013  98.1917    cl_page.c:483
  0000000000052894 302      1194033        0.0248  98.2165    cl_page.c:483
  000000000005289a 32       1194065        0.0026  98.2192    cl_page.c:484
  000000000005289e 136      1194201        0.0112  98.2303    cl_page.c:484
  00000000000528a2 105      1194306        0.0086  98.2390    cl_page.c:485
  00000000000528a6 270      1194576        0.0222  98.2612    cl_page.c:487
  00000000000528aa 83       1194659        0.0068  98.2680    cl_page.c:487
  00000000000528ad 25       1194684        0.0021  98.2701    cl_page.c:487
  00000000000528b0 29       1194713        0.0024  98.2725    cl_page.c:487
  00000000000528b5 13876    1208589        1.1414  99.4138    spinlock.h:174
  00000000000528b8 14       1208603        0.0012  99.4150    spinlock.h:174
  00000000000528bb 6973     1215576        0.5736  99.9886    spinlock.h:174
  00000000000528be 18       1215594        0.0015  99.9900    cl_page.c:491
  00000000000528c7 18       1215612        0.0015  99.9915    libcfs_debug.h:198
  00000000000528ce 103      1215715        0.0085  100.000    libcfs_debug.h:198

000000000004d840 876063   9578630        6.0478  66.1244    cl_object.c:167             obdclass.ko              cl_object_top
  000000000004d840 6463     6463           0.7377   0.7377    cl_object.c:167
  000000000004d841 9476     15939          1.0817   1.8194    cl_object.c:167
  000000000004d844 438      16377          0.0500   1.8694    cl_object.c:167
  000000000004d845 569      16946          0.0649   1.9343    cl_object.c:167
  000000000004d849 185      17131          0.0211   1.9555    cl_object.c:167
  000000000004d84e 276      17407          0.0315   1.9870    cl_object.h:2626
  000000000004d851 17130    34537          1.9553   3.9423    cl_object.h:2626
  000000000004d858 97       34634          0.0111   3.9534    cl_object.c:171
  000000000004d85b 254      34888          0.0290   3.9824    cl_object.c:171
  000000000004d862 424776   459664        48.4869  52.4693    cl_object.c:171
  000000000004d867 693      460357         0.0791  52.5484    list.h:186
  000000000004d86b 368051   828408        42.0119  94.5603    lu_object.h:748
  000000000004d86f 10       828418         0.0011  94.5615    lu_object.h:748
  000000000004d878 662      829080         0.0756  94.6370    libcfs.h:320
  000000000004d880 616      829696         0.0703  94.7073    libcfs.h:320
  000000000004d889 539      830235         0.0615  94.7689    libcfs.h:323
  000000000004d88d 3        830238        3.4e-04  94.7692    cl_object.h:2603
  000000000004d890 29       830267         0.0033  94.7725    lu_object.h:749
  000000000004d893 391      830658         0.0446  94.8172    cl_object.h:2603
  000000000004d895 630      831288         0.0719  94.8891    libcfs_debug.h:198
  000000000004d89c 159      831447         0.0181  94.9072    cl_object.c:175
  000000000004d8a6 614      832061         0.0701  94.9773    libcfs_debug.h:198
  000000000004d8cc 11       832072         0.0013  94.9786    cl_object.c:177
  000000000004d8cf 156      832228         0.0178  94.9964    cl_object.c:177
  000000000004d8d3 8        832236        9.1e-04  94.9973    cl_object.c:177
  000000000004d8d4 662      832898         0.0756  95.0728    cl_object.c:177
  000000000004d8d5 221      833119         0.0252  95.0981    cl_object.c:177
  000000000004d8d6 584      833703         0.0667  95.1647    cl_object.h:2603
  000000000004d8de 517      834220         0.0590  95.2237    cl_object.h:2587
  000000000004d8e2 29587    863807         3.3773  98.6010    cl_object.h:2587
  000000000004d8e6 9113     872920         1.0402  99.6412    cl_object.h:2603
  000000000004d8e9 3143     876063         0.3588  100.000    cl_object.h:2603

00000000000510d0 760239   10338869       5.2482  71.3726    cl_page.c:1126              obdclass.ko              cl_page_delete0
  00000000000510d0 12       12             0.0016   0.0016    cl_page.c:1126
  00000000000510d1 38       50             0.0050   0.0066    cl_page.c:1126
  00000000000510d4 9        59             0.0012   0.0078    cl_page.c:1126
  00000000000510d6 14       73             0.0018   0.0096    cl_page.c:1126
  00000000000510d8 42       115            0.0055   0.0151    cl_page.c:1126
  00000000000510da 6        121           7.9e-04   0.0159    cl_page.c:1126
  00000000000510dc 6        127           7.9e-04   0.0167    cl_page.c:1126
  00000000000510dd 4        131           5.3e-04   0.0172    cl_page.c:1126
  00000000000510e1 26       157            0.0034   0.0207    cl_page.c:1126
  00000000000510e6 4        161           5.3e-04   0.0212    libcfs_debug.h:198
  00000000000510ed 32       193            0.0042   0.0254    cl_page.c:1128
  00000000000510f7 4        197           5.3e-04   0.0259    cl_page.c:1126
  00000000000510fa 8        205            0.0011   0.0270    cl_page.c:1126
  00000000000510fd 28       233            0.0037   0.0306    cl_page.c:1126
  0000000000051100 6        239           7.9e-04   0.0314    libcfs_debug.h:198
  000000000005110f 5        244           6.6e-04   0.0321    libcfs_debug.h:198
  0000000000051112 3        247           3.9e-04   0.0325    libcfs_debug.h:198
  000000000005111b 25       272            0.0033   0.0358    cl_page.c:102
  0000000000051120 2        274           2.6e-04   0.0360    cl_page.c:102
  0000000000051125 2        276           2.6e-04   0.0363    cl_page.c:1130
  000000000005112e 2        278           2.6e-04   0.0366    cl_page.c:1131
  0000000000051133 20       298            0.0026   0.0392    cl_page.c:1131
  0000000000051139 4        302           5.3e-04   0.0397    cl_page.c:1136
  000000000005113c 2        304           2.6e-04   0.0400    cl_page.c:1136
  0000000000051141 22       326            0.0029   0.0429    cl_page.c:1146
  0000000000051143 14       340            0.0018   0.0447    cl_page.c:1146
  0000000000051146 1        341           1.3e-04   0.0449    cl_page.c:1146
  0000000000051149 7        348           9.2e-04   0.0458    cl_page.c:1146
  000000000005114e 1        349           1.3e-04   0.0459    cl_page.c:1147
  0000000000051159 28       377            0.0037   0.0496    cl_page.c:1147
  0000000000051167 98       475            0.0129   0.0625    cl_page.c:1149
  0000000000051169 3        478           3.9e-04   0.0629    cl_page.c:1149
  0000000000051173 32       510            0.0042   0.0671    cl_page.c:102
  0000000000051178 167      677            0.0220   0.0891    cl_page.c:102
  000000000005117d 21       698            0.0028   0.0918    cl_page.c:102
  0000000000051180 5        703           6.6e-04   0.0925    cl_page.c:1170
  0000000000051185 117      820            0.0154   0.1079    cl_page.c:1170
  0000000000051189 24       844            0.0032   0.1110    cl_page.c:1170
  000000000005118e 62       906            0.0082   0.1192    cl_page.c:1170
  0000000000051190 38       944            0.0050   0.1242    cl_page.c:1170
  0000000000051194 855      1799           0.1125   0.2366    cl_page.c:1170
  0000000000051198 192      1991           0.0253   0.2619    cl_page.c:1170
  000000000005119d 31       2022           0.0041   0.2660    cl_page.c:1170
  00000000000511a0 41       2063           0.0054   0.2714    cl_page.c:1170
  00000000000511a3 4        2067          5.3e-04   0.2719    cl_page.c:1170
  00000000000511a5 159      2226           0.0209   0.2928    cl_page.c:1170
  00000000000511a9 143      2369           0.0188   0.3116    cl_page.c:1170
  00000000000511ad 44       2413           0.0058   0.3174    cl_page.c:1170
  00000000000511b2 6        2419          7.9e-04   0.3182    cl_page.c:1170
  00000000000511b7 152      2571           0.0200   0.3382    cl_page.c:1170
  00000000000511bc 24       2595           0.0032   0.3413    libcfs_debug.h:198
  00000000000511c3 3        2598          3.9e-04   0.3417    cl_page.c:1172
  00000000000511cd 28       2626           0.0037   0.3454    libcfs_debug.h:198
  00000000000511f1 7        2633          9.2e-04   0.3463    cl_page.c:1173
  00000000000511f2 16       2649           0.0021   0.3484    cl_page.c:1173
  00000000000511f4 4        2653          5.3e-04   0.3490    cl_page.c:1173
  00000000000511f6 7        2660          9.2e-04   0.3499    cl_page.c:1173
  00000000000511f8 11       2671           0.0014   0.3513    cl_page.c:1173
  00000000000511fa 21       2692           0.0028   0.3541    cl_page.c:1173
  00000000000511fb 4        2696          5.3e-04   0.3546    cl_page.c:1173
  0000000000051216 2        2698          2.6e-04   0.3549    cl_page.c:1150
  0000000000051219 4        2702          5.3e-04   0.3554    cl_page.c:1150
  000000000005121c 39       2741           0.0051   0.3605    cl_page.c:1150
  0000000000051228 82       2823           0.0108   0.3713    cl_page.c:1164
  000000000005122c 1095     3918           0.1440   0.5154    cl_page.c:1164
  0000000000051235 23       3941           0.0030   0.5184    cl_page.c:1165
  000000000005123d 120      4061           0.0158   0.5342    spinlock.h:174
  0000000000051243 2020     6081           0.2657   0.7999    spinlock.h:174
  000000000005124a 717      6798           0.0943   0.8942    cl_page.c:1155
  0000000000051253 19       6817           0.0025   0.8967    cl_object.h:2626
  0000000000051257 194      7011           0.0255   0.9222    cl_object.h:2626
  000000000005125a 2233     9244           0.2937   1.2159    cl_page.c:1160
  000000000005125e 44       9288           0.0058   1.2217    cl_page.c:1160
  0000000000051261 36       9324           0.0047   1.2265    cl_page.c:1160
  0000000000051266 750688   760012        98.7437  99.9701    cl_page.c:1161
  000000000005126a 120      760132         0.0158  99.9859    cl_page.c:1161
  0000000000051273 107      760239         0.0141  100.000    cl_page.c:1163

We can start by trying the easy stuff to improve performance here. For example, as suggested in the cl_object_header structure definition aligning the spin locks on cache lines should help. This may be particularly true on Sequoias PowerPC arch where we have 68 slow cores and relatively long latencies to memory. Although, I don't see much contention in the oprofile results for the other locks in the cache line. But if that fails we're going to need to look at optimizing what's done under the spin locks. On first glance I see several while loops and for loops which is always concerning.



 Comments   
Comment by Brian Behlendorf [ 24/Jul/12 ]

This problem can be easily reproduced with a single client and the following fio script.

fio ssf.fio
; ssf.fio, single shared file
[global]
directory=/p/lstest/behlendo/fio/
filename=ssf
size=256g
blocksize=1m
direct=0
ioengine=sync
numjobs=128
group_reporting=1
fallocate=none
runtime=60
stonewall

[write]
rw=randwrite:256
rw_sequencer=sequential
fsync_on_close=1

[read]
rw=randread:256
rw_sequencer=sequential
Comment by Brian Behlendorf [ 24/Jul/12 ]

For the record these results include the the following patch. It looks like you guys are already well aware that this lock is heavily contended. If there is currently other work underway to address this could you please point me too it.

http://review.whamcloud.com/#change,911 (Patch set #16)

Finally, I've also verified that I'm able to create significant contention on this lock (100% on 16 cores) even in the file per process case when there are enough threads performing I/O. So resolving the lock contention here should improve a variety of client workloads.

Comment by Peter Jones [ 24/Jul/12 ]

Jinshan

Could you please comment on this one?

Thanks

Peter

Comment by Jinshan Xiong (Inactive) [ 26/Jul/12 ]

Right now the spinlock in object header is protecting both individual page and radix tree, we can split it by functionality to reduce contention. However, the lock protecting radix tree will still be under heavy contention. Distributing this lock into different cache line won't help because you're writing the same file so the same lock will be under contention.

Comment by Brian Behlendorf [ 26/Jul/12 ]

One way or another we're going to have to reduce the contention on this lock or we'll never see good client performance numbers. To quantify this we ran some simple single client fio benchmarks for a single shared file (SSF) and a file per process (FPP) workload. We did this for both 16-core x86_64 clients (Grove) and 68-core PowerPC clients (Sequoia). I've attached both fio scripts which clearly describe the exact test cases.

    Sequoia |   FPP  |   SSF  |
   ---------+--------+--------+      
      Write | ^ 1329 | #  169 |
   ---------+--------+--------+
       Read | ^  778 | ^ 1474 |
   ---------+--------+--------+

      Grove |   FPP  |   SSF  |
   ---------+--------+--------+      
      Write | ^ 1697 | * 549  |
   ---------+--------+--------+
       Read | ^ 1643 | ^ 1039 |
   ---------+--------+--------+

  • All files striped over 8 OSTs
  • All results in MB/s
  • (^) Profiling indicates the performance was limited by LU-1666
  • (#) Profiling indicates the performance was limited by LU-1669

These results are very interesting. They suggest that this spin lock is currently the limiting factor for both FPP I/O and SSF reads. Also notice that the effect is more severe on the PowerPC clients which have ~4x cores (68 cores), running at a lower clock rate, with significantly slower access to memory.

Splitting the lock sounds like a good place to start. That's straight forward enough and we can test what sort of dividends that pays. But even a quick glance at the code and profiling results show other areas which can be improved.

For example, check out the cl_vmpage_page() function in the profiling results above. A total of 26.3% of the cpu cycles on the node went to this function. Of that time 97% of it was spent waiting to acquire the spin lock, but more interestingly 3% of that time was spent in this for loop covered by the lock. This is absolutely something which can be optimized.

        hdr = cl_object_header(cl_object_top(obj));
        cfs_spin_lock(&hdr->coh_page_guard);
        for (page = (void *)vmpage->private;
             page != NULL; page = page->cp_child) {
                if (cl_object_same(page->cp_obj, obj)) {
                        cl_page_get_trust(page);
                        break;
                }
        }
        cfs_spin_unlock(&hdr->coh_page_guard);
Comment by Brian Behlendorf [ 26/Jul/12 ]

Fio workload scripts

Comment by Brian Behlendorf [ 30/Jul/12 ]

In conjunction with this issue we'll be pursuing with IBM getting ticket spinlocks implemented in the kernel for PowerPC. This will improve the fairness between the lock waiters ensuring none of them get starved for excessively long which is entirely possible right now. Particularly with so many cores it seems entirely plausible to me one core might get starved for several minutes.

Comment by Prakash Surya (Inactive) [ 01/Aug/12 ]

We've since ruled out the need for ticketed spinlocks. A simple test was constructed which did not show any one CPU as being unfairly starved. Each CPU was able to acquire a global shared lock a roughly equal amount of times, and with roughly equal wait times. That result, in conjunction with the fact that the node only has a single NUMA zone, hints at the fact that ticketed spin locks would not prove as great of a benefit as we had hoped.

Comment by Ned Bass [ 08/Aug/12 ]

As a reference for those of us trying to understand the implicated code, here is a call trace for cl_vmpage_page().

 0xffffffffa04d4190 : cl_vmpage_page+0x0/0x1a0 [obdclass]
 0xffffffffa04d5983 : cl_page_find0+0x303/0x730 [obdclass]
 0xffffffffa04d5de1 : cl_page_find+0x11/0x20 [obdclass]
 0xffffffffa0b543dd : ll_readahead+0xa2d/0xec0 [lustre]
 0xffffffffa0b78895 : vvp_io_read_page+0x295/0x2e0 [lustre]
 0xffffffffa04de145 : cl_io_read_page+0x95/0x130 [obdclass]
 0xffffffffa0b533f0 : ll_readpage+0xa0/0x1d0 [lustre]
 0xffffffff8111262c : generic_file_aio_read+0x1fc/0x700 [kernel]
 0xffffffffa0b78ec6 : vvp_io_read_start+0x146/0x350 [lustre]
 0xffffffffa04dc512 : cl_io_start+0x72/0xf0 [obdclass]
 0xffffffffa04dfe24 : cl_io_loop+0xb4/0x160 [obdclass]
 0xffffffffa0b32a77 : ll_file_io_generic+0x337/0x4f0 [lustre]
 0xffffffffa0b33919 : ll_file_aio_read+0x149/0x290 [lustre]
 0xffffffffa0b33bd3 : ll_file_read+0x173/0x270 [lustre]
 0xffffffff811782d5 : vfs_read+0xb5/0x1a0 [kernel]
 0xffffffff81178411 : sys_read+0x51/0x90 [kernel]
 0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]
Comment by Ned Bass [ 10/Aug/12 ]

Here's some data from /proc/lock_stats for the fio single-shared-file read workload. I believe the _atomic_dec_and_lock is from cl_page_put(). cl_page_find0+0x464 is where the lock is taken for a radix_tree_insert().

lock_stat version 0.3 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                    &cl_page_guard_class:      40974077       41330086           1.72        5319.93   622070050.62       45807971       48525879           1.16         583.13    14269160.92
                    --------------------
                    &cl_page_guard_class       15772469          [<ffffffff8128fb35>] _atomic_dec_and_lock+0x55/0x80
                    &cl_page_guard_class       20602549          [<ffffffffa04ae508>] cl_vmpage_page+0x58/0x1a0 [obdclass]
                    &cl_page_guard_class            300          [<ffffffffa04ae1fb>] cl_page_delete0+0x19b/0x3b0 [obdclass]
                    &cl_page_guard_class        4954768          [<ffffffffa04af9c4>] cl_page_find0+0x464/0x750 [obdclass]
                    --------------------
                    &cl_page_guard_class       15656563          [<ffffffff8128fb35>] _atomic_dec_and_lock+0x55/0x80
                    &cl_page_guard_class       20518390          [<ffffffffa04ae508>] cl_vmpage_page+0x58/0x1a0 [obdclass]
                    &cl_page_guard_class            277          [<ffffffffa04ae1fb>] cl_page_delete0+0x19b/0x3b0 [obdclass]
                    &cl_page_guard_class        5154856          [<ffffffffa04af9c4>] cl_page_find0+0x464/0x750 [obdclass]
Comment by Ned Bass [ 10/Aug/12 ]

Hi Jinshan,

I tried this experimental patch to avoid taking coh_page_guard in cl_vmpage_page()

diff --git a/lustre/obdclass/cl_page.c b/lustre/obdclass/cl_page.c
index 0639499..027075e 100644
--- a/lustre/obdclass/cl_page.c
+++ b/lustre/obdclass/cl_page.c
@@ -736,6 +736,15 @@ struct cl_page *cl_vmpage_page(cfs_page_t *vmpage, struct cl_object *obj)
          * This loop assumes that ->private points to the top-most page. This
          * can be rectified easily.
          */
+
+        page = (void *)vmpage->private;
+        if (page == NULL)
+                RETURN(page);
+        if (cl_object_same(page->cp_obj, obj)) {
+                cl_page_get_trust(page);
+                RETURN(page);  
+        }
+
         hdr = cl_object_header(cl_object_top(obj));
         cfs_spin_lock(&hdr->coh_page_guard);
         for (page = (void *)vmpage->private;

I found that it reduced the contentions/acquisitions ratio for coh_page_guard by about 10%, but increased read throughput on the ssf.fio test on grove by about 25%. However, the node is still CPU-bound during the test due to contention on this lock.

My assumption here is that we only need the lock to ensure consistency of the page->cp_child linked list, and that the cl_page we're looking for is very often associated with the top-most object, in which case we don't need to traverse the list. Anyways, these results are a good indication that we can reap some big performance gains if we can find ways to take this lock less often.

Comment by Jinshan Xiong (Inactive) [ 13/Aug/12 ]

Hi Ned, your patch seems work for me. BTW, can you please try to not grab coh_page_guard and see how's going?

It may be safe to do it this way because we're holding vmpage lock to call cl_vmpage_page() so that cl_page list can't go away; also cl_object of the page can't be destroyed either since there already exists a page.

Comment by Ned Bass [ 13/Aug/12 ]

Never taking coh_page_guard in cl_vmpage_page() seems to work well. However, when I instrumented the code to count iterations in the for loop, it always performed only one iteration. I think for this workload it never follows the cl_page pointer, so we'll need more testing or careful reasoning to say for sure that this is safe.

Here are some results for the ssf.fio test above on grove with and without taking the lock. Reads get a nice bump from this optimization.

         |   SSF  |   SSF    |                                                  
   Grove | locked | unlocked |                                                  
---------+--------+----------+                                                  
   Write |   836  |   719    |                                                  
---------+--------+----------+                                                  
    Read |  1232  |   1698   |                                                  
---------+--------+----------+                                                  

The point of contention now shifts to the _atomic_dec_and_lock() call in cl_page_put(). I don't understand why the lock would be contended here since it only takes the lock if the refcount decrements to zero. In that case, who else would be holding the lock?

Here is oprofile output and lock_stat data for the read phase of ssf.fio when we don't take coh_page_guard:

CPU: Intel Sandy Bridge microarchitecture, speed 2601 MHz (estimated)              
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
vma      samples  cum. samples  %        cum. %     linenr info                 app name                 symbol name
ffffffff8126c5e0 327867   327867        28.1153  28.1153    dec_and_lock.c:21           vmlinux                  _atomic_dec_and_lock
00000000000524b0 92029    419896         7.8917  36.0070    cl_page.c:402               obdclass.ko              cl_page_find0
000000000004d900 82820    502716         7.1020  43.1090    cl_object.c:167             obdclass.ko              cl_object_top
0000000000051160 80924    583640         6.9394  50.0484    cl_page.c:1123              obdclass.ko              cl_page_delete0
0000000000000000 59278    642918         5.0832  55.1316    (no location information)   lustre                   /lustre
ffffffff812778c0 23310    666228         1.9989  57.1305    copy_user_64.S:240          vmlinux                  copy_user_generic_string
00000000000520b0 20577    686805         1.7645  58.8950    cl_page.c:652               obdclass.ko              cl_page_put
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                    &cl_page_guard_class:      31974716       31992762           1.42        3949.95   161714748.58       55017484       62473133           1.39        1542.40    20544116.54
                    --------------------
                    &cl_page_guard_class         779139          [<ffffffffa04a536b>] cl_page_delete0+0x19b/0x3b0 [obdclass]
                    &cl_page_guard_class       22460626          [<ffffffff8128fb35>] _atomic_dec_and_lock+0x55/0x80
                    &cl_page_guard_class        8752997          [<ffffffffa04a6994>] cl_page_find0+0x464/0x750 [obdclass]
                    --------------------
                    &cl_page_guard_class         872283          [<ffffffffa04a536b>] cl_page_delete0+0x19b/0x3b0 [obdclass]
                    &cl_page_guard_class       21478799          [<ffffffff8128fb35>] _atomic_dec_and_lock+0x55/0x80
                    &cl_page_guard_class        9641680          [<ffffffffa04a6994>] cl_page_find0+0x464/0x750 [obdclass]

Comment by Jinshan Xiong (Inactive) [ 13/Aug/12 ]

Hi Ned,

After taking a closer look, I realize this change will revive bug 19537 so I think the correct way to reduce the contention of coh_page_guard is to define a per page spinlock for this purpose.

The reason that it has high contention at coh_page_guard is that every page is used only once so everyone had to grab coh_page_guard in cl_page_put(). I will work out a patch soon.

Comment by Jinshan Xiong (Inactive) [ 13/Aug/12 ]

Hi Ned, can you please try http://review.whamcloud.com/3627 and see if it can help?

Comment by Ned Bass [ 14/Aug/12 ]

Initial results are very good! First try with ssf.fio got 2274MB/s, the best read number I've seen yet. All CPUs are still busy during the read phase, but in the 91-94% range, no longer 100%.

Oprofile stats from read phase:

vma      samples  cum. samples  %        cum. %     linenr info                 app name                 symbol name
0000000000052500 1533121  1533121        8.2425   8.2425    cl_page.c:401               obdclass.ko              cl_page_find0
0000000000000000 1353993  2887114        7.2794  15.5219    (no location information)   lustre                   /lustre
ffffffff8126c5e0 1045457  3932571        5.6207  21.1425    dec_and_lock.c:21           vmlinux                  _atomic_dec_and_lock
ffffffff812778c0 888524   4821095        4.7769  25.9195    copy_user_64.S:240          vmlinux                  copy_user_generic_string
0000000000050ff0 603333   5424428        3.2437  29.1632    cl_page.c:719               obdclass.ko              cl_vmpage_page
Comment by James A Simmons [ 16/Aug/12 ]

Also have been doing some benchmarks with this patch. For single client performance using all the memory using the following command line.

IOR -a POSIX -i 5 -C -v -g -w -r -e -b 32G -t 4m -o /lustre/barry/scratch/jsimmons/wc_ir_ior_test_2629.arthur-sys0.ccs.ornl.gov/testfile.out

Without patch -

Single Shared file:
Max Write: 244.93 MiB/sec (256.83 MB/sec)
Max Read: 141.05 MiB/sec (147.90 MB/sec)

With patch -

Single Shared file:
Max Write: 296.17 MiB/sec (310.56 MB/sec)
Max Read: 967.77 MiB/sec (1014.78 MB/sec)

With multiple clients (18 nodes) with a file stripe count of 28 I get with the command

IOR -a POSIX -i 5 -C -v -g -w -r -e -b 64m -t 4m -o /lustre/barry/scratc
h/jsimmons/wc_ir_ior_test_2627.arthur-sys0.ccs.ornl.gov/testfile.out

without patch

Max Write: 2163.72 MiB/sec (2268.83 MB/sec)
Max Read: 2217.49 MiB/sec (2325.21 MB/sec)

with patch

Max Write: 2293.95 MiB/sec (2405.38 MB/sec)
Max Read: 3273.48 MiB/sec (3432.49 MB/sec)

I'm also seeing improvement for the file per process case as well. This patch is a step in the right direction.

Comment by Ned Bass [ 16/Aug/12 ]

Hit this assertion with patch http://review.whamcloud.com/3627 running ssf.fio test on a BGQ IO node.

2012-08-16 15:46:27.296017 {DefaultControlEventListener} [mmcs]{16}.0.2:  kernel:LustreError: 5769:0:(osc_page.c:431:osc_page_delete()) ASSERTION( 0 ) failed: 
2012-08-16 15:46:27.296967 {DefaultControlEventListener} [mmcs]{16}.0.2: ^GMessage from syslogd@(none) at Aug 16 15:46:27 ...
2012-08-16 15:46:27.297419 {DefaultControlEventListener} [mmcs]{16}.0.2:  kernel:LustreError: 5769:0:(osc_page.c:431:osc_page_delete()) LBUG
2012-08-16 15:46:27.297863 {DefaultControlEventListener} [mmcs]{16}.10.3: Kernel panic - not syncing: LBUG
2012-08-16 15:46:27.298321 {DefaultControlEventListener} [mmcs]{16}.10.3: Call Trace:
2012-08-16 15:46:27.299124 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa550] [c000000000008190] .show_stack+0x7c/0x184 (unreliable)
2012-08-16 15:46:27.299627 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa600] [c000000000428ebc] .panic+0x80/0x1a8
2012-08-16 15:46:27.300102 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa690] [8000000002fd12c0] .lbug_with_loc+0xb0/0xc0 [libcfs]
2012-08-16 15:46:27.300671 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa720] [8000000007e7f3cc] .osc_page_delete+0x2cc/0x390 [osc]
2012-08-16 15:46:27.301057 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa7f0] [80000000053fb8c0] .cl_page_delete0+0x150/0x5c0 [obdclass]
2012-08-16 15:46:27.301479 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa8d0] [80000000053fbdc8] .cl_page_delete+0x98/0x1a0 [obdclass]
2012-08-16 15:46:27.301970 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaa990] [80000000097a5f74] .ll_invalidatepage+0xc4/0x1b0 [lustre]
2012-08-16 15:46:27.302370 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaaa30] [80000000097bc568] .vvp_page_discard+0xf8/0x1a0 [lustre]
2012-08-16 15:46:27.302768 {DefaultControlEventListener} [mmcs]{16}.0.2: [c00000034aaaaac0] [80000000053f7e84] .cl_page_invoid+0x104/0x220 [obdclass]
2012-08-16 15:46:27.303311 {DefaultControlEventListener} [mmcs]{16}.0.2: [c00000034aaaab90] [8000000007e7c89c] .discard_pagevec+0xdc/0x150 [osc]
2012-08-16 15:46:27.303778 {DefaultControlEventListener} [mmcs]{16}.0.2: [c00000034aaaac50] [8000000007e7cf5c] .osc_lru_shrink+0x64c/0xf90 [osc]
2012-08-16 15:46:27.304146 {DefaultControlEventListener} [mmcs]{16}.0.2: [c00000034aaaadb0] [8000000007e7f7d8] .osc_page_init+0x348/0xcf0 [osc]
2012-08-16 15:46:27.304672 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaaf20] [80000000053fda78] .cl_page_find0+0x2f8/0x990 [obdclass]
2012-08-16 15:46:27.305536 {DefaultControlEventListener} [mmcs]{16}.0.2: ^GMessage from syslogd@(none) at Aug 16 15:46:27 ...
2012-08-16 15:46:27.306031 {DefaultControlEventListener} [mmcs]{16}.0.2:  kernel:Kernel panic - not syncing: LBUG
2012-08-16 15:46:27.306421 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab040] [8000000008897730] .lov_page_init_raid0+0x240/0x820 [lov]
2012-08-16 15:46:27.306875 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab150] [8000000008893bd4] .lov_page_init+0x84/0x120 [lov]
2012-08-16 15:46:27.307234 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab1f0] [80000000053fda78] .cl_page_find0+0x2f8/0x990 [obdclass]
2012-08-16 15:46:27.313843 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab310] [8000000009787a4c] .ll_cl_init+0x11c/0x540 [lustre]
2012-08-16 15:46:27.314287 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab400] [800000000978a194] .ll_prepare_write+0xa4/0x240 [lustre]
2012-08-16 15:46:27.314838 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab4d0] [80000000097a635c] .ll_write_begin+0xcc/0x220 [lustre]
2012-08-16 15:46:27.315272 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab5a0] [c00000000009589c] .generic_file_buffered_write+0x140/0x354
2012-08-16 15:46:27.315789 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab6e0] [c00000000009600c] .__generic_file_aio_write+0x374/0x3d8
2012-08-16 15:46:27.316270 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab7e0] [80000000097bf864] .vvp_io_write_start+0x114/0x300 [lustre]
2012-08-16 15:46:27.316783 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab8c0] [8000000005408f3c] .cl_io_start+0xdc/0x1d0 [obdclass]
2012-08-16 15:46:27.317267 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaab970] [800000000540f3a4] .cl_io_loop+0x144/0x250 [obdclass]
2012-08-16 15:46:27.317758 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaaba30] [800000000975509c] .ll_file_io_generic+0x33c/0x510 [lustre]
2012-08-16 15:46:27.318270 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaabb30] [80000000097556b4] .ll_file_aio_write+0x1e4/0x340 [lustre]
2012-08-16 15:46:27.318726 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaabc00] [8000000009755970] .ll_file_write+0x160/0x2d0 [lustre]
2012-08-16 15:46:27.319202 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaabce0] [c0000000000c8234] .vfs_write+0xd0/0x1c4
2012-08-16 15:46:27.319739 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaabd80] [c0000000000c8424] .SyS_write+0x54/0x98
2012-08-16 15:46:27.320127 {DefaultControlEventListener} [mmcs]{16}.10.3: [c00000034aaabe30] [c000000000000580] syscall_exit+0x0/0x2c
Comment by Prakash Surya (Inactive) [ 16/Aug/12 ]

Ned, I saw that in the logs as well. Have you looked at LU-1723? It looks similar, but the stacks are different.

Comment by Ned Bass [ 16/Aug/12 ]

No I hadn't noticed that bug, thanks for the pointer.

Comment by Jinshan Xiong (Inactive) [ 16/Aug/12 ]

Hi Ned, do you have any luck to grab the page debug info such as LU-1723?

Comment by Ned Bass [ 17/Aug/12 ]

Yes, sorry I neglected to include it above.

2012-08-16 15:46:27.275649 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_cache.c:2338:osc_teardown_async_page()) extent c0000003642caf38@{[361184 -> 361199/361199], [3|1|-|active|wi|c00000029bfc0240], [1048576|16|+|-|c000000295f440b0|16|(null)]} trunc at 361195.
2012-08-16 15:46:27.276084 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) page@c000000343b9ed40[2 c0000003c0a2ee98:2889515 ^(null)_c000000343b9ead0 4 0 1 (null) (null) 0x0]
2012-08-16 15:46:27.276522 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) page@c000000343b9ead0[1 c00000036d6e4e68:361195 ^c000000343b9ed40_(null) 4 0 1 (null) (null) 0x0]
2012-08-16 15:46:27.276915 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) vvp-page@c0000003633dce90(0:0:0) vm@c0000000015ddce8 71 2:0 0 2889515 lru
2012-08-16 15:46:27.277362 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) lov-page@c000000349140930
2012-08-16 15:46:27.277779 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) osc-page@c00000038c90d328: 1< 0x845fed 2 0 + - > 2< 23671275520 0 65536 0x0 0x420 | (null) c00000029159a7b0 c00000029bfc0240 > 3< + c0000003e9990740 0 0 0 > 4< 0 0 8 18677760 - | + - + - > 5< + - + - | 0 - | 64 - ->
2012-08-16 15:46:27.278174 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) end page@c000000343b9ed40
2012-08-16 15:46:27.278663 {DefaultControlEventListener} [mmcs]{16}.10.3: LustreError: 5769:0:(osc_page.c:430:osc_page_delete()) Trying to teardown failed: -16
Comment by Jinshan Xiong (Inactive) [ 17/Aug/12 ]

This problem is imported in LU-744, I will work out a patch soon.

Comment by Jinshan Xiong (Inactive) [ 17/Aug/12 ]

Please use patch set 20 of LU-744.

Comment by Gregoire Pichon [ 05/Dec/12 ]

One of our customer (TGCC - lustre 2.1.3) is facing a performance issue while reading in the same file from many tasks on one client at the same time. Performance drops while increasing the number of tasks.

I have reproduced the issue and obtained these results:

    tasks     Mean    Total (MB/s)
dd      1      339      339
dd      2      107      214
dd      4       78      312
dd      8       64      512
dd     16       26      428

Applying the patch http://review.whamcloud.com/#change,3627 on lustre 2.1.3 gives no real improvement.

    tasks     Mean    Total (MB/s)
dd      1      542      542
dd      2      118      236
dd      4       74      298
dd      8       73      591
dd     16       52      833

The profiling shows contention in CLIO page routines.
(I suppose the libpython entry comes from the mpirun synchronization of the 16 tasks)

samples  %        linenr info                 app name                 symbol name
1149893  34.8489  (no location information)   libpython2.6.so.1.0      /usr/lib64/libpython2.6.so.1.0
338676   10.2640  dec_and_lock.c:21           vmlinux                  _atomic_dec_and_lock
335241   10.1599  cl_page.c:404               obdclass.ko              cl_page_find0
293448    8.8933  cl_page.c:729               obdclass.ko              cl_vmpage_page
235856    7.1479  filemap.c:963               vmlinux                  grab_cache_page_nowait
142686    4.3243  cl_page.c:661               obdclass.ko              cl_page_put
80065     2.4265  filemap.c:667               vmlinux                  find_get_page
46623     1.4130  copy_user_64.S:240          vmlinux                  copy_user_generic_string
34127     1.0343  (no location information)   libpthread-2.12.so       pthread_rwlock_rdlock
33685     1.0209  (no location information)   libc-2.12.so             getenv
31725     0.9615  (no location information)   libpthread-2.12.so       pthread_rwlock_unlock
30492     0.9241  swap.c:288                  vmlinux                  activate_page
28528     0.8646  (no location information)   libc-2.12.so             __dcigettext
23703     0.7183  (no location information)   libc-2.12.so             __strlen_sse42
22345     0.6772  vvp_page.c:120              lustre.ko                vvp_page_unassume
20646     0.6257  swap.c:183                  vmlinux                  put_page
18525     0.5614  intel_idle.c:215            vmlinux                  intel_idle
12774     0.3871  open.c:961                  vmlinux                  sys_close
12702     0.3849  filemap.c:579               vmlinux                  unlock_page
12685     0.3844  (no location information)   libc-2.12.so             __memcpy_ssse3
12134     0.3677  cl_page.c:874               obdclass.ko              cl_page_invoid
11734     0.3556  (no location information)   libpthread-2.12.so       __close_nocancel
10573     0.3204  cl_page.c:884               obdclass.ko              cl_page_owner_clear
10510     0.3185  (no location information)   libc-2.12.so             strerror_r
9783      0.2965  cl_page.c:1068              obdclass.ko              cl_page_unassume
9472      0.2871  (no location information)   oprofiled                /usr/bin/oprofiled
9438      0.2860  (no location information)   libc-2.12.so             __strncmp_sse2
8826      0.2675  rw.c:724                    lustre.ko                ll_readahead
8769      0.2658  filemap.c:527               vmlinux                  page_waitqueue
8095      0.2453  entry_64.S:470              vmlinux                  system_call_after_swapgs
7756      0.2351  cl_page.c:561               obdclass.ko              cl_page_state_set0
7615      0.2308  (no location information)   libc-2.12.so             __stpcpy_sse2
7366      0.2232  entry_64.S:462              vmlinux                  system_call
7097      0.2151  sched.c:4430                vmlinux                  find_busiest_group
6898      0.2091  (no location information)   libperl.so               /usr/lib64/perl5/CORE/libperl.so
6887      0.2087  (no location information)   libc-2.12.so             __ctype_b_loc
6780      0.2055  wait.c:251                  vmlinux                  __wake_up_bit
6631      0.2010  cl_page.c:139               obdclass.ko              cl_page_at_trusted
6588      0.1997  (no location information)   libc-2.12.so             __strlen_sse2
6336      0.1920  lov_io.c:239                lov.ko                   lov_page_stripe
5289      0.1603  lov_io.c:208                lov.ko                   lov_sub_get
5263      0.1595  ring_buffer.c:2834          vmlinux                  rb_get_reader_page
4595      0.1393  lvfs_lib.c:76               lvfs.ko                  lprocfs_counter_add
4373      0.1325  (no location information)   libc-2.12.so             strerror
4347      0.1317  lov_io.c:252                lov.ko                   lov_page_subio
4337      0.1314  cl_page.c:1036              obdclass.ko              cl_page_assume
4310      0.1306  (no location information)   libc-2.12.so             __mempcpy_sse2
3903      0.1183  lov_page.c:92               lov.ko                   lov_page_own
3866      0.1172  mutex.c:409                 vmlinux                  __mutex_lock_slowpath
3841      0.1164  lcommon_cl.c:1097           lustre.ko                cl2vm_page
3698      0.1121  radix-tree.c:414            vmlinux                  radix_tree_lookup_slot
3506      0.1063  vvp_page.c:109              lustre.ko                vvp_page_assume
3491      0.1058  lu_object.c:1115            obdclass.ko              lu_object_locate
3490      0.1058  (no location information)   ipmi_si.ko               port_inb
3457      0.1048  ring_buffer.c:3221          vmlinux                  ring_buffer_consume
3431      0.1040  cl_page.c:898               obdclass.ko              cl_page_owner_set

Does the patch LU-1666 need any adaptations to be applied on lustre 2.1.3 branch ?
Would this use case require additional patches to be fixed ?

I can open another JIRA ticket if needed and provide additional information (log, dump, profiling) as you which.
thanks.

Comment by Ned Bass [ 05/Dec/12 ]

Hi Gregoire,

You got a factor of two performance increase at 16 tasks, so why do you say there was no real improvement? I think the same optimization should apply with minimal adaptation to 2.1, since the same contention point exists there.

Comment by Gregoire Pichon [ 06/Dec/12 ]

Ned,
It's true the LU-1666 patch gives some improvement at 16 tasks. Anyway, I would have expected the per-task performance to maintain at the level (or slightly lower) of the one-reader case. It's what I observe on ext4 for instance.

Comment by Gregoire Pichon [ 12/Dec/12 ]

As proposed by Jinshan, I have open a separate ticket for the parallel read performance issue: LU-2481 "performance issue of parallel read".

Comment by Andreas Dilger [ 18/Feb/13 ]

Gregoire submitted a b2_1 version of this patch at http://review.whamcloud.com/5428.

Generated at Sat Feb 10 01:18:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.