[LU-3753] MDS operation rate dropping sporadically Created: 14/Aug/13  Updated: 10/Mar/14  Resolved: 10/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Daire Byrne (Inactive) Assignee: Alex Zhuravlev
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: File bravo.txt.bz2     JPEG File mds-pauses.jpeg     JPEG File mds-pauses2.jpeg    
Issue Links:
Related
is related to LU-3786 performance-sanity test_6: mkdir hung... Resolved
Severity: 3
Rank (Obsolete): 9680

 Description   

Hi,

I have been investigating why our metadata rate keeps dropping to nothing every 30 seconds or so for a few seconds at a time. Looking at the debug logs it looks like it is stuck doing many osp_sync_request_commit_cb() calls. I shouldn't think this should block all other operations so maybe this is not the issue and it's just that these are the only operations that are occurring when we see the rate drop to zero. I'm pretty sure the RAID and storage path is all fine. Here is a snippet illustrating the issue:

/proc/fs/lustre/mds/MDS/mdt/stats @ 1375970781.677198
Name Cur.Count Cur.Rate #Events Unit last min avg max stddev
req_waittime 0 0 803871957 [usec] 0 2 6.63 10477 8.15
req_qdepth 0 0 803871957 [reqs] 0 0 0.00 12 0.07
req_active 0 0 803871957 [reqs] 0 1 2.67 16 1.79
req_timeout 0 0 803871957 [sec] 0 1 15.50 37 14.29
reqbuf_avail 0 0 1709293570[bufs] 0 47 63.79 64 0.59
ldlm_ibits_enqueue 0 0 499159709 [reqs] 0 1 1.00 1 0.00
mds_getattr 0 0 3152776 [usec] 0 8 346.15 1453985 6393.00
mds_getattr_lock 0 0 158194 [usec] 0 9 83.28 381776 1418.81
mds_connect 0 0 76 [usec] 0 14 3477.13 147911 21249.32
mds_disconnect 0 0 19 [usec] 0 26 1178.21 15598 3757.50
mds_getstatus 0 0 4 [usec] 0 9 13.75 16 3.20
mds_statfs 0 0 2904 [usec] 0 5 19.87 2115 67.53
mds_sync 0 0 3 [usec] 0 116 10933.67 32562 18730.69
mds_getxattr 0 0 50841 [usec] 0 6 12.89 92 3.91
obd_ping 0 0 530449 [usec] 0 3 11.87 3851 8.76

00000004:00080000:4.0:1375970781.689383:0:4870:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff8821062c4000, transno 8617232460
00000004:00080000:4.0:1375970781.689384:0:4870:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff882d394d4c00, transno 8617232461
00000004:00080000:4.0:1375970781.689384:0:4870:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883e03485800, transno 8617232462
00000004:00080000:4.0:1375970781.689385:0:4870:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883479fcdc00, transno 8617232463
00000004:00080000:4.0:1375970781.689386:0:4870:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883515c1f400, transno 8617232464
00000004:00080000:4.0:1375970781.689387:0:4870:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883d76783000, transno 8617232465

I am also attaching a graph showing the MDS rate drops over a few minutes. I can upload some full debug logs from the MDS when the rate drops to zero if that helps.

The workload is running on two clients and copies, hard links and unlinks thousands of directory trees. The two clients suffer performance drops at the same time as the MDS rate drops to zero.

Cheers,

Daire



 Comments   
Comment by Alex Zhuravlev [ 15/Aug/13 ]

are you running with ldiskfs or ZFS ?

Comment by Daire Byrne (Inactive) [ 15/Aug/13 ]

No ZFS just the standard ldiskfs.

Comment by Alex Zhuravlev [ 15/Aug/13 ]

ok, thanks. I've been working on a patch to batch MDS-OST synchronization, but it's not priority at the moment and it's not quite ready yet..

Comment by Daire Byrne (Inactive) [ 15/Aug/13 ]

Ok. Well we are happy to be early adopters and test it whenever. Our workload seems to show up this effect quite well.

Even without the batching though is it expected that the MDS rate would drop to zero while the syncronisation calls occur? I would have thought this would have shown up at other sites testing v2.4 or artificial benchmarks as it is a major performance regression to v1.8 IMHO.

Is there anything we can do to alleviate this effect in the meantime? e.g. tweak no. of threads, local client locks, async journals etc.

Cheers.

Comment by Alex Zhuravlev [ 15/Aug/13 ]

that's true, even with no batching those threads shouldn't be consuming CPU much and block user requests. I'll think of using lower priorities for these threads and things like schedule().

Comment by Alex Zhuravlev [ 18/Aug/13 ]

would you mind to try http://review.whamcloud.com/#/c/7343/ please?

Comment by James Braid (Inactive) [ 20/Aug/13 ]

Hi Alex

I've applied the patch. It seems like it may have made a very small improvement. Is there more debugging info we can provide to assist? We still see frequent huge performance drops on our clients.

Here is collectl output from a client - note the write drops to zero and then back again.

#<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
  28  17 39967  89269 561M   9M   8G   8G   1G  12G   4096M     4G   513295   2133
  32  20 39782  95362 692M   9M   8G   8G   1G  12G   4096M     4G   571402   2340
  30  18 39539  97551 459M   9M   8G   8G   1G  12G   4096M     4G   589446   2446
  34  22 43882 104601 539M   9M   8G   8G   1G  12G   4096M     4G   584120   2417
  24  16 32688  75601 529M   9M   8G   8G   1G  12G   4096M     4G   459916   1883
  27  16 34082  81696 481M   9M   8G   8G   1G  12G   4096M     4G   549843   2231
   9   5 13125  24345 602M   9M   8G   8G   1G  12G   4096M     4G   165923    651
   1   0   657    171 636M   9M   8G   8G   1G  12G   4096M     4G        0      0
   0   0   186    241 637M   9M   8G   8G   1G  12G   4096M     4G        0      0
   0   0    99    165 638M   9M   8G   8G   1G  12G   4096M     4G        0      0
   0   0   429    505 638M   9M   8G   8G   1G  12G   4096M     4G        0      0
  16  10 23053  51084 470M   9M   8G   8G   1G  12G   4096M     4G   318397   1298
  32  20 39637  91021 341M   9M   8G   8G   1G  12G   4096M     4G   593453   2422
  35  22 43514  98099 288M   9M   8G   8G   1G  12G   4096M     4G   646799   2640
  28  18 38110  87991 367M   9M   8G   8G   1G  12G   4096M     4G   520423   2141
  10   6 15983  30703 372M   9M   8G   8G   1G  12G   4096M     4G   215376    880
   9   5 15507  28817 253M   9M   8G   8G   1G  12G   4096M     4G   208647    870
  30  19 38929  94465 233M   9M   8G   8G   1G  12G   4096M     4G   541840   2236
  27  17 36664  85198 166M   9M   8G   8G   1G  12G   4096M     4G   508441   2139
  32  20 40896  94522 614M   9M   8G   8G   1G  12G   4096M     4G   556058   2326
  29  17 39580  88633 444M   9M   8G   8G   1G  12G   4096M     4G   560600   2305
  27  17 38357  81376 605M   9M   8G   8G   1G  12G   4096M     4G   488711   2036
  28  17 38639  86526 698M   9M   8G   8G   1G  12G   4096M     4G   522031   2182
  28  18 38191  88627 659M   9M   8G   8G   1G  12G   4096M     4G   539818   2217
   6   4 11154  22075 630M   9M   8G   8G   1G  12G   4096M     4G   158178    621
   0   0   101    160 629M   9M   8G   8G   1G  12G   4096M     4G        0      0
   0   0   163    240 629M   9M   8G   8G   1G  12G   4096M     4G        0      0
   0   0   260    371 630M   9M   8G   8G   1G  12G   4096M     4G        0      0
  13   8 22627  50627 421M   9M   8G   8G   1G  12G   4096M     4G   262699   1110
  29  18 40373  90854 581M   9M   8G   8G   1G  12G   4096M     4G   531336   2236
  33  22 40703 102870 599M   9M   8G   8G   1G  12G   4096M     4G   541039   2288
  31  20 41802 103343 724M   9M   8G   8G   1G  12G   4096M     4G   565820   2372
  29  19 38808  94022 660M   9M   8G   8G   1G  12G   4096M     4G   516110   2175
  30  19 40539 101639 533M   9M   8G   8G   1G  12G   4096M     4G   547567   2279
  28  17 39459  92811 527M   9M   8G   8G   1G  12G   4096M     4G   535465   2232
  29  19 39342  89179 595M   9M   8G   8G   1G  12G   4096M     4G   528147   2185
  29  19 41263  91443 513M   9M   8G   8G   1G  12G   4096M     4G   522968   2151
  23  15 36688  78033 550M   9M   8G   8G   1G  12G   4096M     4G   452367   1866
  32  20 40583  99159 591M   9M   8G   8G   1G  12G   4096M     4G   565728   2312
  32  21 39717  94413 650M   9M   8G   8G   1G  12G   4096M     4G   566728   2310
  31  20 39367  92265 858M   9M   8G   8G   1G  12G   4096M     4G   576691   2359
  31  20 39148  91641 650M   9M   8G   8G   1G  12G   4096M     4G   576751   2355
  29  18 37997  90133 376M   9M   8G   8G   1G  12G   4096M     4G   550331   2270
  16  11 21152  42319 734M   9M   8G   8G   1G  12G   4096M     4G   253585   1036
   0   0   556   1003 723M   9M   8G   8G   1G  12G   4096M     4G     6288     26
Comment by Alex Zhuravlev [ 20/Aug/13 ]

can you grab the same stats from MDS please?

Comment by James Braid (Inactive) [ 20/Aug/13 ]

Hi Alex

collectl doesn't seem to work with the 2.x MDS, but here is the stats with non-Lustre specific info. The drops in interupts/csw coincide with low client performance

#<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre MDS-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map Gattr+ Sattr+   Sync  Unlnk 
   0   0 14990  19822 141G  76G 338M  43G  31G 125M      0      0      0      0 
   0   0 19252  26953 141G  76G 338M  43G  31G 125M      0      0      0      0 
   1   1 33080  50769 141G  76G 338M  43G  31G 125M      0      0      0      0 
   2   2 46215  74568 141G  76G 337M  43G  31G 125M      0      0      0      0 
   2   2 46757  75867 141G  76G 337M  43G  31G 125M      0      0      0      0 
   2   2 46323  71085 141G  76G 338M  43G  31G 125M      0      0      0      0 
   2   2 46438  74262 141G  76G 338M  43G  31G 125M      0      0      0      0 
   3   3 46285  72427 141G  76G 337M  43G  31G 125M      0      0      0      0 
   3   3 44278  67984 141G  76G 337M  43G  31G 125M      0      0      0      0 
   3   3 43048  66573 141G  76G 338M  43G  31G 125M      0      0      0      0 
   3   3 41457  63118 141G  76G 338M  43G  31G 125M      0      0      0      0 
   3   3 43484  66497 141G  76G 337M  43G  31G 125M      0      0      0      0 
   3   3 41665  64117 141G  76G 337M  43G  31G 125M      0      0      0      0 
   3   3 41082  61940 141G  76G 337M  43G  31G 125M      0      0      0      0 
   3   3 40711  60277 141G  76G 338M  43G  31G 125M      0      0      0      0 
   0   0 10233  15577 141G  76G 336M  43G  31G 125M      0      0      0      0 
   0   0   161    243 141G  76G 338M  43G  31G 125M      0      0      0      0 
   0   0  5366   7845 141G  76G 338M  43G  31G 125M      0      0      0      0 
   4   4 46224  70689 141G  76G 338M  43G  31G 125M      0      0      0      0 
   3   3 41865  62386 141G  76G 338M  43G  31G 125M      0      0      0      0 
   4   4 43553  66924 141G  76G 338M  43G  31G 125M      0      0      0      0 
   3   3 43779  67676 141G  76G 337M  43G  31G 125M      0      0      0      0 
   3   3 41717  63335 141G  76G 338M  43G  31G 125M      0      0      0      0 
   0   0   969   1122 141G  76G 338M  43G  31G 125M      0      0      0      0 
   0   0   228    277 141G  76G 338M  43G  31G 125M      0      0      0      0 
   0   0   337    823 141G  76G 338M  43G  31G 125M      0      0      0      0 
   0   0   151    243 141G  76G 338M  43G  31G 125M      0      0      0      0 
   0   0   187    262 141G  76G 338M  43G  31G 125M      0      0      0      0 
   1   1 21902  33261 141G  76G 336M  43G  31G 125M      0      0      0      0 
   4   4 43533  67827 141G  76G 337M  43G  31G 125M      0      0      0      0 
   3   3 43230  66712 141G  76G 337M  43G  31G 125M      0      0      0      0 
   3   3 41791  64745 141G  76G 337M  43G  31G 125M      0      0      0      0 
   3   3 41640  63042 141G  76G 338M  43G  31G 125M      0      0      0      0 
   2   2 43619  51296 141G  76G 337M  43G  31G 125M      0      0      0      0 
   2   2 50002  79991 141G  76G 337M  43G  31G 125M      0      0      0      0 
   2   2 55406  88595 141G  76G 338M  43G  31G 125M      0      0      0      0 
   2   2 48724  77648 141G  76G 337M  43G  31G 125M      0      0      0      0 
   3   3 46859  75466 141G  76G 338M  43G  31G 125M      0      0      0      0 
   3   3 47779  76225 141G  76G 337M  43G  31G 125M      0      0      0      0 
   3   3 46867  75415 141G  76G 337M  43G  32G 125M      0      0      0      0 
   3   3 44883  71390 141G  76G 337M  43G  32G 125M      0      0      0      0 
   3   3 44000  69896 141G  76G 338M  43G  32G 125M      0      0      0      0 
   2   2 38649  61769 141G  76G 338M  43G  32G 125M      0      0      0      0 
   0   0   721    824 141G  76G 338M  43G  32G 125M      0      0      0      0 
   0   0   180    240 141G  76G 338M  43G  32G 125M      0      0      0      0 
   0   0   333    829 141G  76G 338M  43G  32G 125M      0      0      0      0 
   0   0  9742  11493 141G  76G 338M  43G  32G 125M      0      0      0      0 
   0   0 15017  17764 141G  76G 338M  43G  32G 125M      0      0      0      0 
Comment by Alex Zhuravlev [ 20/Aug/13 ]

is there a way to include low-level (blocks) I/O in the report?

Comment by James Braid (Inactive) [ 20/Aug/13 ]

Hi Alex

Here is collectl disk info included

#<----CPU[HYPER]-----><-----------Memory-----------><----------Disks----------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes 
   0   0 15419   9478 121G  90G 345M  51G  38G 151M    624    127   9536   2266 
   0   0  3174   1543 121G  90G 345M  51G  38G 151M     72     18   2156    514 
   0   0   694    305 121G  90G 346M  51G  38G 152M      0      0     64     15 
   0   0 10080   6090 121G  90G 345M  51G  38G 152M    456     84   7048   1603 
   0   0 18893  12913 121G  90G 345M  51G  38G 152M   1052    205  15158   3395 
   3   3 66971  62305 121G  90G 346M  51G  38G 153M   4392    892  73522  15416 
   3   3 74302  69772 121G  90G 345M  51G  38G 154M   5320   1103  83760  17784 
   3   3 78258  73317 121G  90G 345M  51G  38G 154M   5188   1059  84000  19377 
   4   4 81342  72783 121G  90G 345M  51G  38G 155M   5372   1078  83952  20370 
   4   4 79303  73135 121G  90G 345M  51G  38G 155M   5172   1115  86600  20962 
   4   4 82544  73869 121G  90G 345M  51G  38G 156M   6528   1214  89628  20826 
   3   3 70805  58015 121G  90G 345M  51G  38G 157M   3748    774 199530  25660 
   0   0 26491  21617 121G  90G 345M  51G  38G 157M   1716    370  20936   4663 
   2   1 64310  55434 121G  90G 345M  51G  38G 170M   4312    884  50792  12617 
   1   1 48974  41562 121G  90G 345M  51G  38G 171M   3124    661  40344   9558 
   3   3 87339  84892 121G  90G 345M  51G  38G 171M   6664   1441  87324  21470 
   3   3 89330  84011 121G  90G 345M  51G  38G 172M   7856   1589  85492  21058 
   5   3 89911  81401 121G  90G 345M  51G  38G 178M   7864   1479  80904  20404 
   4   4 84957  81147 121G  90G 345M  51G  38G 160M   7076   1492  83104  20472 
   4   4 85601  82894 121G  90G 345M  51G  38G 160M   6652   1362  80336  20092 
   4   4 81877  76701 121G  90G 345M  51G  38G 160M   6988   1357  83660  19825 
   4   4 76911  70142 121G  90G 345M  51G  38G 160M   6032   1140  86088  18022 
   3   3 75907  66537 121G  90G 345M  51G  38G 160M   5120   1055  81692  18017 
   3   3 65582  58071 121G  90G 343M  51G  38G 161M   4676    899  70736  15957 
   0   0  1158    890 121G  90G 346M  51G  38G 161M      0      0    122     41 
   0   0  1421    800 121G  90G 345M  51G  38G 161M     40     10    560    124 
   0   0   646    312 121G  90G 346M  51G  38G 161M      4      1      0      0 
   0   0  3744   2393 121G  90G 345M  51G  38G 161M    244     33   2161    518 
   3   3 67809  59011 121G  90G 345M  51G  38G 161M   5064    851  73344  16346 
   3   3 75089  65527 121G  90G 345M  51G  38G 161M   5020    996  83584  18005 
   1   1 24734  20954 121G  90G 345M  51G  38G 161M   2036    346  25896   5893 
   0   0  1517    427 121G  90G 346M  51G  38G 161M      0      0      4      1 
   1   1 37255  31162 121G  90G 345M  51G  38G 161M   2056    427  39380   8705 
   4   4 79364  70818 121G  90G 345M  51G  38G 161M   5636   1088  89616  20124 
   4   3 74635  70406 121G  90G 345M  51G  38G 161M   4776   1029  88984  19081 
   3   3 74168  69116 121G  90G 345M  51G  38G 162M   5896   1059  85292  18438 
   4   4 79242  71251 121G  90G 345M  51G  38G 162M   5800   1149  89636  19608 
   3   3 77076  70043 121G  90G 345M  51G  38G 162M   5344   1041  86020  18137 
   3   3 76394  68878 121G  90G 345M  51G  38G 162M   5172   1028  89460  19349 
   1   1 26792  24128 121G  90G 344M  51G  38G 162M   1684    337  30656   6404 
   0   0   561    255 121G  90G 346M  51G  38G 162M      0      0      4      1 
   0   0  3377   2410 121G  90G 345M  51G  38G 162M    108     27   3176    673 
   3   3 57772  50792 121G  90G 345M  51G  38G 162M   3864    738  67184  14451 
   4   4 76063  70173 121G  90G 344M  51G  38G 162M   4812    967  89464  18906 
   3   3 68062  63098 121G  90G 346M  51G  38G 162M   4432    857  74896  15953 
   3   3 71401  64585 121G  90G 345M  51G  38G 162M   4616    921  75924  16704 
   2   2 47990  41065 121G  90G 345M  51G  38G 163M   2380    538  55948  11602 
   0   0  4580   2754 121G  90G 346M  51G  38G 163M      8      2  39576   3569 

Here is perf top output while this capture was taking place

 22.97%  [kernel]                 [k] _spin_lock
 13.40%  [jbd2]                   [k] journal_clean_one_cp_list
  1.53%  [kernel]                 [k] find_busiest_group
  1.43%  [kernel]                 [k] kmem_cache_free
  1.22%  [kernel]                 [k] _spin_lock_irqsave
  1.21%  [kernel]                 [k] kmem_cache_alloc
  1.00%  [jbd2]                   [k] __jbd2_journal_clean_checkpoint_list
  0.67%  [kernel]                 [k] _read_lock_irqsave
  0.60%  [kernel]                 [k] schedule
  0.55%  [kernel]                 [k] mempool_free
  0.53%  [kernel]                 [k] _spin_unlock_irqrestore
  0.51%  [kernel]                 [k] _spin_lock_irq
Comment by James Braid (Inactive) [ 23/Aug/13 ]

Hi Alex

Do you have any more ideas about this? We will have to roll back to Lustre 1.8.x on this system very soon unless we can find any way to improve the performance.

Thanks!

Comment by Alex Zhuravlev [ 23/Aug/13 ]

sorry, still thinking about this.. from the stats it doesn't look like MDS is busy with anything at all. a bit MDS-OST traffic should be seen as high number of RPCs (so interrupts), I'd think.

what journal size do you use on MDS?
cat you paste /proc/fs/jbd2/<mds_device>/info after the test please?

Comment by James Braid (Inactive) [ 23/Aug/13 ]

Hi Alex

Here is the info from our 2.4.0 system (slow performance):

bmds1 /root # dumpe2fs -h /dev/lustre/mdt1
dumpe2fs 1.42.7.wc1 (12-Apr-2013)
Filesystem volume name:   bravo-MDT0000
Last mounted on:          /
Filesystem UUID:          f1416ea8-190c-4bd7-a8f4-27b013f278e0
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              3221225472
Block count:              3221225472
Reserved block count:     161061273
Free blocks:              2565247465
Free inodes:              2896159999
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      256
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         32768
Inode blocks per group:   4096
RAID stride:              256
RAID stripe width:        256
Flex block group size:    16
Filesystem created:       Wed Jul  3 14:49:25 2013
Last mount time:          Tue Aug 20 14:08:37 2013
Last write time:          Tue Aug 20 14:08:37 2013
Mount count:              9
Maximum mount count:      -1
Last checked:             Wed Jul  3 14:49:25 2013
Check interval:           0 (<none>)
Lifetime writes:          60 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          512
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      047637b0-966e-403b-9c32-0c549afe4180
Journal backup:           inode blocks
User quota inode:         3
Group quota inode:        4
Journal features:         journal_incompat_revoke
Journal size:             400M
Journal length:           102400
Journal sequence:         0x4160245e
Journal start:            14452

bmds1 /root # cat /proc/fs/jbd2/dm-12-8/info
69887438 transaction, each up to 25600 blocks
average:
  0ms waiting for transaction
  3ms running transaction
  0ms transaction was being locked
  0ms flushing data (in ordered mode)
  1ms logging transaction
  828us average transaction commit time
  8 handles per transaction
  20 blocks per transaction
  21 logged blocks per transaction

And from our 1.8.9 system (good performance):

[root@mds ~]# dumpe2fs -h /dev/md3
dumpe2fs 1.41.90.wc3 (28-May-2011)
Filesystem volume name:   lustre-MDT0000
Last mounted on:          /
Filesystem UUID:          fa42ca46-0585-426b-9308-761aaf6b7a11
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file uninit_bg
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              3805970432
Block count:              1902976990
Reserved block count:     95148849
Free blocks:              1283903845
Free inodes:              3519235719
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1024
Blocks per group:         16384
Fragments per group:      16384
Inodes per group:         32768
Inode blocks per group:   4096
Filesystem created:       Tue Dec 13 12:06:20 2011
Last mount time:          Sun May  5 11:42:06 2013
Last write time:          Sun May  5 11:42:06 2013
Mount count:              26
Maximum mount count:      20
Last checked:             Tue Dec 13 12:06:20 2011
Check interval:           0 (<none>)
Lifetime writes:          1815 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          512
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
First orphan inode:       3735290412
Default directory hash:   half_md4
Directory Hash Seed:      b589d099-f7c1-4677-ad85-dfc6024efb09
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Journal size:             400M
Journal length:           102400
Journal sequence:         0x00959d1d
Journal start:            64343

[root@mds ~]# cat /proc/fs/jbd2/md3-8/info
1894322 transaction, each upto 25600 blocks
average:
  6ms waiting for transaction
  5000ms running transaction
  7ms transaction was being locked
  17ms flushing data (in ordered mode)
  60ms logging transaction
  0us average transaction commit time
  11796 handles per transaction
  3223 blocks per transaction
  3233 logged blocks per transaction
Comment by James Braid (Inactive) [ 23/Aug/13 ]

Hmm, just noticed the RAID stride values look a bit strange... I didn't format this system so I'll have to check in with Daire about what is going on there.

Comment by Alex Zhuravlev [ 23/Aug/13 ]

bmds1 /root # cat /proc/fs/jbd2/dm-12-8/info
69887438 transaction, each up to 25600 blocks
average:
0ms waiting for transaction
3ms running transaction
0ms transaction was being locked
0ms flushing data (in ordered mode)
1ms logging transaction
828us average transaction commit time
8 handles per transaction
20 blocks per transaction
21 logged blocks per transaction

if this was taken after some time (test or work load), then something is pretty wrong here.
the transactions are way too short (compare with 1.8 numbers) which is usually a result of sync (in explicit or implicit form).

do you use ACLs ? can you check /proc/fs/lustre/mdd/*MDT0000/sync_permission: if it's 1, set to 0 and try again?

Comment by James Braid (Inactive) [ 23/Aug/13 ]

Hi Alex

That info was captured whilst running our production workload (dirvish backups).

We don't use ACLs but the dirvish processes do a very large number of chmod() operations.

I tried setting sync_permission to 0 but it doesn't appear to make much difference. Here is another sample of the jbd2 info:

bmds1 /root # cat /proc/fs/jbd2/dm-12-8/info                                                                                      
70236645 transaction, each up to 25600 blocks
average:
  0ms waiting for transaction
  3ms running transaction
  0ms transaction was being locked
  0ms flushing data (in ordered mode)
  1ms logging transaction
  254103us average transaction commit time
  9 handles per transaction
  20 blocks per transaction
  21 logged blocks per transaction
Comment by Alex Zhuravlev [ 24/Aug/13 ]

sorry to ask, but.. Are you sure dm-12-m is /dev/lustre/mdt1 running bravo-MDT0000 ?
the stats you pasted are [censored] bad. this looks more like OST with async_journal disabled.
can you check with debugfs -R stats <dm-12-8> please?

Comment by James Braid (Inactive) [ 24/Aug/13 ]

Hi Alex, yep, we only have a single LVM volume on the MDS. I've attached a file with the output of debugfs (and jbd2 info and dump2fs)

Comment by Alex Zhuravlev [ 25/Aug/13 ]

ok, the good news is that there should be a significant improvement once the transactions are normal in terms of size and lifetime (as seen with 1.8).
i'm a bit surprised sync_permission=0 didn't help, have to think a bit..

Comment by Daire Byrne (Inactive) [ 02/Sep/13 ]

On further investigation the sync_permission=0 does help - it increases the overall IOPs on the MDT but the drops in transactions are still present. I've attached a graph which shows me setting sync_permission=0 at the beginning, sync_permission=1 for the middle section and then setting it back to sync_permission=0 again for the final part of the plot.

Comment by Daire Byrne (Inactive) [ 02/Sep/13 ]

sync_permission=0, sync_permission=1 comparison. The rate is almost double with sync_permission=0.

Comment by Alex Zhuravlev [ 02/Sep/13 ]

thanks! I think it makes sense to grab the stats again (CPU, I/O, jbd info) with sync_permission=0.

Comment by Daire Byrne (Inactive) [ 02/Sep/13 ]

Well I don't think the system has been rebooted in a few weeks so I'm not sure that the JBD stats don't have a lot of sync_permissions=1 data in there too. Either way it doesn't look much different than before:

[root@bmds1 ~]# cat /proc/fs/jbd2/dm-12-8/info 
70424050 transaction, each up to 25600 blocks
average: 
  1ms waiting for transaction
  14ms running transaction
  0ms transaction was being locked
  0ms flushing data (in ordered mode)
  3ms logging transaction
  1061027us average transaction commit time
  55 handles per transaction
  47 blocks per transaction
  48 logged blocks per transaction
Comment by Alex Zhuravlev [ 02/Sep/13 ]

it's started to change, which is good. how about CPU and I/O? given this is less blocked by IO now I'd hope to see another bottleneck now..

Comment by Daire Byrne (Inactive) [ 02/Sep/13 ]

collectl snippet

#<----CPU[HYPER]-----><-----------Memory-----------><----------Disks----------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes 
   1   1 23117  46950 239G   7G 388M   5G   2G 145M   3528    606  40624   2133 
   1   1 25639  53105 239G   7G 388M   5G   2G 145M   3712    586  25092    839 
   0   0 13059  23901 239G   7G 387M   5G   2G 145M   1708    239  22912   2751 
   0   0  7551  14811 239G   7G 388M   5G   2G 145M   1404    169  27388   2277 
   0   0  4868  12429 239G   7G 388M   5G   2G 145M      0      0  29500   5330 
   0   0  1807   4783 239G   7G 388M   5G   2G 145M      0      0  13836   2221 
   0   0  1508   3047 239G   7G 388M   5G   2G 145M      0      0  16784   1690 
   0   0  3434   6783 239G   7G 388M   5G   2G 145M      0      0  21604   3566 
   0   0  3689   5870 239G   7G 388M   5G   2G 145M      0      0  24728   3260 
   0   0  2288   3000 239G   7G 388M   5G   2G 145M      0      0  12500   1308 
   0   0  1249   1440 239G   7G 388M   5G   2G 145M      0      0  12832   1087 
   0   0   218    376 239G   7G 388M   5G   2G 145M      0      0    596     55 
   0   0  9066  39054 239G   7G 388M   5G   2G 145M      0      0  57684   9864 
   0   0 22196  31497 239G   7G 388M   5G   2G 145M    968    146  86420  15432 
   1   1 35652  54341 239G   7G 387M   5G   2G 145M   3316    525 127146  15395 
   2   2 42714 101785 239G   7G 387M   5G   2G 145M  11668   1394     18     16 
   3   3 54471 135731 239G   7G 388M   5G   2G 145M  13656   1807      0      0 
   3   3 51028 126309 239G   7G 388M   5G   2G 145M  14252   1855     16      3 
   3   3 49706 153783 239G   7G 387M   5G   2G 145M   8460   1316 125290  10353 

What I see is that when the MDS rate drops to zero there is still a lot of writes being pushed to disk and then when the MDS rate recovers the writes drop for a few seconds before continuing again.

I also rebooted the server to reset the jbd stats so we only have sync_permissions=0 data.

[root@bmds1 ~]# cat /proc/fs/jbd2/dm-12-8/info 
187 transaction, each up to 25600 blocks
average: 
  319ms waiting for transaction
  5001ms running transaction
  49ms transaction was being locked
  28ms flushing data (in ordered mode)
  821ms logging transaction
  359017us average transaction commit time
  10301 handles per transaction
  8828 blocks per transaction
  8854 logged blocks per transaction
Comment by Daire Byrne (Inactive) [ 09/Sep/13 ]

Alex,

I have spent some time trying to replicate our workload in a test environment but I have failed so far to reproduce the effect. Perhaps it is tied to our hardware/config in some way. We can provide you with remote access to the servers if you think that would be useful to you. The workload runs 24/7 and the servers are not in production so we can survive the odd crash/reboot if needed.

Comment by Alex Zhuravlev [ 10/Sep/13 ]

sure, I'd be happy to have a look at the live data.

Comment by Daire Byrne (Inactive) [ 10/Sep/13 ]

Great. Can you start a webex? Otherwise I can setup SSH, keys etc.

Comment by Alex Zhuravlev [ 10/Sep/13 ]

I'd prefer SSH if possible.

Comment by Daire Byrne (Inactive) [ 10/Sep/13 ]

Okay. Can you tell me your IP and a private email that I can send the details to?

Comment by Alex Zhuravlev [ 10/Sep/13 ]

I'm seeing this:
req_waittime 22 22 2195399824[usec] 540 2 6.55 26729 7.74
req_waittime 0 0 2195399824[usec] 0 2 6.55 26729 7.74
req_waittime 17 17 2195399841[usec] 441 2 6.55 26729 7.74
req_waittime 0 0 2195399841[usec] 0 2 6.55 26729 7.74
req_waittime 3 3 2195399844[usec] 63 2 6.55 26729 7.74
req_waittime 17 17 2195399861[usec] 894 2 6.55 26729 7.74

and vmstat at the same time:
rocs ----------memory--------- --swap- ----io--- -system- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 4416 8109060 105329448 114312 0 0 8 8 355 620 0 0 98 2 0
0 0 4416 8110104 105329448 114312 0 0 0 0 210 558 0 0 100 0 0
0 0 4416 8112220 105329456 114308 0 0 0 6224 1557 2293 0 0 99 0 0
0 0 4416 8112492 105329472 114300 0 0 0 9600 1530 3156 0 0 100 0 0
0 0 4416 8114856 105329472 114312 0 0 0 4624 456 2005 0 0 100 0 0
0 0 4416 8116680 105329488 114312 0 0 0 16 232 532 0 0 100 0 0

another example:
req_active 341 341 2199083146[reqs] 3336 1 4.28 38 2.77
req_active 5 5 2199083151[reqs] 11 1 4.28 38 2.77
req_active 0 0 2199083151[reqs] 0 1 4.28 38 2.77
req_active 17 17 2199083168[reqs] 95 1 4.28 38 2.77
req_active 0 0 2199083168[reqs] 0 1 4.28 38 2.77
req_active 0 0 2199083168[reqs] 0 1 4.28 38 2.77
req_active 0 0 2199083168[reqs] 0 1 4.28 38 2.77
req_active 0 0 2199083168[reqs] 0 1 4.28 38 2.77
req_active 0 0 2199083168[reqs] 0 1 4.28 38 2.77

0 2 4416 10686668 102630000 133880 0 0 16 3340 810 899 0 0 93 7 0
0 0 4416 10686924 102630016 133904 0 0 0 4360 1387 1217 0 0 98 2 0
0 0 4416 10688228 102630040 133876 0 0 0 9008 1030 1240 0 0 100 0 0
0 0 4416 10688856 102630040 133892 0 0 0 0 289 678 0 0 100 0 0
0 1 4416 10689848 102630072 133868 0 0 0 11532 1117 1118 0 0 98 2 0
0 0 4416 10691804 102630080 133896 0 0 0 4112 492 763 0 0 99 1 0
0 1 4416 10692256 102630096 133888 0 0 0 5824 892 728 0 0 97 3 0
0 1 4416 10693264 102630096 133900 0 0 0 0 220 412 0 0 97 3 0

as you can see the system was idling.

also I noticed the following in jbd stats:

1046886us average transaction commit time or more than 1 second to commit a transaction. this is very long.

did you use the same hardware/software setup with 1.8 ?

Comment by Daire Byrne (Inactive) [ 10/Sep/13 ]

Unfortunately we did not use the same hardware/software for our v1.8 filesystem (which is still in production). So the thing I don't understand is that when the system becomes idle there is no IO wait and read speeds from the device seem to be okay which is why I thought it was unlikely to be RAID device issue.

I agree the jdb commit time is unreasonably high. Again it was not clear to me at what level this slowness occurred - HWRAID, LVM, RAID10, ext4... So you think Lustre is just waiting on the journal to be written and that is why Lustre stops processing requests?

Comment by Daire Byrne (Inactive) [ 10/Sep/13 ]

I wonder if mounting with "nobarrier" will help? e.g. #LU-979

Comment by Daire Byrne (Inactive) [ 10/Sep/13 ]

"nobarrier" had no effect.

Comment by Daire Byrne (Inactive) [ 12/Sep/13 ]

So I think the best course of action now is to install Lustre v1.8.9 on this hardware and compare the performance with the same workload. Unfortunately we will have to change one element - the v1.8.9 Whamcloud kernel (EL6.2) does not have support for the RAID10 LVM layout we are currently using (EL6.4) so we will have to use a MD RAID10 with an LVM on top.

I will report back later next week once this work has been done. Thanks for your time.

Comment by Alex Zhuravlev [ 12/Sep/13 ]

if you have additional storage on MDS node, I'd try with an external journal.. but 1.8 would be interesting, definitely.

Comment by Daire Byrne (Inactive) [ 12/Sep/13 ]

Well we can easily add some SSDs to the MDS server. Is it fairly easy to create an external journal to an existing MDT filesystem? It is worth a try before we reformat v1.8.9.

Comment by Daire Byrne (Inactive) [ 12/Sep/13 ]

I'll try this:

mke2fs -b 4096 -O journal_dev /dev/sdb 262144
tune2fs -O ^has_journal /dev/lustre/mdt1
tune2fs -o journal_data -j -J device=/dev/sdb /dev/lustre/mdt1

Comment by Daire Byrne (Inactive) [ 12/Sep/13 ]

Well it helped with the journal transaction times but the constant pausing is still present. Actually it may be even more frequent and regular now. Oh well, a reformat to v1.8.9 will be more instructive.

bmds1 /root # cat /proc/fs/jbd2/sdb/info
2283745 transaction, each up to 65536 blocks
average: 
  0ms waiting for transaction
  2ms running transaction
  0ms transaction was being locked
  0ms flushing data (in ordered mode)
  0ms logging transaction
  152us average transaction commit time
  4 handles per transaction
  12 blocks per transaction
  13 logged blocks per transaction
Comment by Alex Zhuravlev [ 12/Sep/13 ]

hmm, 3 handles per transaction isn't good at all. sync_permission=0 ?

Comment by Daire Byrne (Inactive) [ 12/Sep/13 ]

Yea obviously forgot to apply sync_permissions=0. So now it looks like:

66188 transaction, each up to 65536 blocks
average: 
  2ms waiting for transaction
  49ms running transaction
  3ms transaction was being locked
  0ms flushing data (in ordered mode)
  0ms logging transaction
  48370us average transaction commit time
  196 handles per transaction
  167 blocks per transaction
  168 logged blocks per transaction

It seems so far like the periods of "no activity" have been reduced but the rate of transactions still varies quite a lot. I think we'll still go ahead and compare the same hardware on v1.8.9. It just feels like the ext4/RAID sync is really really expensive with the current config. We need the metadata rate to be more consistent and predictable.

Cheers.

Comment by Alex Zhuravlev [ 12/Sep/13 ]

could you grab another vmstat output ?

Comment by Daire Byrne (Inactive) [ 12/Sep/13 ]

The load and IO wait seems to be consistently higher than it used to be so I guess that is a good sign that we are doing more work overall than we were before the external journal. In this snippet the req_waittime was around 4,000/s and then increased to around 14,000 for some time.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0 35      0 153068944 76469912 530164    0    0  5340  2080 33554 73880  0  2 69 29  0
 2 33      0 153049840 76478664 530904    0    0  5888     0 33074 75321  0  2 69 29  0
 0 36      0 153039680 76488400 529812    0    0  6712  6450 39886 89002  0  2 72 26  0
 2 33      0 153029568 76494808 530492    0    0  4284 59700 36049 136657  0  2 70 28  0
 0 14      0 153025168 76500336 531188    0    0  4240  2064 23191 47842  0  1 71 28  0
 0 16      0 153028528 76501656 531352    0    0  1240  4168 14398 27798  0  1 81 19  0
 2 35      0 153010848 76510256 529632    0    0  6136  4144 38677 82843  0  2 70 28  0
 0 15      0 153009856 76513080 531240    0    0  2360  4144 20287 36438  0  1 80 19  0
 2 31      0 153002480 76520320 529444    0    0  5808 37772 35445 136320  0  2 71 26  0
 0 32      0 152982288 76532400 530672    0    0  8752  6192 47392 105233  0  3 69 28  0
 0 14      0 152969552 76539408 529064    0    0  5896  6240 30985 60270  0  2 73 25  0
 0 31      0 152956768 76545224 530080    0    0  4768  8304 29606 58896  0  2 80 19  0
 0 24      0 152930224 76562528 529352    0    0 14392  4224 60441 143758  0  4 70 26  0
 5 26      0 152912704 76578672 529636    0    0 14068 71680 58179 184156  0  4 63 33  0
 1 23      0 152887104 76594720 531316    0    0 13708  4160 59833 142103  0  4 68 28  0
 1 21      0 152859648 76609600 530468    0    0 11764  2088 55286 128557  0  3 63 33  0
 2 29      0 152835968 76624656 530008    0    0 12752  6208 49436 108952  0  3 68 29  0
 0 28      0 152812096 76639664 530384    0    0 12116  2080 56040 134488  0  4 73 24  0
Comment by Alex Zhuravlev [ 12/Sep/13 ]

it doesn't look stalling at any point? are you still seeing gaps with llstat ?

Comment by Daire Byrne (Inactive) [ 12/Sep/13 ]

Seeing the drop to zero transactions a lot less frequently now and they never last more than a couple of seconds. Maybe every 30+ mins instead of every couple of mins... Certainly a lot better than it was. The overall throughput graphs are showing a noticeable increase too. Here's and example of a stall that happened just now:

req_waittime              10357      10357      57808929  [usec]        70735          2         6.59       9887      4.80 
req_waittime              11138      11138      57820067  [usec]        74281          2         6.59       9887      4.80 
req_waittime              13940      13940      57834007  [usec]        97575          2         6.59       9887      4.80 
req_waittime              11907      11907      57845914  [usec]        82263          2         6.59       9887      4.80 
req_waittime              11521      11521      57857435  [usec]        79703          2         6.59       9887      4.80 
req_waittime              2642       2642       57860077  [usec]        17844          2         6.59       9887      4.80 
req_waittime              5          5          57860082  [usec]          128          2         6.59       9887      4.80 
req_waittime              1          1          57860083  [usec]           32          2         6.59       9887      4.80 
req_waittime              2          2          57860085  [usec]           43          2         6.59       9887      4.80 
req_waittime              0          0          57860085  [usec]            0          2         6.59       9887      4.80 
req_waittime              902        902        57860987  [usec]         6832          2         6.59       9887      4.80 
req_waittime              11040      11040      57872027  [usec]        73828          2         6.59       9887      4.80 
req_waittime              8804       8804       57880831  [usec]        61287          2         6.59       9887      4.80 
req_waittime              9867       9867       57890698  [usec]        66307          2         6.59       9887      4.79 
req_waittime              11310      11310      57902008  [usec]        75206          2         6.59       9887      4.79 
req_waittime              11138      11138      57913146  [usec]        74315          2         6.59       9887      4.79 
req_waittime              9425       9425       57922571  [usec]        61696          2         6.59       9887      4.79 

----total-cpu-usage---- -dsk/total- -net/bond0- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  0   5  73  22   0   0|  11M    0 |9051k 9150k|   0     0 |  52k  158k
  0   4  74  22   0   0|  12M   85M|8408k 8479k|   0     0 |  48k  152k
  0   1  74  24   0   0|6424k  525k|9324k 9598k|   0     0 |  23k   51k
  0   0  88  12   0   0|8192B 4096B|6604k 6973k|   0     0 |1366  2319 
  0   0  69  31   0   0|   0    16k|  70k   37k|   0     0 | 423   561 
  0   0  69  31   0   0|   0     0 |  82k   86k|   0     0 | 359   483 
  0   0  69  31   0   0|   0   116k|  14k 5632B|   0     0 | 228   358 
  0   0  69  31   0   0|   0    36k|9634B 5207B|   0     0 | 359   502 
  0   3  75  22   0   0|  10M   30M|  14k 5888B|   0     0 |  41k  100k
  0   3  72  25   0   0|  11M   15M|5462k 5440k|   0     0 |  42k  103k
  0   2  73  25   0   0|  11M   12M|7685k 7958k|   0     0 |  38k   92k
  0   3  65  32   0   0|  11M   17M|7911k 7944k|   0     0 |  47k  113k
Comment by Alex Zhuravlev [ 16/Sep/13 ]

this might be due to journal overflow, which happens sometimes and it takes a bit to free the journal. if the theory is correct, then the similar (probably with less frequency) behavior should be observed with 1.8

Comment by Daire Byrne (Inactive) [ 16/Sep/13 ]

Well since putting the external journal in place the performance of this workload has almost doubled. Responsiveness of the filesystem has also noticeably improved. We may put off downgrading to v1.8 for a while longer while we further test v2.4 for other potential issues. Thanks for your help again.

Comment by Daire Byrne (Inactive) [ 09/Jan/14 ]

Hi,

I just wanted to update this ticket with recent changes to our config. We have now replaced the 10K SAS drives in our external DELL RAID with SSDs (RAID0 with 2xRAID5 9disks). With our extremely heavy link/unlink workload we are seeing much better peak IOP performance. The journal is internal in this case.

req_waittime              20878      20878      21619361  [usec]       150732          2         7.37     176237     70.37 
req_waittime              21540      21540      21640901  [usec]       160872          2         7.37     176237     70.33 
req_waittime              18086      18086      21658987  [usec]       128436          2         7.37     176237     70.30 
req_waittime              20076      20076      21679063  [usec]       143419          2         7.37     176237     70.27 
req_waittime              20525      20525      21699588  [usec]       144492          2         7.37     176237     70.24 
req_waittime              15152      15152      21714740  [usec]       108671          2         7.37     176237     70.21 
req_waittime              12877      12877      21727617  [usec]        94028          2         7.37     176237     70.19 
req_waittime              18741      18741      21746358  [usec]       136609          2         7.37     176237     70.16 
req_waittime              20261      20261      21766619  [usec]       145094          2         7.37     176237     70.13 
req_waittime              10070      10070      21776689  [usec]        73515          2         7.37     176237     70.11 
req_waittime              6038       6038       21782727  [usec]        51293          2         7.37     176237     70.10 
req_waittime              22204      22204      21804931  [usec]       155824          2         7.37     176237     70.07 
req_waittime              21095      21095      21826026  [usec]       153739          2         7.37     176237     70.03 
req_waittime              15435      15435      21841461  [usec]       111966          2         7.37     176237     70.01 
req_waittime              20672      20672      21862133  [usec]       147581          2         7.37     176237     69.98 
req_waittime              22573      22573      21884706  [usec]       160090          2         7.37     176237     69.94 
req_waittime              13807      13807      21898513  [usec]        98643          2         7.37     176237     69.92 
req_waittime              20257      20257      21918770  [usec]       144935          2         7.37     176237     69.89 
req_waittime              20398      20398      21939168  [usec]       144100          2         7.37     176237     69.85 
req_waittime              20208      20208      21959376  [usec]       143570          2         7.37     176237     69.82 
req_waittime              6276       6276       21965652  [usec]        44278          2         7.37     176237     69.81 
req_waittime              3850       3850       21969502  [usec]        28082          2         7.37     176237     69.81 
req_waittime              12482      12482      21981984  [usec]        90562          2         7.37     176237     69.79
..

However I am now seeing that the issue whereby the IOPs drops to zero at regular intervals is now returning after many hours of operation. This halves the average IOPs so has a fairly big effect on performance. Once I remount the MDT the performance returns to normal/maximum again for X hours. It's almost like the performance drops off once the server memory/cache fills up (256Gigs).

req_waittime              4889       4889       1102506250[usec]        35266          2         8.02    2613228    507.60 
req_waittime              10978      10978      1102517228[usec]        85139          2         8.02    2613228    507.60 
req_waittime              9717       9717       1102526945[usec]        73508          2         8.02    2613228    507.59 
req_waittime              13839      13839      1102540784[usec]       102205          2         8.02    2613228    507.59 
req_waittime              13343      13343      1102554127[usec]        98177          2         8.02    2613228    507.59 
req_waittime              10959      10959      1102565086[usec]        83997          2         8.02    2613228    507.58 
req_waittime              8660       8660       1102573746[usec]        63587          2         8.02    2613228    507.58 
req_waittime              1307       1307       1102575053[usec]         8431          2         8.02    2613228    507.58 
req_waittime              0          0          1102575053[usec]            0          2         8.02    2613228    507.58 
req_waittime              1169       1169       1102576222[usec]         7085          2         8.02    2613228    507.58 
req_waittime              3012       3012       1102579234[usec]        20816          2         8.02    2613228    507.58 
req_waittime              12367      12367      1102591601[usec]        92169          2         8.02    2613228    507.58 
req_waittime              13004      13004      1102604605[usec]        99554          2         8.02    2613228    507.58 
req_waittime              13152      13152      1102617757[usec]        99387          2         8.02    2613228    507.57 
req_waittime              12478      12478      1102630235[usec]        94129          2         8.02    2613228    507.57 
req_waittime              15342      15342      1102645577[usec]       111265          2         8.02    2613228    507.57 
req_waittime              6443       6443       1102652020[usec]        47035          2         8.02    2613228    507.56 
req_waittime              983        983        1102653003[usec]         6440          2         8.02    2613228    507.56 
req_waittime              0          0          1102653003[usec]            0          2         8.02    2613228    507.56 
req_waittime              1068       1068       1102654071[usec]         6716          2         8.02    2613228    507.56 
req_waittime              3278       3278       1102657349[usec]        24421          2         8.02    2613228    507.56 
req_waittime              13877      13877      1102671226[usec]       103323          2         8.02    2613228    507.56 
req_waittime              12384      12384      1102683610[usec]        92282          2         8.02    2613228    507.56 
req_waittime              13866      13866      1102697476[usec]        98032          2         8.01    2613228    507.55 
req_waittime              12153      12153      1102709629[usec]        86540          2         8.01    2613228    507.55 
req_waittime              10753      10753      1102720382[usec]        79395          2         8.01    2613228    507.55 
req_waittime              4344       4344       1102724726[usec]        36456          2         8.01    2613228    507.55 
req_waittime              70         70         1102724796[usec]          443          2         8.01    2613228    507.55 
req_waittime              0          0          1102724796[usec]            0          2         8.01    2613228    507.55 
req_waittime              191        191        1102724987[usec]         1507          2         8.01    2613228    507.55 
req_waittime              5686       5686       1102730673[usec]        38777          2         8.01    2613228    507.55 
req_waittime              11976      11976      1102742649[usec]        91311          2         8.01    2613228    507.54 
req_waittime              12811      12811      1102755460[usec]        92589          2         8.01    2613228    507.54 
req_waittime              12153      12153      1102767613[usec]        89900          2         8.01    2613228    507.54 
req_waittime              11515      11515      1102779128[usec]        87047          2         8.01    2613228    507.54 
req_waittime              11057      11057      1102790185[usec]        82195          2         8.01    2613228    507.53 
req_waittime              5531       5531       1102795716[usec]        39728          2         8.01    2613228    507.53 
req_waittime              20         20         1102795736[usec]          167          2         8.01    2613228    507.53
..
# cat /proc/fs/jbd2/md0-8/info
594841 transaction, each up to 25600 blocks
average: 
  0ms waiting for transaction
  2ms running transaction
  0ms transaction was being locked
  0ms flushing data (in ordered mode)
  1ms logging transaction
  183570us average transaction commit time
  10 handles per transaction
  31 blocks per transaction
  32 logged blocks per transaction
# dumpe2fs -h /dev/md0
dumpe2fs 1.42.7.wc1 (12-Apr-2013)
Filesystem volume name:   bravo-MDT0000
Last mounted on:          /
Filesystem UUID:          468d314b-55f5-4d52-99ab-ab4baa527c73
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              3121512448
Block count:              1560747776
Reserved block count:     78037388
Free blocks:              726316155
Free inodes:              2690673937
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1024
Blocks per group:         16384
Fragments per group:      16384
Inodes per group:         32768
Inode blocks per group:   4096
RAID stride:              128
RAID stripe width:        256
Flex block group size:    16
Filesystem created:       Wed Dec 11 12:35:07 2013
Last mount time:          Thu Jan  9 11:14:43 2014
Last write time:          Thu Jan  9 11:14:43 2014
Mount count:              12
Maximum mount count:      -1
Last checked:             Wed Dec 11 12:35:07 2013
Check interval:           0 (<none>)
Lifetime writes:          72 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               512
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      b3732abb-0697-4314-bec1-104d3524a83f
Journal backup:           inode blocks
User quota inode:         3
Group quota inode:        4
Journal features:         journal_incompat_revoke
Journal size:             400M
Journal length:           102400
Journal sequence:         0x00136afd
Journal start:            86536
Comment by Daire Byrne (Inactive) [ 10/Mar/14 ]

You can definitely close this now. For our workloads SSD based MDTs is required and is now working well for us.

Cheers,

Comment by Peter Jones [ 10/Mar/14 ]

ok thanks Daire

Generated at Sat Feb 10 01:36:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.