[LU-3753] MDS operation rate dropping sporadically Created: 14/Aug/13 Updated: 10/Mar/14 Resolved: 10/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Daire Byrne (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9680 | ||||||||
| Description |
|
Hi, I have been investigating why our metadata rate keeps dropping to nothing every 30 seconds or so for a few seconds at a time. Looking at the debug logs it looks like it is stuck doing many osp_sync_request_commit_cb() calls. I shouldn't think this should block all other operations so maybe this is not the issue and it's just that these are the only operations that are occurring when we see the rate drop to zero. I'm pretty sure the RAID and storage path is all fine. Here is a snippet illustrating the issue: /proc/fs/lustre/mds/MDS/mdt/stats @ 1375970781.677198 00000004:00080000:4.0:1375970781.689383:0:4870:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff8821062c4000, transno 8617232460 I am also attaching a graph showing the MDS rate drops over a few minutes. I can upload some full debug logs from the MDS when the rate drops to zero if that helps. The workload is running on two clients and copies, hard links and unlinks thousands of directory trees. The two clients suffer performance drops at the same time as the MDS rate drops to zero. Cheers, Daire |
| Comments |
| Comment by Alex Zhuravlev [ 15/Aug/13 ] |
|
are you running with ldiskfs or ZFS ? |
| Comment by Daire Byrne (Inactive) [ 15/Aug/13 ] |
|
No ZFS just the standard ldiskfs. |
| Comment by Alex Zhuravlev [ 15/Aug/13 ] |
|
ok, thanks. I've been working on a patch to batch MDS-OST synchronization, but it's not priority at the moment and it's not quite ready yet.. |
| Comment by Daire Byrne (Inactive) [ 15/Aug/13 ] |
|
Ok. Well we are happy to be early adopters and test it whenever. Our workload seems to show up this effect quite well. Even without the batching though is it expected that the MDS rate would drop to zero while the syncronisation calls occur? I would have thought this would have shown up at other sites testing v2.4 or artificial benchmarks as it is a major performance regression to v1.8 IMHO. Is there anything we can do to alleviate this effect in the meantime? e.g. tweak no. of threads, local client locks, async journals etc. Cheers. |
| Comment by Alex Zhuravlev [ 15/Aug/13 ] |
|
that's true, even with no batching those threads shouldn't be consuming CPU much and block user requests. I'll think of using lower priorities for these threads and things like schedule(). |
| Comment by Alex Zhuravlev [ 18/Aug/13 ] |
|
would you mind to try http://review.whamcloud.com/#/c/7343/ please? |
| Comment by James Braid (Inactive) [ 20/Aug/13 ] |
|
Hi Alex I've applied the patch. It seems like it may have made a very small improvement. Is there more debugging info we can provide to assist? We still see frequent huge performance drops on our clients. Here is collectl output from a client - note the write drops to zero and then back again. #<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 28 17 39967 89269 561M 9M 8G 8G 1G 12G 4096M 4G 513295 2133 32 20 39782 95362 692M 9M 8G 8G 1G 12G 4096M 4G 571402 2340 30 18 39539 97551 459M 9M 8G 8G 1G 12G 4096M 4G 589446 2446 34 22 43882 104601 539M 9M 8G 8G 1G 12G 4096M 4G 584120 2417 24 16 32688 75601 529M 9M 8G 8G 1G 12G 4096M 4G 459916 1883 27 16 34082 81696 481M 9M 8G 8G 1G 12G 4096M 4G 549843 2231 9 5 13125 24345 602M 9M 8G 8G 1G 12G 4096M 4G 165923 651 1 0 657 171 636M 9M 8G 8G 1G 12G 4096M 4G 0 0 0 0 186 241 637M 9M 8G 8G 1G 12G 4096M 4G 0 0 0 0 99 165 638M 9M 8G 8G 1G 12G 4096M 4G 0 0 0 0 429 505 638M 9M 8G 8G 1G 12G 4096M 4G 0 0 16 10 23053 51084 470M 9M 8G 8G 1G 12G 4096M 4G 318397 1298 32 20 39637 91021 341M 9M 8G 8G 1G 12G 4096M 4G 593453 2422 35 22 43514 98099 288M 9M 8G 8G 1G 12G 4096M 4G 646799 2640 28 18 38110 87991 367M 9M 8G 8G 1G 12G 4096M 4G 520423 2141 10 6 15983 30703 372M 9M 8G 8G 1G 12G 4096M 4G 215376 880 9 5 15507 28817 253M 9M 8G 8G 1G 12G 4096M 4G 208647 870 30 19 38929 94465 233M 9M 8G 8G 1G 12G 4096M 4G 541840 2236 27 17 36664 85198 166M 9M 8G 8G 1G 12G 4096M 4G 508441 2139 32 20 40896 94522 614M 9M 8G 8G 1G 12G 4096M 4G 556058 2326 29 17 39580 88633 444M 9M 8G 8G 1G 12G 4096M 4G 560600 2305 27 17 38357 81376 605M 9M 8G 8G 1G 12G 4096M 4G 488711 2036 28 17 38639 86526 698M 9M 8G 8G 1G 12G 4096M 4G 522031 2182 28 18 38191 88627 659M 9M 8G 8G 1G 12G 4096M 4G 539818 2217 6 4 11154 22075 630M 9M 8G 8G 1G 12G 4096M 4G 158178 621 0 0 101 160 629M 9M 8G 8G 1G 12G 4096M 4G 0 0 0 0 163 240 629M 9M 8G 8G 1G 12G 4096M 4G 0 0 0 0 260 371 630M 9M 8G 8G 1G 12G 4096M 4G 0 0 13 8 22627 50627 421M 9M 8G 8G 1G 12G 4096M 4G 262699 1110 29 18 40373 90854 581M 9M 8G 8G 1G 12G 4096M 4G 531336 2236 33 22 40703 102870 599M 9M 8G 8G 1G 12G 4096M 4G 541039 2288 31 20 41802 103343 724M 9M 8G 8G 1G 12G 4096M 4G 565820 2372 29 19 38808 94022 660M 9M 8G 8G 1G 12G 4096M 4G 516110 2175 30 19 40539 101639 533M 9M 8G 8G 1G 12G 4096M 4G 547567 2279 28 17 39459 92811 527M 9M 8G 8G 1G 12G 4096M 4G 535465 2232 29 19 39342 89179 595M 9M 8G 8G 1G 12G 4096M 4G 528147 2185 29 19 41263 91443 513M 9M 8G 8G 1G 12G 4096M 4G 522968 2151 23 15 36688 78033 550M 9M 8G 8G 1G 12G 4096M 4G 452367 1866 32 20 40583 99159 591M 9M 8G 8G 1G 12G 4096M 4G 565728 2312 32 21 39717 94413 650M 9M 8G 8G 1G 12G 4096M 4G 566728 2310 31 20 39367 92265 858M 9M 8G 8G 1G 12G 4096M 4G 576691 2359 31 20 39148 91641 650M 9M 8G 8G 1G 12G 4096M 4G 576751 2355 29 18 37997 90133 376M 9M 8G 8G 1G 12G 4096M 4G 550331 2270 16 11 21152 42319 734M 9M 8G 8G 1G 12G 4096M 4G 253585 1036 0 0 556 1003 723M 9M 8G 8G 1G 12G 4096M 4G 6288 26 |
| Comment by Alex Zhuravlev [ 20/Aug/13 ] |
|
can you grab the same stats from MDS please? |
| Comment by James Braid (Inactive) [ 20/Aug/13 ] |
|
Hi Alex collectl doesn't seem to work with the 2.x MDS, but here is the stats with non-Lustre specific info. The drops in interupts/csw coincide with low client performance #<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre MDS--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map Gattr+ Sattr+ Sync Unlnk 0 0 14990 19822 141G 76G 338M 43G 31G 125M 0 0 0 0 0 0 19252 26953 141G 76G 338M 43G 31G 125M 0 0 0 0 1 1 33080 50769 141G 76G 338M 43G 31G 125M 0 0 0 0 2 2 46215 74568 141G 76G 337M 43G 31G 125M 0 0 0 0 2 2 46757 75867 141G 76G 337M 43G 31G 125M 0 0 0 0 2 2 46323 71085 141G 76G 338M 43G 31G 125M 0 0 0 0 2 2 46438 74262 141G 76G 338M 43G 31G 125M 0 0 0 0 3 3 46285 72427 141G 76G 337M 43G 31G 125M 0 0 0 0 3 3 44278 67984 141G 76G 337M 43G 31G 125M 0 0 0 0 3 3 43048 66573 141G 76G 338M 43G 31G 125M 0 0 0 0 3 3 41457 63118 141G 76G 338M 43G 31G 125M 0 0 0 0 3 3 43484 66497 141G 76G 337M 43G 31G 125M 0 0 0 0 3 3 41665 64117 141G 76G 337M 43G 31G 125M 0 0 0 0 3 3 41082 61940 141G 76G 337M 43G 31G 125M 0 0 0 0 3 3 40711 60277 141G 76G 338M 43G 31G 125M 0 0 0 0 0 0 10233 15577 141G 76G 336M 43G 31G 125M 0 0 0 0 0 0 161 243 141G 76G 338M 43G 31G 125M 0 0 0 0 0 0 5366 7845 141G 76G 338M 43G 31G 125M 0 0 0 0 4 4 46224 70689 141G 76G 338M 43G 31G 125M 0 0 0 0 3 3 41865 62386 141G 76G 338M 43G 31G 125M 0 0 0 0 4 4 43553 66924 141G 76G 338M 43G 31G 125M 0 0 0 0 3 3 43779 67676 141G 76G 337M 43G 31G 125M 0 0 0 0 3 3 41717 63335 141G 76G 338M 43G 31G 125M 0 0 0 0 0 0 969 1122 141G 76G 338M 43G 31G 125M 0 0 0 0 0 0 228 277 141G 76G 338M 43G 31G 125M 0 0 0 0 0 0 337 823 141G 76G 338M 43G 31G 125M 0 0 0 0 0 0 151 243 141G 76G 338M 43G 31G 125M 0 0 0 0 0 0 187 262 141G 76G 338M 43G 31G 125M 0 0 0 0 1 1 21902 33261 141G 76G 336M 43G 31G 125M 0 0 0 0 4 4 43533 67827 141G 76G 337M 43G 31G 125M 0 0 0 0 3 3 43230 66712 141G 76G 337M 43G 31G 125M 0 0 0 0 3 3 41791 64745 141G 76G 337M 43G 31G 125M 0 0 0 0 3 3 41640 63042 141G 76G 338M 43G 31G 125M 0 0 0 0 2 2 43619 51296 141G 76G 337M 43G 31G 125M 0 0 0 0 2 2 50002 79991 141G 76G 337M 43G 31G 125M 0 0 0 0 2 2 55406 88595 141G 76G 338M 43G 31G 125M 0 0 0 0 2 2 48724 77648 141G 76G 337M 43G 31G 125M 0 0 0 0 3 3 46859 75466 141G 76G 338M 43G 31G 125M 0 0 0 0 3 3 47779 76225 141G 76G 337M 43G 31G 125M 0 0 0 0 3 3 46867 75415 141G 76G 337M 43G 32G 125M 0 0 0 0 3 3 44883 71390 141G 76G 337M 43G 32G 125M 0 0 0 0 3 3 44000 69896 141G 76G 338M 43G 32G 125M 0 0 0 0 2 2 38649 61769 141G 76G 338M 43G 32G 125M 0 0 0 0 0 0 721 824 141G 76G 338M 43G 32G 125M 0 0 0 0 0 0 180 240 141G 76G 338M 43G 32G 125M 0 0 0 0 0 0 333 829 141G 76G 338M 43G 32G 125M 0 0 0 0 0 0 9742 11493 141G 76G 338M 43G 32G 125M 0 0 0 0 0 0 15017 17764 141G 76G 338M 43G 32G 125M 0 0 0 0 |
| Comment by Alex Zhuravlev [ 20/Aug/13 ] |
|
is there a way to include low-level (blocks) I/O in the report? |
| Comment by James Braid (Inactive) [ 20/Aug/13 ] |
|
Hi Alex Here is collectl disk info included #<----CPU[HYPER]-----><-----------Memory-----------><----------Disks-----------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrit Writes 0 0 15419 9478 121G 90G 345M 51G 38G 151M 624 127 9536 2266 0 0 3174 1543 121G 90G 345M 51G 38G 151M 72 18 2156 514 0 0 694 305 121G 90G 346M 51G 38G 152M 0 0 64 15 0 0 10080 6090 121G 90G 345M 51G 38G 152M 456 84 7048 1603 0 0 18893 12913 121G 90G 345M 51G 38G 152M 1052 205 15158 3395 3 3 66971 62305 121G 90G 346M 51G 38G 153M 4392 892 73522 15416 3 3 74302 69772 121G 90G 345M 51G 38G 154M 5320 1103 83760 17784 3 3 78258 73317 121G 90G 345M 51G 38G 154M 5188 1059 84000 19377 4 4 81342 72783 121G 90G 345M 51G 38G 155M 5372 1078 83952 20370 4 4 79303 73135 121G 90G 345M 51G 38G 155M 5172 1115 86600 20962 4 4 82544 73869 121G 90G 345M 51G 38G 156M 6528 1214 89628 20826 3 3 70805 58015 121G 90G 345M 51G 38G 157M 3748 774 199530 25660 0 0 26491 21617 121G 90G 345M 51G 38G 157M 1716 370 20936 4663 2 1 64310 55434 121G 90G 345M 51G 38G 170M 4312 884 50792 12617 1 1 48974 41562 121G 90G 345M 51G 38G 171M 3124 661 40344 9558 3 3 87339 84892 121G 90G 345M 51G 38G 171M 6664 1441 87324 21470 3 3 89330 84011 121G 90G 345M 51G 38G 172M 7856 1589 85492 21058 5 3 89911 81401 121G 90G 345M 51G 38G 178M 7864 1479 80904 20404 4 4 84957 81147 121G 90G 345M 51G 38G 160M 7076 1492 83104 20472 4 4 85601 82894 121G 90G 345M 51G 38G 160M 6652 1362 80336 20092 4 4 81877 76701 121G 90G 345M 51G 38G 160M 6988 1357 83660 19825 4 4 76911 70142 121G 90G 345M 51G 38G 160M 6032 1140 86088 18022 3 3 75907 66537 121G 90G 345M 51G 38G 160M 5120 1055 81692 18017 3 3 65582 58071 121G 90G 343M 51G 38G 161M 4676 899 70736 15957 0 0 1158 890 121G 90G 346M 51G 38G 161M 0 0 122 41 0 0 1421 800 121G 90G 345M 51G 38G 161M 40 10 560 124 0 0 646 312 121G 90G 346M 51G 38G 161M 4 1 0 0 0 0 3744 2393 121G 90G 345M 51G 38G 161M 244 33 2161 518 3 3 67809 59011 121G 90G 345M 51G 38G 161M 5064 851 73344 16346 3 3 75089 65527 121G 90G 345M 51G 38G 161M 5020 996 83584 18005 1 1 24734 20954 121G 90G 345M 51G 38G 161M 2036 346 25896 5893 0 0 1517 427 121G 90G 346M 51G 38G 161M 0 0 4 1 1 1 37255 31162 121G 90G 345M 51G 38G 161M 2056 427 39380 8705 4 4 79364 70818 121G 90G 345M 51G 38G 161M 5636 1088 89616 20124 4 3 74635 70406 121G 90G 345M 51G 38G 161M 4776 1029 88984 19081 3 3 74168 69116 121G 90G 345M 51G 38G 162M 5896 1059 85292 18438 4 4 79242 71251 121G 90G 345M 51G 38G 162M 5800 1149 89636 19608 3 3 77076 70043 121G 90G 345M 51G 38G 162M 5344 1041 86020 18137 3 3 76394 68878 121G 90G 345M 51G 38G 162M 5172 1028 89460 19349 1 1 26792 24128 121G 90G 344M 51G 38G 162M 1684 337 30656 6404 0 0 561 255 121G 90G 346M 51G 38G 162M 0 0 4 1 0 0 3377 2410 121G 90G 345M 51G 38G 162M 108 27 3176 673 3 3 57772 50792 121G 90G 345M 51G 38G 162M 3864 738 67184 14451 4 4 76063 70173 121G 90G 344M 51G 38G 162M 4812 967 89464 18906 3 3 68062 63098 121G 90G 346M 51G 38G 162M 4432 857 74896 15953 3 3 71401 64585 121G 90G 345M 51G 38G 162M 4616 921 75924 16704 2 2 47990 41065 121G 90G 345M 51G 38G 163M 2380 538 55948 11602 0 0 4580 2754 121G 90G 346M 51G 38G 163M 8 2 39576 3569 Here is perf top output while this capture was taking place 22.97% [kernel] [k] _spin_lock 13.40% [jbd2] [k] journal_clean_one_cp_list 1.53% [kernel] [k] find_busiest_group 1.43% [kernel] [k] kmem_cache_free 1.22% [kernel] [k] _spin_lock_irqsave 1.21% [kernel] [k] kmem_cache_alloc 1.00% [jbd2] [k] __jbd2_journal_clean_checkpoint_list 0.67% [kernel] [k] _read_lock_irqsave 0.60% [kernel] [k] schedule 0.55% [kernel] [k] mempool_free 0.53% [kernel] [k] _spin_unlock_irqrestore 0.51% [kernel] [k] _spin_lock_irq |
| Comment by James Braid (Inactive) [ 23/Aug/13 ] |
|
Hi Alex Do you have any more ideas about this? We will have to roll back to Lustre 1.8.x on this system very soon unless we can find any way to improve the performance. Thanks! |
| Comment by Alex Zhuravlev [ 23/Aug/13 ] |
|
sorry, still thinking about this.. from the stats it doesn't look like MDS is busy with anything at all. a bit MDS-OST traffic should be seen as high number of RPCs (so interrupts), I'd think. what journal size do you use on MDS? |
| Comment by James Braid (Inactive) [ 23/Aug/13 ] |
|
Hi Alex Here is the info from our 2.4.0 system (slow performance): bmds1 /root # dumpe2fs -h /dev/lustre/mdt1 dumpe2fs 1.42.7.wc1 (12-Apr-2013) Filesystem volume name: bravo-MDT0000 Last mounted on: / Filesystem UUID: f1416ea8-190c-4bd7-a8f4-27b013f278e0 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 3221225472 Block count: 3221225472 Reserved block count: 161061273 Free blocks: 2565247465 Free inodes: 2896159999 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 256 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 32768 Inode blocks per group: 4096 RAID stride: 256 RAID stripe width: 256 Flex block group size: 16 Filesystem created: Wed Jul 3 14:49:25 2013 Last mount time: Tue Aug 20 14:08:37 2013 Last write time: Tue Aug 20 14:08:37 2013 Mount count: 9 Maximum mount count: -1 Last checked: Wed Jul 3 14:49:25 2013 Check interval: 0 (<none>) Lifetime writes: 60 TB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 512 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 047637b0-966e-403b-9c32-0c549afe4180 Journal backup: inode blocks User quota inode: 3 Group quota inode: 4 Journal features: journal_incompat_revoke Journal size: 400M Journal length: 102400 Journal sequence: 0x4160245e Journal start: 14452 bmds1 /root # cat /proc/fs/jbd2/dm-12-8/info 69887438 transaction, each up to 25600 blocks average: 0ms waiting for transaction 3ms running transaction 0ms transaction was being locked 0ms flushing data (in ordered mode) 1ms logging transaction 828us average transaction commit time 8 handles per transaction 20 blocks per transaction 21 logged blocks per transaction And from our 1.8.9 system (good performance): [root@mds ~]# dumpe2fs -h /dev/md3 dumpe2fs 1.41.90.wc3 (28-May-2011) Filesystem volume name: lustre-MDT0000 Last mounted on: / Filesystem UUID: fa42ca46-0585-426b-9308-761aaf6b7a11 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file uninit_bg Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 3805970432 Block count: 1902976990 Reserved block count: 95148849 Free blocks: 1283903845 Free inodes: 3519235719 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1024 Blocks per group: 16384 Fragments per group: 16384 Inodes per group: 32768 Inode blocks per group: 4096 Filesystem created: Tue Dec 13 12:06:20 2011 Last mount time: Sun May 5 11:42:06 2013 Last write time: Sun May 5 11:42:06 2013 Mount count: 26 Maximum mount count: 20 Last checked: Tue Dec 13 12:06:20 2011 Check interval: 0 (<none>) Lifetime writes: 1815 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 512 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 First orphan inode: 3735290412 Default directory hash: half_md4 Directory Hash Seed: b589d099-f7c1-4677-ad85-dfc6024efb09 Journal backup: inode blocks Journal features: journal_incompat_revoke Journal size: 400M Journal length: 102400 Journal sequence: 0x00959d1d Journal start: 64343 [root@mds ~]# cat /proc/fs/jbd2/md3-8/info 1894322 transaction, each upto 25600 blocks average: 6ms waiting for transaction 5000ms running transaction 7ms transaction was being locked 17ms flushing data (in ordered mode) 60ms logging transaction 0us average transaction commit time 11796 handles per transaction 3223 blocks per transaction 3233 logged blocks per transaction |
| Comment by James Braid (Inactive) [ 23/Aug/13 ] |
|
Hmm, just noticed the RAID stride values look a bit strange... I didn't format this system so I'll have to check in with Daire about what is going on there. |
| Comment by Alex Zhuravlev [ 23/Aug/13 ] |
|
bmds1 /root # cat /proc/fs/jbd2/dm-12-8/info if this was taken after some time (test or work load), then something is pretty wrong here. do you use ACLs ? can you check /proc/fs/lustre/mdd/*MDT0000/sync_permission: if it's 1, set to 0 and try again? |
| Comment by James Braid (Inactive) [ 23/Aug/13 ] |
|
Hi Alex That info was captured whilst running our production workload (dirvish backups). We don't use ACLs but the dirvish processes do a very large number of chmod() operations. I tried setting sync_permission to 0 but it doesn't appear to make much difference. Here is another sample of the jbd2 info: bmds1 /root # cat /proc/fs/jbd2/dm-12-8/info 70236645 transaction, each up to 25600 blocks average: 0ms waiting for transaction 3ms running transaction 0ms transaction was being locked 0ms flushing data (in ordered mode) 1ms logging transaction 254103us average transaction commit time 9 handles per transaction 20 blocks per transaction 21 logged blocks per transaction |
| Comment by Alex Zhuravlev [ 24/Aug/13 ] |
|
sorry to ask, but.. Are you sure dm-12-m is /dev/lustre/mdt1 running bravo-MDT0000 ? |
| Comment by James Braid (Inactive) [ 24/Aug/13 ] |
|
Hi Alex, yep, we only have a single LVM volume on the MDS. I've attached a file with the output of debugfs (and jbd2 info and dump2fs) |
| Comment by Alex Zhuravlev [ 25/Aug/13 ] |
|
ok, the good news is that there should be a significant improvement once the transactions are normal in terms of size and lifetime (as seen with 1.8). |
| Comment by Daire Byrne (Inactive) [ 02/Sep/13 ] |
|
On further investigation the sync_permission=0 does help - it increases the overall IOPs on the MDT but the drops in transactions are still present. I've attached a graph which shows me setting sync_permission=0 at the beginning, sync_permission=1 for the middle section and then setting it back to sync_permission=0 again for the final part of the plot. |
| Comment by Daire Byrne (Inactive) [ 02/Sep/13 ] |
|
sync_permission=0, sync_permission=1 comparison. The rate is almost double with sync_permission=0. |
| Comment by Alex Zhuravlev [ 02/Sep/13 ] |
|
thanks! I think it makes sense to grab the stats again (CPU, I/O, jbd info) with sync_permission=0. |
| Comment by Daire Byrne (Inactive) [ 02/Sep/13 ] |
|
Well I don't think the system has been rebooted in a few weeks so I'm not sure that the JBD stats don't have a lot of sync_permissions=1 data in there too. Either way it doesn't look much different than before: [root@bmds1 ~]# cat /proc/fs/jbd2/dm-12-8/info
70424050 transaction, each up to 25600 blocks
average:
1ms waiting for transaction
14ms running transaction
0ms transaction was being locked
0ms flushing data (in ordered mode)
3ms logging transaction
1061027us average transaction commit time
55 handles per transaction
47 blocks per transaction
48 logged blocks per transaction
|
| Comment by Alex Zhuravlev [ 02/Sep/13 ] |
|
it's started to change, which is good. how about CPU and I/O? given this is less blocked by IO now I'd hope to see another bottleneck now.. |
| Comment by Daire Byrne (Inactive) [ 02/Sep/13 ] |
|
collectl snippet #<----CPU[HYPER]-----><-----------Memory-----------><----------Disks-----------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrit Writes 1 1 23117 46950 239G 7G 388M 5G 2G 145M 3528 606 40624 2133 1 1 25639 53105 239G 7G 388M 5G 2G 145M 3712 586 25092 839 0 0 13059 23901 239G 7G 387M 5G 2G 145M 1708 239 22912 2751 0 0 7551 14811 239G 7G 388M 5G 2G 145M 1404 169 27388 2277 0 0 4868 12429 239G 7G 388M 5G 2G 145M 0 0 29500 5330 0 0 1807 4783 239G 7G 388M 5G 2G 145M 0 0 13836 2221 0 0 1508 3047 239G 7G 388M 5G 2G 145M 0 0 16784 1690 0 0 3434 6783 239G 7G 388M 5G 2G 145M 0 0 21604 3566 0 0 3689 5870 239G 7G 388M 5G 2G 145M 0 0 24728 3260 0 0 2288 3000 239G 7G 388M 5G 2G 145M 0 0 12500 1308 0 0 1249 1440 239G 7G 388M 5G 2G 145M 0 0 12832 1087 0 0 218 376 239G 7G 388M 5G 2G 145M 0 0 596 55 0 0 9066 39054 239G 7G 388M 5G 2G 145M 0 0 57684 9864 0 0 22196 31497 239G 7G 388M 5G 2G 145M 968 146 86420 15432 1 1 35652 54341 239G 7G 387M 5G 2G 145M 3316 525 127146 15395 2 2 42714 101785 239G 7G 387M 5G 2G 145M 11668 1394 18 16 3 3 54471 135731 239G 7G 388M 5G 2G 145M 13656 1807 0 0 3 3 51028 126309 239G 7G 388M 5G 2G 145M 14252 1855 16 3 3 3 49706 153783 239G 7G 387M 5G 2G 145M 8460 1316 125290 10353 What I see is that when the MDS rate drops to zero there is still a lot of writes being pushed to disk and then when the MDS rate recovers the writes drop for a few seconds before continuing again. I also rebooted the server to reset the jbd stats so we only have sync_permissions=0 data. [root@bmds1 ~]# cat /proc/fs/jbd2/dm-12-8/info
187 transaction, each up to 25600 blocks
average:
319ms waiting for transaction
5001ms running transaction
49ms transaction was being locked
28ms flushing data (in ordered mode)
821ms logging transaction
359017us average transaction commit time
10301 handles per transaction
8828 blocks per transaction
8854 logged blocks per transaction
|
| Comment by Daire Byrne (Inactive) [ 09/Sep/13 ] |
|
Alex, I have spent some time trying to replicate our workload in a test environment but I have failed so far to reproduce the effect. Perhaps it is tied to our hardware/config in some way. We can provide you with remote access to the servers if you think that would be useful to you. The workload runs 24/7 and the servers are not in production so we can survive the odd crash/reboot if needed. |
| Comment by Alex Zhuravlev [ 10/Sep/13 ] |
|
sure, I'd be happy to have a look at the live data. |
| Comment by Daire Byrne (Inactive) [ 10/Sep/13 ] |
|
Great. Can you start a webex? Otherwise I can setup SSH, keys etc. |
| Comment by Alex Zhuravlev [ 10/Sep/13 ] |
|
I'd prefer SSH if possible. |
| Comment by Daire Byrne (Inactive) [ 10/Sep/13 ] |
|
Okay. Can you tell me your IP and a private email that I can send the details to? |
| Comment by Alex Zhuravlev [ 10/Sep/13 ] |
|
I'm seeing this: and vmstat at the same time: another example: 0 2 4416 10686668 102630000 133880 0 0 16 3340 810 899 0 0 93 7 0 as you can see the system was idling. also I noticed the following in jbd stats: 1046886us average transaction commit time or more than 1 second to commit a transaction. this is very long. did you use the same hardware/software setup with 1.8 ? |
| Comment by Daire Byrne (Inactive) [ 10/Sep/13 ] |
|
Unfortunately we did not use the same hardware/software for our v1.8 filesystem (which is still in production). So the thing I don't understand is that when the system becomes idle there is no IO wait and read speeds from the device seem to be okay which is why I thought it was unlikely to be RAID device issue. I agree the jdb commit time is unreasonably high. Again it was not clear to me at what level this slowness occurred - HWRAID, LVM, RAID10, ext4... So you think Lustre is just waiting on the journal to be written and that is why Lustre stops processing requests? |
| Comment by Daire Byrne (Inactive) [ 10/Sep/13 ] |
|
I wonder if mounting with "nobarrier" will help? e.g. # |
| Comment by Daire Byrne (Inactive) [ 10/Sep/13 ] |
|
"nobarrier" had no effect. |
| Comment by Daire Byrne (Inactive) [ 12/Sep/13 ] |
|
So I think the best course of action now is to install Lustre v1.8.9 on this hardware and compare the performance with the same workload. Unfortunately we will have to change one element - the v1.8.9 Whamcloud kernel (EL6.2) does not have support for the RAID10 LVM layout we are currently using (EL6.4) so we will have to use a MD RAID10 with an LVM on top. I will report back later next week once this work has been done. Thanks for your time. |
| Comment by Alex Zhuravlev [ 12/Sep/13 ] |
|
if you have additional storage on MDS node, I'd try with an external journal.. but 1.8 would be interesting, definitely. |
| Comment by Daire Byrne (Inactive) [ 12/Sep/13 ] |
|
Well we can easily add some SSDs to the MDS server. Is it fairly easy to create an external journal to an existing MDT filesystem? It is worth a try before we reformat v1.8.9. |
| Comment by Daire Byrne (Inactive) [ 12/Sep/13 ] |
|
I'll try this: mke2fs -b 4096 -O journal_dev /dev/sdb 262144 |
| Comment by Daire Byrne (Inactive) [ 12/Sep/13 ] |
|
Well it helped with the journal transaction times but the constant pausing is still present. Actually it may be even more frequent and regular now. Oh well, a reformat to v1.8.9 will be more instructive. bmds1 /root # cat /proc/fs/jbd2/sdb/info
2283745 transaction, each up to 65536 blocks
average:
0ms waiting for transaction
2ms running transaction
0ms transaction was being locked
0ms flushing data (in ordered mode)
0ms logging transaction
152us average transaction commit time
4 handles per transaction
12 blocks per transaction
13 logged blocks per transaction
|
| Comment by Alex Zhuravlev [ 12/Sep/13 ] |
|
hmm, 3 handles per transaction isn't good at all. sync_permission=0 ? |
| Comment by Daire Byrne (Inactive) [ 12/Sep/13 ] |
|
Yea obviously forgot to apply sync_permissions=0. So now it looks like: 66188 transaction, each up to 65536 blocks
average:
2ms waiting for transaction
49ms running transaction
3ms transaction was being locked
0ms flushing data (in ordered mode)
0ms logging transaction
48370us average transaction commit time
196 handles per transaction
167 blocks per transaction
168 logged blocks per transaction
It seems so far like the periods of "no activity" have been reduced but the rate of transactions still varies quite a lot. I think we'll still go ahead and compare the same hardware on v1.8.9. It just feels like the ext4/RAID sync is really really expensive with the current config. We need the metadata rate to be more consistent and predictable. Cheers. |
| Comment by Alex Zhuravlev [ 12/Sep/13 ] |
|
could you grab another vmstat output ? |
| Comment by Daire Byrne (Inactive) [ 12/Sep/13 ] |
|
The load and IO wait seems to be consistently higher than it used to be so I guess that is a good sign that we are doing more work overall than we were before the external journal. In this snippet the req_waittime was around 4,000/s and then increased to around 14,000 for some time. procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 35 0 153068944 76469912 530164 0 0 5340 2080 33554 73880 0 2 69 29 0 2 33 0 153049840 76478664 530904 0 0 5888 0 33074 75321 0 2 69 29 0 0 36 0 153039680 76488400 529812 0 0 6712 6450 39886 89002 0 2 72 26 0 2 33 0 153029568 76494808 530492 0 0 4284 59700 36049 136657 0 2 70 28 0 0 14 0 153025168 76500336 531188 0 0 4240 2064 23191 47842 0 1 71 28 0 0 16 0 153028528 76501656 531352 0 0 1240 4168 14398 27798 0 1 81 19 0 2 35 0 153010848 76510256 529632 0 0 6136 4144 38677 82843 0 2 70 28 0 0 15 0 153009856 76513080 531240 0 0 2360 4144 20287 36438 0 1 80 19 0 2 31 0 153002480 76520320 529444 0 0 5808 37772 35445 136320 0 2 71 26 0 0 32 0 152982288 76532400 530672 0 0 8752 6192 47392 105233 0 3 69 28 0 0 14 0 152969552 76539408 529064 0 0 5896 6240 30985 60270 0 2 73 25 0 0 31 0 152956768 76545224 530080 0 0 4768 8304 29606 58896 0 2 80 19 0 0 24 0 152930224 76562528 529352 0 0 14392 4224 60441 143758 0 4 70 26 0 5 26 0 152912704 76578672 529636 0 0 14068 71680 58179 184156 0 4 63 33 0 1 23 0 152887104 76594720 531316 0 0 13708 4160 59833 142103 0 4 68 28 0 1 21 0 152859648 76609600 530468 0 0 11764 2088 55286 128557 0 3 63 33 0 2 29 0 152835968 76624656 530008 0 0 12752 6208 49436 108952 0 3 68 29 0 0 28 0 152812096 76639664 530384 0 0 12116 2080 56040 134488 0 4 73 24 0 |
| Comment by Alex Zhuravlev [ 12/Sep/13 ] |
|
it doesn't look stalling at any point? are you still seeing gaps with llstat ? |
| Comment by Daire Byrne (Inactive) [ 12/Sep/13 ] |
|
Seeing the drop to zero transactions a lot less frequently now and they never last more than a couple of seconds. Maybe every 30+ mins instead of every couple of mins... Certainly a lot better than it was. The overall throughput graphs are showing a noticeable increase too. Here's and example of a stall that happened just now: req_waittime 10357 10357 57808929 [usec] 70735 2 6.59 9887 4.80
req_waittime 11138 11138 57820067 [usec] 74281 2 6.59 9887 4.80
req_waittime 13940 13940 57834007 [usec] 97575 2 6.59 9887 4.80
req_waittime 11907 11907 57845914 [usec] 82263 2 6.59 9887 4.80
req_waittime 11521 11521 57857435 [usec] 79703 2 6.59 9887 4.80
req_waittime 2642 2642 57860077 [usec] 17844 2 6.59 9887 4.80
req_waittime 5 5 57860082 [usec] 128 2 6.59 9887 4.80
req_waittime 1 1 57860083 [usec] 32 2 6.59 9887 4.80
req_waittime 2 2 57860085 [usec] 43 2 6.59 9887 4.80
req_waittime 0 0 57860085 [usec] 0 2 6.59 9887 4.80
req_waittime 902 902 57860987 [usec] 6832 2 6.59 9887 4.80
req_waittime 11040 11040 57872027 [usec] 73828 2 6.59 9887 4.80
req_waittime 8804 8804 57880831 [usec] 61287 2 6.59 9887 4.80
req_waittime 9867 9867 57890698 [usec] 66307 2 6.59 9887 4.79
req_waittime 11310 11310 57902008 [usec] 75206 2 6.59 9887 4.79
req_waittime 11138 11138 57913146 [usec] 74315 2 6.59 9887 4.79
req_waittime 9425 9425 57922571 [usec] 61696 2 6.59 9887 4.79
----total-cpu-usage---- -dsk/total- -net/bond0- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
0 5 73 22 0 0| 11M 0 |9051k 9150k| 0 0 | 52k 158k
0 4 74 22 0 0| 12M 85M|8408k 8479k| 0 0 | 48k 152k
0 1 74 24 0 0|6424k 525k|9324k 9598k| 0 0 | 23k 51k
0 0 88 12 0 0|8192B 4096B|6604k 6973k| 0 0 |1366 2319
0 0 69 31 0 0| 0 16k| 70k 37k| 0 0 | 423 561
0 0 69 31 0 0| 0 0 | 82k 86k| 0 0 | 359 483
0 0 69 31 0 0| 0 116k| 14k 5632B| 0 0 | 228 358
0 0 69 31 0 0| 0 36k|9634B 5207B| 0 0 | 359 502
0 3 75 22 0 0| 10M 30M| 14k 5888B| 0 0 | 41k 100k
0 3 72 25 0 0| 11M 15M|5462k 5440k| 0 0 | 42k 103k
0 2 73 25 0 0| 11M 12M|7685k 7958k| 0 0 | 38k 92k
0 3 65 32 0 0| 11M 17M|7911k 7944k| 0 0 | 47k 113k
|
| Comment by Alex Zhuravlev [ 16/Sep/13 ] |
|
this might be due to journal overflow, which happens sometimes and it takes a bit to free the journal. if the theory is correct, then the similar (probably with less frequency) behavior should be observed with 1.8 |
| Comment by Daire Byrne (Inactive) [ 16/Sep/13 ] |
|
Well since putting the external journal in place the performance of this workload has almost doubled. Responsiveness of the filesystem has also noticeably improved. We may put off downgrading to v1.8 for a while longer while we further test v2.4 for other potential issues. Thanks for your help again. |
| Comment by Daire Byrne (Inactive) [ 09/Jan/14 ] |
|
Hi, I just wanted to update this ticket with recent changes to our config. We have now replaced the 10K SAS drives in our external DELL RAID with SSDs (RAID0 with 2xRAID5 9disks). With our extremely heavy link/unlink workload we are seeing much better peak IOP performance. The journal is internal in this case. req_waittime 20878 20878 21619361 [usec] 150732 2 7.37 176237 70.37 req_waittime 21540 21540 21640901 [usec] 160872 2 7.37 176237 70.33 req_waittime 18086 18086 21658987 [usec] 128436 2 7.37 176237 70.30 req_waittime 20076 20076 21679063 [usec] 143419 2 7.37 176237 70.27 req_waittime 20525 20525 21699588 [usec] 144492 2 7.37 176237 70.24 req_waittime 15152 15152 21714740 [usec] 108671 2 7.37 176237 70.21 req_waittime 12877 12877 21727617 [usec] 94028 2 7.37 176237 70.19 req_waittime 18741 18741 21746358 [usec] 136609 2 7.37 176237 70.16 req_waittime 20261 20261 21766619 [usec] 145094 2 7.37 176237 70.13 req_waittime 10070 10070 21776689 [usec] 73515 2 7.37 176237 70.11 req_waittime 6038 6038 21782727 [usec] 51293 2 7.37 176237 70.10 req_waittime 22204 22204 21804931 [usec] 155824 2 7.37 176237 70.07 req_waittime 21095 21095 21826026 [usec] 153739 2 7.37 176237 70.03 req_waittime 15435 15435 21841461 [usec] 111966 2 7.37 176237 70.01 req_waittime 20672 20672 21862133 [usec] 147581 2 7.37 176237 69.98 req_waittime 22573 22573 21884706 [usec] 160090 2 7.37 176237 69.94 req_waittime 13807 13807 21898513 [usec] 98643 2 7.37 176237 69.92 req_waittime 20257 20257 21918770 [usec] 144935 2 7.37 176237 69.89 req_waittime 20398 20398 21939168 [usec] 144100 2 7.37 176237 69.85 req_waittime 20208 20208 21959376 [usec] 143570 2 7.37 176237 69.82 req_waittime 6276 6276 21965652 [usec] 44278 2 7.37 176237 69.81 req_waittime 3850 3850 21969502 [usec] 28082 2 7.37 176237 69.81 req_waittime 12482 12482 21981984 [usec] 90562 2 7.37 176237 69.79 .. However I am now seeing that the issue whereby the IOPs drops to zero at regular intervals is now returning after many hours of operation. This halves the average IOPs so has a fairly big effect on performance. Once I remount the MDT the performance returns to normal/maximum again for X hours. It's almost like the performance drops off once the server memory/cache fills up (256Gigs). req_waittime 4889 4889 1102506250[usec] 35266 2 8.02 2613228 507.60 req_waittime 10978 10978 1102517228[usec] 85139 2 8.02 2613228 507.60 req_waittime 9717 9717 1102526945[usec] 73508 2 8.02 2613228 507.59 req_waittime 13839 13839 1102540784[usec] 102205 2 8.02 2613228 507.59 req_waittime 13343 13343 1102554127[usec] 98177 2 8.02 2613228 507.59 req_waittime 10959 10959 1102565086[usec] 83997 2 8.02 2613228 507.58 req_waittime 8660 8660 1102573746[usec] 63587 2 8.02 2613228 507.58 req_waittime 1307 1307 1102575053[usec] 8431 2 8.02 2613228 507.58 req_waittime 0 0 1102575053[usec] 0 2 8.02 2613228 507.58 req_waittime 1169 1169 1102576222[usec] 7085 2 8.02 2613228 507.58 req_waittime 3012 3012 1102579234[usec] 20816 2 8.02 2613228 507.58 req_waittime 12367 12367 1102591601[usec] 92169 2 8.02 2613228 507.58 req_waittime 13004 13004 1102604605[usec] 99554 2 8.02 2613228 507.58 req_waittime 13152 13152 1102617757[usec] 99387 2 8.02 2613228 507.57 req_waittime 12478 12478 1102630235[usec] 94129 2 8.02 2613228 507.57 req_waittime 15342 15342 1102645577[usec] 111265 2 8.02 2613228 507.57 req_waittime 6443 6443 1102652020[usec] 47035 2 8.02 2613228 507.56 req_waittime 983 983 1102653003[usec] 6440 2 8.02 2613228 507.56 req_waittime 0 0 1102653003[usec] 0 2 8.02 2613228 507.56 req_waittime 1068 1068 1102654071[usec] 6716 2 8.02 2613228 507.56 req_waittime 3278 3278 1102657349[usec] 24421 2 8.02 2613228 507.56 req_waittime 13877 13877 1102671226[usec] 103323 2 8.02 2613228 507.56 req_waittime 12384 12384 1102683610[usec] 92282 2 8.02 2613228 507.56 req_waittime 13866 13866 1102697476[usec] 98032 2 8.01 2613228 507.55 req_waittime 12153 12153 1102709629[usec] 86540 2 8.01 2613228 507.55 req_waittime 10753 10753 1102720382[usec] 79395 2 8.01 2613228 507.55 req_waittime 4344 4344 1102724726[usec] 36456 2 8.01 2613228 507.55 req_waittime 70 70 1102724796[usec] 443 2 8.01 2613228 507.55 req_waittime 0 0 1102724796[usec] 0 2 8.01 2613228 507.55 req_waittime 191 191 1102724987[usec] 1507 2 8.01 2613228 507.55 req_waittime 5686 5686 1102730673[usec] 38777 2 8.01 2613228 507.55 req_waittime 11976 11976 1102742649[usec] 91311 2 8.01 2613228 507.54 req_waittime 12811 12811 1102755460[usec] 92589 2 8.01 2613228 507.54 req_waittime 12153 12153 1102767613[usec] 89900 2 8.01 2613228 507.54 req_waittime 11515 11515 1102779128[usec] 87047 2 8.01 2613228 507.54 req_waittime 11057 11057 1102790185[usec] 82195 2 8.01 2613228 507.53 req_waittime 5531 5531 1102795716[usec] 39728 2 8.01 2613228 507.53 req_waittime 20 20 1102795736[usec] 167 2 8.01 2613228 507.53 .. # cat /proc/fs/jbd2/md0-8/info
594841 transaction, each up to 25600 blocks
average:
0ms waiting for transaction
2ms running transaction
0ms transaction was being locked
0ms flushing data (in ordered mode)
1ms logging transaction
183570us average transaction commit time
10 handles per transaction
31 blocks per transaction
32 logged blocks per transaction
# dumpe2fs -h /dev/md0 dumpe2fs 1.42.7.wc1 (12-Apr-2013) Filesystem volume name: bravo-MDT0000 Last mounted on: / Filesystem UUID: 468d314b-55f5-4d52-99ab-ab4baa527c73 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 3121512448 Block count: 1560747776 Reserved block count: 78037388 Free blocks: 726316155 Free inodes: 2690673937 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1024 Blocks per group: 16384 Fragments per group: 16384 Inodes per group: 32768 Inode blocks per group: 4096 RAID stride: 128 RAID stripe width: 256 Flex block group size: 16 Filesystem created: Wed Dec 11 12:35:07 2013 Last mount time: Thu Jan 9 11:14:43 2014 Last write time: Thu Jan 9 11:14:43 2014 Mount count: 12 Maximum mount count: -1 Last checked: Wed Dec 11 12:35:07 2013 Check interval: 0 (<none>) Lifetime writes: 72 TB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 512 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: b3732abb-0697-4314-bec1-104d3524a83f Journal backup: inode blocks User quota inode: 3 Group quota inode: 4 Journal features: journal_incompat_revoke Journal size: 400M Journal length: 102400 Journal sequence: 0x00136afd Journal start: 86536 |
| Comment by Daire Byrne (Inactive) [ 10/Mar/14 ] |
|
You can definitely close this now. For our workloads SSD based MDTs is required and is now working well for us. Cheers, |
| Comment by Peter Jones [ 10/Mar/14 ] |
|
ok thanks Daire |