[LU-3753] MDS operation rate dropping sporadically - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.4.0
Labels:
None

Severity:
3
Rank (Obsolete):
9680

Description

Hi,

I have been investigating why our metadata rate keeps dropping to nothing every 30 seconds or so for a few seconds at a time. Looking at the debug logs it looks like it is stuck doing many osp_sync_request_commit_cb() calls. I shouldn't think this should block all other operations so maybe this is not the issue and it's just that these are the only operations that are occurring when we see the rate drop to zero. I'm pretty sure the RAID and storage path is all fine. Here is a snippet illustrating the issue:

/proc/fs/lustre/mds/MDS/mdt/stats @ 1375970781.677198
Name Cur.Count Cur.Rate #Events Unit last min avg max stddev
req_waittime 0 0 803871957 [usec] 0 2 6.63 10477 8.15
req_qdepth 0 0 803871957 [reqs] 0 0 0.00 12 0.07
req_active 0 0 803871957 [reqs] 0 1 2.67 16 1.79
req_timeout 0 0 803871957 [sec] 0 1 15.50 37 14.29
reqbuf_avail 0 0 1709293570[bufs] 0 47 63.79 64 0.59
ldlm_ibits_enqueue 0 0 499159709 [reqs] 0 1 1.00 1 0.00
mds_getattr 0 0 3152776 [usec] 0 8 346.15 1453985 6393.00
mds_getattr_lock 0 0 158194 [usec] 0 9 83.28 381776 1418.81
mds_connect 0 0 76 [usec] 0 14 3477.13 147911 21249.32
mds_disconnect 0 0 19 [usec] 0 26 1178.21 15598 3757.50
mds_getstatus 0 0 4 [usec] 0 9 13.75 16 3.20
mds_statfs 0 0 2904 [usec] 0 5 19.87 2115 67.53
mds_sync 0 0 3 [usec] 0 116 10933.67 32562 18730.69
mds_getxattr 0 0 50841 [usec] 0 6 12.89 92 3.91
obd_ping 0 0 530449 [usec] 0 3 11.87 3851 8.76

00000004:00080000:4.0:1375970781.689383:0:4870:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff8821062c4000, transno 8617232460
00000004:00080000:4.0:1375970781.689384:0:4870:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff882d394d4c00, transno 8617232461
00000004:00080000:4.0:1375970781.689384:0:4870:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883e03485800, transno 8617232462
00000004:00080000:4.0:1375970781.689385:0:4870:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883479fcdc00, transno 8617232463
00000004:00080000:4.0:1375970781.689386:0:4870:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883515c1f400, transno 8617232464
00000004:00080000:4.0:1375970781.689387:0:4870:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883d76783000, transno 8617232465

I am also attaching a graph showing the MDS rate drops over a few minutes. I can upload some full debug logs from the MDS when the rate drops to zero if that helps.

The workload is running on two clients and copies, hard links and unlinks thousands of directory trees. The two clients suffer performance drops at the same time as the MDS rate drops to zero.

Cheers,

Daire

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bravo.txt.bz2
1.35 MB
24/Aug/13 9:35 PM
mds-pauses.jpeg
76 kB
14/Aug/13 10:36 AM
mds-pauses2.jpeg
97 kB
02/Sep/13 9:42 AM

Issue Links

is related to

LU-3786 performance-sanity test_6: mkdir hung (sys_mkdir)

Resolved

Activity

[LU-3753] MDS operation rate dropping sporadically

Daire Byrne (Inactive) added a comment - 12/Sep/13 4:54 PM

The load and IO wait seems to be consistently higher than it used to be so I guess that is a good sign that we are doing more work overall than we were before the external journal. In this snippet the req_waittime was around 4,000/s and then increased to around 14,000 for some time.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0 35      0 153068944 76469912 530164    0    0  5340  2080 33554 73880  0  2 69 29  0
 2 33      0 153049840 76478664 530904    0    0  5888     0 33074 75321  0  2 69 29  0
 0 36      0 153039680 76488400 529812    0    0  6712  6450 39886 89002  0  2 72 26  0
 2 33      0 153029568 76494808 530492    0    0  4284 59700 36049 136657  0  2 70 28  0
 0 14      0 153025168 76500336 531188    0    0  4240  2064 23191 47842  0  1 71 28  0
 0 16      0 153028528 76501656 531352    0    0  1240  4168 14398 27798  0  1 81 19  0
 2 35      0 153010848 76510256 529632    0    0  6136  4144 38677 82843  0  2 70 28  0
 0 15      0 153009856 76513080 531240    0    0  2360  4144 20287 36438  0  1 80 19  0
 2 31      0 153002480 76520320 529444    0    0  5808 37772 35445 136320  0  2 71 26  0
 0 32      0 152982288 76532400 530672    0    0  8752  6192 47392 105233  0  3 69 28  0
 0 14      0 152969552 76539408 529064    0    0  5896  6240 30985 60270  0  2 73 25  0
 0 31      0 152956768 76545224 530080    0    0  4768  8304 29606 58896  0  2 80 19  0
 0 24      0 152930224 76562528 529352    0    0 14392  4224 60441 143758  0  4 70 26  0
 5 26      0 152912704 76578672 529636    0    0 14068 71680 58179 184156  0  4 63 33  0
 1 23      0 152887104 76594720 531316    0    0 13708  4160 59833 142103  0  4 68 28  0
 1 21      0 152859648 76609600 530468    0    0 11764  2088 55286 128557  0  3 63 33  0
 2 29      0 152835968 76624656 530008    0    0 12752  6208 49436 108952  0  3 68 29  0
 0 28      0 152812096 76639664 530384    0    0 12116  2080 56040 134488  0  4 73 24  0

Daire Byrne (Inactive) added a comment - 12/Sep/13 4:54 PM The load and IO wait seems to be consistently higher than it used to be so I guess that is a good sign that we are doing more work overall than we were before the external journal. In this snippet the req_waittime was around 4,000/s and then increased to around 14,000 for some time. procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 35 0 153068944 76469912 530164 0 0 5340 2080 33554 73880 0 2 69 29 0 2 33 0 153049840 76478664 530904 0 0 5888 0 33074 75321 0 2 69 29 0 0 36 0 153039680 76488400 529812 0 0 6712 6450 39886 89002 0 2 72 26 0 2 33 0 153029568 76494808 530492 0 0 4284 59700 36049 136657 0 2 70 28 0 0 14 0 153025168 76500336 531188 0 0 4240 2064 23191 47842 0 1 71 28 0 0 16 0 153028528 76501656 531352 0 0 1240 4168 14398 27798 0 1 81 19 0 2 35 0 153010848 76510256 529632 0 0 6136 4144 38677 82843 0 2 70 28 0 0 15 0 153009856 76513080 531240 0 0 2360 4144 20287 36438 0 1 80 19 0 2 31 0 153002480 76520320 529444 0 0 5808 37772 35445 136320 0 2 71 26 0 0 32 0 152982288 76532400 530672 0 0 8752 6192 47392 105233 0 3 69 28 0 0 14 0 152969552 76539408 529064 0 0 5896 6240 30985 60270 0 2 73 25 0 0 31 0 152956768 76545224 530080 0 0 4768 8304 29606 58896 0 2 80 19 0 0 24 0 152930224 76562528 529352 0 0 14392 4224 60441 143758 0 4 70 26 0 5 26 0 152912704 76578672 529636 0 0 14068 71680 58179 184156 0 4 63 33 0 1 23 0 152887104 76594720 531316 0 0 13708 4160 59833 142103 0 4 68 28 0 1 21 0 152859648 76609600 530468 0 0 11764 2088 55286 128557 0 3 63 33 0 2 29 0 152835968 76624656 530008 0 0 12752 6208 49436 108952 0 3 68 29 0 0 28 0 152812096 76639664 530384 0 0 12116 2080 56040 134488 0 4 73 24 0

Alex Zhuravlev added a comment - 12/Sep/13 4:40 PM

could you grab another vmstat output ?

Alex Zhuravlev added a comment - 12/Sep/13 4:40 PM could you grab another vmstat output ?

Daire Byrne (Inactive) added a comment - 12/Sep/13 4:22 PM

Yea obviously forgot to apply sync_permissions=0. So now it looks like:

66188 transaction, each up to 65536 blocks
average: 
  2ms waiting for transaction
  49ms running transaction
  3ms transaction was being locked
  0ms flushing data (in ordered mode)
  0ms logging transaction
  48370us average transaction commit time
  196 handles per transaction
  167 blocks per transaction
  168 logged blocks per transaction

It seems so far like the periods of "no activity" have been reduced but the rate of transactions still varies quite a lot. I think we'll still go ahead and compare the same hardware on v1.8.9. It just feels like the ext4/RAID sync is really really expensive with the current config. We need the metadata rate to be more consistent and predictable.

Cheers.

Daire Byrne (Inactive) added a comment - 12/Sep/13 4:22 PM Yea obviously forgot to apply sync_permissions=0. So now it looks like: 66188 transaction, each up to 65536 blocks average: 2ms waiting for transaction 49ms running transaction 3ms transaction was being locked 0ms flushing data (in ordered mode) 0ms logging transaction 48370us average transaction commit time 196 handles per transaction 167 blocks per transaction 168 logged blocks per transaction It seems so far like the periods of "no activity" have been reduced but the rate of transactions still varies quite a lot. I think we'll still go ahead and compare the same hardware on v1.8.9. It just feels like the ext4/RAID sync is really really expensive with the current config. We need the metadata rate to be more consistent and predictable. Cheers.

Alex Zhuravlev added a comment - 12/Sep/13 2:12 PM

hmm, 3 handles per transaction isn't good at all. sync_permission=0 ?

Alex Zhuravlev added a comment - 12/Sep/13 2:12 PM hmm, 3 handles per transaction isn't good at all. sync_permission=0 ?

Daire Byrne (Inactive) added a comment - 12/Sep/13 2:06 PM

Well it helped with the journal transaction times but the constant pausing is still present. Actually it may be even more frequent and regular now. Oh well, a reformat to v1.8.9 will be more instructive.

bmds1 /root # cat /proc/fs/jbd2/sdb/info
2283745 transaction, each up to 65536 blocks
average: 
  0ms waiting for transaction
  2ms running transaction
  0ms transaction was being locked
  0ms flushing data (in ordered mode)
  0ms logging transaction
  152us average transaction commit time
  4 handles per transaction
  12 blocks per transaction
  13 logged blocks per transaction

Daire Byrne (Inactive) added a comment - 12/Sep/13 2:06 PM Well it helped with the journal transaction times but the constant pausing is still present. Actually it may be even more frequent and regular now. Oh well, a reformat to v1.8.9 will be more instructive. bmds1 /root # cat /proc/fs/jbd2/sdb/info 2283745 transaction, each up to 65536 blocks average: 0ms waiting for transaction 2ms running transaction 0ms transaction was being locked 0ms flushing data (in ordered mode) 0ms logging transaction 152us average transaction commit time 4 handles per transaction 12 blocks per transaction 13 logged blocks per transaction

Daire Byrne (Inactive) added a comment - 12/Sep/13 11:31 AM

I'll try this:

mke2fs -b 4096 -O journal_dev /dev/sdb 262144
tune2fs -O ^has_journal /dev/lustre/mdt1
tune2fs -o journal_data -j -J device=/dev/sdb /dev/lustre/mdt1

Daire Byrne (Inactive) added a comment - 12/Sep/13 11:31 AM I'll try this: mke2fs -b 4096 -O journal_dev /dev/sdb 262144 tune2fs -O ^has_journal /dev/lustre/mdt1 tune2fs -o journal_data -j -J device=/dev/sdb /dev/lustre/mdt1

Daire Byrne (Inactive) added a comment - 12/Sep/13 10:58 AM

Well we can easily add some SSDs to the MDS server. Is it fairly easy to create an external journal to an existing MDT filesystem? It is worth a try before we reformat v1.8.9.

Daire Byrne (Inactive) added a comment - 12/Sep/13 10:58 AM Well we can easily add some SSDs to the MDS server. Is it fairly easy to create an external journal to an existing MDT filesystem? It is worth a try before we reformat v1.8.9.

Alex Zhuravlev added a comment - 12/Sep/13 10:40 AM

if you have additional storage on MDS node, I'd try with an external journal.. but 1.8 would be interesting, definitely.

Alex Zhuravlev added a comment - 12/Sep/13 10:40 AM if you have additional storage on MDS node, I'd try with an external journal.. but 1.8 would be interesting, definitely.

Daire Byrne (Inactive) added a comment - 12/Sep/13 10:28 AM

So I think the best course of action now is to install Lustre v1.8.9 on this hardware and compare the performance with the same workload. Unfortunately we will have to change one element - the v1.8.9 Whamcloud kernel (EL6.2) does not have support for the RAID10 LVM layout we are currently using (EL6.4) so we will have to use a MD RAID10 with an LVM on top.

I will report back later next week once this work has been done. Thanks for your time.

Daire Byrne (Inactive) added a comment - 12/Sep/13 10:28 AM So I think the best course of action now is to install Lustre v1.8.9 on this hardware and compare the performance with the same workload. Unfortunately we will have to change one element - the v1.8.9 Whamcloud kernel (EL6.2) does not have support for the RAID10 LVM layout we are currently using (EL6.4) so we will have to use a MD RAID10 with an LVM on top. I will report back later next week once this work has been done. Thanks for your time.

Daire Byrne (Inactive) added a comment - 10/Sep/13 5:55 PM

"nobarrier" had no effect.

Daire Byrne (Inactive) added a comment - 10/Sep/13 5:55 PM "nobarrier" had no effect.

Daire Byrne (Inactive) added a comment - 10/Sep/13 5:15 PM

I wonder if mounting with "nobarrier" will help? e.g. #~~LU-979~~

Daire Byrne (Inactive) added a comment - 10/Sep/13 5:15 PM I wonder if mounting with "nobarrier" will help? e.g. # LU-979

People

Assignee:: Alex Zhuravlev

Reporter:: Daire Byrne (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 14/Aug/13 10:36 AM

Updated:: 10/Mar/14 1:55 PM

Resolved:: 10/Mar/14 1:55 PM