[LU-2649] slow IO at NOAA Created: 18/Jan/13 Updated: 08/Feb/13 Resolved: 08/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Kit Westneat (Inactive) | Assignee: | Cliff White (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | ptr | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6189 |
| Description |
|
(Sorry, this is kind of a book.) For the past couple of weeks, NOAA has been having significantly reduced performance on one of their filesystems, scratch2. Both scratch2 and scratch1 have the same type of hardware and are connected the same IB switches and network. The problem was first reported by a user doing a lot of opens and reads. The streaming rates appear to be normal, but the open rates have dropped significantly. We ran mdtest to confirm the drop: [Dennis.Nelson@fe7 ~/mdtest-1.8.3]$ mpirun -np 8 mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes 8 tasks, 15912 files/directories SUMMARY: (of 3 iterations) – finished at 01/14/2013 15:35:29 – [Dennis.Nelson@fe7 ~/mdtest-1.8.3]$ mpirun -np 8 mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes 8 tasks, 15912 files/directories SUMMARY: (of 3 iterations) – finished at 01/14/2013 15:28:37 – We've investigated a few different areas without much luck. First we looked at the storage to make sure it looked ok, and it does look remarkably clean. Drive and LUN latencies are all low. We also looked at iostat on the MDS to make sure that the average wait time was in line with what was seen on scratch1, and it also all looks normal. We took a look at the average request wait time on the client, and found that requests to the OSTs and MDTs are taking significantly longer to complete on scratch2 vs scratch1: 4326.63 [Kit.Westneat@bmem5 ~]$ for x in /proc/fs/lustre/osc/scratch1-OST*/stats; do cat $x | awk '/req_wait/ {print $7 / $2" '$(echo $x | grep -o OST....)'" } '; done | awk ' {s+=$1/220} END {print s}'883.17 scratch1 mdc avg req_waittime 191.04 scratch2 mdc avg req_waittime 1220.99 We investigated the IB network some, but couldn't find anything obviously wrong. One interesting thing to note is that some of the OSSes on scratch2 use asymmetric routes between themselves and the test client (bmem5). For example: [root@lfs-mds-2-1 ~]# ibtracert -C mlx4_1 -P2 62 31 From ca {0x0002c903000faa4a} portnum 1 lid 62-62 "bmem5 mlx4_1" [1] -> switch port {0x00066a00e3002e7c}[2] lid 4-4 "QLogic 12300 GUID=0x00066a00e3002e7c" [20] -> switch port {0x00066a00e30032de}[2] lid 2-2 "QLogic 12300 GUID=0x00066a00e30032de" [28] -> switch port {0x00066a00e3003338}[10] lid 6-6 "QLogic 12300 GUID=0x00066a00e3003338" [32] -> ca port {0x0002c903000f9194}[2] lid 31-31 "lfs-oss-2-01 HCA-2" To ca {0x0002c903000f9192} portnum 2 lid 31-31 "lfs-oss-2-01 HCA-2" [root@lfs-mds-2-1 ~]# ibtracert -C mlx4_1 -P2 31 62 From ca {0x0002c903000f9192} portnum 2 lid 31-31 "lfs-oss-2-01 HCA-2" [2] -> switch port {0x00066a00e3003338}[32] lid 6-6 "QLogic 12300 GUID=0x00066a00e3003338" [1] -> switch port {0x00066a00e3003339}[19] lid 3-3 "QLogic 12300 GUID=0x00066a00e3003339" [10] -> switch port {0x00066a00e3002e7c}[28] lid 4-4 "QLogic 12300 GUID=0x00066a00e3002e7c" [2] -> ca port {0x0002c903000faa4b}[1] lid 62-62 "bmem5 mlx4_1" To ca {0x0002c903000faa4a} portnum 1 lid 62-62 "bmem5 mlx4_1" The path from the client to the oss uses switch 0x...32de, but the return path uses switch 0x...3339. I'm not sure if this has any effect on latency, but it's interesting to note. The slow average request time also holds from the MDTs to the OST: [root@lfs-mds-1-1 ~]# for x in /proc/fs/lustre/osc/scratch1-OST*/stats; do cat $x | awk '/req_wait/ {print $7 / $2" '$(echo $x | grep -o OST....)'" }'; done | awk '{s+=$1/220} END {print s}'2445.21 [root@lfs-mds-2-2 ~]# for x in /proc/fs/lustre/osc/scratch2-OST*/stats; do cat $x | awk '/req_wait/ {print $7 / $2" '$(echo $x | grep -o OST....)'" }'; done | awk '{s+=$1/220} END {print s} ' One individual OSS had an average response time of 61s: Checking the brw_stats for that OST, and found that the vast majority of IOs are completed in less than 1s. Less than 1% of all IO took 1s+, most IOs (51%) were done in <= 16ms. This leads me to believe that we can rule out the storage. We were looking at loading the OST metadata using debugfs and realized that scratch2 has its caches enabled on the OSTs, while scratch1 does not. We disabled the read and write caches, and performance increased, though still not to the levels of scratch1: mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes 8 tasks, 15912 files/directories SUMMARY: (of 3 iterations) – finished at 01/18/2013 22:34:53 – The next thing we're going to try is doing small message LNET testing to verify that latency isn't worse on the scratch2 paths versus the scratch1 paths. The OSSes themselves are not reporting abnormal reqwait times. oss-2-08 reports an average wait time of 185.83ms, and oss-1-01 reports 265.72ms (checking using llstat ost). I was wondering if you all had any suggestions of what we could check next. Is there any information I can get you? Also I was wondering what exactly the req_waitimes measured. Am I reading them correctly? Thanks, |
| Comments |
| Comment by Cliff White (Inactive) [ 18/Jan/13 ] |
|
The mdtest results would point toward MDS/network. I believe req_waittime for llstat oat is the time for the IO in the disk queue. I will ask our networking people for further advice. |
| Comment by Cliff White (Inactive) [ 18/Jan/13 ] |
|
They have suggested using LNET self-test to determine if the routing differences are significant. A test such as: |
| Comment by Kit Westneat (Inactive) [ 18/Jan/13 ] |
|
LNET selftest actually performed better on the scratch2 (slow) MDT than on the scratch1 MDT to the same client. scratch2 was 2.9-3GB/s and scratch1 was 2.8ish. I tried doing a single rpc in flight test too, in order to see if latency was an issue, but they both ran at the same speed, 1.5-1.6GB/s. I tried doing a 4k test and RPC rates were lower on scratch2 by about 10% with concurrency 8. When I lowered the concurrency to 1, I got 30k rpcs on scratch2 vs. 40k rpcs on scratch1. That is somewhat significant, but it still not a large enough difference to explain the giant dirstat differences. |
| Comment by Kit Westneat (Inactive) [ 20/Jan/13 ] |
|
Any more ideas for things to check? The metadata performance delta between the filesystems is still fairly significant and we still haven't narrowed it down to any component. |
| Comment by Kit Westneat (Inactive) [ 20/Jan/13 ] |
|
Just to throw another ? into the equation. I looked at import on scratch1 and scratch2 right after running mdtests on both, and they both say 31s as their service estimates.. Though scratch1 has a waittime of 182ms vs a waittime of 262ms for scratch2. |
| Comment by Niu Yawei (Inactive) [ 21/Jan/13 ] |
|
Looks like it's a network problem between MDT to OSTs (and looks the connection between client and OSTs aren't good as well), I think we'd run LNET selftest between MDTs and OSTs but not the client to MDTs. |
| Comment by Kit Westneat (Inactive) [ 21/Jan/13 ] |
|
I ran MDS <-> OSS testing with lnet_selftest. There was a lot of variation in the numbers due to the active workloads on them (the system is in production), but I couldn't see a significant difference between the scratch1 filesystem and scratch2. I will try to get better data, but at the moment the LST data is inconclusive. I was looking at more routes and it looks like some of the scratch1 routes are asymmetric as well, so I don't think that can be pointed at as the root cause. I am still seeing a 2x avg req_waittime between the MDT and OSTs on scratch2 vs scratch1 though. I'll try to look at the individual OSTs to see if I can see a pattern. |
| Comment by Kit Westneat (Inactive) [ 21/Jan/13 ] |
|
Hi Niu, I was wondering if you could expand on what makes it seem like a networking issue as opposed to anything else. The customer is interested. Also, is there anything else we could check to definitively rule out any other issues? Thanks, |
| Comment by Doug Oucharek (Inactive) [ 21/Jan/13 ] |
|
Question: have the tunable LND parameters (credits, peer_credits, etc) been modified on any of these systems? To help see if credits are playing a role in throttling at the LNet layer, can you post the results of these two "cat" commands on the systems with potentially slow networking: cat /proc/sys/lnet/nis These three proc files will show information about credits as well as potential networking errors. Are there LNet routers involved? If so, can you post the results of this "cat" command on those routers: cat /proc/sys/lnet/buffers That one will show if we are running out of routing buffers thereby creating a bottleneck. |
| Comment by Cliff White (Inactive) [ 21/Jan/13 ] |
|
To some extent, we are going by you initial problem statement, you said you checked the storage, and it appeared to be performing properly. In the case of the mdtest results you reported, those depend quite a bit on network/MDS storage performance. Anything further you can check related to storage performance would help eliminate that as an issue. |
| Comment by Kit Westneat (Inactive) [ 21/Jan/13 ] |
|
Ok, I'll attach that output from OSSes and MDSes on scratch1 and scratch2. In getting this info, I realized that I had been testing the wrong fabric before for the OSS <-> MDS tests. The LST tests again looked pretty decent except for when I did the MDS as the server, the OSS as the client and did a write-only test. Then it bounces between 1.2-2.2 GB/s (most of the time around 2GB/s though). If I did the MDS as the client, and/or did a read-only test, the results were around 3GB/s. This behavior is seen on both scratch1 and scratch2 filesystems. The IO to disk looks decent, but I was wondering if there was some way to rule out the layers above the disk. Is there a way to see how long IOs are waiting in Lustre's queues before going to the disk? Is it possible for ldiskfs to add significant time? Is there anything else we can check? |
| Comment by Kit Westneat (Inactive) [ 21/Jan/13 ] |
|
lnet data |
| Comment by Doug Oucharek (Inactive) [ 22/Jan/13 ] |
|
In those collected LNet stats are a lot of negative "min credits". That represents the low water mark for credits for either a network interface or peer (depends on the proc file being viewed). A negative number indicates that queuing has been going on and shows how deep the queue became. This indicates to me that some level of throttling is going on. I cannot say for sure that this is enough to create bandwidth reductions you are seeing. One way to find out is to bump up the credits and peer_credits on critical nodes like the MDS and OSS's. See the Lustre manual on the module parameters used to modify the credits and peer_credits. |
| Comment by Kit Westneat (Inactive) [ 22/Jan/13 ] |
|
One thing I noticed is that the OSS has a smaller min for the MDS peer line for scratch2 (-14) than for the scratch1 counterparts (-421). What does this mean? Does this mean that there is less queuing on the scratch2 network than scratch1? |
| Comment by Doug Oucharek (Inactive) [ 22/Jan/13 ] |
|
That is correct. There seems to be less queuing on the scratch2 network. |
| Comment by Kit Westneat (Inactive) [ 23/Jan/13 ] |
|
Hmm, would that imply that the network is most likely not a factor in scratch2 being slower? If there are not many requests waiting in queue, it seems like the problem is in processing the requests and getting them into the queue. We have made some progress.. Cpu speed regulating was enabled on the OSSes and MDSes, turning that off has improved performance across the board. There is still a large performance difference between scratch1 and scratch2, of the same magnitude as before. Are there any other stats we should check? |
| Comment by Kit Westneat (Inactive) [ 25/Jan/13 ] |
|
I did a trace and rpctrace debug dump. It looks like the time delta is almost entirely in the ldlm code. Do you have any thoughts as to what might cause that? I am going to try to enable ldlm debugging and see if I can get any more information. |
| Comment by Cliff White (Inactive) [ 25/Jan/13 ] |
|
Can you attach/upload the trace data? |
| Comment by Cliff White (Inactive) [ 25/Jan/13 ] |
|
Also, have you tried adjusting the peer credits, as Doug mentioned above? |
| Comment by Kit Westneat (Inactive) [ 25/Jan/13 ] |
|
rpctrace and trace of a single request with timestamp deltas added |
| Comment by Kit Westneat (Inactive) [ 25/Jan/13 ] |
|
The customer won't take a downtime to change the credits unless I can say that I think there is a high probability of solving the issue, and I don't have that level of confidence right now. |
| Comment by Kit Westneat (Inactive) [ 25/Jan/13 ] |
|
dlm traces, I'm not sure if these actually add anything |
| Comment by Kit Westneat (Inactive) [ 28/Jan/13 ] |
|
Hello, any updates on this issue? I have been having regular status calls about it, so any ideas would be very helpful in keeping the customer content. |
| Comment by Cliff White (Inactive) [ 28/Jan/13 ] |
|
I am talking to our engineers, hoping to have some ideas for you soon. |
| Comment by Oleg Drokin [ 28/Jan/13 ] |
|
It seems pretty clear that entire MDS operations on one filesystem are slower than on hte other and it's not affected by OST performance (see how dir operations are also degraded and those don't really touch OSTs). |
| Comment by Kit Westneat (Inactive) [ 28/Jan/13 ] |
|
There is a variable background load, to be sure. scratch2 is often more loaded, but I've tried to run the tests when there isn't very much activity on the MDSes (as seen by the export stats). I'll try to get a more precise measurement of relative activity via rpcstats. When you say sdiostats, which stats in particular are you thinking of? I've used iostat to watch the disk activity and I don't see a lot of differences between the two. One thing that occurred to me is that scratch2 has 20 OSSes, while scratch1 only has 16 (and scratch2 has 25% more OSTs as well). Could the time spent in the LDLM code be proportional to the number of OSSes/OSTs? It would explain the 25% reduction in performance, to be sure. |
| Comment by Oleg Drokin [ 28/Jan/13 ] |
|
LDLM on MDS does not deal with OSTs at all. The only place on MDT that does is in LOV (and OSCs) to do object pre-allocations (also setattrs and such). I am interested in iostata data about typical io completion times and io sizes. |
| Comment by Kit Westneat (Inactive) [ 31/Jan/13 ] |
|
I looked at the /proc/fs/lustre/mds/*/stats and the /proc/fs/lustre/mdt/MDS/mds/stats files to see how loaded the system was. scratch2 does seem more loaded, but it might have just been when I was looking. I tried to find a moment when scratch2 was less loaded in order to run the benchmark. The mdtest showed some improvement but the dirstats were still significantly lower than on scratch1. Clearly the load on the system has some effect, and there is a downtime next week we'll use to test. What sorts of tests would be useful on a quiet system? I was going to enable various debug flags in order to be able to hopefully trace out what is going on. Is there anything else you would like to look at? What could cause dirstat performance to be so much worse than the other metadata operations? Here are some example req_* llstat mds output: scratch2: I was wondering why the reqbuf_avail on scratch2 was so much higher? |
| Comment by Kit Westneat (Inactive) [ 31/Jan/13 ] |
|
IO statistics for the sd devices, they look very similar between the FSes |
| Comment by Kit Westneat (Inactive) [ 31/Jan/13 ] |
|
differences in /proc/fs/lustre/mds/scratch2-MDT0000/stats over the run of an mdtest. You can see scratch2 was much more loaded during this time. The opens, closes, and unlinks stick out as major differences too. |
| Comment by Kit Westneat (Inactive) [ 31/Jan/13 ] |
|
I've attached the test plan for the downtime next week, can you review it and see if there is anything else we should be looking at or trying? |
| Comment by Cliff White (Inactive) [ 31/Jan/13 ] |
|
What about the peer credits tuning mentioned earlier? |
| Comment by Kit Westneat (Inactive) [ 31/Jan/13 ] |
|
What are the risks in making that change and how likely do you think it is that those changes will affect performance? |
| Comment by Cliff White (Inactive) [ 31/Jan/13 ] |
|
I think the risks are minimal, shouldn't hurt anything. I am not certain how likely it is that it will help this issue. |
| Comment by Kit Westneat (Inactive) [ 31/Jan/13 ] |
|
Ok, the reason I ask is that any changes have to be approved through a change review meeting, and I'm sure that they will ask me those questions. Actually another question I have is: can those be set only on the servers, or do they need to be set on the clients too? I remember the last time I played around with ko2iblnd settings, there were some that had to be set on both the client and the servers, or they would stop communicating. |
| Comment by Doug Oucharek (Inactive) [ 01/Feb/13 ] |
|
Credit changes affect the "throttling" on the system they are set on. I suspect the important systems to change here are just the servers. The clients (assuming there are many of them) should be fine as is. The changes don't need to be the same on both sides of a connection. |
| Comment by Oleg Drokin [ 02/Feb/13 ] |
|
Note if you enable any extra debug flags, you'll likely have (significant, depending on flags) slower metadata performance for certain debug flags like +inode or +dlmtrace. It looks like scratch1 is almost 2x less loaded than scratch2 at times, and I suspect that on the fully quiet system you'll have same level of performance. dirstat is only affected by network latency and how busy MDS/disk is (even then I suspect for mdstat run created directories are fully in cache and disk does not play any role other than when it slows down other threads). |
| Comment by Kit Westneat (Inactive) [ 04/Feb/13 ] |
|
It seems like credits changes need to be made on all systems (clients and servers). I got these errors trying to ping from a node with credit changes to one without: |
| Comment by Doug Oucharek (Inactive) [ 05/Feb/13 ] |
|
Looks like I was wrong about the "credits" tuneable. For IB, it controls the QP depth which should be symmetric (and we ensure it is symmetric as seen in the error messages). All the systems on the same IB network (could be logical networks) need to have the same credits setting. |
| Comment by Kit Westneat (Inactive) [ 08/Feb/13 ] |
|
we can close this. We found a potential microcode issue with the Intel chipset and upgraded that. We also split the two filesystems over different fabrics. Metadata improved on both filesystems. |
| Comment by Peter Jones [ 08/Feb/13 ] |
|
ok thanks Kit! |